File Coverage

blib/lib/Search/Indexer.pm

Criterion	Covered	Total	%
statement	13	15	86.6
branch			n/a
condition			n/a
subroutine	5	5	100.0
pod			n/a
total	18	20	90.0

line	stmt	sub	time	code
1				package Search::Indexer;
2
3	2	2	28004	use strict;
	2		4
	2		51
4	2	2	7	use warnings;
	2		3
	2		49
5	2	2	6	no warnings 'uninitialized';
	2		6
	2		44
6
7	2	2	6	use Carp;
	2		6
	2		136
8	2	2	1106	use BerkeleyDB;
	0
	0
9				use locale;
10				use Search::QueryParser;
11				use List::MoreUtils qw/uniq/;
12
13				# TODO : experiment with bit vectors (cf vec() and pack "b*" for combining
14				# result sets
15
16				our $VERSION = "0.78";
17
18				=head1 NAME
19
20				Search::Indexer - full-text indexer
21
22				=head1 SYNOPSIS
23
24				use Search::Indexer;
25				my $ix = new Search::Indexer(dir => $dir, writeMode => 1);
26				foreach my $docId (keys %docs) {
27				$ix->add($docId, $docs{$docId});
28				}
29
30				my $result = $ix->search('+word -excludedWord +"exact phrase"');
31				my @docIds = keys %{$result->{scores}};
32				my $killedWords = join ", ", @{$result->{killedWords}};
33				print scalar(@docIds), " documents found\n", ;
34				print "words $killedWords were ignored during the search\n" if $killedWords;
35				foreach my $docId (@docIds) {
36				my $score = $result->{scores}{$docId};
37				my $excerpts = join "\n", $ix->excerpts($docs{$docId}, $result->{regex});
38				print "DOCUMENT $docId, score $score:\n$excerpts\n\n";
39				}
40
41				my $result2 = $ix->search('word1 AND (word2 OR word3) AND NOT word4');
42
43				$ix->remove($someDocId);
44
45				=head1 DESCRIPTION
46
47				This module provides support for indexing a collection of documents,
48				for searching the collection, and displaying the sorted results,
49				together with contextual excerpts of the original document.
50
51				=head2 Documents
52
53				As far as this module is concerned, a I is just a buffer of
54				plain text, together with a unique identifying number. The caller is
55				responsible for supplying unique numbers, and for converting the
56				original source (HTML, PDF, whatever) into plain text. Documents could
57				also contain more information (other fields like date, author, Dublin
58				Core, etc.), but this must be handled externally, in a database or any
59				other store. A candidate for storing metadata about documents
60				could be L, which uses the same
61				query parser.
62
63				=head2 Search syntax
64
65				Searching requests may include plain terms, "exact phrases",
66				'+' or '-' prefixes, boolean operators and parentheses.
67				See L for details.
68
69				=head2 Index files
70
71				The indexer uses three files in BerkeleyDB format : a) a mapping from
72				words to wordIds; b) a mapping from wordIds to lists of documents ; c)
73				a mapping from pairs (docId, wordId) to lists of positions within the
74				document. This third file holds detailed information and therefore is
75				quite big ; but it allows us to quickly retrieve "exact phrases"
76				(sequences of adjacent words) in the document.
77
78				=head2 Indexing steps
79
80				Indexing of a document buffer goes through the following
81				steps :
82
83				=over
84
85				=item *
86
87				terms are extracted, according to the I regular expression
88
89				=item *
90
91				extracted terms are normalized or filtered out
92				by the I callback function. This function can for example
93				remove accented characters, perform lemmatization, suppress
94				irrelevant terms (such as numbers), etc.
95
96				=item *
97
98				normalized terms are eliminated if they belong to
99				the I list (list of common words to exclude from the index).
100
101				=item *
102
103				remaining terms are stored, together with the positions where they
104				occur in the document.
105
106				=back
107
108				=head2 Limits
109
110				All ids are stored as unsigned 32-bit integers; therefore there is
111				a limit of 4294967295 to the number of documents or to the number of
112				different words.
113
114				=head2 Related modules
115
116				A short comparison with other CPAN indexing modules is
117				given in the L section.
118
119				This module depends on L for analyzing requests and
120				on L for storing the indexes.
121
122				This module was designed together with L.
123
124				=cut
125
126
127				sub addToScore (\$$);
128
129				use constant {
130
131				# max size of various ids
132				MAX_DOC_ID => 0xFFFFFFFF, # unsigned long (32 bits)
133				MAX_POS_ID => 0xFFFFFFFF, # position_id
134
135				# encodings for pack/unpack
136				IXDPACK => 'wC', # docId : compressed int; freq : unsigned char
137				IXDPACK_L => '(wC)*', # list of above
138				IXPPACK => 'w*', # word positions : list of compressed ints
139				IXPKEYPACK => 'ww', # key for ixp : (docId, wordId)
140
141				WRITECACHESIZE => (1 << 24), # arbitrary big value; seems good enough but need tuning
142
143				# default values for args to new()
144				DEFAULT => {
145				writeMode => 0,
146				wregex => qr/\w+/,
147				wfilter => sub { # default filter : lowercase and no accents
148				my $word = lc($_[0]);
149				$word =~ tr[��][caaaaeeeeiiiioooouuuuyy];
150				return $word;
151				},
152				fieldname => '',
153
154				ctxtNumChars => 35,
155				maxExcerpts => 5,
156				preMatch => "",
157				postMatch => "",
158				positions => 1,
159				}
160				};
161
162				=head1 METHODS
163
164				=over
165
166				=item C expr1, ...)>
167
168				Creates an indexer (either for a new index, or for
169				accessing an existing index). Parameters are :
170
171				=over
172
173				=item dir
174
175				Directory for index files. and possibly for the stopwords file.
176				Default is current directory
177
178				=item writeMode
179
180				Give a true value if you intend to write into the index.
181
182				=item wregex
183
184				Regex for matching a word (C by default).
185				Will affect both L and L method.
186				This regex should not contain any capturing parentheses
187				(use non-capturing parentheses C<< (?: ... ) >> instead).
188
189				=item wfilter
190
191				Ref to a callback sub that may normalize or eliminate a word. Will
192				affect both L and L method. The default wfilter
193				translates words in lower case and translates latin1 (iso-8859-1)
194				accented characters into plain characters.
195
196				=item stopwords
197
198				List of words that will be marked into the index as "words to exclude".
199				This should usually occur when creating a new index ; but nothing prevents
200				you to add other stopwords later. Since stopwords are stored in the
201				index, they need not be specified when opening an index for searches or
202				updates.
203
204				The list may be supplied either as a ref to an array of scalars, or
205				as a the name of a file containing the stopwords (full pathname
206				or filename relative to I).
207
208
209				=item fieldname
210
211				Will only affect the L method.
212				Search queries are passed to a general parser
213				(see L).
214				Then, before being applied to the present indexer module,
215				queries are pruned of irrelevant items.
216				Query items are considered relevant if they have no
217				associated field name, or if the associated field name is
218				equal to this C.
219
220				=back
221
222				Below are some additional parameters that only affect the
223				L method.
224
225				=over
226
227				=item ctxtNumChars
228
229				Number of characters determining the size of contextual excerpts
230				return by the L method.
231				A I is a part of the document text,
232				containg a matched word surrounded by I characters
233				to the left and to the right. Default is 35.
234
235
236				=item maxExcerpts
237
238				Maximum number of contextual excerpts to retrieve per document.
239				Default is 5.
240
241				=item preMatch
242
243				String to insert in contextual excerpts before a matched word.
244				Default is C<"EbE">.
245
246				=item postMatch
247
248				String to insert in contextual excerpts after a matched word.
249				Default is C<"E/bE">.
250
251
252				=item positions
253
254				my $indexer = new Search::Indexer(dir => $dir,
255				writeMode => 1,
256				positions => 0);
257
258				Truth value to tell whether or not, when creating a new index,
259				word positions should be stored. The default is true.
260
261				If you turn it off, index files will be much smaller, indexing
262				will be faster, but results will be less precise,
263				because the indexer can no longer find "exact phrases".
264				So if you type C<"quick fox jumped">, the query will be
265				translated into C, and therefore
266				will retrieve documents in which those three words are present, but
267				not necessarily in order.
268
269				Another consequence of C<< positions => 0 >> is that
270				there will be no automatic check of uniqueness of ids
271				when adding documents into the index.
272
273				=back
274
275				=cut
276
277				sub new {
278				my $class = shift;
279				my $args = ref $_[0] eq 'HASH' ? $_[0] : {@_};
280
281				# parse options
282				my $self = {};
283				$self->{$_} = exists $args->{$_} ? delete $args->{$_} : DEFAULT->{$_}
284				foreach qw(writeMode wregex wfilter fieldname
285				ctxtNumChars maxExcerpts preMatch postMatch positions);
286				my $dir = delete $args->{dir} \|\| ".";
287				$dir =~ s{[/\\]$}{}; # remove trailing slash
288				my $stopwords = delete $args->{stopwords};
289
290				# check if invalid options
291				my @remaining = keys %$args;
292				croak "unexpected option : $remaining[0]" if @remaining;
293				croak "can't add 'positions' after index creation time"
294				if $self->{writeMode} and $self->{positions}
295				and -f "$dir/ixd.bdb" and not -f "$dir/ixp.bdb";
296
297				# BerkeleyDB environment should allow us to do proper locking for
298				# concurrent access ; but seems to be incompatible with the
299				# -Cachesize argument, so I commented it out ... need to learn more about
300				# BerkeleyDB ...
301				# my $dbEnv = new BerkeleyDB::Env
302				# -Home => $dir,
303				# -Flags => DB_INIT_CDB \| DB_INIT_MPOOL \| DB_CDB_ALLDB \|
304				# ($self->{writeMode} ? DB_CREATE : 0),
305				# -Verbose => 1
306				# or croak "new BerkeleyDB::Env : $^E $BerkeleyDB::Error" ;
307
308
309				my @bdb_args = (# -Env => $dbEnv, # commented out, see explanation above
310				-Flags => ($self->{writeMode} ? DB_CREATE : DB_RDONLY),
311				($self->{writeMode} ? (-Cachesize => WRITECACHESIZE) : ()));
312
313				# 3 index files :
314				# ixw : word => wordId (or -1 for stopwords)
315				$self->{ixwDb} = tie %{$self->{ixw}},
316				'BerkeleyDB::Btree',
317				-Filename => "$dir/ixw.bdb", @bdb_args
318				or croak "open $dir/ixw.bdb : $^E $BerkeleyDB::Error";
319
320				# ixd : wordId => list of (docId, nOccur)
321				$self->{ixdDb} = tie %{$self->{ixd}},
322				'BerkeleyDB::Hash',
323				-Filename => "$dir/ixd.bdb", @bdb_args
324				or croak "open $dir/ixd.bdb : $^E $BerkeleyDB::Error";
325
326				if (-f "$dir/ixp.bdb" \|\| $self->{writeMode} && $self->{positions}) {
327				# ixp : (docId, wordId) => list of positions of word in doc
328				$self->{ixpDb} = tie %{$self->{ixp}},
329				'BerkeleyDB::Btree',
330				-Filename => "$dir/ixp.bdb", @bdb_args
331				or croak "open $dir/ixp.bdb : $^E $BerkeleyDB::Error";
332				}
333
334
335				# optional list of stopwords may be given as a list or as a filename
336				if ($stopwords) {
337				$self->{writeMode} or croak "must be in writeMode to specify stopwords";
338				if (not ref $stopwords) { # if scalar, name of stopwords file
339				open TMP, $stopwords or
340				(open TMP, "$dir/$stopwords") or
341				croak "open stopwords file $stopwords : $^E ";
342				local $/ = undef;
343				my $buf = ;
344				$stopwords = [$buf =~ /$self->{wregex}/g];
345				close TMP;
346				}
347				foreach my $word (@$stopwords) {
348				$self->{ixw}{$word} = -1;
349				}
350				}
351
352				bless $self, $class;
353				}
354
355
356
357
358
359				=item C
360
361				Add a new document to the index.
362				I is the unique identifier for this doc
363				(the caller is responsible for uniqueness).
364				I is a scalar containing the text representation of this doc.
365
366				=cut
367
368				sub add {
369				my $self = shift;
370				my $docId = shift;
371				# my $buf = shift; # using $_[0] instead for efficiency reasons
372
373				croak "docId $docId is too large" if $docId > MAX_DOC_ID;
374
375				# first check if this docId is already used
376				if ($self->{ixp}) { # can only check if we have the positions index
377				my $c = $self->{ixpDb}->db_cursor;
378				my $k = pack IXPKEYPACK, $docId, 0;
379				my $v; # not used, but needed by c_get()
380				my $status = $c->c_get($k, $v, DB_SET_RANGE);
381				if ($status == 0) {
382				my ($check, $wordId) = unpack IXPKEYPACK, $k;
383				croak "docId $docId is already used (wordId=$wordId)"
384				if $docId == $check;
385				}
386				}
387
388				# OK, let's extract words from the $_[0] buffer
389				my %positions;
390				for (my $nwords = 1; $_[0] =~ /$self->{wregex}/g; $nwords++) {
391
392				my $word = $self->{wfilter}->($&) or next;
393				my $wordId = $self->{ixw}{$word} \|\|
394				($self->{ixw}{$word} = ++$self->{ixw}{_NWORDS}); # create new wordId
395				push @{$positions{$wordId}}, $nwords if $wordId > 0;
396				}
397
398				foreach my $wordId (keys %positions) {
399				my $occurrences = @{$positions{$wordId}};
400				$occurrences = 255 if $occurrences > 255;
401
402				$self->{ixd}{$wordId} .= pack(IXDPACK, $docId, $occurrences);
403				if ($self->{ixp}) {
404				my $ixpKey = pack IXPKEYPACK, $docId, $wordId;
405				$self->{ixp}{$ixpKey} = pack(IXPPACK, @{$positions{$wordId}});
406				}
407				}
408
409				$self->{ixd}{NDOCS} = 0 if not defined $self->{ixd}{NDOCS};
410				$self->{ixd}{NDOCS} += 1;
411				}
412
413
414				=item C
415
416				Removes a document from the index.
417				If the index contains word positions (true by default), then
418				only the C is needed; however, if the index was created
419				without word positions, then the text representation
420				of the document must be given as a scalar string in the second argument
421				(of course this should be the same as the one that was supplied
422				when calling the L method).
423
424
425				=cut
426
427				sub remove {
428				my $self = shift;
429				my $docId = shift;
430				# my $buf = shift; # using $_[0] instead for efficiency reasons
431
432				my $wordIds;
433
434				if ($self->{ixp}) { # if using word positions
435				not $_[0] or carp "remove() : unexpected 'buf' argument";
436				$wordIds= $self->wordIds($docId);
437				}
438				else { # otherwise : recompute word ids
439				$wordIds = [grep {defined $_ and $_ > 0}
440				map {$self->{ixw}{$_}}
441				uniq map {$self->{wfilter}->($_)}
442				($_[0] =~ /$self->{wregex}/g)];
443				}
444
445				return if not @$wordIds;
446
447				foreach my $wordId (@$wordIds) {
448				my %docs = unpack IXDPACK_L, $self->{ixd}{$wordId};
449				delete $docs{$docId};
450				$self->{ixd}{$wordId} = pack IXDPACK_L, %docs;
451				if ($self->{ixp}) {
452				my $ixpKey = pack IXPKEYPACK, $docId, $wordId;
453				delete $self->{ixp}{$ixpKey};
454				}
455				}
456
457				$self->{ixd}{NDOCS} -= 1;
458				}
459
460				=item C
461
462				Returns a ref to an array of word Ids contained in the specified document
463				(not available if the index was created with C<< positions => 0 >>)
464
465				=cut
466
467				sub wordIds {
468				my $self = shift;
469				my $docId_ini = shift;
470
471				$self->{ixpDb}
472				or croak "wordIds() not available (index was created with positions=>0)";
473
474				my @wordIds = ();
475				my $c = $self->{ixpDb}->db_cursor;
476				my ($k, $v);
477				$k = pack IXPKEYPACK, $docId_ini, 0;
478				my $status = $c->c_get($k, $v, DB_SET_RANGE);
479				while ($status == 0) {
480				my ($docId, $wordId) = unpack IXPKEYPACK, $k;
481				last if $docId != $docId_ini;
482				push @wordIds, $wordId;
483				$status = $c->c_get($k, $v, DB_NEXT);
484				}
485				return \@wordIds;
486				}
487
488
489				=item C
490
491				Returns a ref to an array of words found in the dictionary,
492				starting with prefix (i.e. C<< $ix->words("foo") >> will
493				return "foo", "food", "fool", "footage", etc.).
494
495				=cut
496
497				sub words {
498				my $self = shift;
499				my $prefix = shift;
500
501				my $regex = qr/^$prefix/;
502				my @words = ();
503				my $c = $self->{ixwDb}->db_cursor;
504				my ($k, $v);
505				$k = $prefix;
506				my $status = $c->c_get($k, $v, DB_SET_RANGE);
507				while ($status == 0) {
508				last if $k !~ $regex;
509				push @words, $k;
510				$status = $c->c_get($k, $v, DB_NEXT);
511				}
512				return \@words;
513				}
514
515
516
517				=item C
518
519				Debugging function, prints indexed words with list of associated docs.
520
521				=cut
522
523				sub dump {
524				my $self = shift;
525				foreach my $word (sort keys %{$self->{ixw}}) {
526				my $wordId = $self->{ixw}{$word};
527				if ($wordId == -1) {
528				print "$word : STOPWORD\n";
529				}
530				else {
531				my %docs = unpack IXDPACK_L, $self->{ixd}{$wordId};
532				print "$word : ", join (" ", keys %docs), "\n";
533				}
534				}
535				}
536
537
538				=item C
539
540				Searches the index. See the L and L sections
541				above for short descriptions of query strings, or
542				L for details. The second argument is optional ;
543				if true, all words without any prefix will implicitly take prefix '+'
544				(mandatory words).
545
546				The return value is a hash ref containing
547
548				=over
549
550				=item scores
551
552				hash ref, where keys are docIds of matching documents, and values are
553				the corresponding computed scores.
554
555				=item killedWords
556
557				ref to an array of terms from the query string which were ignored
558				during the search (because they were filtered out or were stopwords)
559
560				=item regex
561
562				ref to a regular expression corresponding to all terms in the query
563				string. This will be useful if you later want to get contextual
564				excerpts from the found documents (see the L method).
565
566				=back
567
568				=cut
569
570
571				sub search {
572				my $self = shift;
573				my $query_string = shift;
574				my $implicitPlus = shift;
575
576				$self->{qp} \|\|= new Search::QueryParser;
577
578				my $q = $self->{qp}->parse($query_string, $implicitPlus);
579				my $killedWords = {};
580				my $wordsRegexes = [];
581
582				my $qt = $self->translateQuery($q, $killedWords, $wordsRegexes);
583
584				my $tmp = {};
585				$tmp->{$_} = 1 foreach @$wordsRegexes;
586				my $strRegex = "(?:" . join("\|", keys %$tmp) . ")";
587
588				return {scores => $self->_search($qt),
589				killedWords => [keys %$killedWords],
590				regex => qr/$strRegex/i};
591				}
592
593
594				sub _search {
595				my ($self, $q) = @_;
596
597				my $scores = undef; # hash {doc1 => score1, doc2 => score2 ...}
598
599				# 1) deal with mandatory subqueries
600
601				foreach my $subQ ( @{$q->{'+'}} ) {
602				my $sc = $self->docsAndScores($subQ) or next;
603				$scores = $sc and next if not $scores; # if first result set, just store
604
605				# otherwise, intersect with previous result set
606				foreach my $docId (keys %$scores) {
607				delete $scores->{$docId} and next if not defined $sc->{$docId};
608				addToScore $scores->{$docId}, $sc->{$docId}; # otherwise
609				}
610				}
611
612				my $noMandatorySubq = not $scores;
613
614				# 2) deal with non-mandatory subqueries
615
616				foreach my $subQ (@{$q->{''}}) {
617				my $sc = $self->docsAndScores($subQ) or next;
618				$scores = $sc and next if not $scores; # if first result set, just store
619
620				# otherwise, combine with previous result set
621				foreach my $docId (keys %$sc) {
622				if (defined $scores->{$docId}) { # docId was already there, add new score
623				addToScore $scores->{$docId}, $sc->{$docId};
624				}
625				elsif ($noMandatorySubq){ # insert a new docId to the result set
626				$scores->{$docId} = $sc->{$docId};
627				}
628				# else do nothing (ignore this docId)
629				}
630				}
631
632				return undef if not $scores or not %$scores; # no results
633
634				# 3) deal with negative subqueries (remove corresponding docs from results)
635
636				foreach my $subQ (@{$q->{'-'}}) {
637				my $negScores = $self->docsAndScores($subQ) or next;
638				delete $scores->{$_} foreach keys %$negScores;
639				}
640
641				return $scores;
642				}
643
644
645
646				sub docsAndScores { # returns a hash {docId => score} or undef (no info)
647				my ($self, $subQ) = @_;
648
649				# recursive call to _search if $subQ is a parenthesized query
650				return $self->_search($subQ->{value}) if $subQ->{op} eq '()';
651
652				# otherwise, don't care about $subQ->{op} (assert $subQ->{op} eq ':')
653
654				if (ref $subQ->{value}) { # several words, this is an "exact phrase"
655				return $self->matchExactPhrase($subQ);
656				}
657				elsif ($subQ->{value} <= -1) {# this is a stopword
658				return undef;
659				}
660				else { # scalar value, match single word
661				my $scores = {unpack IXDPACK_L, ($self->{ixd}{$subQ->{value}} \|\| "")};
662				my @k = keys %$scores;
663				if (@k) {
664				my $coeff = log(($self->{ixd}{NDOCS} + 1)/@k) * 100;
665				$scores->{$_} = int($coeff * $scores->{$_}) foreach @k;
666				}
667				return $scores;
668				}
669				}
670
671
672				sub matchExactPhrase {
673				my ($self, $subQ) = @_;
674
675				if (! $self->{ixp}) { # if not indexed with positions
676				# translate into an AND query
677				my $fake_query = {'+' => [map {{op => ':',
678				value => $_ }} @{$subQ->{value}}]};
679				# and search for that one
680				return $self->_search($fake_query);
681				};
682
683				# otherwise, intersect word position sets
684				my %pos;
685				my $wordDelta = 0;
686				my $scores = undef;
687				foreach my $wordId (@{$subQ->{value}}) {
688				my $sc = $self->docsAndScores({op=>':', value=>$wordId});
689				if (not $scores) { # no previous result set
690				if ($sc) {
691				$scores = $sc;
692				foreach my $docId (keys %$scores) {
693				my $ixpKey = pack IXPKEYPACK, $docId, $wordId;
694				$pos{$docId} = [unpack IXPPACK, $self->{ixp}{$ixpKey}];
695				}
696				}
697				}
698				else { # combine with previous result set
699				$wordDelta++;
700				foreach my $docId (keys %$scores) {
701				if ($sc) { # if we have info about current word (is not a stopword)
702				if (not defined $sc->{$docId}) { # current word not in current doc
703				delete $scores->{$docId};
704				} else { # current word found in current doc, check if positions match
705				my $ixpKey = pack IXPKEYPACK, $docId, $wordId;
706				my @newPos = unpack IXPPACK, $self->{ixp}{$ixpKey};
707				$pos{$docId} = nearPositions($pos{$docId}, \@newPos, $wordDelta)
708				and addToScore $scores->{$docId}, $sc->{$docId}
709				or delete $scores->{$docId};
710				}
711				}
712				} # end foreach my $docId (keys %$scores)
713				}
714				} # end foreach my $wordId (@{$subQ->{value}})
715
716				return $scores;
717				}
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735				sub nearPositions {
736				my ($set1, $set2, $wordDelta) = @_;
737				# returns the set of positions in $set2 which are "close enough" (<= $wordDelta)
738				# to positions in $set1. Assumption : input sets are sorted.
739
740
741				my @result;
742				my ($i1, $i2) = (0, 0); # indices into sets
743
744				while ($i1 < @$set1 and $i2 < @$set2) {
745				my $delta = $set2->[$i2] - $set1->[$i1];
746				++$i1 and next if $delta > $wordDelta;
747				push @result, $set2->[$i2] if $delta > 0;
748				++$i2;
749				}
750				return @result ? \@result : undef;
751				}
752
753
754
755				sub addToScore (\$$) { # first score arg gets "incremented" by the second arg
756				my ($ptScore1, $score2) = @_;
757				$$ptScore1 = 0 if not defined $$ptScore1;
758				$$ptScore1 += $score2 if $score2; # TODO : find better formula for score combination !
759				}
760
761
762				sub translateQuery { # replace words by ids, remove irrelevant subqueries
763				my ($self, $q, $killedWords, $wordsRegexes) = @_;
764
765				my $r = {};
766
767				foreach my $k ('+', '-', '') {
768				foreach my $subQ (@{$q->{$k}}) {
769
770				# ignore items concerning other field names
771				next if $subQ->{field} and $subQ->{field} ne $self->{fieldname};
772
773				my $val = $subQ->{value};
774
775				my $clone = undef;
776				if ($subQ->{op} eq '()') {
777				$clone = {op => '()',
778				value => $self->translateQuery($val, $killedWords, $wordsRegexes)};
779				}
780				elsif ($subQ->{op} eq ':') {
781				# split query according to our notion of "term"
782				my @words = ($val =~ /$self->{wregex}/g);
783
784				# TODO : 1) accept '*' suffix; 2) find keys in $self->{ixw}; 3) rewrite into
785				# an 'OR' query
786
787				# my @words = ($str =~ /$self->{wregex}\*?/g);
788
789				my $regex1 = join "\\W+", map quotemeta, @words;
790				my $regex2 = join "\\W+", map quotemeta,
791				map {$self->{wfilter}($_)} @words;
792				foreach my $regex ($regex1, $regex2) {
793				$regex = "\\b$regex" if $regex =~ /^\w/;
794				$regex = "$regex\\b" if $regex =~ /\w$/;
795				}
796				push @$wordsRegexes, $regex1;
797				push @$wordsRegexes, $regex2 unless $regex1 eq $regex2;
798
799				# now translate into word ids
800				foreach my $word (@words) {
801				my $wf = $self->{wfilter}->($word);
802				my $wordId = $wf ? ($self->{ixw}{$wf} \|\| 0) : -1;
803				$killedWords->{$word} = 1 if $wordId < 0;
804				$word = $wordId;
805				}
806
807				$val = (@words>1) ? \@words : # several words : return an array
808				(@words>0) ? $words[0] : # just one word : return its id
809				0; # no word : return 0 (means "no info")
810
811				$clone = {op => ':', value=> $val};
812				}
813				push @{$r->{$k}}, $clone if $clone;
814				}
815				}
816
817				return $r;
818				}
819
820
821
822				=item C
823
824				Searches C for occurrences of C,
825				extracts the occurences together with some context
826				(a number of characters to the left and to the right),
827				and highlights the occurences. See parameters C,
828				C, C, C of the L method.
829
830				=cut
831
832				sub excerpts {
833				my $self = shift;
834				# $_[0] : text buffer ; no copy for efficiency reason
835				my $regex = $_[1];
836
837				my $nc = $self->{ctxtNumChars};
838
839				# find start and end positions of matching fragments
840				my $matches = []; # array of refs to [start, end, number_of_matches]
841				while ($_[0] =~ /$regex/g) {
842				my ($start, $end) = ($-[0], $+[0]);
843				if (@$matches and $start <= $matches->[-1][1] + $nc) {
844				# merge with the last fragment if close enough
845				$matches->[-1][1] = $end; # extend the end position
846				$matches->[-1][2] += 1; # increment the number of matches
847				}
848				else {
849				push @$matches, [$start, $end, 1];
850				}
851				}
852
853				foreach (@$matches) { # extend start and end positions by $self->{ctxtNumChars}
854				$_->[0] = ($_->[0] < $nc) ? 0 : $_->[0] - $nc;
855				$_->[1] += $nc;
856				}
857
858				my $excerpts = [];
859				foreach my $match (sort {$b->[2] <=> $a->[2]} @$matches) {
860				last if @$excerpts >= $self->{maxExcerpts};
861				my $x = substr($_[0], $match->[0], $match->[1] - $match->[0]); # extract
862				$x =~ s/$regex/$self->{preMatch}$&$self->{postMatch}/g ; # highlight
863				push @$excerpts, "...$x...";
864				}
865				return $excerpts;
866				}
867
868				=back
869
870				=head1 TO DO
871
872				=over
873
874				=item *
875
876				Find a proper formula for combining scores from several terms.
877				Current implementation is ridiculously simple-minded (just an addition).
878				Also study the literature to improve the scoring formula.
879
880				=item *
881
882				Handle concurrency through BerkeleyDB locks.
883
884				=item *
885
886				Maybe put all 3 index files as subDatabases in one single file.
887
888				=item *
889
890				Fine tuning of cachesize and other BerkeleyDB parameters.
891
892				=item *
893
894				Compare performances with other packages.
895
896				=item *
897
898				More functionalities : add NEAR operator and boost factors.
899
900				=back
901
902
903
904				=head1 SEE ALSO
905
906				L is nice and compact, but
907				limited in functionality (no +/- prefixes, no "exact phrase" search,
908				no parentheses).
909
910				L is a Perl port of the Java I search engine.
911				Plucene has probably every feature you will ever need, but requires
912				quite an investment to install and learn (more than 60 classes,
913				dependencies on lots of external modules).
914				I haven't done any benchmarks yet to compare performance.
915
916				L is a more recent, more sophisticated search engine,
917				which looks very powerful and should be probably faster and definitely
918				more scalable than C; but also with a less compact
919				API. I haven't performed any detailed comparison yet.
920
921
922				=cut
923
924
925				1;