File Coverage

blib/lib/Search/Indexer.pm

Criterion	Covered	Total	%
statement	10	12	83.3
branch			n/a
condition			n/a
subroutine	4	4	100.0
pod			n/a
total	14	16	87.5

line	stmt	sub	time	code
1				package Search::Indexer;
2
3	2	2	74048	use strict;
	2		5
	2		83
4	2	2	36	use warnings;
	2		3
	2		63
5				# no warnings 'uninitialized'; ## CHECK IF NEEDED OR NOT
6	2	2	11	use Carp;
	2		8
	2		440
7	2	2	2499	use BerkeleyDB;
	0
	0
8				use locale;
9				use Search::QueryParser;
10				use List::MoreUtils qw/uniq/;
11
12				# TODO : experiment with bit vectors (cf vec() and pack "b*" for combining
13				# result sets
14
15				our $VERSION = "0.77";
16
17				=head1 NAME
18
19				Search::Indexer - full-text indexer
20
21				=head1 SYNOPSIS
22
23				use Search::Indexer;
24				my $ix = new Search::Indexer(dir => $dir, writeMode => 1);
25				foreach my $docId (keys %docs) {
26				$ix->add($docId, $docs{$docId});
27				}
28
29				my $result = $ix->search('+word -excludedWord +"exact phrase"');
30				my @docIds = keys @{$result->{scores}};
31				my $killedWords = join ", ", @{$result->{killedWords}};
32				print scalar(@docIds), " documents found\n", ;
33				print "words $killedWords were ignored during the search\n" if $killedWords;
34				foreach my $docId (@docIds) {
35				my $score = $result->{scores}{$docId};
36				my $excerpts = join "\n", $ix->excerpts($docs{$docId}, $result->{regex});
37				print "DOCUMENT $docId, score $score:\n$excerpts\n\n";
38				}
39
40				my $result2 = $ix->search('word1 AND (word2 OR word3) AND NOT word4');
41
42				$ix->remove($someDocId);
43
44				=head1 DESCRIPTION
45
46				This module provides support for indexing a collection of documents,
47				for searching the collection, and displaying the sorted results,
48				together with contextual excerpts of the original document.
49
50				=head2 Documents
51
52				As far as this module is concerned, a I is just a buffer of
53				plain text, together with a unique identifying number. The caller is
54				responsible for supplying unique numbers, and for converting the
55				original source (HTML, PDF, whatever) into plain text. Documents could
56				also contain more information (other fields like date, author, Dublin
57				Core, etc.), but this must be handled externally, in a database or any
58				other store. A candidate for storing metadata about documents
59				could be L, which uses the same
60				query parser.
61
62				=head2 Search syntax
63
64				Searching requests may include plain terms, "exact phrases",
65				'+' or '-' prefixes, boolean operators and parentheses.
66				See L for details.
67
68				=head2 Index files
69
70				The indexer uses three files in BerkeleyDB format : a) a mapping from
71				words to wordIds; b) a mapping from wordIds to lists of documents ; c)
72				a mapping from pairs (docId, wordId) to lists of positions within the
73				document. This third file holds detailed information and therefore is
74				quite big ; but it allows us to quickly retrieve "exact phrases"
75				(sequences of adjacent words) in the document.
76
77				=head2 Indexing steps
78
79				Indexing of a document buffer goes through the following
80				steps :
81
82				=over
83
84				=item *
85
86				terms are extracted, according to the I regular expression
87
88				=item *
89
90				extracted terms are normalized or filtered out
91				by the I callback function. This function can for example
92				remove accented characters, perform lemmatization, suppress
93				irrelevant terms (such as numbers), etc.
94
95				=item *
96
97				normalized terms are eliminated if they belong to
98				the I list (list of common words to exclude from the index).
99
100				=item *
101
102				remaining terms are stored, together with the positions where they
103				occur in the document.
104
105				=back
106
107				=head2 Limits
108
109				All ids are stored as unsigned 32-bit integers; therefore there is
110				a limit of 4294967295 to the number of documents or to the number of
111				different words.
112
113				=head2 Related modules
114
115				A short comparison with other CPAN indexing modules is
116				given in the L section.
117
118				This module depends on L for analyzing requests and
119				on L for storing the indexes.
120
121				This module was designed together with L.
122
123				=cut
124
125
126				sub addToScore (\$$);
127
128				use constant {
129
130				# max size of various ids
131				MAX_DOC_ID => 0xFFFFFFFF, # unsigned long (32 bits)
132				MAX_POS_ID => 0xFFFFFFFF, # position_id
133
134				# encodings for pack/unpack
135				IXDPACK => 'wC', # docId : compressed int; freq : unsigned char
136				IXDPACK_L => '(wC)*', # list of above
137				IXPPACK => 'w*', # word positions : list of compressed ints
138				IXPKEYPACK => 'ww', # key for ixp : (docId, wordId)
139
140				WRITECACHESIZE => (1 << 24), # arbitrary big value; seems good enough but need tuning
141
142				# default values for args to new()
143				DEFAULT => {
144				writeMode => 0,
145				wregex => qr/\w+/,
146				wfilter => sub { # default filter : lowercase and no accents
147				my $word = lc($_[0]);
148				$word =~ tr[��][caaaaeeeeiiiioooouuuuyy];
149				return $word;
150				},
151				fieldname => '',
152
153				ctxtNumChars => 35,
154				maxExcerpts => 5,
155				preMatch => "",
156				postMatch => "",
157				positions => 1,
158				}
159				};
160
161				=head1 METHODS
162
163				=over
164
165				=item C expr1, ...)>
166
167				Creates an indexer (either for a new index, or for
168				accessing an existing index). Parameters are :
169
170				=over
171
172				=item dir
173
174				Directory for index files. and possibly for the stopwords file.
175				Default is current directory
176
177				=item writeMode
178
179				Give a true value if you intend to write into the index.
180
181				=item wregex
182
183				Regex for matching a word (C by default).
184				Will affect both L and L method.
185				This regex should not contain any capturing parentheses
186				(use non-capturing parentheses C<< (?: ... ) >> instead).
187
188				=item wfilter
189
190				Ref to a callback sub that may normalize or eliminate a word. Will
191				affect both L and L method. The default wfilter
192				translates words in lower case and translates latin1 (iso-8859-1)
193				accented characters into plain characters.
194
195				=item stopwords
196
197				List of words that will be marked into the index as "words to exclude".
198				This should usually occur when creating a new index ; but nothing prevents
199				you to add other stopwords later. Since stopwords are stored in the
200				index, they need not be specified when opening an index for searches or
201				updates.
202
203				The list may be supplied either as a ref to an array of scalars, or
204				as a the name of a file containing the stopwords (full pathname
205				or filename relative to I).
206
207
208				=item fieldname
209
210				Will only affect the L method.
211				Search queries are passed to a general parser
212				(see L).
213				Then, before being applied to the present indexer module,
214				queries are pruned of irrelevant items.
215				Query items are considered relevant if they have no
216				associated field name, or if the associated field name is
217				equal to this C.
218
219				=back
220
221				Below are some additional parameters that only affect the
222				L method.
223
224				=over
225
226				=item ctxtNumChars
227
228				Number of characters determining the size of contextual excerpts
229				return by the L method.
230				A I is a part of the document text,
231				containg a matched word surrounded by I characters
232				to the left and to the right. Default is 35.
233
234
235				=item maxExcerpts
236
237				Maximum number of contextual excerpts to retrieve per document.
238				Default is 5.
239
240				=item preMatch
241
242				String to insert in contextual excerpts before a matched word.
243				Default is C<"EbE">.
244
245				=item postMatch
246
247				String to insert in contextual excerpts after a matched word.
248				Default is C<"E/bE">.
249
250
251				=item positions
252
253				my $indexer = new Search::Indexer(dir => $dir,
254				writeMode => 1,
255				positions => 0);
256
257				Truth value to tell whether or not, when creating a new index,
258				word positions should be stored. The default is true.
259
260				If you turn it off, index files will be much smaller, indexing
261				will be faster, but results will be less precise,
262				because the indexer can no longer find "exact phrases".
263				So if you type C<"quick fox jumped">, the query will be
264				translated into C, and therefore
265				will retrieve documents in which those three words are present, but
266				not necessarily in order.
267
268				Another consequence of C<< positions => 0 >> is that
269				there will be no automatic check of uniqueness of ids
270				when adding documents into the index.
271
272				=back
273
274				=cut
275
276				sub new {
277				my $class = shift;
278				my $args = ref $_[0] eq 'HASH' ? $_[0] : {@_};
279
280				# parse options
281				my $self = {};
282				$self->{$_} = exists $args->{$_} ? delete $args->{$_} : DEFAULT->{$_}
283				foreach qw(writeMode wregex wfilter fieldname
284				ctxtNumChars maxExcerpts preMatch postMatch positions);
285				my $dir = delete $args->{dir} \|\| ".";
286				$dir =~ s{[/\\]$}{}; # remove trailing slash
287				my $stopwords = delete $args->{stopwords};
288
289				# check if invalid options
290				my @remaining = keys %$args;
291				croak "unexpected option : $remaining[0]" if @remaining;
292				croak "can't add 'positions' after index creation time"
293				if $self->{writeMode} and $self->{positions}
294				and -f "$dir/ixd.bdb" and not -f "$dir/ixp.bdb";
295
296				# BerkeleyDB environment should allow us to do proper locking for
297				# concurrent access ; but seems to be incompatible with the
298				# -Cachesize argument, so I commented it out ... need to learn more about
299				# BerkeleyDB ...
300				# my $dbEnv = new BerkeleyDB::Env
301				# -Home => $dir,
302				# -Flags => DB_INIT_CDB \| DB_INIT_MPOOL \| DB_CDB_ALLDB \|
303				# ($self->{writeMode} ? DB_CREATE : 0),
304				# -Verbose => 1
305				# or croak "new BerkeleyDB::Env : $^E $BerkeleyDB::Error" ;
306
307
308				my @bdb_args = (# -Env => $dbEnv, # commented out, see explanation above
309				-Flags => ($self->{writeMode} ? DB_CREATE : DB_RDONLY),
310				($self->{writeMode} ? (-Cachesize => WRITECACHESIZE) : ()));
311
312				# 3 index files :
313				# ixw : word => wordId (or -1 for stopwords)
314				$self->{ixwDb} = tie %{$self->{ixw}},
315				'BerkeleyDB::Btree',
316				-Filename => "$dir/ixw.bdb", @bdb_args
317				or croak "open $dir/ixw.bdb : $^E $BerkeleyDB::Error";
318
319				# ixd : wordId => list of (docId, nOccur)
320				$self->{ixdDb} = tie %{$self->{ixd}},
321				'BerkeleyDB::Hash',
322				-Filename => "$dir/ixd.bdb", @bdb_args
323				or croak "open $dir/ixd.bdb : $^E $BerkeleyDB::Error";
324
325				if (-f "$dir/ixp.bdb" \|\| $self->{writeMode} && $self->{positions}) {
326				# ixp : (docId, wordId) => list of positions of word in doc
327				$self->{ixpDb} = tie %{$self->{ixp}},
328				'BerkeleyDB::Btree',
329				-Filename => "$dir/ixp.bdb", @bdb_args
330				or croak "open $dir/ixp.bdb : $^E $BerkeleyDB::Error";
331				}
332
333
334				# optional list of stopwords may be given as a list or as a filename
335				if ($stopwords) {
336				$self->{writeMode} or croak "must be in writeMode to specify stopwords";
337				if (not ref $stopwords) { # if scalar, name of stopwords file
338				open TMP, $stopwords or
339				(open TMP, "$dir/$stopwords") or
340				croak "open stopwords file $stopwords : $^E ";
341				local $/ = undef;
342				my $buf = ;
343				$stopwords = [$buf =~ /$self->{wregex}/g];
344				close TMP;
345				}
346				foreach my $word (@$stopwords) {
347				$self->{ixw}{$word} = -1;
348				}
349				}
350
351				bless $self, $class;
352				}
353
354
355
356
357
358				=item C
359
360				Add a new document to the index.
361				I is the unique identifier for this doc
362				(the caller is responsible for uniqueness).
363				I is a scalar containing the text representation of this doc.
364
365				=cut
366
367				sub add {
368				my $self = shift;
369				my $docId = shift;
370				# my $buf = shift; # using $_[0] instead for efficiency reasons
371
372				croak "docId $docId is too large" if $docId > MAX_DOC_ID;
373
374				# first check if this docId is already used
375				if ($self->{ixp}) { # can only check if we have the positions index
376				my $c = $self->{ixpDb}->db_cursor;
377				my $k = pack IXPKEYPACK, $docId, 0;
378				my $v; # not used, but needed by c_get()
379				my $status = $c->c_get($k, $v, DB_SET_RANGE);
380				if ($status == 0) {
381				my ($check, $wordId) = unpack IXPKEYPACK, $k;
382				croak "docId $docId is already used (wordId=$wordId)"
383				if $docId == $check;
384				}
385				}
386
387				# OK, let's extract words from the $_[0] buffer
388				my %positions;
389				for (my $nwords = 1; $_[0] =~ /$self->{wregex}/g; $nwords++) {
390
391				my $word = $self->{wfilter}->($&) or next;
392				my $wordId = $self->{ixw}{$word} \|\|
393				($self->{ixw}{$word} = ++$self->{ixw}{_NWORDS}); # create new wordId
394				push @{$positions{$wordId}}, $nwords if $wordId > 0;
395				}
396
397				foreach my $wordId (keys %positions) {
398				my $occurrences = @{$positions{$wordId}};
399				$occurrences = 255 if $occurrences > 255;
400
401				$self->{ixd}{$wordId} .= pack(IXDPACK, $docId, $occurrences);
402				if ($self->{ixp}) {
403				my $ixpKey = pack IXPKEYPACK, $docId, $wordId;
404				$self->{ixp}{$ixpKey} = pack(IXPPACK, @{$positions{$wordId}});
405				}
406				}
407
408				$self->{ixd}{NDOCS} = 0 if not defined $self->{ixd}{NDOCS};
409				$self->{ixd}{NDOCS} += 1;
410				}
411
412
413				=item C
414
415				Removes a document from the index.
416				If the index contains word positions (true by default), then
417				only the C is needed; however, if the index was created
418				without word positions, then the text representation
419				of the document must be given as a scalar string in the second argument
420				(of course this should be the same as the one that was supplied
421				when calling the L method).
422
423
424				=cut
425
426				sub remove {
427				my $self = shift;
428				my $docId = shift;
429				# my $buf = shift; # using $_[0] instead for efficiency reasons
430
431				my $wordIds;
432
433				if ($self->{ixp}) { # if using word positions
434				not $_[0] or carp "remove() : unexpected 'buf' argument";
435				$wordIds= $self->wordIds($docId);
436				}
437				else { # otherwise : recompute word ids
438				$wordIds = [grep {defined $_ and $_ > 0}
439				map {$self->{ixw}{$_}}
440				uniq map {$self->{wfilter}->($_)}
441				($_[0] =~ /$self->{wregex}/g)];
442				}
443
444				return if not @$wordIds;
445
446				foreach my $wordId (@$wordIds) {
447				my %docs = unpack IXDPACK_L, $self->{ixd}{$wordId};
448				delete $docs{$docId};
449				$self->{ixd}{$wordId} = pack IXDPACK_L, %docs;
450				if ($self->{ixp}) {
451				my $ixpKey = pack IXPKEYPACK, $docId, $wordId;
452				delete $self->{ixp}{$ixpKey};
453				}
454				}
455
456				$self->{ixd}{NDOCS} -= 1;
457				}
458
459				=item C
460
461				Returns a ref to an array of word Ids contained in the specified document
462				(not available if the index was created with C<< positions => 0 >>)
463
464				=cut
465
466				sub wordIds {
467				my $self = shift;
468				my $docId_ini = shift;
469
470				$self->{ixpDb}
471				or croak "wordIds() not available (index was created with positions=>0)";
472
473				my @wordIds = ();
474				my $c = $self->{ixpDb}->db_cursor;
475				my ($k, $v);
476				$k = pack IXPKEYPACK, $docId_ini, 0;
477				my $status = $c->c_get($k, $v, DB_SET_RANGE);
478				while ($status == 0) {
479				my ($docId, $wordId) = unpack IXPKEYPACK, $k;
480				last if $docId != $docId_ini;
481				push @wordIds, $wordId;
482				$status = $c->c_get($k, $v, DB_NEXT);
483				}
484				return \@wordIds;
485				}
486
487
488				=item C
489
490				Returns a ref to an array of words found in the dictionary,
491				starting with prefix (i.e. C<< $ix->words("foo") >> will
492				return "foo", "food", "fool", "footage", etc.).
493
494				=cut
495
496				sub words {
497				my $self = shift;
498				my $prefix = shift;
499
500				my $regex = qr/^$prefix/;
501				my @words = ();
502				my $c = $self->{ixwDb}->db_cursor;
503				my ($k, $v);
504				$k = $prefix;
505				my $status = $c->c_get($k, $v, DB_SET_RANGE);
506				while ($status == 0) {
507				last if $k !~ $regex;
508				push @words, $k;
509				$status = $c->c_get($k, $v, DB_NEXT);
510				}
511				return \@words;
512				}
513
514
515
516				=item C
517
518				Debugging function, prints indexed words with list of associated docs.
519
520				=cut
521
522				sub dump {
523				my $self = shift;
524				foreach my $word (sort keys %{$self->{ixw}}) {
525				my $wordId = $self->{ixw}{$word};
526				if ($wordId == -1) {
527				print "$word : STOPWORD\n";
528				}
529				else {
530				my %docs = unpack IXDPACK_L, $self->{ixd}{$wordId};
531				print "$word : ", join (" ", keys %docs), "\n";
532				}
533				}
534				}
535
536
537				=item C
538
539				Searches the index. See the L and L sections
540				above for short descriptions of query strings, or
541				L for details. The second argument is optional ;
542				if true, all words without any prefix will implicitly take prefix '+'
543				(mandatory words).
544
545				The return value is a hash ref containing
546
547				=over
548
549				=item scores
550
551				hash ref, where keys are docIds of matching documents, and values are
552				the corresponding computed scores.
553
554				=item killedWords
555
556				ref to an array of terms from the query string which were ignored
557				during the search (because they were filtered out or were stopwords)
558
559				=item regex
560
561				ref to a regular expression corresponding to all terms in the query
562				string. This will be useful if you later want to get contextual
563				excerpts from the found documents (see the L method).
564
565				=back
566
567				=cut
568
569
570				sub search {
571				my $self = shift;
572				my $query_string = shift;
573				my $implicitPlus = shift;
574
575				$self->{qp} \|\|= new Search::QueryParser;
576
577				my $q = $self->{qp}->parse($query_string, $implicitPlus);
578				my $killedWords = {};
579				my $wordsRegexes = [];
580
581				my $qt = $self->translateQuery($q, $killedWords, $wordsRegexes);
582
583				my $tmp = {};
584				$tmp->{$_} = 1 foreach @$wordsRegexes;
585				my $strRegex = "(?:" . join("\|", keys %$tmp) . ")";
586
587				return {scores => $self->_search($qt),
588				killedWords => [keys %$killedWords],
589				regex => qr/$strRegex/i};
590				}
591
592
593				sub _search {
594				my ($self, $q) = @_;
595
596				my $scores = undef; # hash {doc1 => score1, doc2 => score2 ...}
597
598				# 1) deal with mandatory subqueries
599
600				foreach my $subQ ( @{$q->{'+'}} ) {
601				my $sc = $self->docsAndScores($subQ) or next;
602				$scores = $sc and next if not $scores; # if first result set, just store
603
604				# otherwise, intersect with previous result set
605				foreach my $docId (keys %$scores) {
606				delete $scores->{$docId} and next if not defined $sc->{$docId};
607				addToScore $scores->{$docId}, $sc->{$docId}; # otherwise
608				}
609				}
610
611				my $noMandatorySubq = not $scores;
612
613				# 2) deal with non-mandatory subqueries
614
615				foreach my $subQ (@{$q->{''}}) {
616				my $sc = $self->docsAndScores($subQ) or next;
617				$scores = $sc and next if not $scores; # if first result set, just store
618
619				# otherwise, combine with previous result set
620				foreach my $docId (keys %$sc) {
621				if (defined $scores->{$docId}) { # docId was already there, add new score
622				addToScore $scores->{$docId}, $sc->{$docId};
623				}
624				elsif ($noMandatorySubq){ # insert a new docId to the result set
625				$scores->{$docId} = $sc->{$docId};
626				}
627				# else do nothing (ignore this docId)
628				}
629				}
630
631				return undef if not $scores or not %$scores; # no results
632
633				# 3) deal with negative subqueries (remove corresponding docs from results)
634
635				foreach my $subQ (@{$q->{'-'}}) {
636				my $negScores = $self->docsAndScores($subQ) or next;
637				delete $scores->{$_} foreach keys %$negScores;
638				}
639
640				return $scores;
641				}
642
643
644
645				sub docsAndScores { # returns a hash {docId => score} or undef (no info)
646				my ($self, $subQ) = @_;
647
648				# recursive call to _search if $subQ is a parenthesized query
649				return $self->_search($subQ->{value}) if $subQ->{op} eq '()';
650
651				# otherwise, don't care about $subQ->{op} (assert $subQ->{op} eq ':')
652
653				if (ref $subQ->{value}) { # several words, this is an "exact phrase"
654				return $self->matchExactPhrase($subQ);
655				}
656				elsif ($subQ->{value} <= -1) {# this is a stopword
657				return undef;
658				}
659				else { # scalar value, match single word
660				my $scores = {unpack IXDPACK_L, ($self->{ixd}{$subQ->{value}} \|\| "")};
661				my @k = keys %$scores;
662				if (@k) {
663				my $coeff = log(($self->{ixd}{NDOCS} + 1)/@k) * 100;
664				$scores->{$_} = int($coeff * $scores->{$_}) foreach @k;
665				}
666				return $scores;
667				}
668				}
669
670
671				sub matchExactPhrase {
672				my ($self, $subQ) = @_;
673
674				if (! $self->{ixp}) { # if not indexed with positions
675				# translate into an AND query
676				my $fake_query = {'+' => [map {{op => ':',
677				value => $_ }} @{$subQ->{value}}]};
678				# and search for that one
679				return $self->_search($fake_query);
680				};
681
682				# otherwise, intersect word position sets
683				my %pos;
684				my $wordDelta = 0;
685				my $scores = undef;
686				foreach my $wordId (@{$subQ->{value}}) {
687				my $sc = $self->docsAndScores({op=>':', value=>$wordId});
688				if (not $scores) { # no previous result set
689				if ($sc) {
690				$scores = $sc;
691				foreach my $docId (keys %$scores) {
692				my $ixpKey = pack IXPKEYPACK, $docId, $wordId;
693				$pos{$docId} = [unpack IXPPACK, $self->{ixp}{$ixpKey}];
694				}
695				}
696				}
697				else { # combine with previous result set
698				$wordDelta++;
699				foreach my $docId (keys %$scores) {
700				if ($sc) { # if we have info about current word (is not a stopword)
701				if (not defined $sc->{$docId}) { # current word not in current doc
702				delete $scores->{$docId};
703				} else { # current word found in current doc, check if positions match
704				my $ixpKey = pack IXPKEYPACK, $docId, $wordId;
705				my @newPos = unpack IXPPACK, $self->{ixp}{$ixpKey};
706				$pos{$docId} = nearPositions($pos{$docId}, \@newPos, $wordDelta)
707				and addToScore $scores->{$docId}, $sc->{$docId}
708				or delete $scores->{$docId};
709				}
710				}
711				} # end foreach my $docId (keys %$scores)
712				}
713				} # end foreach my $wordId (@{$subQ->{value}})
714
715				return $scores;
716				}
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734				sub nearPositions {
735				my ($set1, $set2, $wordDelta) = @_;
736				# returns the set of positions in $set2 which are "close enough" (<= $wordDelta)
737				# to positions in $set1. Assumption : input sets are sorted.
738
739
740				my @result;
741				my ($i1, $i2) = (0, 0); # indices into sets
742
743				while ($i1 < @$set1 and $i2 < @$set2) {
744				my $delta = $set2->[$i2] - $set1->[$i1];
745				++$i1 and next if $delta > $wordDelta;
746				push @result, $set2->[$i2] if $delta > 0;
747				++$i2;
748				}
749				return @result ? \@result : undef;
750				}
751
752
753
754				sub addToScore (\$$) { # first score arg gets "incremented" by the second arg
755				my ($ptScore1, $score2) = @_;
756				$$ptScore1 = 0 if not defined $$ptScore1;
757				$$ptScore1 += $score2 if $score2; # TODO : find better formula for score combination !
758				}
759
760
761				sub translateQuery { # replace words by ids, remove irrelevant subqueries
762				my ($self, $q, $killedWords, $wordsRegexes) = @_;
763
764				my $r = {};
765
766				foreach my $k ('+', '-', '') {
767				foreach my $subQ (@{$q->{$k}}) {
768
769				# ignore items concerning other field names
770				next if $subQ->{field} and $subQ->{field} ne $self->{fieldname};
771
772				my $val = $subQ->{value};
773
774				my $clone = undef;
775				if ($subQ->{op} eq '()') {
776				$clone = {op => '()',
777				value => $self->translateQuery($val, $killedWords, $wordsRegexes)};
778				}
779				elsif ($subQ->{op} eq ':') {
780				# split query according to our notion of "term"
781				my @words = ($val =~ /$self->{wregex}/g);
782
783				# TODO : 1) accept '*' suffix; 2) find keys in $self->{ixw}; 3) rewrite into
784				# an 'OR' query
785
786				# my @words = ($str =~ /$self->{wregex}\*?/g);
787
788				my $regex1 = join "\\W+", map quotemeta, @words;
789				my $regex2 = join "\\W+", map quotemeta,
790				map {$self->{wfilter}($_)} @words;
791				foreach my $regex ($regex1, $regex2) {
792				$regex = "\\b$regex" if $regex =~ /^\w/;
793				$regex = "$regex\\b" if $regex =~ /\w$/;
794				}
795				push @$wordsRegexes, $regex1;
796				push @$wordsRegexes, $regex2 unless $regex1 eq $regex2;
797
798				# now translate into word ids
799				foreach my $word (@words) {
800				my $wf = $self->{wfilter}->($word);
801				my $wordId = $wf ? ($self->{ixw}{$wf} \|\| 0) : -1;
802				$killedWords->{$word} = 1 if $wordId < 0;
803				$word = $wordId;
804				}
805
806				$val = (@words>1) ? \@words : # several words : return an array
807				(@words>0) ? $words[0] : # just one word : return its id
808				0; # no word : return 0 (means "no info")
809
810				$clone = {op => ':', value=> $val};
811				}
812				push @{$r->{$k}}, $clone if $clone;
813				}
814				}
815
816				return $r;
817				}
818
819
820
821				=item C
822
823				Searches C for occurrences of C,
824				extracts the occurences together with some context
825				(a number of characters to the left and to the right),
826				and highlights the occurences. See parameters C,
827				C, C, C of the L method.
828
829				=cut
830
831				sub excerpts {
832				my $self = shift;
833				# $_[0] : text buffer ; no copy for efficiency reason
834				my $regex = $_[1];
835
836				my $nc = $self->{ctxtNumChars};
837
838				# find start and end positions of matching fragments
839				my $matches = []; # array of refs to [start, end, number_of_matches]
840				while ($_[0] =~ /$regex/g) {
841				my ($start, $end) = ($-[0], $+[0]);
842				if (@$matches and $start <= $matches->[-1][1] + $nc) {
843				# merge with the last fragment if close enough
844				$matches->[-1][1] = $end; # extend the end position
845				$matches->[-1][2] += 1; # increment the number of matches
846				}
847				else {
848				push @$matches, [$start, $end, 1];
849				}
850				}
851
852				foreach (@$matches) { # extend start and end positions by $self->{ctxtNumChars}
853				$_->[0] = ($_->[0] < $nc) ? 0 : $_->[0] - $nc;
854				$_->[1] += $nc;
855				}
856
857				my $excerpts = [];
858				foreach my $match (sort {$b->[2] <=> $a->[2]} @$matches) {
859				last if @$excerpts >= $self->{maxExcerpts};
860				my $x = substr($_[0], $match->[0], $match->[1] - $match->[0]); # extract
861				$x =~ s/$regex/$self->{preMatch}$&$self->{postMatch}/g ; # highlight
862				push @$excerpts, "...$x...";
863				}
864				return $excerpts;
865				}
866
867				=back
868
869				=head1 TO DO
870
871				=over
872
873				=item *
874
875				Find a proper formula for combining scores from several terms.
876				Current implementation is ridiculously simple-minded (just an addition).
877				Also study the literature to improve the scoring formula.
878
879				=item *
880
881				Handle concurrency through BerkeleyDB locks.
882
883				=item *
884
885				Maybe put all 3 index files as subDatabases in one single file.
886
887				=item *
888
889				Fine tuning of cachesize and other BerkeleyDB parameters.
890
891				=item *
892
893				Compare performances with other packages.
894
895				=item *
896
897				More functionalities : add NEAR operator and boost factors.
898
899				=back
900
901
902
903				=head1 SEE ALSO
904
905				L is nice and compact, but
906				limited in functionality (no +/- prefixes, no "exact phrase" search,
907				no parentheses).
908
909				L is a Perl port of the Java I search engine.
910				Plucene has probably every feature you will ever need, but requires
911				quite an investment to install and learn (more than 60 classes,
912				dependencies on lots of external modules).
913				I haven't done any benchmarks yet to compare performance.
914
915				L is a more recent, more sophisticated search engine,
916				which looks very powerful and should be probably faster and definitely
917				more scalable than C; but also with a less compact
918				API. I haven't performed any detailed comparison yet.
919
920
921				=cut
922
923
924				1;