File Coverage

blib/lib/Search/Indexer.pm
Criterion Covered Total %
statement 10 12 83.3
branch n/a
condition n/a
subroutine 4 4 100.0
pod n/a
total 14 16 87.5


line stmt bran cond sub pod time code
1             package Search::Indexer;
2              
3 2     2   74048 use strict;
  2         5  
  2         83  
4 2     2   36 use warnings;
  2         3  
  2         63  
5             # no warnings 'uninitialized'; ## CHECK IF NEEDED OR NOT
6 2     2   11 use Carp;
  2         8  
  2         440  
7 2     2   2499 use BerkeleyDB;
  0            
  0            
8             use locale;
9             use Search::QueryParser;
10             use List::MoreUtils qw/uniq/;
11              
12             # TODO : experiment with bit vectors (cf vec() and pack "b*" for combining
13             # result sets
14              
15             our $VERSION = "0.77";
16              
17             =head1 NAME
18              
19             Search::Indexer - full-text indexer
20              
21             =head1 SYNOPSIS
22              
23             use Search::Indexer;
24             my $ix = new Search::Indexer(dir => $dir, writeMode => 1);
25             foreach my $docId (keys %docs) {
26             $ix->add($docId, $docs{$docId});
27             }
28              
29             my $result = $ix->search('+word -excludedWord +"exact phrase"');
30             my @docIds = keys @{$result->{scores}};
31             my $killedWords = join ", ", @{$result->{killedWords}};
32             print scalar(@docIds), " documents found\n", ;
33             print "words $killedWords were ignored during the search\n" if $killedWords;
34             foreach my $docId (@docIds) {
35             my $score = $result->{scores}{$docId};
36             my $excerpts = join "\n", $ix->excerpts($docs{$docId}, $result->{regex});
37             print "DOCUMENT $docId, score $score:\n$excerpts\n\n";
38             }
39              
40             my $result2 = $ix->search('word1 AND (word2 OR word3) AND NOT word4');
41              
42             $ix->remove($someDocId);
43              
44             =head1 DESCRIPTION
45              
46             This module provides support for indexing a collection of documents,
47             for searching the collection, and displaying the sorted results,
48             together with contextual excerpts of the original document.
49              
50             =head2 Documents
51              
52             As far as this module is concerned, a I is just a buffer of
53             plain text, together with a unique identifying number. The caller is
54             responsible for supplying unique numbers, and for converting the
55             original source (HTML, PDF, whatever) into plain text. Documents could
56             also contain more information (other fields like date, author, Dublin
57             Core, etc.), but this must be handled externally, in a database or any
58             other store. A candidate for storing metadata about documents
59             could be L, which uses the same
60             query parser.
61              
62             =head2 Search syntax
63              
64             Searching requests may include plain terms, "exact phrases",
65             '+' or '-' prefixes, boolean operators and parentheses.
66             See L for details.
67              
68             =head2 Index files
69              
70             The indexer uses three files in BerkeleyDB format : a) a mapping from
71             words to wordIds; b) a mapping from wordIds to lists of documents ; c)
72             a mapping from pairs (docId, wordId) to lists of positions within the
73             document. This third file holds detailed information and therefore is
74             quite big ; but it allows us to quickly retrieve "exact phrases"
75             (sequences of adjacent words) in the document.
76              
77             =head2 Indexing steps
78              
79             Indexing of a document buffer goes through the following
80             steps :
81              
82             =over
83              
84             =item *
85              
86             terms are extracted, according to the I regular expression
87              
88             =item *
89              
90             extracted terms are normalized or filtered out
91             by the I callback function. This function can for example
92             remove accented characters, perform lemmatization, suppress
93             irrelevant terms (such as numbers), etc.
94              
95             =item *
96              
97             normalized terms are eliminated if they belong to
98             the I list (list of common words to exclude from the index).
99              
100             =item *
101              
102             remaining terms are stored, together with the positions where they
103             occur in the document.
104              
105             =back
106              
107             =head2 Limits
108              
109             All ids are stored as unsigned 32-bit integers; therefore there is
110             a limit of 4294967295 to the number of documents or to the number of
111             different words.
112              
113             =head2 Related modules
114              
115             A short comparison with other CPAN indexing modules is
116             given in the L section.
117              
118             This module depends on L for analyzing requests and
119             on L for storing the indexes.
120              
121             This module was designed together with L.
122              
123             =cut
124              
125              
126             sub addToScore (\$$);
127              
128             use constant {
129              
130             # max size of various ids
131             MAX_DOC_ID => 0xFFFFFFFF, # unsigned long (32 bits)
132             MAX_POS_ID => 0xFFFFFFFF, # position_id
133              
134             # encodings for pack/unpack
135             IXDPACK => 'wC', # docId : compressed int; freq : unsigned char
136             IXDPACK_L => '(wC)*', # list of above
137             IXPPACK => 'w*', # word positions : list of compressed ints
138             IXPKEYPACK => 'ww', # key for ixp : (docId, wordId)
139              
140             WRITECACHESIZE => (1 << 24), # arbitrary big value; seems good enough but need tuning
141              
142             # default values for args to new()
143             DEFAULT => {
144             writeMode => 0,
145             wregex => qr/\w+/,
146             wfilter => sub { # default filter : lowercase and no accents
147             my $word = lc($_[0]);
148             $word =~ tr[çáàâäéèêëíìîïóòôöúùûüýÿ][caaaaeeeeiiiioooouuuuyy];
149             return $word;
150             },
151             fieldname => '',
152              
153             ctxtNumChars => 35,
154             maxExcerpts => 5,
155             preMatch => "",
156             postMatch => "",
157             positions => 1,
158             }
159             };
160              
161             =head1 METHODS
162              
163             =over
164              
165             =item C expr1, ...)>
166              
167             Creates an indexer (either for a new index, or for
168             accessing an existing index). Parameters are :
169              
170             =over
171              
172             =item dir
173              
174             Directory for index files. and possibly for the stopwords file.
175             Default is current directory
176              
177             =item writeMode
178              
179             Give a true value if you intend to write into the index.
180              
181             =item wregex
182              
183             Regex for matching a word (C by default).
184             Will affect both L and L method.
185             This regex should not contain any capturing parentheses
186             (use non-capturing parentheses C<< (?: ... ) >> instead).
187              
188             =item wfilter
189              
190             Ref to a callback sub that may normalize or eliminate a word. Will
191             affect both L and L method. The default wfilter
192             translates words in lower case and translates latin1 (iso-8859-1)
193             accented characters into plain characters.
194              
195             =item stopwords
196              
197             List of words that will be marked into the index as "words to exclude".
198             This should usually occur when creating a new index ; but nothing prevents
199             you to add other stopwords later. Since stopwords are stored in the
200             index, they need not be specified when opening an index for searches or
201             updates.
202              
203             The list may be supplied either as a ref to an array of scalars, or
204             as a the name of a file containing the stopwords (full pathname
205             or filename relative to I).
206              
207              
208             =item fieldname
209              
210             Will only affect the L method.
211             Search queries are passed to a general parser
212             (see L).
213             Then, before being applied to the present indexer module,
214             queries are pruned of irrelevant items.
215             Query items are considered relevant if they have no
216             associated field name, or if the associated field name is
217             equal to this C.
218              
219             =back
220              
221             Below are some additional parameters that only affect the
222             L method.
223              
224             =over
225              
226             =item ctxtNumChars
227              
228             Number of characters determining the size of contextual excerpts
229             return by the L method.
230             A I is a part of the document text,
231             containg a matched word surrounded by I characters
232             to the left and to the right. Default is 35.
233              
234              
235             =item maxExcerpts
236              
237             Maximum number of contextual excerpts to retrieve per document.
238             Default is 5.
239              
240             =item preMatch
241              
242             String to insert in contextual excerpts before a matched word.
243             Default is C<"EbE">.
244              
245             =item postMatch
246              
247             String to insert in contextual excerpts after a matched word.
248             Default is C<"E/bE">.
249              
250              
251             =item positions
252              
253             my $indexer = new Search::Indexer(dir => $dir,
254             writeMode => 1,
255             positions => 0);
256              
257             Truth value to tell whether or not, when creating a new index,
258             word positions should be stored. The default is true.
259              
260             If you turn it off, index files will be much smaller, indexing
261             will be faster, but results will be less precise,
262             because the indexer can no longer find "exact phrases".
263             So if you type C<"quick fox jumped">, the query will be
264             translated into C, and therefore
265             will retrieve documents in which those three words are present, but
266             not necessarily in order.
267              
268             Another consequence of C<< positions => 0 >> is that
269             there will be no automatic check of uniqueness of ids
270             when adding documents into the index.
271              
272             =back
273              
274             =cut
275              
276             sub new {
277             my $class = shift;
278             my $args = ref $_[0] eq 'HASH' ? $_[0] : {@_};
279              
280             # parse options
281             my $self = {};
282             $self->{$_} = exists $args->{$_} ? delete $args->{$_} : DEFAULT->{$_}
283             foreach qw(writeMode wregex wfilter fieldname
284             ctxtNumChars maxExcerpts preMatch postMatch positions);
285             my $dir = delete $args->{dir} || ".";
286             $dir =~ s{[/\\]$}{}; # remove trailing slash
287             my $stopwords = delete $args->{stopwords};
288              
289             # check if invalid options
290             my @remaining = keys %$args;
291             croak "unexpected option : $remaining[0]" if @remaining;
292             croak "can't add 'positions' after index creation time"
293             if $self->{writeMode} and $self->{positions}
294             and -f "$dir/ixd.bdb" and not -f "$dir/ixp.bdb";
295              
296             # BerkeleyDB environment should allow us to do proper locking for
297             # concurrent access ; but seems to be incompatible with the
298             # -Cachesize argument, so I commented it out ... need to learn more about
299             # BerkeleyDB ...
300             # my $dbEnv = new BerkeleyDB::Env
301             # -Home => $dir,
302             # -Flags => DB_INIT_CDB | DB_INIT_MPOOL | DB_CDB_ALLDB |
303             # ($self->{writeMode} ? DB_CREATE : 0),
304             # -Verbose => 1
305             # or croak "new BerkeleyDB::Env : $^E $BerkeleyDB::Error" ;
306              
307              
308             my @bdb_args = (# -Env => $dbEnv, # commented out, see explanation above
309             -Flags => ($self->{writeMode} ? DB_CREATE : DB_RDONLY),
310             ($self->{writeMode} ? (-Cachesize => WRITECACHESIZE) : ()));
311              
312             # 3 index files :
313             # ixw : word => wordId (or -1 for stopwords)
314             $self->{ixwDb} = tie %{$self->{ixw}},
315             'BerkeleyDB::Btree',
316             -Filename => "$dir/ixw.bdb", @bdb_args
317             or croak "open $dir/ixw.bdb : $^E $BerkeleyDB::Error";
318              
319             # ixd : wordId => list of (docId, nOccur)
320             $self->{ixdDb} = tie %{$self->{ixd}},
321             'BerkeleyDB::Hash',
322             -Filename => "$dir/ixd.bdb", @bdb_args
323             or croak "open $dir/ixd.bdb : $^E $BerkeleyDB::Error";
324              
325             if (-f "$dir/ixp.bdb" || $self->{writeMode} && $self->{positions}) {
326             # ixp : (docId, wordId) => list of positions of word in doc
327             $self->{ixpDb} = tie %{$self->{ixp}},
328             'BerkeleyDB::Btree',
329             -Filename => "$dir/ixp.bdb", @bdb_args
330             or croak "open $dir/ixp.bdb : $^E $BerkeleyDB::Error";
331             }
332              
333              
334             # optional list of stopwords may be given as a list or as a filename
335             if ($stopwords) {
336             $self->{writeMode} or croak "must be in writeMode to specify stopwords";
337             if (not ref $stopwords) { # if scalar, name of stopwords file
338             open TMP, $stopwords or
339             (open TMP, "$dir/$stopwords") or
340             croak "open stopwords file $stopwords : $^E ";
341             local $/ = undef;
342             my $buf = ;
343             $stopwords = [$buf =~ /$self->{wregex}/g];
344             close TMP;
345             }
346             foreach my $word (@$stopwords) {
347             $self->{ixw}{$word} = -1;
348             }
349             }
350              
351             bless $self, $class;
352             }
353              
354              
355              
356              
357              
358             =item C
359              
360             Add a new document to the index.
361             I is the unique identifier for this doc
362             (the caller is responsible for uniqueness).
363             I is a scalar containing the text representation of this doc.
364              
365             =cut
366              
367             sub add {
368             my $self = shift;
369             my $docId = shift;
370             # my $buf = shift; # using $_[0] instead for efficiency reasons
371              
372             croak "docId $docId is too large" if $docId > MAX_DOC_ID;
373              
374             # first check if this docId is already used
375             if ($self->{ixp}) { # can only check if we have the positions index
376             my $c = $self->{ixpDb}->db_cursor;
377             my $k = pack IXPKEYPACK, $docId, 0;
378             my $v; # not used, but needed by c_get()
379             my $status = $c->c_get($k, $v, DB_SET_RANGE);
380             if ($status == 0) {
381             my ($check, $wordId) = unpack IXPKEYPACK, $k;
382             croak "docId $docId is already used (wordId=$wordId)"
383             if $docId == $check;
384             }
385             }
386              
387             # OK, let's extract words from the $_[0] buffer
388             my %positions;
389             for (my $nwords = 1; $_[0] =~ /$self->{wregex}/g; $nwords++) {
390              
391             my $word = $self->{wfilter}->($&) or next;
392             my $wordId = $self->{ixw}{$word} ||
393             ($self->{ixw}{$word} = ++$self->{ixw}{_NWORDS}); # create new wordId
394             push @{$positions{$wordId}}, $nwords if $wordId > 0;
395             }
396              
397             foreach my $wordId (keys %positions) {
398             my $occurrences = @{$positions{$wordId}};
399             $occurrences = 255 if $occurrences > 255;
400              
401             $self->{ixd}{$wordId} .= pack(IXDPACK, $docId, $occurrences);
402             if ($self->{ixp}) {
403             my $ixpKey = pack IXPKEYPACK, $docId, $wordId;
404             $self->{ixp}{$ixpKey} = pack(IXPPACK, @{$positions{$wordId}});
405             }
406             }
407              
408             $self->{ixd}{NDOCS} = 0 if not defined $self->{ixd}{NDOCS};
409             $self->{ixd}{NDOCS} += 1;
410             }
411              
412              
413             =item C
414              
415             Removes a document from the index.
416             If the index contains word positions (true by default), then
417             only the C is needed; however, if the index was created
418             without word positions, then the text representation
419             of the document must be given as a scalar string in the second argument
420             (of course this should be the same as the one that was supplied
421             when calling the L method).
422              
423              
424             =cut
425              
426             sub remove {
427             my $self = shift;
428             my $docId = shift;
429             # my $buf = shift; # using $_[0] instead for efficiency reasons
430              
431             my $wordIds;
432              
433             if ($self->{ixp}) { # if using word positions
434             not $_[0] or carp "remove() : unexpected 'buf' argument";
435             $wordIds= $self->wordIds($docId);
436             }
437             else { # otherwise : recompute word ids
438             $wordIds = [grep {defined $_ and $_ > 0}
439             map {$self->{ixw}{$_}}
440             uniq map {$self->{wfilter}->($_)}
441             ($_[0] =~ /$self->{wregex}/g)];
442             }
443              
444             return if not @$wordIds;
445              
446             foreach my $wordId (@$wordIds) {
447             my %docs = unpack IXDPACK_L, $self->{ixd}{$wordId};
448             delete $docs{$docId};
449             $self->{ixd}{$wordId} = pack IXDPACK_L, %docs;
450             if ($self->{ixp}) {
451             my $ixpKey = pack IXPKEYPACK, $docId, $wordId;
452             delete $self->{ixp}{$ixpKey};
453             }
454             }
455              
456             $self->{ixd}{NDOCS} -= 1;
457             }
458              
459             =item C
460              
461             Returns a ref to an array of word Ids contained in the specified document
462             (not available if the index was created with C<< positions => 0 >>)
463              
464             =cut
465              
466             sub wordIds {
467             my $self = shift;
468             my $docId_ini = shift;
469              
470             $self->{ixpDb}
471             or croak "wordIds() not available (index was created with positions=>0)";
472              
473             my @wordIds = ();
474             my $c = $self->{ixpDb}->db_cursor;
475             my ($k, $v);
476             $k = pack IXPKEYPACK, $docId_ini, 0;
477             my $status = $c->c_get($k, $v, DB_SET_RANGE);
478             while ($status == 0) {
479             my ($docId, $wordId) = unpack IXPKEYPACK, $k;
480             last if $docId != $docId_ini;
481             push @wordIds, $wordId;
482             $status = $c->c_get($k, $v, DB_NEXT);
483             }
484             return \@wordIds;
485             }
486              
487              
488             =item C
489              
490             Returns a ref to an array of words found in the dictionary,
491             starting with prefix (i.e. C<< $ix->words("foo") >> will
492             return "foo", "food", "fool", "footage", etc.).
493              
494             =cut
495              
496             sub words {
497             my $self = shift;
498             my $prefix = shift;
499              
500             my $regex = qr/^$prefix/;
501             my @words = ();
502             my $c = $self->{ixwDb}->db_cursor;
503             my ($k, $v);
504             $k = $prefix;
505             my $status = $c->c_get($k, $v, DB_SET_RANGE);
506             while ($status == 0) {
507             last if $k !~ $regex;
508             push @words, $k;
509             $status = $c->c_get($k, $v, DB_NEXT);
510             }
511             return \@words;
512             }
513              
514              
515              
516             =item C
517              
518             Debugging function, prints indexed words with list of associated docs.
519              
520             =cut
521              
522             sub dump {
523             my $self = shift;
524             foreach my $word (sort keys %{$self->{ixw}}) {
525             my $wordId = $self->{ixw}{$word};
526             if ($wordId == -1) {
527             print "$word : STOPWORD\n";
528             }
529             else {
530             my %docs = unpack IXDPACK_L, $self->{ixd}{$wordId};
531             print "$word : ", join (" ", keys %docs), "\n";
532             }
533             }
534             }
535              
536              
537             =item C
538              
539             Searches the index. See the L and L sections
540             above for short descriptions of query strings, or
541             L for details. The second argument is optional ;
542             if true, all words without any prefix will implicitly take prefix '+'
543             (mandatory words).
544              
545             The return value is a hash ref containing
546              
547             =over
548              
549             =item scores
550              
551             hash ref, where keys are docIds of matching documents, and values are
552             the corresponding computed scores.
553              
554             =item killedWords
555              
556             ref to an array of terms from the query string which were ignored
557             during the search (because they were filtered out or were stopwords)
558              
559             =item regex
560              
561             ref to a regular expression corresponding to all terms in the query
562             string. This will be useful if you later want to get contextual
563             excerpts from the found documents (see the L method).
564              
565             =back
566              
567             =cut
568              
569              
570             sub search {
571             my $self = shift;
572             my $query_string = shift;
573             my $implicitPlus = shift;
574              
575             $self->{qp} ||= new Search::QueryParser;
576              
577             my $q = $self->{qp}->parse($query_string, $implicitPlus);
578             my $killedWords = {};
579             my $wordsRegexes = [];
580              
581             my $qt = $self->translateQuery($q, $killedWords, $wordsRegexes);
582              
583             my $tmp = {};
584             $tmp->{$_} = 1 foreach @$wordsRegexes;
585             my $strRegex = "(?:" . join("|", keys %$tmp) . ")";
586              
587             return {scores => $self->_search($qt),
588             killedWords => [keys %$killedWords],
589             regex => qr/$strRegex/i};
590             }
591              
592              
593             sub _search {
594             my ($self, $q) = @_;
595              
596             my $scores = undef; # hash {doc1 => score1, doc2 => score2 ...}
597              
598             # 1) deal with mandatory subqueries
599              
600             foreach my $subQ ( @{$q->{'+'}} ) {
601             my $sc = $self->docsAndScores($subQ) or next;
602             $scores = $sc and next if not $scores; # if first result set, just store
603              
604             # otherwise, intersect with previous result set
605             foreach my $docId (keys %$scores) {
606             delete $scores->{$docId} and next if not defined $sc->{$docId};
607             addToScore $scores->{$docId}, $sc->{$docId}; # otherwise
608             }
609             }
610              
611             my $noMandatorySubq = not $scores;
612              
613             # 2) deal with non-mandatory subqueries
614              
615             foreach my $subQ (@{$q->{''}}) {
616             my $sc = $self->docsAndScores($subQ) or next;
617             $scores = $sc and next if not $scores; # if first result set, just store
618              
619             # otherwise, combine with previous result set
620             foreach my $docId (keys %$sc) {
621             if (defined $scores->{$docId}) { # docId was already there, add new score
622             addToScore $scores->{$docId}, $sc->{$docId};
623             }
624             elsif ($noMandatorySubq){ # insert a new docId to the result set
625             $scores->{$docId} = $sc->{$docId};
626             }
627             # else do nothing (ignore this docId)
628             }
629             }
630              
631             return undef if not $scores or not %$scores; # no results
632              
633             # 3) deal with negative subqueries (remove corresponding docs from results)
634              
635             foreach my $subQ (@{$q->{'-'}}) {
636             my $negScores = $self->docsAndScores($subQ) or next;
637             delete $scores->{$_} foreach keys %$negScores;
638             }
639              
640             return $scores;
641             }
642              
643              
644              
645             sub docsAndScores { # returns a hash {docId => score} or undef (no info)
646             my ($self, $subQ) = @_;
647              
648             # recursive call to _search if $subQ is a parenthesized query
649             return $self->_search($subQ->{value}) if $subQ->{op} eq '()';
650              
651             # otherwise, don't care about $subQ->{op} (assert $subQ->{op} eq ':')
652              
653             if (ref $subQ->{value}) { # several words, this is an "exact phrase"
654             return $self->matchExactPhrase($subQ);
655             }
656             elsif ($subQ->{value} <= -1) {# this is a stopword
657             return undef;
658             }
659             else { # scalar value, match single word
660             my $scores = {unpack IXDPACK_L, ($self->{ixd}{$subQ->{value}} || "")};
661             my @k = keys %$scores;
662             if (@k) {
663             my $coeff = log(($self->{ixd}{NDOCS} + 1)/@k) * 100;
664             $scores->{$_} = int($coeff * $scores->{$_}) foreach @k;
665             }
666             return $scores;
667             }
668             }
669              
670              
671             sub matchExactPhrase {
672             my ($self, $subQ) = @_;
673              
674             if (! $self->{ixp}) { # if not indexed with positions
675             # translate into an AND query
676             my $fake_query = {'+' => [map {{op => ':',
677             value => $_ }} @{$subQ->{value}}]};
678             # and search for that one
679             return $self->_search($fake_query);
680             };
681              
682             # otherwise, intersect word position sets
683             my %pos;
684             my $wordDelta = 0;
685             my $scores = undef;
686             foreach my $wordId (@{$subQ->{value}}) {
687             my $sc = $self->docsAndScores({op=>':', value=>$wordId});
688             if (not $scores) { # no previous result set
689             if ($sc) {
690             $scores = $sc;
691             foreach my $docId (keys %$scores) {
692             my $ixpKey = pack IXPKEYPACK, $docId, $wordId;
693             $pos{$docId} = [unpack IXPPACK, $self->{ixp}{$ixpKey}];
694             }
695             }
696             }
697             else { # combine with previous result set
698             $wordDelta++;
699             foreach my $docId (keys %$scores) {
700             if ($sc) { # if we have info about current word (is not a stopword)
701             if (not defined $sc->{$docId}) { # current word not in current doc
702             delete $scores->{$docId};
703             } else { # current word found in current doc, check if positions match
704             my $ixpKey = pack IXPKEYPACK, $docId, $wordId;
705             my @newPos = unpack IXPPACK, $self->{ixp}{$ixpKey};
706             $pos{$docId} = nearPositions($pos{$docId}, \@newPos, $wordDelta)
707             and addToScore $scores->{$docId}, $sc->{$docId}
708             or delete $scores->{$docId};
709             }
710             }
711             } # end foreach my $docId (keys %$scores)
712             }
713             } # end foreach my $wordId (@{$subQ->{value}})
714              
715             return $scores;
716             }
717              
718              
719              
720              
721              
722              
723              
724              
725              
726              
727              
728              
729              
730              
731              
732              
733              
734             sub nearPositions {
735             my ($set1, $set2, $wordDelta) = @_;
736             # returns the set of positions in $set2 which are "close enough" (<= $wordDelta)
737             # to positions in $set1. Assumption : input sets are sorted.
738              
739              
740             my @result;
741             my ($i1, $i2) = (0, 0); # indices into sets
742              
743             while ($i1 < @$set1 and $i2 < @$set2) {
744             my $delta = $set2->[$i2] - $set1->[$i1];
745             ++$i1 and next if $delta > $wordDelta;
746             push @result, $set2->[$i2] if $delta > 0;
747             ++$i2;
748             }
749             return @result ? \@result : undef;
750             }
751              
752              
753              
754             sub addToScore (\$$) { # first score arg gets "incremented" by the second arg
755             my ($ptScore1, $score2) = @_;
756             $$ptScore1 = 0 if not defined $$ptScore1;
757             $$ptScore1 += $score2 if $score2; # TODO : find better formula for score combination !
758             }
759              
760              
761             sub translateQuery { # replace words by ids, remove irrelevant subqueries
762             my ($self, $q, $killedWords, $wordsRegexes) = @_;
763              
764             my $r = {};
765              
766             foreach my $k ('+', '-', '') {
767             foreach my $subQ (@{$q->{$k}}) {
768              
769             # ignore items concerning other field names
770             next if $subQ->{field} and $subQ->{field} ne $self->{fieldname};
771              
772             my $val = $subQ->{value};
773              
774             my $clone = undef;
775             if ($subQ->{op} eq '()') {
776             $clone = {op => '()',
777             value => $self->translateQuery($val, $killedWords, $wordsRegexes)};
778             }
779             elsif ($subQ->{op} eq ':') {
780             # split query according to our notion of "term"
781             my @words = ($val =~ /$self->{wregex}/g);
782              
783             # TODO : 1) accept '*' suffix; 2) find keys in $self->{ixw}; 3) rewrite into
784             # an 'OR' query
785              
786             # my @words = ($str =~ /$self->{wregex}\*?/g);
787              
788             my $regex1 = join "\\W+", map quotemeta, @words;
789             my $regex2 = join "\\W+", map quotemeta,
790             map {$self->{wfilter}($_)} @words;
791             foreach my $regex ($regex1, $regex2) {
792             $regex = "\\b$regex" if $regex =~ /^\w/;
793             $regex = "$regex\\b" if $regex =~ /\w$/;
794             }
795             push @$wordsRegexes, $regex1;
796             push @$wordsRegexes, $regex2 unless $regex1 eq $regex2;
797            
798             # now translate into word ids
799             foreach my $word (@words) {
800             my $wf = $self->{wfilter}->($word);
801             my $wordId = $wf ? ($self->{ixw}{$wf} || 0) : -1;
802             $killedWords->{$word} = 1 if $wordId < 0;
803             $word = $wordId;
804             }
805              
806             $val = (@words>1) ? \@words : # several words : return an array
807             (@words>0) ? $words[0] : # just one word : return its id
808             0; # no word : return 0 (means "no info")
809              
810             $clone = {op => ':', value=> $val};
811             }
812             push @{$r->{$k}}, $clone if $clone;
813             }
814             }
815              
816             return $r;
817             }
818              
819              
820              
821             =item C
822              
823             Searches C for occurrences of C,
824             extracts the occurences together with some context
825             (a number of characters to the left and to the right),
826             and highlights the occurences. See parameters C,
827             C, C, C of the L method.
828              
829             =cut
830              
831             sub excerpts {
832             my $self = shift;
833             # $_[0] : text buffer ; no copy for efficiency reason
834             my $regex = $_[1];
835              
836             my $nc = $self->{ctxtNumChars};
837              
838             # find start and end positions of matching fragments
839             my $matches = []; # array of refs to [start, end, number_of_matches]
840             while ($_[0] =~ /$regex/g) {
841             my ($start, $end) = ($-[0], $+[0]);
842             if (@$matches and $start <= $matches->[-1][1] + $nc) {
843             # merge with the last fragment if close enough
844             $matches->[-1][1] = $end; # extend the end position
845             $matches->[-1][2] += 1; # increment the number of matches
846             }
847             else {
848             push @$matches, [$start, $end, 1];
849             }
850             }
851              
852             foreach (@$matches) { # extend start and end positions by $self->{ctxtNumChars}
853             $_->[0] = ($_->[0] < $nc) ? 0 : $_->[0] - $nc;
854             $_->[1] += $nc;
855             }
856              
857             my $excerpts = [];
858             foreach my $match (sort {$b->[2] <=> $a->[2]} @$matches) {
859             last if @$excerpts >= $self->{maxExcerpts};
860             my $x = substr($_[0], $match->[0], $match->[1] - $match->[0]); # extract
861             $x =~ s/$regex/$self->{preMatch}$&$self->{postMatch}/g ; # highlight
862             push @$excerpts, "...$x...";
863             }
864             return $excerpts;
865             }
866              
867             =back
868              
869             =head1 TO DO
870              
871             =over
872              
873             =item *
874              
875             Find a proper formula for combining scores from several terms.
876             Current implementation is ridiculously simple-minded (just an addition).
877             Also study the literature to improve the scoring formula.
878              
879             =item *
880              
881             Handle concurrency through BerkeleyDB locks.
882              
883             =item *
884              
885             Maybe put all 3 index files as subDatabases in one single file.
886              
887             =item *
888              
889             Fine tuning of cachesize and other BerkeleyDB parameters.
890              
891             =item *
892              
893             Compare performances with other packages.
894              
895             =item *
896              
897             More functionalities : add NEAR operator and boost factors.
898              
899             =back
900              
901              
902              
903             =head1 SEE ALSO
904              
905             L is nice and compact, but
906             limited in functionality (no +/- prefixes, no "exact phrase" search,
907             no parentheses).
908              
909             L is a Perl port of the Java I search engine.
910             Plucene has probably every feature you will ever need, but requires
911             quite an investment to install and learn (more than 60 classes,
912             dependencies on lots of external modules).
913             I haven't done any benchmarks yet to compare performance.
914              
915             L is a more recent, more sophisticated search engine,
916             which looks very powerful and should be probably faster and definitely
917             more scalable than C; but also with a less compact
918             API. I haven't performed any detailed comparison yet.
919              
920              
921             =cut
922              
923            
924             1;