File Coverage

blib/lib/Search/Indexer.pm
Criterion Covered Total %
statement 13 15 86.6
branch n/a
condition n/a
subroutine 5 5 100.0
pod n/a
total 18 20 90.0


line stmt bran cond sub pod time code
1             package Search::Indexer;
2              
3 2     2   28004 use strict;
  2         4  
  2         51  
4 2     2   7 use warnings;
  2         3  
  2         49  
5 2     2   6 no warnings 'uninitialized';
  2         6  
  2         44  
6              
7 2     2   6 use Carp;
  2         6  
  2         136  
8 2     2   1106 use BerkeleyDB;
  0            
  0            
9             use locale;
10             use Search::QueryParser;
11             use List::MoreUtils qw/uniq/;
12              
13             # TODO : experiment with bit vectors (cf vec() and pack "b*" for combining
14             # result sets
15              
16             our $VERSION = "0.78";
17              
18             =head1 NAME
19              
20             Search::Indexer - full-text indexer
21              
22             =head1 SYNOPSIS
23              
24             use Search::Indexer;
25             my $ix = new Search::Indexer(dir => $dir, writeMode => 1);
26             foreach my $docId (keys %docs) {
27             $ix->add($docId, $docs{$docId});
28             }
29              
30             my $result = $ix->search('+word -excludedWord +"exact phrase"');
31             my @docIds = keys %{$result->{scores}};
32             my $killedWords = join ", ", @{$result->{killedWords}};
33             print scalar(@docIds), " documents found\n", ;
34             print "words $killedWords were ignored during the search\n" if $killedWords;
35             foreach my $docId (@docIds) {
36             my $score = $result->{scores}{$docId};
37             my $excerpts = join "\n", $ix->excerpts($docs{$docId}, $result->{regex});
38             print "DOCUMENT $docId, score $score:\n$excerpts\n\n";
39             }
40              
41             my $result2 = $ix->search('word1 AND (word2 OR word3) AND NOT word4');
42              
43             $ix->remove($someDocId);
44              
45             =head1 DESCRIPTION
46              
47             This module provides support for indexing a collection of documents,
48             for searching the collection, and displaying the sorted results,
49             together with contextual excerpts of the original document.
50              
51             =head2 Documents
52              
53             As far as this module is concerned, a I is just a buffer of
54             plain text, together with a unique identifying number. The caller is
55             responsible for supplying unique numbers, and for converting the
56             original source (HTML, PDF, whatever) into plain text. Documents could
57             also contain more information (other fields like date, author, Dublin
58             Core, etc.), but this must be handled externally, in a database or any
59             other store. A candidate for storing metadata about documents
60             could be L, which uses the same
61             query parser.
62              
63             =head2 Search syntax
64              
65             Searching requests may include plain terms, "exact phrases",
66             '+' or '-' prefixes, boolean operators and parentheses.
67             See L for details.
68              
69             =head2 Index files
70              
71             The indexer uses three files in BerkeleyDB format : a) a mapping from
72             words to wordIds; b) a mapping from wordIds to lists of documents ; c)
73             a mapping from pairs (docId, wordId) to lists of positions within the
74             document. This third file holds detailed information and therefore is
75             quite big ; but it allows us to quickly retrieve "exact phrases"
76             (sequences of adjacent words) in the document.
77              
78             =head2 Indexing steps
79              
80             Indexing of a document buffer goes through the following
81             steps :
82              
83             =over
84              
85             =item *
86              
87             terms are extracted, according to the I regular expression
88              
89             =item *
90              
91             extracted terms are normalized or filtered out
92             by the I callback function. This function can for example
93             remove accented characters, perform lemmatization, suppress
94             irrelevant terms (such as numbers), etc.
95              
96             =item *
97              
98             normalized terms are eliminated if they belong to
99             the I list (list of common words to exclude from the index).
100              
101             =item *
102              
103             remaining terms are stored, together with the positions where they
104             occur in the document.
105              
106             =back
107              
108             =head2 Limits
109              
110             All ids are stored as unsigned 32-bit integers; therefore there is
111             a limit of 4294967295 to the number of documents or to the number of
112             different words.
113              
114             =head2 Related modules
115              
116             A short comparison with other CPAN indexing modules is
117             given in the L section.
118              
119             This module depends on L for analyzing requests and
120             on L for storing the indexes.
121              
122             This module was designed together with L.
123              
124             =cut
125              
126              
127             sub addToScore (\$$);
128              
129             use constant {
130              
131             # max size of various ids
132             MAX_DOC_ID => 0xFFFFFFFF, # unsigned long (32 bits)
133             MAX_POS_ID => 0xFFFFFFFF, # position_id
134              
135             # encodings for pack/unpack
136             IXDPACK => 'wC', # docId : compressed int; freq : unsigned char
137             IXDPACK_L => '(wC)*', # list of above
138             IXPPACK => 'w*', # word positions : list of compressed ints
139             IXPKEYPACK => 'ww', # key for ixp : (docId, wordId)
140              
141             WRITECACHESIZE => (1 << 24), # arbitrary big value; seems good enough but need tuning
142              
143             # default values for args to new()
144             DEFAULT => {
145             writeMode => 0,
146             wregex => qr/\w+/,
147             wfilter => sub { # default filter : lowercase and no accents
148             my $word = lc($_[0]);
149             $word =~ tr[çáàâäéèêëíìîïóòôöúùûüýÿ][caaaaeeeeiiiioooouuuuyy];
150             return $word;
151             },
152             fieldname => '',
153              
154             ctxtNumChars => 35,
155             maxExcerpts => 5,
156             preMatch => "",
157             postMatch => "",
158             positions => 1,
159             }
160             };
161              
162             =head1 METHODS
163              
164             =over
165              
166             =item C expr1, ...)>
167              
168             Creates an indexer (either for a new index, or for
169             accessing an existing index). Parameters are :
170              
171             =over
172              
173             =item dir
174              
175             Directory for index files. and possibly for the stopwords file.
176             Default is current directory
177              
178             =item writeMode
179              
180             Give a true value if you intend to write into the index.
181              
182             =item wregex
183              
184             Regex for matching a word (C by default).
185             Will affect both L and L method.
186             This regex should not contain any capturing parentheses
187             (use non-capturing parentheses C<< (?: ... ) >> instead).
188              
189             =item wfilter
190              
191             Ref to a callback sub that may normalize or eliminate a word. Will
192             affect both L and L method. The default wfilter
193             translates words in lower case and translates latin1 (iso-8859-1)
194             accented characters into plain characters.
195              
196             =item stopwords
197              
198             List of words that will be marked into the index as "words to exclude".
199             This should usually occur when creating a new index ; but nothing prevents
200             you to add other stopwords later. Since stopwords are stored in the
201             index, they need not be specified when opening an index for searches or
202             updates.
203              
204             The list may be supplied either as a ref to an array of scalars, or
205             as a the name of a file containing the stopwords (full pathname
206             or filename relative to I).
207              
208              
209             =item fieldname
210              
211             Will only affect the L method.
212             Search queries are passed to a general parser
213             (see L).
214             Then, before being applied to the present indexer module,
215             queries are pruned of irrelevant items.
216             Query items are considered relevant if they have no
217             associated field name, or if the associated field name is
218             equal to this C.
219              
220             =back
221              
222             Below are some additional parameters that only affect the
223             L method.
224              
225             =over
226              
227             =item ctxtNumChars
228              
229             Number of characters determining the size of contextual excerpts
230             return by the L method.
231             A I is a part of the document text,
232             containg a matched word surrounded by I characters
233             to the left and to the right. Default is 35.
234              
235              
236             =item maxExcerpts
237              
238             Maximum number of contextual excerpts to retrieve per document.
239             Default is 5.
240              
241             =item preMatch
242              
243             String to insert in contextual excerpts before a matched word.
244             Default is C<"EbE">.
245              
246             =item postMatch
247              
248             String to insert in contextual excerpts after a matched word.
249             Default is C<"E/bE">.
250              
251              
252             =item positions
253              
254             my $indexer = new Search::Indexer(dir => $dir,
255             writeMode => 1,
256             positions => 0);
257              
258             Truth value to tell whether or not, when creating a new index,
259             word positions should be stored. The default is true.
260              
261             If you turn it off, index files will be much smaller, indexing
262             will be faster, but results will be less precise,
263             because the indexer can no longer find "exact phrases".
264             So if you type C<"quick fox jumped">, the query will be
265             translated into C, and therefore
266             will retrieve documents in which those three words are present, but
267             not necessarily in order.
268              
269             Another consequence of C<< positions => 0 >> is that
270             there will be no automatic check of uniqueness of ids
271             when adding documents into the index.
272              
273             =back
274              
275             =cut
276              
277             sub new {
278             my $class = shift;
279             my $args = ref $_[0] eq 'HASH' ? $_[0] : {@_};
280              
281             # parse options
282             my $self = {};
283             $self->{$_} = exists $args->{$_} ? delete $args->{$_} : DEFAULT->{$_}
284             foreach qw(writeMode wregex wfilter fieldname
285             ctxtNumChars maxExcerpts preMatch postMatch positions);
286             my $dir = delete $args->{dir} || ".";
287             $dir =~ s{[/\\]$}{}; # remove trailing slash
288             my $stopwords = delete $args->{stopwords};
289              
290             # check if invalid options
291             my @remaining = keys %$args;
292             croak "unexpected option : $remaining[0]" if @remaining;
293             croak "can't add 'positions' after index creation time"
294             if $self->{writeMode} and $self->{positions}
295             and -f "$dir/ixd.bdb" and not -f "$dir/ixp.bdb";
296              
297             # BerkeleyDB environment should allow us to do proper locking for
298             # concurrent access ; but seems to be incompatible with the
299             # -Cachesize argument, so I commented it out ... need to learn more about
300             # BerkeleyDB ...
301             # my $dbEnv = new BerkeleyDB::Env
302             # -Home => $dir,
303             # -Flags => DB_INIT_CDB | DB_INIT_MPOOL | DB_CDB_ALLDB |
304             # ($self->{writeMode} ? DB_CREATE : 0),
305             # -Verbose => 1
306             # or croak "new BerkeleyDB::Env : $^E $BerkeleyDB::Error" ;
307              
308              
309             my @bdb_args = (# -Env => $dbEnv, # commented out, see explanation above
310             -Flags => ($self->{writeMode} ? DB_CREATE : DB_RDONLY),
311             ($self->{writeMode} ? (-Cachesize => WRITECACHESIZE) : ()));
312              
313             # 3 index files :
314             # ixw : word => wordId (or -1 for stopwords)
315             $self->{ixwDb} = tie %{$self->{ixw}},
316             'BerkeleyDB::Btree',
317             -Filename => "$dir/ixw.bdb", @bdb_args
318             or croak "open $dir/ixw.bdb : $^E $BerkeleyDB::Error";
319              
320             # ixd : wordId => list of (docId, nOccur)
321             $self->{ixdDb} = tie %{$self->{ixd}},
322             'BerkeleyDB::Hash',
323             -Filename => "$dir/ixd.bdb", @bdb_args
324             or croak "open $dir/ixd.bdb : $^E $BerkeleyDB::Error";
325              
326             if (-f "$dir/ixp.bdb" || $self->{writeMode} && $self->{positions}) {
327             # ixp : (docId, wordId) => list of positions of word in doc
328             $self->{ixpDb} = tie %{$self->{ixp}},
329             'BerkeleyDB::Btree',
330             -Filename => "$dir/ixp.bdb", @bdb_args
331             or croak "open $dir/ixp.bdb : $^E $BerkeleyDB::Error";
332             }
333              
334              
335             # optional list of stopwords may be given as a list or as a filename
336             if ($stopwords) {
337             $self->{writeMode} or croak "must be in writeMode to specify stopwords";
338             if (not ref $stopwords) { # if scalar, name of stopwords file
339             open TMP, $stopwords or
340             (open TMP, "$dir/$stopwords") or
341             croak "open stopwords file $stopwords : $^E ";
342             local $/ = undef;
343             my $buf = ;
344             $stopwords = [$buf =~ /$self->{wregex}/g];
345             close TMP;
346             }
347             foreach my $word (@$stopwords) {
348             $self->{ixw}{$word} = -1;
349             }
350             }
351              
352             bless $self, $class;
353             }
354              
355              
356              
357              
358              
359             =item C
360              
361             Add a new document to the index.
362             I is the unique identifier for this doc
363             (the caller is responsible for uniqueness).
364             I is a scalar containing the text representation of this doc.
365              
366             =cut
367              
368             sub add {
369             my $self = shift;
370             my $docId = shift;
371             # my $buf = shift; # using $_[0] instead for efficiency reasons
372              
373             croak "docId $docId is too large" if $docId > MAX_DOC_ID;
374              
375             # first check if this docId is already used
376             if ($self->{ixp}) { # can only check if we have the positions index
377             my $c = $self->{ixpDb}->db_cursor;
378             my $k = pack IXPKEYPACK, $docId, 0;
379             my $v; # not used, but needed by c_get()
380             my $status = $c->c_get($k, $v, DB_SET_RANGE);
381             if ($status == 0) {
382             my ($check, $wordId) = unpack IXPKEYPACK, $k;
383             croak "docId $docId is already used (wordId=$wordId)"
384             if $docId == $check;
385             }
386             }
387              
388             # OK, let's extract words from the $_[0] buffer
389             my %positions;
390             for (my $nwords = 1; $_[0] =~ /$self->{wregex}/g; $nwords++) {
391              
392             my $word = $self->{wfilter}->($&) or next;
393             my $wordId = $self->{ixw}{$word} ||
394             ($self->{ixw}{$word} = ++$self->{ixw}{_NWORDS}); # create new wordId
395             push @{$positions{$wordId}}, $nwords if $wordId > 0;
396             }
397              
398             foreach my $wordId (keys %positions) {
399             my $occurrences = @{$positions{$wordId}};
400             $occurrences = 255 if $occurrences > 255;
401              
402             $self->{ixd}{$wordId} .= pack(IXDPACK, $docId, $occurrences);
403             if ($self->{ixp}) {
404             my $ixpKey = pack IXPKEYPACK, $docId, $wordId;
405             $self->{ixp}{$ixpKey} = pack(IXPPACK, @{$positions{$wordId}});
406             }
407             }
408              
409             $self->{ixd}{NDOCS} = 0 if not defined $self->{ixd}{NDOCS};
410             $self->{ixd}{NDOCS} += 1;
411             }
412              
413              
414             =item C
415              
416             Removes a document from the index.
417             If the index contains word positions (true by default), then
418             only the C is needed; however, if the index was created
419             without word positions, then the text representation
420             of the document must be given as a scalar string in the second argument
421             (of course this should be the same as the one that was supplied
422             when calling the L method).
423              
424              
425             =cut
426              
427             sub remove {
428             my $self = shift;
429             my $docId = shift;
430             # my $buf = shift; # using $_[0] instead for efficiency reasons
431              
432             my $wordIds;
433              
434             if ($self->{ixp}) { # if using word positions
435             not $_[0] or carp "remove() : unexpected 'buf' argument";
436             $wordIds= $self->wordIds($docId);
437             }
438             else { # otherwise : recompute word ids
439             $wordIds = [grep {defined $_ and $_ > 0}
440             map {$self->{ixw}{$_}}
441             uniq map {$self->{wfilter}->($_)}
442             ($_[0] =~ /$self->{wregex}/g)];
443             }
444              
445             return if not @$wordIds;
446              
447             foreach my $wordId (@$wordIds) {
448             my %docs = unpack IXDPACK_L, $self->{ixd}{$wordId};
449             delete $docs{$docId};
450             $self->{ixd}{$wordId} = pack IXDPACK_L, %docs;
451             if ($self->{ixp}) {
452             my $ixpKey = pack IXPKEYPACK, $docId, $wordId;
453             delete $self->{ixp}{$ixpKey};
454             }
455             }
456              
457             $self->{ixd}{NDOCS} -= 1;
458             }
459              
460             =item C
461              
462             Returns a ref to an array of word Ids contained in the specified document
463             (not available if the index was created with C<< positions => 0 >>)
464              
465             =cut
466              
467             sub wordIds {
468             my $self = shift;
469             my $docId_ini = shift;
470              
471             $self->{ixpDb}
472             or croak "wordIds() not available (index was created with positions=>0)";
473              
474             my @wordIds = ();
475             my $c = $self->{ixpDb}->db_cursor;
476             my ($k, $v);
477             $k = pack IXPKEYPACK, $docId_ini, 0;
478             my $status = $c->c_get($k, $v, DB_SET_RANGE);
479             while ($status == 0) {
480             my ($docId, $wordId) = unpack IXPKEYPACK, $k;
481             last if $docId != $docId_ini;
482             push @wordIds, $wordId;
483             $status = $c->c_get($k, $v, DB_NEXT);
484             }
485             return \@wordIds;
486             }
487              
488              
489             =item C
490              
491             Returns a ref to an array of words found in the dictionary,
492             starting with prefix (i.e. C<< $ix->words("foo") >> will
493             return "foo", "food", "fool", "footage", etc.).
494              
495             =cut
496              
497             sub words {
498             my $self = shift;
499             my $prefix = shift;
500              
501             my $regex = qr/^$prefix/;
502             my @words = ();
503             my $c = $self->{ixwDb}->db_cursor;
504             my ($k, $v);
505             $k = $prefix;
506             my $status = $c->c_get($k, $v, DB_SET_RANGE);
507             while ($status == 0) {
508             last if $k !~ $regex;
509             push @words, $k;
510             $status = $c->c_get($k, $v, DB_NEXT);
511             }
512             return \@words;
513             }
514              
515              
516              
517             =item C
518              
519             Debugging function, prints indexed words with list of associated docs.
520              
521             =cut
522              
523             sub dump {
524             my $self = shift;
525             foreach my $word (sort keys %{$self->{ixw}}) {
526             my $wordId = $self->{ixw}{$word};
527             if ($wordId == -1) {
528             print "$word : STOPWORD\n";
529             }
530             else {
531             my %docs = unpack IXDPACK_L, $self->{ixd}{$wordId};
532             print "$word : ", join (" ", keys %docs), "\n";
533             }
534             }
535             }
536              
537              
538             =item C
539              
540             Searches the index. See the L and L sections
541             above for short descriptions of query strings, or
542             L for details. The second argument is optional ;
543             if true, all words without any prefix will implicitly take prefix '+'
544             (mandatory words).
545              
546             The return value is a hash ref containing
547              
548             =over
549              
550             =item scores
551              
552             hash ref, where keys are docIds of matching documents, and values are
553             the corresponding computed scores.
554              
555             =item killedWords
556              
557             ref to an array of terms from the query string which were ignored
558             during the search (because they were filtered out or were stopwords)
559              
560             =item regex
561              
562             ref to a regular expression corresponding to all terms in the query
563             string. This will be useful if you later want to get contextual
564             excerpts from the found documents (see the L method).
565              
566             =back
567              
568             =cut
569              
570              
571             sub search {
572             my $self = shift;
573             my $query_string = shift;
574             my $implicitPlus = shift;
575              
576             $self->{qp} ||= new Search::QueryParser;
577              
578             my $q = $self->{qp}->parse($query_string, $implicitPlus);
579             my $killedWords = {};
580             my $wordsRegexes = [];
581              
582             my $qt = $self->translateQuery($q, $killedWords, $wordsRegexes);
583              
584             my $tmp = {};
585             $tmp->{$_} = 1 foreach @$wordsRegexes;
586             my $strRegex = "(?:" . join("|", keys %$tmp) . ")";
587              
588             return {scores => $self->_search($qt),
589             killedWords => [keys %$killedWords],
590             regex => qr/$strRegex/i};
591             }
592              
593              
594             sub _search {
595             my ($self, $q) = @_;
596              
597             my $scores = undef; # hash {doc1 => score1, doc2 => score2 ...}
598              
599             # 1) deal with mandatory subqueries
600              
601             foreach my $subQ ( @{$q->{'+'}} ) {
602             my $sc = $self->docsAndScores($subQ) or next;
603             $scores = $sc and next if not $scores; # if first result set, just store
604              
605             # otherwise, intersect with previous result set
606             foreach my $docId (keys %$scores) {
607             delete $scores->{$docId} and next if not defined $sc->{$docId};
608             addToScore $scores->{$docId}, $sc->{$docId}; # otherwise
609             }
610             }
611              
612             my $noMandatorySubq = not $scores;
613              
614             # 2) deal with non-mandatory subqueries
615              
616             foreach my $subQ (@{$q->{''}}) {
617             my $sc = $self->docsAndScores($subQ) or next;
618             $scores = $sc and next if not $scores; # if first result set, just store
619              
620             # otherwise, combine with previous result set
621             foreach my $docId (keys %$sc) {
622             if (defined $scores->{$docId}) { # docId was already there, add new score
623             addToScore $scores->{$docId}, $sc->{$docId};
624             }
625             elsif ($noMandatorySubq){ # insert a new docId to the result set
626             $scores->{$docId} = $sc->{$docId};
627             }
628             # else do nothing (ignore this docId)
629             }
630             }
631              
632             return undef if not $scores or not %$scores; # no results
633              
634             # 3) deal with negative subqueries (remove corresponding docs from results)
635              
636             foreach my $subQ (@{$q->{'-'}}) {
637             my $negScores = $self->docsAndScores($subQ) or next;
638             delete $scores->{$_} foreach keys %$negScores;
639             }
640              
641             return $scores;
642             }
643              
644              
645              
646             sub docsAndScores { # returns a hash {docId => score} or undef (no info)
647             my ($self, $subQ) = @_;
648              
649             # recursive call to _search if $subQ is a parenthesized query
650             return $self->_search($subQ->{value}) if $subQ->{op} eq '()';
651              
652             # otherwise, don't care about $subQ->{op} (assert $subQ->{op} eq ':')
653              
654             if (ref $subQ->{value}) { # several words, this is an "exact phrase"
655             return $self->matchExactPhrase($subQ);
656             }
657             elsif ($subQ->{value} <= -1) {# this is a stopword
658             return undef;
659             }
660             else { # scalar value, match single word
661             my $scores = {unpack IXDPACK_L, ($self->{ixd}{$subQ->{value}} || "")};
662             my @k = keys %$scores;
663             if (@k) {
664             my $coeff = log(($self->{ixd}{NDOCS} + 1)/@k) * 100;
665             $scores->{$_} = int($coeff * $scores->{$_}) foreach @k;
666             }
667             return $scores;
668             }
669             }
670              
671              
672             sub matchExactPhrase {
673             my ($self, $subQ) = @_;
674              
675             if (! $self->{ixp}) { # if not indexed with positions
676             # translate into an AND query
677             my $fake_query = {'+' => [map {{op => ':',
678             value => $_ }} @{$subQ->{value}}]};
679             # and search for that one
680             return $self->_search($fake_query);
681             };
682              
683             # otherwise, intersect word position sets
684             my %pos;
685             my $wordDelta = 0;
686             my $scores = undef;
687             foreach my $wordId (@{$subQ->{value}}) {
688             my $sc = $self->docsAndScores({op=>':', value=>$wordId});
689             if (not $scores) { # no previous result set
690             if ($sc) {
691             $scores = $sc;
692             foreach my $docId (keys %$scores) {
693             my $ixpKey = pack IXPKEYPACK, $docId, $wordId;
694             $pos{$docId} = [unpack IXPPACK, $self->{ixp}{$ixpKey}];
695             }
696             }
697             }
698             else { # combine with previous result set
699             $wordDelta++;
700             foreach my $docId (keys %$scores) {
701             if ($sc) { # if we have info about current word (is not a stopword)
702             if (not defined $sc->{$docId}) { # current word not in current doc
703             delete $scores->{$docId};
704             } else { # current word found in current doc, check if positions match
705             my $ixpKey = pack IXPKEYPACK, $docId, $wordId;
706             my @newPos = unpack IXPPACK, $self->{ixp}{$ixpKey};
707             $pos{$docId} = nearPositions($pos{$docId}, \@newPos, $wordDelta)
708             and addToScore $scores->{$docId}, $sc->{$docId}
709             or delete $scores->{$docId};
710             }
711             }
712             } # end foreach my $docId (keys %$scores)
713             }
714             } # end foreach my $wordId (@{$subQ->{value}})
715              
716             return $scores;
717             }
718              
719              
720              
721              
722              
723              
724              
725              
726              
727              
728              
729              
730              
731              
732              
733              
734              
735             sub nearPositions {
736             my ($set1, $set2, $wordDelta) = @_;
737             # returns the set of positions in $set2 which are "close enough" (<= $wordDelta)
738             # to positions in $set1. Assumption : input sets are sorted.
739              
740              
741             my @result;
742             my ($i1, $i2) = (0, 0); # indices into sets
743              
744             while ($i1 < @$set1 and $i2 < @$set2) {
745             my $delta = $set2->[$i2] - $set1->[$i1];
746             ++$i1 and next if $delta > $wordDelta;
747             push @result, $set2->[$i2] if $delta > 0;
748             ++$i2;
749             }
750             return @result ? \@result : undef;
751             }
752              
753              
754              
755             sub addToScore (\$$) { # first score arg gets "incremented" by the second arg
756             my ($ptScore1, $score2) = @_;
757             $$ptScore1 = 0 if not defined $$ptScore1;
758             $$ptScore1 += $score2 if $score2; # TODO : find better formula for score combination !
759             }
760              
761              
762             sub translateQuery { # replace words by ids, remove irrelevant subqueries
763             my ($self, $q, $killedWords, $wordsRegexes) = @_;
764              
765             my $r = {};
766              
767             foreach my $k ('+', '-', '') {
768             foreach my $subQ (@{$q->{$k}}) {
769              
770             # ignore items concerning other field names
771             next if $subQ->{field} and $subQ->{field} ne $self->{fieldname};
772              
773             my $val = $subQ->{value};
774              
775             my $clone = undef;
776             if ($subQ->{op} eq '()') {
777             $clone = {op => '()',
778             value => $self->translateQuery($val, $killedWords, $wordsRegexes)};
779             }
780             elsif ($subQ->{op} eq ':') {
781             # split query according to our notion of "term"
782             my @words = ($val =~ /$self->{wregex}/g);
783              
784             # TODO : 1) accept '*' suffix; 2) find keys in $self->{ixw}; 3) rewrite into
785             # an 'OR' query
786              
787             # my @words = ($str =~ /$self->{wregex}\*?/g);
788              
789             my $regex1 = join "\\W+", map quotemeta, @words;
790             my $regex2 = join "\\W+", map quotemeta,
791             map {$self->{wfilter}($_)} @words;
792             foreach my $regex ($regex1, $regex2) {
793             $regex = "\\b$regex" if $regex =~ /^\w/;
794             $regex = "$regex\\b" if $regex =~ /\w$/;
795             }
796             push @$wordsRegexes, $regex1;
797             push @$wordsRegexes, $regex2 unless $regex1 eq $regex2;
798            
799             # now translate into word ids
800             foreach my $word (@words) {
801             my $wf = $self->{wfilter}->($word);
802             my $wordId = $wf ? ($self->{ixw}{$wf} || 0) : -1;
803             $killedWords->{$word} = 1 if $wordId < 0;
804             $word = $wordId;
805             }
806              
807             $val = (@words>1) ? \@words : # several words : return an array
808             (@words>0) ? $words[0] : # just one word : return its id
809             0; # no word : return 0 (means "no info")
810              
811             $clone = {op => ':', value=> $val};
812             }
813             push @{$r->{$k}}, $clone if $clone;
814             }
815             }
816              
817             return $r;
818             }
819              
820              
821              
822             =item C
823              
824             Searches C for occurrences of C,
825             extracts the occurences together with some context
826             (a number of characters to the left and to the right),
827             and highlights the occurences. See parameters C,
828             C, C, C of the L method.
829              
830             =cut
831              
832             sub excerpts {
833             my $self = shift;
834             # $_[0] : text buffer ; no copy for efficiency reason
835             my $regex = $_[1];
836              
837             my $nc = $self->{ctxtNumChars};
838              
839             # find start and end positions of matching fragments
840             my $matches = []; # array of refs to [start, end, number_of_matches]
841             while ($_[0] =~ /$regex/g) {
842             my ($start, $end) = ($-[0], $+[0]);
843             if (@$matches and $start <= $matches->[-1][1] + $nc) {
844             # merge with the last fragment if close enough
845             $matches->[-1][1] = $end; # extend the end position
846             $matches->[-1][2] += 1; # increment the number of matches
847             }
848             else {
849             push @$matches, [$start, $end, 1];
850             }
851             }
852              
853             foreach (@$matches) { # extend start and end positions by $self->{ctxtNumChars}
854             $_->[0] = ($_->[0] < $nc) ? 0 : $_->[0] - $nc;
855             $_->[1] += $nc;
856             }
857              
858             my $excerpts = [];
859             foreach my $match (sort {$b->[2] <=> $a->[2]} @$matches) {
860             last if @$excerpts >= $self->{maxExcerpts};
861             my $x = substr($_[0], $match->[0], $match->[1] - $match->[0]); # extract
862             $x =~ s/$regex/$self->{preMatch}$&$self->{postMatch}/g ; # highlight
863             push @$excerpts, "...$x...";
864             }
865             return $excerpts;
866             }
867              
868             =back
869              
870             =head1 TO DO
871              
872             =over
873              
874             =item *
875              
876             Find a proper formula for combining scores from several terms.
877             Current implementation is ridiculously simple-minded (just an addition).
878             Also study the literature to improve the scoring formula.
879              
880             =item *
881              
882             Handle concurrency through BerkeleyDB locks.
883              
884             =item *
885              
886             Maybe put all 3 index files as subDatabases in one single file.
887              
888             =item *
889              
890             Fine tuning of cachesize and other BerkeleyDB parameters.
891              
892             =item *
893              
894             Compare performances with other packages.
895              
896             =item *
897              
898             More functionalities : add NEAR operator and boost factors.
899              
900             =back
901              
902              
903              
904             =head1 SEE ALSO
905              
906             L is nice and compact, but
907             limited in functionality (no +/- prefixes, no "exact phrase" search,
908             no parentheses).
909              
910             L is a Perl port of the Java I search engine.
911             Plucene has probably every feature you will ever need, but requires
912             quite an investment to install and learn (more than 60 classes,
913             dependencies on lots of external modules).
914             I haven't done any benchmarks yet to compare performance.
915              
916             L is a more recent, more sophisticated search engine,
917             which looks very powerful and should be probably faster and definitely
918             more scalable than C; but also with a less compact
919             API. I haven't performed any detailed comparison yet.
920              
921              
922             =cut
923              
924            
925             1;