File Coverage

blib/lib/Lingua/EN/Sentence.pm
Criterion Covered Total %
statement 70 89 78.6
branch 3 12 25.0
condition n/a
subroutine 15 20 75.0
pod 7 11 63.6
total 95 132 71.9


line stmt bran cond sub pod time code
1             package Lingua::EN::Sentence;
2              
3             #==============================================================================
4             #
5             # Start of POD
6             #
7             #==============================================================================
8              
9             =head1 NAME
10              
11             Lingua::EN::Sentence - split text into sentences
12              
13             =head1 SYNOPSIS
14              
15             use Lingua::EN::Sentence qw( get_sentences add_acronyms );
16              
17             add_acronyms('lt','gen'); ## adding support for 'Lt. Gen.'
18             my $sentences=get_sentences($text); ## Get the sentences.
19             foreach my $sentence (@$sentences) {
20             ## do something with $sentence
21             }
22              
23              
24             =head1 DESCRIPTION
25              
26             The C module contains the function get_sentences, which
27             splits text into its constituent sentences, based on a regular expression and a
28             list of abbreviations (built in and given).
29              
30             Certain well know exceptions, such as abbreviations, may cause incorrect
31             segmentations. But some of them are already integrated into this code and are
32             being taken care of. Still, if you see that there are words causing the
33             get_sentences function to fail, you can add those to the module, so it notices them.
34              
35             =head1 ALGORITHM
36              
37             Basically, I use a 'brute' regular expression to split the text into sentences.
38             (Well, nothing is yet split - I just mark the end-of-sentence). Then I look into
39             a set of rules which decide when an end-of-sentence is justified and when it's a
40             mistake. In case of a mistake, the end-of-sentence mark is removed.
41              
42             What are such mistakes? Cases of abbreviations, for example. I have a list of
43             such abbreviations (Please see public globals belwo for a list), and more
44             general rules (for example, the abbreviations 'i.e.' and '.e.g.' need not to be
45             in the list as a special rule takes care of all single letter abbreviations).
46              
47             =head1 FUNCTIONS
48              
49             All functions used should be requested in the 'use' clause. None is exported by
50             default.
51              
52             =over 4
53              
54             =item get_sentences( $text )
55              
56             The get_sentences function takes a scalar containing ascii text as an argument
57             and returns a reference to an array of sentences that the text has been split
58             into. Returned sentences will be trimmed (beginning and end of sentence) of
59             white space. Strings with no alpha-numeric characters in them, won't be
60             returned as sentences.
61              
62             =item add_acronyms( @acronyms )
63              
64             This function is used for adding acronyms not supported by this code.
65             The input should be regular expressions for matching the desired acronyms,
66             but should not include the final period (C<.>). So, for example, C
67             matches C and C. C will match C. You do not
68             need to bother with acronyms consisting of single letters and dots
69             (e.g. "U.S.A."), as these are found automatically. Note also that acronyms
70             are searched for on a case insensitive basis.
71              
72             Please see 'Acronym/Abbreviations list' section for the abbreviations already
73             supported by this module.
74              
75             =item get_acronyms( )
76              
77             This function will return the defined list of acronyms.
78              
79             =item set_acronyms( @my_acronyms )
80              
81             This function replaces the predefined acronym list with the given list. See
82             L for details on the input specifications.
83              
84              
85             =item get_EOS( )
86              
87             This function returns the value of the string used to mark the end of sentence.
88             You might want to see what it is, and to make sure your text doesn't contain it.
89             You can use set_EOS() to alter the end-of-sentence string to whatever you
90             desire.
91              
92             =item set_EOS( $new_EOS_string )
93              
94             This function alters the end-of-sentence string used to mark the end of sentences.
95              
96             =item set_locale( $new_locale )
97             Receives language locale in the form language.country.character-set
98             for example:
99             "fr_CA.ISO8859-1"
100             for Canadian French using character set ISO8859-1.
101              
102             Returns a reference to a hash containing the current locale formatting values.
103             Returns undef if got undef.
104              
105              
106             The following will set the LC_COLLATE behaviour to Argentinian Spanish.
107             NOTE: The naming and availability of locales depends on your operating sysem.
108             Please consult the perllocale manpage for how to find out which locales are available in your system.
109              
110             $loc = set_locale( "es_AR.ISO8859-1" );
111              
112             This actually does this:
113              
114             $loc = setlocale( LC_ALL, "es_AR.ISO8859-1" );
115              
116             =back
117              
118             =head1 Acronym/Abbreviations list
119              
120             You can use the get_acronyms() function to get acronyms.
121             It has become too long to specify in the documentation.
122              
123             If I come across a good general-purpose list - I'll incorporate it into this module.
124             Feel free to suggest such lists.
125              
126             =head1 FUTURE WORK
127              
128             [1] Object Oriented like usage
129             [2] Supporting more than just English/French
130             [3] Code optimization. Currently everything is RE based and not so optimized RE
131             [4] Possibly use more semantic heuristics for detecting a beginning of a sentence
132              
133             =head1 SEE ALSO
134              
135             Text::Sentence
136            
137             =head1 REPOSITORY
138              
139             L
140              
141             =head1 AUTHOR
142              
143             Shlomo Yona shlomo@cs.haifa.ac.il
144              
145             Currently being maintained by Kim Ryan, kimryan at CPAN d o t org
146              
147              
148             =head1 COPYRIGHT AND LICENSE
149              
150             Copyright (c) 2001-2016 Shlomo Yona. All rights reserved.
151             Copyright (c) 2018 Kim Ryan. All rights reserved.
152              
153             This library is free software; you can redistribute it and/or modify
154             it under the same terms as Perl itself.
155              
156             =cut
157              
158             #==============================================================================
159             #
160             # End of POD
161             #
162             #==============================================================================
163              
164              
165             #==============================================================================
166             #
167             # Pragmas
168             #
169             #==============================================================================
170              
171 1     1   70462 use strict;
  1         2  
  1         26  
172 1     1   4 use warnings;
  1         1  
  1         26  
173 1     1   495 use POSIX qw(locale_h setlocale);
  1         7794  
  1         5  
174             #==============================================================================
175             #
176             # Modules
177             #
178             #==============================================================================
179 1     1   1575 use Exporter;
  1         2  
  1         43  
180              
181             #==============================================================================
182             #
183             # Public globals
184             #
185             #==============================================================================
186 1     1   6 use vars qw/$VERSION @ISA @EXPORT_OK $EOS $LOC $AP $P $PAP @ABBREVIATIONS/;
  1         2  
  1         83  
187 1     1   4 use Carp qw/cluck/;
  1         2  
  1         57  
188 1     1   740 use English;
  1         2976  
  1         5  
189              
190             our $VERSION = '0.31';
191              
192             our $LOC;
193             if ($OSNAME ne 'android') {
194             # Call POSIX function
195             $LOC= setlocale(LC_CTYPE, "en_US.UTF-8");
196             }
197              
198            
199 1     1   832 use locale;
  1         846  
  1         4  
200              
201             @ISA = qw( Exporter );
202             @EXPORT_OK = qw( get_sentences add_acronyms get_acronyms set_acronyms get_EOS set_EOS set_locale);
203              
204             our $EOS="\001";
205              
206             our $P = q/[\.!?]/; # PUNCTUATION
207              
208             $AP = q/(?:'|"|\?|\)|\]|\})?/; # AFTER PUNCTUATION
209             our $PAP = $P.$AP;
210              
211             my @PEOPLE = qw( mr mrs ms dr prof mme ms?gr sens? reps? gov attys? supt insp const det revd? ald rt hon);
212             my @TITLE_SUFFIXES = qw(PhD jn?r sn?r esq md llb);
213             my @MILITARY = qw( col gen lt cdr cmdr adm capt sgt cpl maj pte);
214             my @INSTITUTES = qw( dept univ assn bros);
215             my @COMPANIES = qw( inc ltd co corp);
216             my @PLACES =
217             qw(
218             arc al ave blv?d cl ct cres dr expy? fw?y hwa?y la pde? pl plz rd st tce
219             dist mt km in ft
220             Ala Ariz Ark Cal Calif Col Colo Conn Del Fed Fla Ga Ida Id Ill Ind Ia Kan Kans Ken Ky
221             La Me Md Is Mass Mich Minn Miss Mo Mont Neb Nebr Nev Mex Okla Ok Ore Penna Penn Pa Dak
222             Tenn Tex Ut Vt Va Wash Wis Wisc Wy Wyo USAFA Alta Man Ont Qu? Sask Yuk
223             Aust Vic Qld Tas
224             );
225             my @MONTHS = qw(jan feb mar apr may jun jul aug sep sept oct nov dec);
226             my @MISC = qw(no esp est); # Established
227             my @LATIN = qw(vs etc al ibid sic);
228              
229             our @ABBREVIATIONS = (@PEOPLE, @TITLE_SUFFIXES, @MILITARY, @INSTITUTES, @COMPANIES, @PLACES, @MONTHS, @MISC, @LATIN );
230             my $abbreviation_regex;
231             _set_abbreviations_regex();
232              
233              
234             #==============================================================================
235             #
236             # Public methods
237             #
238             #==============================================================================
239              
240             #------------------------------------------------------------------------------
241             # get_sentences - takes text input and splits it into sentences.
242             # A regular expression viciously cuts the text into sentences,
243             # and then a list of rules (some of them consist of a list of abbreviations)
244             # are applied on the marked text in order to fix end-of-sentence markings in
245             # places which are not indeed end-of-sentence.
246             #------------------------------------------------------------------------------
247             sub get_sentences {
248 2     2 1 70 my ($text)=@_;
249 2 50       7 return [] unless defined $text;
250 2         7 my $marked_text = first_sentence_breaking($text);
251 2         5 my $fixed_marked_text = remove_false_end_of_sentence($marked_text);
252 2         3 $fixed_marked_text = split_unsplit_stuff($fixed_marked_text);
253 2         12 my @sentences = split(/$EOS/,$fixed_marked_text);
254 2         5 my $cleaned_sentences = clean_sentences(\@sentences);
255 2         7 return $cleaned_sentences;
256             }
257              
258             #------------------------------------------------------------------------------
259             # add_acronyms - user can add a list of acronyms/abbreviations.
260             #------------------------------------------------------------------------------
261             sub add_acronyms {
262 1     1 1 835 push @ABBREVIATIONS, @_;
263 1         3 _set_abbreviations_regex();
264             }
265              
266             #------------------------------------------------------------------------------
267             # get_acronyms - get list of defined acronyms.
268             #------------------------------------------------------------------------------
269             sub get_acronyms {
270 0     0 1 0 return @ABBREVIATIONS;
271             }
272              
273             #------------------------------------------------------------------------------
274             # set_acronyms - replace the predefined acronyms list with your own list.
275             #------------------------------------------------------------------------------
276             sub set_acronyms {
277 0     0 1 0 @ABBREVIATIONS=@_;
278 0         0 _set_abbreviations_regex();
279             }
280              
281             #------------------------------------------------------------------------------
282             # get_EOS - get the value of the $EOS variable (end-of-sentence mark).
283             #------------------------------------------------------------------------------
284             sub get_EOS {
285 0     0 1 0 return $EOS;
286             }
287              
288             #------------------------------------------------------------------------------
289             # set_EOS - set the value of the $EOS variable (end-of-sentence mark).
290             #------------------------------------------------------------------------------
291             sub set_EOS {
292 0     0 1 0 my ($new_EOS) = @_;
293 0 0       0 if (not defined $new_EOS) {
294 0         0 cluck "Won't set \$EOS to undefined value!\n";
295 0         0 return $EOS;
296             }
297 0         0 $EOS = $new_EOS;
298 0         0 _set_abbreviations_regex();
299 0         0 return $EOS;
300             }
301              
302             #------------------------------------------------------------------------------
303              
304             # set_locale - set the value of the locale.
305             #
306             # Receieves language locale in the form
307             # language.country.character-set
308             # for example:
309             # "fr_CA.ISO8859-1"
310             # for Canadian French using character set ISO8859-1.
311             #
312             # Returns a reference to a hash containing the current locale formatting values.
313             # Returns undef if got undef.
314             #
315             # The following will set the LC_ALL behaviour to Argentinian Spanish.
316             # NOTE: The naming and availability of locales depends on your operating system.
317             # Please consult the perllocale manpage for how to find out which locales are
318             # available in your system.
319             #
320             # $loc = set_locale( "es_AR.ISO8859-1" );
321             #
322             # This actually does this:
323             #
324             # $loc = setlocale( LC_ALL, "es_AR.ISO8859-1" ); # NOTE, but actually does LC_CTYPE, should be LC_COLLATE?
325              
326             #------------------------------------------------------------------------------
327             sub set_locale {
328 0     0 1 0 my ($new_locale) = @_;
329 0 0       0 if (not defined $new_locale) {
330 0         0 cluck "Won't set locale to undefined value!\n";
331 0         0 return undef;
332             }
333            
334 0 0       0 if ($OSNAME ne 'android') {
335             # Call POSIX function
336 0         0 $LOC = setlocale(LC_CTYPE, $new_locale);
337 0         0 return $LOC;
338             }
339             else {
340 0         0 return undef;
341             }
342             }
343              
344              
345             #==============================================================================
346             #
347             # Private methods
348             #
349             #==============================================================================
350              
351             # save some time by pre-compiling a regex used for working with abbreviations
352             sub _set_abbreviations_regex {
353 2     2   24 my $abbreviations = join '|', @ABBREVIATIONS;
354 2         179 $abbreviation_regex = qr[(\b(?:$abbreviations)$PAP\s)$EOS]is;
355 2         6 return;
356             }
357              
358             ## Please email me any suggestions for optimizing these RegExps.
359             sub remove_false_end_of_sentence {
360 2     2 0 3 my ($marked_segment) = @_;
361             ## ## don't do u.s.a.
362             ## $marked_segment=~s/(\.\w$PAP)$EOS/$1/sg;
363 2         87 $marked_segment=~s/([^-\w]\w$PAP\s)$EOS/$1/sg;
364 2         42 $marked_segment=~s/([^-\w]\w$P)$EOS/$1/sg;
365              
366             # don't split after a white-space followed by a single letter or number followed
367             # by a dot followed by another whitespace., such as "Something. 1. point one"
368             # Note: will fail for 12. Point 12
369 2         43 $marked_segment=~s/(\s[\w\d]\.\s+)$EOS/$1/sg;
370              
371             # fix: bla bla... yada yada
372 2         18 $marked_segment=~s/(\.\.\. )$EOS([[:lower:]])/$1$2/sg;
373             # fix "." "?" "!"
374 2         19 $marked_segment=~s/(['"]$P['"]\s+)$EOS/$1/sg;
375             ## fix where abbreviations exist
376 2         802 $marked_segment=~s/$abbreviation_regex/$1/g;
377            
378             # don't break after quote unless its a capital letter.
379 2         34 $marked_segment=~s/(["']\s*)$EOS(\s*[[:lower:]])/$1$2/sg;
380              
381             # don't break: text . . some more text.
382 2         14 $marked_segment=~s/(\s\.\s)$EOS(\s*)/$1$2/sg;
383 2         32 $marked_segment=~s/(["']\s*)$EOS(\s*[[:lower:]])/$1$2/sg;
384              
385              
386 2         24 $marked_segment=~s/(\s$PAP\s)$EOS/$1/sg;
387 2         7 return $marked_segment;
388             }
389              
390             sub split_unsplit_stuff {
391 2     2 0 5 my ($text) = @_;
392              
393             # $text=~s/(\D\d+)($P)(\s+)/$1$2$EOS$3/sg; # breaks numbered points, such as {EOL}1. point one
394              
395 2         58 $text=~s/([\w $P]\d)($P)(\s+)/$1$2$EOS$3/sg;
396 2         21 $text=~s/($PAP\s)(\s*\()/$1$EOS$2/gs;
397 2         16 $text=~s/('\w$P)(\s)/$1$EOS$2/gs;
398              
399              
400 2         21 $text=~s/(\sno\.)(\s+)(?!\d)/$1$EOS$2/gis;
401              
402             ## # split where single capital letter followed by dot makes sense to break.
403             ## # notice these are exceptions to the general rule NOT to split on single
404             ## # letter.
405             ## # notice also that sibgle letter M is missing here, due to French 'mister'
406             ## # which is represented as M.
407             ## #
408             ## # the rule will not split on names begining or containing
409             ## # single capital letter dot in the first or second name
410             ## # assuming 2 or three word name.
411             ## $text=~s/(\s[[:lower:]]\w+\s+[^[[:^upper:]M]\.)(?!\s+[[:upper:]]\.)/$1$EOS/sg;
412              
413              
414             # add EOS when you see "a.m." or "p.m." followed by a capital letter.
415 2         5 $text=~s/([ap]\.m\.\s+)([[:upper:]])/$1$EOS$2/gs;
416              
417 2         5 return $text;
418             }
419              
420             sub clean_sentences {
421 2     2 0 3 my ($sentences) = @_;
422 2         4 my $cleaned_sentences;
423 2         3 foreach my $s (@$sentences) {
424 9 50       18 next if not defined $s;
425 9 50       22 next if $s!~m/\w+/;
426 9         22 $s=~s/^\s*//;
427 9         109 $s=~s/\s*$//;
428             ## $s=~s/\s+/ /g;
429 9         16 push @$cleaned_sentences,$s;
430             }
431 2         4 return $cleaned_sentences;
432             }
433              
434             sub first_sentence_breaking {
435 2     2 0 3 my ($text) = @_;
436 2         12 $text=~s/\n\s*\n/$EOS/gs; ## double new-line means a different sentence.
437 2         56 $text=~s/($PAP\s)/$1$EOS/gs;
438 2         38 $text=~s/(\s\w$P)/$1$EOS/gs; # breake also when single letter comes before punc.
439 2         6 return $text;
440             }
441              
442              
443             #==============================================================================
444             #
445             # Return TRUE
446             #
447             #==============================================================================
448              
449             1;