File Coverage

blib/lib/Lingua/EN/Sentence.pm

Criterion	Covered	Total	%
statement	70	89	78.6
branch	3	12	25.0
condition			n/a
subroutine	15	20	75.0
pod	7	11	63.6
total	95	132	71.9

line	stmt	bran	sub	pod	time	code
1						package Lingua::EN::Sentence;
2
3						#==============================================================================
4						#
5						# Start of POD
6						#
7						#==============================================================================
8
9						=head1 NAME
10
11						Lingua::EN::Sentence - split text into sentences
12
13						=head1 SYNOPSIS
14
15						use Lingua::EN::Sentence qw( get_sentences add_acronyms );
16
17						add_acronyms('lt','gen'); ## adding support for 'Lt. Gen.'
18						my $sentences=get_sentences($text); ## Get the sentences.
19						foreach my $sentence (@$sentences) {
20						## do something with $sentence
21						}
22
23
24						=head1 DESCRIPTION
25
26						The C module contains the function get_sentences, which
27						splits text into its constituent sentences, based on a regular expression and a
28						list of abbreviations (built in and given).
29
30						Certain well know exceptions, such as abbreviations, may cause incorrect
31						segmentations. But some of them are already integrated into this code and are
32						being taken care of. Still, if you see that there are words causing the
33						get_sentences function to fail, you can add those to the module, so it notices them.
34
35						=head1 ALGORITHM
36
37						Basically, I use a 'brute' regular expression to split the text into sentences.
38						(Well, nothing is yet split - I just mark the end-of-sentence). Then I look into
39						a set of rules which decide when an end-of-sentence is justified and when it's a
40						mistake. In case of a mistake, the end-of-sentence mark is removed.
41
42						What are such mistakes? Cases of abbreviations, for example. I have a list of
43						such abbreviations (Please see public globals belwo for a list), and more
44						general rules (for example, the abbreviations 'i.e.' and '.e.g.' need not to be
45						in the list as a special rule takes care of all single letter abbreviations).
46
47						=head1 FUNCTIONS
48
49						All functions used should be requested in the 'use' clause. None is exported by
50						default.
51
52						=over 4
53
54						=item get_sentences( $text )
55
56						The get_sentences function takes a scalar containing ascii text as an argument
57						and returns a reference to an array of sentences that the text has been split
58						into. Returned sentences will be trimmed (beginning and end of sentence) of
59						white space. Strings with no alpha-numeric characters in them, won't be
60						returned as sentences.
61
62						=item add_acronyms( @acronyms )
63
64						This function is used for adding acronyms not supported by this code.
65						The input should be regular expressions for matching the desired acronyms,
66						but should not include the final period (C<.>). So, for example, C
67						matches C and C. C will match C. You do not
68						need to bother with acronyms consisting of single letters and dots
69						(e.g. "U.S.A."), as these are found automatically. Note also that acronyms
70						are searched for on a case insensitive basis.
71
72						Please see 'Acronym/Abbreviations list' section for the abbreviations already
73						supported by this module.
74
75						=item get_acronyms( )
76
77						This function will return the defined list of acronyms.
78
79						=item set_acronyms( @my_acronyms )
80
81						This function replaces the predefined acronym list with the given list. See
82						L for details on the input specifications.
83
84
85						=item get_EOS( )
86
87						This function returns the value of the string used to mark the end of sentence.
88						You might want to see what it is, and to make sure your text doesn't contain it.
89						You can use set_EOS() to alter the end-of-sentence string to whatever you
90						desire.
91
92						=item set_EOS( $new_EOS_string )
93
94						This function alters the end-of-sentence string used to mark the end of sentences.
95
96						=item set_locale( $new_locale )
97						Receives language locale in the form language.country.character-set
98						for example:
99						"fr_CA.ISO8859-1"
100						for Canadian French using character set ISO8859-1.
101
102						Returns a reference to a hash containing the current locale formatting values.
103						Returns undef if got undef.
104
105
106						The following will set the LC_COLLATE behaviour to Argentinian Spanish.
107						NOTE: The naming and availability of locales depends on your operating sysem.
108						Please consult the perllocale manpage for how to find out which locales are available in your system.
109
110						$loc = set_locale( "es_AR.ISO8859-1" );
111
112						This actually does this:
113
114						$loc = setlocale( LC_ALL, "es_AR.ISO8859-1" );
115
116						=back
117
118						=head1 Acronym/Abbreviations list
119
120						You can use the get_acronyms() function to get acronyms.
121						It has become too long to specify in the documentation.
122
123						If I come across a good general-purpose list - I'll incorporate it into this module.
124						Feel free to suggest such lists.
125
126						=head1 FUTURE WORK
127
128						[1] Object Oriented like usage
129						[2] Supporting more than just English/French
130						[3] Code optimization. Currently everything is RE based and not so optimized RE
131						[4] Possibly use more semantic heuristics for detecting a beginning of a sentence
132
133						=head1 SEE ALSO
134
135						Text::Sentence
136
137						=head1 REPOSITORY
138
139						L
140
141						=head1 AUTHOR
142
143						Shlomo Yona shlomo@cs.haifa.ac.il
144
145						Currently being maintained by Kim Ryan, kimryan at CPAN d o t org
146
147
148						=head1 COPYRIGHT AND LICENSE
149
150						Copyright (c) 2001-2016 Shlomo Yona. All rights reserved.
151						Copyright (c) 2018 Kim Ryan. All rights reserved.
152
153						This library is free software; you can redistribute it and/or modify
154						it under the same terms as Perl itself.
155
156						=cut
157
158						#==============================================================================
159						#
160						# End of POD
161						#
162						#==============================================================================
163
164
165						#==============================================================================
166						#
167						# Pragmas
168						#
169						#==============================================================================
170
171	1		1		70462	use strict;
	1				2
	1				26
172	1		1		4	use warnings;
	1				1
	1				26
173	1		1		495	use POSIX qw(locale_h setlocale);
	1				7794
	1				5
174						#==============================================================================
175						#
176						# Modules
177						#
178						#==============================================================================
179	1		1		1575	use Exporter;
	1				2
	1				43
180
181						#==============================================================================
182						#
183						# Public globals
184						#
185						#==============================================================================
186	1		1		6	use vars qw/$VERSION @ISA @EXPORT_OK $EOS $LOC $AP $P $PAP @ABBREVIATIONS/;
	1				2
	1				83
187	1		1		4	use Carp qw/cluck/;
	1				2
	1				57
188	1		1		740	use English;
	1				2976
	1				5
189
190						our $VERSION = '0.31';
191
192						our $LOC;
193						if ($OSNAME ne 'android') {
194						# Call POSIX function
195						$LOC= setlocale(LC_CTYPE, "en_US.UTF-8");
196						}
197
198
199	1		1		832	use locale;
	1				846
	1				4
200
201						@ISA = qw( Exporter );
202						@EXPORT_OK = qw( get_sentences add_acronyms get_acronyms set_acronyms get_EOS set_EOS set_locale);
203
204						our $EOS="\001";
205
206						our $P = q/[\.!?]/; # PUNCTUATION
207
208						$AP = q/(?:'\|"\|\?\|\)\|\]\|\})?/; # AFTER PUNCTUATION
209						our $PAP = $P.$AP;
210
211						my @PEOPLE = qw( mr mrs ms dr prof mme ms?gr sens? reps? gov attys? supt insp const det revd? ald rt hon);
212						my @TITLE_SUFFIXES = qw(PhD jn?r sn?r esq md llb);
213						my @MILITARY = qw( col gen lt cdr cmdr adm capt sgt cpl maj pte);
214						my @INSTITUTES = qw( dept univ assn bros);
215						my @COMPANIES = qw( inc ltd co corp);
216						my @PLACES =
217						qw(
218						arc al ave blv?d cl ct cres dr expy? fw?y hwa?y la pde? pl plz rd st tce
219						dist mt km in ft
220						Ala Ariz Ark Cal Calif Col Colo Conn Del Fed Fla Ga Ida Id Ill Ind Ia Kan Kans Ken Ky
221						La Me Md Is Mass Mich Minn Miss Mo Mont Neb Nebr Nev Mex Okla Ok Ore Penna Penn Pa Dak
222						Tenn Tex Ut Vt Va Wash Wis Wisc Wy Wyo USAFA Alta Man Ont Qu? Sask Yuk
223						Aust Vic Qld Tas
224						);
225						my @MONTHS = qw(jan feb mar apr may jun jul aug sep sept oct nov dec);
226						my @MISC = qw(no esp est); # Established
227						my @LATIN = qw(vs etc al ibid sic);
228
229						our @ABBREVIATIONS = (@PEOPLE, @TITLE_SUFFIXES, @MILITARY, @INSTITUTES, @COMPANIES, @PLACES, @MONTHS, @MISC, @LATIN );
230						my $abbreviation_regex;
231						_set_abbreviations_regex();
232
233
234						#==============================================================================
235						#
236						# Public methods
237						#
238						#==============================================================================
239
240						#------------------------------------------------------------------------------
241						# get_sentences - takes text input and splits it into sentences.
242						# A regular expression viciously cuts the text into sentences,
243						# and then a list of rules (some of them consist of a list of abbreviations)
244						# are applied on the marked text in order to fix end-of-sentence markings in
245						# places which are not indeed end-of-sentence.
246						#------------------------------------------------------------------------------
247						sub get_sentences {
248	2		2	1	70	my ($text)=@_;
249	2	50			7	return [] unless defined $text;
250	2				7	my $marked_text = first_sentence_breaking($text);
251	2				5	my $fixed_marked_text = remove_false_end_of_sentence($marked_text);
252	2				3	$fixed_marked_text = split_unsplit_stuff($fixed_marked_text);
253	2				12	my @sentences = split(/$EOS/,$fixed_marked_text);
254	2				5	my $cleaned_sentences = clean_sentences(\@sentences);
255	2				7	return $cleaned_sentences;
256						}
257
258						#------------------------------------------------------------------------------
259						# add_acronyms - user can add a list of acronyms/abbreviations.
260						#------------------------------------------------------------------------------
261						sub add_acronyms {
262	1		1	1	835	push @ABBREVIATIONS, @_;
263	1				3	_set_abbreviations_regex();
264						}
265
266						#------------------------------------------------------------------------------
267						# get_acronyms - get list of defined acronyms.
268						#------------------------------------------------------------------------------
269						sub get_acronyms {
270	0		0	1	0	return @ABBREVIATIONS;
271						}
272
273						#------------------------------------------------------------------------------
274						# set_acronyms - replace the predefined acronyms list with your own list.
275						#------------------------------------------------------------------------------
276						sub set_acronyms {
277	0		0	1	0	@ABBREVIATIONS=@_;
278	0				0	_set_abbreviations_regex();
279						}
280
281						#------------------------------------------------------------------------------
282						# get_EOS - get the value of the $EOS variable (end-of-sentence mark).
283						#------------------------------------------------------------------------------
284						sub get_EOS {
285	0		0	1	0	return $EOS;
286						}
287
288						#------------------------------------------------------------------------------
289						# set_EOS - set the value of the $EOS variable (end-of-sentence mark).
290						#------------------------------------------------------------------------------
291						sub set_EOS {
292	0		0	1	0	my ($new_EOS) = @_;
293	0	0			0	if (not defined $new_EOS) {
294	0				0	cluck "Won't set \$EOS to undefined value!\n";
295	0				0	return $EOS;
296						}
297	0				0	$EOS = $new_EOS;
298	0				0	_set_abbreviations_regex();
299	0				0	return $EOS;
300						}
301
302						#------------------------------------------------------------------------------
303
304						# set_locale - set the value of the locale.
305						#
306						# Receieves language locale in the form
307						# language.country.character-set
308						# for example:
309						# "fr_CA.ISO8859-1"
310						# for Canadian French using character set ISO8859-1.
311						#
312						# Returns a reference to a hash containing the current locale formatting values.
313						# Returns undef if got undef.
314						#
315						# The following will set the LC_ALL behaviour to Argentinian Spanish.
316						# NOTE: The naming and availability of locales depends on your operating system.
317						# Please consult the perllocale manpage for how to find out which locales are
318						# available in your system.
319						#
320						# $loc = set_locale( "es_AR.ISO8859-1" );
321						#
322						# This actually does this:
323						#
324						# $loc = setlocale( LC_ALL, "es_AR.ISO8859-1" ); # NOTE, but actually does LC_CTYPE, should be LC_COLLATE?
325
326						#------------------------------------------------------------------------------
327						sub set_locale {
328	0		0	1	0	my ($new_locale) = @_;
329	0	0			0	if (not defined $new_locale) {
330	0				0	cluck "Won't set locale to undefined value!\n";
331	0				0	return undef;
332						}
333
334	0	0			0	if ($OSNAME ne 'android') {
335						# Call POSIX function
336	0				0	$LOC = setlocale(LC_CTYPE, $new_locale);
337	0				0	return $LOC;
338						}
339						else {
340	0				0	return undef;
341						}
342						}
343
344
345						#==============================================================================
346						#
347						# Private methods
348						#
349						#==============================================================================
350
351						# save some time by pre-compiling a regex used for working with abbreviations
352						sub _set_abbreviations_regex {
353	2		2		24	my $abbreviations = join '\|', @ABBREVIATIONS;
354	2				179	$abbreviation_regex = qr[(\b(?:$abbreviations)$PAP\s)$EOS]is;
355	2				6	return;
356						}
357
358						## Please email me any suggestions for optimizing these RegExps.
359						sub remove_false_end_of_sentence {
360	2		2	0	3	my ($marked_segment) = @_;
361						## ## don't do u.s.a.
362						## $marked_segment=~s/(\.\w$PAP)$EOS/$1/sg;
363	2				87	$marked_segment=~s/([^-\w]\w$PAP\s)$EOS/$1/sg;
364	2				42	$marked_segment=~s/([^-\w]\w$P)$EOS/$1/sg;
365
366						# don't split after a white-space followed by a single letter or number followed
367						# by a dot followed by another whitespace., such as "Something. 1. point one"
368						# Note: will fail for 12. Point 12
369	2				43	$marked_segment=~s/(\s[\w\d]\.\s+)$EOS/$1/sg;
370
371						# fix: bla bla... yada yada
372	2				18	$marked_segment=~s/(\.\.\. )$EOS([[:lower:]])/$1$2/sg;
373						# fix "." "?" "!"
374	2				19	$marked_segment=~s/(['"]$P['"]\s+)$EOS/$1/sg;
375						## fix where abbreviations exist
376	2				802	$marked_segment=~s/$abbreviation_regex/$1/g;
377
378						# don't break after quote unless its a capital letter.
379	2				34	$marked_segment=~s/(["']\s)$EOS(\s[[:lower:]])/$1$2/sg;
380
381						# don't break: text . . some more text.
382	2				14	$marked_segment=~s/(\s\.\s)$EOS(\s*)/$1$2/sg;
383	2				32	$marked_segment=~s/(["']\s)$EOS(\s[[:lower:]])/$1$2/sg;
384
385
386	2				24	$marked_segment=~s/(\s$PAP\s)$EOS/$1/sg;
387	2				7	return $marked_segment;
388						}
389
390						sub split_unsplit_stuff {
391	2		2	0	5	my ($text) = @_;
392
393						# $text=~s/(\D\d+)($P)(\s+)/$1$2$EOS$3/sg; # breaks numbered points, such as {EOL}1. point one
394
395	2				58	$text=~s/([\w $P]\d)($P)(\s+)/$1$2$EOS$3/sg;
396	2				21	$text=~s/($PAP\s)(\s*\()/$1$EOS$2/gs;
397	2				16	$text=~s/('\w$P)(\s)/$1$EOS$2/gs;
398
399
400	2				21	$text=~s/(\sno\.)(\s+)(?!\d)/$1$EOS$2/gis;
401
402						## # split where single capital letter followed by dot makes sense to break.
403						## # notice these are exceptions to the general rule NOT to split on single
404						## # letter.
405						## # notice also that sibgle letter M is missing here, due to French 'mister'
406						## # which is represented as M.
407						## #
408						## # the rule will not split on names begining or containing
409						## # single capital letter dot in the first or second name
410						## # assuming 2 or three word name.
411						## $text=~s/(\s[[:lower:]]\w+\s+[^[[:^upper:]M]\.)(?!\s+[[:upper:]]\.)/$1$EOS/sg;
412
413
414						# add EOS when you see "a.m." or "p.m." followed by a capital letter.
415	2				5	$text=~s/([ap]\.m\.\s+)([[:upper:]])/$1$EOS$2/gs;
416
417	2				5	return $text;
418						}
419
420						sub clean_sentences {
421	2		2	0	3	my ($sentences) = @_;
422	2				4	my $cleaned_sentences;
423	2				3	foreach my $s (@$sentences) {
424	9	50			18	next if not defined $s;
425	9	50			22	next if $s!~m/\w+/;
426	9				22	$s=~s/^\s*//;
427	9				109	$s=~s/\s*$//;
428						## $s=~s/\s+/ /g;
429	9				16	push @$cleaned_sentences,$s;
430						}
431	2				4	return $cleaned_sentences;
432						}
433
434						sub first_sentence_breaking {
435	2		2	0	3	my ($text) = @_;
436	2				12	$text=~s/\n\s*\n/$EOS/gs; ## double new-line means a different sentence.
437	2				56	$text=~s/($PAP\s)/$1$EOS/gs;
438	2				38	$text=~s/(\s\w$P)/$1$EOS/gs; # breake also when single letter comes before punc.
439	2				6	return $text;
440						}
441
442
443						#==============================================================================
444						#
445						# Return TRUE
446						#
447						#==============================================================================
448
449						1;