File Coverage

blib/lib/Parse/GutenbergRoget.pm

Criterion	Covered	Total	%
statement	15	103	14.5
branch	0	40	0.0
condition	0	17	0.0
subroutine	5	8	62.5
pod	3	3	100.0
total	23	171	13.4

line	stmt	bran	cond	sub	pod	time	code
1	2			2		137028	use warnings;
	2					5
	2					69
2	2			2		10	use strict;
	2					4
	2					94
3
4							package Parse::GutenbergRoget;
5	2			2		10	use base qw(Exporter);
	2					7
	2					231
6
7	2			2		12	use Carp ();
	2					3
	2					53
8	2			2		2580	use Text::CSV_XS;
	2					24039
	2					2990
9
10							our @EXPORT = qw(parse_roget); ## no critic Export
11
12							=head1 NAME
13
14							Parse::GutenbergRoget - parse Project Gutenberg's Roget's Thesaurus
15
16							=head1 VERSION
17
18							version 0.022
19
20							$Id$
21
22							=cut
23
24							our $VERSION = '0.022';
25
26							=head1 SYNOPSIS
27
28							use Parse::GutenbergRoget
29
30							my %section = parse_roget("./roget15a.txt");
31
32							print $section{1}[0][0]{text}; # existence
33
34							=head1 DESCRIPTION
35
36							A Roget's Thesaurus is more than the simple synonym/antonym finder included in
37							many dictionary sets. It organizes words into semantically realted categories,
38							so that words with related meanings can be found in proximity to one another,
39							with the level of proximity indicating the level of similarity.
40
41							Project Gutenberg has produced an etext of the 1911 edition of Roget's
42							Thesaurus, and later began to revise it, in 1991. While it's not the best
43							Roget-style thesaurus available, it's the best public domain electronic
44							thesaurus datasource I've found.
45
46							This module parses the file's contents into a Perl data structure, which can
47							then be stored in systems for searching and browsing it. This module does
48							I implement those systems.
49
50							The code is not complete. This means that everything that can be parsed is not
51							yet being parsed. It's important to realize that not everything is going to be
52							parseable. There are too many typos and broken rules which, due to the lousy
53							nature of the rules, create ambiguity. For a description of these rules see
54							L below.
55
56							=head1 FUNCTIONS
57
58							=head2 C<< parse_roget($filename) >>
59
60							This function, exported by default, will attempt to open, read, and parse the
61							named file as a Project Gutenberg Roget's Thesaurus. It has only been tested
62							with C, which is not included in the distribution, because it's
63							too big.
64
65							It returns a hash with the following structure:
66
67							%section = (
68							...
69							'100a' => {
70							major => 100, # major and minor form section identity
71							minor => 'a',
72							name => 'Fraction',
73							comments => [ 'Less than one' ],
74							subsections => [
75							{
76							type => 'N', # these entries are nouns
77							groups => [
78							{ entries => [
79							{ text => 'fraction' },
80							{ text => 'fractional part' }
81							] },
82							{ entries => [ { text => 'part &c. 51' } ] }
83							]
84							},
85							{
86							type => 'Adj',
87							groups => [ { entries => [ ... ] } ]
88							}
89							]
90							}
91							...
92							);
93
94							This structure isn't pretty or perfect, and is subject to change. All of its
95							elements are shown here, except for one exception, which is the next likely
96							subject for change: flags. Entries may have flags, in addition to text, which
97							note things like "French" or "archaic". Entries (or possibly groups) will also
98							gain cross reference attribues, replacing the ugly "&c. XX" text. I'd also
99							like to deal with references to other subsections, which come in the form "&c.
100							Adj." There isn't any reason for these to be needed, I think.
101
102							=cut
103
104							sub parse_roget {
105	0			0	1		my ($filename) = @_;
106	0						my %section = parse_sections($filename);
107	0						bloom_sections(\%section);
108	0						return %section;
109							}
110
111							=head2 C<< parse_sections($filename) >>
112
113							This function is used internally by C to read the named file,
114							returning the above structure, parsed only to the section level.
115
116							=cut
117
118							sub parse_sections {
119	0			0	1		my ($filename) = @_;
120
121	0	0					open my $roget, '<', $filename
122							or Carp::croak "couldn't open $filename: $!";
123
124	0						my $previous_section;
125							my %section;
126
127	0						my $peeked_line;
128	0						my ($in_newheader, $in_longcomment);
129
130	0		0				while (my $line = ($peeked_line \|\| <$roget>)) {
131	0						undef $peeked_line;
132
133	0						chomp $line;
134	0	0					next unless $line;
135	0	0					next if ($line =~ /^#/); # comment
136
137	0	0					if ($line =~ /^\s*<--/) { $in_longcomment = 1; }
	0
138	0	0					if ($line =~ /-->$/) { $in_longcomment = 0; next; }
	0
	0
139	0	0					next if $in_longcomment;
140
141	0	0					if ($line =~ /^%/) {
142	0						$in_newheader = not $in_newheader;
143	0						next;
144							}
145	0	0					next if $in_newheader;
146
147	0						$line =~ s/^\s+//;
148
149	0						until ($peeked_line) {
150	0						$peeked_line = <$roget>;
151	0	0					last unless defined $peeked_line;
152	0						chomp $peeked_line;
153	0	0	0				if ($peeked_line and $peeked_line !~ /^\s{4}/
			0
154							and $peeked_line !~ /^(?:#\|%\|<--)/)
155							{
156	0	0					$line .= q{ } unless (substr($line, -1, 1) eq q{-});
157	0						$line .= $peeked_line;
158	0						undef $peeked_line;
159	0	0					if ($line =~ /[^,]+,[^.]+\.\s{4}/) {
160	0						($line, $peeked_line) = split /\s{4}/, $line, 2;
161							}
162							}
163							}
164
165	0						my ($sec, $title, $newline) =
166							($line =~ /^#?(\d+[a-z]?). (.?)(?:--(.))?$/);
167	0	0	0				$line = ($newline\|\|'') if ($sec);
168
169	0	0					if ($sec) {
170	0						(my($comment_beginning), $title, my($comment_end)) =
171							($title =~ /(?:\[(.+?)\.?\])?\s([^.]+)\.?\s(?:\[(.+?)\.?\])?/);
172	0						$title =~ s/(^\s+\|\s{2,}\|\s+$)//g;
173	0						$section{$sec} = {
174							name => $title,
175							subsections => [ { text => $line\|\|'' } ],
176	0		0				comments => [ grep { defined $_ } ($comment_beginning, $comment_end) ]
177							};
178	0						@{$section{$sec}}{qw[major minor]} = ($sec =~ /^(\d+)(.*)$/);
	0
179	0	0					Carp::confess "couldn't parse section: $sec" unless $section{$sec}{major};
180	0						$previous_section = $sec;
181							} else {
182	0		0				$section{$previous_section}{subsections} \|\|= [];
183	0						push @{$section{$previous_section}{subsections}}, { text => $line };
	0
184							}
185							}
186	0						return %section;
187
188							}
189
190							=head2 C<< bloom_sections(\%sections) >>
191
192							Given a reference to the section hash, this subroutine expands the sections
193							into subsections, groups, and entries.
194
195							=cut
196
197							sub bloom_sections {
198	0			0	1		my ($section) = @_;
199
200	0						my $decomma = Text::CSV_XS->new;
201	0						my $desemi = Text::CSV_XS->new({sep_char => q{;}});
202
203	0						my $types = qr/(Adj\|Adv\|Int\|N\|Phr\|Pron\|V)/;
204
205	0						for (values %$section) {
206	0						my $previous_subsection;
207	0						for my $subsection (@{$_->{subsections}}) {
	0
208	0						$subsection->{text} =~ s/\.$//;
209	0						$subsection->{text} =~ s/ {2,}/ /g;
210	0						$subsection->{text} =~ s/(^\s+\|\s+$)//g;
211
212	0	0					if (my ($type) = ($subsection->{text} =~ /^$types\./)) {
		0
213	0						$subsection->{text} =~ s/^$type\.//;
214	0						$subsection->{type} = $type;
215							} elsif ($previous_subsection) {
216	0						$subsection->{type} = $previous_subsection->{type};
217							} else {
218	0						$subsection->{type} = 'UNKNOWN';
219							}
220
221	0						$desemi->parse(delete $subsection->{text});
222	0						$subsection->{groups} = [ map { { text => $_ } } $desemi->fields ];
	0
223
224	0						for my $group (@{$subsection->{groups}}) {
	0
225	0						$decomma->parse(delete $group->{text});
226	0						$group->{entries} = [ map { { text => $_, flags => [] } } $decomma->fields ];
	0
227
228	0						for (@{$group->{entries}}) {
	0
229	0		0				$_->{text}\|\|= 'UNPARSED';
230	0	0					if ($_->{text} =~ s/\[obs3\]//) {
231	0						push @{$_->{flags}}, 'archaic? (1991)';
	0
232							}
233	0	0					if ($_->{text} =~ s/\|!//) {
234	0						push @{$_->{flags}}, 'obsolete (1991)';
	0
235							}
236	0	0					if ($_->{text} =~ s/\|//) {
237	0						push @{$_->{flags}}, 'obsolete (1911)';
	0
238							}
239	0						$_->{text} =~ s/(^\s+\|\s+$)//;
240							}
241							}
242	0						$previous_subsection = $subsection;
243							}
244							}
245							}
246
247							=head1 THE FILE
248
249							=over 4
250
251							=item * The thesaurus file is plain text, in 7-bit ASCII.
252
253							=item * Lines with a C<#> as their first character are comments.
254
255							=item * Lines beginning with C<< <-- >> begin multi-line comments.
256
257							=item * Lines ending with C<< --> >> end multi-line comments.
258
259							These multi-line comments were originally used for including the page numbers
260							from the original text. Later editors used them (instead of C<#> comments) to
261							mark their editorial notes.
262
263							There exists one situation in C where the C<< <-- >> occurs
264							outside of position zero in a line.
265
266							=item * A line containing only C<%> begins or ends new "supersection" data.
267
268							So, if we wanted to begin a new supersection for "States of Mind", we might
269							have the following:
270
271							%
272							STATES OF MIND
273							%
274
275							Unfortunately, there is almost no consistency in how these supersections are
276							organized. Some entries declare new sections "SECTION VII. CHANGE" and
277							immediately begin new subsupersections, "1. SIMPLE CHANGE". Others just give
278							headings: "Present Events"
279
280							Then there is this excerpt:
281
282							%
283							CLASS II
284							WORDS RELATING TO SPACE
285
286							SECTION I. SPACE IN GENERAL
287
288							1. ABSTRACT SPACE
289							%
290
291							I think the only thing to do is ignore all this crap, so that's what is done.
292
293							=item * Lines do not exceed 79 characters in width.
294
295							=item * New lines begin in column 6.
296
297							Some, though, begin in column 5, 6, or 8. A line starting in column 4 or less
298							is always a continuation, unless it's in 0. (I'm not considering comments.)
299
300							Also, new lines might begin in any column, when a period that does not follow
301							the declaration of a subsection type is followed by at least four spaces.
302
303							=item * Lines in column 0 are continuation.
304
305							Unless the previous line ended with a period.
306
307							Continuations are appended to the continued line. Unless the continued line
308							ended with a hyphen, a space is used to join the lines.
309
310							There is at least one case where a word is split without a hyphen, appearing as
311							"crocu\ns". This word is left broken in the parsed text.
312
313							=item * New lines beginning with a C<#> (not in column 0) begin new sections.
314
315							The C<#> is omitted in one case, section 252a.
316
317							=item * The C<#> in a section heading is followed by the section identifier.
318
319							The section identifier is a series of numbers followed by an optional letter
320							and terminated by a period.
321
322							=item * The section header is followed by a name and comments.
323
324							The comments are marked off by square brackets, and may occur before or after
325							the name, or both.
326
327							The name is all the rest of the text between the comments and before a
328							terminating C<-->.
329
330							=item * Once section headers are removed, every new line is a new subsection.
331
332							=item * Subsections may begin with a type declaration.
333
334							A type declaration indicates that all entries in the subsection are of one
335							part of speech: adjective, adverb, noun, interjection, phrase, and so on.
336
337							=item * Subsections with no declared type are of the type of the previous subsection.
338
339							Despite this rule, some subsections have no type. I file them as UNKNOWN.
340
341							=item * Subsections are divided into groups by semicolons.
342
343							Semicolons don't divide subsections, if they occur within double quotes. The
344							quotes can be part of the line, and need not quote the entire group.
345
346							In other words, this is a valid subsection, consisting of three groups:
347
348							...
349							Int. go for it; "go ahead; make my day" [Dirty Harry]; give it a try,
350							take a shot.
351							...
352
353							=item * Groups are divided into entries by commas.
354
355							Commas don't divide groups, if they occur within double quotes. The quotes can
356							be just part of the entry, and need not quote the entire entry.
357
358							Some groups include the entry-breaking comma inside the quotes, like this one:
359
360							Phr. it must go no further, it will go no further; don't tell a
361							soul;"tell it not in Gath,"nobody the wiser; alitur vitium vivitque
362							tegendo[Lat][obs3]; "let it be tenable in your silence still"[Hamlet].
363
364							The comma after "Gath" should be on the outside.
365
366							=item * Flags may follow the text of an entry.
367
368							Flags are either text enclosed in square brackets, as "[Lat]" or the special
369							identifiers C<\|!> or C<\|>. These flags provide metadata about the entry, like
370							its language of origin or the domain of jargon in which it is relevant. C<\|>
371							indicates that a word was obsolete in the 1911 edition, and C<\|!> indicates
372							words that were no longer used as indicated by 1991. (A third obsolete status
373							is indicated by the flag "[obs3]" meaning "entirely archaic as of 1991.")
374
375							Sometimes, flags occur outside of quotes, and sometimes they're inside. The
376							example text above shows flags outside of the Hamlet quote, but this section
377							(which really exists) is a counter-example:
378
379							Phr. noli me tangere[Lat]; nemo me impune lacessit[Lat]; don't tread
380							on me; don't you dare; don't even think of it; "Go ahead, make my day!"
381							[Dirty Harry].
382
383							=item * Cross-references may follow the entry.
384
385							I haven't determined whether any entries have both a cross-reference I
386							flags.
387
388							A cross-reference is in the form "c&. 123" where 123 is a section, or "c&. n."
389							where "n" refers to the noun-type subsections for this section.
390
391							Unfortunately, cross-references are not always followed by commas or
392							semicolons, meaning that they sometimes seem to appear in the middle of an
393							entry, as follows:
394
395							miss, miss one's aim, miss the mark, miss one's footing, miss stays;
396							slip, trip, stumble; make a slip &c., n. blunder &c. 495, make a mess of,
397							...
398
399							The cross-reference after "make a slip" should have a comma after the "n." but
400							does not, so it appears to be the middle of an entry beginning after "stumble;"
401							and ending before "make a mess of."
402
403							=back
404
405							=head1 TODO
406
407							Well, a good first step would be a TODO section.
408
409							I'll write some tests that will only run if you put a C file in
410							the right place. I'll also try the tests with previous revisions of the file.
411
412							I'm also tempted to produce newer revisions on my own, after I contact the
413							address listed in the file. The changes would just be to eliminate anomalies
414							that prevent parsing. Distraction by shiny objects may prevent this goal.
415
416							The flags and cross reference bits above will be implemented.
417
418							The need for Text::CSV_XS may be eliminated.
419
420							Entries with internal quoting (especially common in phrases) will no longer
421							become UNPARSED.
422
423							I'll try to eliminate more UNKNOWN subsection types.
424
425							=head1 AUTHOR
426
427							Ricardo Signes, C<< >>
428
429							=head1 BUGS
430
431							Please report any bugs or feature requests to
432							C, or through the web interface at
433							L. I will be notified, and then you'll automatically be
434							notified of progress on your bug as I make changes.
435
436							=head1 COPYRIGHT
437
438							Copyright 2004 Ricardo Signes, All Rights Reserved.
439
440							This program is free software; you can redistribute it and/or modify it
441							under the same terms as Perl itself.
442
443							=cut
444
445							1;