File Coverage

blib/lib/Parse/GutenbergRoget.pm

Criterion	Covered	Total	%
statement	15	103	14.5
branch	0	40	0.0
condition	0	17	0.0
subroutine	5	8	62.5
pod	3	3	100.0
total	23	171	13.4

line	stmt	bran	cond	sub	pod	time	code
1	2			2		93951	use warnings;
	2					9
	2					59
2	2			2		8	use strict;
	2					4
	2					54
3
4							package Parse::GutenbergRoget;
5	2			2		583	use parent qw(Exporter);
	2					438
	2					8
6
7	2			2		73	use Carp ();
	2					4
	2					22
8	2			2		1271	use Text::CSV_XS;
	2					27110
	2					1855
9
10							our @EXPORT = qw(parse_roget); ## no critic Export
11
12							=head1 NAME
13
14							Parse::GutenbergRoget - parse Project Gutenberg's Roget's Thesaurus
15
16							=head1 VERSION
17
18							version 0.023
19
20							=cut
21
22							our $VERSION = '0.023';
23
24							=head1 SYNOPSIS
25
26							use Parse::GutenbergRoget
27
28							my %section = parse_roget("./roget15a.txt");
29
30							print $section{1}[0][0]{text}; # existence
31
32							=head1 DESCRIPTION
33
34							A Roget's Thesaurus is more than the simple synonym/antonym finder included in
35							many dictionary sets. It organizes words into semantically realted categories,
36							so that words with related meanings can be found in proximity to one another,
37							with the level of proximity indicating the level of similarity.
38
39							Project Gutenberg has produced an etext of the 1911 edition of Roget's
40							Thesaurus, and later began to revise it, in 1991. While it's not the best
41							Roget-style thesaurus available, it's the best public domain electronic
42							thesaurus datasource I've found.
43
44							This module parses the file's contents into a Perl data structure, which can
45							then be stored in systems for searching and browsing it. This module does
46							I implement those systems.
47
48							The code is not complete. This means that everything that can be parsed is not
49							yet being parsed. It's important to realize that not everything is going to be
50							parseable. There are too many typos and broken rules which, due to the lousy
51							nature of the rules, create ambiguity. For a description of these rules see
52							L below.
53
54							=head1 FUNCTIONS
55
56							=head2 C<< parse_roget($filename) >>
57
58							This function, exported by default, will attempt to open, read, and parse the
59							named file as a Project Gutenberg Roget's Thesaurus. It has only been tested
60							with C, which is not included in the distribution, because it's
61							too big.
62
63							It returns a hash with the following structure:
64
65							%section = (
66							...
67							'100a' => {
68							major => 100, # major and minor form section identity
69							minor => 'a',
70							name => 'Fraction',
71							comments => [ 'Less than one' ],
72							subsections => [
73							{
74							type => 'N', # these entries are nouns
75							groups => [
76							{ entries => [
77							{ text => 'fraction' },
78							{ text => 'fractional part' }
79							] },
80							{ entries => [ { text => 'part &c. 51' } ] }
81							]
82							},
83							{
84							type => 'Adj',
85							groups => [ { entries => [ ... ] } ]
86							}
87							]
88							}
89							...
90							);
91
92							This structure isn't pretty or perfect, and is subject to change. All of its
93							elements are shown here, except for one exception, which is the next likely
94							subject for change: flags. Entries may have flags, in addition to text, which
95							note things like "French" or "archaic". Entries (or possibly groups) will also
96							gain cross reference attribues, replacing the ugly "&c. XX" text. I'd also
97							like to deal with references to other subsections, which come in the form "&c.
98							Adj." There isn't any reason for these to be needed, I think.
99
100							=cut
101
102							sub parse_roget {
103	0			0	1		my ($filename) = @_;
104	0						my %section = parse_sections($filename);
105	0						bloom_sections(\%section);
106	0						return %section;
107							}
108
109							=head2 C<< parse_sections($filename) >>
110
111							This function is used internally by C to read the named file,
112							returning the above structure, parsed only to the section level.
113
114							=cut
115
116							sub parse_sections {
117	0			0	1		my ($filename) = @_;
118
119	0	0					open my $roget, '<', $filename
120							or Carp::croak "couldn't open $filename: $!";
121
122	0						my $previous_section;
123							my %section;
124
125	0						my $peeked_line;
126	0						my ($in_newheader, $in_longcomment);
127
128	0		0				while (my $line = ($peeked_line \|\| <$roget>)) {
129	0						undef $peeked_line;
130
131	0						chomp $line;
132	0	0					next unless $line;
133	0	0					next if ($line =~ /^#/); # comment
134
135	0	0					if ($line =~ /^\s*<--/) { $in_longcomment = 1; }
	0
136	0	0					if ($line =~ /-->$/) { $in_longcomment = 0; next; }
	0
	0
137	0	0					next if $in_longcomment;
138
139	0	0					if ($line =~ /^%/) {
140	0						$in_newheader = not $in_newheader;
141	0						next;
142							}
143	0	0					next if $in_newheader;
144
145	0						$line =~ s/^\s+//;
146
147	0						until ($peeked_line) {
148	0						$peeked_line = <$roget>;
149	0	0					last unless defined $peeked_line;
150	0						chomp $peeked_line;
151	0	0	0				if ($peeked_line and $peeked_line !~ /^\s{4}/
			0
152							and $peeked_line !~ /^(?:#\|%\|<--)/)
153							{
154	0	0					$line .= q{ } unless (substr($line, -1, 1) eq q{-});
155	0						$line .= $peeked_line;
156	0						undef $peeked_line;
157	0	0					if ($line =~ /[^,]+,[^.]+\.\s{4}/) {
158	0						($line, $peeked_line) = split /\s{4}/, $line, 2;
159							}
160							}
161							}
162
163	0						my ($sec, $title, $newline) =
164							($line =~ /^#?(\d+[a-z]?). (.?)(?:--(.))?$/);
165	0	0	0				$line = ($newline\|\|'') if ($sec);
166
167	0	0					if ($sec) {
168	0						(my($comment_beginning), $title, my($comment_end)) =
169							($title =~ /(?:\[(.+?)\.?\])?\s([^.]+)\.?\s(?:\[(.+?)\.?\])?/);
170	0						$title =~ s/(^\s+\|\s{2,}\|\s+$)//g;
171							$section{$sec} = {
172							name => $title,
173							subsections => [ { text => $line\|\|'' } ],
174	0		0				comments => [ grep { defined $_ } ($comment_beginning, $comment_end) ]
	0
175							};
176	0						@{$section{$sec}}{qw[major minor]} = ($sec =~ /^(\d+)(.*)$/);
	0
177	0	0					Carp::confess "couldn't parse section: $sec" unless $section{$sec}{major};
178	0						$previous_section = $sec;
179							} else {
180	0		0				$section{$previous_section}{subsections} \|\|= [];
181	0						push @{$section{$previous_section}{subsections}}, { text => $line };
	0
182							}
183							}
184	0						return %section;
185
186							}
187
188							=head2 C<< bloom_sections(\%sections) >>
189
190							Given a reference to the section hash, this subroutine expands the sections
191							into subsections, groups, and entries.
192
193							=cut
194
195							sub bloom_sections {
196	0			0	1		my ($section) = @_;
197
198	0						my $decomma = Text::CSV_XS->new;
199	0						my $desemi = Text::CSV_XS->new({sep_char => q{;}});
200
201	0						my $types = qr/(Adj\|Adv\|Int\|N\|Phr\|Pron\|V)/;
202
203	0						for (values %$section) {
204	0						my $previous_subsection;
205	0						for my $subsection (@{$_->{subsections}}) {
	0
206	0						$subsection->{text} =~ s/\.$//;
207	0						$subsection->{text} =~ s/ {2,}/ /g;
208	0						$subsection->{text} =~ s/(^\s+\|\s+$)//g;
209
210	0	0					if (my ($type) = ($subsection->{text} =~ /^$types\./)) {
		0
211	0						$subsection->{text} =~ s/^$type\.//;
212	0						$subsection->{type} = $type;
213							} elsif ($previous_subsection) {
214	0						$subsection->{type} = $previous_subsection->{type};
215							} else {
216	0						$subsection->{type} = 'UNKNOWN';
217							}
218
219	0						$desemi->parse(delete $subsection->{text});
220	0						$subsection->{groups} = [ map { { text => $_ } } $desemi->fields ];
	0
221
222	0						for my $group (@{$subsection->{groups}}) {
	0
223	0						$decomma->parse(delete $group->{text});
224	0						$group->{entries} = [ map { { text => $_, flags => [] } } $decomma->fields ];
	0
225
226	0						for (@{$group->{entries}}) {
	0
227	0		0				$_->{text}\|\|= 'UNPARSED';
228	0	0					if ($_->{text} =~ s/\[obs3\]//) {
229	0						push @{$_->{flags}}, 'archaic? (1991)';
	0
230							}
231	0	0					if ($_->{text} =~ s/\|!//) {
232	0						push @{$_->{flags}}, 'obsolete (1991)';
	0
233							}
234	0	0					if ($_->{text} =~ s/\|//) {
235	0						push @{$_->{flags}}, 'obsolete (1911)';
	0
236							}
237	0						$_->{text} =~ s/(^\s+\|\s+$)//;
238							}
239							}
240	0						$previous_subsection = $subsection;
241							}
242							}
243							}
244
245							=head1 THE FILE
246
247							=over 4
248
249							=item * The thesaurus file is plain text, in 7-bit ASCII.
250
251							=item * Lines with a C<#> as their first character are comments.
252
253							=item * Lines beginning with C<< <-- >> begin multi-line comments.
254
255							=item * Lines ending with C<< --> >> end multi-line comments.
256
257							These multi-line comments were originally used for including the page numbers
258							from the original text. Later editors used them (instead of C<#> comments) to
259							mark their editorial notes.
260
261							There exists one situation in C where the C<< <-- >> occurs
262							outside of position zero in a line.
263
264							=item * A line containing only C<%> begins or ends new "supersection" data.
265
266							So, if we wanted to begin a new supersection for "States of Mind", we might
267							have the following:
268
269							%
270							STATES OF MIND
271							%
272
273							Unfortunately, there is almost no consistency in how these supersections are
274							organized. Some entries declare new sections "SECTION VII. CHANGE" and
275							immediately begin new subsupersections, "1. SIMPLE CHANGE". Others just give
276							headings: "Present Events"
277
278							Then there is this excerpt:
279
280							%
281							CLASS II
282							WORDS RELATING TO SPACE
283
284							SECTION I. SPACE IN GENERAL
285
286							1. ABSTRACT SPACE
287							%
288
289							I think the only thing to do is ignore all this crap, so that's what is done.
290
291							=item * Lines do not exceed 79 characters in width.
292
293							=item * New lines begin in column 6.
294
295							Some, though, begin in column 5, 6, or 8. A line starting in column 4 or less
296							is always a continuation, unless it's in 0. (I'm not considering comments.)
297
298							Also, new lines might begin in any column, when a period that does not follow
299							the declaration of a subsection type is followed by at least four spaces.
300
301							=item * Lines in column 0 are continuation.
302
303							Unless the previous line ended with a period.
304
305							Continuations are appended to the continued line. Unless the continued line
306							ended with a hyphen, a space is used to join the lines.
307
308							There is at least one case where a word is split without a hyphen, appearing as
309							"crocu\ns". This word is left broken in the parsed text.
310
311							=item * New lines beginning with a C<#> (not in column 0) begin new sections.
312
313							The C<#> is omitted in one case, section 252a.
314
315							=item * The C<#> in a section heading is followed by the section identifier.
316
317							The section identifier is a series of numbers followed by an optional letter
318							and terminated by a period.
319
320							=item * The section header is followed by a name and comments.
321
322							The comments are marked off by square brackets, and may occur before or after
323							the name, or both.
324
325							The name is all the rest of the text between the comments and before a
326							terminating C<-->.
327
328							=item * Once section headers are removed, every new line is a new subsection.
329
330							=item * Subsections may begin with a type declaration.
331
332							A type declaration indicates that all entries in the subsection are of one
333							part of speech: adjective, adverb, noun, interjection, phrase, and so on.
334
335							=item * Subsections with no declared type are of the type of the previous subsection.
336
337							Despite this rule, some subsections have no type. I file them as UNKNOWN.
338
339							=item * Subsections are divided into groups by semicolons.
340
341							Semicolons don't divide subsections, if they occur within double quotes. The
342							quotes can be part of the line, and need not quote the entire group.
343
344							In other words, this is a valid subsection, consisting of three groups:
345
346							...
347							Int. go for it; "go ahead; make my day" [Dirty Harry]; give it a try,
348							take a shot.
349							...
350
351							=item * Groups are divided into entries by commas.
352
353							Commas don't divide groups, if they occur within double quotes. The quotes can
354							be just part of the entry, and need not quote the entire entry.
355
356							Some groups include the entry-breaking comma inside the quotes, like this one:
357
358							Phr. it must go no further, it will go no further; don't tell a
359							soul;"tell it not in Gath,"nobody the wiser; alitur vitium vivitque
360							tegendo[Lat][obs3]; "let it be tenable in your silence still"[Hamlet].
361
362							The comma after "Gath" should be on the outside.
363
364							=item * Flags may follow the text of an entry.
365
366							Flags are either text enclosed in square brackets, as "[Lat]" or the special
367							identifiers C<\|!> or C<\|>. These flags provide metadata about the entry, like
368							its language of origin or the domain of jargon in which it is relevant. C<\|>
369							indicates that a word was obsolete in the 1911 edition, and C<\|!> indicates
370							words that were no longer used as indicated by 1991. (A third obsolete status
371							is indicated by the flag "[obs3]" meaning "entirely archaic as of 1991.")
372
373							Sometimes, flags occur outside of quotes, and sometimes they're inside. The
374							example text above shows flags outside of the Hamlet quote, but this section
375							(which really exists) is a counter-example:
376
377							Phr. noli me tangere[Lat]; nemo me impune lacessit[Lat]; don't tread
378							on me; don't you dare; don't even think of it; "Go ahead, make my day!"
379							[Dirty Harry].
380
381							=item * Cross-references may follow the entry.
382
383							I haven't determined whether any entries have both a cross-reference I
384							flags.
385
386							A cross-reference is in the form "c&. 123" where 123 is a section, or "c&. n."
387							where "n" refers to the noun-type subsections for this section.
388
389							Unfortunately, cross-references are not always followed by commas or
390							semicolons, meaning that they sometimes seem to appear in the middle of an
391							entry, as follows:
392
393							miss, miss one's aim, miss the mark, miss one's footing, miss stays;
394							slip, trip, stumble; make a slip &c., n. blunder &c. 495, make a mess of,
395							...
396
397							The cross-reference after "make a slip" should have a comma after the "n." but
398							does not, so it appears to be the middle of an entry beginning after "stumble;"
399							and ending before "make a mess of."
400
401							=back
402
403							=head1 TODO
404
405							Well, a good first step would be a TODO section.
406
407							I'll write some tests that will only run if you put a C file in
408							the right place. I'll also try the tests with previous revisions of the file.
409
410							I'm also tempted to produce newer revisions on my own, after I contact the
411							address listed in the file. The changes would just be to eliminate anomalies
412							that prevent parsing. Distraction by shiny objects may prevent this goal.
413
414							The flags and cross reference bits above will be implemented.
415
416							The need for Text::CSV_XS may be eliminated.
417
418							Entries with internal quoting (especially common in phrases) will no longer
419							become UNPARSED.
420
421							I'll try to eliminate more UNKNOWN subsection types.
422
423							=head1 AUTHOR
424
425							Ricardo Signes, C<< >>
426
427							=head1 BUGS
428
429							This software has been abandoned. Any bugs, you're on your own. Or look to
430							take it over.
431
432							=head1 COPYRIGHT
433
434							Copyright 2004 Ricardo Signes, All Rights Reserved.
435
436							This program is free software; you can redistribute it and/or modify it
437							under the same terms as Perl itself.
438
439							=cut
440
441							1;