File Coverage

lib/Parse/File/Taxonomy.pm

Criterion	Covered	Total	%
statement	33	33	100.0
branch	4	4	100.0
condition			n/a
subroutine	8	8	100.0
pod	0	4	0.0
total	45	49	91.8

line	stmt	bran	sub	pod	time	code
1						package Parse::File::Taxonomy;
2	7		7		2587	use strict;
	7				12
	7				217
3	7		7		28	use Carp;
	7				8
	7				422
4	7		7		4148	use Text::CSV;
	7				86856
	7				44
5	7		7		311	use Scalar::Util qw( reftype );
	7				12
	7				2718
6						our $VERSION = '0.04';
7						#use Data::Dump;
8
9						=head1 NAME
10
11						Parse::File::Taxonomy - Validate a file for use as a taxonomy
12
13						=head1 VERSION
14
15						This document refers to version 0.04 of Parse::File::Taxonomy. This version was
16						released May 30 2015.
17
18						=head1 SYNOPSIS
19
20						use Parse::File::Taxonomy;
21
22						=head1 DESCRIPTION
23
24						This module is the base class for the Parse-File-Taxonomy extension to the
25						Perl 5 programming language. You will not instantiate objects of this class;
26						rather, you will instantiate objects of subclasses, of which
27						Parse::File::Taxonomy::Path will be the first.
28
29						B
30
31						=head2 Taxonomy: definition
32
33						For the purpose of this library, a B is defined as a tree-like data
34						structure with a root node, zero or more branch (child) nodes, and one or more
35						leaf nodes. The root node and each branch node must have at least one child
36						node, but leaf nodes have no child nodes. The number of branches
37						between a leaf node and the root node is variable.
38
39						B
40
41						Root
42						\|
43						----------------------------------------------------
44						\| \| \| \|
45						Branch Branch Branch Leaf
46						\| \| \|
47						------------------------- ------------ \|
48						\| \| \| \| \|
49						Branch Branch Leaf Leaf Branch
50						\| \| \|
51						\| ------------ \|
52						\| \| \| \|
53						Leaf Leaf Leaf Leaf
54
55						=head2 Taxonomy File: definition
56
57						For the purpose of this module, a B is a CSV file in which (a)
58						certain columns hold data from which the position of each record within the
59						taxonomy can be derived; and (b) each node in the tree (other than the root
60						node) is uniquely represented by a record within the file.
61
62						=head3 CSV
63
64						B<"CSV">, strictly speaking, refers to B:
65
66						path,nationality,gender,age,income,id_no
67
68						For the purpose of this module, however, the column separators in a taxonomy file
69						may be any user-specified character handled by the F library.
70						Formats frequently observed are B:
71
72						path nationality gender age income id_no
73
74						and B:
75
76						path\|nationality\|gender\|age\|income\|id_no
77
78						The documentation for F comments that the CSV format could
79						perhaps better [be] called ASV (anything separated values)">, but we shall for
80						convenience use "CSV" herein regardless of the specific delimiter.
81
82						Since it is often the case that the characters used as column separators may
83						occur within the data recorded in the columns as well, it is customary to
84						quote either all columns:
85
86						"path","nationality","gender","age","income","id_no"
87
88						... or, at the very least, all columns which can hold
89						data other than pure integers or floating-point numbers:
90
91						"path","nationality","gender",age,income,id_no
92
93						=head3 Tree structure
94
95						To qualify as a taxonomy file, it is not sufficient for a file to be in CSV
96						format. In each non-header record in that file, there must be one or more
97						columns which hold data capable of exactly specifying the record's position in
98						the taxonomy, I the route or B from the root node to the node
99						being represented by that record.
100
101						The precise way in which certain columns are used to determine the path from
102						the root node to a given node is what differentiates various types of taxonomy
103						files from one another. In Parse-File-Taxonomy we identify two different
104						flavors of taxonomy files and provide a class for the construction of each.
105
106						=head3 Taxonomy-by-path
107
108						A B is one in which a single column -- which we will refer
109						to as the B -- will represent the path from the root to the given
110						record as a series of strings joined by separator characters.
111						Within that path column the value corresponding to the root node need
112						not be specified, I may be represented by an empty string.
113
114						Let's rewrite Diagram 1 with values to make this clear.
115
116						B
117
118						""
119						\|
120						----------------------------------------------------
121						\| \| \| \|
122						Alpha Beta Gamma Delta
123						\| \| \|
124						------------------------- ------------ \|
125						\| \| \| \| \|
126						Epsilon Zeta Eta Theta Iota
127						\| \| \|
128						\| ------------ \|
129						\| \| \| \|
130						Kappa Lambda Mu Nu
131
132						Let us suppose that our taxonomy file held comma-separated, quoted records.
133						Let us further supposed that the column holding taxonomy paths was, not
134						surprisingly, called C and that the separator within the C column
135						was a pipe (C<\|>) character. Let us further suppose that for now we are not
136						concerned with the data in any columns other than C so that, for purpose
137						of illustration, they will hold empty (albeit quoted) strings.
138
139						Then the taxonomy file describing the tree in Diagram 2 would look like this:
140
141						"path","nationality","gender","age","income","id_no"
142						"\|Alpha","","","","",""
143						"\|Alpha\|Epsilon","","","","",""
144						"\|Alpha\|Epsilon\|Kappa","","","","",""
145						"\|Alpha\|Zeta","","","","",""
146						"\|Alpha\|Zeta\|Lambda","","","","",""
147						"\|Alpha\|Zeta\|Mu","","","","",""
148						"\|Beta","","","","",""
149						"\|Beta\|Eta","","","","",""
150						"\|Beta\|Theta","","","","",""
151						"\|Gamma","","","","",""
152						"\|Gamma\|Iota","","","","",""
153						"\|Gamma\|Iota\|Nu","","","","",""
154						"\|Delta","","","","",""
155
156						Note that while in the C<\|Gamma> branch we ultimately have only one leaf node,
157						C<\|Gamma\|Iota\|Nu>, we require separate records in the taxonomy file for
158						C<\|Gamma> and C<\|Gamma\|Iota>. To put this another way, the existence of a
159						C leaf must not be assumed to "auto-vivify" C<\|Gamma> and
160						C<\|Gamma\|Iota> nodes. Each non-root node must be explicitly represented in
161						the taxonomy file for the file to be considered valid.
162
163						Note further that there is no restriction on the values of the B of
164						the C across records. It only the B path that must be unique.
165						Let us illustrate that by modifying the data in Diagram 2:
166
167						B
168
169						""
170						\|
171						----------------------------------------------------
172						\| \| \| \|
173						Alpha Beta Gamma Delta
174						\| \| \|
175						------------------------- ------------ \|
176						\| \| \| \| \|
177						Epsilon Zeta Eta Theta Iota
178						\| \| \|
179						\| ------------ \|
180						\| \| \| \|
181						Kappa Lambda Mu Delta
182
183						Here we have two leaf nodes each named C. However, we follow different
184						paths from the root node to get to each of them. The taxonomy file
185						representing this tree would look like this:
186
187						"path","nationality","gender","age","income","id_no"
188						"\|Alpha","","","","",""
189						"\|Alpha\|Epsilon","","","","",""
190						"\|Alpha\|Epsilon\|Kappa","","","","",""
191						"\|Alpha\|Zeta","","","","",""
192						"\|Alpha\|Zeta\|Lambda","","","","",""
193						"\|Alpha\|Zeta\|Mu","","","","",""
194						"\|Beta","","","","",""
195						"\|Beta\|Eta","","","","",""
196						"\|Beta\|Theta","","","","",""
197						"\|Gamma","","","","",""
198						"\|Gamma\|Iota","","","","",""
199						"\|Gamma\|Iota\|Delta","","","","",""
200						"\|Delta","","","","",""
201
202						=head3 Taxonomy-by-index
203
204						A B is one in which the data in which each record has a
205						column with a unique identifier (B) and another column holding the unique
206						identifier of the record representing the next higher node in the hierarchy
207						(B). The record must also a column which holds a datum that is
208						unique among all records having the same parent node.
209
210						Let's make this clearer by rewriting the taxonomy-by-path above for Example 3
211						as a taxonomy-by-index.
212
213						"id","parent_id","name","nationality","gender","age","income","id_no"
214						1,,"Alpha","","","","",""
215						2,1,"Epsilon","","","","",""
216						3,2,"Kappa","","","","",""
217						4,1,"Zeta","","","","",""
218						5,4,"Lambda","","","","",""
219						6,4,"Mu","","","","",""
220						7,,"Beta","","","","",""
221						8,7,"Eta","","","","",""
222						9,7,"Theta","","","","",""
223						10,,"Gamma","","","","",""
224						11,10,"Iota","","","","",""
225						12,11,"Delta","","","","",""
226						13,,"Delta","","","","",""
227
228						In the above taxonomy-by-index, the records with Cs C<1>, C<7>, C<10>, and
229						C<13> are top-level nodes. They have no parents, so the value of their
230						C column is null or, in Perl terms, an empty string. The records
231						with Cs C<2> and C<4> are children of the record with C of C<1>. The
232						record with C is, in turn, a child of the record with C.
233
234						In the above taxonomy-by-index, close inspection will show that no two records
235						with the same C share the same C. The property of
236						B means that we can construct a non-indexed
237						version of the path from the root to a given node by using the C
238						column in a given record to look up the C of the record with the C
239						value identical to the child's C.
240
241						Via index: 3 2 1
242
243						Via name: Kappa Epsilon Alpha
244
245						We go from C to its C, then to C<2>'s C.
246						Putting Cs to this, we go from C to C to C.
247
248						Now, reverse the order of those Cs, throw a pipe delimiter before each
249						of them and join them into a single string, and you get:
250
251						\|Alpha\|Epsilon\|Kappa
252
253						... which is the value of the C column in the third record in the
254						taxonomy-by-path displayed previously.
255
256						With correct data, a given hierarchy of data can therefore be represented
257						either by a taxonomy-by-path or by a taxonomy-by-index. We would therefore
258						describe these two taxonomies as B to each other.
259
260						=head2 Taxonomy Validation
261
262						Each C subclass will have a constructor, C,
263						which will probe a taxonomy file
264						provided to it as an argument to determine whether it can be considered a
265						valid taxonomy according to the description provided above. The arguments
266						needed for such a constructor will be found in the documentation of the
267						subclass.
268
269						The constructor of a C subclass may, if desired, accept
270						a different set of arguments. Suppose you have already read a CSV file and
271						parsed it into one array reference holding its header row -- a list of its
272						columns -- and a second array reference, this one being an array of arrays
273						where each element holds the data in one record in the CSV file. You have the
274						same C needed to validate the taxonomy that you would get by
275						parsing the CSV file.
276
277						=cut
278
279						sub fields {
280	41		41	0	9340	my $self = shift;
281	41				154	return $self->{fields};
282						}
283
284						sub data_records {
285	40		40	0	934	my $self = shift;
286	40				112	return $self->{data_records};
287						}
288
289						sub fields_and_data_records {
290	9		9	0	2540	my $self = shift;
291	9				26	my @all_rows = $self->fields;
292	9				16	for my $row (@{$self->data_records}) {
	9				23
293	117				121	push @all_rows, $row;
294						}
295	9				40	return \@all_rows;
296						}
297
298						sub get_field_position {
299	4		4	0	512	my ($self, $f) = @_;
300	4				10	my $fields = $self->fields;
301	4				37	my $idx;
302	4				7	for (my $i=0; $i<=$#{$fields}; $i++) {
	26				43
303	24	100			45	if ($fields->[$i] eq $f) {
304	2				3	$idx = $i;
305	2				4	last;
306						}
307						}
308	4	100			12	if (defined($idx)) {
309	2				13	return $idx;
310						}
311						else {
312	2				253	croak "'$f' not a field in this taxonomy";
313						}
314						}
315
316						1;
317
318						=head1 BUGS
319
320						There are no bug reports outstanding on Parse::File::Taxonomy as of the most recent
321						CPAN upload date of this distribution.
322
323						=head1 SUPPORT
324
325						Please report any bugs by mail to C
326						or through the web interface at L.
327
328						=head1 AUTHOR
329
330						James E. Keenan (jkeenan@cpan.org). When sending correspondence, please
331						include 'Parse::File::Taxonomy' or 'Parse-File-Taxonomy' in your subject line.
332
333						Creation date: May 24 2015. Last modification date: June 17 2015.
334
335						Development repository: L
336
337						=head1 COPYRIGHT
338
339						Copyright (c) 2002-15 James E. Keenan. United States. All rights reserved.
340						This is free software and may be distributed under the same terms as Perl
341						itself.
342
343						=head1 DISCLAIMER OF WARRANTY
344
345						BECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
346						FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN
347						OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
348						PROVIDE THE SOFTWARE ''AS IS'' WITHOUT WARRANTY OF ANY KIND, EITHER
349						EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
350						WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE
351						ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH
352						YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL
353						NECESSARY SERVICING, REPAIR, OR CORRECTION.
354
355						IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
356						WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
357						REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENCE, BE
358						LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL,
359						OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE
360						THE SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING
361						RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A
362						FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF
363						SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
364						SUCH DAMAGES.
365
366						=cut
367
368						# vim: formatoptions=crqot