File Coverage

lib/Parse/Taxonomy.pm
Criterion Covered Total %
statement 30 30 100.0
branch 4 4 100.0
condition n/a
subroutine 7 7 100.0
pod 0 4 0.0
total 41 45 91.1


line stmt bran cond sub pod time code
1             package Parse::Taxonomy;
2 8     8   2315 use strict;
  8         10  
  8         198  
3 8     8   26 use Carp;
  8         8  
  8         503  
4 8     8   28 use Scalar::Util qw( reftype );
  8         7  
  8         2460  
5             our $VERSION = '0.24';
6              
7             =head1 NAME
8              
9             Parse::Taxonomy - Validate hierarchical data stored in CSV format
10              
11             =head1 VERSION
12              
13             This document refers to version 0.24 of Parse::Taxonomy. This version was
14             released April 09 2016.
15              
16             =head1 SYNOPSIS
17              
18             use Parse::Taxonomy;
19              
20             =head1 DESCRIPTION
21              
22             This module is the base class for the Parse-Taxonomy extension to the
23             Perl 5 programming language. You will not instantiate objects of this class;
24             rather, you will instantiate objects of subclasses, of which
25             Parse::Taxonomy::MaterializedPath and Parse::Taxonomy::AdjacentList are the first.
26              
27             B The documented interfaces are expected to remain
28             stable but are not guaranteed to remain so.
29              
30             =head2 Taxonomy: definition
31              
32             For the purpose of this library, a B is defined as a tree-like data
33             structure with a root node, zero or more branch (child) nodes, and one or more
34             leaf nodes. The root node and each branch node must have at least one child
35             node, but leaf nodes have no child nodes. The number of branches
36             between a leaf node and the root node is variable.
37              
38             B
39              
40             Root
41             |
42             ----------------------------------------------------
43             | | | |
44             Branch Branch Branch Leaf
45             | | |
46             ------------------------- ------------ |
47             | | | | |
48             Branch Branch Leaf Leaf Branch
49             | | |
50             | ------------ |
51             | | | |
52             Leaf Leaf Leaf Leaf
53              
54             =head2 Taxonomy File: definition
55              
56             For the purpose of this module, a B is a CSV file in which (a)
57             certain columns hold data from which the position of each record within the
58             taxonomy can be derived; and (b) each node in the tree (with the possible
59             exception of the root node) is uniquely represented by a record within the
60             file.
61              
62             =head3 CSV
63              
64             B<"CSV">, strictly speaking, refers to B:
65              
66             path,nationality,gender,age,income,id_no
67              
68             For the purpose of this module, however, the column separators in a taxonomy
69             file may be any user-specified character handled by the
70             L library on CPAN. Formats
71             frequently observed are B:
72              
73             path nationality gender age income id_no
74              
75             and B:
76              
77             path|nationality|gender|age|income|id_no
78              
79             The documentation for F comments that the CSV format could I<"...
80             perhaps better [be] called ASV (anything separated values)">, but we shall for
81             convenience use "CSV" herein regardless of the specific delimiter.
82              
83             Since it is often the case that the characters used as column separators may
84             occur within the data recorded in the columns as well, it is customary to
85             quote either all columns:
86              
87             "path","nationality","gender","age","income","id_no"
88              
89             ... or, at the very least, all columns which can hold
90             data other than pure integers or floating-point numbers:
91              
92             "path","nationality","gender",age,income,id_no
93              
94             =head3 Tree structure
95              
96             To qualify as a taxonomy file, it is not sufficient for a file to be in CSV
97             format. In each non-header record in that file, there must be one or more
98             columns which hold data capable of exactly specifying the record's position in
99             the taxonomy, I the route from the root node to the node
100             being represented by that record.
101              
102             The precise way in which certain columns are used to determine the path from
103             the root node to a given node is what differentiates various types of taxonomy
104             files from one another. In Parse-Taxonomy we identify two different
105             flavors of taxonomy files and provide a class for the construction of each.
106              
107             =head3 Taxonomy-by-materialized-path
108              
109             A B is one in which a single column -- which we will refer
110             to as the B -- serves as a B. A materialized
111             path represents the route from the root to the given
112             record as a series of strings joined by separator characters.
113             Within that path column the value corresponding to the root node need
114             not be specified, I may be represented by an empty string.
115              
116             Let's rewrite Diagram 1 with values to make this clear.
117              
118             B
119              
120             ""
121             |
122             ----------------------------------------------------
123             | | | |
124             Alpha Beta Gamma Delta
125             | | |
126             ------------------------- ------------ |
127             | | | | |
128             Epsilon Zeta Eta Theta Iota
129             | | |
130             | ------------ |
131             | | | |
132             Kappa Lambda Mu Nu
133              
134             Let us suppose that our taxonomy file held comma-separated, quoted records.
135             Let us further supposed that the column holding taxonomy paths was, not
136             surprisingly, called C and that the separator within the C column
137             was a pipe (C<|>) character. Let us further suppose that for now we are not
138             concerned with the data in any columns other than C so that, for purpose
139             of illustration, they will hold empty (albeit quoted) strings.
140              
141             Then the taxonomy file describing the tree in Diagram 2 would look like this:
142              
143             "path","nationality","gender","age","income","id_no"
144             "|Alpha","","","","",""
145             "|Alpha|Epsilon","","","","",""
146             "|Alpha|Epsilon|Kappa","","","","",""
147             "|Alpha|Zeta","","","","",""
148             "|Alpha|Zeta|Lambda","","","","",""
149             "|Alpha|Zeta|Mu","","","","",""
150             "|Beta","","","","",""
151             "|Beta|Eta","","","","",""
152             "|Beta|Theta","","","","",""
153             "|Gamma","","","","",""
154             "|Gamma|Iota","","","","",""
155             "|Gamma|Iota|Nu","","","","",""
156             "|Delta","","","","",""
157              
158             Note that while in the C<|Gamma> branch we ultimately have only one leaf node,
159             C<|Gamma|Iota|Nu>, we require separate records in the taxonomy file for
160             C<|Gamma> and C<|Gamma|Iota>. To put this another way, the existence of a
161             C leaf must not be assumed to "auto-vivify" C<|Gamma> and
162             C<|Gamma|Iota> nodes. Each non-root node must be explicitly represented in
163             the taxonomy file for the file to be considered valid.
164              
165             Note further that there is no restriction on the values of the B of
166             the C across records. It only the B path that must be unique.
167             Let us illustrate that by modifying the data in Diagram 2:
168              
169             B
170              
171             ""
172             |
173             ----------------------------------------------------
174             | | | |
175             Alpha Beta Gamma Delta
176             | | |
177             ------------------------- ------------ |
178             | | | | |
179             Epsilon Zeta Eta Theta Iota
180             | | |
181             | ------------ |
182             | | | |
183             Kappa Lambda Mu Delta
184              
185             Here we have two leaf nodes each named C. However, we follow different
186             paths from the root node to get to each of them. The taxonomy file
187             representing this tree would look like this:
188              
189             "path","nationality","gender","age","income","id_no"
190             "|Alpha","","","","",""
191             "|Alpha|Epsilon","","","","",""
192             "|Alpha|Epsilon|Kappa","","","","",""
193             "|Alpha|Zeta","","","","",""
194             "|Alpha|Zeta|Lambda","","","","",""
195             "|Alpha|Zeta|Mu","","","","",""
196             "|Beta","","","","",""
197             "|Beta|Eta","","","","",""
198             "|Beta|Theta","","","","",""
199             "|Gamma","","","","",""
200             "|Gamma|Iota","","","","",""
201             "|Gamma|Iota|Delta","","","","",""
202             "|Delta","","","","",""
203              
204             =head3 Taxonomy-by-adjacent-list
205              
206             A B is one in which each record has a column with a
207             unique identifier (B) and another column holding the unique identifier of
208             the record representing the next higher node in the hierarchy (B).
209             The record must also have a column which holds a datum that is unique among
210             all records having the same parent node.
211              
212             Let's make this clearer by rewriting the taxonomy-by-materialized-path above
213             for Example 3 as a taxonomy-by-adjacent-list.
214              
215             "id","parent_id","name","nationality","gender","age","income","id_no"
216             1,,"Alpha","","","","",""
217             2,1,"Epsilon","","","","",""
218             3,2,"Kappa","","","","",""
219             4,1,"Zeta","","","","",""
220             5,4,"Lambda","","","","",""
221             6,4,"Mu","","","","",""
222             7,,"Beta","","","","",""
223             8,7,"Eta","","","","",""
224             9,7,"Theta","","","","",""
225             10,,"Gamma","","","","",""
226             11,10,"Iota","","","","",""
227             12,11,"Delta","","","","",""
228             13,,"Delta","","","","",""
229              
230             In the above taxonomy-by-adjacent-list, the records with Cs C<1>, C<7>, C<10>, and
231             C<13> are top-level nodes. They have no parents, so the value of their
232             C column is null or, in Perl terms, an empty string. The records
233             with Cs C<2> and C<4> are children of the record with C of C<1>. The
234             record with C is, in turn, a child of the record with C.
235              
236             In the above taxonomy-by-adjacent-list, close inspection will show that no two records
237             with the same C share the same C. The property of
238             B means that we can construct a non-indexed
239             version of the path from the root to a given node by using the C
240             column in a given record to look up the C of the record with the C
241             value identical to the child's C.
242              
243             Via index: 3 2 1
244              
245             Via name: Kappa Epsilon Alpha
246              
247             We go from C to its C, <2>, then to C<2>'s C, <1>.
248             Putting names to this, we go from C to C to C.
249              
250             Now, reverse the order of those Cs, throw a pipe delimiter before each
251             of them and join them into a single string, and you get:
252              
253             |Alpha|Epsilon|Kappa
254              
255             ... which is the value of the C column in the third record in the
256             taxonomy-by-materialized-path displayed previously.
257              
258             With correct data, a given hierarchy of data can therefore be represented
259             either by a taxonomy-by-materialized-path or by a taxonomy-by-adjacent-list.
260             This permits us to describe these two taxonomies as B to each
261             other.
262              
263             =head2 Taxonomy Validation
264              
265             Each C subclass will have a constructor, C, whose
266             principal interface will take the name of a taxonomy file as an argument. We
267             will call this interface the B interface to the constructor. The
268             purpose of the constructor will be to determine whether the taxonomy file
269             holds a valid taxonomy according to the description provided above. The
270             arguments needed for such a constructor will be found in the documentation of
271             the subclass.
272              
273             The constructor of a C subclass may, if desired, accept
274             a different set of arguments. Suppose you have already read a CSV file and
275             parsed it into one array reference holding its header row -- a list of its
276             columns -- and a second array reference, this one being an array of arrays
277             where each element holds the data in one record in the CSV file. You have the
278             same components needed to validate the taxonomy that you would get by
279             parsing the CSV file, so your subclass may implement a B interface
280             as well as a file interface.
281              
282             You should now proceed to read the documentation for
283             L and L.
284              
285             =cut
286              
287             sub fields {
288 69     69 0 7442 my $self = shift;
289 69         186 return $self->{fields};
290             }
291              
292             sub data_records {
293 56     56 0 2199 my $self = shift;
294 56         138 return $self->{data_records};
295             }
296              
297             sub fields_and_data_records {
298 9     9 0 1232 my $self = shift;
299 9         20 my @all_rows = $self->fields;
300 9         12 for my $row (@{$self->data_records}) {
  9         18  
301 117         115 push @all_rows, $row;
302             }
303 9         40 return \@all_rows;
304             }
305              
306             sub get_field_position {
307 4     4 0 253 my ($self, $f) = @_;
308 4         9 my $fields = $self->fields;
309 4         4 my $idx;
310 4         6 for (my $i=0; $i<=$#{$fields}; $i++) {
  26         50  
311 24 100       45 if ($fields->[$i] eq $f) {
312 2         4 $idx = $i;
313 2         3 last;
314             }
315             }
316 4 100       18 if (defined($idx)) {
317 2         12 return $idx;
318             }
319             else {
320 2         242 croak "'$f' not a field in this taxonomy";
321             }
322             }
323              
324             1;
325              
326             =head1 BUGS
327              
328             There are no bug reports outstanding on Parse::Taxonomy as of the most recent
329             CPAN upload date of this distribution.
330              
331             =head1 SUPPORT
332              
333             Please report any bugs by mail to C
334             or through the web interface at L.
335              
336             =head1 AUTHOR
337              
338             James E. Keenan (jkeenan@cpan.org). When sending correspondence, please
339             include 'Parse::Taxonomy' or 'Parse-Taxonomy' in your subject line.
340              
341             Creation date: May 24 2016. Last modification date: April 09 2016.
342              
343             Development repository: L
344              
345             =head1 REFERENCES
346              
347             L
348             by Arthur Axel "fREW" Schmidt
349              
350             L
351             by Larry Leszczynski
352              
353             L
354             by Vadim Tropashko
355              
356             L, now maintained by Ron
357             Savage.
358              
359             =head1 COPYRIGHT
360              
361             Copyright (c) 2002-15 James E. Keenan. United States. All rights reserved.
362             This is free software and may be distributed under the same terms as Perl
363             itself.
364              
365             =head1 DISCLAIMER OF WARRANTY
366              
367             BECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
368             FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN
369             OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
370             PROVIDE THE SOFTWARE ''AS IS'' WITHOUT WARRANTY OF ANY KIND, EITHER
371             EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
372             WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE
373             ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH
374             YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL
375             NECESSARY SERVICING, REPAIR, OR CORRECTION.
376              
377             IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
378             WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
379             REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENCE, BE
380             LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL,
381             OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE
382             THE SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING
383             RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A
384             FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF
385             SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
386             SUCH DAMAGES.
387              
388             =cut
389              
390             # vim: formatoptions=crqot