File Coverage

lib/Parse/File/Taxonomy.pm
Criterion Covered Total %
statement 33 33 100.0
branch 4 4 100.0
condition n/a
subroutine 8 8 100.0
pod 0 4 0.0
total 45 49 91.8


line stmt bran cond sub pod time code
1             package Parse::File::Taxonomy;
2 7     7   2587 use strict;
  7         12  
  7         217  
3 7     7   28 use Carp;
  7         8  
  7         422  
4 7     7   4148 use Text::CSV;
  7         86856  
  7         44  
5 7     7   311 use Scalar::Util qw( reftype );
  7         12  
  7         2718  
6             our $VERSION = '0.04';
7             #use Data::Dump;
8              
9             =head1 NAME
10              
11             Parse::File::Taxonomy - Validate a file for use as a taxonomy
12              
13             =head1 VERSION
14              
15             This document refers to version 0.04 of Parse::File::Taxonomy. This version was
16             released May 30 2015.
17              
18             =head1 SYNOPSIS
19              
20             use Parse::File::Taxonomy;
21              
22             =head1 DESCRIPTION
23              
24             This module is the base class for the Parse-File-Taxonomy extension to the
25             Perl 5 programming language. You will not instantiate objects of this class;
26             rather, you will instantiate objects of subclasses, of which
27             Parse::File::Taxonomy::Path will be the first.
28              
29             B
30              
31             =head2 Taxonomy: definition
32              
33             For the purpose of this library, a B is defined as a tree-like data
34             structure with a root node, zero or more branch (child) nodes, and one or more
35             leaf nodes. The root node and each branch node must have at least one child
36             node, but leaf nodes have no child nodes. The number of branches
37             between a leaf node and the root node is variable.
38              
39             B
40              
41             Root
42             |
43             ----------------------------------------------------
44             | | | |
45             Branch Branch Branch Leaf
46             | | |
47             ------------------------- ------------ |
48             | | | | |
49             Branch Branch Leaf Leaf Branch
50             | | |
51             | ------------ |
52             | | | |
53             Leaf Leaf Leaf Leaf
54              
55             =head2 Taxonomy File: definition
56              
57             For the purpose of this module, a B is a CSV file in which (a)
58             certain columns hold data from which the position of each record within the
59             taxonomy can be derived; and (b) each node in the tree (other than the root
60             node) is uniquely represented by a record within the file.
61              
62             =head3 CSV
63              
64             B<"CSV">, strictly speaking, refers to B:
65              
66             path,nationality,gender,age,income,id_no
67              
68             For the purpose of this module, however, the column separators in a taxonomy file
69             may be any user-specified character handled by the F library.
70             Formats frequently observed are B:
71              
72             path nationality gender age income id_no
73              
74             and B:
75              
76             path|nationality|gender|age|income|id_no
77              
78             The documentation for F comments that the CSV format could
79             perhaps better [be] called ASV (anything separated values)">, but we shall for
80             convenience use "CSV" herein regardless of the specific delimiter.
81              
82             Since it is often the case that the characters used as column separators may
83             occur within the data recorded in the columns as well, it is customary to
84             quote either all columns:
85              
86             "path","nationality","gender","age","income","id_no"
87              
88             ... or, at the very least, all columns which can hold
89             data other than pure integers or floating-point numbers:
90              
91             "path","nationality","gender",age,income,id_no
92              
93             =head3 Tree structure
94              
95             To qualify as a taxonomy file, it is not sufficient for a file to be in CSV
96             format. In each non-header record in that file, there must be one or more
97             columns which hold data capable of exactly specifying the record's position in
98             the taxonomy, I the route or B from the root node to the node
99             being represented by that record.
100              
101             The precise way in which certain columns are used to determine the path from
102             the root node to a given node is what differentiates various types of taxonomy
103             files from one another. In Parse-File-Taxonomy we identify two different
104             flavors of taxonomy files and provide a class for the construction of each.
105              
106             =head3 Taxonomy-by-path
107              
108             A B is one in which a single column -- which we will refer
109             to as the B -- will represent the path from the root to the given
110             record as a series of strings joined by separator characters.
111             Within that path column the value corresponding to the root node need
112             not be specified, I may be represented by an empty string.
113              
114             Let's rewrite Diagram 1 with values to make this clear.
115              
116             B
117              
118             ""
119             |
120             ----------------------------------------------------
121             | | | |
122             Alpha Beta Gamma Delta
123             | | |
124             ------------------------- ------------ |
125             | | | | |
126             Epsilon Zeta Eta Theta Iota
127             | | |
128             | ------------ |
129             | | | |
130             Kappa Lambda Mu Nu
131              
132             Let us suppose that our taxonomy file held comma-separated, quoted records.
133             Let us further supposed that the column holding taxonomy paths was, not
134             surprisingly, called C and that the separator within the C column
135             was a pipe (C<|>) character. Let us further suppose that for now we are not
136             concerned with the data in any columns other than C so that, for purpose
137             of illustration, they will hold empty (albeit quoted) strings.
138              
139             Then the taxonomy file describing the tree in Diagram 2 would look like this:
140              
141             "path","nationality","gender","age","income","id_no"
142             "|Alpha","","","","",""
143             "|Alpha|Epsilon","","","","",""
144             "|Alpha|Epsilon|Kappa","","","","",""
145             "|Alpha|Zeta","","","","",""
146             "|Alpha|Zeta|Lambda","","","","",""
147             "|Alpha|Zeta|Mu","","","","",""
148             "|Beta","","","","",""
149             "|Beta|Eta","","","","",""
150             "|Beta|Theta","","","","",""
151             "|Gamma","","","","",""
152             "|Gamma|Iota","","","","",""
153             "|Gamma|Iota|Nu","","","","",""
154             "|Delta","","","","",""
155              
156             Note that while in the C<|Gamma> branch we ultimately have only one leaf node,
157             C<|Gamma|Iota|Nu>, we require separate records in the taxonomy file for
158             C<|Gamma> and C<|Gamma|Iota>. To put this another way, the existence of a
159             C leaf must not be assumed to "auto-vivify" C<|Gamma> and
160             C<|Gamma|Iota> nodes. Each non-root node must be explicitly represented in
161             the taxonomy file for the file to be considered valid.
162              
163             Note further that there is no restriction on the values of the B of
164             the C across records. It only the B path that must be unique.
165             Let us illustrate that by modifying the data in Diagram 2:
166              
167             B
168              
169             ""
170             |
171             ----------------------------------------------------
172             | | | |
173             Alpha Beta Gamma Delta
174             | | |
175             ------------------------- ------------ |
176             | | | | |
177             Epsilon Zeta Eta Theta Iota
178             | | |
179             | ------------ |
180             | | | |
181             Kappa Lambda Mu Delta
182              
183             Here we have two leaf nodes each named C. However, we follow different
184             paths from the root node to get to each of them. The taxonomy file
185             representing this tree would look like this:
186              
187             "path","nationality","gender","age","income","id_no"
188             "|Alpha","","","","",""
189             "|Alpha|Epsilon","","","","",""
190             "|Alpha|Epsilon|Kappa","","","","",""
191             "|Alpha|Zeta","","","","",""
192             "|Alpha|Zeta|Lambda","","","","",""
193             "|Alpha|Zeta|Mu","","","","",""
194             "|Beta","","","","",""
195             "|Beta|Eta","","","","",""
196             "|Beta|Theta","","","","",""
197             "|Gamma","","","","",""
198             "|Gamma|Iota","","","","",""
199             "|Gamma|Iota|Delta","","","","",""
200             "|Delta","","","","",""
201              
202             =head3 Taxonomy-by-index
203              
204             A B is one in which the data in which each record has a
205             column with a unique identifier (B) and another column holding the unique
206             identifier of the record representing the next higher node in the hierarchy
207             (B). The record must also a column which holds a datum that is
208             unique among all records having the same parent node.
209              
210             Let's make this clearer by rewriting the taxonomy-by-path above for Example 3
211             as a taxonomy-by-index.
212              
213             "id","parent_id","name","nationality","gender","age","income","id_no"
214             1,,"Alpha","","","","",""
215             2,1,"Epsilon","","","","",""
216             3,2,"Kappa","","","","",""
217             4,1,"Zeta","","","","",""
218             5,4,"Lambda","","","","",""
219             6,4,"Mu","","","","",""
220             7,,"Beta","","","","",""
221             8,7,"Eta","","","","",""
222             9,7,"Theta","","","","",""
223             10,,"Gamma","","","","",""
224             11,10,"Iota","","","","",""
225             12,11,"Delta","","","","",""
226             13,,"Delta","","","","",""
227              
228             In the above taxonomy-by-index, the records with Cs C<1>, C<7>, C<10>, and
229             C<13> are top-level nodes. They have no parents, so the value of their
230             C column is null or, in Perl terms, an empty string. The records
231             with Cs C<2> and C<4> are children of the record with C of C<1>. The
232             record with C is, in turn, a child of the record with C.
233              
234             In the above taxonomy-by-index, close inspection will show that no two records
235             with the same C share the same C. The property of
236             B means that we can construct a non-indexed
237             version of the path from the root to a given node by using the C
238             column in a given record to look up the C of the record with the C
239             value identical to the child's C.
240              
241             Via index: 3 2 1
242              
243             Via name: Kappa Epsilon Alpha
244              
245             We go from C to its C, then to C<2>'s C.
246             Putting Cs to this, we go from C to C to C.
247              
248             Now, reverse the order of those Cs, throw a pipe delimiter before each
249             of them and join them into a single string, and you get:
250              
251             |Alpha|Epsilon|Kappa
252              
253             ... which is the value of the C column in the third record in the
254             taxonomy-by-path displayed previously.
255              
256             With correct data, a given hierarchy of data can therefore be represented
257             either by a taxonomy-by-path or by a taxonomy-by-index. We would therefore
258             describe these two taxonomies as B to each other.
259              
260             =head2 Taxonomy Validation
261              
262             Each C subclass will have a constructor, C,
263             which will probe a taxonomy file
264             provided to it as an argument to determine whether it can be considered a
265             valid taxonomy according to the description provided above. The arguments
266             needed for such a constructor will be found in the documentation of the
267             subclass.
268              
269             The constructor of a C subclass may, if desired, accept
270             a different set of arguments. Suppose you have already read a CSV file and
271             parsed it into one array reference holding its header row -- a list of its
272             columns -- and a second array reference, this one being an array of arrays
273             where each element holds the data in one record in the CSV file. You have the
274             same C needed to validate the taxonomy that you would get by
275             parsing the CSV file.
276              
277             =cut
278              
279             sub fields {
280 41     41 0 9340 my $self = shift;
281 41         154 return $self->{fields};
282             }
283              
284             sub data_records {
285 40     40 0 934 my $self = shift;
286 40         112 return $self->{data_records};
287             }
288              
289             sub fields_and_data_records {
290 9     9 0 2540 my $self = shift;
291 9         26 my @all_rows = $self->fields;
292 9         16 for my $row (@{$self->data_records}) {
  9         23  
293 117         121 push @all_rows, $row;
294             }
295 9         40 return \@all_rows;
296             }
297              
298             sub get_field_position {
299 4     4 0 512 my ($self, $f) = @_;
300 4         10 my $fields = $self->fields;
301 4         37 my $idx;
302 4         7 for (my $i=0; $i<=$#{$fields}; $i++) {
  26         43  
303 24 100       45 if ($fields->[$i] eq $f) {
304 2         3 $idx = $i;
305 2         4 last;
306             }
307             }
308 4 100       12 if (defined($idx)) {
309 2         13 return $idx;
310             }
311             else {
312 2         253 croak "'$f' not a field in this taxonomy";
313             }
314             }
315              
316             1;
317              
318             =head1 BUGS
319              
320             There are no bug reports outstanding on Parse::File::Taxonomy as of the most recent
321             CPAN upload date of this distribution.
322              
323             =head1 SUPPORT
324              
325             Please report any bugs by mail to C
326             or through the web interface at L.
327              
328             =head1 AUTHOR
329              
330             James E. Keenan (jkeenan@cpan.org). When sending correspondence, please
331             include 'Parse::File::Taxonomy' or 'Parse-File-Taxonomy' in your subject line.
332              
333             Creation date: May 24 2015. Last modification date: June 17 2015.
334              
335             Development repository: L
336              
337             =head1 COPYRIGHT
338              
339             Copyright (c) 2002-15 James E. Keenan. United States. All rights reserved.
340             This is free software and may be distributed under the same terms as Perl
341             itself.
342              
343             =head1 DISCLAIMER OF WARRANTY
344              
345             BECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
346             FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN
347             OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
348             PROVIDE THE SOFTWARE ''AS IS'' WITHOUT WARRANTY OF ANY KIND, EITHER
349             EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
350             WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE
351             ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH
352             YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL
353             NECESSARY SERVICING, REPAIR, OR CORRECTION.
354              
355             IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
356             WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
357             REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENCE, BE
358             LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL,
359             OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE
360             THE SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING
361             RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A
362             FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF
363             SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
364             SUCH DAMAGES.
365              
366             =cut
367              
368             # vim: formatoptions=crqot