File Coverage

lib/Text/CSV/Hashify.pm

Criterion	Covered	Total	%
statement	98	98	100.0
branch	46	48	95.8
condition	23	23	100.0
subroutine	16	16	100.0
pod	6	7	85.7
total	189	192	98.4

line	stmt	bran	cond	sub	pod	time	code
1							package Text::CSV::Hashify;
2	5			5		115519	use strict;
	5					10
	5					144
3	5			5		87	use 5.8.0;
	5					16
4	5			5		25	use Carp;
	5					8
	5					421
5	5			5		29	use Scalar::Util qw( reftype looks_like_number );
	5					9
	5					302
6	5			5		3775	use Text::CSV;
	5					90174
	5					260
7	5			5		2237	use open qw( :encoding(UTF-8) :std );
	5					4784
	5					27
8
9							BEGIN {
10	5			5		58016	use Exporter ();
	5					9
	5					99
11	5			5		17	use vars qw($VERSION @ISA @EXPORT);
	5					6
	5					425
12	5			5		10	$VERSION = '0.08';
13	5					50	@ISA = qw(Exporter);
14	5					5549	@EXPORT = qw( hashify );
15							}
16
17							=head1 NAME
18
19							Text::CSV::Hashify - Turn a CSV file into a Perl hash
20
21							=head1 VERSION
22
23							This document refers to version 0.08 of Text::CSV::Hashify. This version was
24							released March 15 2017.
25
26							=head1 SYNOPSIS
27
28							# Simple functional interface
29							use Text::CSV::Hashify;
30							$hash_ref = hashify('/path/to/file.csv', 'primary_key');
31
32							# Object-oriented interface
33							use Text::CSV::Hashify;
34							$obj = Text::CSV::Hashify->new( {
35							file => '/path/to/file.csv',
36							format => 'hoh', # hash of hashes, which is default
37							key => 'id', # needed except when format is 'aoh'
38							max_rows => 20, # number of records to read; defaults to all
39							... # other key-value pairs possible for Text::CSV
40							} );
41
42							# all records requested
43							$hash_ref = $obj->all;
44
45							# arrayref of fields input
46							$fields_ref = $obj->fields;
47
48							# hashref of specified record
49							$record_ref = $obj->record('value_of_key');
50
51							# value of one field in one record
52							$datum = $obj->datum('value_of_key', 'field');
53
54							# arrayref of all unique keys seen
55							$keys_ref = $obj->keys;
56
57							=head1 DESCRIPTION
58
59							The Comma-Separated-Value ('CSV') format is the most common way to store
60							spreadsheets or the output of relational database queries in plain-text
61							format. However, since commas (or other designated field-separator
62							characters) may be embedded within data entries, the parsing of delimited
63							records is non-trivial. Fortunately, in Perl this parsing is well handled by
64							CPAN distribution Text::CSV. This permits us to address more specific data
65							manipulation problems by building modules on top of Text::CSV.
66
67							B In this document we will use I as a catch-all for tab-delimited
68							files, pipe-delimited files, and so forth. Please refer to the documentation
69							for Text::CSV to learn how to handle field separator characters other than the
70							comma.
71
72							=head2 Primary Case: CSV (with primary key) to Hash of Hashes
73
74							Text::CSV::Hashify is designed for the case where you simply want to turn a
75							CSV file into a Perl hash. In particular, it is designed for the case where
76							(a) the CSV file's first record is a list of fields in the ancestral database
77							table and (b) one field (column) functions as a B, I each
78							record's entry in that field is non-null and is distinct from every other
79							record's entry therein.
80
81							Text::CSV::Hashify turns that kind of CSV file into one big hash of hashes.
82							Elements of this hash are keyed on the entries in the designated primary key
83							field and the value for each element is a hash reference of all the data in a
84							particular database record (including the primary key field and its value).
85
86							=head2 Secondary Case: CSV (lacking primary key) to Array of Hashes
87
88							You may, however, encounter cases where a CSV file's header row contains the
89							list of database fields but no field is capable of serving as a primary key,
90							I there is no field in which the entry for that field in any record is
91							guaranteed to be distinct from the entries in that field for all other
92							records.
93
94							In this case, while an individual record can be turned into a hash,
95							the CSV file as a whole cannot accurately be turned into a hash of hashes. As
96							a fallback, Text::CSV::Hashify can, upon request, turn this into an array of
97							hashes. In this case, you will not be able to look up a particular record by
98							its primary key. You will instead have to know its index position within the
99							array (which is equivalent to knowing its record number in the original CSV
100							file minus C<1>).
101
102							=head2 Interfaces
103
104							Text::CSV::Hashify provides two interfaces: one functional, one
105							object-oriented.
106
107							Use the functional interface when all you want is to turn a CSV file with a
108							primary key field into a hash of hashes.
109
110							Use the object-oriented interface for any more sophisticated manipulation of
111							the CSV file. This includes:
112
113							=over 4
114
115							=item * Text::CSV options
116
117							Access to any of the options available to Text::CSV, such as use of a
118							separator character other than a comma.
119
120							=item * Limit number of records
121
122							Selection of a limited number of records from the CSV file, rather than
123							slurping the whole file into your in-memory hash.
124
125							=item * Array of hash references format
126
127							Probably better than the default hash of hash references format when the CSV
128							file has no field able to serve as a primary key.
129
130							=item * Metadata
131
132							Access to the list of fields, the list of all primary key values, the values
133							in an individual record, or the value of an individual field in an individual
134							record.
135
136							=back
137
138							B On the recommendation of the authors/maintainers of Text::CSV,
139							Text::CSV::Hashify will internally always set Text::CSV's C 1>
140							option.
141
142							=head1 FUNCTIONAL INTERFACE
143
144							Text::CSV::Hashify by default exports one function: C.
145
146							$hash_ref = hashify('/path/to/file.csv', 'primary_key');
147
148							Function takes two arguments: path to CSV file; field in that file which
149							serves as primary key.
150
151							Returns a reference to a hash of hash references.
152
153							=cut
154
155							sub hashify {
156	3	100		3	0	988	croak "'hashify()' must have two arguments"
157							unless @_ == 2;
158	2					4	my @args = @_;
159	2					7	for (my $i=0;$i<=$#args;$i++) {
160	4	100				123	croak "'hashify()' argument at index '$i' not true" unless $args[$i];
161							}
162	1					9	my $obj = Text::CSV::Hashify->new( {
163							file => $args[0],
164							key => $args[1],
165							} );
166	1					5	return $obj->all();
167							}
168
169							=head1 OBJECT-ORIENTED INTERFACE
170
171							=head2 C
172
173							=over 4
174
175							=item * Purpose
176
177							Text::CSV::Hashify constructor.
178
179							=item * Arguments
180
181							$obj = Text::CSV::Hashify->new( {
182							file => '/path/to/file.csv',
183							format => 'hoh', # hash of hashes, which is default
184							key => 'id', # needed except when format is 'aoh'
185							max_rows => 20, # number of records to read; defaults to all
186							... # other key-value pairs possible for Text::CSV
187							} );
188
189							Single hash reference. Required element is:
190
191							=over 4
192
193							=item * C
194
195							String: path to CSV file serving as input.
196
197							=back
198
199							Element usually needed:
200
201							=over 4
202
203							=item * C
204
205							String: name of field in CSV file serving as unique key. Needed except when
206							optional element C is C.
207
208							=back
209
210							Optional elements are:
211
212							=over 4
213
214							=item * C
215
216							String: possible values are C and C. Defaults to C (hash of
217							hashes). C will fail if the same value is encountered in more than one
218							record's entry in the C column. So if you know in advance that your data
219							cannot meet this condition, explicitly select C aoh>.
220
221							=item * C
222
223							Number: provide this if you do not wish to populate the hash with all data
224							records from the CSV file. (Will have no effect if the number provided is
225							greater than or equal to the number of data records in the CSV file.)
226
227							=item * Any option available to Text::CSV
228
229							See documentation for either Text::CSV or Text::CSV_XS.
230
231							=back
232
233							=item * Return Value
234
235							Text::CSV::Hashify object.
236
237							=item * Comment
238
239							=back
240
241							=cut
242
243							sub new {
244	23			23	1	12767	my ($class, $args) = @_;
245	23					33	my %data;
246
247	23	100	100			595	croak "Argument to 'new()' must be hashref"
248							unless (ref($args) and reftype($args) eq 'HASH');
249	21	100				167	croak "Argument to 'new()' must have 'file' element" unless $args->{file};
250							croak "Cannot locate file '$args->{file}'"
251	20	100				478	unless (-f $args->{file});
252	19					52	$data{file} = delete $args->{file};
253
254	19	100	100			98	if ($args->{format} and ($args->{format} !~ m/^(?:h\|a)oh$/i) ) {
255	1					117	croak "Entry '$args->{format}' for format is invalid'";
256							}
257	18		100			71	$data{format} = delete $args->{format} \|\| 'hoh';
258
259	18	100	100			64	if (! exists $args->{key} and $data{format} ne 'aoh') {
260	1					112	croak "Argument to 'new()' must have 'key' element unless 'format' element is 'aoh'";
261							}
262	17					32	$data{key} = delete $args->{key};
263
264	17	100				45	if (defined($args->{max_rows})) {
265	6	100				38	if ($args->{max_rows} !~ m/^[0-9]+$/) {
266	3					323	croak "'max_rows' option, if defined, must be numeric";
267							}
268							else {
269	3					7	$data{max_rows} = delete $args->{max_rows};
270							}
271							}
272							# We've now handled all the Text::CSV::Hashify::new-specific options.
273							# Any remaining options are assumed to be intended for Text::CSV::new().
274
275	14					27	$args->{binary} = 1;
276	14	50				89	my $csv = Text::CSV->new ( $args )
277							or croak "Cannot use CSV: ".Text::CSV->error_diag ();
278							open my $IN, "<", $data{file}
279	14	50				2061	or croak "Unable to open '$data{file}' for reading";
280	14					1520	my $header_ref = $csv->getline($IN);
281	14					905	my %header_fields_seen;
282	14					18	for (@{$header_ref}) {
	14					35
283	107	100				130	if (exists $header_fields_seen{$_}) {
284	1					140	croak "Duplicate field '$_' observed in '$data{file}'";
285							}
286							else {
287	106					170	$header_fields_seen{$_}++;
288							}
289							}
290	13					25	$data{fields} = $header_ref;
291	13					17	$csv->column_names(@{$header_ref});
	13					80
292
293							# 'hoh format
294	13					458	my %keys_seen;
295	13					25	my @keys_list = ();
296	13					53	my %parsed_data;
297							# 'aoh' format
298							my @parsed_data;
299
300	13					66	PARSE_FILE: while (my $record = $csv->getline_hr($IN)) {
301	133	100				7379	if ($data{format} eq 'hoh') {
302	123					155	my $kk = $record->{$data{key}};
303	123	100				139	if ($keys_seen{$kk}) {
304	1					169	croak "Key '$kk' already seen";
305							}
306							else {
307	122					187	$keys_seen{$kk}++;
308	122					142	push @keys_list, $kk;
309	122					114	$parsed_data{$kk} = $record;
310							last PARSE_FILE if (
311							defined $data{max_rows} and
312							scalar(keys %parsed_data) == $data{max_rows}
313	122	100	100			436	);
314							}
315							}
316							else { # format: 'aoh'
317	10					14	push @parsed_data, $record;
318							last PARSE_FILE if (
319							defined $data{max_rows} and
320							scalar(@parsed_data) == $data{max_rows}
321	10	100	100			65	);
322							}
323							}
324	12	100				590	$data{all} = ($data{format} eq 'aoh') ? \@parsed_data : \%parsed_data;
325	12	100				56	$data{keys} = \@keys_list if $data{format} eq 'hoh';
326	12					26	$data{csv} = $csv;
327	12					28	while (my ($k,$v) = each %{$args}) {
	26					88
328	14					35	$data{$k} = $v;
329							}
330	12					233	return bless \%data, $class;
331							}
332
333							=head2 C
334
335							=over 4
336
337							=item * Purpose
338
339							Get a representation of all data found in a CSV input file.
340
341							=item * Arguments
342
343							$hash_ref = $obj->all; # when format is default or 'hoh'
344							$array_ref = $obj->all; # when format is 'aoh'
345
346							=item * Return Value
347
348							Reference representing all data records in the CSV input file. In the default
349							case, or if you have specifically requested C 'hoh'>, the return
350							value is a hash reference. When you have requested C 'aoh'>, the
351							return value is an array reference.
352
353							=item * Comment
354
355							In the default (C) case, the return value is equivalent to that of
356							C.
357
358							=back
359
360							=cut
361
362							sub all {
363	5			5	1	3296	my ($self) = @_;
364	5					37	return $self->{all};
365							}
366
367							=head2 C
368
369							=over 4
370
371							=item * Purpose
372
373							Get a list of the fields in the CSV source.
374
375							=item * Arguments
376
377							$fields_ref = $obj->fields;
378
379							=item * Return Value
380
381							Array reference.
382
383							=item * Comment
384
385							If any field names are duplicate, you will not get this far, as C would
386							have died.
387
388							=back
389
390							=cut
391
392							sub fields {
393	3			3	1	1297	my ($self) = @_;
394	3					7	return $self->{fields};
395							}
396
397							=head2 C
398
399							=over 4
400
401							=item * Purpose
402
403							Get a hash representing one record in the CSV input file.
404
405							=item * Arguments
406
407							$record_ref = $obj->record('value_of_key');
408
409							One argument. In the default case (C 'hoh'>), this argument is the value in the record in the column serving as unique key.
410
411							In the C 'aoh'> case, this will be index position of the data record
412							in the array. (The header row will be at index C<0>.)
413
414							=item * Return Value
415
416							Hash reference.
417
418							=back
419
420							=cut
421
422							sub record {
423	15			15	1	9894	my ($self, $key) = @_;
424	15	100	100			844	croak "Argument to 'record()' either not defined or non-empty"
425							unless (defined $key and $key ne '');
426							($self->{format} eq 'aoh')
427							? return $self->{all}->[$key]
428	9	100				38	: return $self->{all}->{$key};
429							}
430
431							=head2 C
432
433							=over 4
434
435							=item * Purpose
436
437							Get value of one field in one record.
438
439							=item * Arguments
440
441							$datum = $obj->datum('value_of_key', 'field');
442
443							List of two arguments: the value in the record in the column serving as unique
444							key; the name of the field.
445
446							=item * Return Value
447
448							Scalar.
449
450							=back
451
452							=cut
453
454							sub datum {
455	14			14	1	6781	my ($self, @args) = @_;
456	14	100				295	croak "'datum()' needs two arguments" unless @args == 2;
457	11					39	for (my $i=0;$i<=$#args;$i++) {
458	19	100	100			595	croak "Argument to 'datum()' at index '$i' either not defined or non-empty"
459							unless ((defined($args[$i])) and ($args[$i] ne ''));
460							}
461							($self->{format} eq 'aoh')
462							? return $self->{all}->[$args[0]]->{$args[1]}
463	5	100				36	: return $self->{all}->{$args[0]}->{$args[1]};
464							}
465
466							=head2 C
467
468							=over 4
469
470							=item * Purpose
471
472							Get a list of all unique keys found in the input file.
473
474							=item * Arguments
475
476							$keys_ref = $obj->keys;
477
478							=item * Return Value
479
480							Array reference.
481
482							=item * Comment
483
484							If you have selected C 'aoh'> in the options to C, the
485							C method is inappropriate and will cause your program to die.
486
487							=back
488
489							=cut
490
491							sub keys {
492	3			3	1	1246	my ($self) = @_;
493	3	100				11	if (exists $self->{keys}) {
494	2					4	return $self->{keys};
495							}
496							else {
497	1					117	croak "'keys()' method not appropriate when 'format' is 'aoh'";
498							}
499							}
500
501							=head1 AUTHOR
502
503							James E Keenan
504							CPAN ID: jkeenan
505							jkeenan@cpan.org
506							http://thenceforward.net/perl/modules/Text-CSV-Hashify
507
508							=head1 COPYRIGHT
509
510							This program is free software; you can redistribute
511							it and/or modify it under the same terms as Perl itself.
512
513							The full text of the license can be found in the
514							LICENSE file included with this module.
515
516							Copyright 2012-2017, James E Keenan. All rights reserved.
517
518							=head1 BUGS
519
520							There are no bug reports outstanding on Text::CSV::Hashify as of the most recent
521							CPAN upload date of this distribution.
522
523							=head1 SUPPORT
524
525							To report any bugs or make any feature requests, please send mail to
526							C or use the web interface at
527							L.
528
529							=head1 ACKNOWLEDGEMENTS
530
531							Thanks to Christine Shieh for serving as the alpha consumer of this
532							library's output.
533
534							=head1 OTHER CPAN DISTRIBUTIONS
535
536							=head2 Text-CSV and Text-CSV_XS
537
538							These distributions underlie Text-CSV-Hashify and provide all of its
539							file-parsing functionality. Where possible, install both. That will enable
540							you to process a file with a single, shared interface but have access to the
541							faster processing speeds of XS where available.
542
543							=head2 Text-CSV-Slurp
544
545							Like Text-CSV-Hashify, Text-CSV-Slurp slurps an entire CSV file into memory,
546							but stores it as an array of hashes instead.
547
548							=head2 Text-CSV-Auto
549
550							This distribution inspired the C option to C.
551
552							=cut
553
554							1;
555