File Coverage

Bio/PullParserI.pm
Criterion Covered Total %
statement 137 163 84.0
branch 60 88 68.1
condition 24 37 64.8
subroutine 21 21 100.0
pod 4 4 100.0
total 246 313 78.5


line stmt bran cond sub pod time code
1             #
2             # BioPerl module for Bio::PullParserI
3             #
4             # Please direct questions and support issues to
5             #
6             # Cared for by Sendu Bala
7             #
8             # Copyright Sendu Bala
9             #
10             # You may distribute this module under the same terms as perl itself
11              
12             # POD documentation - main docs before the code
13              
14             =head1 NAME
15              
16             Bio::PullParserI - A base module for fast 'pull' parsing
17              
18             =head1 SYNOPSIS
19              
20             # do not use this class, it is intended for parser module
21             # writers only
22              
23             =head1 DESCRIPTION
24              
25             If you are writing a module to parse some new format, you may wish to use
26             a 'pull' approach whereby you only do work (reading file data, parsing it,
27             turning the parsed data in an object) when absolutely necessary.
28              
29             PullParserI provides a system for doing exactly that. As a PullParser you
30             need a chunk. A chunk is just a Bio::Root::IO that contains all the raw data
31             you would want to parse. You can use the chunk() method to create a chunk from
32             a filename, existing filehandle or even a string. If you make a chunk from a
33             large file, but actually only want your chunk to be some portion of the whole
34             file, supply start and end amounts in bytes to chunk() at the same time.
35             The methods _chunk_seek() and _chunk_tell() provide seeks and tells that are
36             relative to the start and end of your chunk, not the whole file.
37              
38             The other thing you will need to decide when making a chunk is how to handle
39             piped input. A PullParser typically needs seekable data to parse, so if your
40             data is piped in and unseekable, you must decide between creating a temp file
41             or reading the input into memory, which will be done before the chunk becomes
42             usable and you can begin any parsing. Alternatively you can choose to force
43             a sequential read, in which case you can make use of _dependencies() to define
44             the linear order of methods that would result in the file being read
45             sequentially. The return value of _sequential() is also useful here, if you
46             would need to cache some data or otherwise behave differently during a
47             sequential read.
48              
49             The main method in the system is get_field(). This method relies on the
50             existance of a private hash reference accessible to it with the method
51             _fields(). That hash ref should have as keys all the sorts of data you will want
52             to parse (eg. 'score'), and prior to parsing the values would be undefined. A
53             user of your module can then call either $module-Eget_field('score') or
54             $module-Escore and get_field will either return the answer from
55             $self-E_fields-E{score} if it is defined, or call a method _discover_score()
56             first if not. So for the system to work you need to define a _discover_*()
57             method for every field in the fields hash, and ensure that the method stores an
58             answer in the fields hash.
59              
60             How you implement your _discover_* methods is up to you, though you should never
61             call a _discover_* method directly yourself; always use get_field(), since
62             get_field() will deal with calling dependent methods for you if a forced
63             sequenctial read is in progress due to piped input. You will almost certainly
64             want to make use of the various chunk-related methods of this class (that are
65             denoted private by the leading '_'; this means you can use them as the author of
66             a parser class, but users of your parser should not).
67              
68             Primary amongst them is _*_chunk_by_end() to which you provide text that
69             represents the end of your desired chunk and it does a readline with your
70             argument as $/. The chunk knows about its line-endings, so if you want your
71             end definition to include a new line, just always use "\n" and PullParserI will
72             do any necessary conversion for you.
73              
74             If your input data is hierarchical (eg. report-Emany results-Emany hits-Emany
75             hsps), and you want an object at the leaf of the hierarchy to have access to
76             information that is shared amongst all of them (is parsed in the root), you
77             don't have to copy the data to each leaf object; simply by defining parent(),
78             when you call get_field() and the requested field isn't in your leaf's fields
79             hash, the leaf's parent will be asked for the field instead, and so on till
80             root.
81              
82             See Bio::SearchIO::hmmer_pull for an example of implementing a parser using
83             PullParserI.
84              
85             =head1 FEEDBACK
86              
87             =head2 Mailing Lists
88              
89             User feedback is an integral part of the evolution of this and other
90             Bioperl modules. Send your comments and suggestions preferably to
91             the Bioperl mailing list. Your participation is much appreciated.
92              
93             bioperl-l@bioperl.org - General discussion
94             http://bioperl.org/wiki/Mailing_lists - About the mailing lists
95              
96             =head2 Support
97              
98             Please direct usage questions or support issues to the mailing list:
99              
100             I
101              
102             rather than to the module maintainer directly. Many experienced and
103             reponsive experts will be able look at the problem and quickly
104             address it. Please include a thorough description of the problem
105             with code and data examples if at all possible.
106              
107             =head2 Reporting Bugs
108              
109             Report bugs to the Bioperl bug tracking system to help us keep track
110             of the bugs and their resolution. Bug reports can be submitted via the
111             web:
112              
113             https://github.com/bioperl/bioperl-live/issues
114              
115             =head1 AUTHOR - Sendu Bala
116              
117             Email bix@sendu.me.uk
118              
119             =head1 CONTRIBUTORS
120              
121             Inspired by a posting by Aaron J. Mackey
122              
123             =head1 APPENDIX
124              
125             The rest of the documentation details each of the object methods.
126             Internal methods are usually preceded with a _
127              
128             =cut
129              
130             # Let the code begin...
131              
132             package Bio::PullParserI;
133              
134 2     2   11 use vars qw($AUTOLOAD $FORCE_TEMP_FILE);
  2         2  
  2         90  
135 2     2   7 use strict;
  2         2  
  2         34  
136              
137 2     2   6 use Bio::Root::IO;
  2         2  
  2         38  
138              
139 2     2   5 use base qw(Bio::Root::RootI);
  2         2  
  2         135  
140              
141             BEGIN {
142             # chunk() needs perl 5.8 feature for modes other than temp_file, so will
143             # workaround by forcing temp_file mode in <5.8. Could also rewrite using
144             # IO::String, but don't want to.
145 2 50   2   2472 if ($] < 5.008) {
146 0         0 $FORCE_TEMP_FILE = 1;
147             }
148             }
149              
150             =head2 _fields
151              
152             Title : _fields
153             Usage : $obj->_fields( { field1 => undef } );
154             my $fields_ref = $obj->_fields;
155             Function: Get/set the hash reference containing all the fields for this parser
156             Returns : hash ref
157             Args : none to get, OR hash ref to set
158              
159             =cut
160              
161             sub _fields {
162 15738     15738   11136 my $self = shift;
163 15738 100       17931 if (@_) {
164 268         274 $self->{_fields} = shift;
165             }
166 15738 50       18222 unless (defined $self->{_fields}) {
167 0         0 $self->{_fields} = { };
168             }
169 15738         30104 return $self->{_fields};
170             }
171              
172             =head2 has_field
173              
174             Title : has_field
175             Usage : if ($obj->has_field('field_name') {...}
176             Function: Ask if a particular object has a given field (doesn't ask ancestors)
177             Returns : boolean
178             Args : string (the field name to test)
179              
180             =cut
181              
182             sub has_field {
183 6     6 1 8 my ($self, $desired) = @_;
184 6 50       13 $desired || return;
185 6         11 return exists $self->_fields->{$desired};
186             }
187              
188             =head2 get_field
189              
190             Title : get_field
191             Usage : my $field_value = $obj->get_field('field_name');
192             Function: Get the value of a given field. If this $obj doesn't have the field,
193             it's parent() will be asked, and so on until there are no more
194             parents.
195             Returns : scalar, warns if a value for the field couldn't be found and returns
196             undef.
197             Args : string (the field to get)
198              
199             =cut
200              
201             sub get_field {
202 3205     3205 1 2328 my $self = shift;
203 3205   50     4204 my $desired = shift || return keys %{$self->_fields};
204 3205 100       3633 if (exists $self->_fields->{$desired}) {
205 2773 100       2720 unless (defined $self->_fields->{$desired}) {
206 1246         1450 my $method = '_discover_'.$desired;
207            
208 1246         1455 my $dependency = $self->_dependencies($desired);
209 1246 100 100     2160 if ($dependency && ! defined $self->_fields->{$dependency}) {
210 510         677 $self->get_field($dependency);
211             }
212            
213             # it might exist now
214 1246 100       1532 $self->$method unless defined $self->_fields->{$desired};
215             }
216 2773         3271 return $self->_fields->{$desired};
217             }
218            
219             # is it a field of our parent? (checks all ancestors)
220 432 100       612 if (my $parent = $self->parent) {
221 400         543 return $parent->get_field($desired);
222             }
223            
224 32         65 $desired =~ s/_discover_//;
225 32         147 $self->warn("This report does not hold information about '$desired'");
226 32         107 return;
227             }
228              
229             =head2 parent
230              
231             Title : parent
232             Usage : $obj->parent($parent_obj);
233             my $parent_obj = $obj->parent;
234             Function: Get/set the parent object of this one.
235             Returns : Bio::PullParserI
236             Args : none to get, OR Bio::PullParserI to set
237              
238             =cut
239              
240             sub parent {
241 788     788 1 596 my $self = shift;
242 788 100       1033 if (@_) { $self->{parent} = shift }
  254         283  
243 788   100     1778 return $self->{parent} || return;
244             }
245              
246             =head2 chunk
247              
248             Title : chunk
249             Usage : $obj->chunk($filename);
250             my $chunk = $obj->chunk;
251             Function: Get/set the chunk of this parser.
252             Returns : Bio:Root::IO
253             Args : none to get, OR
254             First argument of a GLOB reference, filename string, string data to
255             treat as the chunk, or Bio::Root::IO.
256             Optionally, also provide:
257             -start => int : the byte position within the thing described by the
258             first argument to consider as the start of this
259             chunk (default 0)
260             -end => int : the byte position to consider as the end (default
261             true end)
262             -piped_behaviour => 'memory'|'temp_file'|'sequential_read'
263              
264             The last option comes into effect when the first argument is
265             something that cannot be seeked (eg. piped input filehandle).
266             'memory' means read all the piped input into a string
267             first, then set the chunk to that string.
268             'temp_file' means read all the piped input and output it to
269             a temp file, then set the chunk to that temp file.
270             'sequential_read' means that the piped input should be read
271             sequentially and your parsing code must cope with
272             not being able to seek.
273             'memory' is the fastest but uses the most memory. 'temp_file' and
274             'sequential_read' can be slow, with 'temp_file' being the most memory
275             efficient but requiring disc space. The default is 'sequential_read'.
276             Note that in versions of perl earlier than 5.8 only temp_file works
277             and will be used regardless of what value is supplied here.
278              
279             =cut
280              
281             sub chunk {
282 6531     6531 1 4363 my $self = shift;
283            
284 6531 100       7614 if (@_) {
285 218   33     326 my $thing = shift || $self->throw("Trying to set chunk() to an undefined value");
286 218 50       932 if (ref($thing) eq 'GLOB') {
    100          
    50          
287 0         0 $self->{_chunk} = Bio::Root::IO->new(-fh => $thing);
288             }
289             elsif (ref(\$thing) eq 'SCALAR') {
290 14 50 33     210 if ($thing !~ /\n/ && -e $thing) {
291 14         81 $self->{_chunk} = Bio::Root::IO->new(-file => $thing);
292             }
293             else {
294 0 0       0 unless ($FORCE_TEMP_FILE) {
295             # treat a string as a filehandle
296 0 0       0 open my $fake_fh, "+<", \$thing or $self->throw("Could not open file '$thing': $!"); # requires perl 5.8
297 0         0 $self->{_chunk} = Bio::Root::IO->new(-fh => $fake_fh);
298             }
299             else {
300 0         0 my ($handle) = $self->{_chunk}->tempfile();
301 0         0 print $handle $thing;
302 0         0 $self->{_chunk} = Bio::Root::IO->new(-fh => $handle);
303             }
304             }
305             }
306             elsif ($thing->isa('Bio::Root::IO')) {
307 204         325 $self->{_chunk} = $thing;
308             }
309             else {
310 0         0 $self->throw("Unknown input into chunk()");
311             }
312            
313 218         190 my ($piped_behaviour, $start, $end);
314 218 50       343 if (@_) {
315 218         510 ($piped_behaviour, $start, $end) =
316             $self->_rearrange([qw(PIPED_BEHAVIOUR START END)], @_);
317             }
318 218   100     624 $piped_behaviour ||= 'sequential_read';
319 218 50       342 $FORCE_TEMP_FILE && ($piped_behaviour = 'temp_file');
320 218   100     315 $start ||= 0;
321 218         340 $self->_chunk_true_start($start);
322 218         279 $self->_chunk_true_end($end);
323            
324             # determine if the chunk is seekable
325 218         387 my $fh = $self->{_chunk}->_fh;
326 218         637 seek($fh, 0, 0);
327 218         1520 my $first_line = <$fh>;
328 218         332 seek($fh, 0, 0);
329 218         286 my $seekable = tell($fh) == 0;
330 218 50       295 unless ($seekable) {
331 0 0       0 if ($piped_behaviour eq 'memory') {
    0          
    0          
332 0         0 my $string = $first_line;
333 0         0 while (<$fh>) {
334 0         0 $string .= $_;
335             }
336 0         0 $self->chunk($string);
337             }
338             elsif ($piped_behaviour eq 'temp_file') {
339 0         0 my ($handle) = $self->{_chunk}->tempfile();
340 0         0 print $handle $first_line;
341 0         0 while (<$fh>) {
342 0         0 print $handle $_;
343             }
344 0         0 seek($handle, 0, 0);
345 0         0 $self->chunk($handle);
346             }
347             elsif ($piped_behaviour eq 'sequential_read') {
348 0         0 $self->{_chunk}->_pushback($first_line);
349 0         0 $self->_sequential(1);
350             }
351             else {
352 0         0 $self->throw("Unknown piped behaviour type '$piped_behaviour'");
353             }
354             }
355            
356             # determine our line ending
357 218 100       518 if ($first_line =~ /\r\n/) {
    50          
358 98         168 $self->_line_ending("\r\n");
359             }
360             elsif ($first_line =~ /\r/) {
361 0         0 $self->_line_ending("\r");
362             }
363             else {
364 120         205 $self->_line_ending("\n");
365             }
366             }
367            
368 6531   50     14491 return $self->{_chunk} || return;
369             }
370              
371             =head2 _sequential
372              
373             Title : _sequential
374             Usage : if ($obj->_sequential) {...}
375             Function: Ask if we have to do operations such that the input is read
376             sequentially.
377             Returns : boolean
378             Args : none to get, OR boolean to set (typically, you should never set this
379             yourself)
380              
381             =cut
382              
383             sub _sequential {
384 4079     4079   2492 my $self = shift;
385 4079 50       5107 if (@_) {
386 0         0 $self->{_sequential} = shift;
387             }
388 4079   50     11732 return $self->{_sequential} || 0;
389             }
390              
391             =head2 _dependencies
392              
393             Title : _dependencies
394             Usage : $obj->_dependencies( { field1 => field2 } );
395             my $dependancy = $obj->_dependencies('field_name');
396             Function: Set the fields that are dependent on each other, or get the field
397             than another is dependent upon.
398             Returns : string (a field name)
399             Args : string (a field name) to get, OR hash ref to initially set, with
400             field names as keys and values, key field being dependent upon value
401             field.
402              
403             =cut
404              
405             sub _dependencies {
406 1514     1514   1188 my ($self, $thing) = @_;
407 1514 50       1813 $thing || return;
408 1514 100       1740 if (ref($thing) eq 'HASH') {
409 268         431 $self->{_dependencies} = $thing;
410             }
411             else {
412 1246         1824 return $self->{_dependencies}->{$thing};
413             }
414             }
415              
416             =head2 _chunk_true_start
417              
418             Title : _chunk_true_start
419             Usage : my $true_start = $obj->_chunk_true_start;
420             Function: Get/set the true start position of the chunk within the filehandle
421             it is part of.
422             Returns : int
423             Args : none to get, OR int to set (typically, you won't set this yourself)
424              
425             =cut
426              
427             sub _chunk_true_start {
428 6251     6251   4042 my $self = shift;
429 6251 100       7113 if (@_) {
430 218         241 $self->{_chunk_start} = shift;
431             }
432 6251   100     11459 return $self->{_chunk_start} || 0;
433             }
434              
435             =head2 _chunk_true_end
436              
437             Title : _chunk_true_end
438             Usage : my $true_end = $obj->_chunk_true_end;
439             Function: Get/set for the true end position of the chunk within the filehandle
440             it is part of.
441             Returns : int
442             Args : none to get, OR int to set (typically, you won't set this yourself)
443              
444             =cut
445              
446             sub _chunk_true_end {
447 3623     3623   2506 my $self = shift;
448 3623 100       4251 if (@_) {
449 218         217 $self->{_chunk_end} = shift;
450             }
451 3623         5964 return $self->{_chunk_end};
452             }
453              
454             =head2 _line_ending
455              
456             Title : _line_ending
457             Usage : my $line_ending = $obj->_line_ending;
458             Function: Get/set for the line ending for the chunk.
459             Returns : string
460             Args : none to get, OR string to set (typically, you won't set this
461             yourself)
462              
463             =cut
464              
465             sub _line_ending {
466 1326     1326   938 my $self = shift;
467 1326 100       1695 if (@_) {
468 218         264 $self->{_chunk_line_ending} = shift;
469             }
470 1326         1434 return $self->{_chunk_line_ending};
471             }
472              
473             =head2 _chunk_seek
474              
475             Title : _chunk_seek
476             Usage : $obj->_chunk_seek($pos);
477             Function: seek() the chunk to the provided position in bytes, relative to the
478             defined start of the chunk within its filehandle.
479              
480             In _sequential() mode, this function does nothing.
481              
482             Returns : n/a
483             Args : int
484              
485             =cut
486              
487             sub _chunk_seek {
488 937     937   776 my ($self, $pos) = @_;
489 937 50       1204 $self->throw("Undefined position passed") unless defined $pos;
490 937 50       1148 return if $self->_sequential;
491            
492 937         1120 my $fh = $self->chunk->_fh;
493            
494             # seek to the defined start
495 937         1304 seek($fh, $self->_chunk_true_start, 0);
496            
497             # now seek to desired position relative to defined start
498 937         1408 seek($fh, $pos, 1);
499             }
500              
501             =head2 _chunk_tell
502              
503             Title : _chunk_seek
504             Usage : my $pos = $obj->_chunk_tell;
505             Function: Get the current tell() position within the chunk, relative to the
506             defined start of the chunk within its filehandle.
507              
508             In _sequential() mode, this function does nothing.
509              
510             Returns : int
511             Args : none
512              
513             =cut
514              
515             sub _chunk_tell {
516 2876     2876   2051 my $self = shift;
517 2876 50       2754 return if $self->_sequential;
518            
519 2876         3041 my $fh = $self->chunk->_fh;
520 2876         3901 return tell($fh) - $self->_chunk_true_start;
521             }
522              
523             =head2 _get_chunk_by_nol
524              
525             Title : _chunk_seek
526             Usage : my $string = $obj->_get_chunk_by_nol;
527             Function: Get a chunk of chunk() from the current position onward for the given
528             number of lines.
529             Returns : string
530             Args : int (number of lines you want)
531              
532             =cut
533              
534             sub _get_chunk_by_nol {
535 430     430   428 my ($self, $nol) = @_;
536 430 50       593 $nol > 0 || $self->throw("Can't request a chunk of fewer than 1 lines");
537            
538             # hope that $/ is \n
539            
540 430         284 my ($line, $count);
541 430         496 while (defined($_ = $self->chunk->_readline)) {
542 1175         1239 $line .= $_;
543 1175         846 $count++;
544 1175 100       1989 last if $count == $nol;
545             }
546            
547 430         522 my $current = $self->_chunk_tell;
548 430   50     761 my $end = ($current || 0) + $self->_chunk_true_start;
549 430 100 33     874 if (! $current || ($self->_chunk_true_end ? $end <= $self->_chunk_true_end : 1)) {
    100          
550 301         813 return $line;
551             }
552 129         391 return;
553             }
554              
555             =head2 _get_chunk_by_end
556              
557             Title : _get_chunk_by_end
558             Usage : my $string = $obj->_get_chunk_by_end;
559             Function: Get a chunk of chunk() from the current position onward till the end
560             of the line, as defined by the supplied argument.
561             Returns : string
562             Args : string (line ending - if you want the line ending to include a new
563             line, always use \n)
564              
565             =cut
566              
567             sub _get_chunk_by_end {
568 865     865   801 my ($self, $chunk_ending) = @_;
569            
570 865         993 my $start = $self->_chunk_tell;
571            
572 865         1029 my $line_ending = $self->_line_ending;
573 865         1933 $chunk_ending =~ s/\n/$line_ending/g;
574 865   50     2597 local $/ = $chunk_ending || '';
575 865         1002 my $line = $self->chunk->_readline;
576            
577 865         1213 my $current = $self->_chunk_tell;
578 865   100     1487 my $end = ($current || 0) + $self->_chunk_true_start;
579 865 100 66     1738 if (! $current || ($self->_chunk_true_end ? $end <= $self->_chunk_true_end : 1)) {
    100          
580 612         2193 return $line;
581             }
582            
583 253         320 $self->_chunk_seek($start);
584 253         972 return;
585             }
586              
587             =head2 _find_chunk_by_end
588              
589             Title : _find_chunk_by_end
590             Usage : my $string = $obj->_find_chunk_by_end;
591             Function: Get the start and end of what would be a chunk of chunk() from the
592             current position onward till the end of the line, as defined by the
593             supplied argument.
594              
595             In _sequential() mode, this function does nothing.
596              
597             Returns : _chunk_tell values for start and end in 2 element list
598             Args : string (line ending - if you want the line ending to include a new
599             line, always use \n)
600              
601             =cut
602              
603             sub _find_chunk_by_end {
604 243     243   242 my ($self, $chunk_ending) = @_;
605 243 50       309 return if $self->_sequential;
606            
607 243         346 my $line_ending = $self->_line_ending;
608 243         676 $chunk_ending =~ s/\n/$line_ending/g;
609 243   50     854 local $/ = $chunk_ending || '';
610            
611 243         316 my $start = $self->_chunk_tell;
612 243         310 $self->chunk->_readline;
613 243         365 my $end = $self->_chunk_tell;
614            
615 243         333 my $comp_end = $end + $self->_chunk_true_start;
616 243 100       305 if ($self->_chunk_true_end ? $comp_end <= $self->_chunk_true_end : 1) {
    100          
617 177         624 return ($start, $end);
618             }
619            
620 66         103 $self->_chunk_seek($start);
621 66         195 return;
622             }
623              
624             =head2 AUTOLOAD
625              
626             Title : AUTOLOAD
627             Usage : n/a
628             Function: Assumes that any unknown method called should be treated as
629             get_field($method_name).
630             Returns : n/a
631             Args : n/a
632              
633             =cut
634              
635             sub AUTOLOAD {
636 35     35   901 my $self = shift;
637 35 50       91 ref($self) || return;
638            
639 35         42 my $name = $AUTOLOAD;
640 35         175 $name =~ s/.*://; # strip fully-qualified portion
641            
642             # is it one of our fields?
643 35         80 return $self->get_field($name);
644             }
645              
646             1;