File Coverage

Bio/SeqIO/msout.pm
Criterion Covered Total %
statement 212 223 95.0
branch 54 62 87.1
condition 14 21 66.6
subroutine 29 30 96.6
pod 19 22 86.3
total 328 358 91.6


line stmt bran cond sub pod time code
1             # POD documentation - main docs before the code
2              
3             =head1 NAME
4              
5             Bio::SeqIO::msout - input stream for output by Hudson's ms
6              
7             =head1 SYNOPSIS
8              
9             Do not use this module directly. Use it via the Bio::SeqIO class.
10              
11             =head1 DESCRIPTION
12              
13             ms ( Hudson, R. R. (2002) Generating samples under a Wright-Fisher neutral
14             model. Bioinformatics 18:337-8 ) can be found at
15             http://home.uchicago.edu/~rhudson1/source/mksamples.html.
16              
17             Currently, this object can be used to read output from ms into seq objects.
18             However, because bioperl has no support for haplotypes created using an infinite
19             sites model (where '1' identifies a derived allele and '0' identifies an
20             ancestral allele), the sequences returned by msout are coded using A, T, C and
21             G. To decode the bases, use the sequence conversion table (a hash) returned by
22             get_base_conversion_table(). In the table, 4 and 5 are used when the ancestry is
23             unclear. This should not ever happen when creating files with ms, but it will be
24             used when creating msOUT files from a collection of seq objects ( To be added
25             later ). Alternatively, use get_next_hap() to get a string with 1's and 0's
26             instead of a seq object.
27              
28             =head2 Mapping to Finite Sites
29              
30             This object can now also be used to map haplotypes created using an infinite sites
31             model to sequences of arbitrary finite length. See set_n_sites() for more detail.
32             Thanks to Filipe G. Vieira for the idea and code.
33              
34             =head1 FEEDBACK
35              
36             =head2 Mailing Lists
37              
38             User feedback is an integral part of the evolution of this and other
39             Bioperl modules. Send your comments and suggestions preferably to the
40             Bioperl mailing list. Your participation is much appreciated.
41              
42             bioperl-l@bioperl.org - General discussion
43             http://bioperl.org/wiki/Mailing_lists - About the mailing lists
44              
45             =head2 Reporting Bugs
46              
47             Report bugs to the Bioperl bug tracking system to help us keep track
48             of the bugs and their resolution. Bug reports can be submitted via the
49             web:
50              
51             https://github.com/bioperl/bioperl-live/issues
52              
53             =head1 AUTHOR - Warren Kretzschmar
54              
55             This module was written by Warren Kretzschmar
56              
57             email: wkretzsch@gmail.com
58              
59             This module grew out of a parser written by Aida Andres.
60              
61             =head1 COPYRIGHT
62              
63             =head2 Public Domain Notice
64              
65             This software/database is ``United States Government Work'' under the
66             terms of the United States Copyright Act. It was written as part of
67             the authors' official duties for the United States Government and thus
68             cannot be copyrighted. This software/database is freely available to
69             the public for use without a copyright notice. Restrictions cannot
70             be placed on its present or future use.
71              
72             Although all reasonable efforts have been taken to ensure the accuracy
73             and reliability of the software and data, the National Human Genome
74             Research Institute (NHGRI) and the U.S. Government does not and cannot
75             warrant the performance or results that may be obtained by using this
76             software or data. NHGRI and the U.S. Government disclaims all
77             warranties as to performance, merchantability or fitness for any
78             particular purpose.
79              
80             =head1 METHODS
81              
82             =cut
83              
84             package Bio::SeqIO::msout;
85 1     1   4 use version;
  1         1  
  1         6  
86             our $API_VERSION = qv('1.1.8');
87              
88 1     1   65 use strict;
  1         1  
  1         16  
89 1     1   3 use base qw(Bio::SeqIO); # This ISA Bio::SeqIO object
  1         1  
  1         340  
90 1     1   260 use Bio::Seq::SeqFactory;
  1         2  
  1         1749  
91              
92             =head2 Methods for Internal Use
93              
94             =head3 _initialize
95              
96             Title : _initialize
97             Usage : $stream = Bio::SeqIO::msOUT->new($infile)
98             Function: extracts basic information about the file.
99             Returns : Bio::SeqIO object
100             Args : no_og, gunzip, gzip, n_sites
101             Details :
102             - include 'no_og' flag if the last population of an msout file contains
103             only one haplotype and you don't want the last haplotype to be
104             treated as the outgroup ( suggested when reading data created by ms ).
105             - including 'n_sites' (positive integer) causes all output haplotypes to be
106             mapped to a sequence of length 'n_sites'. See set_n_sites() for more details.
107              
108             =cut
109              
110             sub _initialize {
111 12     12   24 my ( $self, @args ) = @_;
112 12         32 $self->SUPER::_initialize(@args);
113              
114 12 50       31 unless ( defined $self->sequence_factory ) {
115 12         72 $self->sequence_factory( Bio::Seq::SeqFactory->new() );
116             }
117 12         45 my ($no_og) = $self->_rearrange( [qw(NO_OG)], @args );
118 12         43 my ($n_sites) = $self->_rearrange( [qw(N_SITES)], @args );
119              
120 12         152 my %initial_values = (
121             RUNS => undef,
122             N_SITES => undef,
123             SEGSITES => undef,
124             SEEDS => [],
125             MS_INFO_LINE => undef,
126             TOT_RUN_HAPS => undef,
127             POPS => [],
128             NEXT_RUN_NUM => undef, # What run is the next hap from? undef = EOF
129             LAST_READ_HAP_NUM => undef, # What did we just read from
130             LAST_HAPS_RUN_NUM => undef,
131             LAST_READ_POSITIONS => [],
132             LAST_READ_SEGSITES => undef,
133             BUFFER_HAP => undef,
134             NO_OUTGROUP => $no_og,
135             BASE_CONVERSION_TABLE_HASH_REF => {
136             'A' => 0,
137             'T' => 1,
138             'C' => 4,
139             'G' => 5,
140             },
141             );
142 12         47 foreach my $key ( keys %initial_values ) {
143 180         190 $self->{$key} = $initial_values{$key};
144             }
145 12         39 $self->set_n_sites($n_sites);
146              
147             # If the filehandle is defined open it and read a few lines
148 12 50       38 if ( ref( $self->{_filehandle} ) eq 'GLOB' ) {
149 12         33 $self->_read_start();
150 12         48 return $self;
151             }
152              
153             # Otherwise throw a warning
154             else {
155 0         0 $self->throw(
156             "No filehandle defined. Please define a file handle through -file when calling msout with Bio::SeqIO"
157             );
158             }
159             }
160              
161             =head3 _read_start
162              
163             Title : _read_start
164             Usage : $stream->_read_start()
165             Function: reads from the filehandle $stream->{_filehandle} all information up to the first haplotype (sequence). Closes the filehandle if all lines have been read.
166             Returns : void
167             Args : none
168              
169             =cut
170              
171             sub _read_start {
172 12     12   17 my $self = shift;
173              
174 12         15 my $fh_IN = $self->{_filehandle};
175              
176             # get the first five lines and parse for important info
177 12         28 my ( $ms_info_line, $seeds ) = $self->_get_next_clean_hap( $fh_IN, 2, 1 );
178              
179 12         64 my @ms_info_line = split( /\s+/, $ms_info_line );
180              
181 12         12 my ( $tot_pops, @pop_haplos );
182              
183             # Parsing the ms header line
184 12         14 shift @ms_info_line;
185 12         17 my $tot_run_haps = shift @ms_info_line;
186 12         20 my $runs = shift @ms_info_line;
187 12         15 my $segsites;
188              
189 12         29 foreach my $word ( 0 .. $#ms_info_line ) {
190 63 100       95 if ( $ms_info_line[$word] eq '-I' ) {
    100          
191 9         13 $tot_pops = $ms_info_line[ $word + 1 ];
192 9         28 for my $pop_num ( 1 .. $tot_pops ) {
193 27         57 push @pop_haplos, $ms_info_line[ $word + 1 + $pop_num ];
194             }
195              
196             # if @pop_haplos contains a non-digit, then there is an error in the msinfo line.
197 9 50 33     61 if ( !defined $pop_haplos[-1] || $pop_haplos[-1] =~ /\D/ ) {
198 0         0 $self->throw(
199             "Incorrect number of populations in the ms info line (after the -I specifier)"
200             );
201             }
202             }
203             elsif ( $ms_info_line[$word] eq '-s' ) {
204 9         16 $segsites = $ms_info_line[ $word + 1 ];
205             }
206 45         38 else { next; }
207             }
208              
209 12 100       25 unless (@pop_haplos) { @pop_haplos = ($tot_run_haps) }
  3         6  
210              
211 12         38 my @seeds = split( /\s+/, $seeds );
212              
213             # Save ms info data
214 12         24 $self->{RUNS} = $runs;
215 12         15 $self->{SEGSITES} = $segsites;
216 12         18 $self->{SEEDS} = \@seeds;
217 12         13 $self->{MS_INFO_LINE} = $ms_info_line;
218 12         12 $self->{TOT_RUN_HAPS} = $tot_run_haps;
219 12         21 $self->{POPS} = [@pop_haplos];
220              
221 12         24 return;
222             }
223              
224             =head2 Methods to Access Data
225              
226             =head3 get_segsites
227              
228             Title : get_segsites
229             Usage : $segsites = $stream->get_segsites()
230             Function: returns the number of segsites in the msOUT file (according to the msOUT header line's -s option), or the current run's segsites if -s was not specified in the command line (in this case the number of segsites varies from run to run).
231             Returns : scalar
232             Args : NONE
233              
234             =cut
235              
236             sub get_segsites {
237 7     7 1 3143 my $self = shift;
238 7 100       20 if ( defined $self->{SEGSITES} ) {
239 4         12 return $self->{SEGSITES};
240             }
241             else {
242 3         7 return $self->get_current_run_segsites;
243             }
244             }
245              
246             =head3 get_current_run_segsites
247              
248             Title : get_current_run_segsites
249             Usage : $segsites = $stream->get_current_run_segsites()
250             Function: returns the number of segsites in the run of the last read
251             haplotype (sequence).
252             Returns : scalar
253             Args : NONE
254              
255             =cut
256              
257             sub get_current_run_segsites {
258 138     138 1 2576 my $self = shift;
259 138         207 return $self->{LAST_READ_SEGSITES};
260             }
261              
262             =head3 get_n_sites
263              
264             Title : get_n_sites
265             Usage : $n_sites = $stream->get_n_sites()
266             Function: Gets the number of total sites (variable or not) to be output.
267             Returns : scalar if n_sites option is defined at call time of new()
268             Args : NONE
269             Note :
270             WARNING: Final sequence length might not be equal to n_sites if n_sites is
271             too close to number of segregating sites in the msout file.
272              
273             =cut
274              
275             sub get_n_sites {
276 134     134 1 3216 my ($self) = @_;
277              
278 134         149 return $self->{N_SITES};
279             }
280              
281             =head3 set_n_sites
282              
283             Title : set_n_sites
284             Usage : $n_sites = $stream->set_n_sites($value)
285             Function: Sets the number of total sites (variable or not) to be output.
286             Returns : 1 on success; throws an error if $value is not a positive integer or undef
287             Args : positive integer
288             Note :
289             WARNING: Final sequence length might not be equal to n_sites if it is
290             too close to number of segregating sites.
291             - n_sites needs to be at least as large as the number of segsites of
292             the next haplotype returned
293             - n_sites may also be set to undef, in which case haplotypes are returned
294             under the infinite sites model assumptions.
295              
296             =cut
297              
298             sub set_n_sites {
299 14     14 1 457 my ( $self, $value ) = @_;
300              
301             # make sure $value is a positive integer if it is defined
302 14 100       32 if ( defined $value ) {
303 7 100 66     57 $self->throw(
304             "first argument needs to be a positive integer. argument supplied: $value"
305             ) unless ( $value =~ m/^\d+$/ && $value > 0 );
306             }
307 13         17 $self->{N_SITES} = $value;
308              
309 13         16 return 1;
310             }
311              
312             =head3 get_runs
313              
314             Title : get_runs
315             Usage : $runs = $stream->get_runs()
316             Function: returns the number of runs in the msOUT file (according to the
317             msinfo line)
318             Returns : scalar
319             Args : NONE
320              
321             =cut
322              
323             sub get_runs {
324 7     7 1 3424 my $self = shift;
325 7         22 return $self->{RUNS};
326             }
327              
328             =head3 get_Seeds
329              
330             Title : get_Seeds
331             Usage : @seeds = $stream->get_Seeds()
332             Function: returns an array of the seeds used in the creation of the msOUT file.
333             Returns : array
334             Args : NONE
335             Details : In older versions, ms used three seeds. Newer versions of ms seem to
336             use only one (longer) seed. This function will return all the seeds
337             found.
338              
339             =cut
340              
341             sub get_Seeds {
342 7     7 1 3641 my $self = shift;
343 7         12 return @{ $self->{SEEDS} };
  7         27  
344             }
345              
346             =head3 get_Positions
347              
348             Title : get_Positions
349             Usage : @positions = $stream->get_Positions()
350             Function: returns an array of the names of each segsite of the run of the last
351             read hap.
352             Returns : array
353             Args : NONE
354             Details : The Positions may or may not vary from run to run depending on the
355             options used with ms.
356              
357             =cut
358              
359             sub get_Positions {
360 69     69 1 2929 my $self = shift;
361 69         73 return @{ $self->{LAST_READ_POSITIONS} };
  69         245  
362             }
363              
364             =head3 get_tot_run_haps
365              
366             Title : get_tot_run_haps
367             Usage : $number_of_haps_per_run = $stream->get_tot_run_haps()
368             Function: returns the number of haplotypes (sequences) in each run of the msOUT
369             file ( according to the msinfo line ).
370             Returns : scalar >= 0
371             Args : NONE
372             Details : This number should not vary from run to run.
373              
374             =cut
375              
376             sub get_tot_run_haps {
377 7     7 1 3462 my $self = shift;
378 7         20 return $self->{TOT_RUN_HAPS};
379             }
380              
381             =head3 get_ms_info_line
382              
383             Title : get_ms_info_line
384             Usage : $ms_info_line = $stream->get_ms_info_line()
385             Function: returns the header line of the msOUT file.
386             Returns : scalar
387             Args : NONE
388              
389             =cut
390              
391             sub get_ms_info_line {
392 7     7 1 3492 my $self = shift;
393 7         21 return $self->{MS_INFO_LINE};
394             }
395              
396             =head3 tot_haps
397              
398             Title : tot_haps
399             Usage : $number_of_haplotypes_in_file = $stream->tot_haps()
400             Function: returns the number of haplotypes (sequences) in the msOUT file.
401             Information gathered from msOUT header line.
402             Returns : scalar
403             Args : NONE
404              
405             =cut
406              
407             sub get_tot_haps {
408 12     12 0 12 my $self = shift;
409 12         48 return ( $self->{TOT_RUN_HAPS} * $self->{RUNS} );
410             }
411              
412             =head3 get_Pops
413              
414             Title : get_Pops
415             Usage : @pops = $stream->pops()
416             Function: returns an array of population sizes (order taken from the -I flag in
417             the msOUT header line). This array will include the last hap even if
418             it looks like an outgroup.
419             Returns : array of scalars > 0
420             Args : NONE
421              
422             =cut
423              
424             sub get_Pops {
425 91     91 1 2759 my $self = shift;
426 91         72 return @{ $self->{POPS} };
  91         197  
427             }
428              
429             =head3 get_next_run_num
430              
431             Title : get_next_run_num
432             Usage : $next_run_number = $stream->next_run_num()
433             Function: returns the number of the ms run that the next haplotype (sequence)
434             will be taken from (starting at 1). Returns undef if the complete
435             file has been read.
436             Returns : scalar > 0 or undef
437             Args : NONE
438              
439             =cut
440              
441             sub get_next_run_num {
442 162     162 1 7065 my $self = shift;
443 162         211 return $self->{NEXT_RUN_NUM};
444             }
445              
446             =head3 get_last_haps_run_num
447              
448             Title : get_last_haps_run_num
449             Usage : $last_haps_run_number = $stream->get_last_haps_run_num()
450             Function: returns the number of the ms run that the last haplotype (sequence)
451             was taken from (starting at 1). Returns undef if no hap has been
452             read yet.
453             Returns : scalar > 0 or undef
454             Args : NONE
455              
456             =cut
457              
458             sub get_last_haps_run_num {
459 128     128 1 100 my $self = shift;
460 128         100 return $self->{LAST_HAPS_RUN_NUM};
461             }
462              
463             =head3 get_last_read_hap_num
464              
465             Title : get_last_read_hap_num
466             Usage : $last_read_hap_num = $stream->get_last_read_hap_num()
467             Function: returns the number (starting with 1) of the last haplotype read from
468             the ms file
469             Returns : scalar >= 0
470             Args : NONE
471             Details : 0 means that no haplotype has been read yet. Is reset to 0 every run.
472              
473             =cut
474              
475             sub get_last_read_hap_num {
476 321     321 1 3795 my $self = shift;
477 321         332 return $self->{LAST_READ_HAP_NUM};
478             }
479              
480             =head3 outgroup
481              
482             Title : outgroup
483             Usage : $outgroup = $stream->outgroup()
484             Function: returns '1' if the msOUT stream has an outgroup. Returns '0'
485             otherwise.
486             Returns : '1' or '0'
487             Args : NONE
488             Details : This method will return '1' only if the last population in the msOUT
489             file contains only one haplotype. If the last population is not an
490             outgroup then create the msOUT object using 'no_og' as input flag.
491             Also, return 0, if the run has only one population.
492              
493             =cut
494              
495             sub outgroup {
496 6     6 1 2460 my $self = shift;
497 6         13 my @pops = $self->get_Pops;
498 6 100 66     50 if ( $pops[$#pops] == 1 && !defined $self->{NO_OUTGROUP} && @pops > 1 ) {
      66        
499 4         11 return 1;
500             }
501 2         6 else { return 0; }
502             }
503              
504             =head3 get_next_haps_pop_num
505              
506             Title : get_next_haps_pop_num
507             Usage : ($next_haps_pop_num, $num_haps_left_in_pop) = $stream->get_next_haps_pop_num()
508             Function: First return value is the population number (starting with 1) the
509             next hap will come from. The second return value is the number of haps
510             left to read in the population from which the next hap will come.
511             Returns : (scalar > 0, scalar > 0)
512             Args : NONE
513              
514             =cut
515              
516             sub get_next_haps_pop_num {
517 39     39 1 38 my $self = shift;
518 39         52 my $last_read_hap = $self->get_last_read_hap_num;
519 39         57 my @pops = $self->get_Pops;
520              
521 39         117 foreach my $pop_num ( 0 .. $#pops ) {
522 55 100       105 if ( $last_read_hap < $pops[$pop_num] ) {
523 37         88 return ( $pop_num + 1, $pops[$pop_num] - $last_read_hap );
524             }
525 18         22 else { $last_read_hap -= $pops[$pop_num] }
526             }
527              
528             # In this case we're at the beginning of the next run
529 2         4 return ( 1, $pops[0] );
530             }
531              
532             =head3 get_next_seq
533              
534             Title : get_next_seq
535             Usage : $seq = $stream->get_next_seq()
536             Function: reads and returns the next sequence (haplotype) in the stream
537             Returns : Bio::Seq object or void if end of file
538             Args : NONE
539             Note : This function is included only to conform to convention. The
540             returned Bio::Seq object holds a halpotype in coded form. Use the hash
541             returned by get_base_conversion_table() to convert 'A', 'T', 'C', 'G'
542             back into 1,2,4 and 5. Use get_next_hap() to retrieve the halptoype as
543             a string of 1,2,4 and 5s instead.
544              
545             =cut
546              
547             sub get_next_seq {
548 139     139 1 5885 my $self = shift;
549 139         202 my $seqstring = $self->get_next_hap;
550              
551 135 100       182 return unless ($seqstring);
552              
553             # Used to create unique ID;
554 128         157 my $run = $self->get_last_haps_run_num;
555              
556             # Converting numbers to letters so that the haplotypes can be stored as a
557             # seq object
558 128         173 my $rh_base_conversion_table = $self->get_base_conversion_table;
559 128         90 foreach my $base ( keys %{$rh_base_conversion_table} ) {
  128         283  
560 512         3964 $seqstring =~ s/($rh_base_conversion_table->{$base})/$base/g;
561             }
562              
563             # Fill in non-variable positions
564 128         237 my $segsites = $self->get_current_run_segsites;
565 128         159 my $n_sites = $self->get_n_sites;
566 128 100       208 if ( defined($n_sites) ) {
567              
568             # make sure that n_sites is at least as large
569             # as segsites for each run. Throw an exception otherwise.
570 63 100       133 $self->throw( "n_sites:\t$n_sites"
571             . "\nsegsites:\t$segsites"
572             . "\nrun:\t$run"
573             . "\nn_sites needs to be at least the number of segsites of every run"
574             ) unless $segsites <= $n_sites;
575              
576 62         48 my $seq_len = 0;
577 62         52 my @seq;
578 62         80 my @pos = $self->get_Positions;
579 62         129 for ( my $i = 0 ; $i <= $#pos ; $i++ ) {
580 434         462 $pos[$i] *= $n_sites;
581 434         521 push( @seq, "A" x ( $pos[$i] - 1 - $seq_len ) );
582 434         312 $seq_len += length( $seq[-1] );
583 434         378 push( @seq, substr( $seqstring, $i, 1 ) );
584 434         594 $seq_len += length( $seq[-1] );
585             }
586 62         84 push( @seq, "A" x ( $n_sites - $seq_len ) );
587 62         246 $seqstring = join( "", @seq );
588             }
589              
590 127         159 my $last_read_hap = $self->get_last_read_hap_num;
591 127         181 my $id = 'Hap_' . $last_read_hap . '_Run_' . $run;
592 127 100       344 my $description =
593             "Segsites $segsites;"
594             . " Positions "
595             . ( defined $n_sites ? $n_sites : $segsites ) . ";"
596             . " Haplotype $last_read_hap;"
597             . " Run $run;";
598 127         244 my $seq = $self->sequence_factory->create(
599             -seq => $seqstring,
600             -id => $id,
601             -desc => $description,
602             -alphabet => q(dna),
603             -direct => 1,
604             );
605              
606 127         334 return $seq
607              
608             }
609              
610             =head3 next_seq
611              
612             Title : next_seq
613             Usage : $seq = $stream->next_seq()
614             Function: Alias to get_next_seq()
615             Returns : Bio::Seq object or void if end of file
616             Args : NONE
617             Note : This function is only included for convention. It calls get_next_seq().
618             See get_next_seq() for details.
619              
620             =cut
621              
622             sub next_seq {
623 0     0 1 0 my $self = shift;
624 0         0 return $self->get_next_seq();
625             }
626              
627             =head3 get_next_hap
628              
629             Title : get_next_hap
630             Usage : $hap = $stream->next_hap()
631             Function: reads and returns the next sequence (haplotype) in the stream.
632             Returns undef if all sequences in stream have been read.
633             Returns : Haplotype string (e.g. '110110000101101045454000101'
634             Args : NONE
635             Note : Use get_next_seq() if you want the halpotype returned as a
636             Bio::Seq object.
637              
638             =cut
639              
640             sub get_next_hap {
641 148     148 1 2246 my $self = shift;
642              
643             # Let's figure out how many haps to read from the input file so that
644             # we get back to the beginning of the next run.
645              
646 148         114 my $end_run = 0;
647 148 100       288 if ( $self->{TOT_RUN_HAPS} == $self->{LAST_READ_HAP_NUM} + 1 ) {
648 23         18 $end_run = 1;
649             }
650              
651             # Setting last_haps_run_num
652 148         165 $self->{LAST_HAPS_RUN_NUM} = $self->get_next_run_num;
653              
654 148         162 my $last_read_hap = $self->get_last_read_hap_num;
655             my ($seqstring) =
656 148         211 $self->_get_next_clean_hap( $self->{_filehandle}, 1, $end_run );
657 146 100 100     277 if ( !defined $seqstring && $last_read_hap < $self->get_tot_haps ) {
658 2         7 $self->throw(
659             "msout file has only $last_read_hap hap(s), which is less than indicated in msinfo line ( "
660             . $self->get_tot_haps
661             . " )" );
662             }
663              
664 144         174 return $seqstring;
665             }
666              
667             =head3 get_next_pop
668              
669             Title : get_next_pop
670             Usage : @seqs = $stream->next_pop()
671             Function: reads and returns all the remaining sequences (haplotypes) in the
672             population of the next sequence. Returns an empty list if no more
673             haps remain to be read in the stream
674             Returns : array of Bio::Seq objects
675             Args : NONE
676              
677             =cut
678              
679             sub get_next_pop {
680 18     18 1 8087 my $self = shift;
681              
682             # Let's figure out how many haps to read from the input file so that
683             # we get back to the beginning of the next run.
684              
685 18         38 my @pops = $self->get_Pops;
686 18         23 my @seqs; # holds Bio::Seq objects to return
687              
688             # Determine number of the pop that the next hap will be taken from
689 18         30 my ( $next_haps_pop_num, $haps_to_pull ) = $self->get_next_haps_pop_num;
690              
691             # If $haps_to_pull == 0, then we need to pull the whole population
692 18 50       44 if ( $haps_to_pull == 0 ) {
693 0         0 $haps_to_pull = $pops[ $next_haps_pop_num - 1 ];
694             }
695              
696 18         25 for ( 1 .. $haps_to_pull ) {
697 50         69 my $seq = $self->get_next_seq;
698 50 100       76 next unless defined $seq;
699              
700             # Add Population number information to description
701 47         104 $seq->display_id(" Population number $next_haps_pop_num;");
702 47         70 push @seqs, $seq;
703             }
704              
705 18         52 return @seqs;
706             }
707              
708             =head3 next_run
709              
710             Title : next_run
711             Usage : @seqs = $stream->next_run()
712             Function: reads and returns all the remaining sequences (haplotypes) in the ms
713             run of the next sequence. Returns an empty list if all haps have been
714             read from the stream.
715             Returns : array of Bio::Seq objects
716             Args : NONE
717              
718             =cut
719              
720             sub get_next_run {
721 21     21 0 8667 my $self = shift;
722              
723             # Let's figure out how many haps to read from the input file so that
724             # we get back to the beginning of the next run.
725              
726 21         43 my ( $next_haps_pop_num, $haps_to_pull ) = $self->get_next_haps_pop_num;
727 21         23 my @seqs;
728              
729 21         34 my @pops = $self->get_Pops;
730              
731 21         40 foreach ( $next_haps_pop_num .. $#pops ) {
732 20         30 $haps_to_pull += $pops[$_];
733             }
734              
735             # Read those haps from the input file
736             # Next hap read will be the first hap of the first pop of the next run.
737              
738 21         34 for ( 1 .. $haps_to_pull ) {
739 75         106 my $seq = $self->get_next_seq;
740 71 100       110 next unless defined $seq;
741              
742 68         93 push @seqs, $seq;
743             }
744              
745 17         73 return @seqs;
746             }
747              
748             =head2 Methods to Retrieve Constants
749              
750             =head3 base_conversion_table
751              
752             Title : get_base_conversion_table
753             Usage : $table_hash_ref = $stream->get_base_conversion_table()
754             Function: returns a reference to a hash. The keys of the hash are the letters '
755             A','T','G','C'. The values associated with each key are the value that
756             each letter in the sequence of a seq object returned by a
757             Bio::SeqIO::msout stream should be translated to.
758             Returns : reference to a hash
759             Args : NONE
760             Synopsys:
761            
762             # retrieve the Bio::Seq object's sequence
763             my $haplotype = $seq->seq;
764            
765             # need to convert all letters to their corresponding numbers.
766             foreach my $base (keys %{$rh_base_conversion_table}){
767             $haplotype =~ s/($base)/$rh_base_conversion_table->{$base}/g;
768             }
769            
770             # $haplotype is now an ms style haplotype. (e.g. '100101101455')
771            
772             =cut
773              
774             sub get_base_conversion_table {
775 135     135 0 3375 my $self = shift;
776 135         166 return $self->{BASE_CONVERSION_TABLE_HASH_REF};
777             }
778              
779             ##############################################################################
780             ## subs for internal use only
781             ##############################################################################
782              
783             sub _get_next_clean_hap {
784              
785             #By Warren Kretzschmar
786              
787             # return the next non-empty line from file handle (chomped line)
788             # skipps to the next run if '//' is encountered
789 160     160   169 my ( $self, $fh, $times, $end_run ) = @_;
790 160         117 my @data;
791              
792 160 50       296 unless ( ref($fh) eq q(GLOB) ) { return; }
  0         0  
793              
794 160 50 33     505 unless ( defined $times && $times > 0 ) {
795 0         0 $times = 1;
796             }
797              
798 160 100       297 if ( defined $self->{BUFFER_HAP} ) {
799 28         35 push @data, $self->{BUFFER_HAP};
800 28         27 $self->{BUFFER_HAP} = undef;
801 28         24 $self->{LAST_READ_HAP_NUM}++;
802 28         27 $times--;
803             }
804              
805 160         265 while ( 1 <= $times-- ) {
806              
807             # Find next clean line
808 144         422 my $data = <$fh>;
809 144 100       216 last if !defined($data);
810 134         130 chomp $data;
811 134         382 while ( $data !~ /./ ) {
812 28         26 $data = <$fh>;
813 28         59 chomp $data;
814             }
815              
816             # If the next run is encountered here, then we have a programming
817             # or format error
818 134 50       214 if ( $data eq '//' ) { $self->throw("'//' found when not expected\n") }
  0         0  
819              
820 134         107 $self->{LAST_READ_HAP_NUM}++;
821 134         287 push @data, $data;
822             }
823              
824 160 100       210 if ($end_run) {
825 35         67 $self->_load_run_info($fh);
826             }
827              
828 158         284 return (@data);
829             }
830              
831             sub _load_run_info {
832              
833 35     35   35 my ( $self, $fh ) = @_;
834              
835 35         85 my $data = <$fh>;
836              
837             # getting rid of excess newlines
838 35   100     151 while ( defined($data) && $data !~ /./ ) {
839 28         120 $data = <$fh>;
840             }
841              
842             # In this case we are at EOF
843 35 100       73 if ( !defined($data) ) { $self->{NEXT_RUN_NUM} = undef; return; }
  5         9  
  5         5  
844              
845 30         61 while ( $data !~ /./ ) {
846 0         0 $data = <$fh>;
847 0         0 chomp $data;
848             }
849              
850 30         37 chomp $data;
851              
852             # If the next run is encountered, then skip to the next hap and save it in
853             # the buffer.
854 30 100       51 if ( $data eq '//' ) {
855 28         31 $self->{NEXT_RUN_NUM}++;
856 28         24 $self->{LAST_READ_HAP_NUM} = 0;
857 28         52 for ( 1 .. 3 ) {
858 84         121 $data = <$fh>;
859 84         172 while ( $data !~ /./ ) {
860 20         27 $data = <$fh>;
861 20         36 chomp $data;
862             }
863 84         93 chomp $data;
864              
865 84 100       141 if ( $_ eq '1' ) {
    100          
866 28         109 my @sites = split( /\s+/, $data );
867 28         57 $self->{LAST_READ_SEGSITES} = $sites[1];
868             }
869             elsif ( $_ eq '2' ) {
870 28         171 my @positions = split( /\s+/, $data );
871 28         29 shift @positions;
872 28         64 $self->{LAST_READ_POSITIONS} = \@positions;
873             }
874             else {
875 28 50       43 if ( !defined($data) ) {
876 0         0 $self->throw("run $self->{NEXT_RUN_NUM} has no haps./n");
877             }
878 28         61 $self->{BUFFER_HAP} = $data;
879             }
880             }
881             }
882             else {
883 2         7 $self->throw(
884             "'//' not encountered when expected. There are more haplos in one of the msOUT runs than advertised in the msinfo line."
885             );
886             }
887              
888             }
889             1;