File Coverage

blib/lib/Bio/ToolBox/db_helper.pm
Criterion Covered Total %
statement 130 671 19.3
branch 70 550 12.7
condition 20 117 17.0
subroutine 14 29 48.2
pod 16 17 94.1
total 250 1384 18.0


line stmt bran cond sub pod time code
1             package Bio::ToolBox::db_helper;
2             our $VERSION = '1.68';
3              
4             =head1 NAME
5              
6             Bio::ToolBox::db_helper - helper interface to various database formats
7              
8             =head1 DESCRIPTION
9              
10             In most cases, this module does not need to be used directly. The
11             methods available to L and L
12             provide convenient access to the methods described here.
13              
14             These are helper subroutines to work with relevant databases that can be
15             accessed through BioPerl modules. These include the L
16             relational database as well as L, L,
17             L, L, L, and
18             L databases. The primary functions
19             included opening connections to these databases, verifying features found
20             within the database, generating lists of features or genomic intervals for
21             subsequent analysis, and collecting features and/or scores within regions
22             of interest. Collected scores may be summarized using a variety of statistical
23             methods.
24              
25             When collecting scores or features, the data may hosted in a variety of
26             formats and locations. These include:
27              
28             =over 4
29              
30             =item SeqFeature database
31              
32             Genomic feature annotation, including genes, transcripts, exons, etc,
33             may be stored in a L database. These
34             databases are backed by either a relational database, including
35             MySQL, PostGreSQL, or SQLite. Small GFF3 files may also be loaded in
36             an in-memory database. These are typically loaded from GFF3
37             files; see the adapter documentation for more information. Dataset
38             scores, such as from microarray, may also be stored as the score
39             value in the source GFF file.
40              
41             References to local, binary, indexed files may also be included as
42             attributes to features stored in the database. This is legacy support
43             from the L, and should
44             not be used in preference to other methods. Supported files include the
45             legacy binary wig files (F<.wib>, see L) using the
46             C attribute, or bigWig files using the C or C
47             attribute. The attribute values must be full paths.
48              
49             =item BigWig files
50              
51             BigWig files (F<.bw> and F<.bigwig>) are compressed, binary, indexed
52             versions of text wig files and may be accessed either locally or remotely.
53             They support extremely fast score retrieval from any genomic location of
54             any size without sacrificing resolution (spatial and numeric). See
55             L for more information.
56             BigWig files are supported by either the L or the
57             L adapter, based on either L
58             or the L libraries,
59             respectively; see their respective documentation for more information.
60              
61             =item Directory of BigWig files
62              
63             A directory containing two or more BigWig files is assembled into a
64             BigWigSet, allowing for metadata, such as strand, to be associated with
65             BigWig files. Additional metadata beyond the filename may be included in
66             text file F within the directory. See the
67             L adapter documentation for more information. When using
68             L adapters, BigWigSet support is natively supported by
69             the BioToolBox package.
70              
71             =item BigBed files
72              
73             BigBed files are compressed, binary, indexed versions of text BED files. See
74             L for more information.
75             Both local and remote files may be accessed. BigBed files are supported by
76             either the L or the L adapter, based on either L
77             or the L libraries,
78             respectively; see their respective documentation for more information.
79              
80             =item Bam files
81              
82             Bam files (F<.bam>) are compressed, binary, indexed versions of the text SAM file,
83             or sequence alignment map. They are used with next generation sequencing
84             technologies. They support individual alignment retrieval as well as
85             read depth coverage. Two different Bam file adapters are supported.
86              
87             The L adapter is an interface to the Bam (or Cram) file.
88             This is based on the modern HTSlib C library (version E= 1.0), successor
89             to the original samtools library. See L for
90             more information.
91              
92             The L adapter is an older interface to the Bam alignment
93             file. This is based on the samtools C library version E= 0.1.19. While
94             this adapter is still supported, the above should be used for new installs.
95              
96             =item Cram files
97              
98             Cram files (F<.cram>) are similar to Bam files, but are much smaller due to
99             only storing sequence differences for each alignment. As such, they require
100             an indexed, reference fasta to regenerate the complete alignment. Cram files
101             are only supported by the L adapter.
102              
103             =item USeq files
104              
105             USeq files (F<.useq>) are compressed, binary, indexed files that support
106             multiple information types, including region intervals, BED type
107             annotations, or wig type scores distributed across the genome. They
108             support rapid, random access across the genome and are comparable to
109             both BigWig and BigBed files. See L
110             for more information. Files may only be local. These files are supported
111             by the L adapter.
112              
113             =item Fasta files
114              
115             Fasta files (F<.fa> or F<.fasta>) may be opened. Fasta files are indexed to
116             rapidly and randomly fetch genomic sequence. Three different adapters are
117             available for indexing the fasta file.
118              
119             The L adapter is preferentially used, and requires a
120             F<.fai> index file.
121              
122             The L adapter may alternatively used, and also requires a
123             F<.fai> index file.
124              
125             The older style L adapter is much slower, but will index
126             either a single genomic fasta file or a directory of individual chromosome
127             or contig fasta files.
128              
129             Additionally, genomic sequence stored in a L
130             annotation database may also be used.
131              
132             =back
133              
134             While the functions within this library may appear to be simply a rehashing of
135             the methods and functions in L and other modules, they
136             either provide a simpler function to often used database methodologies or are
137             designed to work intimately with the BioToolBox data format file and data structures
138             (see L). One key advantage to these functions is the ability
139             to work with datasets that are stranded (transcriptome data, for example).
140              
141             Historically, this module was initially written to use L for
142             database usage. It has since been re-written to use L
143             database schema. Additional functionality has also been added for other databases
144             and file formats.
145              
146             Complete usage and examples for the functions are provided below.
147              
148             =head1 USAGE
149              
150             Call the module at the beginning of your perl script and include the module
151             names to export. None are exported by default.
152              
153             use Bio::ToolBox::db_helper qw(
154             get_new_feature_list
155             get_db_feature
156             );
157              
158             This will export the indicated subroutine names into the current namespace.
159             Their usage is detailed below.
160              
161             =head1 EXPORTED SUBROUTINES
162              
163             =over 4
164              
165             =item C<$BAM_ADAPTER> variable
166              
167             =item C<$BIG_ADAPTER> variable
168              
169             These variables control which Bam and bigWig/bigBed adapters are used
170             during execution in installations where both adapters are present. These
171             variables are not set initially; adapters are chosen automatically when an
172             adapter is requested during execution and modules are evaluated for loading.
173              
174             =item use_bam_adapter
175              
176             =item use_big_adapter
177              
178             These are convenience methods for explicitly setting and retrieving the
179             C<$BAM_ADAPTER> and C<$BIG_ADAPTER> variables, respectively.
180              
181             Optionally pass the value to set. It will always return the value being used.
182             Values include the following:
183              
184             =over 4
185              
186             =item Bam adapter
187              
188             * hts Bio::DB::HTS (default if available)
189             * sam Bio::DB::Sam
190             * none no adapters allowed
191              
192             =item Big adapter
193              
194             * big Bio::DB::Big (default if available)
195             * ucsc Bio::DB::BigWig and Bio::DB::BigBed
196             * none no adapters allowed
197              
198             =back
199              
200             =item open_db_connection
201              
202             my $db_name = 'cerevisiae';
203             my $db = open_db_connection($db_name);
204            
205             my $file = '/path/to/file.bam';
206             my $db = open_db_connection($file);
207            
208             # within a forked child process
209             # reusing the same variable to simplify code
210             # pass second true value to avoid cache
211             $db = open_db_connection($file, 1);
212              
213             This is a simplified, generalized method that will open a connection to a
214             database or indexed file using any one of a number of different
215             BioPerl-style adapters. For local and remote files, the appropriate
216             adapter is determined by the file extension. Relational database
217             connection and authentication information is checked in a configuration
218             file. Once identified, the appropriate Perl module adapter is loaded
219             during run time automatically, saving the user from knowing in advance
220             to C the appropriate module in the script.
221              
222             Pass the name of a relational database or the path of the database file to
223             the subroutine. The opened database object is returned. If it fails, then
224             an error message should be generated and nothing is returned.
225              
226             B If you are forking your perl process, B re-open your
227             database objects in the child process, and pass a second true value
228             to avoid using the cached database object. By default, opened databases are
229             cached to improve efficiency, but this will be disastrous when crossing forks.
230              
231             =over 4
232              
233             =item SeqFeature Store database
234              
235             Provide the name of the relational database to open. Parameters for
236             connecting to a relational database are stored in the BioToolBox
237             configuration file, F<.biotoolbox.cfg>. These include database adaptors,
238             user name, password, etc. Further information regarding the configuration
239             file may be found in L.
240              
241             For SQLite databases, provide the path to the F<.sqlite> or F<.db> file.
242              
243             For in-memory databases, provide the path to the F<.gff> or F<.gff3> file.
244             The larger the file, the more memory and processing time is consumed as
245             the file is parsed into memory. This may be fine for small model organisms
246             like yeast, but not for vertebrate annotation.
247              
248             =item Bam file database
249              
250             Provide the path to the F<.bam> file. This may be a local file, or a remote
251             URL (generally supported by the adapter). If a local F<.bam.bai> index file
252             is not present, it will automatically be built prior to opening; this may
253             fail if the bam file is not sorted. Remote files are not indexed.
254              
255             =item Cram file database
256              
257             Provide the path to the local F<.cram> file. Currently, supplementary fasta
258             files are not supported, so the Cram file header must point to a valid
259             reference. Cram files will be indexed automatically if an index is not available.
260              
261             =item BigWig file database
262              
263             Provide the path to a local F<.bw> or F<.bigwig> file, or URL to a
264             remote file.
265              
266             =item BigWigSet database
267              
268             A local or remote directory of one or more BigWig files that can treated
269             collectively as a database. A special text file may be included in the
270             directory to assign metadata to each BigWig file, including attributes such
271             as type, source, display name, strand, etc. See L for
272             more information on the formatting of the metadata file. If a metadata
273             file is not included (it's not necessary), then the filenames are used
274             as the name and type. Strand may be parsed from the filenames if the
275             basename ends in F<.f> and F<.r>.
276              
277             =item BigBed database
278              
279             Provide the path to a local F<.bb> or F<.bigbed> file, or URL to a
280             remote file.
281              
282             =item USeq database
283              
284             Provide the path to a local F<.bb> or F<.bigbed> file, or URL to a
285             remote file.
286              
287             =item Fasta database
288              
289             Provide the path to a local fasta file for rapidly fetching genomic
290             sequence. If the fasta file is not indexed, then it will be indexed
291             automatically upon opening.
292              
293             =back
294              
295             =item get_db_name
296              
297             my $name = get_db_name($db);
298              
299             This method will attempt to get the name of an opened Database
300             object if, for some reason, it's unknown, e.g. the user only provided an
301             opened db object. This only works for some databases, at least those I've
302             bothered to track down and find a usable API call to use.
303              
304             =item get_dataset_list
305              
306             my $db_name = 'cerevisiae';
307             my @types = get_dataset_list($db_name);
308              
309             This method will retrieve a list of the available features stored in the
310             database and returns an array of the features' types.
311              
312             Pass either the name of the database or an established database object.
313             Supported databases include both L and
314             L databases.
315              
316             For L databases, the type is represented as
317             "C", corresponding to the third and second GFF columns,
318             respectively. The types are sorted alphabetically first by source,
319             then by method.
320              
321             For L databases, the C, C, C,
322             or C attribute may be used, in that respective order of
323             availability. The list is sorted alphabetically.
324              
325             =item verify_or_request_feature_types
326              
327             # verify a dataset, either file or database type
328             my $db_name = 'cerevisiae';
329             my $dataset = 'microaray_data';
330             my @features = verify_or_request_feature_types(
331             db => $db_name,
332             feature => $dataset,
333             );
334             unless (@features) {
335             die " no valid feature provided!\n";
336             }
337            
338             # select a list of features
339             my $db_name = 'cerevisiae';
340             my @types = (); # user not provided
341             @types = verify_or_request_feature_types(
342             db => $db_name,
343             feature => $types,
344             );
345             # user will be promoted to select from database list
346             if (@types) {
347             print "using types " . join(", ", @types);
348             }
349             else {
350             die " no valid features selected!\n";
351             }
352              
353             This subroutine will process a list of feature types or data sources to be
354             used for data or feature collection. There are two modes of action. If a
355             list was provided, it will process and verify the list. If no list was
356             provided, then the user will be prompted to select from a list of available
357             feature types in the provided database.
358              
359             The provided list may be a list of feature types or a list of single
360             file data sources, including Bam, bigWig, or bigBed files. If the list
361             includes feature types, they are verified as existing in the provided
362             database. If the list includes file names, the local files are checked
363             and the names prefixed with "C" for use downstream.
364              
365             If no list was provided, then a list of available feature types in the
366             provided database will be presented to the user, and the user prompted
367             to make a selection. One or more types may be selected, and a single
368             item may be enforced if requested. The response is filtered through
369             the parse_list method from L, so a mix of single
370             numbers or a range of numbers may be accepted. The responses are then
371             validated.
372              
373             Two types of databases are supported: L and
374             L databases. For SeqFeature Store databases, the
375             type is comprised of "C", corresponding to the third and
376             second GFF columns. For BigWigSet databases, types correspond to either
377             the C, C, C, or C attributes,
378             in that order of availability.
379              
380             For feature types or files that pass validation, they are returned as a
381             list. Types of files that do not pass validation are printed to C.
382              
383             To use this subroutine, pass an array with the following keys
384             and values. Not every key is required.
385              
386             =over 4
387              
388             =item db
389              
390             The name of the database or a reference to an established BioPerl
391             database object. A L or
392             L database are typically used. Only required when
393             no features are provided.
394              
395             =item feature
396              
397             Pass either a single dataset name as a scalar or an anonymous array
398             reference of a list of dataset names. These may have been provided as
399             a command line option and need to be verified. If nothing is passed,
400             then a list of possible datasets will be presented to the user to be
401             chosen.
402              
403             =item prompt
404              
405             Provide a phrase to be prompted to the user to help in selecting
406             datasets from the list. If none is provided, a generic prompt will be
407             used.
408              
409             =item single
410              
411             A Boolean value (1 or 0) indicating whether only a single dataset is
412             allowed when selecting datasets from a presented list. If true, only
413             one dataset choice is accepted. If false, one or more dataset choices
414             are allowed. Default is false.
415              
416             =item limit
417              
418             Optionally provide a word or regular expression that matches the
419             feature type (C only; C, if present, is ignored).
420             For example, provide "geneEmrna" to only present gene and mRNA
421             features to the user. This is only applicable when a user must select
422             from a database list. The default is to list all available feature
423             types.
424              
425             =back
426              
427             =item check_dataset_for_rpm_support($dataset, [$cpu])
428              
429             # count the total number of alignments
430             my $dataset = '/path/to/file.bam';
431             my $total_count = check_dataset_for_rpm_support($dataset);
432            
433             # use multithreading
434             my $total_count = check_dataset_for_rpm_support($dataset, $cpu);
435              
436             This subroutine will check a dataset for RPM, or Reads Per Million mapped,
437             support. Only two types of database files support this, Bam files and
438             BigBed files. If the dataset is either one of these, or the name of a
439             database feature which points to one of these files, then it will
440             calculate the total number of mapped alignments (Bam file) or features
441             (BigBed file). It will return this total number. If the dataset does
442             not support RPM (because it is not a Bam or BigBed file, for example),
443             then it will return undefined.
444              
445             Pass this subroutine one or two values. The first is the name of the
446             dataset. Ideally it should be validated using verify_or_request_feature_types()
447             and have an appropriate prefix (file, http, or ftp).
448              
449             For multi-threaded execution pass a second value, the number of CPU cores
450             available to count Bam files. This will speed up counting bam files
451             considerably. The default is 2 for environments where L
452             is installed, or 1 where it is not.
453              
454             =item get_new_feature_list
455              
456             my $Data = Bio::ToolBox::Data->new;
457             my $db_name = 'cerevisiae';
458             my %data = get_new_feature_list(
459             data => $Data,
460             db => $db_name,
461             features => 'genes',
462             );
463              
464             This subroutine will generate a new feature list collected from an annotation
465             database, such as a L database. It will populate
466             an empty L object data table with a list of the collected
467             features. Columns include the Primary ID, Name, and feature Type.
468              
469             The subroutine is passed an array containing three required arguments. The keys
470             include
471              
472             =over 4
473              
474             =item data
475              
476             An empty L object. This is required.
477              
478             =item db
479              
480             The name of the database or a reference to an established database object.
481             Currently, this must be a L database.
482             This is required.
483              
484             =item features
485              
486             A scalar value containing a name representing the feature type(s) of
487             feature(s) to collect. The type may be a C or C.
488             It may be a single element or a comma-delimited list. This is required.
489              
490             =item chrskip
491              
492             Provide a regular-expression compatible string for skipping features from
493             specific chromosomes, for example mitochondrial or unmapped contigs.
494              
495             =back
496              
497             Status messages, including the number of features found and kept, are always
498             printed to C.
499              
500             =item get_new_genome_list
501              
502             my $Data = Bio::ToolBox::Data->new();
503             my $db_name = 'cerevisiae';
504             my $window_size = 500;
505             my $step_size = 250;
506             my %data = get_new_genome_list(
507             data => $Data,
508             db => $db_name,
509             win => $window_size,
510             step => $step_size,
511             );
512              
513             This subroutine will generate a new list of genomic windows or intervals.
514             The genome is split into intervals of a specified size with the specified
515             step size.
516              
517             The subroutine is passed an array containing the required arguments.
518              
519             =over 4
520              
521             =item data
522              
523             An empty L object. This is required.
524              
525             =item db
526              
527             The name of the database or a reference to an established database
528             object. Any BioPerl adapter, including those described in the L,
529             are supported. Required.
530              
531             =item win
532              
533             A scalar value containing an integer representing the size of the
534             window in basepairs. The default value is 500 bp.
535              
536             =item step
537              
538             A scalar value containing an integer representing the step size for
539             advancing the window across the genome. The default is the window size.
540              
541             =item chrskip
542              
543             =item skip
544              
545             =item exclude
546              
547             Provide a regular expression compatible string for excluding specific
548             chromosomes or classes of chromosomes, such as the mitochondrial or
549             unmapped contigs.
550              
551             =back
552              
553             Status messages are printed to C.
554              
555             =item get_db_feature
556              
557             my $db = open_db_connection('annotation.db');
558             my $seqfeature = get_db_feature(
559             db => $db,
560             type => 'gene',
561             name => 'ABC1',
562             );
563              
564             This subroutine will retrieve a specific feature from a L
565             database for subsequent analysis, manipulation, and/or score retrieval using the
566             L methods. It relies upon unique information to pull out a
567             single, unique feature.
568              
569             Several attributes may be used to pull out the feature, including the feature's
570             unique database primary ID, name andEor aliases, and GFF type (C
571             andEor C). The L subroutine will generate a list
572             of features with their unique identifiers.
573              
574             The C attribute is preferentially used as it provides the best
575             performance. However, it is not always portable between databases or even a
576             database re-load. In that case, the C and C are used to identify
577             potential features. Note that the C may not always be unique in the
578             database. In this case, the addition of aliases may help. If all else fails, a new
579             feature list should be generated.
580              
581             To get a feature, pass an array of arguments.
582              
583             =over 4
584              
585             =item db
586              
587             The name of the L database or a reference
588             to an established database object.
589              
590             =item id
591              
592             Provide the C tag. In the L
593             database schema this is a (usually) non-portable, unique identifier
594             specific to a database. It provides the fastest lookup.
595              
596             =item name
597              
598             A scalar value representing the feature C. Aliases may be
599             appended with semicolon delimiters.
600              
601             =item type
602              
603             Provide the feature type, which is typically expressed as
604             C. Alternatively, provide just the C only.
605              
606             =back
607              
608             While it is possible to identify features with any two attributes
609             (or possibly just name or ID), the best performance is obtained with
610             all three together. The first SeqFeature object is returned if multiple
611             are found.
612              
613             =item get_segment_score
614              
615             my $score = get_segment_score(
616             $seqfeature->seq_id, # chromosome
617             $seqfeature->start, # start
618             $seqfeature->end, # end
619             $seqfeature->strand, # strand
620             'sense', # strandedness
621             'mean', # method
622             0, # return type
623             $db, # database
624             'scores.bw' # datasets
625             );
626              
627             Generic method for collecting score(s) from a database. This is the
628             primary interface to all of the database handling packages, including
629             Bam, bigWig, bigBed, and USeq files, as well as SeqFeature databases.
630             By passing common parameters here, the appropriate database or file
631             handling packages are loaded during run time and the appropriate
632             methods are called. Essentially, this is where the magic happens.
633              
634             Pass the method an array of nine (or more) parameters. The parameters
635             are not keyed as a hash, and the order is explicit. However, see the
636             L module for named index positions.
637             The order is as follows.
638              
639             =over 4
640              
641             =item * 0 Chromosome
642              
643             String representing chromosome or sequence ID.
644              
645             =item * 1 Start
646              
647             Start coordinate (integer, 1-based)
648              
649             =item * 2 Stop
650              
651             Stop or end coordinate (integer, 1-based)
652              
653             =item * 3 Strand
654              
655             BioPerl style strand value, including -1, 0, and 1.
656              
657             =item * 4 Strandedness
658              
659             A string indicating strandedness: sense, antisense, all.
660              
661             =item * 5 Method
662              
663             A string representing the method of data collection.
664              
665             Current acceptable methods include mean, median, sum, min, max,
666             stddev, count, ncount (named count), and pcount (precise count).
667             See the individual db_helper sub classes for more details. Not all
668             adapters support all methods.
669              
670             =item * 6 Return type
671              
672             An integer (0, 1, 2) representing the type of value to return.
673              
674             A return type of 0 indicates a single score should be returned,
675             calculated using the specified method. A return type of 1 is an
676             array or array reference of all scores found. A return type of 2
677             is a hash or hash reference of coordinate => score.
678              
679             =item * 7 Dataset database
680              
681             This is either a L or a L
682             database containing multiple score dataset features. If your dataset
683             does not contain multiple different feature types, pass C.
684              
685             Pass either a string containing the name of database (which will be opened)
686             or an already opened database object.
687              
688             =item * 8 Dataset
689              
690             This is the verified dataset from which the scores will be passed. The
691             dataset should be verified using the L
692             subroutine. For indexed file datasets, e.g. F<.bam> or F<.bw> files,
693             this will the verified local path or URL. For a database, this will be
694             a database type, sucha as C or C.
695             or the path to a data file, e.g. Bam, bigWig, bigBed, or USeq file.
696              
697             If multiple datasets are to be combined, simply append them to the
698             array (index 9, 10, etc).
699              
700             =back
701              
702             The returned item is dependent on the value of the return type code.
703              
704             =item calculate_score
705              
706             my $scores = get_segment_score($chr,$start,$stop,$strand,
707             'sense','ncount',1,undef,'file:/path/to/dataset.bam');
708             my $score = calculate_score('ncount', $scores);
709              
710             This subroutine will perform the math to calculate a single score
711             from an array of values. Current acceptable methods include mean,
712             median, sum, min, max, stddev, count, ncount, and pcount.
713              
714             Pass the subroutine two items: the name of the mathematical method
715             enumerated above, and an array reference of the values to work on.
716             A single score will be returned.
717              
718             =item get_chromosome_list
719              
720             my $db = open_db_connection('cerevisiae');
721             # get all chromosomes in the database
722             my @chromosomes = get_chromosome_list($db);
723             foreach (@chromosomes) {
724             my $name = $_->[0];
725             my $length = $_->[1];
726             print "chromosome $name is $length bp\n";
727             }
728              
729             This subroutine will collect a list of chromosomes or reference sequences
730             in a Bio::DB database and return the list along with their sizes in bp.
731             Many BioPerl-based databases are supported, including
732             L, L, L,
733             L, L, L,
734             L, or any others that support the C method.
735             L adapters for big files are also supported, though see note
736             below. See the L subroutine for more information.
737              
738             Pass the subroutine either the name of the database, the path to the
739             database file, or an opened database object.
740              
741             Optionally pass a second value, a regular expression compatible string
742             for skipping specific chromosomes or chromosome classes, such as mitochondrial
743             or unmapped contigs. The default is to return all chromosomes.
744              
745             The subroutine will return an array, with each element representing each
746             chromosome or reference sequence in the database. Each element is an anonymous
747             array of two elements, the chromosome name and length in bp.
748              
749             B: L databases for bigWig and bigBed files do
750             not return chromosomes in the original order as the file, and are returned
751             in a sorted manner that may not be the original order.
752              
753             =item low_level_bam_coverage
754              
755             my $coverage = low_level_bam_coverage($sam, $tid, $start, $stop);
756              
757             This is a convenience method for running the low level bam coverage method.
758             Since both L and L bam file adapters are
759             supported, and each has slight variation in the API syntax, this method helps
760             to abstract the actual method and use the appropriate syntax depending on
761             which adapter is loaded. It is best if the C<$sam> object was opened using the
762             L method, or that C<$BAM_ADAPTER> is set.
763              
764             B that this is the B coverage method based on the index object,
765             and not the similarly named high level API method. Read the adapter
766             documentation for proper usage.
767              
768             =item low_level_bam_fetch
769              
770             my $success = low_level_bam_fetch($sam, $tid, $start, $stop, $callback, $data);
771              
772             This is a convenience method for running the low level bam fetch method.
773             Since both L and L bam file adapters are
774             supported, and each has slight variation in the API syntax, this method helps
775             to abstract the actual method and use the appropriate syntax depending on
776             which adapter is loaded. It is best if the C<$sam> object was opened using the
777             L method, or that C<$BAM_ADAPTER> is set.
778              
779             B that this is the B fetch method based on the index object,
780             and not the similarly named high level API method. Read the adapter
781             documentation for proper usage.
782              
783             =item get_genomic_sequence
784              
785             my $sequence = get_genomic_sequence($db, $chrom, $start, $stop);
786              
787             This is a convenience wrapper function for fetching genomic sequence from
788             three different supported fasta database adapters, which, of course, all
789             have different APIs, including the L, L,
790             and L adapters, as well as L
791             and possibly other BioPerl databases.
792              
793             Pass the opened database object, chromosome, start, and stop coordinates. This
794             assumes BioPerl standard 1-base coordinates. Only the forward sequence is
795             retrieved. The sequence is returned as a simple string.
796              
797             =back
798              
799             =head1 INTERNAL SUBROUTINES
800              
801             These are not intended for normal general consumption but are documented here
802             so that at least some of us know what is going on.
803              
804             =over 4
805              
806             =item _lookup_db_method
807              
808             Internal method for determining which database adapter and method call
809             should be used given a set of parameters. This result is cached for
810             future data queries (likely hundreds or thousands of repeated calls).
811             The method is stored in the global hash C<%DB_METHODS>;
812              
813             Pass the parameter array reference from L. This
814             should load the appropriate db_helper sub module during run time.
815              
816             =item _load_helper_module
817              
818             Subroutine to load and import a db_helper module during run time.
819             Sets the appropriate global variable if successful.
820              
821             =item _load_bam_helper_module
822              
823             Subroutine to determine which Bam adapter db_helper module to load,
824             either the older samtools-based adapter or the newer HTSlib-based adapter.
825             Uses the global and exportable variable C<$BAM_ADAPTER> to set a
826             preference for which adapter to use. Use 'sam' or 'hts' or some
827             string containing these names. Do B change this variable after
828             loading the helper module; it will not do what you think it will do.
829             If both are available and a preference is not set then the hts helper
830             is the default.
831              
832             =back
833              
834             =cut
835              
836 3     3   25 use strict;
  3         6  
  3         128  
837             require Exporter;
838 3     3   14 use Carp qw(carp cluck croak confess);
  3         32  
  3         198  
839 3     3   1729 use Module::Load; # for dynamic loading during runtime
  3         3551  
  3         18  
840 3     3   175 use List::Util qw(min max sum0);
  3         6  
  3         213  
841 3     3   1509 use Statistics::Lite qw(median range stddevp);
  3         4819  
  3         225  
842 3     3   1635 use Bio::ToolBox::db_helper::constants;
  3         7  
  3         207  
843 3     3   1652 use Bio::ToolBox::utility;
  3         8  
  3         27939  
844              
845             # check values for dynamically loaded helper modules
846             # these are loaded only when needed during runtime to avoid wasting resources
847             our $BAM_OK = 0;
848             our $BIGBED_OK = 0;
849             our $BIGWIG_OK = 0;
850             our $SEQFASTA_OK = 0;
851             our $USEQ_OK = 0;
852             our $BAM_ADAPTER = undef; # preference for which bam adapter to use
853             our $BIG_ADAPTER = undef;
854              
855             # define reusable variables
856             our %total_read_number; # for rpm calculations
857             our $primary_id_warning; # for out of date primary IDs
858             our %OPENED_DB; # cache for some opened Bio::DB databases
859             our %DB_METHODS; # cache for database score methods
860              
861             # Exported names
862             our @ISA = qw(Exporter);
863             our @EXPORT = qw();
864             our @EXPORT_OK = qw(
865             $BAM_ADAPTER
866             $BIG_ADAPTER
867             use_bam_adapter
868             use_big_adapter
869             open_db_connection
870             get_dataset_list
871             verify_or_request_feature_types
872             check_dataset_for_rpm_support
873             get_new_feature_list
874             get_new_genome_list
875             validate_included_feature
876             get_db_feature
877             get_segment_score
878             calculate_score
879             get_chromosome_list
880             low_level_bam_coverage
881             low_level_bam_fetch
882             get_genomic_sequence
883             );
884              
885             # The true statement
886             1;
887              
888              
889             ### Open a connection to the SeqFeature Store MySQL database
890              
891             sub use_bam_adapter {
892 0   0 0 1 0 my $a = shift || undef;
893 0 0       0 $BAM_ADAPTER = $a if $a;
894 0         0 return $BAM_ADAPTER;
895             }
896              
897              
898             sub use_big_adapter {
899 0   0 0 1 0 my $a = shift || undef;
900 0 0       0 $BIG_ADAPTER = $a if $a;
901 0         0 return $BIG_ADAPTER;
902             }
903              
904              
905             sub open_db_connection {
906 6     6 1 12 my $database = shift;
907 6   50     23 my $no_cache = shift || 0;
908 6 50       13 unless ($database) {
909             # cluck 'no database name passed!';
910 0         0 return;
911             }
912            
913             # first check if it is a database reference
914 6         12 my $db_ref = ref $database;
915 6 100       20 if ($db_ref =~ /DB|big::BigWigSet/) {
916             # the provided database is already an open database object
917             # nothing to open, return as is
918 1         3 return $database;
919             }
920            
921             # check to see if we have already opened it
922 5 100 66     20 if (exists $OPENED_DB{$database} and not $no_cache) {
923             # return the cached database connection
924             # but NOT if user explicitly requested no cached databases
925             # DO NOT reuse database objects if you have forked!!! Bad things happen
926 4         10 return $OPENED_DB{$database};
927             }
928            
929             # skip parsed databases
930 1 50       3 return if $database =~ /^Parsed:/; # we don't open parsed annotation files
931            
932            
933             ### Attempt to open the database
934             # we go through a series of checks to determine if it is remote, local,
935             # an indexed big data file, SQLite file, etc
936             # when all else fails, try to open a SQL connection
937 1         2 my $db;
938             my $error;
939            
940             # check if it is a remote file
941 1 50       11 if ($database =~ /^(?:https?|ftp)/i) {
    50          
    0          
    0          
942            
943             # a remote Bam database
944 0 0       0 if ($database =~ /\.bam$/i) {
    0          
    0          
    0          
    0          
945             # open using Bam adaptor
946 0 0       0 _load_bam_helper_module() unless $BAM_OK;
947 0 0       0 if ($BAM_OK) {
948 0         0 $db = open_bam_db($database);
949 0 0       0 unless ($db) {
950 0         0 $error = " ERROR: could not open remote Bam file" .
951             " '$database'! $!\n";
952             }
953             }
954             else {
955 0         0 $error = " Bam database cannot be loaded because\n" .
956             " Bio::DB::Sam or Bio::DB::HTS is not installed\n";
957             }
958             }
959            
960             # a remote BigBed database
961             elsif ($database =~ /\.(?:bb|bigbed)$/i) {
962             # open using BigBed adaptor
963 0 0       0 $BIGBED_OK = _load_bigbed_helper_module() unless $BIGBED_OK;
964 0 0       0 if ($BIGBED_OK) {
965 0         0 $db = open_bigbed_db($database);
966 0 0       0 unless ($db) {
967 0         0 $error = " ERROR: could not open remote BigBed file" .
968             " '$database'! $!\n";
969             }
970             }
971             else {
972 0         0 $error = " BigBed database cannot be loaded because\n" .
973             " Bio::DB::Big or Bio::DB::BigBed is not installed\n";
974             }
975             }
976            
977             # a remote BigWig database
978             elsif ($database =~ /\.(?:bw|bigwig)$/i) {
979             # open using BigWig adaptor
980 0 0       0 $BIGWIG_OK = _load_bigwig_helper_module() unless $BIGWIG_OK;
981 0 0       0 if ($BIGWIG_OK) {
982 0         0 $db = open_bigwig_db($database);
983 0 0       0 unless ($db) {
984 0         0 $error = " ERROR: could not open remote BigWig file" .
985             " '$database'! $!\n";
986             }
987             }
988             else {
989 0         0 $error = " BigWig database cannot be loaded because\n" .
990             " Bio::DB::Big or Bio::DB::BigWig is not installed\n";
991             }
992             }
993            
994             # a remote useq file
995             elsif ($database =~ /\.useq$/) {
996             # uh oh! remote useq files are not supported
997 0         0 $error = " ERROR: remote useq files are not supported!\n";
998             }
999            
1000             # a remote fasta file???
1001             elsif ($database =~ /\.fa(?:sta)?$/i) {
1002             # uh oh! remote fasta files are not supported
1003 0         0 $error = " ERROR: remote fasta files are not supported!\n";
1004             }
1005            
1006             # a presumed remote directory, presumably of bigwig files
1007             else {
1008             # open using BigWigSet adaptor
1009 0 0       0 $BIGWIG_OK = _load_helper_module('Bio::ToolBox::db_helper::bigwig')
1010             unless $BIGWIG_OK;
1011 0 0       0 if ($BIGWIG_OK) {
1012 0         0 $db = open_bigwigset_db($database);
1013 0 0       0 unless ($db) {
1014 0         0 $error = " ERROR: could not open presumed remote " .
1015             "BigWigSet directory '$database'! $!\n";
1016             }
1017             }
1018             else {
1019 0         0 $error = " Presumed BigWigSet database cannot be loaded because\n" .
1020             " Bio::DB::BigWigSet is not installed\n";
1021             }
1022             }
1023            
1024             }
1025            
1026             # check for a known file type
1027             elsif ($database =~ /gff|bw|bb|bam|useq|db|sqlite|fa|fasta|bigbed|bigwig|cram/i) {
1028            
1029             # remove prefix, just in case
1030 1         2 $database =~ s/^file://;
1031            
1032             # first check that it exists
1033 1 50       27 if (-e $database) {
1034            
1035             # a Bam database
1036 1 50       14 if ($database =~ /\.bam$/i) {
    50          
    50          
    50          
    0          
    0          
    0          
1037             # open using appropriate bam adaptor
1038 0 0       0 _load_bam_helper_module() unless $BAM_OK;
1039 0 0       0 if ($BAM_OK) {
1040 0         0 undef $@;
1041 0         0 $db = open_bam_db($database);
1042 0 0       0 unless ($db) {
1043 0         0 $error = " ERROR: could not open local Bam file" .
1044             " '$database'! $@\n";
1045             }
1046             }
1047             else {
1048 0         0 $error = " Bam database cannot be loaded because\n" .
1049             " Bio::DB::Sam or Bio::DB::HTS is not installed\n";
1050             }
1051             }
1052            
1053             # a BigBed database
1054             elsif ($database =~ /\.(?:bb|bigbed)$/i) {
1055             # open using BigBed adaptor
1056 0 0       0 $BIGBED_OK = _load_bigbed_helper_module() unless $BIGBED_OK;
1057 0 0       0 if ($BIGBED_OK) {
1058 0         0 undef $@;
1059 0         0 $db = open_bigbed_db($database);
1060 0 0       0 unless ($db) {
1061 0         0 $error = " ERROR: could not open local BigBed file" .
1062             " '$database'! $@\n";
1063             }
1064             }
1065             else {
1066 0         0 $error = " BigBed database cannot be loaded because\n" .
1067             " Big::DB::Big or Bio::DB::BigBed is not installed\n";
1068             }
1069             }
1070            
1071             # a BigWig database
1072             elsif ($database =~ /\.(?:bw|bigwig)$/i) {
1073             # open using BigWig adaptor
1074 0 0       0 $BIGWIG_OK = _load_bigwig_helper_module() unless $BIGWIG_OK;
1075 0 0       0 if ($BIGWIG_OK) {
1076 0         0 undef $@;
1077 0         0 $db = open_bigwig_db($database);
1078 0 0       0 unless ($db) {
1079 0         0 $error = " ERROR: could not open local BigWig file" .
1080             " '$database'! $@\n";
1081             }
1082             }
1083             else {
1084 0         0 $error = " BigWig database cannot be loaded because\n" .
1085             " Big::DB::Big or Bio::DB::BigWig is not installed\n";
1086             }
1087             }
1088            
1089             # a useq database
1090             elsif ($database =~ /\.useq$/i) {
1091             # open using USeq adaptor
1092 1 50       44 $USEQ_OK = _load_helper_module('Bio::ToolBox::db_helper::useq')
1093             unless $USEQ_OK;
1094 1 50       3 if ($USEQ_OK) {
1095 1         4 $db = open_useq_db($database);
1096 1 50       7 unless ($db) {
1097 0         0 $error = " ERROR: could not open local useq file" .
1098             " '$database'! $!\n";
1099             }
1100             }
1101             else {
1102 0         0 $error = " Useq database cannot be loaded because\n" .
1103             " Bio::DB::USeq is not installed\n";
1104             }
1105             }
1106            
1107             # a Fasta File
1108             elsif ($database =~ /\.fa(?:sta)?$/i) {
1109             # first try a modern fai indexed adapter
1110 0 0       0 _load_bam_helper_module() unless $BAM_OK;
1111 0 0       0 if ($BAM_OK) {
1112 0         0 $db = open_indexed_fasta($database);
1113 0 0       0 unless ($db) {
1114 0         0 $error .= " ERROR: could not open indexed fasta file '$database'!\n";
1115             }
1116             }
1117             else {
1118             # open using the old slow BioPerl Fasta adaptor
1119 0 0       0 $SEQFASTA_OK = _load_helper_module('Bio::ToolBox::db_helper::seqfasta')
1120             unless $SEQFASTA_OK;
1121 0 0       0 if ($SEQFASTA_OK) {
1122 0         0 $db = open_fasta_db($database);
1123 0 0       0 unless ($db) {
1124 0         0 $error .= " ERROR: could not open fasta file '$database'!\n";
1125 0 0       0 if (-e "$database\.index") {
1126 0         0 $error .= " Try deleting $database\.index and try again\n";
1127             }
1128             }
1129             }
1130             else {
1131 0         0 $error .= " Fasta file could not be loaded because neither" .
1132             " Bio::DB::HTS, Bio::DB::Sam, or Bio::DB::Fasta is installed\n";
1133             }
1134             }
1135             }
1136            
1137             # a gff3 or sqlite database
1138             elsif ($database =~ /\.(?:gff3?|gff3?\.gz|db|sqlite)$/) {
1139 0 0       0 $SEQFASTA_OK = _load_helper_module('Bio::ToolBox::db_helper::seqfasta')
1140             unless $SEQFASTA_OK;
1141 0 0       0 if ($SEQFASTA_OK) {
1142 0         0 $db = open_store_db($database);
1143 0 0       0 unless ($db) {
1144 0         0 $error = " ERROR: could not load SeqFeature database file '$database'!\n";
1145             }
1146             }
1147             else {
1148 0         0 $error .= " Module Bio::DB::SeqFeature::Store is required to load " .
1149             "GFF and database files\n";
1150             }
1151             }
1152            
1153             # a cram database
1154             elsif ($database =~ /\.cram$/i) {
1155             # open using HTS bam adaptor only
1156 0   0     0 $BAM_ADAPTER ||= 'hts';
1157 0 0       0 if ($BAM_ADAPTER eq 'sam') {
1158 0         0 $error .= " ERROR: Only HTS adapters support Cram files!\n";
1159             }
1160 0 0       0 _load_bam_helper_module() unless $BAM_OK;
1161 0 0       0 if ($BAM_OK) {
1162 0         0 undef $@;
1163 0         0 $db = open_bam_db($database);
1164 0 0       0 unless ($db) {
1165 0         0 $error = " ERROR: could not open local Cram file" .
1166             " '$database'! $@\n";
1167             }
1168             }
1169             else {
1170 0         0 $error = " Cram database cannot be loaded because\n" .
1171             " Bio::DB::HTS is not installed\n";
1172             }
1173             }
1174            
1175             # something unrecognized?
1176             else {
1177 0         0 $error .= " ERROR! Cannot identify database type for $database!\n";
1178             }
1179             }
1180            
1181             # file does not exist or can be read
1182             else {
1183             # file does not exist
1184 0         0 $error = " ERROR: file '$database' does not exist!\n";
1185             }
1186             }
1187            
1188             # a directory, presumably of bigwig files
1189             elsif (-d $database) {
1190             # try opening using the BigWigSet adaptor
1191 0 0       0 $BIGWIG_OK = _load_bigwig_helper_module() unless $BIGWIG_OK;
1192 0 0       0 if ($BIGWIG_OK) {
1193 0         0 $db = open_bigwigset_db($database);
1194 0 0       0 unless ($db) {
1195 0         0 $error = " ERROR: could not open local BigWigSet " .
1196             "directory '$database'!\n";
1197 0         0 $error .= " Does directory contain bigWig .bw files?\n";
1198             }
1199             }
1200             else {
1201 0         0 $error = " Presumed BigWigSet database cannot be loaded because\n" .
1202             " Bio::DB::Big or Bio::DB::BigWigSet is not installed\n";
1203             }
1204            
1205             # try opening with the Fasta adaptor
1206 0 0       0 unless ($db) {
1207 0 0       0 $SEQFASTA_OK = _load_helper_module('Bio::ToolBox::db_helper::seqfasta')
1208             unless $SEQFASTA_OK;
1209 0 0       0 if ($SEQFASTA_OK) {
1210 0         0 $db = open_fasta_db($database);
1211 0 0       0 unless ($db) {
1212 0         0 $error .= " ERROR: could not open fasta directory '$database'!\n";
1213 0         0 $error .= " Does directory contain fasta files? If it contains a" .
1214             " directory.index file,\n try deleting it and try again.\n";
1215             }
1216             }
1217             else {
1218 0         0 $error .= " Module Bio::DB::Fasta is required to open presumed fasta " .
1219             "directory\n";
1220             }
1221             }
1222             }
1223            
1224             # unrecognized real file
1225             elsif (-e $database) {
1226             # file exists, I just don't recognize the extension
1227 0         0 $error = " File '$database' type and/or extension is not recognized\n";
1228             }
1229            
1230            
1231             # otherwise assume the name of a SeqFeature::Store database in the configuration
1232             else {
1233             # attempt to open using information from the configuration
1234             # using default connection information as necessary
1235 0 0       0 $SEQFASTA_OK = _load_helper_module('Bio::ToolBox::db_helper::seqfasta')
1236             unless $SEQFASTA_OK;
1237 0 0       0 if ($SEQFASTA_OK) {
1238 0         0 $db = open_store_db($database);
1239 0 0       0 unless ($db) {
1240 0         0 $error .= " unable to open relational database named '$database'\n";
1241             }
1242             }
1243             else {
1244 0         0 $error .= " Module Bio::DB::SeqFeature::Store is required to connect " .
1245             "to databases\n";
1246             }
1247             }
1248            
1249             # conditional return
1250 1 50       3 if ($db) {
1251             # cache the opened connection for use later
1252 1 50       5 $OPENED_DB{$database} = $db unless $no_cache;
1253            
1254             # return as appropriate either both object and name or just object
1255 1         3 return $db;
1256             }
1257             else {
1258 0         0 print STDERR $error;
1259 0         0 return;
1260             }
1261             }
1262              
1263              
1264             ### Retrieve a database name from an db object
1265             sub get_db_name {
1266 0     0 1 0 my $db = shift;
1267 0 0       0 return unless $db;
1268 0         0 my $db_ref = ref($db);
1269 0 0       0 return $db unless $db_ref; # presumption that non-object is just a database name
1270 0         0 my $db_name;
1271 0 0       0 if ($db_ref =~ /^Bio::DB::SeqFeature::Store/) {
    0          
    0          
1272             # a SeqFeature database, using any DBI adapter
1273 0         0 $db_name = $db->{'dbh'}->{'name'};
1274             # dig through the object internals to identify the original
1275             # name of the database
1276             # this should be relatively well documented through DBI
1277             # but could break in the future since it's not official API
1278             }
1279             elsif ($db_ref eq 'Bio::DB::Sam') {
1280             # a Samtools Bam database
1281             # old versions <=1.39 don't have the bam_path method, so it's easier
1282             # to explicitly grab it from the object internals
1283 0         0 $db_name = $db->{bam_path};
1284             }
1285             elsif ($db_ref eq 'Bio::DB::HTS') {
1286             # a HTSlib Bam database
1287 0         0 $db_name = $db->hts_path;
1288             }
1289             # determining the database name from other sources is
1290             # either not possible or not easy, so won't bother unless
1291             # there is a really really good need
1292 0         0 return $db_name;
1293             }
1294              
1295              
1296             ### Retrieve a list of the microrarray data sets from the db
1297             sub get_dataset_list {
1298            
1299 0     0 1 0 my $database = shift;
1300 0         0 my $use_all_features = shift;
1301            
1302             # Open a db connection
1303 0         0 my $db = open_db_connection($database);
1304 0 0       0 unless ($db) {
1305 0         0 carp 'no database connected!';
1306 0         0 return;
1307             }
1308 0         0 my $db_ref = ref $db;
1309            
1310             # process the database types, according to the type of database
1311 0         0 my @types;
1312            
1313             # a SeqFeature database
1314 0 0       0 if ($db_ref =~ /^Bio::DB::SeqFeature::Store/) {
    0          
1315            
1316             # collect the database types,
1317             # sort first by source, then by method
1318             @types = (
1319             map $_->[2],
1320 0 0       0 sort { ($a->[0] cmp $b->[0]) or ($a->[1] cmp $b->[1]) }
  0         0  
1321             map [$_->source, $_->method, $_],
1322             $db->types
1323             );
1324             }
1325            
1326             # a BigWigSet database
1327             elsif ($db_ref =~ /BigWigSet/i) {
1328            
1329             # get the metadata
1330 0         0 my $metadata = $db->metadata;
1331            
1332             # collect
1333 0         0 foreach my $file (keys %{ $metadata }) {
  0         0  
1334             # get the type for each file
1335 0         0 my ($primary, $type, $name);
1336            
1337             # get the appropriate tags
1338 0         0 foreach my $attribute (keys %{ $metadata->{$file} } ) {
  0         0  
1339 0 0       0 if ($attribute =~ m/^primary_tag|method$/i) {
    0          
    0          
1340 0         0 $primary = $metadata->{$file}{$attribute};
1341             }
1342             elsif ($attribute =~ m/^type/i) {
1343 0         0 $type = $metadata->{$file}{$attribute};
1344             }
1345             elsif ($attribute =~ m/^display_name/i) {
1346 0         0 $name = $metadata->{$file}{$attribute};
1347             }
1348             }
1349            
1350             # store the type
1351 0   0     0 push @types, $type || $primary || $name;
1352             }
1353            
1354             # sort the types in alphabetical order
1355 0         0 @types = sort {$a cmp $b} @types;
  0         0  
1356             }
1357            
1358             # some other database
1359             else {
1360 0         0 carp " no dataset lists for database type $db_ref!\n";
1361             }
1362            
1363             # finish
1364 0         0 return @types;
1365             }
1366              
1367              
1368              
1369             ### Process and verify a dataset
1370             sub verify_or_request_feature_types {
1371            
1372             # Retrieve passed values
1373 1     1 1 4 my %args = @_; # the passed argument values as a hash reference
1374            
1375             # Check for single option
1376 1   50     6 $args{'single'} ||= 0;
1377            
1378             # Collect the datasets
1379 1         2 my @datasets;
1380 1   50     7 $args{'feature'} ||= undef;
1381 1 50       7 if (ref $args{'feature'} eq 'ARRAY') {
    50          
1382             # an anonymous array of datasets
1383 0         0 @datasets = @{ $args{'feature'} };
  0         0  
1384             }
1385             elsif (defined $args{'feature'}) {
1386 1         3 push @datasets, $args{'feature'};
1387             }
1388             # else it is null and @datasets remains empty, prompting user input
1389            
1390             # feature limits
1391             # this is a regex to limit the feature types presented to the user
1392             # otherwise everything in the database is presented to the user
1393 1   50     6 my $limit = $args{'limit'} ||= undef;
1394            
1395            
1396             # Open database object and collect features
1397 1   50     3 $args{'db'} ||= undef;
1398 1 50       4 my $db = $args{'db'} ? open_db_connection( $args{'db'} ) : undef;
1399            
1400             # Initialize main output arrays
1401 1         3 my @good_datasets;
1402             my @bad_datasets;
1403 1         0 my %db_features; # hash for features found in the database, use when needed
1404            
1405             # Check provided datasets
1406 1 50       13 if (@datasets) {
1407            
1408             # check for multiple comma-delimited datasets
1409 1         3 my @list_to_check;
1410 1         3 foreach my $item (@datasets) {
1411 1 50       4 if ($item =~ /,/) {
1412             # this one has a comma, therefore it has more than dataset
1413 0         0 push @list_to_check, split(/,/, $item);
1414             }
1415             else {
1416             # a singleton
1417 1         4 push @list_to_check, $item;
1418             }
1419             }
1420            
1421             # now verify the datasets
1422 1         3 foreach my $dataset (@list_to_check) {
1423            
1424             # check for a remote file
1425 1 50       9 if ($dataset =~ /^(?: http | ftp) .+ \. (?: bam | bw | bb) $/xi) {
    50          
1426             # a remote file
1427             # assume it is good, no verification here though
1428             # it will either work or won't work
1429 0         0 push @good_datasets, $dataset;
1430             }
1431            
1432             # a local file
1433             elsif ($dataset =~ /\.(?:bam|bw|bigwig|bb|bigbed|useq)$/i) {
1434             # presume we have a local indexed data file
1435            
1436             # user may have requested two or more files to be merged
1437             # these should be combined with an ampersand
1438             # check each one
1439 1         2 my @files;
1440 1         7 foreach my $file (split /\&/, $dataset) {
1441 1 50       34 if (-e $file) {
1442             # file exists
1443 1         7 push @files, "file:$file";
1444             }
1445             else {
1446             # file doesn't exist! can't use this set of files
1447 0         0 @files = ();
1448 0         0 last;
1449             }
1450             }
1451 1 50       8 if (@files) {
1452 1         5 push @good_datasets, join("&", @files);
1453             }
1454             else {
1455 0         0 push @bad_datasets, $dataset;
1456             }
1457             }
1458            
1459             # a feature type in a database
1460             else {
1461             # must be a database feature type
1462            
1463             # check for database features hash
1464 0 0       0 unless (%db_features) {
1465 0 0       0 if ($db) {
1466             # collect the database features as needed
1467             # I want both method:source and method in the hash
1468 0         0 foreach my $type ( get_dataset_list($db) ) {
1469 0         0 my ($method, $source) = split /:/, $type;
1470 0 0       0 if ($source) {
1471 0         0 $db_features{$type} = 1;
1472 0         0 $db_features{$method} = 1;
1473             }
1474             else {
1475 0         0 $db_features{$type} = 1;
1476             }
1477             }
1478            
1479             # verify
1480 0 0       0 unless (%db_features) {
1481 0         0 carp " provided database has no feature types " .
1482             "to verify dataset(s) against!\n";
1483             }
1484             }
1485             else {
1486             # we need a database
1487 0         0 carp " unable to verify dataset without database";
1488             }
1489             }
1490            
1491             # validate the given dataset
1492             # user may have requested two or more features to be merged
1493             # these should be combined with an ampersand
1494             # check each one
1495 0         0 my $check = 0;
1496 0         0 foreach my $d (split /&/, $dataset) {
1497             # validate this dataset
1498 0 0       0 if (exists $db_features{$d}) {
1499 0         0 $check++;
1500             }
1501             else {
1502             # at least one of these is not good, fail check
1503 0         0 $check = 0;
1504 0         0 last;
1505             }
1506             }
1507            
1508             # report the verification
1509 0 0       0 if ($check) {
1510 0         0 push @good_datasets, $dataset;
1511             }
1512             else {
1513 0         0 push @bad_datasets, $dataset;
1514             }
1515             }
1516             }
1517             }
1518            
1519             # User must select datasets
1520             else {
1521             # dataset not specified
1522             # present the dataset list to the user and get an answer
1523            
1524             # get the dataset list
1525 0 0       0 if ($db) {
1526             # collect the database features as needed
1527 0         0 my $i = 1;
1528 0         0 foreach my $type ( get_dataset_list($db) ) {
1529 0 0       0 if ($limit) {
1530 0         0 my ($p, $s) = split /:/, $type; # split into primary_tag and source
1531             # only keep those types that match our limiter
1532 0 0       0 next unless $p =~ /$limit/i;
1533             }
1534 0         0 $db_features{$i} = $type;
1535 0         0 $i++;
1536             }
1537            
1538             # verify
1539 0 0       0 unless (%db_features) {
1540 0         0 carp " provided database has no feature types " .
1541             "to collect!\n";
1542 0         0 return;
1543             }
1544             }
1545             else {
1546             # we need a database
1547 0         0 carp " no database provided from which to collect features!\n";
1548 0         0 return;
1549             }
1550            
1551             # present the list
1552 0         0 print "\n These are the available data sets in the database:\n";
1553 0         0 foreach (sort {$a <=> $b} keys %db_features) {
  0         0  
1554             # print out the list of microarray data sets
1555 0         0 print " $_\t$db_features{$_}\n";
1556             }
1557            
1558             # prompt the user
1559 0   0     0 $args{'prompt'} ||= undef;
1560 0 0       0 if ($args{'prompt'}) {
1561             # provided custom prompt
1562 0         0 print $args{'prompt'};
1563             }
1564             else {
1565             # generic prompt
1566 0 0       0 if ($args{'single'}) {
1567 0         0 print " Enter one number for the data set or feature ";
1568             }
1569             else {
1570 0         0 print " Enter the number(s) or range of the data set(s) or" .
1571             " feature(s) ";
1572             }
1573             }
1574            
1575             # get answer from the user
1576 0         0 my $answer = ;
1577 0         0 chomp $answer;
1578 0         0 my @answer_list = parse_list($answer);
1579            
1580             # take the first one if requested
1581 0 0       0 if ($args{'single'}) {
1582 0 0       0 unless (scalar @answer_list == 1) {
1583 0         0 splice(@answer_list, 1);
1584             }
1585             }
1586            
1587             # verify the answer list
1588 0         0 foreach my $answer (@answer_list) {
1589            
1590             # check for merged datasets
1591 0 0       0 if ($answer =~ /&/) {
1592             # a merged dataset
1593 0         0 my @list = split /&/, $answer;
1594 0         0 my $check = 1;
1595            
1596             # check all are good
1597 0         0 foreach (@list) {
1598 0 0       0 unless (exists $db_features{$_}) {
1599 0         0 $check = 0;
1600             }
1601             }
1602            
1603             # if all are good
1604 0 0       0 if ($check) {
1605             push @good_datasets,
1606 0         0 join( "&", map { $db_features{$_} } @list);
  0         0  
1607             }
1608             else {
1609 0         0 push @bad_datasets, $answer;
1610             }
1611             }
1612            
1613             else {
1614             # a single dataset
1615             # check if it is good
1616            
1617 0 0       0 if (exists $db_features{$answer}) {
1618 0         0 push @good_datasets, $db_features{$answer};
1619             }
1620             else {
1621 0         0 push @bad_datasets, $db_features{$answer};
1622             }
1623             }
1624             }
1625             }
1626            
1627             # Print bad results
1628 1 50       3 if (@bad_datasets) {
1629 0         0 print " The following datasets could not be verified:\n";
1630 0         0 foreach (@bad_datasets) {
1631 0         0 print " $_\n";
1632             }
1633             }
1634            
1635             # Return good results
1636 1 50       4 if ($args{'single'}) {
1637 0         0 return $good_datasets[0];
1638             }
1639             else {
1640 1         5 return @good_datasets;
1641             }
1642             }
1643              
1644              
1645             ### Process and verify a dataset
1646             sub check_dataset_for_rpm_support {
1647            
1648             # get passed dataset and databases
1649 0     0 1 0 my $dataset = shift;
1650 0         0 my $cpu = shift;
1651            
1652             # Calculate the total number of reads
1653             # this uses the global variable $rpkm_read_sum
1654 0         0 my $rpm_read_sum;
1655            
1656 0 0       0 if (exists $total_read_number{$dataset}) {
    0          
    0          
1657             # this dataset has already been summed
1658             # no need to do it again
1659 0         0 $rpm_read_sum = $total_read_number{$dataset};
1660             }
1661            
1662             elsif ($dataset =~ /\.bam$/) {
1663             # a bam file dataset
1664            
1665 0 0       0 _load_bam_helper_module() unless $BAM_OK;
1666 0 0       0 if ($BAM_OK) {
1667             # Bio::ToolBox::db_helper::bam was loaded ok
1668             # sum the number of reads in the dataset
1669 0         0 $rpm_read_sum = sum_total_bam_alignments($dataset, 0, 0, $cpu);
1670             }
1671             else {
1672 0         0 carp " Bam support is not available! " .
1673             "Is Bio::DB::Sam or Bio::DB::HTS installed?\n";
1674 0         0 return;
1675             }
1676             }
1677            
1678             elsif ($dataset =~ /\.bb$/) {
1679             # a bigbed file dataset
1680            
1681 0 0       0 $BIGBED_OK = _load_helper_module('Bio::ToolBox::db_helper::bigbed') unless
1682             $BIGBED_OK;
1683 0 0       0 if ($BIGBED_OK) {
1684             # Bio::ToolBox::db_helper::bigbed was loaded ok
1685             # sum the number of features in the dataset
1686 0         0 $rpm_read_sum = sum_total_bigbed_features($dataset);
1687             }
1688             else {
1689 0         0 carp " BigBed support is not available! " .
1690             "Is Bio::DB::BigBed installed?\n";
1691 0         0 return;
1692             }
1693             }
1694            
1695             else {
1696             # some other non-supported dataset
1697 0         0 return;
1698             }
1699            
1700             # return the sum value if we've made it this far
1701 0         0 $total_read_number{$dataset} = $rpm_read_sum;
1702 0         0 return $rpm_read_sum;
1703             }
1704              
1705              
1706             ### Generate a new list of features
1707             sub get_new_feature_list {
1708              
1709             # Retrieve passed values
1710 0     0 1 0 my %args = @_; # the passed argument values
1711            
1712             # check data object
1713 0   0     0 my $data = $args{data} || undef;
1714 0 0       0 unless ($data) {
1715 0         0 confess "must pass a 'data' key and Bio::ToolBox::Data object!";
1716 0         0 return;
1717             }
1718 0 0       0 unless (ref($data) eq 'Bio::ToolBox::Data') {
1719 0         0 confess 'must pass a Bio::ToolBox::Data object!';
1720 0         0 return;
1721             }
1722            
1723             # Open a db connection
1724 0   0     0 $args{'db'} ||= undef;
1725 0         0 my $db = open_db_connection($args{'db'});
1726 0 0       0 unless ($db) {
1727 0         0 carp 'no database connected!';
1728 0         0 return;
1729             }
1730              
1731             # Verify a SeqFeature::Store database
1732 0         0 my $db_ref = ref $db;
1733 0 0       0 unless ($db_ref =~ /^Bio::DB::SeqFeature::Store/) {
1734 0         0 carp "Database type $db_ref doesn't support generating feature lists!\n";
1735 0         0 return;
1736             }
1737            
1738             # Features to search for
1739 0   0     0 my $searchFeature = $args{'feature'} || $args{'features'} || undef;
1740 0 0       0 unless ($searchFeature) {
1741 0         0 carp "no search feature types passed!";
1742 0         0 return;
1743             }
1744 0         0 my @classes = split(',', $searchFeature); # it may or may not be a list
1745            
1746             # chromosomes to skip
1747 0   0     0 my $chr_exclude = $args{'chrskip'} || undef;
1748            
1749             # Add table columns
1750 0         0 my $pid_i = $data->add_column('Primary_ID');
1751 0         0 my $name_i = $data->add_column('Name');
1752 0         0 my $type_i = $data->add_column('Type');
1753            
1754             # Set up the database iterator
1755 0         0 print " Searching for $searchFeature features\n";
1756 0 0       0 my $iterator = $db->get_seq_stream(
1757             -types => scalar @classes ? \@classes : $searchFeature,
1758             );
1759 0 0       0 unless ($iterator) {
1760             # there should be some features found in the database
1761 0         0 carp "could not get feature iterator for database";
1762 0         0 return;
1763             }
1764            
1765             # Walk through the collected features
1766 0         0 my $total_count = 0; # total found features
1767 0         0 while (my $feature = $iterator->next_seq) {
1768 0         0 $total_count++;
1769            
1770             # skip genes from excluded chromosomes
1771 0 0 0     0 next if (defined $chr_exclude and $feature->seq_id =~ $chr_exclude);
1772            
1773             # Record the feature information
1774             # in the B::DB::SF::S database, the primary_ID is a number unique to the
1775             # the specific database, and is not portable between databases
1776 0         0 $data->add_row( [
1777             $feature->primary_id,
1778             $feature->display_name,
1779             $feature->type,
1780             ] );
1781             }
1782            
1783             # print result of search
1784 0         0 printf " Found %s features in the database\n", format_with_commas($total_count);
1785 0         0 printf " Kept %s features.\n", format_with_commas($data->last_row);
1786            
1787             # return the new data structure
1788 0         0 return 1;
1789             }
1790              
1791              
1792             ### Generate a new list genomic windows
1793             sub get_new_genome_list {
1794              
1795             # Collect the passed arguments
1796 0     0 1 0 my %args = @_;
1797            
1798             # check data object
1799 0   0     0 my $data = $args{data} || undef;
1800 0 0       0 unless ($data) {
1801 0         0 confess "must pass a 'data' key and Bio::ToolBox::Data object!";
1802 0         0 return;
1803             }
1804 0 0       0 unless (ref($data) eq 'Bio::ToolBox::Data') {
1805 0         0 confess 'must pass a Bio::ToolBox::Data object!';
1806 0         0 return;
1807             }
1808            
1809             # Open a db connection
1810 0   0     0 $args{'db'} ||= undef;
1811 0         0 my $db = open_db_connection($args{'db'});
1812 0 0       0 unless ($db) {
1813 0         0 carp 'no database connected!';
1814 0         0 return;
1815             }
1816            
1817             # Determine win and step sizes
1818 0   0     0 $args{'win'} ||= 500; # hard coded default size
1819 0   0     0 $args{'step'} ||= $args{'win'};
1820            
1821            
1822             # Prepare data structures
1823 0         0 my $chr_i = $data->add_column('Chromosome');
1824 0         0 my $start_i = $data->add_column('Start');
1825 0         0 my $stop_i = $data->add_column('Stop');
1826 0         0 $data->metadata($start_i, 'win', $args{'win'});
1827 0         0 $data->metadata($start_i, 'step', $args{'step'});
1828            
1829             # Collect the chromosomes
1830 0   0     0 my $chr_exclude = $args{'skip'} || $args{'chrskip'} || undef;
1831 0         0 my @chromosomes = get_chromosome_list($db, $chr_exclude);
1832 0 0       0 unless (@chromosomes) {
1833 0         0 carp " no sequence IDs were found in the database!\n";
1834 0         0 return;
1835             }
1836            
1837             # Collect the genomic windows
1838 0         0 print " Generating $args{win} bp windows in $args{step} bp increments\n";
1839 0         0 foreach (@chromosomes) {
1840            
1841             # name and length as sub-array in each element
1842 0         0 my ($chr, $length) = @{$_};
  0         0  
1843            
1844 0         0 for (my $start = 1; $start <= $length; $start += $args{step}) {
1845 0         0 my $end = $start + $args{win} - 1;
1846 0 0       0 if ($end > $length) {
1847             # fix end to the length of the chromosome
1848 0         0 $end = $length;
1849             }
1850 0         0 $data->add_row( [ $chr, $start, $end] );
1851             }
1852             }
1853 0         0 print " Kept " . $data->{'last_row'} . " windows.\n";
1854            
1855             # Return the data structure
1856 0         0 return 1;
1857             }
1858              
1859              
1860             sub get_db_feature {
1861 0     0 1 0 my %args = @_;
1862            
1863             # Open a db connection
1864 0   0     0 $args{db} ||= undef;
1865 0 0       0 unless ($args{db}) {
1866 0         0 croak 'no database provided for getting a feature!';
1867             }
1868 0         0 my $db = open_db_connection($args{db});
1869 0         0 my $db_ref = ref $db;
1870            
1871             # get the name of the feature
1872 0   0     0 my $name = $args{'name'} || undef;
1873            
1874             # check for values and internal nulls
1875 0 0       0 $args{'id'} = exists $args{'id'} ? $args{'id'} : undef;
1876 0   0     0 $args{'type'} ||= undef;
1877 0 0 0     0 undef $name if $name and $name eq '.';
1878 0 0 0     0 undef $args{'id'} if $args{'id'} and $args{'id'} eq '.';
1879 0 0 0     0 undef $args{'type'} if $args{'type'} and $args{'type'} eq '.';
1880            
1881             # quick method for feature retrieval
1882 0 0 0     0 if (defined $args{'id'} and $db->can('fetch')) {
1883             # we have a unique primary_id to identify the feature
1884             # usually this pulls out the feature directly
1885 0   0     0 my $feature = $db->fetch($args{'id'}) || undef;
1886            
1887             # check that the feature we pulled out is what we want
1888 0 0       0 my $check = $feature ? 1 : 0;
1889 0 0       0 if ($check) {
1890 0 0 0     0 $check = 0 if (defined $name and $feature->display_name ne $name);
1891 0 0 0     0 $check = 0 if (defined $args{'type'} and $feature->type ne $args{'type'});
1892             }
1893            
1894             # return if things match up as best we can tell
1895 0 0       0 if ($check) {
1896 0         0 return $feature;
1897             }
1898             else {
1899             # the primary_ids are out of date
1900 0 0       0 unless ($primary_id_warning) {
1901 0         0 warn "CAUTION: Some primary IDs in Input file appear to be out of date\n";
1902 0         0 $primary_id_warning++;
1903             }
1904             }
1905             }
1906            
1907             # otherwise use name and type
1908 0 0       0 return unless $name; # otherwise db will return all features! Not what we want!
1909             my @features = $db->features(
1910             -name => $name, # we assume this name will be unique
1911             -aliases => 0,
1912 0         0 -type => $args{'type'},
1913             );
1914            
1915             # if none are found
1916 0 0       0 unless (@features) {
1917             # try again with aliases enabled
1918             @features = $db->features(
1919             -name => $name,
1920             -aliases => 1,
1921 0         0 -type => $args{'type'},
1922             );
1923             }
1924 0 0 0     0 unless (@features and $name =~ /[;,\|]/) {
1925             # I used to append aliases to the end of the name in old versions of biotoolbox
1926             # crazy, I know, but just in case this happened, let's try breaking them apart
1927 0         0 my $name2 = (split(/\s*[;,\|]\s*/, $name))[0];
1928             # multiple names present using common delimiters ;,|
1929             # take the first name only, assume others are aliases that we don't need
1930             @features = $db->features(
1931             -name => $name2,
1932             -aliases => 1,
1933 0         0 -type => $args{'type'},
1934             );
1935             }
1936            
1937             # check the number of features returned
1938 0 0       0 if (scalar @features > 1) {
    0          
1939             # there should only be one feature found
1940             # if more are returned, there's redundant or duplicated data in the db
1941            
1942             # first check whether addition of aliases may improve accuracy
1943 0 0       0 if ($args{'name'} =~ /;/) {
1944             # we have presumed aliases in the name value
1945 0         0 my $check = $args{'name'};
1946 0         0 $check =~ s/\s*;\s*/;/g; # strip spaces just in case
1947            
1948             # check each feature and try to match name and aliases
1949 0         0 my @candidates;
1950 0         0 foreach my $f (@features) {
1951 0         0 my $f_name = join(';', $f->display_name, ($f->get_tag_values('Alias')));
1952 0 0       0 if ($check eq $f_name) {
1953             # found one
1954 0         0 push @candidates, $f;
1955             }
1956             }
1957            
1958 0 0       0 if (scalar @candidates == 1) {
    0          
1959             # we found it!
1960 0         0 return shift @candidates;
1961             }
1962             elsif (scalar @candidates > 1) {
1963             # hopefully we've improved a little bit?
1964 0         0 @features = @candidates;
1965             }
1966             }
1967            
1968             # warn the user, this should be fixed
1969             printf " Found %s %s features named '$name' in the database! Using first one.\n",
1970 0         0 scalar(@features), $args{'type'};
1971             }
1972             elsif (!@features) {
1973 0         0 printf " Found no %s features named '$name' in the database!\n", $args{'type'};
1974 0         0 return;
1975             }
1976            
1977             # done
1978 0         0 return shift @features;
1979             }
1980              
1981              
1982             ### Get a dataset score for a single region
1983             sub get_segment_score {
1984             # parameters passed as an array
1985             # we will be passing this array on as a reference to the appropriate
1986             # imported helper subroutine
1987             # chromosome, start, stop, strand, strandedness, method, return type, db, dataset
1988 12 50   12 1 29 confess "incorrect number of parameters passed!" unless scalar @_ == 9;
1989            
1990             # check the database
1991 12 100 66     57 $_[DB] = open_db_connection($_[DB]) if ($_[DB] and not ref($_[DB]));
1992            
1993             # check for combined datasets
1994 12 50       36 if ($_[DATA] =~ /&/) {
1995 0         0 push @_, (split '&', pop @_);
1996             }
1997            
1998             # determine method
1999             # yes, we're only checking the first dataset, but they should all
2000             # be the same type
2001 12   66     83 my $db_method = $DB_METHODS{$_[METH]}{$_[RETT]}{$_[DB]}{$_[DATA]} ||
2002             _lookup_db_method(\@_);
2003            
2004             # return type values
2005             # 0 = calculate score
2006             # 1 = score array
2007             # 2 = hash position scores
2008             # immediately return calculated score if appropriate
2009 12 100       29 if ($_[RETT] > 0) {
2010             # immediately return either indexed hash or array of scores
2011 5         8 return &{$db_method}(\@_);
  5         19  
2012             }
2013             else {
2014             # calculate a score
2015 7         13 my $scores = &{$db_method}(\@_);
  7         19  
2016             # this might be an array reference of scores that need to be combined
2017             # or it could be a single scalar score which just needs to be returned
2018 7 50       32 if (ref($scores)) {
2019 7         19 return calculate_score($_[METH], $scores);
2020             }
2021             else {
2022 0 0       0 if (not defined $scores) {
2023 0 0       0 return 0 if $_[METH] =~ /count|sum/;
2024 0         0 return '.';
2025             }
2026 0         0 return $scores;
2027             }
2028             }
2029             }
2030              
2031              
2032             sub calculate_score {
2033 7     7 1 15 my ($method, $scores) = @_;
2034 7   50     47 $scores ||= []; # just in case
2035            
2036             # calculate a score based on the method
2037 7 100 100     49 if ($method eq 'mean') {
    50          
    100          
    50          
    50          
    100          
    50          
    0          
    0          
    0          
2038 1   50     29 return sum0(@$scores)/(scalar(@$scores) || 1);
2039             }
2040             elsif ($method eq 'sum') {
2041 0         0 return sum0(@$scores);
2042             }
2043             elsif ($method eq 'median') {
2044 3 50       13 return '.' unless scalar(@$scores);
2045 3         23 return median(@$scores);
2046             }
2047             elsif ($method eq 'min') {
2048 0 0       0 return '.' unless scalar(@$scores);
2049 0         0 return min(@$scores);
2050             }
2051             elsif ($method eq 'max') {
2052 0 0       0 return '.' unless scalar(@$scores);
2053 0         0 return max(@$scores);
2054             }
2055             elsif ($method eq 'count' or $method eq 'pcount') {
2056 2         29 return scalar(@$scores);
2057             }
2058             elsif ($method eq 'ncount') {
2059             # Convert names into unique counts
2060 1         2 my %name2count;
2061 1         5 foreach my $s (@$scores) {
2062 437 50       655 if (ref($s) eq 'ARRAY') {
2063             # this is likely from a ncount indexed hash
2064 0         0 foreach (@$s) {
2065 0         0 $name2count{$_} += 1;
2066             }
2067             }
2068             else {
2069 437         1104 $name2count{$s} += 1;
2070             }
2071             }
2072 1         73 return scalar(keys %name2count);
2073             }
2074             elsif ($method eq 'range') {
2075             # the range value is 'min-max'
2076 0 0       0 return '.' unless scalar(@$scores);
2077 0         0 return range(@$scores);
2078             }
2079             elsif ($method eq 'stddev') {
2080             # we are using the standard deviation of the population,
2081             # since these are the only scores we are considering
2082 0 0       0 return '.' unless scalar(@$scores);
2083 0         0 return stddevp(@$scores);
2084             }
2085             elsif ($method =~ /rpk?m/) {
2086 0         0 confess " The rpm methods have been deprecated due to complexity and " .
2087             "the variable way of calculating the value. Collect counts and " .
2088             "calculate your preferred way.\n";
2089             }
2090             else {
2091             # somehow bad method snuck past our checks
2092 0         0 confess " unrecognized method '$method'!";
2093             }
2094             }
2095              
2096              
2097             sub get_chromosome_list {
2098            
2099             # options
2100 1     1 1 585 my $database = shift;
2101 1   50     7 my $chr_exclude = shift || undef;
2102            
2103             # Open a db connection
2104 1         3 my $db = open_db_connection($database);
2105 1 50       3 unless ($db) {
2106 0         0 carp 'no database connected!';
2107 0         0 return;
2108             }
2109            
2110             # Check for BigWigSet database
2111             # these need to be handled a little differently
2112 1 50       4 if (ref($db) =~ /BigWigSet/) {
2113             # BigWigSet databases are the only databases that don't
2114             # support the seq_ids method
2115             # instead we have to look at one of the bigwigs in the set
2116 0         0 my $bw_file = ($db->bigwigs)[0];
2117 0         0 $db = open_db_connection($bw_file);
2118             }
2119            
2120            
2121             # Collect chromosome lengths
2122             # we want to maintain the original order of the chromosomes so we
2123             # won't be putting it into a hash
2124             # instead an array of arrays
2125 1         2 my @chrom_lengths;
2126            
2127             # Database specific approaches to collecting chromosomes
2128             # I used to have one generic approach, but idiosyncrasies and potential
2129             # bugs make me use different approaches for better consistency
2130 1         2 my $type = ref($db);
2131            
2132             # SeqFeature::Store
2133 1 50 33     24 if ($type =~ /^Bio::DB::SeqFeature::Store/) {
    50 33        
    50          
    50          
    50          
    50          
    50          
2134 0         0 for my $chr ($db->seq_ids) {
2135            
2136             # check for excluded chromosomes
2137 0 0 0     0 next if (defined $chr_exclude and $chr =~ /$chr_exclude/i);
2138            
2139             # get chromosome size
2140 0         0 my ($seqf) = $db->get_features_by_name($chr);
2141 0 0       0 my $length = $seqf ? $seqf->length : 0;
2142            
2143             # store
2144 0         0 push @chrom_lengths, [ $chr, $length ];
2145             }
2146             }
2147            
2148             # libBigWig Bigfile, including both bigWig and bigBed
2149             elsif ($type eq 'Bio::DB::Big::File') {
2150             # this doesn't follow the typical BioPerl convention
2151             # it's a hash, so randomly sorted!!!!! Never will be in same order as file!!!!
2152 0         0 my $chroms = $db->chroms();
2153             # so we'll sort the chromosomes by decreasing length
2154             # this is common for a lot of genomes anyway, except for yeast
2155 0         0 foreach (
2156 0         0 sort { $b->[1] <=> $a->[1] }
2157 0         0 map { [$_->{name}, $_->{length}] }
2158             values %$chroms
2159             ) {
2160             # check for excluded chromosomes
2161 0 0 0     0 next if (defined $chr_exclude and $_->[0] =~ /$chr_exclude/i);
2162 0         0 push @chrom_lengths, $_;
2163             }
2164             }
2165            
2166             # UCSC kent Bigfile
2167             elsif ($type eq 'Bio::DB::BigWig' or $type eq 'Bio::DB::BigBed') {
2168 0         0 foreach my $chr ($db->seq_ids) {
2169            
2170             # check for excluded chromosomes
2171 0 0 0     0 next if (defined $chr_exclude and $chr =~ $chr_exclude);
2172            
2173             # get chromosome size
2174 0         0 my $length = $db->length($chr);
2175            
2176             # store
2177 0         0 push @chrom_lengths, [ $chr, $length ];
2178             }
2179             }
2180            
2181             # Samtools Bam
2182             elsif ($type eq 'Bio::DB::Sam' or $type eq 'Bio::DB::HTS') {
2183 0         0 for my $tid (0 .. $db->n_targets - 1) {
2184             # each chromosome is internally represented in the bam file as
2185             # a numeric target identifier
2186             # we can easily convert this to an actual sequence name
2187             # we will force the conversion to go one chromosome at a time
2188            
2189             # sequence info
2190 0         0 my $chr = $db->target_name($tid);
2191 0         0 my $length = $db->target_len($tid);
2192            
2193             # check for excluded chromosomes
2194 0 0 0     0 next if (defined $chr_exclude and $chr =~ $chr_exclude);
2195            
2196             # store
2197 0         0 push @chrom_lengths, [ $chr, $length ];
2198             }
2199             }
2200            
2201             # Fast through modern HTS fai index
2202             elsif ($type eq 'Bio::DB::HTS::Faidx') {
2203 0         0 for my $chr ($db->get_all_sequence_ids) {
2204             # check for excluded
2205 0 0 0     0 next if (defined $chr_exclude and $chr =~ $chr_exclude);
2206            
2207             # get length and store it
2208 0         0 my $length = $db->length($chr);
2209 0         0 push @chrom_lengths, [$chr, $length];
2210             }
2211             }
2212            
2213             # Fasta through old BioPerl adapter
2214             elsif ($type eq 'Bio::DB::Fasta') {
2215 0         0 for my $chr ($db->get_all_ids) {
2216            
2217             # check for excluded chromosomes
2218 0 0 0     0 next if (defined $chr_exclude and $chr =~ $chr_exclude);
2219            
2220             # get chromosome size
2221 0         0 my $seq = $db->get_Seq_by_id($chr);
2222 0 0       0 my $length = $seq ? $seq->length : 0;
2223            
2224             # store
2225 0         0 push @chrom_lengths, [ $chr, $length ];
2226             }
2227             }
2228            
2229             # other Bioperl adapter?????
2230             elsif ($db->can('seq_ids')) {
2231 1         7 foreach my $chr ($db->seq_ids) {
2232            
2233             # check for excluded chromosomes
2234 1 50 33     27 next if (defined $chr_exclude and $chr =~ $chr_exclude);
2235            
2236             # generate a segment representing the chromosome
2237             # due to fuzzy name matching, we may get more than one back
2238 1         9 my @segments = $db->segment($chr);
2239             # need to find the right one
2240 1         230 my $segment;
2241 1         4 while (@segments) {
2242 1         2 $segment = shift @segments;
2243 1 50       14 last if $segment->seq_id eq $chr;
2244             }
2245            
2246             # check segment
2247 1 50       78 unless ($segment) {
2248 0         0 carp " No genome segment for seq_id $chr!!?? skipping\n";
2249 0         0 next;
2250             }
2251            
2252             # get the chromosome length
2253 1         14 my $length = $segment->length;
2254            
2255             # store
2256 1         43 push @chrom_lengths, [ $chr, $length ];
2257             }
2258             }
2259            
2260             else {
2261 0         0 carp " Unable to identify chromosomes in database type '$type'. Try using a different database file or adapter\n";
2262 0         0 return;
2263             }
2264            
2265             # Return
2266 1 50       4 unless (@chrom_lengths) {
2267 0         0 carp " no chromosome sequences identified in database!\n";
2268 0         0 return;
2269             }
2270 1         4 return @chrom_lengths;
2271             }
2272              
2273              
2274             sub low_level_bam_fetch {
2275 0 0   0 1 0 confess "incorrect number of parameters passed!" unless scalar @_ == 6;
2276 0         0 my ($sam, $tid, $start, $stop, $callback, $data) = @_;
2277             # run the the low level bam fetch based on which adapter is being used
2278 0 0       0 unless ($BAM_ADAPTER) {
2279 0 0       0 $BAM_ADAPTER = ref($sam) =~ /hts/i ? 'hts' : 'sam';
2280             }
2281 0 0       0 if ($BAM_ADAPTER eq 'hts') {
    0          
2282             # using Bio::DB::HTS
2283 0         0 my $index = $sam->hts_index;
2284 0 0       0 return unless $index;
2285 0         0 return $index->fetch($sam->hts_file, $tid, $start, $stop, $callback, $data);
2286             }
2287             elsif ($BAM_ADAPTER eq 'sam') {
2288             # using Bio::DB::Sam
2289 0         0 my $index = $sam->bam_index;
2290 0 0       0 return unless $index;
2291 0         0 return $index->fetch($sam->bam, $tid, $start, $stop, $callback, $data);
2292             }
2293             else {
2294 0         0 confess "no bam adapter loaded!\n";
2295             }
2296             }
2297              
2298              
2299             sub low_level_bam_coverage {
2300 0 0   0 1 0 confess "incorrect number of parameters passed!" unless scalar @_ == 4;
2301 0         0 my ($sam, $tid, $start, $stop) = @_;
2302             # run the the low level bam coverage based on which adapter is being used
2303 0 0       0 unless ($BAM_ADAPTER) {
2304 0 0       0 $BAM_ADAPTER = ref($sam) =~ /hts/i ? 'hts' : 'sam';
2305             }
2306 0 0       0 if ($BAM_ADAPTER eq 'hts') {
    0          
2307             # using Bio::DB::HTS
2308 0         0 my $index = $sam->hts_index;
2309 0 0       0 return unless $index;
2310 0         0 return $index->coverage($sam->hts_file, $tid, $start, $stop);
2311             }
2312             elsif ($BAM_ADAPTER eq 'sam') {
2313             # using Bio::DB::Sam
2314 0         0 my $index = $sam->bam_index;
2315 0 0       0 return unless $index;
2316 0         0 return $index->coverage($sam->bam, $tid, $start, $stop);
2317             }
2318             else {
2319 0         0 confess "no bam adapter loaded!\n";
2320             }
2321             }
2322              
2323              
2324             sub get_genomic_sequence {
2325 0 0   0 1 0 confess "incorrect number of parameters passed!" unless scalar @_ == 4;
2326 0         0 my ($db, $chrom, $start, $stop) = @_;
2327            
2328             # check database
2329 0         0 my $type = ref($db);
2330 0 0       0 $db = open_db_connection($db) if not $type;
2331            
2332             # return sequence based on the type of database adapter we're using
2333 0 0       0 if ($type eq 'Bio::DB::HTS::Faidx') {
    0          
    0          
2334 0         0 return $db->get_sequence_no_length("$chrom:$start-$stop");
2335             }
2336             elsif ($type eq 'Bio::DB::Sam::Fai') {
2337 0         0 return $db->fetch("$chrom:$start-$stop");
2338             }
2339             elsif ($db->can('seq')) {
2340             # BioPerl database including Bio::DB::Fasta and Bio::DB::SeqFeature::Store
2341 0         0 return $db->seq($chrom, $start, $stop);
2342             }
2343             else {
2344 0         0 confess "unsupported database $type for collecting genomic sequence!\n";
2345             }
2346             }
2347              
2348              
2349             ### deprecated methods
2350             # just in case
2351             sub validate_included_feature {
2352 0     0 0 0 confess "validate_included_feature() is no longer a valid method. " .
2353             "Please update your script to the current API.\n";
2354             }
2355              
2356              
2357             ### Internal subroutine to retrieve the scores from an established region object
2358             sub _lookup_db_method {
2359             # parameters passed as an array reference
2360 9     9   18 my $param = shift;
2361             # determine the appropriate score method
2362 9         12 my $score_method;
2363 9 50       38 if ($param->[DATA] =~ /^file|http|ftp/) {
    0          
    0          
2364             # collect the data according to file type
2365            
2366             # BigWig Data file
2367 9 50       81 if ($param->[DATA] =~ /\.(?:bw|bigwig)$/i) {
    50          
    50          
    50          
2368             # file is in bigwig format
2369             # get the dataset scores using Bio::ToolBox::db_helper::bigwig
2370             # this uses either Bio::DB::BigWig or Bio::DB::Big
2371            
2372             # check that we have bigwig support
2373 0 0       0 $BIGWIG_OK = _load_bigwig_helper_module() unless $BIGWIG_OK;
2374 0 0       0 if ($BIGWIG_OK) {
2375 0 0       0 if ($param->[RETT] == 2) {
    0          
    0          
    0          
2376 0         0 $score_method = \&collect_bigwig_position_scores;
2377             }
2378             elsif ($param->[RETT] == 1) {
2379 0         0 $score_method = \&collect_bigwig_scores;
2380             }
2381             elsif ($param->[METH] =~ /min|max|mean/) {
2382 0         0 $score_method = \&collect_bigwig_score;
2383             }
2384             elsif ($param->[METH] =~ /sum|count/) {
2385             # difference between ucsc and libBigWig libraries
2386 0 0       0 $score_method = $BIG_ADAPTER eq 'ucsc' ?
2387             \&collect_bigwig_score : \&collect_bigwig_scores;
2388             }
2389             else {
2390 0         0 $score_method = \&collect_bigwig_scores;
2391             }
2392             }
2393             else {
2394 0         0 croak " BigWig support is not enabled! Is Bio::DB::Big or Bio::DB::BigFile installed?";
2395             }
2396             }
2397            
2398             # BigBed Data file
2399             elsif ($param->[DATA] =~ /\.(?:bb|bigbed)$/i) {
2400             # data is in bigbed format
2401             # get the dataset scores using Bio::ToolBox::db_helper::bigbed
2402             # this uses either Bio::DB::BigBed or Bio::DB::Big
2403            
2404             # check that we have bigbed support
2405 0 0       0 $BIGBED_OK = _load_bigbed_helper_module() unless $BIGBED_OK;
2406 0 0       0 if ($BIGBED_OK) {
2407 0 0       0 if ($param->[RETT] == 2) {
2408 0         0 $score_method = \&collect_bigbed_position_scores;
2409             }
2410             else {
2411 0         0 $score_method = \&collect_bigbed_scores;
2412             }
2413             }
2414             else {
2415 0         0 croak " BigBed support is not enabled! Is Bio::DB::Big or Bio::DB::BigFile installed?";
2416             }
2417             }
2418            
2419             # BAM data file
2420             elsif ($param->[DATA] =~ /\.bam$/i) {
2421             # data is in bam format
2422             # get the dataset scores using Bio::ToolBox::db_helper::bam
2423             # this uses the Bio::DB::Sam or Bio::DB::HTS adaptor
2424            
2425             # check that we have Bam support
2426 0 0       0 _load_bam_helper_module() unless $BAM_OK;
2427 0 0       0 if ($BAM_OK) {
2428 0         0 $score_method = \&collect_bam_scores;
2429             }
2430             else {
2431 0         0 croak " Bam support is not enabled! " .
2432             "Is Bio::DB::Sam installed?\n";
2433             }
2434             }
2435            
2436             # USeq Data file
2437             elsif ($param->[DATA] =~ /\.useq$/i) {
2438             # data is in useq format
2439             # get the dataset scores using Bio::ToolBox::db_helper::useq
2440             # this uses the Bio::DB::USeq adaptor
2441            
2442             # check that we have bigbed support
2443 9 50       22 $USEQ_OK = _load_helper_module('Bio::ToolBox::db_helper::useq')
2444             unless $USEQ_OK;
2445 9 50       17 if ($USEQ_OK) {
2446 9 100       19 if ($param->[RETT] == 2) {
2447 4         10 $score_method = \&collect_useq_position_scores;
2448             }
2449             else {
2450 5         10 $score_method = \&collect_useq_scores;
2451             }
2452             }
2453             else {
2454 0         0 croak " USeq support is not enabled! " .
2455             "Is Bio::DB::USeq installed?\n";
2456             }
2457             }
2458            
2459             # Unsupported Data file
2460             else {
2461 0         0 confess sprintf " Unsupported or unrecognized file type for file '%s'!\n",
2462             $param->[DATA];
2463             }
2464             }
2465            
2466            
2467             ### BigWigSet database
2468             elsif (ref($param->[DB]) =~ m/BigWigSet/) {
2469             # calling features from a BigWigSet database object
2470             # this uses either Bio::DB::BigWig or Bio::DB::Big
2471            
2472             # check that we have bigwig support
2473             # duh! we should, we probably opened the stupid database!
2474 0 0       0 $BIGWIG_OK = _load_bigwig_helper_module() unless $BIGWIG_OK;
2475 0 0       0 croak " BigWigSet support is not enabled! Is Bio::DB::Big or Bio::DB::BigFile installed?"
2476             unless $BIGWIG_OK;
2477            
2478             # the data collection depends on the method
2479 0 0       0 if ($param->[RETT] == 2) {
    0          
    0          
2480 0         0 $score_method = \&collect_bigwigset_position_scores;
2481             }
2482             elsif ($param->[RETT] == 1) {
2483 0         0 $score_method = \&collect_bigwigset_scores;
2484             }
2485             elsif ($param->[METH] =~ /min|max|mean|sum|count/) {
2486             # difference between ucsc and libBigWig libraries
2487             # the ucsc library works with summaries and we can handle multiple of these
2488             # but the big adapter doesn't
2489 0 0       0 $score_method = $BIG_ADAPTER eq 'ucsc' ?
2490             \&collect_bigwigset_score : \&collect_bigwigset_scores;
2491             }
2492             else {
2493 0         0 $score_method = \&collect_bigwigset_scores;
2494             }
2495             }
2496            
2497              
2498             ### BioPerl style database
2499             elsif (ref($param->[DB]) =~ m/^Bio::DB/) {
2500             # a BioPerl style database, including Bio::DB::SeqFeature::Store
2501             # most or all Bio::DB databases support generic get_seq_stream() methods
2502             # that return seqfeature objects, which we can use in a generic sense
2503            
2504             # check that we have support
2505             # duh! we should, we probably opened the stupid database!
2506 0 0       0 $SEQFASTA_OK = _load_helper_module('Bio::ToolBox::db_helper::seqfasta')
2507             unless $SEQFASTA_OK;
2508 0 0       0 if ($SEQFASTA_OK) {
2509             # get the dataset scores using Bio::ToolBox::db_helper::seqfasta
2510             # check that we support methods
2511 0 0       0 unless ($param->[DB]->can('get_seq_stream')) {
2512 0         0 confess sprintf "unsupported database! cannot use %s as it does not support get_seq_stream method or equivalent",
2513             ref($param->[DB]);
2514             }
2515 0         0 $score_method = \&collect_store_scores;
2516             }
2517             else {
2518 0         0 croak " SeqFeature Store support is not enabled! " .
2519             "Is BioPerl and Bio::DB::SeqFeature::Store properly installed?\n";
2520             }
2521             }
2522            
2523            
2524             ### Some other database?
2525             else {
2526 0 0       0 confess "no recognizeable dataset provided!" unless $param->[DATA];
2527 0 0       0 confess "no database passed!" unless $param->[DB];
2528 0         0 confess "something went wrong! parameters: @$param";
2529             }
2530            
2531            
2532             ### Cache and return the results
2533 9         36 $DB_METHODS{$param->[METH]}{$param->[RETT]}{$param->[DB]}{$param->[DATA]} =
2534             $score_method;
2535 9         26 return $score_method;
2536             }
2537              
2538              
2539             sub _load_helper_module {
2540 1     1   3 my $class = shift;
2541 1         2 my $success = 0;
2542 1         1 eval {
2543 1         48 load $class; # using Module::Load to dynamically load modules
2544 1         50 $class->import; # must be done particularly for my modules
2545 1         3 $success = 1;
2546             };
2547 1         3 return $success;
2548             }
2549              
2550              
2551             sub _load_bam_helper_module {
2552 0 0   0     if ($BAM_ADAPTER) {
2553 0 0         if ($BAM_ADAPTER =~ /sam/i) {
    0          
    0          
2554 0           $BAM_OK = _load_helper_module('Bio::ToolBox::db_helper::bam');
2555 0           $BAM_ADAPTER = 'sam'; # for internal consistency
2556             }
2557             elsif ($BAM_ADAPTER =~ /hts/i) {
2558 0           $BAM_OK = _load_helper_module('Bio::ToolBox::db_helper::hts');
2559 0           $BAM_ADAPTER = 'hts'; # for internal consistency
2560             }
2561             elsif ($BAM_ADAPTER =~ /none/i) {
2562             # basically for testing purposes, don't use a module
2563 0           return 0;
2564             }
2565             else {
2566             # unrecognized
2567 0           $@ = 'unrecognized';
2568             }
2569 0 0         if (not $BAM_OK) {
2570 0           print "Requested '$BAM_ADAPTER' adapter could not be loaded: $@\n";
2571             }
2572             }
2573             else {
2574             # try hts first, then sam
2575             # be sure to set BAM_ADAPTER upon success
2576 0           $BAM_OK = _load_helper_module('Bio::ToolBox::db_helper::hts');
2577 0 0         if ($BAM_OK) {
2578 0           $BAM_ADAPTER = 'hts';
2579             }
2580             else {
2581 0           $BAM_OK = _load_helper_module('Bio::ToolBox::db_helper::bam');
2582 0 0         $BAM_ADAPTER = 'sam' if $BAM_OK;
2583             }
2584             }
2585 0           return $BAM_OK;
2586             }
2587              
2588              
2589             sub _load_bigwig_helper_module {
2590 0 0   0     if ($BIG_ADAPTER) {
2591 0 0         if ($BIG_ADAPTER =~ /ucsc|kent/i) {
    0          
    0          
2592 0           $BIGWIG_OK = _load_helper_module('Bio::ToolBox::db_helper::bigwig');
2593 0           $BIG_ADAPTER = 'ucsc'; # for internal consistency
2594             }
2595             elsif ($BIG_ADAPTER =~ /big/i) {
2596 0           $BIGWIG_OK = _load_helper_module('Bio::ToolBox::db_helper::big');
2597 0           $BIGBED_OK = $BIGWIG_OK; # bigbed too!
2598 0           $BIG_ADAPTER = 'big'; # for internal consistency
2599             }
2600             elsif ($BIG_ADAPTER =~ /none/i) {
2601             # basically for testing purposes, don't use a module
2602 0           return 0;
2603             }
2604             else {
2605             # unrecognized
2606 0           $@ = 'unrecognized';
2607             }
2608 0 0         if (not $BIGWIG_OK) {
2609 0           print "Requested '$BIG_ADAPTER' adapter could not be loaded: $@\n";
2610             }
2611             }
2612             else {
2613             # we have to try each one out
2614             # try the modern big adapter first
2615 0           $BIGWIG_OK = _load_helper_module('Bio::ToolBox::db_helper::big');
2616 0 0         if ($BIGWIG_OK) {
2617             # success!
2618 0           $BIGBED_OK = $BIGWIG_OK; # bigbed too!
2619 0 0         $BIG_ADAPTER = 'big' if $BIGWIG_OK;
2620             }
2621             else {
2622 0           $BIGWIG_OK = _load_helper_module('Bio::ToolBox::db_helper::bigwig');
2623 0 0         $BIG_ADAPTER = 'ucsc' if $BIGWIG_OK;
2624             }
2625             }
2626 0           return $BIGWIG_OK;
2627             }
2628              
2629             sub _load_bigbed_helper_module {
2630 0 0   0     if ($BIG_ADAPTER) {
2631 0 0         if ($BIG_ADAPTER =~ /ucsc|kent/i) {
    0          
    0          
    0          
2632 0           $BIGBED_OK = _load_helper_module('Bio::ToolBox::db_helper::bigbed');
2633 0           $BIG_ADAPTER = 'ucsc'; # for internal consistency
2634             }
2635             elsif ($BIG_ADAPTER =~ /big/i) {
2636 0           $BIGBED_OK = _load_helper_module('Bio::ToolBox::db_helper::big');
2637 0           $BIGWIG_OK = $BIGBED_OK; # bigwig too!
2638 0           $BIG_ADAPTER = 'big'; # for internal consistency
2639             }
2640             elsif ($BIG_ADAPTER =~ /none/i) {
2641             # basically for testing purposes, don't use a module
2642 0           return 0;
2643             }
2644             elsif ($BAM_ADAPTER =~ /\w+/) {
2645             # unrecognized
2646 0           $@ = 'unrecognized';
2647             }
2648 0 0         if (not $BIGWIG_OK) {
2649 0           print "Requested '$BIG_ADAPTER' adapter could not be loaded: $@\n";
2650             }
2651             }
2652             else {
2653             # we have to try each one out
2654             # try the modern big adapter first
2655 0           $BIGBED_OK = _load_helper_module('Bio::ToolBox::db_helper::big');
2656 0 0         if ($BIGBED_OK) {
2657             # success!
2658 0           $BIGWIG_OK = $BIGBED_OK; # bigwig too!
2659 0 0         $BIG_ADAPTER = 'big' if $BIGBED_OK;
2660             }
2661             else {
2662 0           $BIGBED_OK = _load_helper_module('Bio::ToolBox::db_helper::bigbed');
2663 0 0         $BIG_ADAPTER = 'ucsc' if $BIGBED_OK;
2664             }
2665             }
2666 0           return $BIGBED_OK;
2667             }
2668              
2669             =head1 AUTHOR
2670              
2671             Timothy J. Parnell, PhD
2672             Howard Hughes Medical Institute
2673             Dept of Oncological Sciences
2674             Huntsman Cancer Institute
2675             University of Utah
2676             Salt Lake City, UT, 84112
2677              
2678             This package is free software; you can redistribute it and/or modify
2679             it under the terms of the Artistic License 2.0.
2680