File Coverage

blib/lib/Bio/ToolBox/db_helper.pm

Criterion	Covered	Total	%
statement	155	662	23.4
branch	66	526	12.5
condition	26	125	20.8
subroutine	15	30	50.0
pod	16	17	94.1
total	278	1360	20.4

line	stmt	bran	cond	sub	pod	time	code
1							package Bio::ToolBox::db_helper;
2							our $VERSION = '1.69';
3
4							=head1 NAME
5
6							Bio::ToolBox::db_helper - helper interface to various database formats
7
8							=head1 DESCRIPTION
9
10							In most cases, this module does not need to be used directly. The
11							methods available to L and L
12							provide convenient access to the methods described here.
13
14							These are helper subroutines to work with relevant databases that can be
15							accessed through BioPerl modules. These include the L
16							relational database as well as L, L,
17							L, L, L, and
18							L databases. The primary functions
19							included opening connections to these databases, verifying features found
20							within the database, generating lists of features or genomic intervals for
21							subsequent analysis, and collecting features and/or scores within regions
22							of interest. Collected scores may be summarized using a variety of statistical
23							methods.
24
25							When collecting scores or features, the data may hosted in a variety of
26							formats and locations. These include:
27
28							=over 4
29
30							=item SeqFeature database
31
32							Genomic feature annotation, including genes, transcripts, exons, etc,
33							may be stored in a L database. These
34							databases are backed by either a relational database, including
35							MySQL, PostGreSQL, or SQLite. Small GFF3 files may also be loaded in
36							an in-memory database. These are typically loaded from GFF3
37							files; see the adapter documentation for more information. Dataset
38							scores, such as from microarray, may also be stored as the score
39							value in the source GFF file.
40
41							References to local, binary, indexed files may also be included as
42							attributes to features stored in the database. This is legacy support
43							from the L, and should
44							not be used in preference to other methods. Supported files include the
45							legacy binary wig files (F<.wib>, see L) using the
46							C attribute, or bigWig files using the C or C
47							attribute. The attribute values must be full paths.
48
49							=item BigWig files
50
51							BigWig files (F<.bw> and F<.bigwig>) are compressed, binary, indexed
52							versions of text wig files and may be accessed either locally or remotely.
53							They support extremely fast score retrieval from any genomic location of
54							any size without sacrificing resolution (spatial and numeric). See
55							L for more information.
56							BigWig files are supported by either the L or the
57							L adapter, based on either L
58							or the L libraries,
59							respectively; see their respective documentation for more information.
60
61							=item Directory of BigWig files
62
63							A directory containing two or more BigWig files is assembled into a
64							BigWigSet, allowing for metadata, such as strand, to be associated with
65							BigWig files. Additional metadata beyond the filename may be included in
66							text file F within the directory. See the
67							L adapter documentation for more information. When using
68							L adapters, BigWigSet support is natively supported by
69							the BioToolBox package.
70
71							=item BigBed files
72
73							BigBed files are compressed, binary, indexed versions of text BED files. See
74							L for more information.
75							Both local and remote files may be accessed. BigBed files are supported by
76							either the L or the L adapter, based on either L
77							or the L libraries,
78							respectively; see their respective documentation for more information.
79
80							=item Bam files
81
82							Bam files (F<.bam>) are compressed, binary, indexed versions of the text SAM file,
83							or sequence alignment map. They are used with next generation sequencing
84							technologies. They support individual alignment retrieval as well as
85							read depth coverage. Two different Bam file adapters are supported.
86
87							The L adapter is an interface to the Bam (or Cram) file.
88							This is based on the modern HTSlib C library (version E= 1.0), successor
89							to the original samtools library. See L for
90							more information.
91
92							The L adapter is an older interface to the Bam alignment
93							file. This is based on the samtools C library version E= 0.1.19. While
94							this adapter is still supported, the above should be used for new installs.
95
96							=item Cram files
97
98							Cram files (F<.cram>) are similar to Bam files, but are much smaller due to
99							only storing sequence differences for each alignment. As such, they require
100							an indexed, reference fasta to regenerate the complete alignment. Cram files
101							are only supported by the L adapter.
102
103							=item USeq files
104
105							USeq files (F<.useq>) are compressed, binary, indexed files that support
106							multiple information types, including region intervals, BED type
107							annotations, or wig type scores distributed across the genome. They
108							support rapid, random access across the genome and are comparable to
109							both BigWig and BigBed files. See L
110							for more information. Files may only be local. These files are supported
111							by the L adapter.
112
113							=item Fasta files
114
115							Fasta files (F<.fa> or F<.fasta>) may be opened. Fasta files are indexed to
116							rapidly and randomly fetch genomic sequence. Three different adapters are
117							available for indexing the fasta file.
118
119							The L adapter is preferentially used, and requires a
120							F<.fai> index file.
121
122							The L adapter may alternatively used, and also requires a
123							F<.fai> index file.
124
125							The older style L adapter is much slower, but will index
126							either a single genomic fasta file or a directory of individual chromosome
127							or contig fasta files.
128
129							Additionally, genomic sequence stored in a L
130							annotation database may also be used.
131
132							=back
133
134							While the functions within this library may appear to be simply a rehashing of
135							the methods and functions in L and other modules, they
136							either provide a simpler function to often used database methodologies or are
137							designed to work intimately with the BioToolBox data format file and data structures
138							(see L). One key advantage to these functions is the ability
139							to work with datasets that are stranded (transcriptome data, for example).
140
141							Historically, this module was initially written to use L for
142							database usage. It has since been re-written to use L
143							database schema. Additional functionality has also been added for other databases
144							and file formats.
145
146							Complete usage and examples for the functions are provided below.
147
148							=head1 USAGE
149
150							Call the module at the beginning of your perl script and include the module
151							names to export. None are exported by default.
152
153							use Bio::ToolBox::db_helper qw(
154							get_new_feature_list
155							get_db_feature
156							);
157
158							This will export the indicated subroutine names into the current namespace.
159							Their usage is detailed below.
160
161							=head1 EXPORTED SUBROUTINES
162
163							=over 4
164
165							=item C<$BAM_ADAPTER> variable
166
167							=item C<$BIG_ADAPTER> variable
168
169							These variables control which Bam and bigWig/bigBed adapters are used
170							during execution in installations where both adapters are present. These
171							variables are not set initially; adapters are chosen automatically when an
172							adapter is requested during execution and modules are evaluated for loading.
173
174							=item use_bam_adapter
175
176							=item use_big_adapter
177
178							These are convenience methods for explicitly setting and retrieving the
179							C<$BAM_ADAPTER> and C<$BIG_ADAPTER> variables, respectively.
180
181							Optionally pass the value to set. It will always return the value being used.
182							Values include the following:
183
184							=over 4
185
186							=item Bam adapter
187
188							* hts Bio::DB::HTS (default if available)
189							* sam Bio::DB::Sam
190							* none no adapters allowed
191
192							=item Big adapter
193
194							* big Bio::DB::Big (default if available)
195							* ucsc Bio::DB::BigWig and Bio::DB::BigBed
196							* none no adapters allowed
197
198							=back
199
200							=item open_db_connection
201
202							my $db_name = 'cerevisiae';
203							my $db = open_db_connection($db_name);
204
205							my $file = '/path/to/file.bam';
206							my $db = open_db_connection($file);
207
208							# within a forked child process
209							# reusing the same variable to simplify code
210							# pass second true value to avoid cache
211							$db = open_db_connection($file, 1);
212
213							This is a simplified, generalized method that will open a connection to a
214							database or indexed file using any one of a number of different
215							BioPerl-style adapters. For local and remote files, the appropriate
216							adapter is determined by the file extension. Relational database
217							connection and authentication information is checked in a configuration
218							file. Once identified, the appropriate Perl module adapter is loaded
219							during run time automatically, saving the user from knowing in advance
220							to C the appropriate module in the script.
221
222							Pass the name of a relational database or the path of the database file to
223							the subroutine. The opened database object is returned. If it fails, then
224							an error message should be generated and nothing is returned.
225
226							B If you are forking your perl process, B re-open your
227							database objects in the child process, and pass a second true value
228							to avoid using the cached database object. By default, opened databases are
229							cached to improve efficiency, but this will be disastrous when crossing forks.
230
231							=over 4
232
233							=item SeqFeature Store database
234
235							Provide the name of the relational database to open. Parameters for
236							connecting to a relational database are stored in the BioToolBox
237							configuration file, F<.biotoolbox.cfg>. These include database adaptors,
238							user name, password, etc. Further information regarding the configuration
239							file may be found in L.
240
241							For SQLite databases, provide the path to the F<.sqlite> or F<.db> file.
242
243							For in-memory databases, provide the path to the F<.gff> or F<.gff3> file.
244							The larger the file, the more memory and processing time is consumed as
245							the file is parsed into memory. This may be fine for small model organisms
246							like yeast, but not for vertebrate annotation.
247
248							=item Bam file database
249
250							Provide the path to the F<.bam> file. This may be a local file, or a remote
251							URL (generally supported by the adapter). If a local F<.bam.bai> index file
252							is not present, it will automatically be built prior to opening; this may
253							fail if the bam file is not sorted. Remote files are not indexed.
254
255							=item Cram file database
256
257							Provide the path to the local F<.cram> file. Currently, supplementary fasta
258							files are not supported, so the Cram file header must point to a valid
259							reference. Cram files will be indexed automatically if an index is not available.
260
261							=item BigWig file database
262
263							Provide the path to a local F<.bw> or F<.bigwig> file, or URL to a
264							remote file.
265
266							=item BigWigSet database
267
268							A local or remote directory of one or more BigWig files that can treated
269							collectively as a database. A special text file may be included in the
270							directory to assign metadata to each BigWig file, including attributes such
271							as type, source, display name, strand, etc. See L for
272							more information on the formatting of the metadata file. If a metadata
273							file is not included (it's not necessary), then the filenames are used
274							as the name and type. Strand may be parsed from the filenames if the
275							basename ends in F<.f> and F<.r>.
276
277							=item BigBed database
278
279							Provide the path to a local F<.bb> or F<.bigbed> file, or URL to a
280							remote file.
281
282							=item USeq database
283
284							Provide the path to a local F<.bb> or F<.bigbed> file, or URL to a
285							remote file.
286
287							=item Fasta database
288
289							Provide the path to a local fasta file for rapidly fetching genomic
290							sequence. If the fasta file is not indexed, then it will be indexed
291							automatically upon opening.
292
293							=back
294
295							=item get_db_name
296
297							my $name = get_db_name($db);
298
299							This method will attempt to get the name of an opened Database
300							object if, for some reason, it's unknown, e.g. the user only provided an
301							opened db object. This only works for some databases, at least those I've
302							bothered to track down and find a usable API call to use.
303
304							=item get_dataset_list
305
306							my $db_name = 'cerevisiae';
307							my @types = get_dataset_list($db_name);
308
309							This method will retrieve a list of the available features stored in the
310							database and returns an array of the features' types.
311
312							Pass either the name of the database or an established database object.
313							Supported databases include both L and
314							L databases.
315
316							For L databases, the type is represented as
317							"C", corresponding to the third and second GFF columns,
318							respectively. The types are sorted alphabetically first by source,
319							then by method.
320
321							For L databases, the C, C, C,
322							or C attribute may be used, in that respective order of
323							availability. The list is sorted alphabetically.
324
325							=item verify_or_request_feature_types
326
327							# verify a dataset, either file or database type
328							my $db_name = 'cerevisiae';
329							my $dataset = 'microaray_data';
330							my @features = verify_or_request_feature_types(
331							db => $db_name,
332							feature => $dataset,
333							);
334							unless (@features) {
335							die " no valid feature provided!\n";
336							}
337
338							# select a list of features
339							my $db_name = 'cerevisiae';
340							my @types = (); # user not provided
341							@types = verify_or_request_feature_types(
342							db => $db_name,
343							feature => $types,
344							);
345							# user will be promoted to select from database list
346							if (@types) {
347							print "using types " . join(", ", @types);
348							}
349							else {
350							die " no valid features selected!\n";
351							}
352
353							This subroutine will process a list of feature types or data sources to be
354							used for data or feature collection. There are two modes of action. If a
355							list was provided, it will process and verify the list. If no list was
356							provided, then the user will be prompted to select from a list of available
357							feature types in the provided database.
358
359							The provided list may be a list of feature types or a list of single
360							file data sources, including Bam, bigWig, or bigBed files. If the list
361							includes feature types, they are verified as existing in the provided
362							database. If the list includes file names, the local files are checked
363							and the names prefixed with "C" for use downstream.
364
365							If no list was provided, then a list of available feature types in the
366							provided database will be presented to the user, and the user prompted
367							to make a selection. One or more types may be selected, and a single
368							item may be enforced if requested. The response is filtered through
369							the parse_list method from L, so a mix of single
370							numbers or a range of numbers may be accepted. The responses are then
371							validated.
372
373							Two types of databases are supported: L and
374							L databases. For SeqFeature Store databases, the
375							type is comprised of "C", corresponding to the third and
376							second GFF columns. For BigWigSet databases, types correspond to either
377							the C, C, C, or C attributes,
378							in that order of availability.
379
380							For feature types or files that pass validation, they are returned as a
381							list. Types of files that do not pass validation are printed to C.
382
383							To use this subroutine, pass an array with the following keys
384							and values. Not every key is required.
385
386							=over 4
387
388							=item db
389
390							The name of the database or a reference to an established BioPerl
391							database object. A L or
392							L database are typically used. Only required when
393							no features are provided.
394
395							=item feature
396
397							Pass either a single dataset name as a scalar or an anonymous array
398							reference of a list of dataset names. These may have been provided as
399							a command line option and need to be verified. If nothing is passed,
400							then a list of possible datasets will be presented to the user to be
401							chosen.
402
403							=item prompt
404
405							Provide a phrase to be prompted to the user to help in selecting
406							datasets from the list. If none is provided, a generic prompt will be
407							used.
408
409							=item single
410
411							A Boolean value (1 or 0) indicating whether only a single dataset is
412							allowed when selecting datasets from a presented list. If true, only
413							one dataset choice is accepted. If false, one or more dataset choices
414							are allowed. Default is false.
415
416							=item limit
417
418							Optionally provide a word or regular expression that matches the
419							feature type (C only; C, if present, is ignored).
420							For example, provide "geneEmrna" to only present gene and mRNA
421							features to the user. This is only applicable when a user must select
422							from a database list. The default is to list all available feature
423							types.
424
425							=back
426
427							=item check_dataset_for_rpm_support($dataset, [$cpu])
428
429							# count the total number of alignments
430							my $dataset = '/path/to/file.bam';
431							my $total_count = check_dataset_for_rpm_support($dataset);
432
433							# use multithreading
434							my $total_count = check_dataset_for_rpm_support($dataset, $cpu);
435
436							This subroutine will check a dataset for RPM, or Reads Per Million mapped,
437							support. Only two types of database files support this, Bam files and
438							BigBed files. If the dataset is either one of these, or the name of a
439							database feature which points to one of these files, then it will
440							calculate the total number of mapped alignments (Bam file) or features
441							(BigBed file). It will return this total number. If the dataset does
442							not support RPM (because it is not a Bam or BigBed file, for example),
443							then it will return undefined.
444
445							Pass this subroutine one or two values. The first is the name of the
446							dataset. Ideally it should be validated using verify_or_request_feature_types()
447							and have an appropriate prefix (file, http, or ftp).
448
449							For multi-threaded execution pass a second value, the number of CPU cores
450							available to count Bam files. This will speed up counting bam files
451							considerably. The default is 2 for environments where L
452							is installed, or 1 where it is not.
453
454							=item get_new_feature_list
455
456							my $Data = Bio::ToolBox::Data->new;
457							my $db_name = 'cerevisiae';
458							my %data = get_new_feature_list(
459							data => $Data,
460							db => $db_name,
461							features => 'genes',
462							);
463
464							This subroutine will generate a new feature list collected from an annotation
465							database, such as a L database. It will populate
466							an empty L object data table with a list of the collected
467							features. Columns include the Primary ID, Name, and feature Type.
468
469							The subroutine is passed an array containing three required arguments. The keys
470							include
471
472							=over 4
473
474							=item data
475
476							An empty L object. This is required.
477
478							=item db
479
480							The name of the database or a reference to an established database object.
481							Currently, this must be a L database.
482							This is required.
483
484							=item features
485
486							A scalar value containing a name representing the feature type(s) of
487							feature(s) to collect. The type may be a C or C.
488							It may be a single element or a comma-delimited list. This is required.
489
490							=item chrskip
491
492							Provide a regular-expression compatible string for skipping features from
493							specific chromosomes, for example mitochondrial or unmapped contigs.
494
495							=back
496
497							Status messages, including the number of features found and kept, are always
498							printed to C.
499
500							=item get_new_genome_list
501
502							my $Data = Bio::ToolBox::Data->new();
503							my $db_name = 'cerevisiae';
504							my $window_size = 500;
505							my $step_size = 250;
506							my %data = get_new_genome_list(
507							data => $Data,
508							db => $db_name,
509							win => $window_size,
510							step => $step_size,
511							);
512
513							This subroutine will generate a new list of genomic windows or intervals.
514							The genome is split into intervals of a specified size with the specified
515							step size.
516
517							The subroutine is passed an array containing the required arguments.
518
519							=over 4
520
521							=item data
522
523							An empty L object. This is required.
524
525							=item db
526
527							The name of the database or a reference to an established database
528							object. Any BioPerl adapter, including those described in the L,
529							are supported. Required.
530
531							=item win
532
533							A scalar value containing an integer representing the size of the
534							window in basepairs. The default value is 500 bp.
535
536							=item step
537
538							A scalar value containing an integer representing the step size for
539							advancing the window across the genome. The default is the window size.
540
541							=item chrskip
542
543							=item skip
544
545							=item exclude
546
547							Provide a regular expression compatible string for excluding specific
548							chromosomes or classes of chromosomes, such as the mitochondrial or
549							unmapped contigs.
550
551							=back
552
553							Status messages are printed to C.
554
555							=item get_db_feature
556
557							my $db = open_db_connection('annotation.db');
558							my $seqfeature = get_db_feature(
559							db => $db,
560							type => 'gene',
561							name => 'ABC1',
562							);
563
564							This subroutine will retrieve a specific feature from a L
565							database for subsequent analysis, manipulation, and/or score retrieval using the
566							L methods. It relies upon unique information to pull out a
567							single, unique feature.
568
569							Several attributes may be used to pull out the feature, including the feature's
570							unique database primary ID, name andEor aliases, and GFF type (C
571							andEor C). The L subroutine will generate a list
572							of features with their unique identifiers.
573
574							The C attribute is preferentially used as it provides the best
575							performance. However, it is not always portable between databases or even a
576							database re-load. In that case, the C and C are used to identify
577							potential features. Note that the C may not always be unique in the
578							database. In this case, the addition of aliases may help. If all else fails, a new
579							feature list should be generated.
580
581							To get a feature, pass an array of arguments.
582
583							=over 4
584
585							=item db
586
587							The name of the L database or a reference
588							to an established database object.
589
590							=item id
591
592							Provide the C tag. In the L
593							database schema this is a (usually) non-portable, unique identifier
594							specific to a database. It provides the fastest lookup.
595
596							=item name
597
598							A scalar value representing the feature C. Aliases may be
599							appended with semicolon delimiters.
600
601							=item type
602
603							Provide the feature type, which is typically expressed as
604							C. Alternatively, provide just the C only.
605
606							=back
607
608							While it is possible to identify features with any two attributes
609							(or possibly just name or ID), the best performance is obtained with
610							all three together. The first SeqFeature object is returned if multiple
611							are found.
612
613							=item get_segment_score
614
615							my $score = get_segment_score(
616							$seqfeature->seq_id, # chromosome
617							$seqfeature->start, # start
618							$seqfeature->end, # end
619							$seqfeature->strand, # strand
620							'sense', # strandedness
621							'mean', # method
622							0, # return type
623							$db, # database
624							'scores.bw' # datasets
625							);
626
627							Generic method for collecting score(s) from a database. This is the
628							primary interface to all of the database handling packages, including
629							Bam, bigWig, bigBed, and USeq files, as well as SeqFeature databases.
630							By passing common parameters here, the appropriate database or file
631							handling packages are loaded during run time and the appropriate
632							methods are called. Essentially, this is where the magic happens.
633
634							Pass the method an array of nine (or more) parameters. The parameters
635							are not keyed as a hash, and the order is explicit. However, see the
636							L module for named index positions.
637							The order is as follows.
638
639							=over 4
640
641							=item * 0 Chromosome
642
643							String representing chromosome or sequence ID.
644
645							=item * 1 Start
646
647							Start coordinate (integer, 1-based)
648
649							=item * 2 Stop
650
651							Stop or end coordinate (integer, 1-based)
652
653							=item * 3 Strand
654
655							BioPerl style strand value, including -1, 0, and 1.
656
657							=item * 4 Strandedness
658
659							A string indicating strandedness: sense, antisense, all.
660
661							=item * 5 Method
662
663							A string representing the method of data collection.
664
665							Current acceptable methods include mean, median, sum, min, max,
666							stddev, count, ncount (named count), and pcount (precise count).
667							See the individual db_helper sub classes for more details. Not all
668							adapters support all methods.
669
670							=item * 6 Return type
671
672							An integer (0, 1, 2) representing the type of value to return.
673
674							A return type of 0 indicates a single score should be returned,
675							calculated using the specified method. A return type of 1 is an
676							array or array reference of all scores found. A return type of 2
677							is a hash or hash reference of coordinate => score.
678
679							=item * 7 Dataset database
680
681							This is either a L or a L
682							database containing multiple score dataset features. If your dataset
683							does not contain multiple different feature types, pass C.
684
685							Pass either a string containing the name of database (which will be opened)
686							or an already opened database object.
687
688							=item * 8 Dataset
689
690							This is the verified dataset from which the scores will be passed. The
691							dataset should be verified using the L
692							subroutine. For indexed file datasets, e.g. F<.bam> or F<.bw> files,
693							this will the verified local path or URL. For a database, this will be
694							a database type, sucha as C or C.
695							or the path to a data file, e.g. Bam, bigWig, bigBed, or USeq file.
696
697							If multiple datasets are to be combined, simply append them to the
698							array (index 9, 10, etc).
699
700							=back
701
702							The returned item is dependent on the value of the return type code.
703
704							=item calculate_score
705
706							my $scores = get_segment_score($chr,$start,$stop,$strand,
707							'sense','ncount',1,undef,'file:/path/to/dataset.bam');
708							my $score = calculate_score('ncount', $scores);
709
710							This subroutine will perform the math to calculate a single score
711							from an array of values. Current acceptable methods include mean,
712							median, sum, min, max, stddev, count, ncount, and pcount.
713
714							Pass the subroutine two items: the name of the mathematical method
715							enumerated above, and an array reference of the values to work on.
716							A single score will be returned.
717
718							=item get_chromosome_list
719
720							my $db = open_db_connection('cerevisiae');
721							# get all chromosomes in the database
722							my @chromosomes = get_chromosome_list($db);
723							foreach (@chromosomes) {
724							my $name = $_->[0];
725							my $length = $_->[1];
726							print "chromosome $name is $length bp\n";
727							}
728
729							This subroutine will collect a list of chromosomes or reference sequences
730							in a Bio::DB database and return the list along with their sizes in bp.
731							Many BioPerl-based databases are supported, including
732							L, L, L,
733							L, L, L,
734							L, or any others that support the C method.
735							L adapters for big files are also supported, though see note
736							below. See the L subroutine for more information.
737
738							Pass the subroutine either the name of the database, the path to the
739							database file, or an opened database object.
740
741							Optionally pass a second value, a regular expression compatible string
742							for skipping specific chromosomes or chromosome classes, such as mitochondrial
743							or unmapped contigs. The default is to return all chromosomes.
744
745							The subroutine will return an array, with each element representing each
746							chromosome or reference sequence in the database. Each element is an anonymous
747							array of two elements, the chromosome name and length in bp.
748
749							B: L databases for bigWig and bigBed files do
750							not return chromosomes in the original order as the file, and are returned
751							in a sorted manner that may not be the original order.
752
753							=item low_level_bam_coverage
754
755							my $coverage = low_level_bam_coverage($sam, $tid, $start, $stop);
756
757							This is a convenience method for running the low level bam coverage method.
758							Since both L and L bam file adapters are
759							supported, and each has slight variation in the API syntax, this method helps
760							to abstract the actual method and use the appropriate syntax depending on
761							which adapter is loaded. It is best if the C<$sam> object was opened using the
762							L method, or that C<$BAM_ADAPTER> is set.
763
764							B that this is the B coverage method based on the index object,
765							and not the similarly named high level API method. Read the adapter
766							documentation for proper usage.
767
768							=item low_level_bam_fetch
769
770							my $success = low_level_bam_fetch($sam, $tid, $start, $stop, $callback, $data);
771
772							This is a convenience method for running the low level bam fetch method.
773							Since both L and L bam file adapters are
774							supported, and each has slight variation in the API syntax, this method helps
775							to abstract the actual method and use the appropriate syntax depending on
776							which adapter is loaded. It is best if the C<$sam> object was opened using the
777							L method, or that C<$BAM_ADAPTER> is set.
778
779							B that this is the B fetch method based on the index object,
780							and not the similarly named high level API method. Read the adapter
781							documentation for proper usage.
782
783							=item get_genomic_sequence
784
785							my $sequence = get_genomic_sequence($db, $chrom, $start, $stop);
786
787							This is a convenience wrapper function for fetching genomic sequence from
788							three different supported fasta database adapters, which, of course, all
789							have different APIs, including the L, L,
790							and L adapters, as well as L
791							and possibly other BioPerl databases.
792
793							Pass the opened database object, chromosome, start, and stop coordinates. This
794							assumes BioPerl standard 1-base coordinates. Only the forward sequence is
795							retrieved. The sequence is returned as a simple string.
796
797							=back
798
799							=head1 INTERNAL SUBROUTINES
800
801							These are not intended for normal general consumption but are documented here
802							so that at least some of us know what is going on.
803
804							=over 4
805
806							=item _lookup_db_method
807
808							Internal method for determining which database adapter and method call
809							should be used given a set of parameters. This result is cached for
810							future data queries (likely hundreds or thousands of repeated calls).
811							The method is stored in the global hash C<%DB_METHODS>;
812
813							Pass the parameter array reference from L. This
814							should load the appropriate db_helper sub module during run time.
815
816							=item _load_helper_module
817
818							Subroutine to load and import a db_helper module during run time.
819							Sets the appropriate global variable if successful.
820
821							=item _load_bam_helper_module
822
823							Subroutine to determine which Bam adapter db_helper module to load,
824							either the older samtools-based adapter or the newer HTSlib-based adapter.
825							Uses the global and exportable variable C<$BAM_ADAPTER> to set a
826							preference for which adapter to use. Use 'sam' or 'hts' or some
827							string containing these names. Do B change this variable after
828							loading the helper module; it will not do what you think it will do.
829							If both are available and a preference is not set then the hts helper
830							is the default.
831
832							=back
833
834							=cut
835
836	3			3		16	use strict;
	3					6
	3					89
837							require Exporter;
838	3			3		11	use Carp qw(carp cluck croak confess);
	3					4
	3					148
839	3			3		1136	use Module::Load; # for dynamic loading during runtime
	3					2491
	3					15
840	3			3		124	use List::Util qw(min max sum0 uniq);
	3					5
	3					166
841	3			3		1040	use Statistics::Lite qw(median range stddevp);
	3					3542
	3					157
842	3			3		1104	use Bio::ToolBox::db_helper::constants;
	3					6
	3					145
843	3			3		1168	use Bio::ToolBox::utility;
	3					6
	3					20771
844
845							# check values for dynamically loaded helper modules
846							# these are loaded only when needed during runtime to avoid wasting resources
847							our $BAM_OK = 0;
848							our $BIGBED_OK = 0;
849							our $BIGWIG_OK = 0;
850							our $SEQFASTA_OK = 0;
851							our $USEQ_OK = 0;
852							our $BAM_ADAPTER = undef; # preference for which bam adapter to use
853							our $BIG_ADAPTER = undef;
854
855							# define reusable variables
856							my %TOTAL_READ_NUMBER; # for rpm calculations
857							my $PRIMARY_ID_WARNING; # for out of date primary IDs
858							my %OPENED_DB; # cache for some opened Bio::DB databases
859							my %DB_METHODS; # cache for database score methods
860
861							# score calculators
862							my %SCORE_CALCULATOR_SUB = (
863							'mean' => sub {
864							my $s = shift;
865							return sum0(@$s)/(scalar(@$s) \|\| 1);
866							},
867							'sum' => sub {
868							my $s = shift;
869							return sum0(@$s);
870							},
871							'median' => sub {
872							my $s = shift;
873							return '.' unless scalar(@$s);
874							return median(@$s);
875							},
876							'min' => sub {
877							my $s = shift;
878							return '.' unless scalar(@$s);
879							return min(@$s);
880							},
881							'max' => sub {
882							my $s = shift;
883							return '.' unless scalar(@$s);
884							return max(@$s);
885							},
886							'count' => sub {
887							my $s = shift;
888							return scalar(@$s);
889							},
890							'pcount' => sub {
891							my $s = shift;
892							return scalar(@$s);
893							},
894							'ncount' => sub {
895							# Convert names into unique counts
896							my $s = shift;
897							my %name2count;
898							foreach my $n (@$s) {
899							if (ref($n) eq 'ARRAY') {
900							# this is likely from a ncount indexed hash
901							foreach (@$n) {
902							$name2count{$_} += 1;
903							}
904							}
905							else {
906							$name2count{$n} += 1;
907							}
908							}
909							return scalar(keys %name2count);
910							},
911							'range' => sub {
912							# the range value is 'min-max'
913							my $s = shift;
914							return '.' unless scalar(@$s);
915							return range(@$s);
916							},
917							'stddev' => sub {
918							# we are using the standard deviation of the population,
919							# since these are the only scores we are considering
920							my $s = shift;
921							return '.' unless scalar(@$s);
922							return stddevp(@$s);
923							},
924							'rpm' => sub {
925							confess " The rpm methods have been deprecated due to complexity and " .
926							"the variable way of calculating the value. Collect counts and " .
927							"calculate your preferred way.\n";
928							},
929							'rpkm' => sub {
930							confess " The rpm methods have been deprecated due to complexity and " .
931							"the variable way of calculating the value. Collect counts and " .
932							"calculate your preferred way.\n";
933							}
934							);
935
936
937							# Exported names
938							our @ISA = qw(Exporter);
939							our @EXPORT = qw();
940							our @EXPORT_OK = qw(
941							$BAM_ADAPTER
942							$BIG_ADAPTER
943							use_bam_adapter
944							use_big_adapter
945							open_db_connection
946							get_dataset_list
947							verify_or_request_feature_types
948							check_dataset_for_rpm_support
949							get_new_feature_list
950							get_new_genome_list
951							validate_included_feature
952							get_db_feature
953							get_segment_score
954							calculate_score
955							get_chromosome_list
956							low_level_bam_coverage
957							low_level_bam_fetch
958							get_genomic_sequence
959							);
960
961							# The true statement
962							1;
963
964
965							### Open a connection to the SeqFeature Store MySQL database
966
967							sub use_bam_adapter {
968	0		0	0	1	0	my $a = shift \|\| undef;
969	0	0				0	$BAM_ADAPTER = $a if $a;
970	0					0	return $BAM_ADAPTER;
971							}
972
973
974							sub use_big_adapter {
975	0		0	0	1	0	my $a = shift \|\| undef;
976	0	0				0	$BIG_ADAPTER = $a if $a;
977	0					0	return $BIG_ADAPTER;
978							}
979
980
981							sub open_db_connection {
982	8			8	1	12	my $database = shift;
983	8		50			20	my $no_cache = shift \|\| 0;
984	8	50				12	unless ($database) {
985							# cluck 'no database name passed!';
986	0					0	return;
987							}
988
989							# first check if it is a database reference
990	8					14	my $db_ref = ref $database;
991	8	100				25	if ($db_ref =~ /DB\|big::BigWigSet/) {
992							# the provided database is already an open database object
993							# nothing to open, return as is
994	2					5	return $database;
995							}
996
997							# check to see if we have already opened it
998	6	100	66			23	if (exists $OPENED_DB{$database} and not $no_cache) {
999							# return the cached database connection
1000							# but NOT if user explicitly requested no cached databases
1001							# DO NOT reuse database objects if you have forked!!! Bad things happen
1002	5					11	return $OPENED_DB{$database};
1003							}
1004
1005							# skip parsed databases
1006	1	50				3	return if $database =~ /^Parsed:/; # we don't open parsed annotation files
1007
1008							# remove file prefix, just in case
1009	1					2	$database =~ s/^file://;
1010
1011
1012							### Attempt to open the database
1013							# we go through a series of checks to determine if it is remote, local,
1014							# an indexed big data file, SQLite file, etc
1015							# when all else fails, try to open a SQL connection
1016	1					2	my $db;
1017							my $error;
1018
1019							# check if it is a remote file
1020	1	50				20	if ($database =~ /^(?:https?\|ftp)/i) {
		50
		0
1021
1022							# a remote Bam database
1023	0	0				0	if ($database =~ /\.bam$/i) {
		0
		0
		0
		0
1024							# open using Bam adaptor
1025	0	0				0	_load_bam_helper_module() unless $BAM_OK;
1026	0	0				0	if ($BAM_OK) {
1027	0					0	$db = open_bam_db($database);
1028	0	0				0	unless ($db) {
1029	0					0	$error = " ERROR: could not open remote Bam file" .
1030							" '$database'! $!\n";
1031							}
1032							}
1033							else {
1034	0					0	$error = " Bam database cannot be loaded because\n" .
1035							" Bio::DB::Sam or Bio::DB::HTS is not installed\n";
1036							}
1037							}
1038
1039							# a remote BigBed database
1040							elsif ($database =~ /\.(?:bb\|bigbed)$/i) {
1041							# open using BigBed adaptor
1042	0	0				0	$BIGBED_OK = _load_bigbed_helper_module() unless $BIGBED_OK;
1043	0	0				0	if ($BIGBED_OK) {
1044	0					0	$db = open_bigbed_db($database);
1045	0	0				0	unless ($db) {
1046	0					0	$error = " ERROR: could not open remote BigBed file" .
1047							" '$database'! $!\n";
1048							}
1049							}
1050							else {
1051	0					0	$error = " BigBed database cannot be loaded because\n" .
1052							" Bio::DB::Big or Bio::DB::BigBed is not installed\n";
1053							}
1054							}
1055
1056							# a remote BigWig database
1057							elsif ($database =~ /\.(?:bw\|bigwig)$/i) {
1058							# open using BigWig adaptor
1059	0	0				0	$BIGWIG_OK = _load_bigwig_helper_module() unless $BIGWIG_OK;
1060	0	0				0	if ($BIGWIG_OK) {
1061	0					0	$db = open_bigwig_db($database);
1062	0	0				0	unless ($db) {
1063	0					0	$error = " ERROR: could not open remote BigWig file" .
1064							" '$database'! $!\n";
1065							}
1066							}
1067							else {
1068	0					0	$error = " BigWig database cannot be loaded because\n" .
1069							" Bio::DB::Big or Bio::DB::BigWig is not installed\n";
1070							}
1071							}
1072
1073							# a remote useq file
1074							elsif ($database =~ /\.useq$/) {
1075							# uh oh! remote useq files are not supported
1076	0					0	$error = " ERROR: remote useq files are not supported!\n";
1077							}
1078
1079							# a remote fasta file???
1080							elsif ($database =~ /\.fa(?:sta)?$/i) {
1081							# uh oh! remote fasta files are not supported
1082	0					0	$error = " ERROR: remote fasta files are not supported!\n";
1083							}
1084
1085							# a presumed remote directory, presumably of bigwig files
1086							else {
1087							# open using BigWigSet adaptor
1088	0	0				0	$BIGWIG_OK = _load_helper_module('Bio::ToolBox::db_helper::bigwig')
1089							unless $BIGWIG_OK;
1090	0	0				0	if ($BIGWIG_OK) {
1091	0					0	$db = open_bigwigset_db($database);
1092	0	0				0	unless ($db) {
1093	0					0	$error = " ERROR: could not open presumed remote " .
1094							"BigWigSet directory '$database'! $!\n";
1095							}
1096							}
1097							else {
1098	0					0	$error = " Presumed BigWigSet database cannot be loaded because\n" .
1099							" Bio::DB::BigWigSet is not installed\n";
1100							}
1101							}
1102
1103							}
1104
1105							# a local existing file
1106							elsif (-f $database) {
1107
1108							# a Bam database
1109	1	50				17	if ($database =~ /\.bam$/i) {
		50
		50
		50
		0
		0
		0
1110							# open using appropriate bam adaptor
1111	0	0				0	_load_bam_helper_module() unless $BAM_OK;
1112	0	0				0	if ($BAM_OK) {
1113	0					0	undef $@;
1114	0					0	$db = open_bam_db($database);
1115	0	0				0	unless ($db) {
1116	0					0	$error = " ERROR: could not open local Bam file" .
1117							" '$database'! $@\n";
1118							}
1119							}
1120							else {
1121	0					0	$error = " Bam database cannot be loaded because\n" .
1122							" Bio::DB::Sam or Bio::DB::HTS is not installed\n";
1123							}
1124							}
1125
1126							# a BigBed database
1127							elsif ($database =~ /\.(?:bb\|bigbed)$/i) {
1128							# open using BigBed adaptor
1129	0	0				0	$BIGBED_OK = _load_bigbed_helper_module() unless $BIGBED_OK;
1130	0	0				0	if ($BIGBED_OK) {
1131	0					0	undef $@;
1132	0					0	$db = open_bigbed_db($database);
1133	0	0				0	unless ($db) {
1134	0					0	$error = " ERROR: could not open local BigBed file" .
1135							" '$database'! $@\n";
1136							}
1137							}
1138							else {
1139	0					0	$error = " BigBed database cannot be loaded because\n" .
1140							" Big::DB::Big or Bio::DB::BigBed is not installed\n";
1141							}
1142							}
1143
1144							# a BigWig database
1145							elsif ($database =~ /\.(?:bw\|bigwig)$/i) {
1146							# open using BigWig adaptor
1147	0	0				0	$BIGWIG_OK = _load_bigwig_helper_module() unless $BIGWIG_OK;
1148	0	0				0	if ($BIGWIG_OK) {
1149	0					0	undef $@;
1150	0					0	$db = open_bigwig_db($database);
1151	0	0				0	unless ($db) {
1152	0					0	$error = " ERROR: could not open local BigWig file" .
1153							" '$database'! $@\n";
1154							}
1155							}
1156							else {
1157	0					0	$error = " BigWig database cannot be loaded because\n" .
1158							" Big::DB::Big or Bio::DB::BigWig is not installed\n";
1159							}
1160							}
1161
1162							# a useq database
1163							elsif ($database =~ /\.useq$/i) {
1164							# open using USeq adaptor
1165	1	50				35	$USEQ_OK = _load_helper_module('Bio::ToolBox::db_helper::useq')
1166							unless $USEQ_OK;
1167	1	50				3	if ($USEQ_OK) {
1168	1					2	$db = open_useq_db($database);
1169	1	50				3	unless ($db) {
1170	0					0	$error = " ERROR: could not open local useq file" .
1171							" '$database'! $!\n";
1172							}
1173							}
1174							else {
1175	0					0	$error = " Useq database cannot be loaded because\n" .
1176							" Bio::DB::USeq is not installed\n";
1177							}
1178							}
1179
1180							# a Fasta File
1181							elsif ($database =~ /\.fa(?:sta)?(?:\.gz)?$/i) {
1182							# first try a modern fai indexed adapter
1183	0	0				0	_load_bam_helper_module() unless $BAM_OK;
1184	0	0				0	if ($BAM_OK) {
1185	0					0	$db = open_indexed_fasta($database);
1186	0	0				0	unless ($db) {
1187	0					0	$error .= " ERROR: could not open indexed fasta file '$database'!\n";
1188							}
1189							}
1190							else {
1191							# open using the old slow BioPerl Fasta adaptor
1192	0	0				0	$SEQFASTA_OK = _load_helper_module('Bio::ToolBox::db_helper::seqfasta')
1193							unless $SEQFASTA_OK;
1194	0	0				0	if ($SEQFASTA_OK) {
1195	0					0	$db = open_fasta_db($database);
1196	0	0				0	unless ($db) {
1197	0					0	$error .= " ERROR: could not open fasta file '$database'!\n";
1198	0	0				0	if (-e "$database\.index") {
1199	0					0	$error .= " Try deleting $database\.index and try again\n";
1200							}
1201							}
1202							}
1203							else {
1204	0					0	$error .= " Fasta file could not be loaded because neither" .
1205							" Bio::DB::HTS, Bio::DB::Sam, or Bio::DB::Fasta is installed\n";
1206							}
1207							}
1208							}
1209
1210							# a gff3 or sqlite database
1211							elsif ($database =~ /\.(?:gff3?\|gff3?\.gz\|db\|sqlite)$/) {
1212	0	0				0	$SEQFASTA_OK = _load_helper_module('Bio::ToolBox::db_helper::seqfasta')
1213							unless $SEQFASTA_OK;
1214	0	0				0	if ($SEQFASTA_OK) {
1215	0					0	$db = open_store_db($database);
1216	0	0				0	unless ($db) {
1217	0					0	$error = " ERROR: could not load SeqFeature database file '$database'!\n";
1218							}
1219							}
1220							else {
1221	0					0	$error .= " Module Bio::DB::SeqFeature::Store is required to load " .
1222							"GFF and database files\n";
1223							}
1224							}
1225
1226							# a cram database
1227							elsif ($database =~ /\.cram$/i) {
1228							# open using HTS bam adaptor only
1229	0		0			0	$BAM_ADAPTER \|\|= 'hts';
1230	0	0				0	if ($BAM_ADAPTER eq 'sam') {
1231	0					0	$error .= " ERROR: Only HTS adapters support Cram files!\n";
1232							}
1233	0	0				0	_load_bam_helper_module() unless $BAM_OK;
1234	0	0				0	if ($BAM_OK) {
1235	0					0	undef $@;
1236	0					0	$db = open_bam_db($database);
1237	0	0				0	unless ($db) {
1238	0					0	$error = " ERROR: could not open local Cram file" .
1239							" '$database'! $@\n";
1240							}
1241							}
1242							else {
1243	0					0	$error = " Cram database cannot be loaded because\n" .
1244							" Bio::DB::HTS is not installed\n";
1245							}
1246							}
1247
1248							# something unrecognized?
1249							else {
1250	0					0	$error .= " ERROR! Cannot identify database type based on extension for $database!\n";
1251							}
1252							}
1253
1254							# a directory, presumably of bigwig files
1255							elsif (-d $database) {
1256							# try opening using the BigWigSet adaptor
1257	0	0				0	$BIGWIG_OK = _load_bigwig_helper_module() unless $BIGWIG_OK;
1258	0	0				0	if ($BIGWIG_OK) {
1259	0					0	$db = open_bigwigset_db($database);
1260	0	0				0	unless ($db) {
1261	0					0	$error = " ERROR: could not open local BigWigSet " .
1262							"directory '$database'!\n";
1263	0					0	$error .= " Does directory contain bigWig .bw files?\n";
1264							}
1265							}
1266							else {
1267	0					0	$error = " Presumed BigWigSet database cannot be loaded because\n" .
1268							" Bio::DB::Big or Bio::DB::BigWigSet is not installed\n";
1269							}
1270
1271							# try opening with the Fasta adaptor
1272	0	0				0	unless ($db) {
1273	0	0				0	$SEQFASTA_OK = _load_helper_module('Bio::ToolBox::db_helper::seqfasta')
1274							unless $SEQFASTA_OK;
1275	0	0				0	if ($SEQFASTA_OK) {
1276	0					0	$db = open_fasta_db($database);
1277	0	0				0	unless ($db) {
1278	0					0	$error .= " ERROR: could not open fasta directory '$database'!\n";
1279	0					0	$error .= " Does directory contain fasta files? If it contains a" .
1280							" directory.index file,\n try deleting it and try again.\n";
1281							}
1282							}
1283							else {
1284	0					0	$error .= " Module Bio::DB::Fasta is required to open presumed fasta " .
1285							"directory\n";
1286							}
1287							}
1288							}
1289
1290							# otherwise assume the name of a SeqFeature::Store database in the configuration
1291							else {
1292							# attempt to open using information from the configuration
1293							# using default connection information as necessary
1294	0	0				0	$SEQFASTA_OK = _load_helper_module('Bio::ToolBox::db_helper::seqfasta')
1295							unless $SEQFASTA_OK;
1296	0	0				0	if ($SEQFASTA_OK) {
1297	0					0	$db = open_store_db($database);
1298	0	0				0	unless ($db) {
1299	0					0	$error .= " unable to open relational database named '$database'\n";
1300							}
1301							}
1302							else {
1303	0					0	$error .= " Module Bio::DB::SeqFeature::Store is required to connect " .
1304							"to databases\n";
1305							}
1306							}
1307
1308							# conditional return
1309	1	50				3	if ($db) {
1310							# cache the opened connection for use later
1311	1	50				3	$OPENED_DB{$database} = $db unless $no_cache;
1312
1313							# return as appropriate either both object and name or just object
1314	1					3	return $db;
1315							}
1316							else {
1317	0					0	print STDERR $error;
1318	0					0	return;
1319							}
1320							}
1321
1322
1323							### Retrieve a database name from an db object
1324							sub get_db_name {
1325	0			0	1	0	my $db = shift;
1326	0	0				0	return unless $db;
1327	0					0	my $db_ref = ref($db);
1328	0	0				0	return $db unless $db_ref; # presumption that non-object is just a database name
1329	0					0	my $db_name;
1330	0	0				0	if ($db_ref =~ /^Bio::DB::SeqFeature::Store/) {
		0
		0
1331							# a SeqFeature database, using any DBI adapter
1332	0					0	$db_name = $db->{'dbh'}->{'name'};
1333							# dig through the object internals to identify the original
1334							# name of the database
1335							# this should be relatively well documented through DBI
1336							# but could break in the future since it's not official API
1337							}
1338							elsif ($db_ref eq 'Bio::DB::Sam') {
1339							# a Samtools Bam database
1340							# old versions <=1.39 don't have the bam_path method, so it's easier
1341							# to explicitly grab it from the object internals
1342	0					0	$db_name = $db->{bam_path};
1343							}
1344							elsif ($db_ref eq 'Bio::DB::HTS') {
1345							# a HTSlib Bam database
1346	0					0	$db_name = $db->hts_path;
1347							}
1348							# determining the database name from other sources is
1349							# either not possible or not easy, so won't bother unless
1350							# there is a really really good need
1351	0					0	return $db_name;
1352							}
1353
1354
1355							### Retrieve a list of the microrarray data sets from the db
1356							sub get_dataset_list {
1357
1358	0			0	1	0	my $database = shift;
1359	0					0	my $use_all_features = shift;
1360
1361							# Open a db connection
1362	0					0	my $db = open_db_connection($database);
1363	0	0				0	unless ($db) {
1364	0					0	carp 'no database connected!';
1365	0					0	return;
1366							}
1367	0					0	my $db_ref = ref $db;
1368
1369							# process the database types, according to the type of database
1370	0					0	my @types;
1371
1372							# a SeqFeature database
1373	0	0				0	if ($db_ref =~ /^Bio::DB::SeqFeature::Store/) {
		0
1374
1375							# collect the database types,
1376							# sort first by source, then by method
1377							@types = (
1378							map $_->[2],
1379	0	0				0	sort { ($a->[0] cmp $b->[0]) or ($a->[1] cmp $b->[1]) }
	0					0
1380							map [$_->source, $_->method, $_],
1381							$db->types
1382							);
1383							}
1384
1385							# a BigWigSet database
1386							elsif ($db_ref =~ /BigWigSet/i) {
1387
1388							# get the metadata
1389	0					0	my $metadata = $db->metadata;
1390
1391							# collect
1392	0					0	foreach my $file (keys %{ $metadata }) {
	0					0
1393							# get the type for each file
1394	0					0	my ($primary, $type, $name);
1395
1396							# get the appropriate tags
1397	0					0	foreach my $attribute (keys %{ $metadata->{$file} } ) {
	0					0
1398	0	0				0	if ($attribute =~ m/^type/i) {
		0
		0
1399	0					0	$type = $metadata->{$file}{$attribute};
1400							}
1401							elsif ($attribute =~ m/name/i) {
1402	0					0	$name = $metadata->{$file}{$attribute};
1403							}
1404							elsif ($attribute =~ m/^primary_tag\|method$/i) {
1405	0					0	$primary = $metadata->{$file}{$attribute};
1406							}
1407							}
1408
1409							# store the type
1410	0		0			0	push @types, $type \|\| $primary \|\| $name;
1411							}
1412
1413							# sort the types in alphabetical order
1414							# and discard duplicate types - which may occur with stranded entries
1415	0					0	@types = uniq(sort {$a cmp $b} @types);
	0					0
1416							}
1417
1418							# some other database
1419							else {
1420	0					0	carp " no dataset lists for database type $db_ref!\n";
1421							}
1422
1423							# finish
1424	0					0	return @types;
1425							}
1426
1427
1428
1429							### Process and verify a dataset
1430							sub verify_or_request_feature_types {
1431
1432							# Retrieve passed values
1433	1			1	1	2	my %args = @_; # the passed argument values as a hash reference
1434
1435							# Check for single option
1436	1		50			5	$args{'single'} \|\|= 0;
1437
1438							# Collect the datasets
1439	1					1	my @datasets;
1440	1		50			3	$args{'feature'} \|\|= undef;
1441	1	50				4	if (ref $args{'feature'} eq 'ARRAY') {
		50
1442							# an anonymous array of datasets
1443	0					0	@datasets = @{ $args{'feature'} };
	0					0
1444							}
1445							elsif (defined $args{'feature'}) {
1446	1					2	push @datasets, $args{'feature'};
1447							}
1448							# else it is null and @datasets remains empty, prompting user input
1449
1450							# feature limits
1451							# this is a regex to limit the feature types presented to the user
1452							# otherwise everything in the database is presented to the user
1453	1		50			12	my $limit = $args{'limit'} \|\|= undef;
1454
1455
1456							# Open database object and collect features
1457	1		50			4	$args{'db'} \|\|= undef;
1458	1	50				3	my $db = $args{'db'} ? open_db_connection( $args{'db'} ) : undef;
1459
1460							# Initialize main output arrays
1461	1					3	my @good_datasets;
1462							my @bad_datasets;
1463	1					0	my %db_features; # hash for features found in the database, use when needed
1464
1465							# Check provided datasets
1466	1	50				2	if (@datasets) {
1467
1468							# check for multiple comma-delimited datasets
1469	1					1	my @list_to_check;
1470	1					2	foreach my $item (@datasets) {
1471	1	50				3	if ($item =~ /,/) {
1472							# this one has a comma, therefore it has more than dataset
1473	0					0	push @list_to_check, split(/,/, $item);
1474							}
1475							else {
1476							# a singleton
1477	1					4	push @list_to_check, $item;
1478							}
1479							}
1480
1481							# now verify the datasets
1482	1					2	foreach my $dataset (@list_to_check) {
1483
1484							# check for a remote file
1485	1	50				9	if ($dataset =~ /^(?: http \| ftp) .+ \. (?: bam \| bw \| bb) $/xi) {
		50
1486							# a remote file
1487							# assume it is good, no verification here though
1488							# it will either work or won't work
1489	0					0	push @good_datasets, $dataset;
1490							}
1491
1492							# a local file
1493							elsif ($dataset =~ /\.(?:bam\|bw\|bigwig\|bb\|bigbed\|useq)$/i) {
1494							# presume we have a local indexed data file
1495
1496							# user may have requested two or more files to be merged
1497							# these should be combined with an ampersand
1498							# check each one
1499	1					2	my @files;
1500	1					3	foreach my $file (split /\&/, $dataset) {
1501	1	50				17	if (-e $file) {
1502							# file exists
1503	1					4	push @files, "file:$file";
1504							}
1505							else {
1506							# file doesn't exist! can't use this set of files
1507	0					0	@files = ();
1508	0					0	last;
1509							}
1510							}
1511	1	50				3	if (@files) {
1512	1					10	push @good_datasets, join("&", @files);
1513							}
1514							else {
1515	0					0	push @bad_datasets, $dataset;
1516							}
1517							}
1518
1519							# a feature type in a database
1520							else {
1521							# must be a database feature type
1522
1523							# check for database features hash
1524	0	0				0	unless (%db_features) {
1525	0	0				0	if ($db) {
1526							# collect the database features as needed
1527							# I want both method:source and method in the hash
1528	0					0	foreach my $type ( get_dataset_list($db) ) {
1529	0					0	my ($method, $source) = split /:/, $type;
1530	0	0				0	if ($source) {
1531	0					0	$db_features{$type} = 1;
1532	0					0	$db_features{$method} = 1;
1533							}
1534							else {
1535	0					0	$db_features{$type} = 1;
1536							}
1537							}
1538
1539							# verify
1540	0	0				0	unless (%db_features) {
1541	0					0	carp " provided database has no feature types " .
1542							"to verify dataset(s) against!\n";
1543							}
1544							}
1545							else {
1546							# we need a database
1547	0					0	carp " unable to verify dataset without database";
1548							}
1549							}
1550
1551							# validate the given dataset
1552							# user may have requested two or more features to be merged
1553							# these should be combined with an ampersand
1554							# check each one
1555	0					0	my $check = 0;
1556	0					0	foreach my $d (split /&/, $dataset) {
1557							# validate this dataset
1558	0	0				0	if (exists $db_features{$d}) {
1559	0					0	$check++;
1560							}
1561							else {
1562							# at least one of these is not good, fail check
1563	0					0	$check = 0;
1564	0					0	last;
1565							}
1566							}
1567
1568							# report the verification
1569	0	0				0	if ($check) {
1570	0					0	push @good_datasets, $dataset;
1571							}
1572							else {
1573	0					0	push @bad_datasets, $dataset;
1574							}
1575							}
1576							}
1577							}
1578
1579							# User must select datasets
1580							else {
1581							# dataset not specified
1582							# present the dataset list to the user and get an answer
1583
1584							# get the dataset list
1585	0	0				0	if ($db) {
1586							# collect the database features as needed
1587	0					0	my $i = 1;
1588	0					0	foreach my $type ( get_dataset_list($db) ) {
1589	0	0				0	if ($limit) {
1590	0					0	my ($p, $s) = split /:/, $type; # split into primary_tag and source
1591							# only keep those types that match our limiter
1592	0	0				0	next unless $p =~ /$limit/i;
1593							}
1594	0					0	$db_features{$i} = $type;
1595	0					0	$i++;
1596							}
1597
1598							# verify
1599	0	0				0	unless (%db_features) {
1600	0					0	carp " provided database has no feature types " .
1601							"to collect!\n";
1602	0					0	return;
1603							}
1604							}
1605							else {
1606							# we need a database
1607	0					0	carp " no database provided from which to collect features!\n";
1608	0					0	return;
1609							}
1610
1611							# present the list
1612	0					0	print "\n These are the available data sets in the database:\n";
1613	0					0	foreach (sort {$a <=> $b} keys %db_features) {
	0					0
1614							# print out the list of microarray data sets
1615	0					0	print " $_\t$db_features{$_}\n";
1616							}
1617
1618							# prompt the user
1619	0		0			0	$args{'prompt'} \|\|= undef;
1620	0	0				0	if ($args{'prompt'}) {
1621							# provided custom prompt
1622	0					0	print $args{'prompt'};
1623							}
1624							else {
1625							# generic prompt
1626	0	0				0	if ($args{'single'}) {
1627	0					0	print " Enter one number for the data set or feature ";
1628							}
1629							else {
1630	0					0	print " Enter the number(s) or range of the data set(s) or" .
1631							" feature(s) ";
1632							}
1633							}
1634
1635							# get answer from the user
1636	0					0	my $answer = ;
1637	0					0	chomp $answer;
1638	0					0	my @answer_list = parse_list($answer);
1639
1640							# take the first one if requested
1641	0	0				0	if ($args{'single'}) {
1642	0	0				0	unless (scalar @answer_list == 1) {
1643	0					0	splice(@answer_list, 1);
1644							}
1645							}
1646
1647							# verify the answer list
1648	0					0	foreach my $answer (@answer_list) {
1649
1650							# check for merged datasets
1651	0	0				0	if ($answer =~ /&/) {
1652							# a merged dataset
1653	0					0	my @list = split /&/, $answer;
1654	0					0	my $check = 1;
1655
1656							# check all are good
1657	0					0	foreach (@list) {
1658	0	0				0	unless (exists $db_features{$_}) {
1659	0					0	$check = 0;
1660							}
1661							}
1662
1663							# if all are good
1664	0	0				0	if ($check) {
1665							push @good_datasets,
1666	0					0	join( "&", map { $db_features{$_} } @list);
	0					0
1667							}
1668							else {
1669	0					0	push @bad_datasets, $answer;
1670							}
1671							}
1672
1673							else {
1674							# a single dataset
1675							# check if it is good
1676
1677	0	0				0	if (exists $db_features{$answer}) {
1678	0					0	push @good_datasets, $db_features{$answer};
1679							}
1680							else {
1681	0					0	push @bad_datasets, $db_features{$answer};
1682							}
1683							}
1684							}
1685							}
1686
1687							# Print bad results
1688	1	50				3	if (@bad_datasets) {
1689	0					0	print " The following datasets could not be verified:\n";
1690	0					0	foreach (@bad_datasets) {
1691	0					0	print " $_\n";
1692							}
1693							}
1694
1695							# Return good results
1696	1	50				3	if ($args{'single'}) {
1697	0					0	return $good_datasets[0];
1698							}
1699							else {
1700	1					4	return @good_datasets;
1701							}
1702							}
1703
1704
1705							### Process and verify a dataset
1706							sub check_dataset_for_rpm_support {
1707
1708							# get passed dataset and databases
1709	0			0	1	0	my $dataset = shift;
1710	0					0	my $cpu = shift;
1711
1712							# Calculate the total number of reads
1713							# this uses the global variable $rpkm_read_sum
1714	0					0	my $rpm_read_sum;
1715
1716	0	0				0	if (exists $TOTAL_READ_NUMBER{$dataset}) {
		0
		0
1717							# this dataset has already been summed
1718							# no need to do it again
1719	0					0	$rpm_read_sum = $TOTAL_READ_NUMBER{$dataset};
1720							}
1721
1722							elsif ($dataset =~ /\.bam$/) {
1723							# a bam file dataset
1724
1725	0	0				0	_load_bam_helper_module() unless $BAM_OK;
1726	0	0				0	if ($BAM_OK) {
1727							# Bio::ToolBox::db_helper::bam was loaded ok
1728							# sum the number of reads in the dataset
1729	0					0	$rpm_read_sum = sum_total_bam_alignments($dataset, 0, 0, $cpu);
1730							}
1731							else {
1732	0					0	carp " Bam support is not available! " .
1733							"Is Bio::DB::Sam or Bio::DB::HTS installed?\n";
1734	0					0	return;
1735							}
1736							}
1737
1738							elsif ($dataset =~ /\.bb$/) {
1739							# a bigbed file dataset
1740
1741	0	0				0	$BIGBED_OK = _load_helper_module('Bio::ToolBox::db_helper::bigbed') unless
1742							$BIGBED_OK;
1743	0	0				0	if ($BIGBED_OK) {
1744							# Bio::ToolBox::db_helper::bigbed was loaded ok
1745							# sum the number of features in the dataset
1746	0					0	$rpm_read_sum = sum_total_bigbed_features($dataset);
1747							}
1748							else {
1749	0					0	carp " BigBed support is not available! " .
1750							"Is Bio::DB::BigBed installed?\n";
1751	0					0	return;
1752							}
1753							}
1754
1755							else {
1756							# some other non-supported dataset
1757	0					0	return;
1758							}
1759
1760							# return the sum value if we've made it this far
1761	0					0	$TOTAL_READ_NUMBER{$dataset} = $rpm_read_sum;
1762	0					0	return $rpm_read_sum;
1763							}
1764
1765
1766							### Generate a new list of features
1767							sub get_new_feature_list {
1768
1769							# Retrieve passed values
1770	0			0	1	0	my %args = @_; # the passed argument values
1771
1772							# check data object
1773	0		0			0	my $data = $args{data} \|\| undef;
1774	0	0				0	unless ($data) {
1775	0					0	confess "must pass a 'data' key and Bio::ToolBox::Data object!";
1776	0					0	return;
1777							}
1778	0	0				0	unless (ref($data) eq 'Bio::ToolBox::Data') {
1779	0					0	confess 'must pass a Bio::ToolBox::Data object!';
1780	0					0	return;
1781							}
1782
1783							# Open a db connection
1784	0		0			0	$args{'db'} \|\|= undef;
1785	0					0	my $db = open_db_connection($args{'db'});
1786	0	0				0	unless ($db) {
1787	0					0	carp 'no database connected!';
1788	0					0	return;
1789							}
1790
1791							# Verify a SeqFeature::Store database
1792	0					0	my $db_ref = ref $db;
1793	0	0				0	unless ($db_ref =~ /^Bio::DB::SeqFeature::Store/) {
1794	0					0	carp "Database type $db_ref doesn't support generating feature lists!\n";
1795	0					0	return;
1796							}
1797
1798							# Features to search for
1799	0		0			0	my $searchFeature = $args{'feature'} \|\| $args{'features'} \|\| undef;
1800	0	0				0	unless ($searchFeature) {
1801	0					0	carp "no search feature types passed!";
1802	0					0	return;
1803							}
1804	0					0	my @classes = split(',', $searchFeature); # it may or may not be a list
1805
1806							# chromosomes to skip
1807	0		0			0	my $chr_exclude = $args{'chrskip'} \|\| undef;
1808
1809							# Add table columns
1810	0					0	my $pid_i = $data->add_column('Primary_ID');
1811	0					0	my $name_i = $data->add_column('Name');
1812	0					0	my $type_i = $data->add_column('Type');
1813
1814							# Set up the database iterator
1815	0					0	print " Searching for $searchFeature features\n";
1816	0	0				0	my $iterator = $db->get_seq_stream(
1817							-types => scalar @classes ? \@classes : $searchFeature,
1818							);
1819	0	0				0	unless ($iterator) {
1820							# there should be some features found in the database
1821	0					0	carp "could not get feature iterator for database";
1822	0					0	return;
1823							}
1824
1825							# Walk through the collected features
1826	0					0	my $total_count = 0; # total found features
1827	0					0	while (my $feature = $iterator->next_seq) {
1828	0					0	$total_count++;
1829
1830							# skip genes from excluded chromosomes
1831	0	0	0			0	next if (defined $chr_exclude and $feature->seq_id =~ $chr_exclude);
1832
1833							# Record the feature information
1834							# in the B::DB::SF::S database, the primary_ID is a number unique to the
1835							# the specific database, and is not portable between databases
1836	0					0	$data->add_row( [
1837							$feature->primary_id,
1838							$feature->display_name,
1839							$feature->type,
1840							] );
1841							}
1842
1843							# print result of search
1844	0					0	printf " Found %s features in the database\n", format_with_commas($total_count);
1845	0					0	printf " Kept %s features.\n", format_with_commas($data->last_row);
1846
1847							# return the new data structure
1848	0					0	return 1;
1849							}
1850
1851
1852							### Generate a new list genomic windows
1853							sub get_new_genome_list {
1854
1855							# Collect the passed arguments
1856	1			1	1	5	my %args = @_;
1857
1858							# check data object
1859	1		50			3	my $data = $args{data} \|\| undef;
1860	1	50				4	unless ($data) {
1861	0					0	confess "must pass a 'data' key and Bio::ToolBox::Data object!";
1862							}
1863	1	50				3	unless (ref($data) eq 'Bio::ToolBox::Data') {
1864	0					0	confess 'must pass a Bio::ToolBox::Data object!';
1865							}
1866
1867							# Open a db connection
1868	1		50			3	$args{'db'} \|\|= undef;
1869	1					3	my $db = open_db_connection($args{'db'});
1870	1	50				3	unless ($db) {
1871	0					0	carp 'no database connected!';
1872	0					0	return;
1873							}
1874
1875							# Determine win and step sizes
1876	1		50			3	$args{'win'} \|\|= 500; # hard coded default size
1877	1		33			5	$args{'step'} \|\|= $args{'win'};
1878
1879
1880							# Prepare data structures
1881	1					5	my $chr_i = $data->add_column('Chromosome');
1882	1					3	my $start_i = $data->add_column('Start');
1883	1					66	my $stop_i = $data->add_column('Stop');
1884	1					7	$data->metadata($start_i, 'win', $args{'win'});
1885	1					3	$data->metadata($start_i, 'step', $args{'step'});
1886
1887							# Collect the chromosomes
1888	1		50			6	my $chr_exclude = $args{'skip'} \|\| $args{'chrskip'} \|\| undef;
1889	1					5	my @chromosomes = get_chromosome_list($db, $chr_exclude);
1890	1	50				36	unless (@chromosomes) {
1891	0					0	carp " no sequence IDs were found in the database!\n";
1892	0					0	return;
1893							}
1894
1895							# Load exclusion intervals
1896	1					2	my %exclusion_tree;
1897	1		50			7	$args{exclude} \|\|= $args{blacklist} \|\| undef;
			33
1898	1	50	33			6	if (exists $args{exclude} and defined $args{exclude}) {
1899	0	0				0	if (ref($args{exclude}) eq 'Bio::ToolBox::Data') {
1900	0	0				0	if (_load_helper_module('Set::IntervalTree')) {
1901							# iterate through the data object of intervals
1902							# prepare an Interval Tree for each chromosome
1903							# and load the exclusion interval into it
1904							$args{exclude}->iterate( sub {
1905	0			0		0	my $row = shift;
1906	0	0				0	unless (exists $exclusion_tree{$row->seq_id} ) {
1907	0					0	$exclusion_tree{ $row->seq_id } = Set::IntervalTree->new();
1908							}
1909	0					0	$exclusion_tree{ $row->seq_id }->insert(1, $row->start - 1, $row->end);
1910	0					0	} );
1911							}
1912							else {
1913	0					0	carp " Set::IntervalTree must be installed to use exclusion intervals!";
1914							}
1915							}
1916							else {
1917	0					0	confess " Exclusion data must be a Bio::ToolBox::Data object!";
1918							}
1919							}
1920
1921							# Collect the genomic windows
1922	1					31	print " Generating $args{win} bp windows in $args{step} bp increments\n";
1923	1					6	foreach (@chromosomes) {
1924
1925							# name and length as sub-array in each element
1926	1					1	my ($chr, $length) = @{$_};
	1					3
1927	1		50			5	my $Tree = $exclusion_tree{$chr} \|\| undef;
1928
1929							# iterate through chromosome
1930	1					4	for (my $start = 1; $start <= $length; $start += $args{step}) {
1931	122					127	my $end = $start + $args{win} - 1;
1932	122	100				143	if ($end > $length) {
1933							# fix end to the length of the chromosome
1934	1					2	$end = $length;
1935							}
1936							# skip if excluded
1937	122	50	33			152	next if ($Tree and scalar( @{ $Tree->fetch($start - 1, $end) } ) >= 1);
	0					0
1938	122					184	$data->add_row( [ $chr, $start, $end] );
1939							}
1940							}
1941	1					27	print " Kept " . $data->{'last_row'} . " windows.\n";
1942
1943							# Return the data structure
1944	1					9	return 1;
1945							}
1946
1947
1948							sub get_db_feature {
1949	0			0	1	0	my %args = @_;
1950
1951							# Open a db connection
1952	0		0			0	$args{db} \|\|= undef;
1953	0	0				0	unless ($args{db}) {
1954	0					0	croak 'no database provided for getting a feature!';
1955							}
1956	0					0	my $db = open_db_connection($args{db});
1957	0					0	my $db_ref = ref $db;
1958
1959							# get the name of the feature
1960	0		0			0	my $name = $args{'name'} \|\| undef;
1961
1962							# check for values and internal nulls
1963	0	0				0	$args{'id'} = exists $args{'id'} ? $args{'id'} : undef;
1964	0		0			0	$args{'type'} \|\|= undef;
1965	0	0	0			0	undef $name if $name and $name eq '.';
1966	0	0	0			0	undef $args{'id'} if $args{'id'} and $args{'id'} eq '.';
1967	0	0	0			0	undef $args{'type'} if $args{'type'} and $args{'type'} eq '.';
1968
1969							# quick method for feature retrieval
1970	0	0	0			0	if (defined $args{'id'} and $db->can('fetch')) {
1971							# we have a unique primary_id to identify the feature
1972							# usually this pulls out the feature directly
1973	0		0			0	my $feature = $db->fetch($args{'id'}) \|\| undef;
1974
1975							# check that the feature we pulled out is what we want
1976	0	0				0	my $check = $feature ? 1 : 0;
1977	0	0				0	if ($check) {
1978	0	0	0			0	$check = 0 if (defined $name and $feature->display_name ne $name);
1979	0	0	0			0	$check = 0 if (defined $args{'type'} and $feature->type ne $args{'type'});
1980							}
1981
1982							# return if things match up as best we can tell
1983	0	0				0	if ($check) {
1984	0					0	return $feature;
1985							}
1986							else {
1987							# the primary_ids are out of date
1988	0	0				0	unless ($PRIMARY_ID_WARNING) {
1989	0					0	warn "CAUTION: Some primary IDs in Input file appear to be out of date\n";
1990	0					0	$PRIMARY_ID_WARNING++;
1991							}
1992							}
1993							}
1994
1995							# otherwise use name and type
1996	0	0				0	return unless $name; # otherwise db will return all features! Not what we want!
1997							my @features = $db->features(
1998							-name => $name, # we assume this name will be unique
1999							-aliases => 0,
2000	0					0	-type => $args{'type'},
2001							);
2002
2003							# if none are found
2004	0	0				0	unless (@features) {
2005							# try again with aliases enabled
2006							@features = $db->features(
2007							-name => $name,
2008							-aliases => 1,
2009	0					0	-type => $args{'type'},
2010							);
2011							}
2012	0	0	0			0	unless (@features and $name =~ /[;,\\|]/) {
2013							# I used to append aliases to the end of the name in old versions of biotoolbox
2014							# crazy, I know, but just in case this happened, let's try breaking them apart
2015	0					0	my $name2 = (split(/\s[;,\\|]\s/, $name))[0];
2016							# multiple names present using common delimiters ;,\|
2017							# take the first name only, assume others are aliases that we don't need
2018							@features = $db->features(
2019							-name => $name2,
2020							-aliases => 1,
2021	0					0	-type => $args{'type'},
2022							);
2023							}
2024
2025							# check the number of features returned
2026	0	0				0	if (scalar @features > 1) {
		0
2027							# there should only be one feature found
2028							# if more are returned, there's redundant or duplicated data in the db
2029
2030							# first check whether addition of aliases may improve accuracy
2031	0	0				0	if ($args{'name'} =~ /;/) {
2032							# we have presumed aliases in the name value
2033	0					0	my $check = $args{'name'};
2034	0					0	$check =~ s/\s;\s/;/g; # strip spaces just in case
2035
2036							# check each feature and try to match name and aliases
2037	0					0	my @candidates;
2038	0					0	foreach my $f (@features) {
2039	0					0	my $f_name = join(';', $f->display_name, ($f->get_tag_values('Alias')));
2040	0	0				0	if ($check eq $f_name) {
2041							# found one
2042	0					0	push @candidates, $f;
2043							}
2044							}
2045
2046	0	0				0	if (scalar @candidates == 1) {
		0
2047							# we found it!
2048	0					0	return shift @candidates;
2049							}
2050							elsif (scalar @candidates > 1) {
2051							# hopefully we've improved a little bit?
2052	0					0	@features = @candidates;
2053							}
2054							}
2055
2056							# warn the user, this should be fixed
2057							printf " Found %s %s features named '$name' in the database! Using first one.\n",
2058	0					0	scalar(@features), $args{'type'};
2059							}
2060							elsif (!@features) {
2061	0					0	printf " Found no %s features named '$name' in the database!\n", $args{'type'};
2062	0					0	return;
2063							}
2064
2065							# done
2066	0					0	return shift @features;
2067							}
2068
2069
2070							### Get a dataset score for a single region
2071							sub get_segment_score {
2072							# parameters passed as an array
2073							# we will be passing this array on as a reference to the appropriate
2074							# imported helper subroutine
2075							# chromosome, start, stop, strand, strandedness, method, return type, db, dataset
2076	12	50		12	1	23	confess "incorrect number of parameters passed!" unless scalar @_ == 9;
2077
2078							# check the database
2079	12	100	66			41	$_[DB] = open_db_connection($_[DB]) if ($_[DB] and not ref($_[DB]));
2080
2081							# check for combined datasets
2082	12	50				49	if ($_[DATA] =~ /&/) {
2083	0					0	push @_, (split '&', pop @_);
2084							}
2085
2086							# determine method
2087							# yes, we're only checking the first dataset, but they should all
2088							# be the same type
2089	12		66			71	my $db_method = $DB_METHODS{$_[METH]}{$_[RETT]}{$_[DB]}{$_[DATA]} \|\|
2090							_lookup_db_method(\@_);
2091
2092							# return type values
2093							# 0 = calculate score
2094							# 1 = score array
2095							# 2 = hash position scores
2096							# immediately return calculated score if appropriate
2097	12	100				24	if ($_[RETT] > 0) {
2098							# immediately return either indexed hash or array of scores
2099	5					7	return &{$db_method}(\@_);
	5					11
2100							}
2101							else {
2102							# calculate a score
2103	7					9	my $scores = &{$db_method}(\@_);
	7					16
2104							# this might be an array reference of scores that need to be combined
2105							# or it could be a single scalar score which just needs to be returned
2106	7	50				18	if (ref($scores)) {
2107	7					18	return calculate_score($_[METH], $scores);
2108							}
2109							else {
2110	0	0				0	if (not defined $scores) {
2111	0	0				0	return 0 if $_[METH] =~ /count\|sum/;
2112	0					0	return '.';
2113							}
2114	0					0	return $scores;
2115							}
2116							}
2117							}
2118
2119
2120							sub calculate_score {
2121	7			7	1	12	my ($method, $scores) = @_;
2122	7		50			14	$scores \|\|= []; # just in case
2123	7	50				14	if (exists $SCORE_CALCULATOR_SUB{$method}) {
2124	7					11	return &{$SCORE_CALCULATOR_SUB{$method}}($scores);
	7					17
2125							}
2126							else {
2127	0					0	confess " unrecognized method '$method'!";
2128							}
2129							}
2130
2131
2132							sub get_chromosome_list {
2133
2134							# options
2135	2			2	1	297	my $database = shift;
2136	2		50			8	my $chr_exclude = shift \|\| undef;
2137
2138							# Open a db connection
2139	2					5	my $db = open_db_connection($database);
2140	2	50				5	unless ($db) {
2141	0					0	carp 'no database connected!';
2142	0					0	return;
2143							}
2144
2145							# Check for BigWigSet database
2146							# these need to be handled a little differently
2147	2	50				7	if (ref($db) =~ /BigWigSet/) {
2148							# BigWigSet databases are the only databases that don't
2149							# support the seq_ids method
2150							# instead we have to look at one of the bigwigs in the set
2151	0					0	my $bw_file = ($db->bigwigs)[0];
2152	0					0	$db = open_db_connection($bw_file);
2153							}
2154
2155
2156							# Collect chromosome lengths
2157							# we want to maintain the original order of the chromosomes so we
2158							# won't be putting it into a hash
2159							# instead an array of arrays
2160	2					4	my @chrom_lengths;
2161
2162							# Database specific approaches to collecting chromosomes
2163							# I used to have one generic approach, but idiosyncrasies and potential
2164							# bugs make me use different approaches for better consistency
2165	2					3	my $type = ref($db);
2166
2167							# SeqFeature::Store
2168	2	50	33			33	if ($type =~ /^Bio::DB::SeqFeature::Store/) {
		50	33
		50
		50
		50
		50
		50
2169	0					0	for my $chr ($db->seq_ids) {
2170
2171							# check for excluded chromosomes
2172	0	0	0			0	next if (defined $chr_exclude and $chr =~ /$chr_exclude/i);
2173
2174							# get chromosome size
2175	0					0	my ($seqf) = $db->get_features_by_name($chr);
2176	0	0				0	my $length = $seqf ? $seqf->length : 0;
2177
2178							# store
2179	0					0	push @chrom_lengths, [ $chr, $length ];
2180							}
2181							}
2182
2183							# libBigWig Bigfile, including both bigWig and bigBed
2184							elsif ($type eq 'Bio::DB::Big::File') {
2185							# this doesn't follow the typical BioPerl convention
2186							# it's a hash, so randomly sorted!!!!! Never will be in same order as file!!!!
2187	0					0	my $chroms = $db->chroms();
2188
2189	0					0	my @list;
2190	0					0	foreach (values %$chroms) {
2191							# check for excluded chromosomes
2192	0	0	0			0	next if (defined $chr_exclude and $_->{name} =~ /$chr_exclude/i);
2193	0					0	push @list, [$_->{name}, $_->{length}];
2194							}
2195							# sort consistently by a sane system
2196	0					0	@chrom_lengths = sane_chromo_sort(@list);
2197							}
2198
2199							# UCSC kent Bigfile
2200							elsif ($type eq 'Bio::DB::BigWig' or $type eq 'Bio::DB::BigBed') {
2201	0					0	foreach my $chr ($db->seq_ids) {
2202
2203							# check for excluded chromosomes
2204	0	0	0			0	next if (defined $chr_exclude and $chr =~ /$chr_exclude/i);
2205
2206							# get chromosome size
2207	0					0	my $length = $db->length($chr);
2208
2209							# store
2210	0					0	push @chrom_lengths, [ $chr, $length ];
2211							}
2212							}
2213
2214							# Samtools Bam
2215							elsif ($type eq 'Bio::DB::Sam' or $type eq 'Bio::DB::HTS') {
2216	0					0	for my $tid (0 .. $db->n_targets - 1) {
2217							# each chromosome is internally represented in the bam file as
2218							# a numeric target identifier
2219							# we can easily convert this to an actual sequence name
2220							# we will force the conversion to go one chromosome at a time
2221
2222							# sequence info
2223	0					0	my $chr = $db->target_name($tid);
2224	0					0	my $length = $db->target_len($tid);
2225
2226							# check for excluded chromosomes
2227	0	0	0			0	next if (defined $chr_exclude and $chr =~ /$chr_exclude/i);
2228
2229							# store
2230	0					0	push @chrom_lengths, [ $chr, $length ];
2231							}
2232							}
2233
2234							# Fast through modern HTS fai index
2235							elsif ($type eq 'Bio::DB::HTS::Faidx') {
2236	0					0	for my $chr ($db->get_all_sequence_ids) {
2237							# check for excluded
2238	0	0	0			0	next if (defined $chr_exclude and $chr =~ /$chr_exclude/i);
2239
2240							# get length and store it
2241	0					0	my $length = $db->length($chr);
2242	0					0	push @chrom_lengths, [$chr, $length];
2243							}
2244							}
2245
2246							# Fasta through old BioPerl adapter
2247							elsif ($type eq 'Bio::DB::Fasta') {
2248	0					0	for my $chr ($db->get_all_ids) {
2249
2250							# check for excluded chromosomes
2251	0	0	0			0	next if (defined $chr_exclude and $chr =~ /$chr_exclude/i);
2252
2253							# get chromosome size
2254	0					0	my $seq = $db->get_Seq_by_id($chr);
2255	0	0				0	my $length = $seq ? $seq->length : 0;
2256
2257							# store
2258	0					0	push @chrom_lengths, [ $chr, $length ];
2259							}
2260							}
2261
2262							# other Bioperl adapter?????
2263							elsif ($db->can('seq_ids')) {
2264	2					8	foreach my $chr ($db->seq_ids) {
2265
2266							# check for excluded chromosomes
2267	2	50	33			20	next if (defined $chr_exclude and $chr =~ /$chr_exclude/i);
2268
2269							# generate a segment representing the chromosome
2270							# due to fuzzy name matching, we may get more than one back
2271	2					18	my @segments = $db->segment($chr);
2272							# need to find the right one
2273	2					247	my $segment;
2274	2					12	while (@segments) {
2275	2					4	$segment = shift @segments;
2276	2	50				8	last if $segment->seq_id eq $chr;
2277							}
2278
2279							# check segment
2280	2	50				25	unless ($segment) {
2281	0					0	carp " No genome segment for seq_id $chr!!?? skipping\n";
2282	0					0	next;
2283							}
2284
2285							# get the chromosome length
2286	2					10	my $length = $segment->length;
2287
2288							# store
2289	2					49	push @chrom_lengths, [ $chr, $length ];
2290							}
2291							}
2292
2293							else {
2294	0					0	carp " Unable to identify chromosomes in database type '$type'. Try using a different database file or adapter\n";
2295	0					0	return;
2296							}
2297
2298							# Return
2299	2	50				6	unless (@chrom_lengths) {
2300	0					0	carp " no chromosome sequences identified in database!\n";
2301	0					0	return;
2302							}
2303	2					6	return @chrom_lengths;
2304							}
2305
2306
2307							sub low_level_bam_fetch {
2308	0	0		0	1	0	confess "incorrect number of parameters passed!" unless scalar @_ == 6;
2309	0					0	my ($sam, $tid, $start, $stop, $callback, $data) = @_;
2310							# run the the low level bam fetch based on which adapter is being used
2311	0	0				0	unless ($BAM_ADAPTER) {
2312	0	0				0	$BAM_ADAPTER = ref($sam) =~ /hts/i ? 'hts' : 'sam';
2313							}
2314	0	0				0	if ($BAM_ADAPTER eq 'hts') {
		0
2315							# using Bio::DB::HTS
2316	0					0	my $index = $sam->hts_index;
2317	0	0				0	return unless $index;
2318	0					0	return $index->fetch($sam->hts_file, $tid, $start, $stop, $callback, $data);
2319							}
2320							elsif ($BAM_ADAPTER eq 'sam') {
2321							# using Bio::DB::Sam
2322	0					0	my $index = $sam->bam_index;
2323	0	0				0	return unless $index;
2324	0					0	return $index->fetch($sam->bam, $tid, $start, $stop, $callback, $data);
2325							}
2326							else {
2327	0					0	confess "no bam adapter loaded!\n";
2328							}
2329							}
2330
2331
2332							sub low_level_bam_coverage {
2333	0	0		0	1	0	confess "incorrect number of parameters passed!" unless scalar @_ == 4;
2334	0					0	my ($sam, $tid, $start, $stop) = @_;
2335							# run the the low level bam coverage based on which adapter is being used
2336	0	0				0	unless ($BAM_ADAPTER) {
2337	0	0				0	$BAM_ADAPTER = ref($sam) =~ /hts/i ? 'hts' : 'sam';
2338							}
2339	0	0				0	if ($BAM_ADAPTER eq 'hts') {
		0
2340							# using Bio::DB::HTS
2341	0					0	my $index = $sam->hts_index;
2342	0	0				0	return unless $index;
2343	0					0	return $index->coverage($sam->hts_file, $tid, $start, $stop);
2344							}
2345							elsif ($BAM_ADAPTER eq 'sam') {
2346							# using Bio::DB::Sam
2347	0					0	my $index = $sam->bam_index;
2348	0	0				0	return unless $index;
2349	0					0	return $index->coverage($sam->bam, $tid, $start, $stop);
2350							}
2351							else {
2352	0					0	confess "no bam adapter loaded!\n";
2353							}
2354							}
2355
2356
2357							sub get_genomic_sequence {
2358	0	0		0	1	0	confess "incorrect number of parameters passed!" unless scalar @_ == 4;
2359	0					0	my ($db, $chrom, $start, $stop) = @_;
2360
2361							# check database
2362	0					0	my $type = ref($db);
2363	0	0				0	$db = open_db_connection($db) if not $type;
2364
2365							# return sequence based on the type of database adapter we're using
2366	0	0				0	if ($type eq 'Bio::DB::HTS::Faidx') {
		0
		0
2367	0					0	return $db->get_sequence_no_length("$chrom:$start-$stop");
2368							}
2369							elsif ($type eq 'Bio::DB::Sam::Fai') {
2370	0					0	return $db->fetch("$chrom:$start-$stop");
2371							}
2372							elsif ($db->can('seq')) {
2373							# BioPerl database including Bio::DB::Fasta and Bio::DB::SeqFeature::Store
2374	0					0	return $db->seq($chrom, $start, $stop);
2375							}
2376							else {
2377	0					0	confess "unsupported database $type for collecting genomic sequence!\n";
2378							}
2379							}
2380
2381
2382							### deprecated methods
2383							# just in case
2384							sub validate_included_feature {
2385	0			0	0	0	confess "validate_included_feature() is no longer a valid method. " .
2386							"Please update your script to the current API.\n";
2387							}
2388
2389
2390							### Internal subroutine to retrieve the scores from an established region object
2391							sub _lookup_db_method {
2392							# parameters passed as an array reference
2393	9			9		14	my $param = shift;
2394							# determine the appropriate score method
2395	9					10	my $score_method;
2396	9	50				31	if ($param->[DATA] =~ /^file\|http\|ftp/) {
		0
		0
2397							# collect the data according to file type
2398
2399							# BigWig Data file
2400	9	50				69	if ($param->[DATA] =~ /\.(?:bw\|bigwig)$/i) {
		50
		50
		50
2401							# file is in bigwig format
2402							# get the dataset scores using Bio::ToolBox::db_helper::bigwig
2403							# this uses either Bio::DB::BigWig or Bio::DB::Big
2404
2405							# check that we have bigwig support
2406	0	0				0	$BIGWIG_OK = _load_bigwig_helper_module() unless $BIGWIG_OK;
2407	0	0				0	if ($BIGWIG_OK) {
2408	0	0				0	if ($param->[RETT] == 2) {
		0
		0
		0
2409	0					0	$score_method = \&collect_bigwig_position_scores;
2410							}
2411							elsif ($param->[RETT] == 1) {
2412	0					0	$score_method = \&collect_bigwig_scores;
2413							}
2414							elsif ($param->[METH] =~ /min\|max\|mean/) {
2415	0					0	$score_method = \&collect_bigwig_score;
2416							}
2417							elsif ($param->[METH] =~ /sum\|count/) {
2418							# difference between ucsc and libBigWig libraries
2419	0	0				0	$score_method = $BIG_ADAPTER eq 'ucsc' ?
2420							\&collect_bigwig_score : \&collect_bigwig_scores;
2421							}
2422							else {
2423	0					0	$score_method = \&collect_bigwig_scores;
2424							}
2425							}
2426							else {
2427	0					0	croak " BigWig support is not enabled! Is Bio::DB::Big or Bio::DB::BigFile installed?";
2428							}
2429							}
2430
2431							# BigBed Data file
2432							elsif ($param->[DATA] =~ /\.(?:bb\|bigbed)$/i) {
2433							# data is in bigbed format
2434							# get the dataset scores using Bio::ToolBox::db_helper::bigbed
2435							# this uses either Bio::DB::BigBed or Bio::DB::Big
2436
2437							# check that we have bigbed support
2438	0	0				0	$BIGBED_OK = _load_bigbed_helper_module() unless $BIGBED_OK;
2439	0	0				0	if ($BIGBED_OK) {
2440	0	0				0	if ($param->[RETT] == 2) {
2441	0					0	$score_method = \&collect_bigbed_position_scores;
2442							}
2443							else {
2444	0					0	$score_method = \&collect_bigbed_scores;
2445							}
2446							}
2447							else {
2448	0					0	croak " BigBed support is not enabled! Is Bio::DB::Big or Bio::DB::BigFile installed?";
2449							}
2450							}
2451
2452							# BAM data file
2453							elsif ($param->[DATA] =~ /\.bam$/i) {
2454							# data is in bam format
2455							# get the dataset scores using Bio::ToolBox::db_helper::bam
2456							# this uses the Bio::DB::Sam or Bio::DB::HTS adaptor
2457
2458							# check that we have Bam support
2459	0	0				0	_load_bam_helper_module() unless $BAM_OK;
2460	0	0				0	if ($BAM_OK) {
2461	0					0	$score_method = \&collect_bam_scores;
2462							}
2463							else {
2464	0					0	croak " Bam support is not enabled! " .
2465							"Is Bio::DB::HTS or Bio::DB::Sam installed?\n";
2466							}
2467							}
2468
2469							# USeq Data file
2470							elsif ($param->[DATA] =~ /\.useq$/i) {
2471							# data is in useq format
2472							# get the dataset scores using Bio::ToolBox::db_helper::useq
2473							# this uses the Bio::DB::USeq adaptor
2474
2475							# check that we have bigbed support
2476	9	50				21	$USEQ_OK = _load_helper_module('Bio::ToolBox::db_helper::useq')
2477							unless $USEQ_OK;
2478	9	50				13	if ($USEQ_OK) {
2479	9	100				13	if ($param->[RETT] == 2) {
2480	4					8	$score_method = \&collect_useq_position_scores;
2481							}
2482							else {
2483	5					11	$score_method = \&collect_useq_scores;
2484							}
2485							}
2486							else {
2487	0					0	croak " USeq support is not enabled! " .
2488							"Is Bio::DB::USeq installed?\n";
2489							}
2490							}
2491
2492							# Unsupported Data file
2493							else {
2494	0					0	confess sprintf " Unsupported or unrecognized file type for file '%s'!\n",
2495							$param->[DATA];
2496							}
2497							}
2498
2499
2500							### BigWigSet database
2501							elsif (ref($param->[DB]) =~ m/BigWigSet/) {
2502							# calling features from a BigWigSet database object
2503							# this uses either Bio::DB::BigWig or Bio::DB::Big
2504
2505							# check that we have bigwig support
2506							# duh! we should, we probably opened the stupid database!
2507	0	0				0	$BIGWIG_OK = _load_bigwig_helper_module() unless $BIGWIG_OK;
2508	0	0				0	croak " BigWigSet support is not enabled! Is Bio::DB::Big or Bio::DB::BigFile installed?"
2509							unless $BIGWIG_OK;
2510
2511							# the data collection depends on the method
2512	0	0				0	if ($param->[RETT] == 2) {
		0
		0
2513	0					0	$score_method = \&collect_bigwigset_position_scores;
2514							}
2515							elsif ($param->[RETT] == 1) {
2516	0					0	$score_method = \&collect_bigwigset_scores;
2517							}
2518							elsif ($param->[METH] =~ /min\|max\|mean\|sum\|count/) {
2519							# difference between ucsc and libBigWig libraries
2520							# the ucsc library works with summaries and we can handle multiple of these
2521							# but the big adapter doesn't
2522	0	0				0	$score_method = $BIG_ADAPTER eq 'ucsc' ?
2523							\&collect_bigwigset_score : \&collect_bigwigset_scores;
2524							}
2525							else {
2526	0					0	$score_method = \&collect_bigwigset_scores;
2527							}
2528							}
2529
2530
2531							### BioPerl style database
2532							elsif (ref($param->[DB]) =~ m/^Bio::DB/) {
2533							# a BioPerl style database, including Bio::DB::SeqFeature::Store
2534							# most or all Bio::DB databases support generic get_seq_stream() methods
2535							# that return seqfeature objects, which we can use in a generic sense
2536
2537							# check that we have support
2538							# duh! we should, we probably opened the stupid database!
2539	0	0				0	$SEQFASTA_OK = _load_helper_module('Bio::ToolBox::db_helper::seqfasta')
2540							unless $SEQFASTA_OK;
2541	0	0				0	if ($SEQFASTA_OK) {
2542							# get the dataset scores using Bio::ToolBox::db_helper::seqfasta
2543							# check that we support methods
2544	0	0				0	unless ($param->[DB]->can('get_seq_stream')) {
2545	0					0	confess sprintf "unsupported database! cannot use %s as it does not support get_seq_stream method or equivalent",
2546							ref($param->[DB]);
2547							}
2548	0					0	$score_method = \&collect_store_scores;
2549							}
2550							else {
2551	0					0	croak " SeqFeature Store support is not enabled! " .
2552							"Is BioPerl and Bio::DB::SeqFeature::Store properly installed?\n";
2553							}
2554							}
2555
2556
2557							### Some other database?
2558							else {
2559	0	0				0	confess "no recognizeable dataset provided!" unless $param->[DATA];
2560	0	0				0	confess "no database passed!" unless $param->[DB];
2561	0					0	confess "something went wrong! parameters: @$param";
2562							}
2563
2564
2565							### Cache and return the results
2566	9					27	$DB_METHODS{$param->[METH]}{$param->[RETT]}{$param->[DB]}{$param->[DATA]} =
2567							$score_method;
2568	9					19	return $score_method;
2569							}
2570
2571
2572							sub _load_helper_module {
2573	1			1		3	my $class = shift;
2574	1					2	my $success = 0;
2575	1					1	eval {
2576	1					4	load $class; # using Module::Load to dynamically load modules
2577	1					36	$class->import; # must be done particularly for my modules
2578	1					3	$success = 1;
2579							};
2580	1					2	return $success;
2581							}
2582
2583
2584							sub _load_bam_helper_module {
2585	0	0		0			if ($BAM_ADAPTER) {
2586	0	0					if ($BAM_ADAPTER =~ /sam/i) {
		0
		0
2587	0						$BAM_OK = _load_helper_module('Bio::ToolBox::db_helper::bam');
2588	0						$BAM_ADAPTER = 'sam'; # for internal consistency
2589							}
2590							elsif ($BAM_ADAPTER =~ /hts/i) {
2591	0						$BAM_OK = _load_helper_module('Bio::ToolBox::db_helper::hts');
2592	0						$BAM_ADAPTER = 'hts'; # for internal consistency
2593							}
2594							elsif ($BAM_ADAPTER =~ /none/i) {
2595							# basically for testing purposes, don't use a module
2596	0						return 0;
2597							}
2598							else {
2599							# unrecognized
2600	0						$@ = 'unrecognized';
2601							}
2602	0	0					if (not $BAM_OK) {
2603	0						print "Requested '$BAM_ADAPTER' adapter could not be loaded: $@\n";
2604							}
2605							}
2606							else {
2607							# try hts first, then sam
2608							# be sure to set BAM_ADAPTER upon success
2609	0						$BAM_OK = _load_helper_module('Bio::ToolBox::db_helper::hts');
2610	0	0					if ($BAM_OK) {
2611	0						$BAM_ADAPTER = 'hts';
2612							}
2613							else {
2614	0						$BAM_OK = _load_helper_module('Bio::ToolBox::db_helper::bam');
2615	0	0					$BAM_ADAPTER = 'sam' if $BAM_OK;
2616							}
2617							}
2618	0						return $BAM_OK;
2619							}
2620
2621
2622							sub _load_bigwig_helper_module {
2623	0	0		0			if ($BIG_ADAPTER) {
2624	0	0					if ($BIG_ADAPTER =~ /ucsc\|kent/i) {
		0
		0
2625	0						$BIGWIG_OK = _load_helper_module('Bio::ToolBox::db_helper::bigwig');
2626	0						$BIG_ADAPTER = 'ucsc'; # for internal consistency
2627							}
2628							elsif ($BIG_ADAPTER =~ /big/i) {
2629	0						$BIGWIG_OK = _load_helper_module('Bio::ToolBox::db_helper::big');
2630	0						$BIGBED_OK = $BIGWIG_OK; # bigbed too!
2631	0						$BIG_ADAPTER = 'big'; # for internal consistency
2632							}
2633							elsif ($BIG_ADAPTER =~ /none/i) {
2634							# basically for testing purposes, don't use a module
2635	0						return 0;
2636							}
2637							else {
2638							# unrecognized
2639	0						$@ = 'unrecognized';
2640							}
2641	0	0					if (not $BIGWIG_OK) {
2642	0						print "Requested '$BIG_ADAPTER' adapter could not be loaded: $@\n";
2643							}
2644							}
2645							else {
2646							# we have to try each one out
2647							# try the modern big adapter first
2648	0						$BIGWIG_OK = _load_helper_module('Bio::ToolBox::db_helper::big');
2649	0	0					if ($BIGWIG_OK) {
2650							# success!
2651	0						$BIGBED_OK = $BIGWIG_OK; # bigbed too!
2652	0	0					$BIG_ADAPTER = 'big' if $BIGWIG_OK;
2653							}
2654							else {
2655	0						$BIGWIG_OK = _load_helper_module('Bio::ToolBox::db_helper::bigwig');
2656	0	0					$BIG_ADAPTER = 'ucsc' if $BIGWIG_OK;
2657							}
2658							}
2659	0						return $BIGWIG_OK;
2660							}
2661
2662							sub _load_bigbed_helper_module {
2663	0	0		0			if ($BIG_ADAPTER) {
2664	0	0					if ($BIG_ADAPTER =~ /ucsc\|kent/i) {
		0
		0
		0
2665	0						$BIGBED_OK = _load_helper_module('Bio::ToolBox::db_helper::bigbed');
2666	0						$BIG_ADAPTER = 'ucsc'; # for internal consistency
2667							}
2668							elsif ($BIG_ADAPTER =~ /big/i) {
2669	0						$BIGBED_OK = _load_helper_module('Bio::ToolBox::db_helper::big');
2670	0						$BIGWIG_OK = $BIGBED_OK; # bigwig too!
2671	0						$BIG_ADAPTER = 'big'; # for internal consistency
2672							}
2673							elsif ($BIG_ADAPTER =~ /none/i) {
2674							# basically for testing purposes, don't use a module
2675	0						return 0;
2676							}
2677							elsif ($BAM_ADAPTER =~ /\w+/) {
2678							# unrecognized
2679	0						$@ = 'unrecognized';
2680							}
2681	0	0					if (not $BIGWIG_OK) {
2682	0						print "Requested '$BIG_ADAPTER' adapter could not be loaded: $@\n";
2683							}
2684							}
2685							else {
2686							# we have to try each one out
2687							# try the modern big adapter first
2688	0						$BIGBED_OK = _load_helper_module('Bio::ToolBox::db_helper::big');
2689	0	0					if ($BIGBED_OK) {
2690							# success!
2691	0						$BIGWIG_OK = $BIGBED_OK; # bigwig too!
2692	0	0					$BIG_ADAPTER = 'big' if $BIGBED_OK;
2693							}
2694							else {
2695	0						$BIGBED_OK = _load_helper_module('Bio::ToolBox::db_helper::bigbed');
2696	0	0					$BIG_ADAPTER = 'ucsc' if $BIGBED_OK;
2697							}
2698							}
2699	0						return $BIGBED_OK;
2700							}
2701
2702							=head1 AUTHOR
2703
2704							Timothy J. Parnell, PhD
2705							Howard Hughes Medical Institute
2706							Dept of Oncological Sciences
2707							Huntsman Cancer Institute
2708							University of Utah
2709							Salt Lake City, UT, 84112
2710
2711							This package is free software; you can redistribute it and/or modify
2712							it under the terms of the Artistic License 2.0.
2713