File Coverage

blib/lib/GO/AnnotationProvider/AnnotationParser.pm
Criterion Covered Total %
statement 152 321 47.3
branch 45 154 29.2
condition 13 62 20.9
subroutine 24 37 64.8
pod 22 22 100.0
total 256 596 42.9


line stmt bran cond sub pod time code
1             package GO::AnnotationProvider::AnnotationParser;
2              
3             # File : AnnotationParser.pm
4             # Authors : Elizabeth Boyle; Gavin Sherlock
5             # Date Begun : Summer 2001
6             # Rewritten : September 25th 2002
7              
8             # $Id: AnnotationParser.pm,v 1.35 2008/05/13 23:06:16 sherlock Exp $
9              
10             # Copyright (c) 2003 Gavin Sherlock; Stanford University
11              
12             # Permission is hereby granted, free of charge, to any person
13             # obtaining a copy of this software and associated documentation files
14             # (the "Software"), to deal in the Software without restriction,
15             # including without limitation the rights to use, copy, modify, merge,
16             # publish, distribute, sublicense, and/or sell copies of the Software,
17             # and to permit persons to whom the Software is furnished to do so,
18             # subject to the following conditions:
19              
20             # The above copyright notice and this permission notice shall be
21             # included in all copies or substantial portions of the Software.
22              
23             # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
24             # EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
25             # MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
26             # NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
27             # BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
28             # ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
29             # CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
30             # SOFTWARE.
31              
32             =pod
33              
34             =head1 NAME
35              
36             GO::AnnotationProvider::AnnotationParser - parses a gene annotation file
37              
38             =head1 SYNOPSIS
39              
40             GO::AnnotationProvider::AnnotationParser - reads a Gene Ontology gene
41             associations file, and provides methods by which to retrieve the GO
42             annotations for the an annotated entity. Note, it is case
43             insensitive, with some caveats - see documentation below.
44              
45             my $annotationParser = GO::AnnotationProvider::AnnotationParser->new(annotationFile => "data/gene_association.sgd");
46              
47             my $geneName = "AAT2";
48              
49             print "GO associations for gene: ", join (" ", $annotationParser->goIdsByName(name => $geneName,
50             aspect => 'P')), "\n";
51              
52             print "Database ID for gene: ", $annotationParser->databaseIdByName($geneName), "\n";
53              
54             print "Database name: ", $annotationParser->databaseName(), "\n";
55              
56             print "Standard name for gene: ", $annotationParser->standardNameByName($geneName), "\n";
57              
58             my $i;
59              
60             my @geneNames = $annotationParser->allStandardNames();
61              
62             foreach $i (0..10) {
63              
64             print "$geneNames[$i]\n";
65              
66             }
67              
68             =head1 DESCRIPTION
69              
70             GO::AnnotationProvider::AnnotationParser is a concrete subclass of
71             GO::AnnotationProvider, and creates a data structure mapping gene
72             names to GO annotations by parsing a file of annotations provided by
73             the Gene Ontology Consortium.
74              
75             This package provides object methods for retrieving GO annotations
76             that have been parsed from a 'gene associations' file, provided by
77             the gene ontology consortium. The format for the file is:
78              
79             Lines beginning with a '!' character are comment lines.
80              
81             Column Cardinality Contents
82             ------ ----------- -------------------------------------------------------------
83             0 1 Database abbreviation for the source of annotation (e.g. SGD)
84             1 1 Database identifier of the annotated entity
85             2 1 Standard name of the annotated entity
86             3 0,1 NOT (if a gene is specifically NOT annotated to the term)
87             4 1 GOID of the annotation
88             5 1,n Reference(s) for the annotation
89             6 1 Evidence code for the annotation
90             7 0,n With or From (a bit mysterious)
91             8 1 Aspect of the Annotation (C, F, P)
92             9 0,1 Name of the product being annotated
93             10 0,n Alias(es) of the annotated product
94             11 1 type of annotated entity (one of gene, transcript, protein)
95             12 1,2 taxonomic id of the organism encoding and/or using the product
96             13 1 Date of annotation YYYYMMDD
97             14 1 Assigned_by : The database which made the annotation
98              
99             Columns are separated by tabs. For those entries with a cardinality
100             greater than 1, multiple entries are pipe , |, delimited.
101              
102             Further details can be found at:
103              
104             http://www.geneontology.org/doc/GO.annotation.html#file
105              
106             The following assumptions about the file are made (and should be true):
107              
108             1. All aliases appear for all entries of a given annotated product
109             2. The database identifiers are unique, in that two different
110             entities cannot have the same database id.
111              
112             =head1 TODO
113              
114             Also see the TODO list in the parent, GO::AnnotationProvider.
115              
116             1. Add in methods that will allow retrieval of evidence codes with
117             the annotations for a particular entity.
118              
119             2. Add in methods that return all the annotated entities for a
120             particular GOID.
121              
122             3. Add in the ability to request only annotations either including
123             or excluding particular evidence codes. Such evidence codes
124             could be provided as an anonymous array as the value of a named
125             argument.
126              
127             4. Same as number 3, except allow the retrieval of annotated
128             entities for a particular GOID, based on inclusion or exclusion
129             of certain evidence codes.
130              
131             These first four items will require a reworking of how data are
132             stored on the backend, and thus the parsing code itself, though it
133             should not affect any of the already existing API.
134              
135             5. Instead of 'use'ing Storable, 'require' it instead, only at the
136             point of use, which will mean that AnnotationParser can be
137             happily used in the absence of Storable, just without those
138             functions that need it.
139              
140             6. Extend the ValidateFile class method to check that an entity
141             should never be annotated to the same node twice, with the same
142             evidence, with the same reference.
143              
144             7. An additional checker, that uses an AnnotationProvider in
145             conjunction with an OntologyProvider, would be useful, that
146             checks that some of the annotations themselves are valid, ie
147             that no entities are annotated to the 'unknown' node in a
148             particular aspect, and also to another node within that same
149             aspect. Can annotations be redundant? ie, if an entity is
150             annotated to a node, and an ancestor of the node, is that
151             annotation redundant? Does it depend on the evidence codes and
152             references. Or are such annotations reinforcing? These things
153             are useful to consider when formulating the confidence which can
154             be attributed to an annotation.
155              
156             =cut
157              
158 2     2   221291 use strict;
  2         6  
  2         109  
159 2     2   14 use warnings;
  2         4  
  2         2399  
160 2     2   18 use diagnostics;
  2         5  
  2         18  
161              
162 2     2   6881 use Storable qw (nstore);
  2         9108  
  2         158  
163 2     2   1721 use IO::File;
  2         19875  
  2         294  
164              
165 2     2   16 use vars qw (@ISA $PACKAGE $VERSION);
  2         5  
  2         115  
166              
167 2     2   3515 use GO::AnnotationProvider;
  2         5  
  2         10147  
168             @ISA = qw (GO::AnnotationProvider);
169              
170             $PACKAGE = "GO::AnnotationProvider::AnnotationParser";
171             $VERSION = "0.15";
172              
173             # CLASS Attributes
174             #
175             # These should be considered as constants, and are initialized here
176              
177             my $DEBUG = 0;
178              
179             # constants for instance attribute name
180              
181              
182             my $kDatabaseName = $PACKAGE.'::__databaseName'; # stores the name of the annotating database
183             my $kFileName = $PACKAGE.'::__fileName'; # stores the name of the file used to instantiate the object
184             my $kNameToIdMapInsensitive = $PACKAGE.'::__nameToIdMapInsensitive'; # stores a case insensitive map of all unambiguous names for a gene to the database id
185             my $kNameToIdMapSensitive = $PACKAGE.'::__nameToIdMapSensitive'; # stores a case sensitive map of all names where a particular casing is unambiguous for a gene to the database id
186             my $kAmbiguousNames = $PACKAGE.'::__ambiguousNames'; # stores the database id's for all ambiguous names
187             my $kIdToStandardName = $PACKAGE.'::__idToStandardName'; # stores a map of database id's to standard names of all entities
188             my $kStandardNameToId = $PACKAGE.'::__StandardNameToId'; # stores a map of standard names to their database id's
189             my $kUcIdToId = $PACKAGE.'::__ucIdToId'; # stores a map of uppercased databaseIds to the databaseId
190             my $kUcStdNameToStdName = $PACKAGE.'::__ucStdNameToStdName'; # stores a map of uppercased standard names to the standard name
191             my $kNameToCount = $PACKAGE.'::__nameToCount'; # stores a case sensitive map of the number of times a name has been seen
192             my $kGoids = $PACKAGE.'::__goids'; # stores all the goid annotations
193             my $kNumAnnotatedGenes = $PACKAGE.'::__numAnnotatedGenes'; # stores number of genes with annotations, per aspect
194              
195             my $kAmbiguousNamesSensitive = $PACKAGE.'::__ambiguousNamesSensitive'; # names (case sensitive) that are ambiguous
196              
197             my $kTotalNumAnnotatedGenes = $PACKAGE.'::__totalNumAnnotatedGenes'; # total number of annotated genes
198              
199             # constants to describe what is in which column in the annotation file
200              
201             my $kDatabaseNameColumn = 0;
202             my $kDatabaseIdColumn = 1;
203             my $kStandardNameColumn = 2;
204             my $kNotColumn = 3;
205             my $kGoidColumn = 4;
206             my $kReferenceColumn = 5;
207             my $kEvidenceColumn = 6;
208             my $kWithColumn = 7;
209             my $kAspectColumn = 8;
210             my $kNameColumn = 9;
211             my $kAliasesColumn = 10;
212             my $kEntityTypeColumn = 11;
213             my $kTaxonomicIDColumn = 12;
214             my $kDateColumn = 13;
215             my $kAssignedByColumn = 14;
216              
217             # the following hash of anonymous arrays indicates for each column
218             # what the maximum and minimum number of entries per column can be.
219             # If no maximum is indicated, then the maximum is equal to the
220             # minimum, and exactly that number of entries must exist.
221              
222             my %kColumnsToCardinality = ($kDatabaseNameColumn => [1 ],
223             $kDatabaseIdColumn => [1 ],
224             $kStandardNameColumn => [1 ],
225             $kNotColumn => [0, 1],
226             $kGoidColumn => [1 ],
227             $kReferenceColumn => [1, "n"],
228             $kEvidenceColumn => [1 ],
229             $kWithColumn => [0, "n"],
230             $kAspectColumn => [1 ],
231             $kNameColumn => [0, 1],
232             $kAliasesColumn => [0, "n"],
233             $kEntityTypeColumn => [1 ],
234             $kTaxonomicIDColumn => [1, 2],
235             $kDateColumn => [1 ],
236             $kAssignedByColumn => [1 ]);
237              
238             my $kNumColumnsInFile = scalar keys %kColumnsToCardinality;
239              
240             =pod
241              
242             =head1 Class Methods
243              
244             =cut
245              
246             ############################################################################
247             sub Usage{
248             ############################################################################
249             =pod
250              
251             =head2 Usage
252              
253             This class method simply prints out a usage statement, along with an
254             error message, if one was passed in.
255              
256             Usage :
257              
258             GO::AnnotationProvider::AnnotationParser->Usage();
259              
260             =cut
261              
262 0     0 1 0 my ($class, $message) = @_;
263              
264 0 0       0 defined $message && print $message."\n\n";
265              
266 0         0 print 'The constructor expects one of two arguments, either a
267             \'annotationFile\' argument, or and \'objectFile\' argument. When
268             instantiated with an annotationFile argument, it expects it to
269             correspond to an annotation file created by one of the GO consortium
270             members, according to their file format. When instantiated with an
271             objectFile argument, it expects to open a previously created
272             annotationParser object that has been serialized to disk (see the
273             serializeToDisk method).
274              
275             Usage:
276              
277             my $annotationParser = '.$PACKAGE.'->new(annotationFile => $file);
278              
279             my $annotationParser = '.$PACKAGE.'->new(objectFile => $file);
280             ';
281              
282             }
283              
284             ############################################################################
285             sub ValidateFile{
286             ############################################################################
287             =pod
288              
289             =head2 ValidateFile
290              
291             This class method reads an annotation file, and returns a reference to
292             an array of errors that are present within the file. The errors are
293             simply strings, each beginning with "Line $lineNo : " where $lineNo is
294             the number of the line in the file where the error was found.
295              
296             Usage:
297              
298             my $errorsRef = GO::AnnotationProvider::AnnotationParser->ValidateFile(annotationFile => $file);
299              
300             =cut
301              
302 0     0 1 0 my ($class, %args) = @_;
303            
304 0   0     0 my $file = $args{'annotationFile'} || $class->_handleMissingArgument(argument => 'annotationFile');
305            
306 0   0     0 my $annotationsFh = IO::File->new($file, q{<} )|| die "$PACKAGE cannot open $file : $!";
307              
308 0         0 my (@errors, @line);
309              
310 0         0 my ($databaseId, $standardName, $aliases);
311 0         0 my (%idToName, %idToAliases);
312              
313 0         0 my $lineNo = 0;
314            
315 0         0 while (<$annotationsFh>){
316            
317 0         0 ++$lineNo;
318            
319 0 0       0 next if $_ =~ m/^!/; # skip comment lines
320            
321 0         0 chomp;
322              
323 0 0       0 next unless $_; # skip an empty line, if there is one
324            
325 0         0 @line = split("\t", $_, -1);
326            
327 0 0       0 if (scalar @line != $kNumColumnsInFile){ # doesn't have the correct number of columns
328            
329 0         0 push (@errors, "Line $lineNo has ". scalar @line. "columns, instead of $kNumColumnsInFile.");
330            
331             }
332            
333 0         0 $class->__CheckCardinalityOfColumns(\@errors, \@line, $lineNo);
334            
335             # now want to deal with sanity checks...
336            
337 0         0 ($databaseId, $standardName, $aliases) = @line[$kDatabaseIdColumn, $kStandardNameColumn, $kAliasesColumn];
338            
339 0 0       0 next if ($databaseId eq ""); # will have given incorrect cardinality, but nothing more we can do with it
340              
341 0 0       0 if (!exists $idToName{$databaseId}){
    0          
342              
343 0         0 $idToName{$databaseId} = $standardName;
344              
345             }elsif ($idToName{$databaseId} ne $standardName){
346              
347 0         0 push (@errors, "Line $lineNo : $databaseId has more than one standard name : $idToName{$databaseId} and $standardName.");
348              
349             }
350              
351 0 0       0 if (!exists $idToAliases{$databaseId}){
    0          
352              
353 0         0 $idToAliases{$databaseId} = $aliases;
354              
355             }elsif($idToAliases{$databaseId} ne $aliases){
356              
357 0         0 push (@errors, "Line $lineNo : $databaseId has more than one collections of aliases : $idToAliases{$databaseId} and $aliases.");
358              
359             }
360              
361             }
362            
363 0 0       0 $annotationsFh->close || die "$PACKAGE cannot close $file : $!";
364            
365 0         0 return \@errors;
366              
367             }
368              
369             ############################################################################
370             sub __CheckCardinalityOfColumns{
371             ############################################################################
372             # This method checks the cardinality of each column on a line
373             #
374             # Usage:
375             #
376             # $class->__CheckCardinalityOfColumns(\@errors, \@line, $lineNo);
377            
378 0     0   0 my ($class, $errorsRef, $lineRef, $lineNo) = @_;
379            
380 0         0 my ($cardinality, $min, $max);
381            
382 0         0 foreach my $column (sort {$a<=>$b} keys %kColumnsToCardinality){
  0         0  
383            
384 0         0 ($min, $max) = @{$kColumnsToCardinality{$column}}[0,1];
  0         0  
385            
386 0         0 $cardinality = $class->__GetCardinality($lineRef->[$column], $errorsRef, $lineNo);
387            
388 0 0       0 if (!defined $max){ # must have a defined number of entries
389              
390 0 0       0 if ($cardinality != $min){
391            
392 0         0 push (@{$errorsRef}, "Line $lineNo : column $column has a cardinality of $cardinality, instead of $min.");
  0         0  
393            
394             }
395            
396             }else{ # there's a range of allowed number of entries
397            
398 0 0 0     0 if ($cardinality < $min){ # check if less than minimum
    0          
399            
400 0         0 push (@{$errorsRef}, "Line $lineNo : column $column has a cardinality of $cardinality, which is less than the required $min.");
  0         0  
401            
402             }elsif ($kColumnsToCardinality{$column}->[1] ne 'n' &&
403             $cardinality > $max){ # check if more than maximum
404            
405 0         0 push (@{$errorsRef}, "Line $lineNo : column $column has a cardinality of $cardinality, which is more than the allowed $max.");
  0         0  
406            
407             }
408            
409             }
410            
411             }
412              
413             }
414              
415             ############################################################################
416             sub __GetCardinality{
417             ############################################################################
418             # This private method returns an integer that indicates the
419             # cardinality of a text string, where multiple entries are assumed to
420             # be seperated by the pipe character (|). In addition, it checks
421             # whether there are null or whitespace only entries.
422             #
423             # Usage:
424             #
425             # my $cardinality = $class->__GetCardinality($string);
426              
427 0     0   0 my ($class, $string, $errorsRef, $lineNo) = @_;
428              
429 0         0 my $cardinality;
430              
431 0 0 0     0 if (!defined $string || $string eq ""){
432              
433 0         0 $cardinality = 0;
434              
435             }else{
436              
437 0         0 my @entries = split(/\|/, $string, -1);
438              
439 0         0 foreach my $entry (@entries){
440              
441 0 0       0 if (!defined $entry){
    0          
442              
443 0         0 push (@{$errorsRef}, "Line $lineNo : There is an undefined value in the string $string.");
  0         0  
444              
445             }elsif ($entry =~ /^\s+$/){
446              
447 0         0 push (@{$errorsRef}, "Line $lineNo : There is a white-space only value in the string $string.");
  0         0  
448              
449             }
450              
451             }
452              
453 0         0 $cardinality = scalar @entries;
454              
455             }
456              
457 0         0 return $cardinality;
458              
459             }
460              
461             ############################################################################
462             #
463             # Constructor, and initialization methods.
464             #
465             # All initialization methods are private, except, of course, for the
466             # new() method.
467             #
468             ############################################################################
469              
470             ############################################################################
471             sub new{
472             ############################################################################
473             =pod
474              
475             =head1 Constructor
476              
477             =head2 new
478              
479             This is the constructor for an AnnotationParser object.
480              
481             The constructor expects one of two arguments, either a
482             'annotationFile' argument, or and 'objectFile' argument. When
483             instantiated with an annotationFile argument, it expects it to
484             correspond to an annotation file created by one of the GO consortium
485             members, according to their file format. When instantiated with an
486             objectFile argument, it expects to open a previously created
487             annotationParser object that has been serialized to disk (see the
488             serializeToDisk method).
489              
490             Usage:
491              
492             my $annotationParser = GO::AnnotationProvider::AnnotationParser->new(annotationFile => $file);
493              
494             my $annotationParser = GO::AnnotationProvider::AnnotationParser->new(objectFile => $file);
495              
496             =cut
497              
498              
499 3     3 1 73 my ($class, %args) = @_;
500              
501 3         6 my $self;
502              
503 3 50       27 if (exists($args{'annotationFile'})){
    0          
504              
505 3         6 $self = {};
506              
507 3         9 bless $self, $class;
508              
509 3         16 $self->__init($args{'annotationFile'});
510              
511             }elsif (exists($args{'objectFile'})){
512              
513 0   0     0 $self = Storable::retrieve($args{'objectFile'}) || die "Could not instantiate $PACKAGE object from objectFile : $!";
514              
515 0         0 $self->__setFile($args{'objectFile'});
516              
517             }else{
518              
519 0         0 $class->Usage("An annotationFile or objectFile argument must be provided.");
520 0         0 die;
521              
522             }
523              
524             # now, we have to make some alteration to some hashes to support
525             # our API for case insensitivity. The API says that if a name is
526             # supplied that would otherwise be ambiguous, but has a unique
527             # casing, then we will accept it as that unique cased version.
528             # Thus, we need to make sure that our $kNameToIdMapSensitive hash
529             # only tracks those names that were unique in a particular case
530              
531 3         2223 foreach my $name (keys %{$self->{$kNameToCount}}){
  3         22829  
532              
533             # go through the has that has a count of each name
534              
535 40383 100 100     233133 if ($self->{$kNameToCount}{$name} > 1 || exists $self->{$kNameToIdMapInsensitive}{uc($name)}){
536              
537             # if it was seen more than once, or is known to be unique
538             # in a case insensitive fashion, then delete it. This
539             # will leave just those that are unique in a case
540             # sensitive fashion
541              
542 40368         102206 delete $self->{$kNameToIdMapSensitive}{$name};
543              
544             }
545              
546             }
547              
548 3         7426 return ($self);
549              
550             }
551              
552             ############################################################################
553             sub __init{
554             ############################################################################
555             # This private method initializes the object by reading in the data
556             # from the annotation file.
557             #
558             # Usage :
559             #
560             # $self->__init($file);
561             #
562              
563 3     3   6 my ($self, $file) = @_;
564              
565 3         17 $self->__setFile($file);
566              
567 3   50     29 my $annotationsFh = IO::File->new($file, q{<} )|| die "$PACKAGE cannot open $file : $!";
568              
569             # now read through annotations file
570              
571 3         446 my (@line, $databaseId, $goid, $aspect, $standardName, $aliases);
572            
573 3         91 while (<$annotationsFh>){
574              
575 70620 100       138348 next if $_ =~ m/^!/; # skip commented lines
576              
577 70543         87500 chomp;
578              
579 70543 50       133381 next unless $_; # skip an empty line, if there is one
580              
581 70543         653737 @line = split("\t", $_, -1);
582              
583 70543 100       254001 next if $line[$kNotColumn] eq 'NOT'; # skip annotations NOT to a GOID
584              
585 70387         125340 ($databaseId, $goid, $aspect) = @line[$kDatabaseIdColumn, $kGoidColumn, $kAspectColumn];
586 70387         94770 ($standardName, $aliases) = @line[$kStandardNameColumn, $kAliasesColumn];
587              
588 70387 50       122047 if ($databaseId eq ""){
589              
590 0         0 print "On line $. there is a missing databaseId, so it will be ignored.\n";
591 0         0 next;
592              
593             }
594              
595             # record the source of the annotation
596            
597 70387 100       167118 $self->{$kDatabaseName} = $line[$kDatabaseNameColumn] if (!exists($self->{$kDatabaseName}));
598              
599             # now map the standard name and all aliases to the database id
600            
601 70387         136305 $self->__mapNamesToDatabaseId($databaseId, $standardName, $aliases);
602              
603             # and store the GOID
604            
605 70387         134687 $self->__storeGOID($databaseId, $goid, $aspect);
606              
607             }
608              
609 3 50       32 $annotationsFh->close || die "AnnotationParser can't close $file: $!";
610              
611             # now count up how many annotated things we have
612              
613 3         139 foreach my $databaseId (keys %{$self->{$kGoids}}){
  3         6383  
614              
615 12949         20401 $self->{$kTotalNumAnnotatedGenes}++;
616              
617 12949         12281 foreach my $aspect (keys %{$self->{$kGoids}{$databaseId}}){
  12949         56705  
618              
619 38475         79955 $self->{$kNumAnnotatedGenes}{$aspect}++;
620              
621             }
622              
623             }
624              
625             }
626              
627             ############################################################################
628             sub __setFile{
629             ############################################################################
630             # This method sets the name of the file used for construction.
631             #
632             # Usage:
633             #
634             # $self->__setFile($file);
635             #
636              
637 3     3   7 my ($self, $file) = @_;
638              
639 3         27 $self->{$kFileName} = $file;
640              
641             }
642              
643             ############################################################################
644             sub __mapNamesToDatabaseId{
645             ############################################################################
646             # This private method maps all names and aliases to the databaseId of
647             # an entity. It also maps the databaseId to itself, to facilitate a
648             # single way of mapping any identifier to the database id.
649             #
650             # This mapping is done so that it can be queried in a case insensitive
651             # fashion, and thus allow clients to be able to retrieve annotations
652             # without necessarily knowing the correct casing of any particular
653             # identifier.
654             #
655             # We have to keep the following considerations in mind:
656             #
657             # 1. Any identifier may be non-unique with respect to casing, that is,
658             # it is possible that there is ABC1 and abc1
659             #
660             # 2. We want to be able to returns names and identifiers in their correct
661             # casing, irrespective of the casing that is provided in the query
662             #
663             # 3. In the situation when a name that is ambiguous when considered case
664             # insensitively is provided, we should check to see whether that casing
665             # corresponds to a know correct casing, and assume that that is the one
666             # that they meant.
667             #
668             # Usage :
669             #
670             # $self->__mapNamesToDatabaseId($databaseId, $standardName, $aliases);
671             #
672             # where $aliases is a pipe-delimited list of aliases
673              
674 70387     70387   104485 my ($self, $databaseId, $standardName, $aliases) = @_;
675              
676 70387 100       189957 if (exists $self->{$kIdToStandardName}{$databaseId}){ # we've already seen this databaseId
677              
678 57438 50       136470 if ($self->{$kIdToStandardName}{$databaseId} ne $standardName){
679              
680             # there is a problem in the file - there should only be
681             # one standard name for a given database id, so we'll die
682             # here
683              
684 0         0 die "databaseId $databaseId maps to more than one standard name : $self->{$kIdToStandardName}{$databaseId} ; $standardName\n";
685              
686             }else{
687              
688             # we can simply return, as we've already processed
689             # information for this databaseId
690              
691 57438         84234 return;
692              
693             }
694              
695             }
696              
697             # we haven't see this databaseId before, so process the data
698              
699 12949         28330 my @aliases = split(/\|/, $aliases);
700              
701 12949         15109 my %seen; # sometimes an alias will be the same as the standard name
702              
703 12949         18472 foreach my $name ($databaseId, $standardName, @aliases){
704              
705             # here, we simply store, in case sensitive fashion, a mapping
706             # of the name to databaseId. Later, this map will be
707             # modified, so it only contains those names where the case
708             # sensitive version is unique. We need this map to fulfill
709             # the API requirements that if databaseIdByName() is called
710             # with a name that is ambiguous, but the casing is unique,
711             # then it will correctly determine the casing match
712              
713 43917         150040 $self->{$kNameToIdMapSensitive}{$name} = $databaseId;
714              
715 43917         54621 my $ucName = uc($name); # cache uppercased version for efficiency
716              
717             # occasionally, a standard name is also listed in the aliases,
718             # so we will skip the name if we've already seen it.
719              
720             # note that for now, we are doing this case sensitively - it
721             # is possible that a gene is referred to by the same name
722             # twice but with different casing - however, if those are the
723             # only times that those particular versions are seen, then
724             # they will still be treated unambiguously.
725              
726 43917 100       83029 next if exists ($seen{$name});
727              
728             # let's keep a count of every time a name with the same casing
729             # is seen, across all genes
730              
731 40689         99678 $self->{$kNameToCount}{$name}++;
732              
733             # now we have to deal with the name, depending on whether we
734             # newly determine it is ambiguous, whether we already know
735             # that name is ambiguous, or whether (so far) the name appears
736             # to be unique
737              
738             # for something to be newly ambiguous, the case insensitive
739             # version of its name must have been seen associated with some
740             # other database id already.
741              
742             # if the case insensitive version of the name has already been
743             # seen with the same database id, it is still not ambiguous
744              
745 40689 100 100     185066 if (exists $self->{$kNameToIdMapInsensitive}{$ucName} && $self->{$kNameToIdMapInsensitive}{$ucName} ne $databaseId){
    100          
746              
747             # so record what it maps to
748              
749             # current databaseId
750              
751 277         376 push (@{$self->{$kAmbiguousNames}{$ucName}}, $databaseId);
  277         1214  
752            
753             # and previously seen databaseId
754              
755 277         425 push (@{$self->{$kAmbiguousNames}{$ucName}}, $self->{$kNameToIdMapInsensitive}{$ucName});
  277         912  
756              
757             # and now delete the previously seen databaseId from the unambiguous mapping
758              
759 277         837 delete $self->{$kNameToIdMapInsensitive}{$ucName};
760              
761             }elsif (exists $self->{$kAmbiguousNames}{$ucName}){ # we already know it's ambiguous
762              
763             # so add in this new databaseId
764              
765 36         47 push (@{$self->{$kAmbiguousNames}{$ucName}}, $databaseId);
  36         141  
766              
767             }else{ # otherwise simply map it unambiguously for now, as we haven't see the name before
768              
769 40376         97840 $self->{$kNameToIdMapInsensitive}{$ucName} = $databaseId;
770              
771             }
772              
773 40689         77922 $seen{$name} = undef; # remember that we've seen the name for this row
774              
775             }
776              
777             # now we need to record some useful mappings
778              
779             # map databaseId and standardName to each other - these should
780             # always be unique when treated case sensitively
781              
782 12949         37602 $self->{$kIdToStandardName}{$databaseId} = $standardName; # record the standard name for the database id
783 12949         33134 $self->{$kStandardNameToId}{$standardName} = $databaseId; # also make the reverse look up
784              
785             # Now map upper cased versions of the databaseId and name to their original form
786             # These are not guaranteed to be unique, so we use arrays instead
787              
788 12949         12683 push (@{$self->{$kUcIdToId}{uc($databaseId)}}, $databaseId);
  12949         43808  
789 12949         14721 push (@{$self->{$kUcStdNameToStdName}{uc($standardName)}}, $standardName);
  12949         63755  
790              
791             }
792              
793             ############################################################################
794             sub __storeGOID{
795             ############################################################################
796             # This private method stores a GOID for a given databaseId, on a per
797             # aspect basis, in a hash.
798             #
799             # Usage:
800             #
801             # $self->__storeGOID($databaseId, $goid, $aspect);
802             #
803              
804 70387     70387   98564 my ($self, $databaseId, $goid, $aspect) = @_;
805              
806 70387         393007 $self->{$kGoids}{$databaseId}{$aspect}{$goid} = undef;
807              
808             }
809              
810             =pod
811              
812             =head1 Public instance methods
813              
814             =head1 Some methods dealing with ambiguous names
815              
816             Because there are many names by which an annotated entity may be
817             referred to, that are non-unique, there exist a set of methods for
818             determining whether a name is ambiguous, and to what database
819             identifiers such ambiguous names may refer.
820              
821             Note, that the AnnotationParser is now case insensitive, but with some
822             caveats. For instance, you can use 'cdc6' to retrieve data for CDC6.
823             However, This if gene has been referred to as abc1, and another
824             referred to as ABC1, then these are treated as different, and
825             unambiguous. However, the text 'Abc1' would be considered ambiguous,
826             because it could refer to either. On the other hand, if a single gene
827             is referred to as XYZ1 and xyz1, and no other genes have that name (in
828             any casing), then Xyz1 would still be considered unambiguous.
829              
830             =cut
831              
832             ##############################################################################
833             sub nameIsAmbiguous{
834             ##############################################################################
835              
836             =pod
837              
838             =head2 nameIsAmbiguous
839              
840             This public method returns a boolean to indicate whether a name is
841             ambiguous, i.e. whether the name might map to more than one entity (and
842             therefore more than one databaseId).
843              
844             NB: API change:
845              
846             nameIsAmbiguous is now case insensitive - that is, if there is a name
847             that is used twice using different casing, that will be treated as
848             ambiguous. Previous versions would have not treated these as
849             ambiguous. In the case that a name is provided in a certain casing,
850             which was encountered only once, then it will be treated as
851             unambiguous. This is the price of wanting a case insensitive
852             annotation parser...
853              
854             Usage:
855              
856             if ($annotationParser->nameIsAmbiguous($name)){
857              
858             do something useful....or not....
859              
860             }
861              
862             =cut
863              
864 106406     106406 1 148303 my ($self, $name) = @_;
865              
866 106406 50       191864 die "You must supply a name to nameIsAmbiguous" if !defined ($name);
867              
868             # a name might appear in the hash of ambiguous names - however,
869             # it is possible that the provided name matches the case of one of
870             # the provided versions exactly, and thus may not be ambiguous
871              
872             # of course, it is also possible that there were actually more than
873             # one copy of that alias, with exactly the same casing, which would
874             # be ambiguous
875              
876             # thus, we need to find out whether the provided name matches the case
877             # of a something exactly, which refers to only one entity
878              
879             # a name being ambiguous boils down to whether it has been seen
880             # more than once in that exact case, or in the case that it has
881             # not been seen at all in that exact case, whether it is ambiguous
882             # in upper case form.
883              
884 106406         121246 my $isAmbiguous;
885              
886 106406 100       416688 if (!exists $self->{$kNameToCount}{$name}){
    100          
887              
888             # we haven't seen this casing at all, so see if it's ambiguous
889             # in the uppercased version
890              
891 438         1345 $isAmbiguous = exists $self->{$kAmbiguousNames}{uc($name)};
892            
893             }elsif ($self->{$kNameToCount}{$name} > 1){
894              
895             # we've seen this exact casing more than once, so it has to be
896             # ambiguous
897              
898 137         127 $isAmbiguous = 1;
899              
900             }else{
901              
902             # it must only have ever been seen once in this exact casing,
903             # so it's unambiguous
904              
905 105831         137534 $isAmbiguous = 0;
906              
907             }
908              
909 106406         324102 return $isAmbiguous;
910              
911             }
912              
913             ############################################################################
914             sub databaseIdsForAmbiguousName{
915             ############################################################################
916             =pod
917              
918             =head2 databaseIdsForAmbiguousName
919              
920             This public method returns an array of database identifiers for an
921             ambiguous name. If the name is not ambiguous, an empty list will be
922             returned.
923              
924             NB: API change:
925              
926             databaseIdsForAmbiguousName is now case insensitive - that is, if
927             there is a name that is used twice using different casing, that will
928             be treated as ambiguous. Previous versions would have not treated
929             these as ambiguous. However, if the name provided is of the exact
930             casing as a name that appeared only once with that exact casing, then
931             it is treated as unambiguous. This is the price of wanting a case
932             insensitive annotation parser...
933              
934             Usage:
935              
936             my @databaseIds = $annotationParser->databaseIdsForAmbiguousName($name);
937              
938             =cut
939              
940 2     2 1 4 my ($self, $name) = @_;
941              
942 2 50       8 die "You must supply a name to databaseIdsForAmbiguousName" if !defined ($name);
943            
944 2 50       6 if ($self->nameIsAmbiguous($name)){
945            
946 2         3 return @{$self->{$kAmbiguousNames}{uc($name)}};
  2         13  
947              
948             }else{
949              
950 0         0 return ();
951              
952             }
953              
954             }
955              
956             ############################################################################
957             sub ambiguousNames{
958             ############################################################################
959             =pod
960              
961             =head2 ambiguousNames
962              
963             This method returns an array of names, which from the annotation file
964             have been deemed to be ambiguous.
965              
966             Note - even though we have made the annotation parser case
967             insensitive, if something appeared in the annotations file as BLAH1
968             and blah1, we would not deem either of these to be ambiguous.
969             However, if it appeared as blah1 twice, referring to two different
970             genes, then blah1 would be ambiguous.
971              
972             Usage:
973              
974             my @ambiguousNames = $annotationParser->ambiguousNames();
975              
976             =cut
977              
978 1     1 1 443 my $self = shift;
979              
980             # we can simply generate a list of case-sensitive names that have
981             # appeared more than once - we'll cache them so they don't have to
982             # be recalculated in the event that they're asked for again
983              
984 1 50       8 if (!exists ($self->{$kAmbiguousNamesSensitive})){
985              
986 1         3 my @names;
987              
988 1         2 foreach my $name (keys %{$self->{$kNameToCount}}){
  1         8385  
989              
990 20180 100       49694 push(@names, $name) if ($self->{$kNameToCount}{$name} > 1);
991              
992             }
993              
994 1         3091 $self->{$kAmbiguousNamesSensitive} = \@names;
995              
996             }
997              
998 1         4 return @{$self->{$kAmbiguousNamesSensitive}};
  1         49  
999              
1000             }
1001              
1002             =pod
1003              
1004             =head1 Methods for retrieving GO annotations for entities
1005              
1006             =cut
1007              
1008             ############################################################################
1009             sub goIdsByDatabaseId{
1010             ############################################################################
1011             =pod
1012              
1013             =head2 goIdsByDatabaseId
1014              
1015             This public method returns a reference to an array of GOIDs that are
1016             associated with the supplied databaseId for a specific aspect. If no
1017             annotations are associated with that databaseId in that aspect, then a
1018             reference to an empty array will be returned. If the databaseId is
1019             not recognized, then undef will be returned. In the case that a
1020             databaseId is ambiguous (for instance the same databaseId exists but
1021             with different casings) then if the supplied database id matches the
1022             exact case of one of those supplied, then that is the one it will be
1023             treated as. In the case where the databaseId matches none of the
1024             possibilities by case, then a fatal error will occur, because the
1025             provided databaseId was ambiguous.
1026              
1027             Usage:
1028              
1029             my $goidsRef = $annotationParser->goIdsByDatabaseId(databaseId => $databaseId,
1030             aspect => );
1031              
1032             =cut
1033              
1034 19434     19434 1 60100 my ($self, %args) = @_;
1035              
1036 19434   33     52739 my $aspect = $args{'aspect'} || $self->_handleMissingArgument(argument => 'aspect');
1037 19434   33     43253 my $databaseId = $args{'databaseId'} || $self->_handleMissingArgument(argument => 'databaseId');
1038              
1039 19434         22411 my $mappedId; # will store the id as listed in the annotations file
1040              
1041 19434 50       67659 if (exists $self->{$kUcIdToId}{uc($databaseId)}){ # we recognize it
1042              
1043 19434 100       35353 if (scalar (@{$self->{$kUcIdToId}{uc($databaseId)}}) == 1){
  19434         64529  
1044              
1045             # it's unambiguous
1046              
1047 19432         57853 $mappedId = $self->{$kUcIdToId}{uc($databaseId)}[0];
1048              
1049             }else{
1050              
1051             # it may be ambiguous, but we'll check to see if the provided one
1052             # is of exactly the correct case
1053              
1054 2         3 foreach my $id (@{$self->{$kUcIdToId}{uc($databaseId)}}){
  2         7  
1055            
1056 3 100       8 if ($databaseId eq $id){ # we have a match
1057              
1058 2         3 $mappedId = $id;
1059 2         3 last;
1060              
1061             }
1062              
1063             }
1064              
1065 2 50       6 if (!defined $mappedId){
1066              
1067             # we got no perfect match, so it's ambiguous, and we die
1068              
1069 0         0 die "$databaseId is ambiguous as a databaseId, and could be used to refer to one of:\n\n".
1070 0         0 join("\n", @{$self->{$kUcIdToId}{uc($databaseId)}});
1071              
1072             }
1073              
1074             }
1075              
1076             }else{ # we don't recognize it
1077            
1078 0         0 return ; # note return here
1079              
1080             }
1081              
1082             # if we get here, then we have a recognized, and unambiguous database id
1083              
1084 19434         48870 return $self->_goIdsByMappedDatabaseId(databaseId => $mappedId,
1085             aspect => $aspect);
1086              
1087             }
1088              
1089             ############################################################################
1090             sub _goIdsByMappedDatabaseId{
1091             ############################################################################
1092             # This protected method returns a reference to an array of GOIDs that
1093             # are associated with the supplied databaseId for a specific aspect.
1094             # If no annotations are associated with that databaseId in that
1095             # aspect, then a reference to an empty array will be returned. If the
1096             # databaseId is not recognized, then undef will be returned. The
1097             # supplied databaseId must NOT be ambiguous, i.e. it must be a real
1098             # databaseId known to exist. If it is possibly ambiguous, use the
1099             # goIdsByDatabaseId method instead.
1100             #
1101             # Usage:
1102             #
1103             # my $goidsRef = $annotationParser->_goIdsByMappedDatabaseId(databaseId => $databaseId,
1104             # aspect => );
1105              
1106              
1107 19434     19434   53761 my ($self, %args) = @_;
1108              
1109 19434   33     45607 my $aspect = $args{'aspect'} || $self->_handleMissingArgument(argument => 'aspect');
1110 19434   33     39439 my $mappedId = $args{'databaseId'} || $self->_handleMissingArgument(argument => 'databaseId');
1111              
1112 19434 100       77637 if (exists $self->{$kGoids}{$mappedId}{$aspect}){ # it has annotations
1113              
1114 18903         24652 return [keys %{$self->{$kGoids}{$mappedId}{$aspect}}];
  18903         155797  
1115              
1116             }else{ # it has no annotations
1117            
1118 531         2749 return []; # reference to empty array
1119              
1120             }
1121              
1122             }
1123              
1124             ############################################################################
1125             sub goIdsByStandardName{
1126             ############################################################################
1127             =pod
1128              
1129             =head2 goIdsByStandardName
1130              
1131             This public method returns a reference to an array of GOIDs that are
1132             associated with the supplied standardName for a specific aspect. If
1133             no annotations are associated with the entity with that standard name
1134             in that aspect, then a reference to an empty list will be returned.
1135             If the supplied name is not used as a standard name, then undef will
1136             be returned. In the case that the supplied standardName is ambiguous
1137             (for instance the same standardName exists but with different casings)
1138             then if the supplied standardName matches the exact case of one of
1139             those supplied, then that is the one it will be treated as. In the
1140             case where the standardName matches none of the possibilities by case,
1141             then a fatal error will occur, because the provided standardName was
1142             ambiguous.
1143              
1144             Usage:
1145              
1146             my $goidsRef = $annotationParser->goIdsByStandardName(standardName =>$standardName,
1147             aspect =>);
1148              
1149             =cut
1150              
1151 0     0 1 0 my ($self, %args) = @_;
1152              
1153 0   0     0 my $aspect = $args{'aspect'} || $self->_handleMissingArgument(argument => 'aspect');
1154 0   0     0 my $standardName = $args{'standardName'} || $self->_handleMissingArgument(argument => 'standardName');
1155              
1156             # now we have to determine if the standardName is ambiguous or not
1157              
1158             # first, return if there is no standard name for the supplied string
1159              
1160 0 0       0 return undef if !exists $self->{$kUcStdNameToStdName}{uc($standardName)};
1161              
1162             # now see if we have 1 or more mappings
1163              
1164 0         0 my $mappedName;
1165              
1166 0 0       0 if (scalar @{$self->{$kUcStdNameToStdName}{uc($standardName)}} == 1){
  0         0  
1167              
1168             # we have a single mapping
1169              
1170 0         0 $mappedName = $self->{$kUcStdNameToStdName}{uc($standardName)}[0];
1171              
1172             }else{
1173              
1174             # there's more than one, so see if the case matched exactly
1175              
1176 0         0 foreach my $name (@{$self->{$kUcStdNameToStdName}{uc($standardName)}}){
  0         0  
1177              
1178 0 0       0 if ($name eq $standardName){
1179              
1180 0         0 $mappedName = $name;
1181 0         0 last;
1182              
1183             }
1184              
1185             }
1186              
1187 0 0       0 if (!defined $mappedName){
1188              
1189             # we got no perfect match, so it's ambiguous, and we die
1190              
1191 0         0 die "$standardName is ambiguous as a standardName, and could be used to refer to one of:\n\n".
1192 0         0 join("\n", @{$self->{$kUcStdNameToStdName}{uc($standardName)}});
1193              
1194             }
1195              
1196             }
1197              
1198             # now we're here, we know we have a mapped standard name, which
1199             # must thus map to a databaseId
1200              
1201 0         0 my $databaseId = $self->_databaseIdByMappedStandardName($mappedName);
1202              
1203 0         0 return $self->_goIdsByMappedDatabaseId(databaseId => $databaseId,
1204             aspect => $aspect);
1205              
1206             }
1207              
1208             ############################################################################
1209             sub goIdsByName{
1210             ############################################################################
1211             =pod
1212              
1213             =head2 goIdsByName
1214              
1215             This public method returns a reference to an array of GO IDs that are
1216             associated with the supplied name for a specific aspect. If there are
1217             no GO associations for the entity corresponding to the supplied name
1218             in the provided aspect, then a reference to an empty list will be
1219             returned. If the supplied name does not correspond to any entity,
1220             then undef will be returned. Because the name can be any of the
1221             databaseId, the standard name, or any of the aliases, it is possible
1222             that the name might be ambiguous. Clients of this object should first
1223             test whether the name they are using is ambiguous, using the
1224             nameIsAmbiguous() method, and handle it accordingly. If an ambiguous
1225             name is supplied, then it will die.
1226              
1227             NB: API change:
1228              
1229             goIdsByName is now case insensitive - that is, if there is a name that
1230             is used twice using different casing, that will be treated as
1231             ambiguous. Previous versions would have not treated these as
1232             ambiguous. This is the price of wanting a case insensitive annotation
1233             parser. In the event that a name is provided that is ambiguous
1234             because of case, if it matches exactly the case of one of the possible
1235             matches, it will be treated unambiguously.
1236              
1237             Usage:
1238              
1239             my $goidsRef = $annotationParser->goIdsByName(name => $name,
1240             aspect => );
1241              
1242             =cut
1243              
1244 0     0 1 0 my ($self, %args) = @_;
1245              
1246 0   0     0 my $aspect = $args{'aspect'} || $self->_handleMissingArgument(argument => 'aspect');
1247 0   0     0 my $name = $args{'name'} || $self->_handleMissingArgument(argument => 'name');
1248              
1249 0 0       0 die "You have supplied an ambiguous name to goIdsByName" if ($self->nameIsAmbiguous($name));
1250              
1251             # if we get here, the name is not ambiguous, so it's safe to call
1252             # databaseIdByName
1253              
1254 0         0 my $databaseId = $self->databaseIdByName($name);
1255              
1256 0 0       0 return undef if !defined $databaseId; # there is no such name
1257              
1258             # we should have a databaseId in the correct casing now
1259              
1260 0         0 return $self->_goIdsByMappedDatabaseId(databaseId => $databaseId,
1261             aspect => $aspect);
1262              
1263             }
1264              
1265             =pod
1266              
1267             =head1 Methods for mapping different types of name to each other
1268              
1269             =cut
1270              
1271             ############################################################################
1272             sub standardNameByDatabaseId{
1273             ############################################################################
1274             =pod
1275              
1276             =head2 standardNameByDatabaseId
1277              
1278             This method returns the standard name for a database id.
1279              
1280             NB: API change
1281              
1282             standardNameByDatabaseId is now case insensitive - that is, if there
1283             is a databaseId that is used twice (or more) using different casing,
1284             it will be treated as ambiguous. Previous versions would have not
1285             treated these as ambiguous. This is the price of wanting a case
1286             insensitive annotation parser. In the event that a name is provided
1287             that is ambiguous because of case, if it matches exactly the case of
1288             one of the possible matches, it will be treated unambiguously.
1289              
1290             Usage:
1291              
1292             my $standardName = $annotationParser->standardNameByDatabaseId($databaseId);
1293              
1294             =cut
1295              
1296 0     0 1 0 my ($self, $databaseId) = @_;
1297              
1298 0 0       0 die "You must supply a databaseId to standardNameByDatabaseId" if !defined ($databaseId);
1299              
1300             # first return if there is no databaseId for the supplied string
1301              
1302 0 0       0 return undef if (!exists $self->{$kUcIdToId}{uc($databaseId)});
1303              
1304             # now, check whether it's ambiguous as a databaseId
1305              
1306 0         0 my $mappedId;
1307              
1308 0 0       0 if (scalar(@{$self->{$kUcIdToId}{uc($databaseId)}}) == 1){
  0         0  
1309              
1310             # we have a single mapping
1311              
1312 0         0 $mappedId = $self->{$kUcIdToId}{uc($databaseId)}[0];
1313              
1314             }else{
1315              
1316             # there's more than one, so see if the provided case matches
1317             # exactly one of them
1318              
1319 0         0 foreach my $id (@{$self->{$kUcIdToId}{uc($databaseId)}}){
  0         0  
1320              
1321 0 0       0 if ($databaseId eq $id){
1322              
1323 0         0 $mappedId = $id;
1324 0         0 last;
1325              
1326             }
1327              
1328             }
1329              
1330 0 0       0 if (!defined $mappedId){
1331              
1332             # we got no perfect match, so it's ambiguous, and we die
1333              
1334 0         0 die "$databaseId is ambiguous as a databaseId, and could be used to refer to one of:\n\n".
1335 0         0 join("\n", @{$self->{$kUcIdToId}{uc($databaseId)}});
1336              
1337             }
1338              
1339             }
1340              
1341              
1342 0         0 return ($self->{$kIdToStandardName}{$mappedId});
1343              
1344             }
1345              
1346             ############################################################################
1347             sub databaseIdByStandardName{
1348             ############################################################################
1349             =pod
1350              
1351             =head2 databaseIdByStandardName
1352              
1353             This method returns the database id for a standard name.
1354              
1355             NB: API change
1356              
1357             databaseIdByStandardName is now case insensitive - that is, if there
1358             is a standard name that is used twice (or more) using different
1359             casing, it will be treated as ambiguous. Previous versions would have
1360             not treated these as ambiguous. This is the price of wanting a case
1361             insensitive annotation parser. In the event that a name is provided
1362             that is ambiguous because of case, if it matches exactly the case of
1363             one of the possible matches, it will be treated unambiguously.
1364              
1365             Usage:
1366              
1367             my $databaseId = $annotationParser->databaseIdByStandardName($standardName);
1368              
1369             =cut
1370              
1371 0     0 1 0 my ($self, $standardName) = @_;
1372              
1373 0 0       0 die "You must supply a standardName to databaseIdByStandardName" if !defined ($standardName);
1374              
1375             # first return if there is no standard name for the supplied string
1376              
1377 0 0       0 return undef if (!exists $self->{$kUcStdNameToStdName}{uc($standardName)});
1378              
1379             # now see if it's ambiguous or not
1380              
1381 0         0 my $mappedStandardName;
1382              
1383 0 0       0 if (scalar(@{$self->{$kUcStdNameToStdName}{uc($standardName)}}) == 1){
  0         0  
1384              
1385             # it's not ambiguous
1386              
1387 0         0 $mappedStandardName = $self->{$kUcStdNameToStdName}{uc($standardName)}[0];
1388              
1389             }else{
1390              
1391             # there's more than one, so see if the supplied name matches
1392             # the case of one of them exactly
1393              
1394 0         0 foreach my $name (@{$self->{$kUcStdNameToStdName}{uc($standardName)}}){
  0         0  
1395              
1396 0 0       0 if ($standardName eq $name){
1397              
1398 0         0 $mappedStandardName = $name;
1399 0         0 last;
1400              
1401             }
1402              
1403             }
1404              
1405 0 0       0 if (!defined $mappedStandardName){
1406              
1407 0         0 die "$standardName is ambiguous as a standard name, and could be used to refer to one of:\n\n".
1408 0         0 join("\n", @{$self->{$kUcStdNameToStdName}{uc($standardName)}});
1409              
1410             }
1411              
1412             }
1413              
1414 0         0 return ($self->{$kStandardNameToId}{$standardName});
1415              
1416             }
1417              
1418             ############################################################################
1419             sub _databaseIdByMappedStandardName{
1420             ############################################################################
1421             # This protected method returns the database id for a standard name that is
1422             # guaranteed to be non-ambiguous, and in the correct casing
1423             #
1424             # Usage:
1425             #
1426             # my $databaseId = $annotationParser->_databaseIdByMappedStandardName($standardName);
1427             #
1428              
1429 0     0   0 my ($self, $standardName) = @_;
1430              
1431 0 0       0 die "You must supply a standardName to _databaseIdByMappedStandardName" if !defined ($standardName);
1432              
1433 0         0 return ($self->{$kStandardNameToId}{$standardName});
1434              
1435             }
1436              
1437             ############################################################################
1438             sub databaseIdByName{
1439             ############################################################################
1440             =pod
1441              
1442             =head2 databaseIdByName
1443              
1444             This method returns the database id for any identifier for a gene
1445             (e.g. by databaseId itself, by standard name, or by alias). If the
1446             used name is ambiguous, then the program will die. Thus clients
1447             should call the nameIsAmbiguous() method, prior to using this method.
1448             If the name does not map to any databaseId, then undef will be
1449             returned.
1450              
1451             NB: API change
1452              
1453             databaseIdByName is now case insensitive - that is, if there is a name
1454             that is used twice using different casing, that will be treated as
1455             ambiguous. Previous versions would have not treated these as
1456             ambiguous. This is the price of wanting a case insensitive annotation
1457             parser. In the event that a name is provided that is ambiguous
1458             because of case, if it matches exactly the case of one of the possible
1459             matches, it will be treated unambiguously.
1460              
1461             Usage:
1462              
1463             my $databaseId = $annotationParser->databaseIdByName($name);
1464              
1465             =cut
1466              
1467 53129     53129 1 73450 my ($self, $name) = @_;
1468              
1469 53129 50       103970 die "You must supply a name to databaseIdByName" if !defined ($name);
1470              
1471 53129 50       95474 die "You have supplied an ambiguous name to databaseIdByName" if ($self->nameIsAmbiguous($name));
1472              
1473             # give them the case insensitive unique map, or if there is none,
1474             # then the case sensitive version
1475              
1476 53129   66     218623 my $databaseId = $self->{$kNameToIdMapInsensitive}{uc($name)} || $self->{$kNameToIdMapSensitive}{$name};
1477              
1478 53129         134962 return $databaseId;
1479              
1480             }
1481              
1482             ############################################################################
1483             sub standardNameByName{
1484             ############################################################################
1485             =pod
1486              
1487             =head2 standardNameByName
1488              
1489             This public method returns the standard name for the the gene
1490             specified by the given name. Because a name may be ambiguous, the
1491             nameIsAmbiguous() method should be called first. If an ambiguous name
1492             is supplied, then it will die with an appropriate error message. If
1493             the name does not map to a standard name, then undef will be returned.
1494              
1495             NB: API change
1496              
1497             standardNameByName is now case insensitive - that is, if there is a
1498             name that is used twice using different casing, that will be treated
1499             as ambiguous. Previous versions would have not treated these as
1500             ambiguous. This is the price of wanting a case insensitive annotation
1501             parser.
1502              
1503             Usage:
1504              
1505             my $standardName = $annotationParser->standardNameByName($name);
1506              
1507             =cut
1508              
1509 0     0 1 0 my ($self, $name) = @_;
1510              
1511 0 0       0 die "You must supply a name to standardNameByName" if !defined ($name);
1512              
1513 0 0       0 die "You have supplied an ambiguous name to standardNameByName" if ($self->nameIsAmbiguous($name));
1514              
1515 0         0 my $databaseId = $self->databaseIdByName($name);
1516              
1517 0 0       0 if (defined $databaseId){
1518              
1519 0         0 return $self->{$kIdToStandardName}{$databaseId};
1520              
1521             }else{
1522              
1523 0         0 return undef;
1524            
1525             }
1526              
1527             }
1528              
1529             =pod
1530              
1531             =head1 Other methods relating to names
1532              
1533             =cut
1534              
1535             ############################################################################
1536             sub nameIsStandardName{
1537             ############################################################################
1538             =pod
1539              
1540             =head2 nameIsStandardName
1541              
1542             This method returns a boolean to indicate whether the supplied name is
1543             used as a standard name.
1544              
1545             NB : API change.
1546              
1547             This is now case insensitive. If you provide abC1, and ABc1 is a
1548             standard name, then it will return true.
1549              
1550             Usage :
1551              
1552             if ($annotationParser->nameIsStandardName($name)){
1553              
1554             # do something
1555              
1556             }
1557              
1558             =cut
1559              
1560 6471     6471 1 22646 my ($self, $name) = @_;
1561              
1562 6471 50       10980 die "You must supply a name to nameIsStandardName" if !defined($name);
1563              
1564 6471         20060 return exists ($self->{$kUcStdNameToStdName}{uc($name)});
1565              
1566             }
1567              
1568             ############################################################################
1569             sub nameIsDatabaseId{
1570             ############################################################################
1571             =pod
1572              
1573             =head2 nameIsDatabaseId
1574              
1575             This method returns a boolean to indicate whether the supplied name is
1576             used as a database id.
1577              
1578             NB : API change.
1579              
1580             This is now case insensitive. If you provide abC1, and ABc1 is a
1581             database id, then it will return true.
1582              
1583             Usage :
1584              
1585             if ($annotationParser->nameIsDatabaseId($name)){
1586              
1587             # do something
1588              
1589             }
1590              
1591             =cut
1592              
1593              
1594 6471     6471 1 19683 my ($self, $databaseId) = @_;
1595              
1596 6471 50       10400 die "You must supply a potential databaseId to nameIsDatabaseId" if !defined($databaseId);
1597              
1598 6471         19837 return exists ($self->{$kUcIdToId}{uc($databaseId)});
1599              
1600             }
1601              
1602             ############################################################################
1603             sub nameIsAnnotated{
1604             ############################################################################
1605             =pod
1606              
1607             =head2 nameIsAnnotated
1608              
1609             This method returns a boolean to indicate whether the supplied name has any
1610             annotations, either when considered as a databaseId, a standardName, or
1611             an alias. If an aspect is also supplied, then it indicates whether that
1612             name has any annotations in that aspect only.
1613              
1614             NB: API change.
1615              
1616             This is now case insensitive. If you provide abC1, and ABc1 has
1617             annotation, then it will return true.
1618              
1619             Usage :
1620              
1621             if ($annotationParser->nameIsAnnotated(name => $name)){
1622              
1623             # blah
1624              
1625             }
1626              
1627             or:
1628              
1629             if ($annotationParser->nameIsAnnotated(name => $name,
1630             aspect => $aspect)){
1631              
1632             # blah
1633              
1634             }
1635              
1636              
1637             =cut
1638              
1639 0     0 1 0 my ($self, %args) = @_;
1640              
1641 0   0     0 my $name = $args{'name'} || die "You must supply a name to nameIsAnnotated";
1642            
1643 0         0 my $aspect = $args{'aspect'};
1644              
1645 0         0 my $isAnnotated = 0;
1646              
1647 0         0 my $ucName = uc($name);
1648              
1649 0 0       0 if (!defined ($aspect)){ # if there's no aspect
1650              
1651 0   0     0 $isAnnotated = (exists ($self->{$kNameToIdMapInsensitive}{$ucName}) || exists ($self->{$kAmbiguousNames}{$ucName}));
1652            
1653             }else{
1654              
1655 0 0 0     0 if ($self->nameIsDatabaseId($name) && @{$self->goIdsByDatabaseId(databaseId => $name,
  0 0 0     0  
    0          
1656 0         0 aspect => $aspect)}){
1657              
1658 0         0 $isAnnotated = 1;
1659              
1660             }elsif ($self->nameIsStandardName($name) && @{$self->goIdsByStandardName(standardName => $name,
1661             aspect => $aspect)}){
1662              
1663 0         0 $isAnnotated = 1;
1664              
1665             }elsif (!$self->nameIsAmbiguous($name)){
1666              
1667 0         0 my $goidsRef = $self->goIdsByName(name => $name,
1668             aspect => $aspect);
1669              
1670 0 0 0     0 if (defined $goidsRef && @{$goidsRef}){
  0         0  
1671              
1672 0         0 $isAnnotated = 1;
1673              
1674             }
1675              
1676             }else { # MUST be an ambiguous name, that's not used as a standard name
1677            
1678 0         0 foreach my $databaseId ($self->databaseIdsForAmbiguousName($name)){
1679              
1680 0 0       0 if (@{$self->goIdsByDatabaseId(databaseId => $name,
  0         0  
1681             aspect => $aspect)}){
1682              
1683 0         0 $isAnnotated = 1;
1684 0         0 last; # as soon as we know, we can finish
1685              
1686             }
1687              
1688             }
1689              
1690             }
1691              
1692             }
1693              
1694 0         0 return $isAnnotated;
1695              
1696             }
1697              
1698             =pod
1699              
1700             =head1 Other public methods
1701              
1702             =cut
1703              
1704             ############################################################################
1705             sub databaseName{
1706             ############################################################################
1707             =pod
1708              
1709             =head2 databaseName
1710              
1711             This method returns the name of the annotating authority from the file
1712             that was supplied to the constructor.
1713              
1714             Usage :
1715              
1716             my $databaseName = $annotationParser->databaseName();
1717              
1718             =cut
1719              
1720 0     0 1 0 my $self = shift;
1721              
1722 0         0 return $self->{$kDatabaseName};
1723              
1724             }
1725              
1726             ############################################################################
1727             sub numAnnotatedGenes{
1728             ############################################################################
1729             =pod
1730              
1731             =head2 numAnnotatedGenes
1732              
1733             This method returns the number of entities in the annotation file that
1734             have annotations in the supplied aspect. If no aspect is provided,
1735             then it will return the number of genes with an annotation in at least
1736             one aspect of GO.
1737              
1738             Usage:
1739              
1740             my $numAnnotatedGenes = $annotationParser->numAnnotatedGenes();
1741              
1742             my $numAnnotatedGenes = $annotationParser->numAnnotatedGenes($aspect);
1743              
1744             =cut
1745              
1746 3     3 1 1523 my ($self, $aspect) = @_;
1747              
1748 3 100       17 if (defined ($aspect)){
1749              
1750 1         8 return $self->{$kNumAnnotatedGenes}{$aspect};
1751              
1752             }else{
1753              
1754 2         12 return $self->{$kTotalNumAnnotatedGenes};
1755              
1756             }
1757              
1758             }
1759              
1760             ############################################################################
1761             sub allDatabaseIds{
1762             ############################################################################
1763             =pod
1764              
1765             =head2 allDatabaseIds
1766              
1767             This public method returns an array of all the database identifiers
1768              
1769             Usage:
1770              
1771             my @databaseIds = $annotationParser->allDatabaseIds();
1772              
1773             =cut
1774              
1775 10     10 1 1261 my $self = shift;
1776              
1777 10         18 return keys (%{$self->{$kIdToStandardName}});
  10         26887  
1778              
1779             }
1780              
1781             ############################################################################
1782             sub allStandardNames{
1783             ############################################################################
1784             =pod
1785              
1786             =head2 allStandardNames
1787              
1788             This public method returns an array of all standard names.
1789              
1790             Usage:
1791              
1792             my @standardNames = $annotationParser->allStandardNames();
1793              
1794             =cut
1795              
1796 2     2 1 508 my $self = shift;
1797              
1798 2         6 return keys(%{$self->{$kStandardNameToId}});
  2         3605  
1799              
1800             }
1801              
1802             =pod
1803              
1804             =head1 Methods to do with files
1805              
1806             =cut
1807              
1808             ############################################################################
1809             sub file{
1810             ############################################################################
1811             =pod
1812              
1813             =head2 file
1814              
1815             This method returns the name of the file that was used to instantiate
1816             the object.
1817              
1818             Usage:
1819              
1820             my $file = $annotationParser->file;
1821              
1822             =cut
1823              
1824 1     1 1 3730 return $_[0]->{$kFileName};
1825              
1826             }
1827              
1828             ############################################################################
1829             sub serializeToDisk{
1830             ############################################################################
1831             =pod
1832              
1833             =head2 serializeToDisk
1834              
1835             This public method saves the current state of the Annotation Parser
1836             Object to a file, using the Storable package. The data are saved in
1837             network order for portability, just in case. The name of the object
1838             file is returned. By default, the name of the original file will be
1839             used to make the name of the object file (including the full path from
1840             where the file came), or the client can instead supply their own
1841             filename.
1842              
1843             Usage:
1844              
1845             my $fileName = $annotationParser->serializeToDisk;
1846              
1847             my $fileName = $annotationParser->serializeToDisk(filename => $filename);
1848              
1849             =cut
1850              
1851 0     0 1   my ($self, %args) = @_;
1852              
1853 0           my $fileName;
1854              
1855 0 0         if (exists ($args{'filename'})){ # they supply their own filename
1856              
1857 0           $fileName = $args{'filename'};
1858              
1859             }else{ # we build a name from the file used to instantiate ourselves
1860              
1861 0           $fileName = $self->file;
1862            
1863 0 0         if ($fileName !~ /\.obj$/){ # if we weren't instantiated from an object
1864            
1865 0           $fileName .= ".obj"; # add a .obj suffix to the name
1866            
1867             }
1868              
1869             }
1870              
1871 0 0         nstore ($self, $fileName) || die "$PACKAGE could not serialize itself to $fileName : $!";
1872              
1873 0           return ($fileName);
1874              
1875             }
1876              
1877             1; # to keep perl happy
1878              
1879             ############################################################################
1880             # MORE P O D D O C U M E N T A T I O N #
1881             ############################################################################
1882              
1883             =pod
1884              
1885             =head1 Modifications
1886              
1887             CVS info is listed here:
1888              
1889             # $Author: sherlock $
1890             # $Date: 2008/05/13 23:06:16 $
1891             # $Log: AnnotationParser.pm,v $
1892             # Revision 1.35 2008/05/13 23:06:16 sherlock
1893             # updated to fix bug with querying with a name that was unambiguous when
1894             # taking its casing into account.
1895             #
1896             # Revision 1.34 2007/03/18 03:09:05 sherlock
1897             # couple of PerlCritic suggested improvements, and an extra check to
1898             # make sure that the cardinality between standard names and database ids
1899             # is 1:1
1900             #
1901             # Revision 1.33 2006/07/28 00:02:14 sherlock
1902             # fixed a couple of typos
1903             #
1904             # Revision 1.32 2004/07/28 17:12:10 sherlock
1905             # bumped version
1906             #
1907             # Revision 1.31 2004/07/28 17:03:49 sherlock
1908             # fixed bugs when calling goidsByDatabaseId instead of goIdsByDatabaseId
1909             # on lines 1592 and 1617 - thanks to lfriedl@cs.umass.edu for spotting this.
1910             #
1911             # Revision 1.30 2003/11/26 18:44:28 sherlock
1912             # finished making all the changes that were required to make it case
1913             # insensitive, and modified POD accordingly. It appears to all work as
1914             # expected...
1915             #
1916             # Revision 1.29 2003/11/22 00:05:05 sherlock
1917             # made a very large number of changes to make much of it
1918             # case-insensitive, such that using CDC6 or cdc6 amounts to the same
1919             # query, as long as both versions of that name don't exist in the
1920             # annotations file. Still needs a little work to allow names that are
1921             # potentially ambiguous to be not ambiguous, if their casing matches
1922             # exactly one form of the name that has been seen. Have started to
1923             # update test suite to check all the case insensitive stuff, but is not
1924             # yet finished.
1925             #
1926             #
1927              
1928             =head1 AUTHORS
1929              
1930             Elizabeth Boyle, ell@mit.edu
1931              
1932             Gavin Sherlock, sherlock@genome.stanford.edu
1933              
1934             =cut