File Coverage

blib/lib/XML/DoubleEncodedEntities.pm
Criterion Covered Total %
statement 11 11 100.0
branch 9 12 75.0
condition n/a
subroutine 3 3 100.0
pod 1 1 100.0
total 24 27 88.8


line stmt bran cond sub pod time code
1             package XML::DoubleEncodedEntities;
2              
3 1     1   409 use strict;
  1         2  
  1         46  
4              
5             require Exporter;
6              
7 1     1   5 use vars qw($VERSION @EXPORT_OK @ISA);
  1         1  
  1         266  
8              
9             $VERSION = '1.1';
10             @EXPORT_OK = qw(decode);
11             @ISA = qw(Exporter);
12              
13             # localising prevents the warningness leaking out of this module
14             local $^W = 1; # can't use warnings as that's a 5.6-ism
15              
16             =encoding ISO8859-1
17              
18             =head1 NAME
19              
20             XML::DoubleEncodedEntities - unbreak XML with doubly-encoded entities
21              
22             =head1 DESCRIPTION
23              
24             Occasionally, XML files escape into the wild with their entities encoded
25             twice so instead of this:
26              
27             Green & Blacks
28              
29             you get:
30              
31             <chocolate>Green &amp; Blacks</chocolate>
32              
33             A real-world example of this problem can be seen in this failing test
34             for a module which queries an online XML datasource:
35              
36             http://www.nntp.perl.org/group/perl.cpan.testers/2007/02/msg414642.html
37              
38             (search for the text 'Arcturus' in that page).
39              
40             This module tries to fix that.
41              
42             =head1 SYNOPSIS
43              
44             use XML::DoubleEncodedEntities;
45            
46             my $xmlfile = XML::DoubleEncodedEntities::decode($xmlfile);
47              
48             =head1 Functions
49              
50             =head2 decode
51              
52             This function is not exported, but can be if you wish. It takes one
53             scalar parameter and returns a corresponding scalar, decoded if necessary.
54              
55             The parameter is assumed to be a string. If its first non-whitespace
56             characters are C<<>, or if it contains the sequence C<&amp;> the
57             string is assumed to be a doubly-encoded XML document, in which case the
58             following entities, if present, are decoded:
59             &
60             <
61             >
62             "
63             '
64              
65             No other parameters are decoded. After all, if the input document has been
66             *doubly* encoded then something like C<æ>, which should be the entity C<æ>
67             will be represented by the character sequence C<&aelig;>. Once the
68             C<&> has been corrected by this module, you'll be able to decode the
69             resulting C<æ> in the normal way.
70              
71             =cut
72              
73             # ripped off (and simplified) from XML::Tiny
74             sub decode {
75 6     6 1 21 my $thingy = shift;
76 6 100       44 return $thingy unless($thingy =~ /(^\s*<|&amp;)/);
77              
78 4         16 $thingy =~ s/&(lt;|gt;|quot;|apos;|amp;|.*)/
79 8 100       46 $1 eq 'lt;' ? '<' :
    50          
    50          
    50          
    100          
80             $1 eq 'gt;' ? '>' :
81             $1 eq 'apos;' ? "'" :
82             $1 eq 'quot;' ? '"' :
83             $1 eq 'amp;' ? '&' :
84             die("Illegal ampersand or entity\n\tat &$1\n")
85             /ge;
86 3         13 $thingy;
87             }
88              
89             =head1 BUGS and FEEDBACK
90              
91             I welcome feedback about my code, including constructive criticism.
92             Bug reports should be made using L or by email,
93             and should include the smallest possible chunk of code, along with
94             any necessary data, which demonstrates the bug. Ideally, this
95             will be in the form of a file which I can drop in to the module's
96             test suite. Ideally such files will work in perl 5.004.
97              
98             If you are feeling particularly generous you can encourage me in my
99             open source endeavours by buying me something from my wishlist:
100             L
101              
102             =head1 SEE ALSO
103              
104             L, which does the same job for broken UTF-8.
105              
106             L, which is HTMLish.
107              
108             =head1 AUTHOR
109              
110             David Cantrell EFE
111              
112             =head1 COPYRIGHT and LICENCE
113              
114             Copyright 2007 David Cantrell
115              
116             This module is free-as-in-speech software, and may be used, distributed,
117             and modified under the same terms as Perl itself.
118              
119             =head1 CONSPIRACY
120              
121             This module is also free-as-in-mason software.
122              
123             =cut
124              
125             '&amp;#49'