File Coverage

blib/lib/XML/DoubleEncodedEntities.pm
Criterion Covered Total %
statement 11 11 100.0
branch 9 12 75.0
condition n/a
subroutine 3 3 100.0
pod 0 1 0.0
total 23 27 85.1


line stmt bran cond sub pod time code
1             package XML::DoubleEncodedEntities;
2              
3 1     1   9013 use strict;
  1         2  
  1         53  
4              
5             require Exporter;
6              
7 1     1   4 use vars qw($VERSION @EXPORT_OK @ISA);
  1         1  
  1         491  
8              
9             $VERSION = '1.0';
10             @EXPORT_OK = qw(decode);
11             @ISA = qw(Exporter);
12              
13             # localising prevents the warningness leaking out of this module
14             local $^W = 1; # can't use warnings as that's a 5.6-ism
15              
16             =head1 NAME
17              
18             XML::DoubleEncodedEntities - unbreak XML with doubly-encoded entities
19              
20             =head1 DESCRIPTION
21              
22             Occasionally, XML files escape into the wild with their entities encoded
23             twice so instead of this:
24              
25             Green & Blacks
26              
27             you get:
28              
29             <chocolate>Green &amp; Blacks</chocolate>
30              
31             A real-world example of this problem can be seen in this failing test
32             for a module which queries an online XML datasource:
33              
34             http://www.nntp.perl.org/group/perl.cpan.testers/2007/02/msg414642.html
35              
36             (search for the text 'Arcturus' in that page).
37              
38             This module tries to fix that.
39              
40             =head1 SYNOPSIS
41              
42             use XML::DoubleEncodedEntities;
43            
44             my $xmlfile = XML::DoubleEncodedEntities::decode($xmlfile);
45              
46             =head1 The C function
47              
48             This function is not exported, but can be if you wish. It takes one
49             scalar parameter and returns a corresponding scalar, decoded if necessary.
50              
51             The parameter is assumed to be a string. If its first non-whitespace
52             characters are C<<>, or if it contains the sequence C<&amp;> the
53             string is assumed to be a doubly-encoded XML document, in which case the
54             following entities, if present, are decoded:
55             &
56             <
57             >
58             "
59             '
60              
61             No other parameters are decoded. After all, if the input document has been
62             *doubly* encoded then something like C<æ>, which should be the entity C<æ>
63             will be represented by the character sequence C<&aelig;>. Once the
64             C<&> has been corrected by this module, you'll be able to decode the
65             resulting C<æ> in the normal way.
66              
67             =cut
68              
69             # ripped off (and simplified) from XML::Tiny
70             sub decode {
71 6     6 0 29 my $thingy = shift;
72 6 100       58 return $thingy unless($thingy =~ /(^\s*<|&amp;)/);
73              
74 4         25 $thingy =~ s/&(lt;|gt;|quot;|apos;|amp;|.*)/
75 8 100       66 $1 eq 'lt;' ? '<' :
    50          
    50          
    50          
    100          
76             $1 eq 'gt;' ? '>' :
77             $1 eq 'apos;' ? "'" :
78             $1 eq 'quot;' ? '"' :
79             $1 eq 'amp;' ? '&' :
80             die("Illegal ampersand or entity\n\tat &$1\n")
81             /ge;
82 3         16 $thingy;
83             }
84              
85             =head1 BUGS and FEEDBACK
86              
87             I welcome feedback about my code, including constructive criticism.
88             Bug reports should be made using L or by email,
89             and should include the smallest possible chunk of code, along with
90             any necessary data, which demonstrates the bug. Ideally, this
91             will be in the form of a file which I can drop in to the module's
92             test suite. Ideally such files will work in perl 5.004.
93              
94             If you are feeling particularly generous you can encourage me in my
95             open source endeavours by buying me something from my wishlist:
96             L
97              
98             =head1 SEE ALSO
99              
100             L, which does the same job for broken UTF-8.
101              
102             L, which is HTMLish.
103              
104             =head1 AUTHOR
105              
106             David Cantrell EFE
107              
108             =head1 COPYRIGHT and LICENCE
109              
110             Copyright 2007 David Cantrell
111              
112             This module is free-as-in-speech software, and may be used, distributed,
113             and modified under the same terms as Perl itself.
114              
115             =head1 CONSPIRACY
116              
117             This module is also free-as-in-mason software.
118              
119             =cut
120              
121             '&amp;#49'