File Coverage

blib/lib/XML/Tiny.pm
Criterion Covered Total %
statement 80 80 100.0
branch 66 68 97.0
condition 19 21 90.4
subroutine 4 4 100.0
pod 1 1 100.0
total 170 174 97.7


line stmt bran cond sub pod time code
1             package XML::Tiny;
2              
3 14     14   39847 use strict;
  14         24  
  14         684  
4              
5             require Exporter;
6              
7 14     14   72 use vars qw($VERSION @EXPORT_OK @ISA);
  14         22  
  14         26979  
8              
9             $VERSION = '2.06';
10             @EXPORT_OK = qw(parsefile);
11             @ISA = qw(Exporter);
12              
13             # localising prevents the warningness leaking out of this module
14             local $^W = 1; # can't use warnings as that's a 5.6-ism
15              
16             =head1 NAME
17              
18             XML::Tiny - simple lightweight parser for a subset of XML
19              
20             =head1 DESCRIPTION
21              
22             XML::Tiny is a simple lightweight parser for a subset of XML
23              
24             =head1 SYNOPSIS
25              
26             use XML::Tiny qw(parsefile);
27             open($xmlfile, 'something.xml);
28             my $document = parsefile($xmlfile);
29              
30             This will leave C<$document> looking something like this:
31              
32             [
33             {
34             type => 'e',
35             attrib => { ... },
36             name => 'rootelementname',
37             content => [
38             ...
39             more elements and text content
40             ...
41             ]
42             }
43             ]
44              
45             =head1 FUNCTIONS
46              
47             The C function is optionally exported. By default nothing is
48             exported. There is no objecty interface.
49              
50             =head2 parsefile
51              
52             This takes at least one parameter, optionally more. The compulsory
53             parameter may be:
54              
55             =over 4
56              
57             =item a filename
58              
59             in which case the file is read and parsed;
60              
61             =item a string of XML
62              
63             in which case it is read and parsed. How do we tell if we've got a string
64             or a filename? If it begins with C<_TINY_XML_STRING_> then it's
65             a string. That prefix is, of course, ignored when it comes to actually
66             parsing the data. This is intended primarily for use by wrappers which
67             want to retain compatibility with Ye Aunciente Perl. Normal users who want
68             to pass in a string would be expected to use L.
69              
70             =item a glob-ref or IO::Handle object
71              
72             in which case again, the file is read and parsed.
73              
74             =back
75              
76             The former case is for compatibility with older perls, but makes no
77             attempt to properly deal with character sets. If you open a file in a
78             character-set-friendly way and then pass in a handle / object, then the
79             method should Do The Right Thing as it only ever works with character
80             data.
81              
82             The remaining parameters are a list of key/value pairs to make a hash of
83             options:
84              
85             =over 4
86              
87             =item fatal_declarations
88              
89             If set to true, E!ENTITY...E and E!DOCTYPE...E declarations
90             in the document
91             are fatal errors - otherwise they are *ignored*.
92              
93             =item no_entity_parsing
94              
95             If set to true, the five built-in entities are passed through unparsed.
96             Note that special characters in CDATA and attributes may have been turned
97             into C<&>, C<<> and friends.
98              
99             =item strict_entity_parsing
100              
101             If set to true, any unrecognised entities (ie, those outside the core five
102             plus numeric entities) cause a fatal error. If you set both this and
103             C (but why would you do that?) then the latter takes
104             precedence.
105              
106             Obviously, if you want to maximise compliance with the XML spec, you should
107             turn on fatal_declarations and strict_entity_parsing.
108              
109             =back
110              
111             The function returns a structure describing the document. This contains
112             one or more nodes, each being either an 'element' node or a 'text' mode.
113             The structure is an arrayref which contains a single 'element' node which
114             represents the document entity. The arrayref is redundant, but exists for
115             compatibility with L.
116              
117             Element nodes are hashrefs with the following keys:
118              
119             =over 4
120              
121             =item type
122              
123             The node's type, represented by the letter 'e'.
124              
125             =item name
126              
127             The element's name.
128              
129             =item attrib
130              
131             A hashref containing the element's attributes, as key/value pairs where
132             the key is the attribute name.
133              
134             =item content
135              
136             An arrayref of the element's contents. The array's contents is a list of
137             nodes, in the order they were encountered in the document.
138              
139             =back
140              
141             Text nodes are hashrefs with the following keys:
142              
143             =over 4
144              
145             =item type
146              
147             The node's type, represented by the letter 't'.
148              
149             =item content
150              
151             A scalar piece of text.
152              
153             =back
154              
155             If you prefer a DOMmish interface, then look at L on the CPAN.
156              
157             =cut
158              
159             my %regexps = (
160             name => '[:_a-z][\\w:\\.-]*'
161             );
162              
163             my $strict_entity_parsing; # mmm, global. don't worry, parsefile sets it
164             # explicitly every time
165              
166             sub parsefile {
167 223     223 1 42237 my($arg, %params) = @_;
168 223         982 my($file, $elem) = ('', { content => [] });
169 223         661 local $/; # sluuuuurp
170              
171 223         356 $strict_entity_parsing = $params{strict_entity_parsing};
172              
173 223 100       500 if(ref($arg) eq '') { # we were passed a filename or a string
174 221 100       484 if($arg =~ /^_TINY_XML_STRING_/) { # it's a string
175 29         182 $file = substr($arg, 17);
176             } else {
177 192         405 local *FH;
178 192 100       7863 open(FH, $arg) || die(__PACKAGE__."::parsefile: Can't open $arg\n");
179 191         9005 $file = ;
180 191         2811 close(FH);
181             }
182 2         60 } else { $file = <$arg>; }
183              
184             # strip any BOM
185 222         881 $file =~ s/^(\xff\xfe(\x00\x00)?|(\x00\x00)?\xfe\xff|\xef\xbb\xbf)//;
186              
187 222 100 66     1478 die("No elements\n") if (!defined($file) || $file =~ /^\s*$/);
188              
189             # illegal low-ASCII chars
190 220 100       715 die("Not well-formed (Illegal low-ASCII chars found)\n") if($file =~ /[\x00-\x08\x0b\x0c\x0e-\x1f]/);
191              
192             # turn CDATA into PCDATA
193 214         501 $file =~ s{}{
194 9         27 $_ = $1.chr(0); # this makes sure that empty CDATAs become
195 9         24 s/([&<>'"])/ # the empty string and aren't just thrown away.
196 5 100       52 $1 eq '&' ? '&' :
    50          
    100          
    100          
197             $1 eq '<' ? '<' :
198             $1 eq '"' ? '"' :
199             $1 eq "'" ? ''' :
200             '>'
201             /eg;
202 9         37 $_;
203             }egs;
204              
205 176 100       1132 die("Not well-formed (CDATA not delimited or bad comment)\n") if(
206             $file =~ /]]>/ || # ]]> not delimiting CDATA
207             $file =~ //s || # ---> can't end a comment
208 214 100 100     8154 grep { $_ && /--/ } ($file =~ /^\s+||\s+$/gs) # -- in comm
      100        
209             );
210              
211             # strip leading/trailing whitespace and comments (which don't nest - phew!)
212 204         6400 $file =~ s/^\s+||\s+$//gs;
213            
214             # turn quoted > in attribs into >
215             # double- and single-quoted attrib values get done seperately
216 204         10532 while($file =~ s/($regexps{name}\s*=\s*"[^"]*)>([^"]*")/$1>$2/gsi) {}
217 204         8976 while($file =~ s/($regexps{name}\s*=\s*'[^']*)>([^']*')/$1>$2/gsi) {}
218              
219 204 100 100     1081 if($params{fatal_declarations} && $file =~ /
220 111         1536 die("I can't handle this document\n");
221             }
222              
223             # ignore empty tokens/whitespace tokens
224 93 100       962 foreach my $token (grep { length && $_ !~ /^\s+$/ }
  924         3677  
225             split(/(<[^>]+>)/, $file)) {
226 533 100 100     8938 if(
    100          
    100          
    100          
227             $token =~ /<\?$regexps{name}.*?\?>/is || # PI
228             $token =~ /^
229             ) {
230 6         18 next;
231             } elsif($token =~ m!^!i) { # close tag
232 168 100       633 die("Not well-formed\n\tat $token\n") if($elem->{name} ne $1);
233 165         461 $elem = delete $elem->{parent};
234             } elsif($token =~ /^<$regexps{name}(\s[^>]*)*(\s*\/)?>/is) { # open tag
235 241         1905 my($tagname, $attribs_raw) = ($token =~ m!<(\S*)(.*?)(\s*/)?>!s);
236             # first make attribs into a list so we can spot duplicate keys
237 241         3082 my $attrib = [
238             # do double- and single- quoted attribs seperately
239             $attribs_raw =~ /\s($regexps{name})\s*=\s*"([^"]*?)"/gi,
240             $attribs_raw =~ /\s($regexps{name})\s*=\s*'([^']*?)'/gi
241             ];
242 241 100       328 if(@{$attrib} == 2 * keys %{{@{$attrib}}}) {
  241         365  
  241         241  
  241         819  
243 240         305 $attrib = { @{$attrib} }
  240         518  
244 1         15 } else { die("Not well-formed - duplicate attribute\n"); }
245            
246             # now trash any attribs that we *did* manage to parse and see
247             # if there's anything left
248 240         1884 $attribs_raw =~ s/\s($regexps{name})\s*=\s*"([^"]*?)"//gi;
249 240         1388 $attribs_raw =~ s/\s($regexps{name})\s*=\s*'([^']*?)'//gi;
250 240 100 100     769 die("Not well-formed\n$attribs_raw") if($attribs_raw =~ /\S/ || grep { /
  73         310  
  230         844  
251              
252 228 100       477 unless($params{no_entity_parsing}) {
253 227         238 foreach my $key (keys %{$attrib}) {
  227         598  
254 71         152 ($attrib->{$key} = _fixentities($attrib->{$key})) =~ s/\x00//g; # get rid of CDATA marker
255             }
256             }
257             $elem = {
258 224         1006 content => [],
259             name => $tagname,
260             type => 'e',
261             attrib => $attrib,
262             parent => $elem
263             };
264 224         266 push @{$elem->{parent}->{content}}, $elem;
  224         500  
265             # now handle self-closing tags
266 224 100       1030 if($token =~ /\s*\/>$/) {
267 33         110 $elem->{name} =~ s/\/$//;
268 33         260 $elem = delete $elem->{parent};
269             }
270             } elsif($token =~ /^
271 13         170 die("I can't handle this document\n\tat $token\n");
272             } else { # ordinary content
273 105         173 $token =~ s/\x00//g; # get rid of our CDATA marker
274 105 100       236 unless($params{no_entity_parsing}) { $token = _fixentities($token); }
  104         222  
275 97         131 push @{$elem->{content}}, { content => $token, type => 't' };
  97         461  
276             }
277             }
278 52 50       251 die("Not well-formed (Duplicated parent)\n") if(exists($elem->{parent}));
279 52 100       67 die("Junk after end of document\n") if($#{$elem->{content}} > 0);
  52         334  
280 40         386 die("No elements\n") if(
281 40 100 66     57 $#{$elem->{content}} == -1 || $elem->{content}->[0]->{type} ne 'e'
282             );
283 39         5255 return $elem->{content};
284             }
285              
286             sub _fixentities {
287 175     175   322 my $thingy = shift;
288              
289 175 100       418 my $junk = ($strict_entity_parsing) ? '|.*' : '';
290 175         1224 $thingy =~ s/&((#(\d+|x[a-fA-F0-9]+);)|lt;|gt;|quot;|apos;|amp;$junk)/
291 220 100       1548 $3 ? (
    100          
    100          
    100          
    100          
    100          
    100          
292             substr($3, 0, 1) eq 'x' ? # using a =~ match here clobbers $3
293             chr(hex(substr($3, 1))) : # so don't "fix" it!
294             chr($3)
295             ) :
296             $1 eq 'lt;' ? '<' :
297             $1 eq 'gt;' ? '>' :
298             $1 eq 'apos;' ? "'" :
299             $1 eq 'quot;' ? '"' :
300             $1 eq 'amp;' ? '&' :
301             die("Illegal ampersand or entity\n\tat $1\n")
302             /ge;
303 163         591 $thingy;
304             }
305              
306             =head1 COMPATIBILITY
307              
308             =head2 With other modules
309              
310             The C function is so named because it is intended to work in a
311             similar fashion to L with the L style.
312             Instead of saying this:
313              
314             use XML::Parser;
315             use XML::Parser::EasyTree;
316             $XML::Parser::EasyTree::Noempty=1;
317             my $p=new XML::Parser(Style=>'EasyTree');
318             my $tree=$p->parsefile('something.xml');
319              
320             you would say:
321              
322             use XML::Tiny;
323             my $tree = XML::Tiny::parsefile('something.xml');
324              
325             Any valid document that can be parsed like that using XML::Tiny should
326             produce identical results if you use the above example of how to use
327             L.
328              
329             If you find a document where that is not the case, please report it as
330             a bug.
331              
332             =head2 With perl 5.004
333              
334             The module is intended to be fully compatible with every version of perl
335             back to and including 5.004, and may be compatible with even older
336             versions of perl 5.
337              
338             The lack of Unicode and friends in older perls means that XML::Tiny
339             does nothing with character sets. If you have a document with a funny
340             character set, then you will need to open the file in an appropriate
341             mode using a character-set-friendly perl and pass the resulting file
342             handle to the module. BOMs are ignored.
343              
344             =head2 The subset of XML that we understand
345              
346             =over 4
347              
348             =item Element tags and attributes
349              
350             Including "self-closing" tags like Epie type = 'steak n kidney' /E;
351              
352             =item Comments
353              
354             Which are ignored;
355              
356             =item The five "core" entities
357              
358             ie C<&>, C<<>, C<>>, C<'> and C<">;
359              
360             =item Numeric entities
361              
362             eg C<A> and C<A>;
363              
364             =item CDATA
365              
366             This is simply turned into PCDATA before parsing. Note how this may interact
367             with the various entity-handling options;
368              
369             =back
370              
371             The following parts of the XML standard are handled incorrectly or not at
372             all - this is not an exhaustive list:
373              
374             =over 4
375              
376             =item Namespaces
377              
378             While documents that use namespaces will be parsed just fine, there's no
379             special treatment of them. Their names are preserved in element and
380             attribute names like 'rdf:RDF'.
381              
382             =item DTDs and Schemas
383              
384             This is not a validating parser. declarations are ignored
385             if you've not made them fatal.
386              
387             =item Entities and references
388              
389             declarations are ignored if you've not made them fatal.
390             Unrecognised entities are ignored by default, as are naked & characters.
391             This means that if entity parsing is enabled you won't be able to tell
392             the difference between C<&nbsp;> and C< >. If your
393             document might use any non-core entities then please consider using
394             the C option, and then use something like
395             L.
396              
397             =item Processing instructions
398              
399             These are ignored.
400              
401             =item Whitespace
402              
403             We do not guarantee to correctly handle leading and trailing whitespace.
404              
405             =item Character sets
406              
407             This is not practical with older versions of perl
408              
409             =back
410              
411             =head1 PHILOSOPHY and JUSTIFICATION
412              
413             While feedback from real users about this module has been uniformly
414             positive and helpful, some people seem to take issue with this module
415             because it doesn't implement every last jot and tittle of the XML
416             standard and merely implements a useful subset. A very useful subset,
417             as it happens, which can cope with common light-weight XML-ish tasks
418             such as parsing the results of queries to the Amazon Web Services.
419             Many, perhaps most, users of XML do not in fact need a full implementation
420             of the standard, and are understandably reluctant to install large complex
421             pieces of software which have many dependencies. In fact, when they
422             realise what installing and using a full implementation entails, they
423             quite often don't *want* it. Another class of users, people
424             distributing applications, often can not rely on users being able to
425             install modules from the CPAN, or even having tools like make or a shell
426             available. XML::Tiny exists for those people.
427              
428             =head1 BUGS and FEEDBACK
429              
430             I welcome feedback about my code, including constructive criticism.
431             Bug reports should be made using L or by email,
432             and should include the smallest possible chunk of code, along with
433             any necessary XML data, which demonstrates the bug. Ideally, this
434             will be in the form of a file which I can drop in to the module's
435             test suite. Please note that such files must work in perl 5.004.
436              
437             =head1 SEE ALSO
438              
439             =over 4
440              
441             =item For more capable XML parsers:
442              
443             L
444              
445             L
446              
447             L
448              
449             =item The requirements for a module to be Tiny
450              
451             L
452              
453             =back
454              
455             =head1 AUTHOR, COPYRIGHT and LICENCE
456              
457             David Cantrell EFE
458              
459             Thanks to David Romano for some compatibility patches for Ye Aunciente Perl;
460              
461             to Matt Knecht and David Romano for prodding me to support attributes,
462             and to Matt for providing code to implement it in a quick n dirty minimal
463             kind of way;
464              
465             to the people on L and elsewhere who have been kind
466             enough to point out ways it could be improved;
467              
468             to Sergio Fanchiotti for pointing out a bug in handling self-closing tags,
469             for reporting another bug that I introduced when fixing the first one,
470             and for providing a patch to improve error reporting;
471              
472             to 'Corion' for finding a bug with localised filehandles and providing a fix;
473              
474             to Diab Jerius for spotting that element and attribute names can begin
475             with an underscore;
476              
477             to Nick Dumas for finding a bug when attribs have their quoting character
478             in CDATA, and providing a patch;
479              
480             to Mathieu Longtin for pointing out that BOMs exist.
481              
482             Copyright 2007-2010 David Cantrell Edavid@cantrell.org.ukE
483              
484             This software is free-as-in-speech software, and may be used,
485             distributed, and modified under the terms of either the GNU
486             General Public Licence version 2 or the Artistic Licence. It's
487             up to you which one you use. The full text of the licences can
488             be found in the files GPL2.txt and ARTISTIC.txt, respectively.
489              
490             =head1 CONSPIRACY
491              
492             This module is also free-as-in-mason software.
493              
494             =cut
495              
496             'zero';