File Coverage

blib/lib/XML/Tiny.pm

Criterion	Covered	Total	%
statement	80	80	100.0
branch	66	68	97.0
condition	19	21	90.4
subroutine	4	4	100.0
pod	1	1	100.0
total	170	174	97.7

line	stmt	bran	cond	sub	pod	time	code
1							package XML::Tiny;
2
3	14			14		39847	use strict;
	14					24
	14					684
4
5							require Exporter;
6
7	14			14		72	use vars qw($VERSION @EXPORT_OK @ISA);
	14					22
	14					26979
8
9							$VERSION = '2.06';
10							@EXPORT_OK = qw(parsefile);
11							@ISA = qw(Exporter);
12
13							# localising prevents the warningness leaking out of this module
14							local $^W = 1; # can't use warnings as that's a 5.6-ism
15
16							=head1 NAME
17
18							XML::Tiny - simple lightweight parser for a subset of XML
19
20							=head1 DESCRIPTION
21
22							XML::Tiny is a simple lightweight parser for a subset of XML
23
24							=head1 SYNOPSIS
25
26							use XML::Tiny qw(parsefile);
27							open($xmlfile, 'something.xml);
28							my $document = parsefile($xmlfile);
29
30							This will leave C<$document> looking something like this:
31
32							[
33							{
34							type => 'e',
35							attrib => { ... },
36							name => 'rootelementname',
37							content => [
38							...
39							more elements and text content
40							...
41							]
42							}
43							]
44
45							=head1 FUNCTIONS
46
47							The C function is optionally exported. By default nothing is
48							exported. There is no objecty interface.
49
50							=head2 parsefile
51
52							This takes at least one parameter, optionally more. The compulsory
53							parameter may be:
54
55							=over 4
56
57							=item a filename
58
59							in which case the file is read and parsed;
60
61							=item a string of XML
62
63							in which case it is read and parsed. How do we tell if we've got a string
64							or a filename? If it begins with C<_TINY_XML_STRING_> then it's
65							a string. That prefix is, of course, ignored when it comes to actually
66							parsing the data. This is intended primarily for use by wrappers which
67							want to retain compatibility with Ye Aunciente Perl. Normal users who want
68							to pass in a string would be expected to use L.
69
70							=item a glob-ref or IO::Handle object
71
72							in which case again, the file is read and parsed.
73
74							=back
75
76							The former case is for compatibility with older perls, but makes no
77							attempt to properly deal with character sets. If you open a file in a
78							character-set-friendly way and then pass in a handle / object, then the
79							method should Do The Right Thing as it only ever works with character
80							data.
81
82							The remaining parameters are a list of key/value pairs to make a hash of
83							options:
84
85							=over 4
86
87							=item fatal_declarations
88
89							If set to true, E!ENTITY...E and E!DOCTYPE...E declarations
90							in the document
91							are fatal errors - otherwise they are ignored.
92
93							=item no_entity_parsing
94
95							If set to true, the five built-in entities are passed through unparsed.
96							Note that special characters in CDATA and attributes may have been turned
97							into C<&>, C<<> and friends.
98
99							=item strict_entity_parsing
100
101							If set to true, any unrecognised entities (ie, those outside the core five
102							plus numeric entities) cause a fatal error. If you set both this and
103							C (but why would you do that?) then the latter takes
104							precedence.
105
106							Obviously, if you want to maximise compliance with the XML spec, you should
107							turn on fatal_declarations and strict_entity_parsing.
108
109							=back
110
111							The function returns a structure describing the document. This contains
112							one or more nodes, each being either an 'element' node or a 'text' mode.
113							The structure is an arrayref which contains a single 'element' node which
114							represents the document entity. The arrayref is redundant, but exists for
115							compatibility with L.
116
117							Element nodes are hashrefs with the following keys:
118
119							=over 4
120
121							=item type
122
123							The node's type, represented by the letter 'e'.
124
125							=item name
126
127							The element's name.
128
129							=item attrib
130
131							A hashref containing the element's attributes, as key/value pairs where
132							the key is the attribute name.
133
134							=item content
135
136							An arrayref of the element's contents. The array's contents is a list of
137							nodes, in the order they were encountered in the document.
138
139							=back
140
141							Text nodes are hashrefs with the following keys:
142
143							=over 4
144
145							=item type
146
147							The node's type, represented by the letter 't'.
148
149							=item content
150
151							A scalar piece of text.
152
153							=back
154
155							If you prefer a DOMmish interface, then look at L on the CPAN.
156
157							=cut
158
159							my %regexps = (
160							name => '[:_a-z][\\w:\\.-]*'
161							);
162
163							my $strict_entity_parsing; # mmm, global. don't worry, parsefile sets it
164							# explicitly every time
165
166							sub parsefile {
167	223			223	1	42237	my($arg, %params) = @_;
168	223					982	my($file, $elem) = ('', { content => [] });
169	223					661	local $/; # sluuuuurp
170
171	223					356	$strict_entity_parsing = $params{strict_entity_parsing};
172
173	223	100				500	if(ref($arg) eq '') { # we were passed a filename or a string
174	221	100				484	if($arg =~ /^_TINY_XML_STRING_/) { # it's a string
175	29					182	$file = substr($arg, 17);
176							} else {
177	192					405	local *FH;
178	192	100				7863	open(FH, $arg) \|\| die(__PACKAGE__."::parsefile: Can't open $arg\n");
179	191					9005	$file = ;
180	191					2811	close(FH);
181							}
182	2					60	} else { $file = <$arg>; }
183
184							# strip any BOM
185	222					881	$file =~ s/^(\xff\xfe(\x00\x00)?\|(\x00\x00)?\xfe\xff\|\xef\xbb\xbf)//;
186
187	222	100	66			1478	die("No elements\n") if (!defined($file) \|\| $file =~ /^\s*$/);
188
189							# illegal low-ASCII chars
190	220	100				715	die("Not well-formed (Illegal low-ASCII chars found)\n") if($file =~ /[\x00-\x08\x0b\x0c\x0e-\x1f]/);
191
192							# turn CDATA into PCDATA
193	214					501	$file =~ s{}{
194	9					27	$_ = $1.chr(0); # this makes sure that empty CDATAs become
195	9					24	s/([&<>'"])/ # the empty string and aren't just thrown away.
196	5	100				52	$1 eq '&' ? '&' :
		50
		100
		100
197							$1 eq '<' ? '<' :
198							$1 eq '"' ? '"' :
199							$1 eq "'" ? ''' :
200							'>'
201							/eg;
202	9					37	$_;
203							}egs;
204
205	176	100				1132	die("Not well-formed (CDATA not delimited or bad comment)\n") if(
206							$file =~ /]]>/ \|\| # ]]> not delimiting CDATA
207							$file =~ //s \|\| # ---> can't end a comment
208	214	100	100			8154	grep { $_ && /--/ } ($file =~ /^\s+\|\|\s+$/gs) # -- in comm
			100
209							);
210
211							# strip leading/trailing whitespace and comments (which don't nest - phew!)
212	204					6400	$file =~ s/^\s+\|\|\s+$//gs;
213
214							# turn quoted > in attribs into >
215							# double- and single-quoted attrib values get done seperately
216	204					10532	while($file =~ s/($regexps{name}\s=\s"[^"])>([^"]")/$1>$2/gsi) {}
217	204					8976	while($file =~ s/($regexps{name}\s=\s'[^'])>([^']')/$1>$2/gsi) {}
218
219	204	100	100			1081	if($params{fatal_declarations} && $file =~ /
220	111					1536	die("I can't handle this document\n");
221							}
222
223							# ignore empty tokens/whitespace tokens
224	93	100				962	foreach my $token (grep { length && $_ !~ /^\s+$/ }
	924					3677
225							split(/(<[^>]+>)/, $file)) {
226	533	100	100			8938	if(
		100
		100
		100
227							$token =~ /<\?$regexps{name}.*?\?>/is \|\| # PI
228							$token =~ /^
229							) {
230	6					18	next;
231							} elsif($token =~ m!^!i) { # close tag
232	168	100				633	die("Not well-formed\n\tat $token\n") if($elem->{name} ne $1);
233	165					461	$elem = delete $elem->{parent};
234							} elsif($token =~ /^<$regexps{name}(\s[^>])(\s*\/)?>/is) { # open tag
235	241					1905	my($tagname, $attribs_raw) = ($token =~ m!<(\S)(.?)(\s*/)?>!s);
236							# first make attribs into a list so we can spot duplicate keys
237	241					3082	my $attrib = [
238							# do double- and single- quoted attribs seperately
239							$attribs_raw =~ /\s($regexps{name})\s=\s"([^"]*?)"/gi,
240							$attribs_raw =~ /\s($regexps{name})\s=\s'([^']*?)'/gi
241							];
242	241	100				328	if(@{$attrib} == 2 * keys %{{@{$attrib}}}) {
	241					365
	241					241
	241					819
243	240					305	$attrib = { @{$attrib} }
	240					518
244	1					15	} else { die("Not well-formed - duplicate attribute\n"); }
245
246							# now trash any attribs that we did manage to parse and see
247							# if there's anything left
248	240					1884	$attribs_raw =~ s/\s($regexps{name})\s=\s"([^"]*?)"//gi;
249	240					1388	$attribs_raw =~ s/\s($regexps{name})\s=\s'([^']*?)'//gi;
250	240	100	100			769	die("Not well-formed\n$attribs_raw") if($attribs_raw =~ /\S/ \|\| grep { /
	73					310
	230					844
251
252	228	100				477	unless($params{no_entity_parsing}) {
253	227					238	foreach my $key (keys %{$attrib}) {
	227					598
254	71					152	($attrib->{$key} = _fixentities($attrib->{$key})) =~ s/\x00//g; # get rid of CDATA marker
255							}
256							}
257							$elem = {
258	224					1006	content => [],
259							name => $tagname,
260							type => 'e',
261							attrib => $attrib,
262							parent => $elem
263							};
264	224					266	push @{$elem->{parent}->{content}}, $elem;
	224					500
265							# now handle self-closing tags
266	224	100				1030	if($token =~ /\s*\/>$/) {
267	33					110	$elem->{name} =~ s/\/$//;
268	33					260	$elem = delete $elem->{parent};
269							}
270							} elsif($token =~ /^
271	13					170	die("I can't handle this document\n\tat $token\n");
272							} else { # ordinary content
273	105					173	$token =~ s/\x00//g; # get rid of our CDATA marker
274	105	100				236	unless($params{no_entity_parsing}) { $token = _fixentities($token); }
	104					222
275	97					131	push @{$elem->{content}}, { content => $token, type => 't' };
	97					461
276							}
277							}
278	52	50				251	die("Not well-formed (Duplicated parent)\n") if(exists($elem->{parent}));
279	52	100				67	die("Junk after end of document\n") if($#{$elem->{content}} > 0);
	52					334
280	40					386	die("No elements\n") if(
281	40	100	66			57	$#{$elem->{content}} == -1 \|\| $elem->{content}->[0]->{type} ne 'e'
282							);
283	39					5255	return $elem->{content};
284							}
285
286							sub _fixentities {
287	175			175		322	my $thingy = shift;
288
289	175	100				418	my $junk = ($strict_entity_parsing) ? '\|.*' : '';
290	175					1224	$thingy =~ s/&((#(\d+\|x[a-fA-F0-9]+);)\|lt;\|gt;\|quot;\|apos;\|amp;$junk)/
291	220	100				1548	$3 ? (
		100
		100
		100
		100
		100
		100
292							substr($3, 0, 1) eq 'x' ? # using a =~ match here clobbers $3
293							chr(hex(substr($3, 1))) : # so don't "fix" it!
294							chr($3)
295							) :
296							$1 eq 'lt;' ? '<' :
297							$1 eq 'gt;' ? '>' :
298							$1 eq 'apos;' ? "'" :
299							$1 eq 'quot;' ? '"' :
300							$1 eq 'amp;' ? '&' :
301							die("Illegal ampersand or entity\n\tat $1\n")
302							/ge;
303	163					591	$thingy;
304							}
305
306							=head1 COMPATIBILITY
307
308							=head2 With other modules
309
310							The C function is so named because it is intended to work in a
311							similar fashion to L with the L style.
312							Instead of saying this:
313
314							use XML::Parser;
315							use XML::Parser::EasyTree;
316							$XML::Parser::EasyTree::Noempty=1;
317							my $p=new XML::Parser(Style=>'EasyTree');
318							my $tree=$p->parsefile('something.xml');
319
320							you would say:
321
322							use XML::Tiny;
323							my $tree = XML::Tiny::parsefile('something.xml');
324
325							Any valid document that can be parsed like that using XML::Tiny should
326							produce identical results if you use the above example of how to use
327							L.
328
329							If you find a document where that is not the case, please report it as
330							a bug.
331
332							=head2 With perl 5.004
333
334							The module is intended to be fully compatible with every version of perl
335							back to and including 5.004, and may be compatible with even older
336							versions of perl 5.
337
338							The lack of Unicode and friends in older perls means that XML::Tiny
339							does nothing with character sets. If you have a document with a funny
340							character set, then you will need to open the file in an appropriate
341							mode using a character-set-friendly perl and pass the resulting file
342							handle to the module. BOMs are ignored.
343
344							=head2 The subset of XML that we understand
345
346							=over 4
347
348							=item Element tags and attributes
349
350							Including "self-closing" tags like Epie type = 'steak n kidney' /E;
351
352							=item Comments
353
354							Which are ignored;
355
356							=item The five "core" entities
357
358							ie C<&>, C<<>, C<>>, C<'> and C<">;
359
360							=item Numeric entities
361
362							eg C<A> and C<A>;
363
364							=item CDATA
365
366							This is simply turned into PCDATA before parsing. Note how this may interact
367							with the various entity-handling options;
368
369							=back
370
371							The following parts of the XML standard are handled incorrectly or not at
372							all - this is not an exhaustive list:
373
374							=over 4
375
376							=item Namespaces
377
378							While documents that use namespaces will be parsed just fine, there's no
379							special treatment of them. Their names are preserved in element and
380							attribute names like 'rdf:RDF'.
381
382							=item DTDs and Schemas
383
384							This is not a validating parser. declarations are ignored
385							if you've not made them fatal.
386
387							=item Entities and references
388
389							declarations are ignored if you've not made them fatal.
390							Unrecognised entities are ignored by default, as are naked & characters.
391							This means that if entity parsing is enabled you won't be able to tell
392							the difference between C< > and C< >. If your
393							document might use any non-core entities then please consider using
394							the C option, and then use something like
395							L.
396
397							=item Processing instructions
398
399							These are ignored.
400
401							=item Whitespace
402
403							We do not guarantee to correctly handle leading and trailing whitespace.
404
405							=item Character sets
406
407							This is not practical with older versions of perl
408
409							=back
410
411							=head1 PHILOSOPHY and JUSTIFICATION
412
413							While feedback from real users about this module has been uniformly
414							positive and helpful, some people seem to take issue with this module
415							because it doesn't implement every last jot and tittle of the XML
416							standard and merely implements a useful subset. A very useful subset,
417							as it happens, which can cope with common light-weight XML-ish tasks
418							such as parsing the results of queries to the Amazon Web Services.
419							Many, perhaps most, users of XML do not in fact need a full implementation
420							of the standard, and are understandably reluctant to install large complex
421							pieces of software which have many dependencies. In fact, when they
422							realise what installing and using a full implementation entails, they
423							quite often don't want it. Another class of users, people
424							distributing applications, often can not rely on users being able to
425							install modules from the CPAN, or even having tools like make or a shell
426							available. XML::Tiny exists for those people.
427
428							=head1 BUGS and FEEDBACK
429
430							I welcome feedback about my code, including constructive criticism.
431							Bug reports should be made using L or by email,
432							and should include the smallest possible chunk of code, along with
433							any necessary XML data, which demonstrates the bug. Ideally, this
434							will be in the form of a file which I can drop in to the module's
435							test suite. Please note that such files must work in perl 5.004.
436
437							=head1 SEE ALSO
438
439							=over 4
440
441							=item For more capable XML parsers:
442
443							L
444
445							L
446
447							L
448
449							=item The requirements for a module to be Tiny
450
451							L
452
453							=back
454
455							=head1 AUTHOR, COPYRIGHT and LICENCE
456
457							David Cantrell EFE
458
459							Thanks to David Romano for some compatibility patches for Ye Aunciente Perl;
460
461							to Matt Knecht and David Romano for prodding me to support attributes,
462							and to Matt for providing code to implement it in a quick n dirty minimal
463							kind of way;
464
465							to the people on L and elsewhere who have been kind
466							enough to point out ways it could be improved;
467
468							to Sergio Fanchiotti for pointing out a bug in handling self-closing tags,
469							for reporting another bug that I introduced when fixing the first one,
470							and for providing a patch to improve error reporting;
471
472							to 'Corion' for finding a bug with localised filehandles and providing a fix;
473
474							to Diab Jerius for spotting that element and attribute names can begin
475							with an underscore;
476
477							to Nick Dumas for finding a bug when attribs have their quoting character
478							in CDATA, and providing a patch;
479
480							to Mathieu Longtin for pointing out that BOMs exist.
481
482							Copyright 2007-2010 David Cantrell Edavid@cantrell.org.ukE
483
484							This software is free-as-in-speech software, and may be used,
485							distributed, and modified under the terms of either the GNU
486							General Public Licence version 2 or the Artistic Licence. It's
487							up to you which one you use. The full text of the licences can
488							be found in the files GPL2.txt and ARTISTIC.txt, respectively.
489
490							=head1 CONSPIRACY
491
492							This module is also free-as-in-mason software.
493
494							=cut
495
496							'zero';