File Coverage

lib/Decl/Semantics/Parse.pm

Criterion	Covered	Total	%
statement	13	24	54.1
branch	0	2	0.0
condition			n/a
subroutine	5	8	62.5
pod	3	3	100.0
total	21	37	56.7

line	stmt	bran	sub	pod	time	code
1						package Decl::Semantics::Parse;
2
3	12		12		78	use warnings;
	12				27
	12				482
4	12		12		72	use strict;
	12				25
	12				482
5
6	12		12		65	use base qw(Decl::Node);
	12				27
	12				1239
7	12		12		73	use Iterator::Simple qw(:all);
	12				28
	12				6887
8
9						=head1 NAME
10
11						Decl::Semantics::Parse - implements a parser specification.
12
13						=head1 VERSION
14
15						Version 0.01
16
17						=cut
18
19						our $VERSION = '0.01';
20
21
22						=head1 SYNOPSIS
23
24						A parser, by nature, converts a text stream into a tree structure. You can get it to do things I than building a tree structure, but that is its
25						inherent nature, because we human beings parse our incoming text (in the form of an audio stream) into abstract syntax trees in our heads while understanding
26						things (well, depending on who you listen to, but it's a useful model). And of course, our computers work the same way. So I in the
27						process of getting from text - that is, code - into actions taken, every computer program or data structure goes through a phase of being an abstract
28						tree, even if only in a potential sense.
29
30						Well, C happens to be I of trees, so naturally that's the default output of a parser built into this language. Now note:
31						you can also define a parser that, given some text, outputs a callable code object. The Perl parser is used in just such a manner, and is the default parser
32						for code in C. But because C tries to be as flexible as possible, you can override that, either at the global level or in any
33						particular code block, as I'll illustrate below. So if you build a parser that returns a callable object in some way (whether by building code in Perl,
34						or by doing something fancy with L for all I know), then you can use it to define code objects on the fly.
35
36						The parsers in C are based on those in Mark Jason Dominus' marvelous, marvelous book I. The standard setup in I
37						is to define a lexer (to break the text up into tokens), then to pass the token stream through the parser itself. This makes the entire process a lot
38						easier to organize, and since tokenization is already a useful tool, I decided to go with it.
39
40						=head2 BASIC EXAMPLE: REGEX
41
42						Let's start with one of his examples, shall we? He provides a regexp parser on page 436 in Chapter 8. In C, it looks like this:
43
44						parse regex
45						tokens
46						ATOM "\\x[0-9a-fA-F]{0,2}\|\\\d+\|\\."
47						PAREN "[()]"
48						QUANT "[*+?]"
49						BAR "\|"
50						ATOM "."
51
52						rules
53						regex
54						alternative BAR regex
55						alternative
56						alternative
57						qatom alternative
58						(nothing)
59						qatom
60						atom QUANT
61						atom
62						atom
63						ATOM
64						"(" regex ")"
65
66						Given the input (a\|b)+(c\|d*) - see below for how to pass input to a parser - this returns the nodal structure
67
68						regex
69						alternative
70						qatom
71						atom
72						regex
73						atom
74						ATOM "a"
75						BAR "\|"
76						atom
77						ATOM "b"
78						QUANT "+"
79						atom
80						PAREN "("
81						regex
82						atom
83						ATOM "c"
84						BAR
85						qatom
86						atom
87						ATOM "d"
88						QUANT "*"
89						PAREN ")"
90
91						That's a pretty lanky structure, but it I serve the purpose of getting text into a data structure
92						you can do stuff with (like searching it or walking it or passing it off to a template for some other purpose).
93
94						We can tweak it a little. If you tack an asterisk onto any tag in the grammar, the output will omit that
95						level from the output tree:
96
97						parse regex
98						tokens
99						ATOM "\\x[0-9a-fA-F]{0,2}\|\\\d+\|\\."
100						PAREN "[()]"
101						QUANT "[*+?]"
102						BAR "\|"
103						ATOM "."
104
105						rules
106						regex
107						alternative* BAR regex
108						alternative*
109						alternative
110						qatom alternative*
111						(nothing)
112						qatom
113						atom QUANT
114						atom
115						atom
116						ATOM*
117						"("* regex* ")"*
118
119						Given the input (a\|b)+(c\|d*), this returns the nodal structure
120
121						regex
122						qatom
123						atom
124						atom "a"
125						BAR "\|"
126						atom "b"
127						QUANT "+"
128						atom
129						atom "c"
130						BAR
131						qatom
132						atom "d"
133						QUANT "*"
134
135						That arguably preserves the semantics of the original regex, without keeping the syntactic overhead, and will probably be more useful.
136
137						Once our parser is defined, it becomes a new tag, so
138
139						regex "(a\|b)+(c\|d*)"
140
141						is now shorthand for the tree structure shown above. To insert at build time, we use
142
143						<= (regex) "(a\|b)+(c\|d*)"
144
145
146						For a longer non-callable macro insertion, we'll want a better example, but let's assume something like this:
147
148						<= (regex)
149						lkjlkjsdf
150						lkjljlksjdf
151						lkjljsdf
152
153						=head2 USING A TOKENIZER ALONE
154
155						A parser can also be run as a tokenizer alone, returning a stream of tokens. This is used for the text streams in L, where commands
156						can be interpersed into the text (that's still a work in progress, of course). If you override that parser, you can build PDFs using whatever text
157						stream formalism you find useful.
158
159						example here after it's written
160
161						To iterate over that stream, we treat it as a filter on a given text stream, like this:
162
163						do {
164						^foreach token in my_text\|pdf_tokenizer {
165						if (ref $token eq 'ARRAY') {
166						# handle a command token
167						} else {
168						# we have a word
169						}
170						}
171						}
172
173						A token stream is a special type of stream, actually - the iterator returns strings for words, and arrayrefs for identified tokens, which are
174						generally equivalent to commands. This distinguishes it from normal data iterators, which return an arrayref for each row. I mention this because
175						it affects the way you build your ^foreach specification; a data iterator returning arrayrefs would allow you to provide two local variables, but
176						a token stream can't, because some of the tokens aren't arrayrefs.
177
178						To call a tokenizer from outside C, you'd do something like this:
179
180						use Decl (-nofilter PDF::Declarative);
181
182						$tree = new Decl;
183						$tree->load (<
184						text my_text
185						...
186						EOF
187
188						$iterator = $tree->iterate ("my_text\|pdf_tokenizer");
189						while ($token = $iterator->next) {
190						if (ref $token eq 'ARRAY') {
191						# handle a command token
192						} else {
193						# we have a word
194						}
195						}
196
197
198						=head2 CALLABLE PARSED OBJECTS - EXAMPLE: CALCULATE
199
200						A parser can also skip right past the nodal structure stage, transforming your language directly into callable code. The I
201						example that best fits that model is the calculator; Dominus actually uses the calculator as his first example, but I thought the regexp was a simpler
202						initial example.
203
204						First, let's translate the I calculator grammar into C style, allowing it to generate a nodal structure. Even if you define a parser
205						to be able to build a callable object, its parse tree is still available if you ask for it explicitly, so even the decorated parser below, if used
206						in a non-callable context, will generate a parse tree for you. It's just easier to illustrate without the extra syntax.
207
208						grammar here
209
210						Now let's go ahead and add the specifications necessary to generate a callable object. These are mostly making use of the "actions" feature.
211
212						grammar here
213
214						Now we have a number of different ways to use this parser. First is simply as a parser to extract the parse tree of whatever we defined; I'll
215						skip that, because it was covered in the previous section.
216
217						Second, we can call it just like any other code-generating object, say as an event handler. The default parser for code snippets is "perl", of course,
218						but you can direct C to use any other code-generating parser like this:
219
220						on my_event calculate < {
221						something
222						}
223
224						That's pretty boring in this case, because the grammar we've defined doesn't permit us to use parameters, so we will always calculate the same thing.
225						Eventually, I'll need and use this feature in some actual application, and I'll try to remember to link to it here.
226
227						Finally, we can just call the parser from Perl, like this:
228
229						parser calculate
230						...
231
232						do {
233						print ^calculate ("1 + 2 * (4 - 5)") . "\n";
234						}
235
236						For simple parsers, this last case will probably be the most useful.
237
238
239						=head2 CALLING A PARSER FROM OUTSIDE CLASS::DECLARATIVE
240
241						Of course, we can also call the parser from outside C, like this:
242
243						use Decl (-nofilter);
244
245						$tree = new Decl;
246						$tree->load (<
247						parse calculate
248						...
249						EOF
250
251						$result = $tree->parser('calculate')->parse('1 + 2 * (4 - 5)');
252
253						Here, C<$result> gets the value of -1. If you call a non-code-generating parser like this, you'll get a Decl::Node structure back.
254
255
256						=head2 EXAMPLE: CLASS::DECLARATIVE'S OWN PARSER
257						The standard parser for a C line is this:
258
259						parse Dline
260						tokens
261						WORD
262						LPAREN "\("
263						RPAREN "\)"
264						LBRACK "\["
265						RBRACK "\]"
266						COMMA ","
267						EQUALS "="
268						STRING
269						PARSEFLAG "<"
270						actions
271						rules
272
273						That's the actual parser used by default in C. You can override the line parser for a given tag; we use this for the 'select'
274						tag, for instance. The indentation structure and bracketing is currently handled by C, and that probably won't change (but you never
275						know).
276
277						=head2 EXAMPLE: SELECT PARSER
278
279						The select tag uses SQL to retrieve information from data iterators, and since SQL is, well, a standard query language (kind of), it's supported
280						natively in C, mostly because we already have this fancy parser just sitting around ready to do that kind of thing. The nice thing,
281						of course, is that means you don't have to write an SQL parser, because I've already done it for you.
282
283						parse SQLselect
284
285						=head1 IMPLEMENTATION
286
287						This particular class implements the C node in the specification structure; the class L implements the parser itself.
288						In other words, here we are concerned with building a C object that will then be asked to do actual parsing. The tags claimed
289						by user-defined parsers are also registered in this phase, constituting macros.
290
291						=head2 defines(), tags_defined()
292
293						Called by Decl::Semantics during import, to find out what xmlapi tags this plugin claims to implement.
294
295						=cut
296	0		0	1	0	sub defines { ('parse') }
297	12		12	1	3482	sub tags_defined { Decl->new_data(<
298						parse (body=vanilla)
299						EOF
300
301						=head2 build_payload ()
302
303						The C function is then called when this object's payload is built (i.e. in the stage when we're adding semantics to our
304						parsed syntax). It builds the parser and registers its tag with the application. Instances are handled by L.
305
306						=cut
307
308						sub build_payload {
309	0		0	1		my ($self) = @_;
310
311	0					my $p = Decl::Parser->new();
312	0					$self->{payload} = $p;
313
314	0					my $t = $self->find ('tokens');
315	0					foreach ($t->elements) {
316	0					$p->add_tokenizer ($_->name, $_->label); # TODO: error handling and default definitions for selected tokenizers
317						}
318
319	0	0				if ($self->name) {
320	0					my $root = $self->root();
321	0		0			$root->build_handler($self->name . "*", "", sub { Decl::Macro->new($self, @_) });
	0
322						}
323						}
324
325
326						=head1 AUTHOR
327
328						Michael Roberts, C<< >>
329
330						=head1 BUGS
331
332						Please report any bugs or feature requests to C, or through
333						the web interface at L. I will be notified, and then you'll
334						automatically be notified of progress on your bug as I make changes.
335
336						=head1 LICENSE AND COPYRIGHT
337
338						Copyright 2010 Michael Roberts.
339
340						This program is free software; you can redistribute it and/or modify it
341						under the terms of either: the GNU General Public License as published
342						by the Free Software Foundation; or the Artistic License.
343
344						See http://dev.perl.org/licenses/ for more information.
345
346						=cut
347
348						1; # End of Decl::Semantics::Parse