File Coverage

lib/Decl/Semantics/Parse.pm
Criterion Covered Total %
statement 13 24 54.1
branch 0 2 0.0
condition n/a
subroutine 5 8 62.5
pod 3 3 100.0
total 21 37 56.7


line stmt bran cond sub pod time code
1             package Decl::Semantics::Parse;
2            
3 12     12   78 use warnings;
  12         27  
  12         482  
4 12     12   72 use strict;
  12         25  
  12         482  
5            
6 12     12   65 use base qw(Decl::Node);
  12         27  
  12         1239  
7 12     12   73 use Iterator::Simple qw(:all);
  12         28  
  12         6887  
8            
9             =head1 NAME
10            
11             Decl::Semantics::Parse - implements a parser specification.
12            
13             =head1 VERSION
14            
15             Version 0.01
16            
17             =cut
18            
19             our $VERSION = '0.01';
20            
21            
22             =head1 SYNOPSIS
23            
24             A parser, by nature, converts a text stream into a tree structure. You can get it to do things I than building a tree structure, but that is its
25             inherent nature, because we human beings parse our incoming text (in the form of an audio stream) into abstract syntax trees in our heads while understanding
26             things (well, depending on who you listen to, but it's a useful model). And of course, our computers work the same way. So I in the
27             process of getting from text - that is, code - into actions taken, every computer program or data structure goes through a phase of being an abstract
28             tree, even if only in a potential sense.
29            
30             Well, C happens to be I of trees, so naturally that's the default output of a parser built into this language. Now note:
31             you can also define a parser that, given some text, outputs a callable code object. The Perl parser is used in just such a manner, and is the default parser
32             for code in C. But because C tries to be as flexible as possible, you can override that, either at the global level or in any
33             particular code block, as I'll illustrate below. So if you build a parser that returns a callable object in some way (whether by building code in Perl,
34             or by doing something fancy with L for all I know), then you can use it to define code objects on the fly.
35            
36             The parsers in C are based on those in Mark Jason Dominus' marvelous, marvelous book I. The standard setup in I
37             is to define a lexer (to break the text up into tokens), then to pass the token stream through the parser itself. This makes the entire process a lot
38             easier to organize, and since tokenization is already a useful tool, I decided to go with it.
39            
40             =head2 BASIC EXAMPLE: REGEX
41            
42             Let's start with one of his examples, shall we? He provides a regexp parser on page 436 in Chapter 8. In C, it looks like this:
43            
44             parse regex
45             tokens
46             ATOM "\\x[0-9a-fA-F]{0,2}|\\\d+|\\."
47             PAREN "[()]"
48             QUANT "[*+?]"
49             BAR "|"
50             ATOM "."
51            
52             rules
53             regex
54             alternative BAR regex
55             alternative
56             alternative
57             qatom alternative
58             (nothing)
59             qatom
60             atom QUANT
61             atom
62             atom
63             ATOM
64             "(" regex ")"
65            
66             Given the input (a|b)+(c|d*) - see below for how to pass input to a parser - this returns the nodal structure
67            
68             regex
69             alternative
70             qatom
71             atom
72             regex
73             atom
74             ATOM "a"
75             BAR "|"
76             atom
77             ATOM "b"
78             QUANT "+"
79             atom
80             PAREN "("
81             regex
82             atom
83             ATOM "c"
84             BAR
85             qatom
86             atom
87             ATOM "d"
88             QUANT "*"
89             PAREN ")"
90            
91             That's a pretty lanky structure, but it I serve the purpose of getting text into a data structure
92             you can do stuff with (like searching it or walking it or passing it off to a template for some other purpose).
93            
94             We can tweak it a little. If you tack an asterisk onto any tag in the grammar, the output will omit that
95             level from the output tree:
96            
97             parse regex
98             tokens
99             ATOM "\\x[0-9a-fA-F]{0,2}|\\\d+|\\."
100             PAREN "[()]"
101             QUANT "[*+?]"
102             BAR "|"
103             ATOM "."
104            
105             rules
106             regex
107             alternative* BAR regex
108             alternative*
109             alternative
110             qatom alternative*
111             (nothing)
112             qatom
113             atom QUANT
114             atom
115             atom
116             ATOM*
117             "("* regex* ")"*
118            
119             Given the input (a|b)+(c|d*), this returns the nodal structure
120            
121             regex
122             qatom
123             atom
124             atom "a"
125             BAR "|"
126             atom "b"
127             QUANT "+"
128             atom
129             atom "c"
130             BAR
131             qatom
132             atom "d"
133             QUANT "*"
134            
135             That arguably preserves the semantics of the original regex, without keeping the syntactic overhead, and will probably be more useful.
136            
137             Once our parser is defined, it becomes a new tag, so
138            
139             regex "(a|b)+(c|d*)"
140            
141             is now shorthand for the tree structure shown above. To insert at build time, we use
142            
143             <= (regex) "(a|b)+(c|d*)"
144            
145            
146             For a longer non-callable macro insertion, we'll want a better example, but let's assume something like this:
147            
148             <= (regex)
149             lkjlkjsdf
150             lkjljlksjdf
151             lkjljsdf
152            
153             =head2 USING A TOKENIZER ALONE
154            
155             A parser can also be run as a tokenizer alone, returning a stream of tokens. This is used for the text streams in L, where commands
156             can be interpersed into the text (that's still a work in progress, of course). If you override that parser, you can build PDFs using whatever text
157             stream formalism you find useful.
158            
159             example here after it's written
160            
161             To iterate over that stream, we treat it as a filter on a given text stream, like this:
162            
163             do {
164             ^foreach token in my_text|pdf_tokenizer {
165             if (ref $token eq 'ARRAY') {
166             # handle a command token
167             } else {
168             # we have a word
169             }
170             }
171             }
172            
173             A token stream is a special type of stream, actually - the iterator returns strings for words, and arrayrefs for identified tokens, which are
174             generally equivalent to commands. This distinguishes it from normal data iterators, which return an arrayref for each row. I mention this because
175             it affects the way you build your ^foreach specification; a data iterator returning arrayrefs would allow you to provide two local variables, but
176             a token stream can't, because some of the tokens aren't arrayrefs.
177            
178             To call a tokenizer from outside C, you'd do something like this:
179            
180             use Decl (-nofilter PDF::Declarative);
181            
182             $tree = new Decl;
183             $tree->load (<
184             text my_text
185             ...
186             EOF
187            
188             $iterator = $tree->iterate ("my_text|pdf_tokenizer");
189             while ($token = $iterator->next) {
190             if (ref $token eq 'ARRAY') {
191             # handle a command token
192             } else {
193             # we have a word
194             }
195             }
196            
197            
198             =head2 CALLABLE PARSED OBJECTS - EXAMPLE: CALCULATE
199            
200             A parser can also skip right past the nodal structure stage, transforming your language directly into callable code. The I
201             example that best fits that model is the calculator; Dominus actually uses the calculator as his first example, but I thought the regexp was a simpler
202             initial example.
203            
204             First, let's translate the I calculator grammar into C style, allowing it to generate a nodal structure. Even if you define a parser
205             to be able to build a callable object, its parse tree is still available if you ask for it explicitly, so even the decorated parser below, if used
206             in a non-callable context, will generate a parse tree for you. It's just easier to illustrate without the extra syntax.
207            
208             grammar here
209            
210             Now let's go ahead and add the specifications necessary to generate a callable object. These are mostly making use of the "actions" feature.
211            
212             grammar here
213            
214             Now we have a number of different ways to use this parser. First is simply as a parser to extract the parse tree of whatever we defined; I'll
215             skip that, because it was covered in the previous section.
216            
217             Second, we can call it just like any other code-generating object, say as an event handler. The default parser for code snippets is "perl", of course,
218             but you can direct C to use any other code-generating parser like this:
219            
220             on my_event calculate < {
221             something
222             }
223            
224             That's pretty boring in this case, because the grammar we've defined doesn't permit us to use parameters, so we will always calculate the same thing.
225             Eventually, I'll need and use this feature in some actual application, and I'll try to remember to link to it here.
226            
227             Finally, we can just call the parser from Perl, like this:
228            
229             parser calculate
230             ...
231            
232             do {
233             print ^calculate ("1 + 2 * (4 - 5)") . "\n";
234             }
235            
236             For simple parsers, this last case will probably be the most useful.
237            
238            
239             =head2 CALLING A PARSER FROM OUTSIDE CLASS::DECLARATIVE
240            
241             Of course, we can also call the parser from outside C, like this:
242            
243             use Decl (-nofilter);
244            
245             $tree = new Decl;
246             $tree->load (<
247             parse calculate
248             ...
249             EOF
250            
251             $result = $tree->parser('calculate')->parse('1 + 2 * (4 - 5)');
252            
253             Here, C<$result> gets the value of -1. If you call a non-code-generating parser like this, you'll get a Decl::Node structure back.
254            
255            
256             =head2 EXAMPLE: CLASS::DECLARATIVE'S OWN PARSER
257             The standard parser for a C line is this:
258            
259             parse Dline
260             tokens
261             WORD
262             LPAREN "\("
263             RPAREN "\)"
264             LBRACK "\["
265             RBRACK "\]"
266             COMMA ","
267             EQUALS "="
268             STRING
269             PARSEFLAG "<"
270             actions
271             rules
272            
273             That's the actual parser used by default in C. You can override the line parser for a given tag; we use this for the 'select'
274             tag, for instance. The indentation structure and bracketing is currently handled by C, and that probably won't change (but you never
275             know).
276            
277             =head2 EXAMPLE: SELECT PARSER
278            
279             The select tag uses SQL to retrieve information from data iterators, and since SQL is, well, a standard query language (kind of), it's supported
280             natively in C, mostly because we already have this fancy parser just sitting around ready to do that kind of thing. The nice thing,
281             of course, is that means you don't have to write an SQL parser, because I've already done it for you.
282            
283             parse SQLselect
284            
285             =head1 IMPLEMENTATION
286            
287             This particular class implements the C node in the specification structure; the class L implements the parser itself.
288             In other words, here we are concerned with building a C object that will then be asked to do actual parsing. The tags claimed
289             by user-defined parsers are also registered in this phase, constituting macros.
290            
291             =head2 defines(), tags_defined()
292            
293             Called by Decl::Semantics during import, to find out what xmlapi tags this plugin claims to implement.
294            
295             =cut
296 0     0 1 0 sub defines { ('parse') }
297 12     12 1 3482 sub tags_defined { Decl->new_data(<
298             parse (body=vanilla)
299             EOF
300            
301             =head2 build_payload ()
302            
303             The C function is then called when this object's payload is built (i.e. in the stage when we're adding semantics to our
304             parsed syntax). It builds the parser and registers its tag with the application. Instances are handled by L.
305            
306             =cut
307            
308             sub build_payload {
309 0     0 1   my ($self) = @_;
310            
311 0           my $p = Decl::Parser->new();
312 0           $self->{payload} = $p;
313            
314 0           my $t = $self->find ('tokens');
315 0           foreach ($t->elements) {
316 0           $p->add_tokenizer ($_->name, $_->label); # TODO: error handling and default definitions for selected tokenizers
317             }
318            
319 0 0         if ($self->name) {
320 0           my $root = $self->root();
321 0     0     $root->build_handler($self->name . "*", "", sub { Decl::Macro->new($self, @_) });
  0            
322             }
323             }
324            
325            
326             =head1 AUTHOR
327            
328             Michael Roberts, C<< >>
329            
330             =head1 BUGS
331            
332             Please report any bugs or feature requests to C, or through
333             the web interface at L. I will be notified, and then you'll
334             automatically be notified of progress on your bug as I make changes.
335            
336             =head1 LICENSE AND COPYRIGHT
337            
338             Copyright 2010 Michael Roberts.
339            
340             This program is free software; you can redistribute it and/or modify it
341             under the terms of either: the GNU General Public License as published
342             by the Free Software Foundation; or the Artistic License.
343            
344             See http://dev.perl.org/licenses/ for more information.
345            
346             =cut
347            
348             1; # End of Decl::Semantics::Parse