File Coverage

blib/lib/Pod/Cats.pm
Criterion Covered Total %
statement 120 138 86.9
branch 18 26 69.2
condition 23 32 71.8
subroutine 15 20 75.0
pod 9 9 100.0
total 185 225 82.2


line stmt bran cond sub pod time code
1             package Pod::Cats;
2              
3 5     5   143689 use warnings;
  5         12  
  5         171  
4 5     5   25 use strict;
  5         11  
  5         144  
5 5     5   133 use 5.010;
  5         22  
  5         218  
6              
7 5     5   2769 use Pod::Cats::Parser::MGC;
  5         20  
  5         178  
8 5     5   52 use List::Util qw(min max);
  5         10  
  5         695  
9 5     5   27 use Carp;
  5         12  
  5         12900  
10              
11             =head1 NAME
12              
13             Pod::Cats - The POD-like markup language written for podcats.in
14              
15             =head1 VERSION
16              
17             Version 0.06
18              
19             =head1 DESCRIPTION
20              
21             POD is an expressive markup language - like Perl is an expressive programming
22             language - and for a plain text file format there is little finer. Pod::Cats is
23             an extension of the POD semantics that adds more syntax and more flexibility to
24             the language.
25              
26             Pod::Cats is designed to be extended and doesn't implement any default
27             commands or entities.
28              
29             =head1 SYNTAX
30              
31             Pod::Cats syntax borrows ideas from POD and adds its own.
32              
33             A paragraph is any block of text delimited by blank lines (whitespace ignored).
34             This is the same as POD, and basically allows you to use hard word wrapping in
35             your markup without having to join them all together for output later.
36              
37             There are three command paragraphs, which are defined by their first character.
38             This character must be in the first column; whitespace at the start of a
39             paragraph is syntactically relevant.
40              
41             =over 4
42            
43             =item C<=COMMAND CONTENT>
44             X
45              
46             A line beginning with the C<=> symbol denotes a single I. Usually this
47             will be some sort of header, perhaps the equivalent of a C<<
>>, something
48             like that. It is roughly equivalent to the self-closing tag in XML. B
49             is just text that may or may not be present. The relationship of B to
50             the B is for you to define, as is the meaning of B.
51              
52             When a C<=COMMAND> block is completed, it is passed to L.
53              
54             =item C<+NAME CONTENT>
55             X
56              
57             A line beginning with C<+> opens a named block; its name is B. Similar to
58             C<=COMMAND>, the B is arbitrary, and its relationship to the B of
59             the block is up to you.
60              
61             When this is encountered you are invited to L.
62              
63             =item C<-NAME>
64             X
65              
66             A line beginning with C<-> is the end of the named block previously started.
67             These must match in reverse order to the C<+> block with the matching B -
68             basically the same as XML's pairs. It is passed to L,
69             and unlike the other two command paragraphs it accepts no content.
70              
71             =back
72              
73             Then there are two types of text paragraph, for which the text is not
74             syntactically relevant but whitespace still is:
75              
76             =over 4
77              
78             =item Verbatim paragraphs
79              
80             A line whose first character is whitespace is considered verbatim. No removal of
81             whitespace is done to the rest of the paragraph if the first character is
82             whitespace; all your text is repeated verbatim, hence the name
83              
84             The verbatim paragraph continues until the first non-verbatim paragraph is
85             encountered. A blank line is no longer considered to end the paragraph.
86             Therefore, two verbatim paragraphs can only be separated by a non-verbatim
87             paragraph with non-whitespace content. The special formatting code C<< ZZ<><> >>
88             can be used on its own to separate them with zero-width content.
89              
90             All lines in the verbatim paragraph will have their leading whitespace removed.
91             This is done intelligently: the I amount of leading whitespace found on
92             any line is removed from all lines. This allows you to indent other lines (even
93             the first one) relative to the syntactic whitespace that defines the verbatim
94             paragraph without your indentation being parsed out.
95              
96             L are not parsed in verbatim paragraphs, as expected.
97              
98             When a verbatim paragraph has been collated, it is passed to L.
99              
100             =item Paragraphs
101              
102             Everything that doesn't get caught by one of the above rules is deemed to be a
103             plain text paragraph. As with all paragraphs, a single line break is removed by
104             the parser and a blank line causes the paragraph to be processed. It is passed
105             to L.
106              
107             =back
108              
109             And finally the inline formatting markup, entities.
110              
111             =over
112              
113             =item C<< XZ<><> >>
114             X X
115              
116             An entity is defined as a capital letter followed by a delimiter that is
117             repeated n times, then any amount of text up to a matching quantity of a
118             balanced delimiter.
119              
120             In normal POD the only delimiter is C<< < >>, so entities have the format C<<
121             XZ<><> >>; except that the opening delimiter may be duplicated as long as the
122             closing delimiter matches, allowing you to put the delimiter itself inside the
123             entity: C<<< XZ<><<>> >>>; in Pod::Cats you can use any delimiter, removing the
124             requirement to duplicate it at all: C<< C[ XZ<><> ] >>.
125              
126             Once an entity has begun, nested entities are only considered if the delimiters
127             are the same as those used for the outer entity: C<< B[ I[bold-italic] ] >>;
128             C<< B[IZ<>] >>.
129              
130             Apart from the special entity C<< ZZ<><> >>, the letter used for the entity has
131             no inherent meaning to Pod::Cats. The parsed entity is provided to
132             L. C<< ZZ<><> >> retains its meaning from POD, which is to be a
133             zero-width 'divider' to break up things that would otherwise be considered
134             syntax. You are not given C<< ZZ<><> >> to handle, and C<< ZZ<><> >> itself will
135             produce undef if it is the only content to an element. A paragraph comprising solely
136             C<< ZZ<><> >> will never generate a parsed paragraph; it will be skipped.
137              
138             =back
139              
140             =head1 METHODS
141              
142             =cut
143              
144             our $VERSION = '0.06';
145              
146             =head2 new
147              
148             Create a new parser. Options are provided as a hashref, but there is currently
149             only one:
150              
151             =over
152              
153             =item delimiters
154              
155             A string containing delimiters to use. Bracketed delimiters will be balanced;
156             other delimiters will simply be used as-is. This echoes the delimiter philosophy
157             of Perl syntax such as regexes and C. The string should be all the possible
158             delimiters, listed once each, and only the opening brackets of balanced pairs.
159              
160             The default is C<< '<' >>, same as POD.
161              
162             =back
163              
164             =cut
165              
166             sub new {
167 4     4 1 104 my $class = shift;
168 4   100     26 my $opts = shift || {};
169 4         14 my $self = bless $opts, $class; # FIXME
170              
171 4         14 return $self;
172             }
173              
174             =head2 parse
175              
176             Parses a string containing whatever Pod::Cats code you have.
177              
178             =cut
179              
180             sub parse {
181 0     0 1 0 my ($self, $string) = @_;
182              
183 0         0 return $self->parse_lines(split /\n/, $string);
184             }
185              
186             =head2 parse_file
187              
188             Opens the file given by filename and reads it all in and then parses that.
189              
190             =cut
191              
192             sub parse_file {
193 0     0 1 0 my ($self, $filename) = @_;
194            
195 0 0       0 carp "File not found: " . $filename unless -e $filename;
196              
197 0         0 open my $fh, "<", $filename;
198 0         0 chomp(my @lines = <$fh>);
199 0         0 close $fh;
200              
201 0         0 return $self->parse_lines(@lines);
202             }
203              
204             =head2 parse_lines
205              
206             L and L both come here, which just takes the markup text
207             as an array of lines and parses them. This is where the logic happens. It is
208             exposed publicly so you can parse an array of your own if you want.
209              
210             =cut
211              
212             sub parse_lines {
213 4     4 1 138 my ($self, @lines) = @_;
214              
215 4         11 my $result = "";
216              
217             # The buffer type goes in the first element, and its
218             # contents, if any, in the rest.
219 4         8 my @buffer;
220 4         34 $self->{dom} = [];
221              
222             # Special lines are:
223             # - a blank line. An exception is between verbatim paragraphs, so we will
224             # simply re-merge verbatim paras later on
225             # - A line starting with =, + or -. Command paragraph. Process the previous
226             # buffer and start a new one with this.
227             # - Anything else continues the previous buffer, or starts a normal paragraph
228              
229 4         40 shift @lines while $lines[0] !~ /\S/; # shift off leading blank lines!
230              
231 4         11 for my $line (@lines) {
232 57         76 for ($line) {
233 57         131 when (/^\s*$/) {
234 25         83 $self->_process_buffer(@buffer);
235 25         69 @buffer = ();
236             }
237 32         60 when (/^([=+-])/) {
238 3         9 my $type = $1;
239 3 50       8 if (@buffer) {
240 0         0 warn "$type command found without leading blank line.";
241              
242 0         0 $self->_process_buffer(@buffer);
243 0         0 @buffer = ();
244             }
245              
246 3 50       18 push @buffer, {
247             '+' => 'begin',
248             '-' => 'end',
249             '=' => 'command',
250             }->{$type} or die "Don't know what to do with $type";
251              
252             # find and push the command name onto it; the rest is the first
253             # bit of buffer contents.
254 3         63 push @buffer, grep {$_} ($line =~ /^\Q$type\E(.+?)\b\s*(.*)$/);
  6         19  
255             }
256 29         48 when (/^\s+\S/) {
257 10 100       23 push @buffer, "verbatim" if !@buffer;
258 10         26 push @buffer, $line;
259             }
260 19         23 default {
261             # Nothing special, continue previous buffer or start a paragraph.
262 19 100       57 push @buffer, "paragraph" if !@buffer;
263 19         55 push @buffer, $line;
264             }
265             }
266             }
267              
268 4 50       25 $self->_process_buffer(@buffer) if @buffer;
269 4         32 $self->_postprocess_dom();
270              
271 4         41 $self->_postprocess_paragraphs();
272 4         2273 return $self->{dom};
273             }
274              
275             # Adds the buffer and some metadata to the DOM, returning nothing.
276             sub _process_buffer {
277 29     29   125 my ($self, @buffer) = @_;
278              
279 29 50       64 return '' unless @buffer;
280              
281 29         37 my $buffer_type = shift @buffer;
282            
283 29         69 my $node = {
284             type => $buffer_type
285             };
286              
287 29         46 for ($buffer_type) {
288 29         71 when('paragraph') {
289             # concatenate the lines and normalise whitespace.
290 17         36 my $para = join " ", @buffer;
291 17         114 $node->{content} = $para;
292             }
293 12         38 when('verbatim') {
294             # find the lowest level of indentation in this buffer and strip it
295 9         12 my $indent_level = min map { /^(\s+)/; length $1 } @buffer;
  10         20  
  10         38  
296 9         20 $node->{content} = join "\n", @buffer;
297 9         24 $node->{indent_level} = $indent_level;
298             }
299 3   100     12 when($_ eq 'command' || $_ eq 'begin') {
300 2         10 $node->{name} = shift @buffer;
301 2         5 my $content = join " ", @buffer;
302 2         5 $node->{content} = $content;
303             }
304 1         3 when('end') {
305 1         4 $node->{name} = shift @buffer; # end tags take no content
306             }
307             }
308              
309 29         41 push @{$self->{dom}}, $node;
  29         72  
310             }
311              
312             # This is basically just to merge verbatims together
313             sub _postprocess_dom {
314 4     4   9 my $self = shift;
315              
316 4         7 my @new_dom;
317             my $last_node;
318 4         9 for my $node (@{$self->{dom}}) {
  4         20  
319 29 100 50     63 $last_node = $node and next unless defined $last_node;
320              
321             # Don't change the last node until we stop finding verbatims.
322             # That way we can keep using it as the concatenated node.
323 25 100 100     89 if ($last_node->{type} eq 'verbatim' && $node->{type} eq 'verbatim') {
324             # The smallest indent level is considered the level for the merged node.
325 4         9 $last_node->{indent_level} =
326             min( $last_node->{indent_level}, $node->{indent_level});
327 4         9 $last_node->{content} .= "\n\n" . $node->{content};
328              
329             } else {
330             # Node type changed, push old one
331 21 100       73 if ($last_node->{type} eq 'verbatim') {
332 5         6 my $to_remove = $last_node->{indent_level};
333 5 50       54 $last_node->{content} =~ s/^ {$to_remove}//mg if $to_remove;
334             }
335 21         32 push @new_dom, $last_node;
336 21         36 $last_node = $node;
337             }
338             }
339              
340 4         15 push @new_dom, $last_node;
341 4         13 $self->{dom} = \@new_dom;
342             }
343              
344             # Now is the sax-like bit, where it goes through and fires the user's events for
345             # the various types. TODO: what's the point in sax-like if you already made a
346             # DOM? Make this part of the parsing process and create the DOM out of the SAX.
347             sub _postprocess_paragraphs {
348 4     4   7 my $self = shift;
349              
350 4         14 for my $node (@{ $self->{dom} }) {
  4         19  
351 25         14754 for ($node->{type}) {
352 25         60 when ('paragraph') {
353             # If _process_entities gives us undef, that was a single Z<>, which should not
354             # generate a new paragraph.
355 17   100     67 $node->{content} = $self->_process_entities($node->{content}) // next;
356 16         27 $self->handle_paragraph(@{ $node->{content} });
  16         66  
357             }
358 8         16 when ('begin') {
359 1         4 $node->{content} = $self->_process_entities($node->{content});
360             # Check for balance later
361 1         2 push @{$self->{begin_stack}}, $node->{name};
  1         4  
362              
363 1   50     3 $self->handle_begin($node->{name}, @{ $node->{content} // [] });
  1         21  
364             }
365 7         14 when ('end') {
366 1         6 warn "$node->{name} is ended out of sync!"
367 1 50       2 if pop @{$self->{begin_stack}} ne $node->{name};
368              
369 1         5 $self->handle_end($node->{name});
370             }
371 6         10 when ('command') {
372 1         8 $node->{content} = $self->_process_entities($node->{content});
373 1   50     4 $self->handle_command($node->{name}, @{ $node->{content} // [] });
  1         6  
374             }
375 5         9 when ('verbatim') {
376 5         16 $self->handle_verbatim($node->{content});
377             }
378             }
379             }
380             }
381              
382             =head2 handle_verbatim
383              
384             The verbatim paragraph as it was in the code, except with the minimum amount of
385             whitespace stripped from each line as described in L.
386             Passed in as a single string with line breaks preserved.
387              
388             Do whatever you want. Default is to return the string straight back atcha.
389              
390             =cut
391              
392             sub handle_verbatim {
393 5     5 1 25 shift;
394 5         12 shift;
395             }
396              
397             =head2 handle_entity
398              
399             Passed the letter of the L as the first argument and its content
400             as the rest of @_. The content will alternate between plain text and the return
401             value of this function for any nested entities inside this one.
402              
403             For this reason you should return a scalar from this method, be it text or a
404             ref. The default is to concatenate @_, thus replacing entities with their
405             contents.
406              
407             Note that this method is the only one whose return value is of relevance to the
408             parser, since what you return from this will appear in another handler,
409             depending on what type of paragraph the entity is in.
410              
411             You will never get the C<< ZZ<><> >> entity.
412              
413             =cut
414              
415             sub handle_entity {
416 5     5 1 6440 shift; shift;
  5         7  
417 5   100     40 join ' ', map $_ // '', @_;
418             }
419              
420             # preprocess paragraph before giving it to the user. handle_entity is called
421             # from the parser itself.
422             sub _process_entities {
423 19     19   30 my ($self, $para) = @_;
424              
425             # 1. replace POD-like Z<...> with user-defined functions.
426             # Z itself is the only actual exception to that.
427 19   100     125 $self->{parser} ||= Pod::Cats::Parser::MGC->new(
      66        
428             object => $self,
429             delimiters => $self->{delimiters} // '<'
430             );
431              
432 19         80 my $parsed = $self->{parser}->from_string( $para );
433              
434             # Single return of undef was Z<>
435 19 100 66     1722 return defined $parsed->[0] || @$parsed > 1 ? $parsed : ();
436             }
437              
438             =head2 handle_paragraph
439              
440             The paragraph is split into sections that alternate between plain text and the
441             return values of L as described above. These
442             sections are arrayed in @_. Note that the paragraph could start with an entity.
443              
444             By default it returns @_ concatenated, since the default behaviour of
445             L is to remove the formatting but keep the
446             contents.
447              
448             =cut
449              
450             sub handle_paragraph {
451 14   100 14 1 37 shift; join ' ', map $_ // '', @_;
  14         152  
452             }
453              
454             =head2 handle_command
455              
456             When a L is encountered it comes here. The first argument is
457             the B (from B<=COMMAND>); the rest of the arguments follow the rules of
458             L and alternate between plain text and parsed
459             entities.
460              
461             By default it returns @_ concatenated, same as paragraphs.
462              
463             =cut
464              
465             sub handle_command {
466 0   0 0 1   shift; shift; join ' ', map $_ // '', @_;
  0            
  0            
467             }
468              
469             =head2 handle_begin
470              
471             This is handled the same as L, except it is called when a
472             L command is encountered. The same rules apply.
473              
474             =cut
475              
476             sub _handle_begin {
477 0   0 0     shift; shift; join ' ', map $_ // '', @_;
  0            
  0            
478             }
479              
480             =head2 handle_end
481              
482             The counterpart to the begin handler. This is called when the L paragraph
483             is encountered. The parser will already have discovered whether your begins and
484             ends are not balanced so you don't need to worry about that.
485              
486             Note that there is no content for an end paragraph so the only argument this
487             gets is the command name.
488              
489             =cut
490              
491 0     0 1   sub handle_end { }
492              
493             =head1 TODO
494              
495             =over
496              
497             =item The document is parsed into DOM, then events are fired SAX-like.
498             Preferable to fire the events and build the DOM from that.
499              
500             =item Currently the matching of begin/end commands is a bit naive.
501              
502             =item Line numbers of errors are not yet reported.
503              
504             =back
505              
506             =head1 AUTHOR
507              
508             Altreus, C<< >>
509              
510             =head1 BUGS
511              
512             Bug reports to github please: http://github.com/Altreus/Pod-Cats/issues
513              
514             =head1 SUPPORT
515              
516             You are reading the only documentation for this module.
517              
518             For more help, give me a holler on irc.freenode.com #perl
519              
520             =head1 ACKNOWLEDGEMENTS
521              
522             Paul Evans (LeoNerd) basically wrote Parser::MGC because I was whining about not
523             being able to parse these entity delimiters with any of the token parsers I
524             could find; and then he wrote a POD example that I only had to tweak in order to
525             do so. So a lot of the credit should go to him!
526              
527             =head1 LICENSE AND COPYRIGHT
528              
529             Copyright 2013 Altreus.
530              
531             This module is released under the MIT licence.
532              
533             =cut
534              
535             1;