File Coverage

blib/lib/Pod/Cats.pm

Criterion	Covered	Total	%
statement	120	138	86.9
branch	18	26	69.2
condition	23	32	71.8
subroutine	15	20	75.0
pod	9	9	100.0
total	185	225	82.2

line	stmt	bran	cond	sub	pod	time	code
1							package Pod::Cats;
2
3	5			5		143689	use warnings;
	5					12
	5					171
4	5			5		25	use strict;
	5					11
	5					144
5	5			5		133	use 5.010;
	5					22
	5					218
6
7	5			5		2769	use Pod::Cats::Parser::MGC;
	5					20
	5					178
8	5			5		52	use List::Util qw(min max);
	5					10
	5					695
9	5			5		27	use Carp;
	5					12
	5					12900
10
11							=head1 NAME
12
13							Pod::Cats - The POD-like markup language written for podcats.in
14
15							=head1 VERSION
16
17							Version 0.06
18
19							=head1 DESCRIPTION
20
21							POD is an expressive markup language - like Perl is an expressive programming
22							language - and for a plain text file format there is little finer. Pod::Cats is
23							an extension of the POD semantics that adds more syntax and more flexibility to
24							the language.
25
26							Pod::Cats is designed to be extended and doesn't implement any default
27							commands or entities.
28
29							=head1 SYNTAX
30
31							Pod::Cats syntax borrows ideas from POD and adds its own.
32
33							A paragraph is any block of text delimited by blank lines (whitespace ignored).
34							This is the same as POD, and basically allows you to use hard word wrapping in
35							your markup without having to join them all together for output later.
36
37							There are three command paragraphs, which are defined by their first character.
38							This character must be in the first column; whitespace at the start of a
39							paragraph is syntactically relevant.
40
41							=over 4
42
43							=item C<=COMMAND CONTENT>
44							X
45
46							A line beginning with the C<=> symbol denotes a single I. Usually this
47							will be some sort of header, perhaps the equivalent of a C<< >>, something
48							like that. It is roughly equivalent to the self-closing tag in XML. B
49							is just text that may or may not be present. The relationship of B to
50							the B is for you to define, as is the meaning of B.
51
52							When a C<=COMMAND> block is completed, it is passed to L.
53
54							=item C<+NAME CONTENT>
55							X
56
57							A line beginning with C<+> opens a named block; its name is B. Similar to
58							C<=COMMAND>, the B is arbitrary, and its relationship to the B of
59							the block is up to you.
60
61							When this is encountered you are invited to L.
62
63							=item C<-NAME>
64							X
65
66							A line beginning with C<-> is the end of the named block previously started.
67							These must match in reverse order to the C<+> block with the matching B -
68							basically the same as XML's pairs. It is passed to L,
69							and unlike the other two command paragraphs it accepts no content.
70
71							=back
72
73							Then there are two types of text paragraph, for which the text is not
74							syntactically relevant but whitespace still is:
75
76							=over 4
77
78							=item Verbatim paragraphs
79
80							A line whose first character is whitespace is considered verbatim. No removal of
81							whitespace is done to the rest of the paragraph if the first character is
82							whitespace; all your text is repeated verbatim, hence the name
83
84							The verbatim paragraph continues until the first non-verbatim paragraph is
85							encountered. A blank line is no longer considered to end the paragraph.
86							Therefore, two verbatim paragraphs can only be separated by a non-verbatim
87							paragraph with non-whitespace content. The special formatting code C<< ZZ<><> >>
88							can be used on its own to separate them with zero-width content.
89
90							All lines in the verbatim paragraph will have their leading whitespace removed.
91							This is done intelligently: the I amount of leading whitespace found on
92							any line is removed from all lines. This allows you to indent other lines (even
93							the first one) relative to the syntactic whitespace that defines the verbatim
94							paragraph without your indentation being parsed out.
95
96							L are not parsed in verbatim paragraphs, as expected.
97
98							When a verbatim paragraph has been collated, it is passed to L.
99
100							=item Paragraphs
101
102							Everything that doesn't get caught by one of the above rules is deemed to be a
103							plain text paragraph. As with all paragraphs, a single line break is removed by
104							the parser and a blank line causes the paragraph to be processed. It is passed
105							to L.
106
107							=back
108
109							And finally the inline formatting markup, entities.
110
111							=over
112
113							=item C<< XZ<><> >>
114							X X
115
116							An entity is defined as a capital letter followed by a delimiter that is
117							repeated n times, then any amount of text up to a matching quantity of a
118							balanced delimiter.
119
120							In normal POD the only delimiter is C<< < >>, so entities have the format C<<
121							XZ<><> >>; except that the opening delimiter may be duplicated as long as the
122							closing delimiter matches, allowing you to put the delimiter itself inside the
123							entity: C<<< XZ<><<>> >>>; in Pod::Cats you can use any delimiter, removing the
124							requirement to duplicate it at all: C<< C[ XZ<><> ] >>.
125
126							Once an entity has begun, nested entities are only considered if the delimiters
127							are the same as those used for the outer entity: C<< B[ I[bold-italic] ] >>;
128							C<< B[IZ<>] >>.
129
130							Apart from the special entity C<< ZZ<><> >>, the letter used for the entity has
131							no inherent meaning to Pod::Cats. The parsed entity is provided to
132							L. C<< ZZ<><> >> retains its meaning from POD, which is to be a
133							zero-width 'divider' to break up things that would otherwise be considered
134							syntax. You are not given C<< ZZ<><> >> to handle, and C<< ZZ<><> >> itself will
135							produce undef if it is the only content to an element. A paragraph comprising solely
136							C<< ZZ<><> >> will never generate a parsed paragraph; it will be skipped.
137
138							=back
139
140							=head1 METHODS
141
142							=cut
143
144							our $VERSION = '0.06';
145
146							=head2 new
147
148							Create a new parser. Options are provided as a hashref, but there is currently
149							only one:
150
151							=over
152
153							=item delimiters
154
155							A string containing delimiters to use. Bracketed delimiters will be balanced;
156							other delimiters will simply be used as-is. This echoes the delimiter philosophy
157							of Perl syntax such as regexes and C. The string should be all the possible
158							delimiters, listed once each, and only the opening brackets of balanced pairs.
159
160							The default is C<< '<' >>, same as POD.
161
162							=back
163
164							=cut
165
166							sub new {
167	4			4	1	104	my $class = shift;
168	4		100			26	my $opts = shift \|\| {};
169	4					14	my $self = bless $opts, $class; # FIXME
170
171	4					14	return $self;
172							}
173
174							=head2 parse
175
176							Parses a string containing whatever Pod::Cats code you have.
177
178							=cut
179
180							sub parse {
181	0			0	1	0	my ($self, $string) = @_;
182
183	0					0	return $self->parse_lines(split /\n/, $string);
184							}
185
186							=head2 parse_file
187
188							Opens the file given by filename and reads it all in and then parses that.
189
190							=cut
191
192							sub parse_file {
193	0			0	1	0	my ($self, $filename) = @_;
194
195	0	0				0	carp "File not found: " . $filename unless -e $filename;
196
197	0					0	open my $fh, "<", $filename;
198	0					0	chomp(my @lines = <$fh>);
199	0					0	close $fh;
200
201	0					0	return $self->parse_lines(@lines);
202							}
203
204							=head2 parse_lines
205
206							L and L both come here, which just takes the markup text
207							as an array of lines and parses them. This is where the logic happens. It is
208							exposed publicly so you can parse an array of your own if you want.
209
210							=cut
211
212							sub parse_lines {
213	4			4	1	138	my ($self, @lines) = @_;
214
215	4					11	my $result = "";
216
217							# The buffer type goes in the first element, and its
218							# contents, if any, in the rest.
219	4					8	my @buffer;
220	4					34	$self->{dom} = [];
221
222							# Special lines are:
223							# - a blank line. An exception is between verbatim paragraphs, so we will
224							# simply re-merge verbatim paras later on
225							# - A line starting with =, + or -. Command paragraph. Process the previous
226							# buffer and start a new one with this.
227							# - Anything else continues the previous buffer, or starts a normal paragraph
228
229	4					40	shift @lines while $lines[0] !~ /\S/; # shift off leading blank lines!
230
231	4					11	for my $line (@lines) {
232	57					76	for ($line) {
233	57					131	when (/^\s*$/) {
234	25					83	$self->_process_buffer(@buffer);
235	25					69	@buffer = ();
236							}
237	32					60	when (/^([=+-])/) {
238	3					9	my $type = $1;
239	3	50				8	if (@buffer) {
240	0					0	warn "$type command found without leading blank line.";
241
242	0					0	$self->_process_buffer(@buffer);
243	0					0	@buffer = ();
244							}
245
246	3	50				18	push @buffer, {
247							'+' => 'begin',
248							'-' => 'end',
249							'=' => 'command',
250							}->{$type} or die "Don't know what to do with $type";
251
252							# find and push the command name onto it; the rest is the first
253							# bit of buffer contents.
254	3					63	push @buffer, grep {$_} ($line =~ /^\Q$type\E(.+?)\b\s(.)$/);
	6					19
255							}
256	29					48	when (/^\s+\S/) {
257	10	100				23	push @buffer, "verbatim" if !@buffer;
258	10					26	push @buffer, $line;
259							}
260	19					23	default {
261							# Nothing special, continue previous buffer or start a paragraph.
262	19	100				57	push @buffer, "paragraph" if !@buffer;
263	19					55	push @buffer, $line;
264							}
265							}
266							}
267
268	4	50				25	$self->_process_buffer(@buffer) if @buffer;
269	4					32	$self->_postprocess_dom();
270
271	4					41	$self->_postprocess_paragraphs();
272	4					2273	return $self->{dom};
273							}
274
275							# Adds the buffer and some metadata to the DOM, returning nothing.
276							sub _process_buffer {
277	29			29		125	my ($self, @buffer) = @_;
278
279	29	50				64	return '' unless @buffer;
280
281	29					37	my $buffer_type = shift @buffer;
282
283	29					69	my $node = {
284							type => $buffer_type
285							};
286
287	29					46	for ($buffer_type) {
288	29					71	when('paragraph') {
289							# concatenate the lines and normalise whitespace.
290	17					36	my $para = join " ", @buffer;
291	17					114	$node->{content} = $para;
292							}
293	12					38	when('verbatim') {
294							# find the lowest level of indentation in this buffer and strip it
295	9					12	my $indent_level = min map { /^(\s+)/; length $1 } @buffer;
	10					20
	10					38
296	9					20	$node->{content} = join "\n", @buffer;
297	9					24	$node->{indent_level} = $indent_level;
298							}
299	3		100			12	when($_ eq 'command' \|\| $_ eq 'begin') {
300	2					10	$node->{name} = shift @buffer;
301	2					5	my $content = join " ", @buffer;
302	2					5	$node->{content} = $content;
303							}
304	1					3	when('end') {
305	1					4	$node->{name} = shift @buffer; # end tags take no content
306							}
307							}
308
309	29					41	push @{$self->{dom}}, $node;
	29					72
310							}
311
312							# This is basically just to merge verbatims together
313							sub _postprocess_dom {
314	4			4		9	my $self = shift;
315
316	4					7	my @new_dom;
317							my $last_node;
318	4					9	for my $node (@{$self->{dom}}) {
	4					20
319	29	100	50			63	$last_node = $node and next unless defined $last_node;
320
321							# Don't change the last node until we stop finding verbatims.
322							# That way we can keep using it as the concatenated node.
323	25	100	100			89	if ($last_node->{type} eq 'verbatim' && $node->{type} eq 'verbatim') {
324							# The smallest indent level is considered the level for the merged node.
325	4					9	$last_node->{indent_level} =
326							min( $last_node->{indent_level}, $node->{indent_level});
327	4					9	$last_node->{content} .= "\n\n" . $node->{content};
328
329							} else {
330							# Node type changed, push old one
331	21	100				73	if ($last_node->{type} eq 'verbatim') {
332	5					6	my $to_remove = $last_node->{indent_level};
333	5	50				54	$last_node->{content} =~ s/^ {$to_remove}//mg if $to_remove;
334							}
335	21					32	push @new_dom, $last_node;
336	21					36	$last_node = $node;
337							}
338							}
339
340	4					15	push @new_dom, $last_node;
341	4					13	$self->{dom} = \@new_dom;
342							}
343
344							# Now is the sax-like bit, where it goes through and fires the user's events for
345							# the various types. TODO: what's the point in sax-like if you already made a
346							# DOM? Make this part of the parsing process and create the DOM out of the SAX.
347							sub _postprocess_paragraphs {
348	4			4		7	my $self = shift;
349
350	4					14	for my $node (@{ $self->{dom} }) {
	4					19
351	25					14754	for ($node->{type}) {
352	25					60	when ('paragraph') {
353							# If _process_entities gives us undef, that was a single Z<>, which should not
354							# generate a new paragraph.
355	17		100			67	$node->{content} = $self->_process_entities($node->{content}) // next;
356	16					27	$self->handle_paragraph(@{ $node->{content} });
	16					66
357							}
358	8					16	when ('begin') {
359	1					4	$node->{content} = $self->_process_entities($node->{content});
360							# Check for balance later
361	1					2	push @{$self->{begin_stack}}, $node->{name};
	1					4
362
363	1		50			3	$self->handle_begin($node->{name}, @{ $node->{content} // [] });
	1					21
364							}
365	7					14	when ('end') {
366	1					6	warn "$node->{name} is ended out of sync!"
367	1	50				2	if pop @{$self->{begin_stack}} ne $node->{name};
368
369	1					5	$self->handle_end($node->{name});
370							}
371	6					10	when ('command') {
372	1					8	$node->{content} = $self->_process_entities($node->{content});
373	1		50			4	$self->handle_command($node->{name}, @{ $node->{content} // [] });
	1					6
374							}
375	5					9	when ('verbatim') {
376	5					16	$self->handle_verbatim($node->{content});
377							}
378							}
379							}
380							}
381
382							=head2 handle_verbatim
383
384							The verbatim paragraph as it was in the code, except with the minimum amount of
385							whitespace stripped from each line as described in L.
386							Passed in as a single string with line breaks preserved.
387
388							Do whatever you want. Default is to return the string straight back atcha.
389
390							=cut
391
392							sub handle_verbatim {
393	5			5	1	25	shift;
394	5					12	shift;
395							}
396
397							=head2 handle_entity
398
399							Passed the letter of the L as the first argument and its content
400							as the rest of @_. The content will alternate between plain text and the return
401							value of this function for any nested entities inside this one.
402
403							For this reason you should return a scalar from this method, be it text or a
404							ref. The default is to concatenate @_, thus replacing entities with their
405							contents.
406
407							Note that this method is the only one whose return value is of relevance to the
408							parser, since what you return from this will appear in another handler,
409							depending on what type of paragraph the entity is in.
410
411							You will never get the C<< ZZ<><> >> entity.
412
413							=cut
414
415							sub handle_entity {
416	5			5	1	6440	shift; shift;
	5					7
417	5		100			40	join ' ', map $_ // '', @_;
418							}
419
420							# preprocess paragraph before giving it to the user. handle_entity is called
421							# from the parser itself.
422							sub _process_entities {
423	19			19		30	my ($self, $para) = @_;
424
425							# 1. replace POD-like Z<...> with user-defined functions.
426							# Z itself is the only actual exception to that.
427	19		100			125	$self->{parser} \|\|= Pod::Cats::Parser::MGC->new(
			66
428							object => $self,
429							delimiters => $self->{delimiters} // '<'
430							);
431
432	19					80	my $parsed = $self->{parser}->from_string( $para );
433
434							# Single return of undef was Z<>
435	19	100	66			1722	return defined $parsed->[0] \|\| @$parsed > 1 ? $parsed : ();
436							}
437
438							=head2 handle_paragraph
439
440							The paragraph is split into sections that alternate between plain text and the
441							return values of L as described above. These
442							sections are arrayed in @_. Note that the paragraph could start with an entity.
443
444							By default it returns @_ concatenated, since the default behaviour of
445							L is to remove the formatting but keep the
446							contents.
447
448							=cut
449
450							sub handle_paragraph {
451	14		100	14	1	37	shift; join ' ', map $_ // '', @_;
	14					152
452							}
453
454							=head2 handle_command
455
456							When a L is encountered it comes here. The first argument is
457							the B (from B<=COMMAND>); the rest of the arguments follow the rules of
458							L and alternate between plain text and parsed
459							entities.
460
461							By default it returns @_ concatenated, same as paragraphs.
462
463							=cut
464
465							sub handle_command {
466	0		0	0	1		shift; shift; join ' ', map $_ // '', @_;
	0
	0
467							}
468
469							=head2 handle_begin
470
471							This is handled the same as L, except it is called when a
472							L command is encountered. The same rules apply.
473
474							=cut
475
476							sub _handle_begin {
477	0		0	0			shift; shift; join ' ', map $_ // '', @_;
	0
	0
478							}
479
480							=head2 handle_end
481
482							The counterpart to the begin handler. This is called when the L paragraph
483							is encountered. The parser will already have discovered whether your begins and
484							ends are not balanced so you don't need to worry about that.
485
486							Note that there is no content for an end paragraph so the only argument this
487							gets is the command name.
488
489							=cut
490
491	0			0	1		sub handle_end { }
492
493							=head1 TODO
494
495							=over
496
497							=item The document is parsed into DOM, then events are fired SAX-like.
498							Preferable to fire the events and build the DOM from that.
499
500							=item Currently the matching of begin/end commands is a bit naive.
501
502							=item Line numbers of errors are not yet reported.
503
504							=back
505
506							=head1 AUTHOR
507
508							Altreus, C<< >>
509
510							=head1 BUGS
511
512							Bug reports to github please: http://github.com/Altreus/Pod-Cats/issues
513
514							=head1 SUPPORT
515
516							You are reading the only documentation for this module.
517
518							For more help, give me a holler on irc.freenode.com #perl
519
520							=head1 ACKNOWLEDGEMENTS
521
522							Paul Evans (LeoNerd) basically wrote Parser::MGC because I was whining about not
523							being able to parse these entity delimiters with any of the token parsers I
524							could find; and then he wrote a POD example that I only had to tweak in order to
525							do so. So a lot of the credit should go to him!
526
527							=head1 LICENSE AND COPYRIGHT
528
529							Copyright 2013 Altreus.
530
531							This module is released under the MIT licence.
532
533							=cut
534
535							1;