File Coverage

blib/lib/PPI/Tokenizer.pm
Criterion Covered Total %
statement 212 243 87.2
branch 112 140 80.0
condition 27 35 77.1
subroutine 25 29 86.2
pod 5 5 100.0
total 381 452 84.2


line stmt bran cond sub pod time code
1             package PPI::Tokenizer;
2              
3             =pod
4              
5             =head1 NAME
6              
7             PPI::Tokenizer - The Perl Document Tokenizer
8              
9             =head1 SYNOPSIS
10              
11             # Create a tokenizer for a file, array or string
12             $Tokenizer = PPI::Tokenizer->new( 'filename.pl' );
13             $Tokenizer = PPI::Tokenizer->new( \@lines );
14             $Tokenizer = PPI::Tokenizer->new( \$source );
15            
16             # Return all the tokens for the document
17             my $tokens = $Tokenizer->all_tokens;
18            
19             # Or we can use it as an iterator
20             while ( my $Token = $Tokenizer->get_token ) {
21             print "Found token '$Token'\n";
22             }
23            
24             # If we REALLY need to manually nudge the cursor, you
25             # can do that to (The lexer needs this ability to do rollbacks)
26             $is_incremented = $Tokenizer->increment_cursor;
27             $is_decremented = $Tokenizer->decrement_cursor;
28              
29             =head1 DESCRIPTION
30              
31             PPI::Tokenizer is the class that provides Tokenizer objects for use in
32             breaking strings of Perl source code into Tokens.
33              
34             By the time you are reading this, you probably need to know a little
35             about the difference between how perl parses Perl "code" and how PPI
36             parsers Perl "documents".
37              
38             "perl" itself (the interpreter) uses a heavily modified lex specification
39             to specify its parsing logic, maintains several types of state as it
40             goes, and incrementally tokenizes, lexes AND EXECUTES at the same time.
41              
42             In fact, it is provably impossible to use perl's parsing method without
43             simultaneously executing code. A formal mathematical proof has been
44             published demonstrating the method.
45              
46             This is where the truism "Only perl can parse Perl" comes from.
47              
48             PPI uses a completely different approach by abandoning the (impossible)
49             ability to parse Perl the same way that the interpreter does, and instead
50             parsing the source as a document, using a document structure independently
51             derived from the Perl documentation and approximating the perl interpreter
52             interpretation as closely as possible.
53              
54             It was touch and go for a long time whether we could get it close enough,
55             but in the end it turned out that it could be done.
56              
57             In this approach, the tokenizer C is implemented separately
58             from the lexer L.
59              
60             The job of C is to take pure source as a string and break it
61             up into a stream/set of tokens, and contains most of the "black magic" used
62             in PPI. By comparison, the lexer implements a relatively straight forward
63             tree structure, and has an implementation that is uncomplicated (compared
64             to the insanity in the tokenizer at least).
65              
66             The Tokenizer uses an immense amount of heuristics, guessing and cruft,
67             supported by a very B flexible internal API, but fortunately it was
68             possible to largely encapsulate the black magic, so there is not a lot that
69             gets exposed to people using the C itself.
70              
71             =head1 METHODS
72              
73             Despite the incredible complexity, the Tokenizer itself only exposes a
74             relatively small number of methods, with most of the complexity implemented
75             in private methods.
76              
77             =cut
78              
79             # Make sure everything we need is loaded so
80             # we don't have to go and load all of PPI.
81 64     64   459 use strict;
  64         125  
  64         2357  
82 64     64   370 use Params::Util qw{_INSTANCE _SCALAR0 _ARRAY0};
  64         177  
  64         3575  
83 64     64   377 use List::Util 1.33 ();
  64         1308  
  64         1404  
84 64     64   362 use PPI::Util ();
  64         158  
  64         1246  
85 64     64   402 use PPI::Element ();
  64         206  
  64         1310  
86 64     64   373 use PPI::Token ();
  64         154  
  64         1568  
87 64     64   427 use PPI::Exception ();
  64         175  
  64         1295  
88 64     64   29056 use PPI::Exception::ParserRejection ();
  64         186  
  64         199195  
89              
90             our $VERSION = '1.277';
91              
92             # The x operator cannot follow most Perl operators, implying that
93             # anything beginning with x following an operator is a word.
94             # These are the exceptions.
95             my %X_CAN_FOLLOW_OPERATOR = map { $_ => 1 } qw( -- ++ );
96              
97             # The x operator cannot follow most structure elements, implying that
98             # anything beginning with x following a structure element is a word.
99             # These are the exceptions.
100             my %X_CAN_FOLLOW_STRUCTURE = map { $_ => 1 } qw( } ] \) );
101              
102             # Something that looks like the x operator but follows a word
103             # is usually that word's argument.
104             # These are the exceptions.
105             # chop, chomp, dump are ambiguous because they can have either parms
106             # or no parms.
107             my %X_CAN_FOLLOW_WORD = map { $_ => 1 } qw(
108             endgrent
109             endhostent
110             endnetent
111             endprotoent
112             endpwent
113             endservent
114             fork
115             getgrent
116             gethostent
117             getlogin
118             getnetent
119             getppid
120             getprotoent
121             getpwent
122             getservent
123             setgrent
124             setpwent
125             time
126             times
127             wait
128             wantarray
129             __SUB__
130             );
131              
132              
133              
134             #####################################################################
135             # Creation and Initialization
136              
137             =pod
138              
139             =head2 new $file | \@lines | \$source
140              
141             The main C constructor creates a new Tokenizer object. These
142             objects have no configuration parameters, and can only be used once,
143             to tokenize a single perl source file.
144              
145             It takes as argument either a normal scalar containing source code,
146             a reference to a scalar containing source code, or a reference to an
147             ARRAY containing newline-terminated lines of source code.
148              
149             Returns a new C object on success, or throws a
150             L exception on error.
151              
152             =cut
153              
154             sub new {
155 16855   33 16855 1 56136 my $class = ref($_[0]) || $_[0];
156              
157             # Create the empty tokenizer struct
158 16855         124361 my $self = bless {
159             # Source code
160             source => undef,
161             source_bytes => undef,
162              
163             # Line buffer
164             line => undef,
165             line_length => undef,
166             line_cursor => undef,
167             line_count => 0,
168              
169             # Parse state
170             token => undef,
171             class => 'PPI::Token::BOM',
172             zone => 'PPI::Token::Whitespace',
173              
174             # Output token buffer
175             tokens => [],
176             token_cursor => 0,
177             token_eof => 0,
178              
179             # Perl 6 blocks
180             perl6 => [],
181             }, $class;
182              
183 16855 50       70875 if ( ! defined $_[1] ) {
    100          
    50          
    0          
184             # We weren't given anything
185 0         0 PPI::Exception->throw("No source provided to Tokenizer");
186              
187             } elsif ( ! ref $_[1] ) {
188 502         2287 my $source = PPI::Util::_slurp($_[1]);
189 502 50       1958 if ( ref $source ) {
190             # Content returned by reference
191 502         1887 $self->{source} = $$source;
192             } else {
193             # Errors returned as a string
194 0         0 return( $source );
195             }
196              
197             } elsif ( _SCALAR0($_[1]) ) {
198 16353         23487 $self->{source} = ${$_[1]};
  16353         35040  
199              
200             } elsif ( _ARRAY0($_[1]) ) {
201 0         0 $self->{source} = join '', map { "\n" } @{$_[1]};
  0         0  
  0         0  
202              
203             } else {
204             # We don't support whatever this is
205 0         0 PPI::Exception->throw(ref($_[1]) . " is not supported as a source provider");
206             }
207              
208             # We can't handle a null string
209 16855         32015 $self->{source_bytes} = length $self->{source};
210 16855 100       31874 if ( $self->{source_bytes} ) {
211             # Split on local newlines
212 16851         321397 $self->{source} =~ s/(?:\015{1,2}\012|\015|\012)/\n/g;
213 16851         229118 $self->{source} = [ split /(?<=\n)/, $self->{source} ];
214              
215             } else {
216 4         9 $self->{source} = [ ];
217             }
218              
219             ### EVIL
220             # I'm explaining this earlier than I should so you can understand
221             # why I'm about to do something that looks very strange. There's
222             # a problem with the Tokenizer, in that tokens tend to change
223             # classes as each letter is added, but they don't get allocated
224             # their definite final class until the "end" of the token, the
225             # detection of which occurs in about a hundred different places,
226             # all through various crufty code (that triples the speed).
227             #
228             # However, in general, this does not apply to tokens in which a
229             # whitespace character is valid, such as comments, whitespace and
230             # big strings.
231             #
232             # So what we do is add a space to the end of the source. This
233             # triggers normal "end of token" functionality for all cases. Then,
234             # once the tokenizer hits end of file, it examines the last token to
235             # manually either remove the ' ' token, or chop it off the end of
236             # a longer one in which the space would be valid.
237 16855 100   71071   70470 if ( List::Util::any { /^__(?:DATA|END)__\s*$/ } @{$self->{source}} ) {
  71071 100       156580  
  16855 100       59831  
238 10         38 $self->{source_eof_chop} = '';
239             } elsif ( ! defined $self->{source}->[0] ) {
240 4         10 $self->{source_eof_chop} = '';
241             } elsif ( $self->{source}->[-1] =~ /\s$/ ) {
242 1086         3513 $self->{source_eof_chop} = '';
243             } else {
244 15755         30291 $self->{source_eof_chop} = 1;
245 15755         32873 $self->{source}->[-1] .= ' ';
246             }
247              
248 16855         73570 $self;
249             }
250              
251              
252              
253              
254              
255             #####################################################################
256             # Main Public Methods
257              
258             =pod
259              
260             =head2 get_token
261              
262             When using the PPI::Tokenizer object as an iterator, the C
263             method is the primary method that is used. It increments the cursor
264             and returns the next Token in the output array.
265              
266             The actual parsing of the file is done only as-needed, and a line at
267             a time. When C hits the end of the token array, it will
268             cause the parser to pull in the next line and parse it, continuing
269             as needed until there are more tokens on the output array that
270             get_token can then return.
271              
272             This means that a number of Tokenizer objects can be created, and
273             won't consume significant CPU until you actually begin to pull tokens
274             from it.
275              
276             Return a L object on success, C<0> if the Tokenizer had
277             reached the end of the file, or C on error.
278              
279             =cut
280              
281             sub get_token {
282 381789     381789 1 543655 my $self = shift;
283              
284             # Shortcut for EOF
285 381789 50 66     758735 if ( $self->{token_eof}
286 13248         34677 and $self->{token_cursor} > scalar @{$self->{tokens}}
287             ) {
288 0         0 return 0;
289             }
290              
291             # Return the next token if we can
292 381789 100       1027083 if ( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) {
293 305184         401883 $self->{token_cursor}++;
294 305184         1152939 return $token;
295             }
296              
297 76605         98982 my $line_rv;
298              
299             # Catch exceptions and return undef, so that we
300             # can start to convert code to exception-based code.
301 76605         107550 my $rv = eval {
302             # No token, we need to get some more
303 76605         150944 while ( $line_rv = $self->_process_next_line ) {
304             # If there is something in the buffer, return it
305             # The defined() prevents a ton of calls to PPI::Util::TRUE
306 67351 100       175075 if ( defined( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) ) {
307 46591         64864 $self->{token_cursor}++;
308 46591         90860 return $token;
309             }
310             }
311 30013         47005 return undef;
312             };
313 76605 100       221162 if ( $@ ) {
    100          
314 1 50       13 if ( _INSTANCE($@, 'PPI::Exception') ) {
315 1         6 $@->throw;
316             } else {
317 0         0 my $errstr = $@;
318 0         0 $errstr =~ s/^(.*) at line .+$/$1/;
319 0         0 PPI::Exception->throw( $errstr );
320             }
321             } elsif ( $rv ) {
322 46591         228104 return $rv;
323             }
324              
325 30013 50       53409 if ( defined $line_rv ) {
326             # End of file, but we can still return things from the buffer
327 30013 50       61735 if ( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) {
328 0         0 $self->{token_cursor}++;
329 0         0 return $token;
330             }
331              
332             # Set our token end of file flag
333 30013         42742 $self->{token_eof} = 1;
334 30013         107732 return 0;
335             }
336              
337             # Error, pass it up to our caller
338 0         0 undef;
339             }
340              
341             =pod
342              
343             =head2 all_tokens
344              
345             When not being used as an iterator, the C method tells
346             the Tokenizer to parse the entire file and return all of the tokens
347             in a single ARRAY reference.
348              
349             It should be noted that C does B interfere with the
350             use of the Tokenizer object as an iterator (does not modify the token
351             cursor) and use of the two different mechanisms can be mixed safely.
352              
353             Returns a reference to an ARRAY of L objects on success
354             or throws an exception on error.
355              
356             =cut
357              
358             sub all_tokens {
359 4     4 1 21 my $self = shift;
360              
361             # Catch exceptions and return undef, so that we
362             # can start to convert code to exception-based code.
363 4         6 my $ok = eval {
364             # Process lines until we get EOF
365 4 50       13 unless ( $self->{token_eof} ) {
366 4         5 my $rv;
367 4         10 while ( $rv = $self->_process_next_line ) {}
368 4 50       10 unless ( defined $rv ) {
369 0         0 PPI::Exception->throw("Error while processing source");
370             }
371              
372             # Clean up the end of the tokenizer
373 4         9 $self->_clean_eof;
374             }
375 4         7 1;
376             };
377 4 50       10 if ( !$ok ) {
378 0         0 my $errstr = $@;
379 0         0 $errstr =~ s/^(.*) at line .+$/$1/;
380 0         0 PPI::Exception->throw( $errstr );
381             }
382              
383             # End of file, return a copy of the token array.
384 4         6 return [ @{$self->{tokens}} ];
  4         14  
385             }
386              
387             =pod
388              
389             =head2 increment_cursor
390              
391             Although exposed as a public method, C is implemented
392             for expert use only, when writing lexers or other components that work
393             directly on token streams.
394              
395             It manually increments the token cursor forward through the file, in effect
396             "skipping" the next token.
397              
398             Return true if the cursor is incremented, C<0> if already at the end of
399             the file, or C on error.
400              
401             =cut
402              
403             sub increment_cursor {
404             # Do this via the get_token method, which makes sure there
405             # is actually a token there to move to.
406 0 0   0 1 0 $_[0]->get_token and 1;
407             }
408              
409             =pod
410              
411             =head2 decrement_cursor
412              
413             Although exposed as a public method, C is implemented
414             for expert use only, when writing lexers or other components that work
415             directly on token streams.
416              
417             It manually decrements the token cursor backwards through the file, in
418             effect "rolling back" the token stream. And indeed that is what it is
419             primarily intended for, when the component that is consuming the token
420             stream needs to implement some sort of "roll back" feature in its use
421             of the token stream.
422              
423             Return true if the cursor is decremented, C<0> if already at the
424             beginning of the file, or C on error.
425              
426             =cut
427              
428             sub decrement_cursor {
429 0     0 1 0 my $self = shift;
430              
431             # Check for the beginning of the file
432 0 0       0 return 0 unless $self->{token_cursor};
433              
434             # Decrement the token cursor
435 0         0 $self->{token_eof} = 0;
436 0         0 --$self->{token_cursor};
437             }
438              
439              
440              
441              
442              
443             #####################################################################
444             # Working With Source
445              
446             # Fetches the next line from the input line buffer
447             # Returns undef at EOF.
448             sub _get_line {
449 105211     105211   134916 my $self = shift;
450 105211 100       213268 return undef unless $self->{source}; # EOF hit previously
451              
452             # Pull off the next line
453 89427         113738 my $line = shift @{$self->{source}};
  89427         191462  
454              
455             # Flag EOF if we hit it
456 89427 100       189083 $self->{source} = undef unless defined $line;
457              
458             # Return the line (or EOF flag)
459 89427         163609 return $line; # string or undef
460             }
461              
462             # Fetches the next line, ready to process
463             # Returns 1 on success
464             # Returns 0 on EOF
465             sub _fill_line {
466 102894     102894   134086 my $self = shift;
467 102894         132710 my $inscan = shift;
468              
469             # Get the next line
470 102894         168774 my $line = $self->_get_line;
471 102894 100       192518 unless ( defined $line ) {
472             # End of file
473 32077 100       57637 unless ( $inscan ) {
474 30017         55561 delete $self->{line};
475 30017         42008 delete $self->{line_cursor};
476 30017         38986 delete $self->{line_length};
477 30017         67470 return 0;
478             }
479              
480             # In the scan version, just set the cursor to the end
481             # of the line, and the rest should just cascade out.
482 2060         3330 $self->{line_cursor} = $self->{line_length};
483 2060         4950 return 0;
484             }
485              
486             # Populate the appropriate variables
487 70817         127168 $self->{line} = $line;
488 70817         102234 $self->{line_cursor} = -1;
489 70817         99770 $self->{line_length} = length $line;
490 70817         94691 $self->{line_count}++;
491              
492 70817         147654 1;
493             }
494              
495             # Get the current character
496             sub _char {
497 0     0   0 my $self = shift;
498 0         0 substr( $self->{line}, $self->{line_cursor}, 1 );
499             }
500              
501              
502              
503              
504              
505             ####################################################################
506             # Per line processing methods
507              
508             # Processes the next line
509             # Returns 1 on success completion
510             # Returns 0 if EOF
511             # Returns undef on error
512             sub _process_next_line {
513 97379     97379   135369 my $self = shift;
514              
515             # Fill the line buffer
516 97379         123424 my $rv;
517 97379 100       162767 unless ( $rv = $self->_fill_line ) {
518 30017 50       57946 return undef unless defined $rv;
519              
520             # End of file, finalize last token
521 30017         64225 $self->_finalize_token;
522 30017         69140 return 0;
523             }
524              
525             # Run the __TOKENIZER__on_line_start
526 67362         195717 $rv = $self->{class}->__TOKENIZER__on_line_start( $self );
527 67362 100       126712 unless ( $rv ) {
528             # If there are no more source lines, then clean up
529 27787 100 66     62639 if ( ref $self->{source} eq 'ARRAY' and ! @{$self->{source}} ) {
  27787         74840  
530 307         1210 $self->_clean_eof;
531             }
532              
533             # Defined but false means next line
534 27787 50       73354 return 1 if defined $rv;
535 0         0 PPI::Exception->throw("Error at line $self->{line_count}");
536             }
537              
538             # If we can't deal with the entire line, process char by char
539 39575         77395 while ( $rv = $self->_process_next_char ) {}
540 39574 50       84041 unless ( defined $rv ) {
541 0         0 PPI::Exception->throw("Error at line $self->{line_count}, character $self->{line_cursor}");
542             }
543              
544             # Trigger any action that needs to happen at the end of a line
545 39574         121659 $self->{class}->__TOKENIZER__on_line_end( $self );
546              
547             # If there are no more source lines, then clean up
548 39574 100 100     110265 unless ( ref($self->{source}) eq 'ARRAY' and @{$self->{source}} ) {
  37271         115800  
549 16543         35091 return $self->_clean_eof;
550             }
551              
552 23031         56392 return 1;
553             }
554              
555              
556              
557              
558              
559             #####################################################################
560             # Per-character processing methods
561              
562             # Process on a per-character basis.
563             # Note that due the high number of times this gets
564             # called, it has been fairly heavily in-lined, so the code
565             # might look a bit ugly and duplicated.
566             sub _process_next_char {
567 435850     435850   599762 my $self = shift;
568              
569             ### FIXME - This checks for a screwed up condition that triggers
570             ### several warnings, amongst other things.
571 435850 50 33     1332179 if ( ! defined $self->{line_cursor} or ! defined $self->{line_length} ) {
572             # $DB::single = 1;
573 0         0 return undef;
574             }
575              
576 435850         587949 $self->{line_cursor}++;
577 435850 100       664240 return 0 if $self->_at_line_end;
578              
579             # Pass control to the token class
580 396276         570806 my $result;
581 396276 100       941060 unless ( $result = $self->{class}->__TOKENIZER__on_char( $self ) ) {
582             # undef is error. 0 is "Did stuff ourself, you don't have to do anything"
583 76267 50       252580 return defined $result ? 1 : undef;
584             }
585              
586             # We will need the value of the current character
587 320008         579844 my $char = substr( $self->{line}, $self->{line_cursor}, 1 );
588 320008 100       577707 if ( $result eq '1' ) {
589             # If __TOKENIZER__on_char returns 1, it is signaling that it thinks that
590             # the character is part of it.
591              
592             # Add the character
593 57848 50       104131 if ( defined $self->{token} ) {
594 57848         98470 $self->{token}->{content} .= $char;
595             } else {
596 0 0       0 defined($self->{token} = $self->{class}->new($char)) or return undef;
597             }
598              
599 57848         155618 return 1;
600             }
601              
602             # We have been provided with the name of a class
603 262160 100       613181 if ( $self->{class} ne "PPI::Token::$result" ) {
    100          
604             # New class
605 102240         189987 $self->_new_token( $result, $char );
606             } elsif ( defined $self->{token} ) {
607             # Same class as current
608 29378         44980 $self->{token}->{content} .= $char;
609             } else {
610             # Same class, but no current
611 130542 50       313219 defined($self->{token} = $self->{class}->new($char)) or return undef;
612             }
613              
614 262160         675034 1;
615             }
616              
617             sub _at_line_end {
618 435850     435850   655789 my ($self) = @_;
619 435850         1105075 return $self->{line_cursor} >= $self->{line_length};
620             }
621              
622              
623              
624              
625              
626             #####################################################################
627             # Altering Tokens in Tokenizer
628              
629             # Finish the end of a token.
630             # Returns the resulting parse class as a convenience.
631             sub _finalize_token {
632 395011     395011   537834 my $self = shift;
633 395011 100       723502 return $self->{class} unless defined $self->{token};
634              
635             # Add the token to the token buffer
636 364992         447566 push @{ $self->{tokens} }, $self->{token};
  364992         704228  
637 364992         524428 $self->{token} = undef;
638              
639             # Return the parse class to that of the zone we are in
640 364992         846819 $self->{class} = $self->{zone};
641             }
642              
643             # Creates a new token and sets it in the tokenizer
644             # The defined() in here prevent a ton of calls to PPI::Util::TRUE
645             sub _new_token {
646 234448     234448   333339 my $self = shift;
647             # throw PPI::Exception() unless @_;
648 234448 100       568078 my $class = substr( $_[0], 0, 12 ) eq 'PPI::Token::'
649             ? shift : 'PPI::Token::' . shift;
650              
651             # Finalize any existing token
652 234448 100       564831 $self->_finalize_token if defined $self->{token};
653              
654             # Create the new token and update the parse class
655 234448 50       654713 defined($self->{token} = $class->new($_[0])) or PPI::Exception->throw;
656 234448         401944 $self->{class} = $class;
657              
658 234448         368046 1;
659             }
660              
661             # At the end of the file, we need to clean up the results of the erroneous
662             # space that we inserted at the beginning of the process.
663             sub _clean_eof {
664 16854     16854   24278 my $self = shift;
665              
666             # Finish any partially completed token
667 16854 100       35357 $self->_finalize_token if $self->{token};
668              
669             # Find the last token, and if it has no content, kill it.
670             # There appears to be some evidence that such "null tokens" are
671             # somehow getting created accidentally.
672 16854         25577 my $last_token = $self->{tokens}->[ -1 ];
673 16854 50       37490 unless ( length $last_token->{content} ) {
674 0         0 pop @{$self->{tokens}};
  0         0  
675             }
676              
677             # Now, if the last character of the last token is a space we added,
678             # chop it off, deleting the token if there's nothing else left.
679 16854 100       34743 if ( $self->{source_eof_chop} ) {
680 15480         22930 $last_token = $self->{tokens}->[ -1 ];
681 15480         72300 $last_token->{content} =~ s/ $//;
682 15480 100       40346 unless ( length $last_token->{content} ) {
683             # Popping token
684 13189         16847 pop @{$self->{tokens}};
  13189         22727  
685             }
686              
687             # The hack involving adding an extra space is now reversed, and
688             # now nobody will ever know. The perfect crime!
689 15480         27136 $self->{source_eof_chop} = '';
690             }
691              
692 16854         54509 1;
693             }
694              
695              
696              
697              
698              
699             #####################################################################
700             # Utility Methods
701              
702             # Context
703             sub _last_token {
704 0     0   0 $_[0]->{tokens}->[-1];
705             }
706              
707             sub _last_significant_token {
708 3091     3091   4758 my $self = shift;
709 3091         4603 my $cursor = $#{ $self->{tokens} };
  3091         6101  
710 3091         7340 while ( $cursor >= 0 ) {
711 4119         7525 my $token = $self->{tokens}->[$cursor--];
712 4119 100       13926 return $token if $token->significant;
713             }
714 388         867 return;
715             }
716              
717             # Get an array ref of previous significant tokens.
718             # Like _last_significant_token except it gets more than just one token
719             # Returns array with 0 to x entries
720             sub _previous_significant_tokens {
721 149816     149816   202937 my $self = shift;
722 149816   50     264919 my $count = shift || 1;
723 149816         183437 my $cursor = $#{ $self->{tokens} };
  149816         252917  
724              
725 149816         207627 my @tokens;
726 149816         280846 while ( $cursor >= 0 ) {
727 240154         350954 my $token = $self->{tokens}->[$cursor--];
728 240154 100       539521 next if not $token->significant;
729 155677         223649 push @tokens, $token;
730 155677 100       326358 last if @tokens >= $count;
731             }
732              
733 149816         335139 return @tokens;
734             }
735              
736             my %OBVIOUS_CLASS = (
737             'PPI::Token::Symbol' => 'operator',
738             'PPI::Token::Magic' => 'operator',
739             'PPI::Token::Number' => 'operator',
740             'PPI::Token::ArrayIndex' => 'operator',
741             'PPI::Token::Quote::Double' => 'operator',
742             'PPI::Token::Quote::Interpolate' => 'operator',
743             'PPI::Token::Quote::Literal' => 'operator',
744             'PPI::Token::Quote::Single' => 'operator',
745             'PPI::Token::QuoteLike::Backtick' => 'operator',
746             'PPI::Token::QuoteLike::Command' => 'operator',
747             'PPI::Token::QuoteLike::Readline' => 'operator',
748             'PPI::Token::QuoteLike::Regexp' => 'operator',
749             'PPI::Token::QuoteLike::Words' => 'operator',
750             );
751              
752             my %OBVIOUS_CONTENT = (
753             '(' => 'operand',
754             '{' => 'operand',
755             '[' => 'operand',
756             ';' => 'operand',
757             '}' => 'operator',
758             );
759              
760              
761             my %USUALLY_FORCES = map { $_ => 1 } qw( sub package use no );
762              
763             # Try to determine operator/operand context, if possible.
764             # Returns "operator", "operand", or "" if unknown.
765             sub _opcontext {
766 6992     6992   11309 my $self = shift;
767 6992         14178 my @tokens = $self->_previous_significant_tokens(1);
768 6992         11252 my $p0 = $tokens[0];
769 6992 100       22240 return '' if not $p0;
770 6877         12300 my $c0 = ref $p0;
771              
772             # Map the obvious cases
773 6877 100       22571 return $OBVIOUS_CLASS{$c0} if defined $OBVIOUS_CLASS{$c0};
774 2162 100       5587 return $OBVIOUS_CONTENT{$p0} if defined $OBVIOUS_CONTENT{$p0};
775              
776             # Most of the time after an operator, we are an operand
777 1617 100       8167 return 'operand' if $p0->isa('PPI::Token::Operator');
778              
779             # If there's NOTHING, it's operand
780 1408 50       3533 return 'operand' if $p0->content eq '';
781              
782             # Otherwise, we don't know
783 1408         3946 return ''
784             }
785              
786             # Assuming we are currently parsing the word 'x', return true
787             # if previous tokens imply the x is an operator, false otherwise.
788             sub _current_x_is_operator {
789 1071     1071   2270 my ( $self ) = @_;
790 1071 100       1401 return if !@{$self->{tokens}};
  1071         2920  
791              
792 872         1996 my ($prev, $prevprev) = $self->_previous_significant_tokens(2);
793 872 100       3376 return if !$prev;
794              
795 870 100       4000 return !$self->__current_token_is_forced_word if $prev->isa('PPI::Token::Word');
796              
797             return (!$prev->isa('PPI::Token::Operator') || $X_CAN_FOLLOW_OPERATOR{$prev})
798 708   100     5512 && (!$prev->isa('PPI::Token::Structure') || $X_CAN_FOLLOW_STRUCTURE{$prev})
799             && !$prev->isa('PPI::Token::Label')
800             ;
801             }
802              
803              
804             # Assuming we are at the end of parsing the current token that could be a word,
805             # a wordlike operator, or a version string, try to determine whether context
806             # before or after it forces it to be a bareword. This method is only useful
807             # during tokenization.
808             sub __current_token_is_forced_word {
809 32609     32609   66202 my ( $t, $word ) = @_;
810              
811             # Check if forced by preceding tokens.
812              
813 32609         59050 my ( $prev, $prevprev ) = $t->_previous_significant_tokens(2);
814 32609 100       87201 if ( !$prev ) {
815 8829         22307 pos $t->{line} = $t->{line_cursor};
816             }
817             else {
818 23780         44705 my $content = $prev->{content};
819              
820             # We are forced if we are a method name.
821             # '->' will always be an operator, so we don't check its type.
822 23780 100       48621 return 1 if $content eq '->';
823              
824             # If we are contained in a pair of curly braces, we are probably a
825             # forced bareword hash key. '{' is never a word or operator, so we
826             # don't check its type.
827 23651         54310 pos $t->{line} = $t->{line_cursor};
828 23651 100 100     70064 return 1 if $content eq '{' and $t->{line} =~ /\G\s*\}/gc;
829              
830             # sub, package, use, and no all indicate that what immediately follows
831             # is a word not an operator or (in the case of sub and package) a
832             # version string. However, we don't want to be fooled by 'package
833             # package v10' or 'use no v10'. We're a forced package unless we're
834             # preceded by 'package sub', in which case we're a version string.
835             # We also have to make sure that the sub/package/etc doing the forcing
836             # is not a method call.
837 23423 100       54546 if( $USUALLY_FORCES{$content}) {
838 5631 100 66     15484 return if defined $word and $word =~ /^v[0-9]+$/ and ( $content eq "use" or $content eq "no" );
      100        
      100        
839 5621 100       25096 return 1 if not $prevprev;
840 236 100 100     705 return 1 if not $USUALLY_FORCES{$prevprev->content} and $prevprev->content ne '->';
841 6         31 return;
842             }
843             }
844             # pos on $t->{line} is guaranteed to be set at this point.
845              
846             # Check if forced by following tokens.
847              
848             # If the word is followed by => it is probably a word, not a regex.
849 26621 100       75478 return 1 if $t->{line} =~ /\G\s*=>/gc;
850              
851             # Otherwise we probably aren't forced
852 25831         147894 return '';
853             }
854              
855             1;
856              
857             =pod
858              
859             =head1 NOTES
860              
861             =head2 How the Tokenizer Works
862              
863             Understanding the Tokenizer is not for the faint-hearted. It is by far
864             the most complex and twisty piece of perl I've ever written that is actually
865             still built properly and isn't a terrible spaghetti-like mess. In fact, you
866             probably want to skip this section.
867              
868             But if you really want to understand, well then here goes.
869              
870             =head2 Source Input and Clean Up
871              
872             The Tokenizer starts by taking source in a variety of forms, sucking it
873             all in and merging into one big string, and doing our own internal line
874             split, using a "universal line separator" which allows the Tokenizer to
875             take source for any platform (and even supports a few known types of
876             broken newlines caused by mixed mac/pc/*nix editor screw ups).
877              
878             The resulting array of lines is used to feed the tokenizer, and is also
879             accessed directly by the heredoc-logic to do the line-oriented part of
880             here-doc support.
881              
882             =head2 Doing Things the Old Fashioned Way
883              
884             Due to the complexity of perl, and after 2 previously aborted parser
885             attempts, in the end the tokenizer was fashioned around a line-buffered
886             character-by-character method.
887              
888             That is, the Tokenizer pulls and holds a line at a time into a line buffer,
889             and then iterates a cursor along it. At each cursor position, a method is
890             called in whatever token class we are currently in, which will examine the
891             character at the current position, and handle it.
892              
893             As the handler methods in the various token classes are called, they
894             build up an output token array for the source code.
895              
896             Various parts of the Tokenizer use look-ahead, arbitrary-distance
897             look-behind (although currently the maximum is three significant tokens),
898             or both, and various other heuristic guesses.
899              
900             I've been told it is officially termed a I<"backtracking parser
901             with infinite lookaheads">.
902              
903             =head2 State Variables
904              
905             Aside from the current line and the character cursor, the Tokenizer
906             maintains a number of different state variables.
907              
908             =over
909              
910             =item Current Class
911              
912             The Tokenizer maintains the current token class at all times. Much of the
913             time is just going to be the "Whitespace" class, which is what the base of
914             a document is. As the tokenizer executes the various character handlers,
915             the class changes a lot as it moves a long. In fact, in some instances,
916             the character handler may not handle the character directly itself, but
917             rather change the "current class" and then hand off to the character
918             handler for the new class.
919              
920             Because of this, and some other things I'll deal with later, the number of
921             times the character handlers are called does not in fact have a direct
922             relationship to the number of actual characters in the document.
923              
924             =item Current Zone
925              
926             Rather than create a class stack to allow for infinitely nested layers of
927             classes, the Tokenizer recognises just a single layer.
928              
929             To put it a different way, in various parts of the file, the Tokenizer will
930             recognise different "base" or "substrate" classes. When a Token such as a
931             comment or a number is finalised by the tokenizer, it "falls back" to the
932             base state.
933              
934             This allows proper tokenization of special areas such as __DATA__
935             and __END__ blocks, which also contain things like comments and POD,
936             without allowing the creation of any significant Tokens inside these areas.
937              
938             For the main part of a document we use L for this,
939             with the idea being that code is "floating in a sea of whitespace".
940              
941             =item Current Token
942              
943             The final main state variable is the "current token". This is the Token
944             that is currently being built by the Tokenizer. For certain types, it
945             can be manipulated and morphed and change class quite a bit while being
946             assembled, as the Tokenizer's understanding of the token content changes.
947              
948             When the Tokenizer is confident that it has seen the end of the Token, it
949             will be "finalized", which adds it to the output token array and resets
950             the current class to that of the zone that we are currently in.
951              
952             I should also note at this point that the "current token" variable is
953             optional. The Tokenizer is capable of knowing what class it is currently
954             set to, without actually having accumulated any characters in the Token.
955              
956             =back
957              
958             =head2 Making It Faster
959              
960             As I'm sure you can imagine, calling several different methods for each
961             character and running regexes and other complex heuristics made the first
962             fully working version of the tokenizer extremely slow.
963              
964             During testing, I created a metric to measure parsing speed called
965             LPGC, or "lines per gigacycle" . A gigacycle is simple a billion CPU
966             cycles on a typical single-core CPU, and so a Tokenizer running at
967             "1000 lines per gigacycle" should generate around 1200 lines of tokenized
968             code when running on a 1200 MHz processor.
969              
970             The first working version of the tokenizer ran at only 350 LPGC, so to
971             tokenize a typical large module such as L took
972             10-15 seconds. This sluggishness made it unpractical for many uses.
973              
974             So in the current parser, there are multiple layers of optimisation
975             very carefully built in to the basic. This has brought the tokenizer
976             up to a more reasonable 1000 LPGC, at the expense of making the code
977             quite a bit twistier.
978              
979             =head2 Making It Faster - Whole Line Classification
980              
981             The first step in the optimisation process was to add a hew handler to
982             enable several of the more basic classes (whitespace, comments) to be
983             able to be parsed a line at a time. At the start of each line, a
984             special optional handler (only supported by a few classes) is called to
985             check and see if the entire line can be parsed in one go.
986              
987             This is used mainly to handle things like POD, comments, empty lines,
988             and a few other minor special cases.
989              
990             =head2 Making It Faster - Inlining
991              
992             The second stage of the optimisation involved inlining a small
993             number of critical methods that were repeated an extremely high number
994             of times. Profiling suggested that there were about 1,000,000 individual
995             method calls per gigacycle, and by cutting these by two thirds a significant
996             speed improvement was gained, in the order of about 50%.
997              
998             You may notice that many methods in the C code look
999             very nested and long hand. This is primarily due to this inlining.
1000              
1001             At around this time, some statistics code that existed in the early
1002             versions of the parser was also removed, as it was determined that
1003             it was consuming around 15% of the CPU for the entire parser, while
1004             making the core more complicated.
1005              
1006             A judgment call was made that with the difficulties likely to be
1007             encountered with future planned enhancements, and given the relatively
1008             high cost involved, the statistics features would be removed from the
1009             Tokenizer.
1010              
1011             =head2 Making It Faster - Quote Engine
1012              
1013             Once inlining had reached diminishing returns, it became obvious from
1014             the profiling results that a huge amount of time was being spent
1015             stepping a char at a time though long, simple and "syntactically boring"
1016             code such as comments and strings.
1017              
1018             The existing regex engine was expanded to also encompass quotes and
1019             other quote-like things, and a special abstract base class was added
1020             that provided a number of specialised parsing methods that would "scan
1021             ahead", looking out ahead to find the end of a string, and updating
1022             the cursor to leave it in a valid position for the next call.
1023              
1024             This is also the point at which the number of character handler calls began
1025             to greatly differ from the number of characters. But it has been done
1026             in a way that allows the parser to retain the power of the original
1027             version at the critical points, while skipping through the "boring bits"
1028             as needed for additional speed.
1029              
1030             The addition of this feature allowed the tokenizer to exceed 1000 LPGC
1031             for the first time.
1032              
1033             =head2 Making It Faster - The "Complete" Mechanism
1034              
1035             As it became evident that great speed increases were available by using
1036             this "skipping ahead" mechanism, a new handler method was added that
1037             explicitly handles the parsing of an entire token, where the structure
1038             of the token is relatively simple. Tokens such as symbols fit this case,
1039             as once we are passed the initial sigil and word char, we know that we
1040             can skip ahead and "complete" the rest of the token much more easily.
1041              
1042             A number of these have been added for most or possibly all of the common
1043             cases, with most of these "complete" handlers implemented using regular
1044             expressions.
1045              
1046             In fact, so many have been added that at this point, you could arguably
1047             reclassify the tokenizer as a "hybrid regex, char-by=char heuristic
1048             tokenizer". More tokens are now consumed in "complete" methods in a
1049             typical program than are handled by the normal char-by-char methods.
1050              
1051             Many of the these complete-handlers were implemented during the writing
1052             of the Lexer, and this has allowed the full parser to maintain around
1053             1000 LPGC despite the increasing weight of the Lexer.
1054              
1055             =head2 Making It Faster - Porting To C (In Progress)
1056              
1057             While it would be extraordinarily difficult to port all of the Tokenizer
1058             to C, work has started on a L "accelerator" package which acts as
1059             a separate and automatically-detected add-on to the main PPI package.
1060              
1061             L implements faster versions of a variety of functions scattered
1062             over the entire PPI codebase, from the Tokenizer Core, Quote Engine, and
1063             various other places, and implements them identically in XS/C.
1064              
1065             In particular, the skip-ahead methods from the Quote Engine would appear
1066             to be extremely amenable to being done in C, and a number of other
1067             functions could be cherry-picked one at a time and implemented in C.
1068              
1069             Each method is heavily tested to ensure that the functionality is
1070             identical, and a versioning mechanism is included to ensure that if a
1071             function gets out of sync, L will degrade gracefully and just
1072             not replace that single method.
1073              
1074             =head1 TO DO
1075              
1076             - Add an option to reset or seek the token stream...
1077              
1078             - Implement more Tokenizer functions in L
1079              
1080             =head1 SUPPORT
1081              
1082             See the L in the main module.
1083              
1084             =head1 AUTHOR
1085              
1086             Adam Kennedy Eadamk@cpan.orgE
1087              
1088             =head1 COPYRIGHT
1089              
1090             Copyright 2001 - 2011 Adam Kennedy.
1091              
1092             This program is free software; you can redistribute
1093             it and/or modify it under the same terms as Perl itself.
1094              
1095             The full text of the license can be found in the
1096             LICENSE file included with this module.
1097              
1098             =cut