File Coverage

blib/lib/PPI/Tokenizer.pm

Criterion	Covered	Total	%
statement	212	243	87.2
branch	112	140	80.0
condition	27	35	77.1
subroutine	25	29	86.2
pod	5	5	100.0
total	381	452	84.2

line	stmt	bran	cond	sub	pod	time	code
1							package PPI::Tokenizer;
2
3							=pod
4
5							=head1 NAME
6
7							PPI::Tokenizer - The Perl Document Tokenizer
8
9							=head1 SYNOPSIS
10
11							# Create a tokenizer for a file, array or string
12							$Tokenizer = PPI::Tokenizer->new( 'filename.pl' );
13							$Tokenizer = PPI::Tokenizer->new( \@lines );
14							$Tokenizer = PPI::Tokenizer->new( \$source );
15
16							# Return all the tokens for the document
17							my $tokens = $Tokenizer->all_tokens;
18
19							# Or we can use it as an iterator
20							while ( my $Token = $Tokenizer->get_token ) {
21							print "Found token '$Token'\n";
22							}
23
24							# If we REALLY need to manually nudge the cursor, you
25							# can do that to (The lexer needs this ability to do rollbacks)
26							$is_incremented = $Tokenizer->increment_cursor;
27							$is_decremented = $Tokenizer->decrement_cursor;
28
29							=head1 DESCRIPTION
30
31							PPI::Tokenizer is the class that provides Tokenizer objects for use in
32							breaking strings of Perl source code into Tokens.
33
34							By the time you are reading this, you probably need to know a little
35							about the difference between how perl parses Perl "code" and how PPI
36							parsers Perl "documents".
37
38							"perl" itself (the interpreter) uses a heavily modified lex specification
39							to specify its parsing logic, maintains several types of state as it
40							goes, and incrementally tokenizes, lexes AND EXECUTES at the same time.
41
42							In fact, it is provably impossible to use perl's parsing method without
43							simultaneously executing code. A formal mathematical proof has been
44							published demonstrating the method.
45
46							This is where the truism "Only perl can parse Perl" comes from.
47
48							PPI uses a completely different approach by abandoning the (impossible)
49							ability to parse Perl the same way that the interpreter does, and instead
50							parsing the source as a document, using a document structure independently
51							derived from the Perl documentation and approximating the perl interpreter
52							interpretation as closely as possible.
53
54							It was touch and go for a long time whether we could get it close enough,
55							but in the end it turned out that it could be done.
56
57							In this approach, the tokenizer C is implemented separately
58							from the lexer L.
59
60							The job of C is to take pure source as a string and break it
61							up into a stream/set of tokens, and contains most of the "black magic" used
62							in PPI. By comparison, the lexer implements a relatively straight forward
63							tree structure, and has an implementation that is uncomplicated (compared
64							to the insanity in the tokenizer at least).
65
66							The Tokenizer uses an immense amount of heuristics, guessing and cruft,
67							supported by a very B flexible internal API, but fortunately it was
68							possible to largely encapsulate the black magic, so there is not a lot that
69							gets exposed to people using the C itself.
70
71							=head1 METHODS
72
73							Despite the incredible complexity, the Tokenizer itself only exposes a
74							relatively small number of methods, with most of the complexity implemented
75							in private methods.
76
77							=cut
78
79							# Make sure everything we need is loaded so
80							# we don't have to go and load all of PPI.
81	64			64		459	use strict;
	64					125
	64					2357
82	64			64		370	use Params::Util qw{_INSTANCE _SCALAR0 _ARRAY0};
	64					177
	64					3575
83	64			64		377	use List::Util 1.33 ();
	64					1308
	64					1404
84	64			64		362	use PPI::Util ();
	64					158
	64					1246
85	64			64		402	use PPI::Element ();
	64					206
	64					1310
86	64			64		373	use PPI::Token ();
	64					154
	64					1568
87	64			64		427	use PPI::Exception ();
	64					175
	64					1295
88	64			64		29056	use PPI::Exception::ParserRejection ();
	64					186
	64					199195
89
90							our $VERSION = '1.277';
91
92							# The x operator cannot follow most Perl operators, implying that
93							# anything beginning with x following an operator is a word.
94							# These are the exceptions.
95							my %X_CAN_FOLLOW_OPERATOR = map { $_ => 1 } qw( -- ++ );
96
97							# The x operator cannot follow most structure elements, implying that
98							# anything beginning with x following a structure element is a word.
99							# These are the exceptions.
100							my %X_CAN_FOLLOW_STRUCTURE = map { $_ => 1 } qw( } ] \) );
101
102							# Something that looks like the x operator but follows a word
103							# is usually that word's argument.
104							# These are the exceptions.
105							# chop, chomp, dump are ambiguous because they can have either parms
106							# or no parms.
107							my %X_CAN_FOLLOW_WORD = map { $_ => 1 } qw(
108							endgrent
109							endhostent
110							endnetent
111							endprotoent
112							endpwent
113							endservent
114							fork
115							getgrent
116							gethostent
117							getlogin
118							getnetent
119							getppid
120							getprotoent
121							getpwent
122							getservent
123							setgrent
124							setpwent
125							time
126							times
127							wait
128							wantarray
129							__SUB__
130							);
131
132
133
134							#####################################################################
135							# Creation and Initialization
136
137							=pod
138
139							=head2 new $file \| \@lines \| \$source
140
141							The main C constructor creates a new Tokenizer object. These
142							objects have no configuration parameters, and can only be used once,
143							to tokenize a single perl source file.
144
145							It takes as argument either a normal scalar containing source code,
146							a reference to a scalar containing source code, or a reference to an
147							ARRAY containing newline-terminated lines of source code.
148
149							Returns a new C object on success, or throws a
150							L exception on error.
151
152							=cut
153
154							sub new {
155	16855		33	16855	1	56136	my $class = ref($_[0]) \|\| $_[0];
156
157							# Create the empty tokenizer struct
158	16855					124361	my $self = bless {
159							# Source code
160							source => undef,
161							source_bytes => undef,
162
163							# Line buffer
164							line => undef,
165							line_length => undef,
166							line_cursor => undef,
167							line_count => 0,
168
169							# Parse state
170							token => undef,
171							class => 'PPI::Token::BOM',
172							zone => 'PPI::Token::Whitespace',
173
174							# Output token buffer
175							tokens => [],
176							token_cursor => 0,
177							token_eof => 0,
178
179							# Perl 6 blocks
180							perl6 => [],
181							}, $class;
182
183	16855	50				70875	if ( ! defined $_[1] ) {
		100
		50
		0
184							# We weren't given anything
185	0					0	PPI::Exception->throw("No source provided to Tokenizer");
186
187							} elsif ( ! ref $_[1] ) {
188	502					2287	my $source = PPI::Util::_slurp($_[1]);
189	502	50				1958	if ( ref $source ) {
190							# Content returned by reference
191	502					1887	$self->{source} = $$source;
192							} else {
193							# Errors returned as a string
194	0					0	return( $source );
195							}
196
197							} elsif ( _SCALAR0($_[1]) ) {
198	16353					23487	$self->{source} = ${$_[1]};
	16353					35040
199
200							} elsif ( _ARRAY0($_[1]) ) {
201	0					0	$self->{source} = join '', map { "\n" } @{$_[1]};
	0					0
	0					0
202
203							} else {
204							# We don't support whatever this is
205	0					0	PPI::Exception->throw(ref($_[1]) . " is not supported as a source provider");
206							}
207
208							# We can't handle a null string
209	16855					32015	$self->{source_bytes} = length $self->{source};
210	16855	100				31874	if ( $self->{source_bytes} ) {
211							# Split on local newlines
212	16851					321397	$self->{source} =~ s/(?:\015{1,2}\012\|\015\|\012)/\n/g;
213	16851					229118	$self->{source} = [ split /(?<=\n)/, $self->{source} ];
214
215							} else {
216	4					9	$self->{source} = [ ];
217							}
218
219							### EVIL
220							# I'm explaining this earlier than I should so you can understand
221							# why I'm about to do something that looks very strange. There's
222							# a problem with the Tokenizer, in that tokens tend to change
223							# classes as each letter is added, but they don't get allocated
224							# their definite final class until the "end" of the token, the
225							# detection of which occurs in about a hundred different places,
226							# all through various crufty code (that triples the speed).
227							#
228							# However, in general, this does not apply to tokens in which a
229							# whitespace character is valid, such as comments, whitespace and
230							# big strings.
231							#
232							# So what we do is add a space to the end of the source. This
233							# triggers normal "end of token" functionality for all cases. Then,
234							# once the tokenizer hits end of file, it examines the last token to
235							# manually either remove the ' ' token, or chop it off the end of
236							# a longer one in which the space would be valid.
237	16855	100		71071		70470	if ( List::Util::any { /^__(?:DATA\|END)__\s*$/ } @{$self->{source}} ) {
	71071	100				156580
	16855	100				59831
238	10					38	$self->{source_eof_chop} = '';
239							} elsif ( ! defined $self->{source}->[0] ) {
240	4					10	$self->{source_eof_chop} = '';
241							} elsif ( $self->{source}->[-1] =~ /\s$/ ) {
242	1086					3513	$self->{source_eof_chop} = '';
243							} else {
244	15755					30291	$self->{source_eof_chop} = 1;
245	15755					32873	$self->{source}->[-1] .= ' ';
246							}
247
248	16855					73570	$self;
249							}
250
251
252
253
254
255							#####################################################################
256							# Main Public Methods
257
258							=pod
259
260							=head2 get_token
261
262							When using the PPI::Tokenizer object as an iterator, the C
263							method is the primary method that is used. It increments the cursor
264							and returns the next Token in the output array.
265
266							The actual parsing of the file is done only as-needed, and a line at
267							a time. When C hits the end of the token array, it will
268							cause the parser to pull in the next line and parse it, continuing
269							as needed until there are more tokens on the output array that
270							get_token can then return.
271
272							This means that a number of Tokenizer objects can be created, and
273							won't consume significant CPU until you actually begin to pull tokens
274							from it.
275
276							Return a L object on success, C<0> if the Tokenizer had
277							reached the end of the file, or C on error.
278
279							=cut
280
281							sub get_token {
282	381789			381789	1	543655	my $self = shift;
283
284							# Shortcut for EOF
285	381789	50	66			758735	if ( $self->{token_eof}
286	13248					34677	and $self->{token_cursor} > scalar @{$self->{tokens}}
287							) {
288	0					0	return 0;
289							}
290
291							# Return the next token if we can
292	381789	100				1027083	if ( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) {
293	305184					401883	$self->{token_cursor}++;
294	305184					1152939	return $token;
295							}
296
297	76605					98982	my $line_rv;
298
299							# Catch exceptions and return undef, so that we
300							# can start to convert code to exception-based code.
301	76605					107550	my $rv = eval {
302							# No token, we need to get some more
303	76605					150944	while ( $line_rv = $self->_process_next_line ) {
304							# If there is something in the buffer, return it
305							# The defined() prevents a ton of calls to PPI::Util::TRUE
306	67351	100				175075	if ( defined( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) ) {
307	46591					64864	$self->{token_cursor}++;
308	46591					90860	return $token;
309							}
310							}
311	30013					47005	return undef;
312							};
313	76605	100				221162	if ( $@ ) {
		100
314	1	50				13	if ( _INSTANCE($@, 'PPI::Exception') ) {
315	1					6	$@->throw;
316							} else {
317	0					0	my $errstr = $@;
318	0					0	$errstr =~ s/^(.*) at line .+$/$1/;
319	0					0	PPI::Exception->throw( $errstr );
320							}
321							} elsif ( $rv ) {
322	46591					228104	return $rv;
323							}
324
325	30013	50				53409	if ( defined $line_rv ) {
326							# End of file, but we can still return things from the buffer
327	30013	50				61735	if ( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) {
328	0					0	$self->{token_cursor}++;
329	0					0	return $token;
330							}
331
332							# Set our token end of file flag
333	30013					42742	$self->{token_eof} = 1;
334	30013					107732	return 0;
335							}
336
337							# Error, pass it up to our caller
338	0					0	undef;
339							}
340
341							=pod
342
343							=head2 all_tokens
344
345							When not being used as an iterator, the C method tells
346							the Tokenizer to parse the entire file and return all of the tokens
347							in a single ARRAY reference.
348
349							It should be noted that C does B interfere with the
350							use of the Tokenizer object as an iterator (does not modify the token
351							cursor) and use of the two different mechanisms can be mixed safely.
352
353							Returns a reference to an ARRAY of L objects on success
354							or throws an exception on error.
355
356							=cut
357
358							sub all_tokens {
359	4			4	1	21	my $self = shift;
360
361							# Catch exceptions and return undef, so that we
362							# can start to convert code to exception-based code.
363	4					6	my $ok = eval {
364							# Process lines until we get EOF
365	4	50				13	unless ( $self->{token_eof} ) {
366	4					5	my $rv;
367	4					10	while ( $rv = $self->_process_next_line ) {}
368	4	50				10	unless ( defined $rv ) {
369	0					0	PPI::Exception->throw("Error while processing source");
370							}
371
372							# Clean up the end of the tokenizer
373	4					9	$self->_clean_eof;
374							}
375	4					7	1;
376							};
377	4	50				10	if ( !$ok ) {
378	0					0	my $errstr = $@;
379	0					0	$errstr =~ s/^(.*) at line .+$/$1/;
380	0					0	PPI::Exception->throw( $errstr );
381							}
382
383							# End of file, return a copy of the token array.
384	4					6	return [ @{$self->{tokens}} ];
	4					14
385							}
386
387							=pod
388
389							=head2 increment_cursor
390
391							Although exposed as a public method, C is implemented
392							for expert use only, when writing lexers or other components that work
393							directly on token streams.
394
395							It manually increments the token cursor forward through the file, in effect
396							"skipping" the next token.
397
398							Return true if the cursor is incremented, C<0> if already at the end of
399							the file, or C on error.
400
401							=cut
402
403							sub increment_cursor {
404							# Do this via the get_token method, which makes sure there
405							# is actually a token there to move to.
406	0	0		0	1	0	$_[0]->get_token and 1;
407							}
408
409							=pod
410
411							=head2 decrement_cursor
412
413							Although exposed as a public method, C is implemented
414							for expert use only, when writing lexers or other components that work
415							directly on token streams.
416
417							It manually decrements the token cursor backwards through the file, in
418							effect "rolling back" the token stream. And indeed that is what it is
419							primarily intended for, when the component that is consuming the token
420							stream needs to implement some sort of "roll back" feature in its use
421							of the token stream.
422
423							Return true if the cursor is decremented, C<0> if already at the
424							beginning of the file, or C on error.
425
426							=cut
427
428							sub decrement_cursor {
429	0			0	1	0	my $self = shift;
430
431							# Check for the beginning of the file
432	0	0				0	return 0 unless $self->{token_cursor};
433
434							# Decrement the token cursor
435	0					0	$self->{token_eof} = 0;
436	0					0	--$self->{token_cursor};
437							}
438
439
440
441
442
443							#####################################################################
444							# Working With Source
445
446							# Fetches the next line from the input line buffer
447							# Returns undef at EOF.
448							sub _get_line {
449	105211			105211		134916	my $self = shift;
450	105211	100				213268	return undef unless $self->{source}; # EOF hit previously
451
452							# Pull off the next line
453	89427					113738	my $line = shift @{$self->{source}};
	89427					191462
454
455							# Flag EOF if we hit it
456	89427	100				189083	$self->{source} = undef unless defined $line;
457
458							# Return the line (or EOF flag)
459	89427					163609	return $line; # string or undef
460							}
461
462							# Fetches the next line, ready to process
463							# Returns 1 on success
464							# Returns 0 on EOF
465							sub _fill_line {
466	102894			102894		134086	my $self = shift;
467	102894					132710	my $inscan = shift;
468
469							# Get the next line
470	102894					168774	my $line = $self->_get_line;
471	102894	100				192518	unless ( defined $line ) {
472							# End of file
473	32077	100				57637	unless ( $inscan ) {
474	30017					55561	delete $self->{line};
475	30017					42008	delete $self->{line_cursor};
476	30017					38986	delete $self->{line_length};
477	30017					67470	return 0;
478							}
479
480							# In the scan version, just set the cursor to the end
481							# of the line, and the rest should just cascade out.
482	2060					3330	$self->{line_cursor} = $self->{line_length};
483	2060					4950	return 0;
484							}
485
486							# Populate the appropriate variables
487	70817					127168	$self->{line} = $line;
488	70817					102234	$self->{line_cursor} = -1;
489	70817					99770	$self->{line_length} = length $line;
490	70817					94691	$self->{line_count}++;
491
492	70817					147654	1;
493							}
494
495							# Get the current character
496							sub _char {
497	0			0		0	my $self = shift;
498	0					0	substr( $self->{line}, $self->{line_cursor}, 1 );
499							}
500
501
502
503
504
505							####################################################################
506							# Per line processing methods
507
508							# Processes the next line
509							# Returns 1 on success completion
510							# Returns 0 if EOF
511							# Returns undef on error
512							sub _process_next_line {
513	97379			97379		135369	my $self = shift;
514
515							# Fill the line buffer
516	97379					123424	my $rv;
517	97379	100				162767	unless ( $rv = $self->_fill_line ) {
518	30017	50				57946	return undef unless defined $rv;
519
520							# End of file, finalize last token
521	30017					64225	$self->_finalize_token;
522	30017					69140	return 0;
523							}
524
525							# Run the __TOKENIZER__on_line_start
526	67362					195717	$rv = $self->{class}->__TOKENIZER__on_line_start( $self );
527	67362	100				126712	unless ( $rv ) {
528							# If there are no more source lines, then clean up
529	27787	100	66			62639	if ( ref $self->{source} eq 'ARRAY' and ! @{$self->{source}} ) {
	27787					74840
530	307					1210	$self->_clean_eof;
531							}
532
533							# Defined but false means next line
534	27787	50				73354	return 1 if defined $rv;
535	0					0	PPI::Exception->throw("Error at line $self->{line_count}");
536							}
537
538							# If we can't deal with the entire line, process char by char
539	39575					77395	while ( $rv = $self->_process_next_char ) {}
540	39574	50				84041	unless ( defined $rv ) {
541	0					0	PPI::Exception->throw("Error at line $self->{line_count}, character $self->{line_cursor}");
542							}
543
544							# Trigger any action that needs to happen at the end of a line
545	39574					121659	$self->{class}->__TOKENIZER__on_line_end( $self );
546
547							# If there are no more source lines, then clean up
548	39574	100	100			110265	unless ( ref($self->{source}) eq 'ARRAY' and @{$self->{source}} ) {
	37271					115800
549	16543					35091	return $self->_clean_eof;
550							}
551
552	23031					56392	return 1;
553							}
554
555
556
557
558
559							#####################################################################
560							# Per-character processing methods
561
562							# Process on a per-character basis.
563							# Note that due the high number of times this gets
564							# called, it has been fairly heavily in-lined, so the code
565							# might look a bit ugly and duplicated.
566							sub _process_next_char {
567	435850			435850		599762	my $self = shift;
568
569							### FIXME - This checks for a screwed up condition that triggers
570							### several warnings, amongst other things.
571	435850	50	33			1332179	if ( ! defined $self->{line_cursor} or ! defined $self->{line_length} ) {
572							# $DB::single = 1;
573	0					0	return undef;
574							}
575
576	435850					587949	$self->{line_cursor}++;
577	435850	100				664240	return 0 if $self->_at_line_end;
578
579							# Pass control to the token class
580	396276					570806	my $result;
581	396276	100				941060	unless ( $result = $self->{class}->__TOKENIZER__on_char( $self ) ) {
582							# undef is error. 0 is "Did stuff ourself, you don't have to do anything"
583	76267	50				252580	return defined $result ? 1 : undef;
584							}
585
586							# We will need the value of the current character
587	320008					579844	my $char = substr( $self->{line}, $self->{line_cursor}, 1 );
588	320008	100				577707	if ( $result eq '1' ) {
589							# If __TOKENIZER__on_char returns 1, it is signaling that it thinks that
590							# the character is part of it.
591
592							# Add the character
593	57848	50				104131	if ( defined $self->{token} ) {
594	57848					98470	$self->{token}->{content} .= $char;
595							} else {
596	0	0				0	defined($self->{token} = $self->{class}->new($char)) or return undef;
597							}
598
599	57848					155618	return 1;
600							}
601
602							# We have been provided with the name of a class
603	262160	100				613181	if ( $self->{class} ne "PPI::Token::$result" ) {
		100
604							# New class
605	102240					189987	$self->_new_token( $result, $char );
606							} elsif ( defined $self->{token} ) {
607							# Same class as current
608	29378					44980	$self->{token}->{content} .= $char;
609							} else {
610							# Same class, but no current
611	130542	50				313219	defined($self->{token} = $self->{class}->new($char)) or return undef;
612							}
613
614	262160					675034	1;
615							}
616
617							sub _at_line_end {
618	435850			435850		655789	my ($self) = @_;
619	435850					1105075	return $self->{line_cursor} >= $self->{line_length};
620							}
621
622
623
624
625
626							#####################################################################
627							# Altering Tokens in Tokenizer
628
629							# Finish the end of a token.
630							# Returns the resulting parse class as a convenience.
631							sub _finalize_token {
632	395011			395011		537834	my $self = shift;
633	395011	100				723502	return $self->{class} unless defined $self->{token};
634
635							# Add the token to the token buffer
636	364992					447566	push @{ $self->{tokens} }, $self->{token};
	364992					704228
637	364992					524428	$self->{token} = undef;
638
639							# Return the parse class to that of the zone we are in
640	364992					846819	$self->{class} = $self->{zone};
641							}
642
643							# Creates a new token and sets it in the tokenizer
644							# The defined() in here prevent a ton of calls to PPI::Util::TRUE
645							sub _new_token {
646	234448			234448		333339	my $self = shift;
647							# throw PPI::Exception() unless @_;
648	234448	100				568078	my $class = substr( $_[0], 0, 12 ) eq 'PPI::Token::'
649							? shift : 'PPI::Token::' . shift;
650
651							# Finalize any existing token
652	234448	100				564831	$self->_finalize_token if defined $self->{token};
653
654							# Create the new token and update the parse class
655	234448	50				654713	defined($self->{token} = $class->new($_[0])) or PPI::Exception->throw;
656	234448					401944	$self->{class} = $class;
657
658	234448					368046	1;
659							}
660
661							# At the end of the file, we need to clean up the results of the erroneous
662							# space that we inserted at the beginning of the process.
663							sub _clean_eof {
664	16854			16854		24278	my $self = shift;
665
666							# Finish any partially completed token
667	16854	100				35357	$self->_finalize_token if $self->{token};
668
669							# Find the last token, and if it has no content, kill it.
670							# There appears to be some evidence that such "null tokens" are
671							# somehow getting created accidentally.
672	16854					25577	my $last_token = $self->{tokens}->[ -1 ];
673	16854	50				37490	unless ( length $last_token->{content} ) {
674	0					0	pop @{$self->{tokens}};
	0					0
675							}
676
677							# Now, if the last character of the last token is a space we added,
678							# chop it off, deleting the token if there's nothing else left.
679	16854	100				34743	if ( $self->{source_eof_chop} ) {
680	15480					22930	$last_token = $self->{tokens}->[ -1 ];
681	15480					72300	$last_token->{content} =~ s/ $//;
682	15480	100				40346	unless ( length $last_token->{content} ) {
683							# Popping token
684	13189					16847	pop @{$self->{tokens}};
	13189					22727
685							}
686
687							# The hack involving adding an extra space is now reversed, and
688							# now nobody will ever know. The perfect crime!
689	15480					27136	$self->{source_eof_chop} = '';
690							}
691
692	16854					54509	1;
693							}
694
695
696
697
698
699							#####################################################################
700							# Utility Methods
701
702							# Context
703							sub _last_token {
704	0			0		0	$_[0]->{tokens}->[-1];
705							}
706
707							sub _last_significant_token {
708	3091			3091		4758	my $self = shift;
709	3091					4603	my $cursor = $#{ $self->{tokens} };
	3091					6101
710	3091					7340	while ( $cursor >= 0 ) {
711	4119					7525	my $token = $self->{tokens}->[$cursor--];
712	4119	100				13926	return $token if $token->significant;
713							}
714	388					867	return;
715							}
716
717							# Get an array ref of previous significant tokens.
718							# Like _last_significant_token except it gets more than just one token
719							# Returns array with 0 to x entries
720							sub _previous_significant_tokens {
721	149816			149816		202937	my $self = shift;
722	149816		50			264919	my $count = shift \|\| 1;
723	149816					183437	my $cursor = $#{ $self->{tokens} };
	149816					252917
724
725	149816					207627	my @tokens;
726	149816					280846	while ( $cursor >= 0 ) {
727	240154					350954	my $token = $self->{tokens}->[$cursor--];
728	240154	100				539521	next if not $token->significant;
729	155677					223649	push @tokens, $token;
730	155677	100				326358	last if @tokens >= $count;
731							}
732
733	149816					335139	return @tokens;
734							}
735
736							my %OBVIOUS_CLASS = (
737							'PPI::Token::Symbol' => 'operator',
738							'PPI::Token::Magic' => 'operator',
739							'PPI::Token::Number' => 'operator',
740							'PPI::Token::ArrayIndex' => 'operator',
741							'PPI::Token::Quote::Double' => 'operator',
742							'PPI::Token::Quote::Interpolate' => 'operator',
743							'PPI::Token::Quote::Literal' => 'operator',
744							'PPI::Token::Quote::Single' => 'operator',
745							'PPI::Token::QuoteLike::Backtick' => 'operator',
746							'PPI::Token::QuoteLike::Command' => 'operator',
747							'PPI::Token::QuoteLike::Readline' => 'operator',
748							'PPI::Token::QuoteLike::Regexp' => 'operator',
749							'PPI::Token::QuoteLike::Words' => 'operator',
750							);
751
752							my %OBVIOUS_CONTENT = (
753							'(' => 'operand',
754							'{' => 'operand',
755							'[' => 'operand',
756							';' => 'operand',
757							'}' => 'operator',
758							);
759
760
761							my %USUALLY_FORCES = map { $_ => 1 } qw( sub package use no );
762
763							# Try to determine operator/operand context, if possible.
764							# Returns "operator", "operand", or "" if unknown.
765							sub _opcontext {
766	6992			6992		11309	my $self = shift;
767	6992					14178	my @tokens = $self->_previous_significant_tokens(1);
768	6992					11252	my $p0 = $tokens[0];
769	6992	100				22240	return '' if not $p0;
770	6877					12300	my $c0 = ref $p0;
771
772							# Map the obvious cases
773	6877	100				22571	return $OBVIOUS_CLASS{$c0} if defined $OBVIOUS_CLASS{$c0};
774	2162	100				5587	return $OBVIOUS_CONTENT{$p0} if defined $OBVIOUS_CONTENT{$p0};
775
776							# Most of the time after an operator, we are an operand
777	1617	100				8167	return 'operand' if $p0->isa('PPI::Token::Operator');
778
779							# If there's NOTHING, it's operand
780	1408	50				3533	return 'operand' if $p0->content eq '';
781
782							# Otherwise, we don't know
783	1408					3946	return ''
784							}
785
786							# Assuming we are currently parsing the word 'x', return true
787							# if previous tokens imply the x is an operator, false otherwise.
788							sub _current_x_is_operator {
789	1071			1071		2270	my ( $self ) = @_;
790	1071	100				1401	return if !@{$self->{tokens}};
	1071					2920
791
792	872					1996	my ($prev, $prevprev) = $self->_previous_significant_tokens(2);
793	872	100				3376	return if !$prev;
794
795	870	100				4000	return !$self->__current_token_is_forced_word if $prev->isa('PPI::Token::Word');
796
797							return (!$prev->isa('PPI::Token::Operator') \|\| $X_CAN_FOLLOW_OPERATOR{$prev})
798	708		100			5512	&& (!$prev->isa('PPI::Token::Structure') \|\| $X_CAN_FOLLOW_STRUCTURE{$prev})
799							&& !$prev->isa('PPI::Token::Label')
800							;
801							}
802
803
804							# Assuming we are at the end of parsing the current token that could be a word,
805							# a wordlike operator, or a version string, try to determine whether context
806							# before or after it forces it to be a bareword. This method is only useful
807							# during tokenization.
808							sub __current_token_is_forced_word {
809	32609			32609		66202	my ( $t, $word ) = @_;
810
811							# Check if forced by preceding tokens.
812
813	32609					59050	my ( $prev, $prevprev ) = $t->_previous_significant_tokens(2);
814	32609	100				87201	if ( !$prev ) {
815	8829					22307	pos $t->{line} = $t->{line_cursor};
816							}
817							else {
818	23780					44705	my $content = $prev->{content};
819
820							# We are forced if we are a method name.
821							# '->' will always be an operator, so we don't check its type.
822	23780	100				48621	return 1 if $content eq '->';
823
824							# If we are contained in a pair of curly braces, we are probably a
825							# forced bareword hash key. '{' is never a word or operator, so we
826							# don't check its type.
827	23651					54310	pos $t->{line} = $t->{line_cursor};
828	23651	100	100			70064	return 1 if $content eq '{' and $t->{line} =~ /\G\s*\}/gc;
829
830							# sub, package, use, and no all indicate that what immediately follows
831							# is a word not an operator or (in the case of sub and package) a
832							# version string. However, we don't want to be fooled by 'package
833							# package v10' or 'use no v10'. We're a forced package unless we're
834							# preceded by 'package sub', in which case we're a version string.
835							# We also have to make sure that the sub/package/etc doing the forcing
836							# is not a method call.
837	23423	100				54546	if( $USUALLY_FORCES{$content}) {
838	5631	100	66			15484	return if defined $word and $word =~ /^v[0-9]+$/ and ( $content eq "use" or $content eq "no" );
			100
			100
839	5621	100				25096	return 1 if not $prevprev;
840	236	100	100			705	return 1 if not $USUALLY_FORCES{$prevprev->content} and $prevprev->content ne '->';
841	6					31	return;
842							}
843							}
844							# pos on $t->{line} is guaranteed to be set at this point.
845
846							# Check if forced by following tokens.
847
848							# If the word is followed by => it is probably a word, not a regex.
849	26621	100				75478	return 1 if $t->{line} =~ /\G\s*=>/gc;
850
851							# Otherwise we probably aren't forced
852	25831					147894	return '';
853							}
854
855							1;
856
857							=pod
858
859							=head1 NOTES
860
861							=head2 How the Tokenizer Works
862
863							Understanding the Tokenizer is not for the faint-hearted. It is by far
864							the most complex and twisty piece of perl I've ever written that is actually
865							still built properly and isn't a terrible spaghetti-like mess. In fact, you
866							probably want to skip this section.
867
868							But if you really want to understand, well then here goes.
869
870							=head2 Source Input and Clean Up
871
872							The Tokenizer starts by taking source in a variety of forms, sucking it
873							all in and merging into one big string, and doing our own internal line
874							split, using a "universal line separator" which allows the Tokenizer to
875							take source for any platform (and even supports a few known types of
876							broken newlines caused by mixed mac/pc/*nix editor screw ups).
877
878							The resulting array of lines is used to feed the tokenizer, and is also
879							accessed directly by the heredoc-logic to do the line-oriented part of
880							here-doc support.
881
882							=head2 Doing Things the Old Fashioned Way
883
884							Due to the complexity of perl, and after 2 previously aborted parser
885							attempts, in the end the tokenizer was fashioned around a line-buffered
886							character-by-character method.
887
888							That is, the Tokenizer pulls and holds a line at a time into a line buffer,
889							and then iterates a cursor along it. At each cursor position, a method is
890							called in whatever token class we are currently in, which will examine the
891							character at the current position, and handle it.
892
893							As the handler methods in the various token classes are called, they
894							build up an output token array for the source code.
895
896							Various parts of the Tokenizer use look-ahead, arbitrary-distance
897							look-behind (although currently the maximum is three significant tokens),
898							or both, and various other heuristic guesses.
899
900							I've been told it is officially termed a I<"backtracking parser
901							with infinite lookaheads">.
902
903							=head2 State Variables
904
905							Aside from the current line and the character cursor, the Tokenizer
906							maintains a number of different state variables.
907
908							=over
909
910							=item Current Class
911
912							The Tokenizer maintains the current token class at all times. Much of the
913							time is just going to be the "Whitespace" class, which is what the base of
914							a document is. As the tokenizer executes the various character handlers,
915							the class changes a lot as it moves a long. In fact, in some instances,
916							the character handler may not handle the character directly itself, but
917							rather change the "current class" and then hand off to the character
918							handler for the new class.
919
920							Because of this, and some other things I'll deal with later, the number of
921							times the character handlers are called does not in fact have a direct
922							relationship to the number of actual characters in the document.
923
924							=item Current Zone
925
926							Rather than create a class stack to allow for infinitely nested layers of
927							classes, the Tokenizer recognises just a single layer.
928
929							To put it a different way, in various parts of the file, the Tokenizer will
930							recognise different "base" or "substrate" classes. When a Token such as a
931							comment or a number is finalised by the tokenizer, it "falls back" to the
932							base state.
933
934							This allows proper tokenization of special areas such as __DATA__
935							and __END__ blocks, which also contain things like comments and POD,
936							without allowing the creation of any significant Tokens inside these areas.
937
938							For the main part of a document we use L for this,
939							with the idea being that code is "floating in a sea of whitespace".
940
941							=item Current Token
942
943							The final main state variable is the "current token". This is the Token
944							that is currently being built by the Tokenizer. For certain types, it
945							can be manipulated and morphed and change class quite a bit while being
946							assembled, as the Tokenizer's understanding of the token content changes.
947
948							When the Tokenizer is confident that it has seen the end of the Token, it
949							will be "finalized", which adds it to the output token array and resets
950							the current class to that of the zone that we are currently in.
951
952							I should also note at this point that the "current token" variable is
953							optional. The Tokenizer is capable of knowing what class it is currently
954							set to, without actually having accumulated any characters in the Token.
955
956							=back
957
958							=head2 Making It Faster
959
960							As I'm sure you can imagine, calling several different methods for each
961							character and running regexes and other complex heuristics made the first
962							fully working version of the tokenizer extremely slow.
963
964							During testing, I created a metric to measure parsing speed called
965							LPGC, or "lines per gigacycle" . A gigacycle is simple a billion CPU
966							cycles on a typical single-core CPU, and so a Tokenizer running at
967							"1000 lines per gigacycle" should generate around 1200 lines of tokenized
968							code when running on a 1200 MHz processor.
969
970							The first working version of the tokenizer ran at only 350 LPGC, so to
971							tokenize a typical large module such as L took
972							10-15 seconds. This sluggishness made it unpractical for many uses.
973
974							So in the current parser, there are multiple layers of optimisation
975							very carefully built in to the basic. This has brought the tokenizer
976							up to a more reasonable 1000 LPGC, at the expense of making the code
977							quite a bit twistier.
978
979							=head2 Making It Faster - Whole Line Classification
980
981							The first step in the optimisation process was to add a hew handler to
982							enable several of the more basic classes (whitespace, comments) to be
983							able to be parsed a line at a time. At the start of each line, a
984							special optional handler (only supported by a few classes) is called to
985							check and see if the entire line can be parsed in one go.
986
987							This is used mainly to handle things like POD, comments, empty lines,
988							and a few other minor special cases.
989
990							=head2 Making It Faster - Inlining
991
992							The second stage of the optimisation involved inlining a small
993							number of critical methods that were repeated an extremely high number
994							of times. Profiling suggested that there were about 1,000,000 individual
995							method calls per gigacycle, and by cutting these by two thirds a significant
996							speed improvement was gained, in the order of about 50%.
997
998							You may notice that many methods in the C code look
999							very nested and long hand. This is primarily due to this inlining.
1000
1001							At around this time, some statistics code that existed in the early
1002							versions of the parser was also removed, as it was determined that
1003							it was consuming around 15% of the CPU for the entire parser, while
1004							making the core more complicated.
1005
1006							A judgment call was made that with the difficulties likely to be
1007							encountered with future planned enhancements, and given the relatively
1008							high cost involved, the statistics features would be removed from the
1009							Tokenizer.
1010
1011							=head2 Making It Faster - Quote Engine
1012
1013							Once inlining had reached diminishing returns, it became obvious from
1014							the profiling results that a huge amount of time was being spent
1015							stepping a char at a time though long, simple and "syntactically boring"
1016							code such as comments and strings.
1017
1018							The existing regex engine was expanded to also encompass quotes and
1019							other quote-like things, and a special abstract base class was added
1020							that provided a number of specialised parsing methods that would "scan
1021							ahead", looking out ahead to find the end of a string, and updating
1022							the cursor to leave it in a valid position for the next call.
1023
1024							This is also the point at which the number of character handler calls began
1025							to greatly differ from the number of characters. But it has been done
1026							in a way that allows the parser to retain the power of the original
1027							version at the critical points, while skipping through the "boring bits"
1028							as needed for additional speed.
1029
1030							The addition of this feature allowed the tokenizer to exceed 1000 LPGC
1031							for the first time.
1032
1033							=head2 Making It Faster - The "Complete" Mechanism
1034
1035							As it became evident that great speed increases were available by using
1036							this "skipping ahead" mechanism, a new handler method was added that
1037							explicitly handles the parsing of an entire token, where the structure
1038							of the token is relatively simple. Tokens such as symbols fit this case,
1039							as once we are passed the initial sigil and word char, we know that we
1040							can skip ahead and "complete" the rest of the token much more easily.
1041
1042							A number of these have been added for most or possibly all of the common
1043							cases, with most of these "complete" handlers implemented using regular
1044							expressions.
1045
1046							In fact, so many have been added that at this point, you could arguably
1047							reclassify the tokenizer as a "hybrid regex, char-by=char heuristic
1048							tokenizer". More tokens are now consumed in "complete" methods in a
1049							typical program than are handled by the normal char-by-char methods.
1050
1051							Many of the these complete-handlers were implemented during the writing
1052							of the Lexer, and this has allowed the full parser to maintain around
1053							1000 LPGC despite the increasing weight of the Lexer.
1054
1055							=head2 Making It Faster - Porting To C (In Progress)
1056
1057							While it would be extraordinarily difficult to port all of the Tokenizer
1058							to C, work has started on a L "accelerator" package which acts as
1059							a separate and automatically-detected add-on to the main PPI package.
1060
1061							L implements faster versions of a variety of functions scattered
1062							over the entire PPI codebase, from the Tokenizer Core, Quote Engine, and
1063							various other places, and implements them identically in XS/C.
1064
1065							In particular, the skip-ahead methods from the Quote Engine would appear
1066							to be extremely amenable to being done in C, and a number of other
1067							functions could be cherry-picked one at a time and implemented in C.
1068
1069							Each method is heavily tested to ensure that the functionality is
1070							identical, and a versioning mechanism is included to ensure that if a
1071							function gets out of sync, L will degrade gracefully and just
1072							not replace that single method.
1073
1074							=head1 TO DO
1075
1076							- Add an option to reset or seek the token stream...
1077
1078							- Implement more Tokenizer functions in L
1079
1080							=head1 SUPPORT
1081
1082							See the L in the main module.
1083
1084							=head1 AUTHOR
1085
1086							Adam Kennedy Eadamk@cpan.orgE
1087
1088							=head1 COPYRIGHT
1089
1090							Copyright 2001 - 2011 Adam Kennedy.
1091
1092							This program is free software; you can redistribute
1093							it and/or modify it under the same terms as Perl itself.
1094
1095							The full text of the license can be found in the
1096							LICENSE file included with this module.
1097
1098							=cut