27 March 2000 Release 2.22 Notes for New Users of PCCTS Version 1.33MR22
6
where the user meant it to be. Since it is okay to specify a #lexclass in several pieces it might be a good idea when
using #lexclass to place "#lexclass START" just before the first rule - then any inline definitions of tokens will be
placed in the START #lexclass automatically:
#lexclass START
...
#lexclass COMMENT
...
#lexclass START
#40.
Use a stack of #lexclass modes in order to emulate lexical subroutines
Consider a grammar in which lexical elements have internal structure. An example of this is C strings and character
literals which may contain elements like:
escaped characters
\" and \'
symbolic codes
\t
numbers
\xff \200 \0
Rather than implementing a separate #lexclass to handle these sequences for both character literals and string literals
it would be possible to have a single #lexclass which would handle both. To implement such a scheme one needs
something like a subroutine stack to remember the previous #lexclass. See Example #9 for a set of such routines.
#41.
Sometimes a stack of #lexclass modes isn't enough
Consider a log file consisting of clauses, each of which has its own #lexclass and in which a given word is reserved
in some clauses and not others:
#1;1-JAN-94 01:23:34;enable;forge bellows alarm;move to station B;
#2;1-JAN-94 08:01:56;operator;john bellows;shift change at 08:00;
#3;1-JAN-94 09:10:11;move;old pos=5.0 new pos=6.0;operator request;
#4;1-JAN-94 10:11:12;alarm;bellows;2-JAN-94 00:00:01;
If the item is terminated by a separator then there is a problem because the separator will be consumed in the
recognition of the most nested item - with nothing left over to be consumed by other elements which end at the
separator. The problem appears when it is necessary to leave a #lexclass and return more than one level. To be
more specific, a #token action can only be executed when one or more characters are consumed - so to return
through three levels of #lexclass calls would appear to require the consumption of at least three characters. In the
case of balanced constructs like
"..."
and
'...'
this is not a problem since the terminating character can be
used to trigger the #token action. However, if the scan is terminated by a
separator
such as the semi-colon above
(";"), one cannot use the same technique. Once the semi-colon is consumed it is unavailable for the other #lexclass
routines on the stack to see.
One solution is to allow the user to specify (during the call to pushMode) a "lookahead" routine to be called when
the corresponding element of the mode stack is popped. At that point the "lookahead" routine can examine
ch
to
determine whether it also wants to pop the stack, and so on up the mode stack. The consumption of a single
character can result in popping multiple modes from the mode stack based on a single character of lookahead.
For anything more complicated than this and you might as well write a second parser just to handle the so-called
lexical elements.
Continuing with the example of the log file (above): each statement type has its fields in a specific order. When the
statement type is recognized, a pointer is set to a list of the #lexclasses which is in the same order as the remaining
fields of that kind of statement. An action is attached to every #token which recognizes a semi-colon (";") advances
a pointer in the list of #lexclasses and then changes the #lexclass by calling mode() to set the #lexclass for the next
field of the statement.
Lexical Lookahead
#42.
Vern Paxson's flex has more powerful features for lookahead than dlg
Flex is a superset of lex. For an example of how to use flex with
ANTLR
in C++ mode see Example #14. For C
mode download http://www.polhode.com/NOTES.flex.
#43.
Extra lookahead is available from class BufFileInput (subclass of
DLG
InputStream)