27 March 2000 Release 2.22 Notes for New Users of PCCTS Version 1.33MR22
4
with input:
else stop
When
DLG
gets to the end of "else" it realizes that the space will allow it to match a longer string than "else" by
itself. So
DLG
accept the spaces. Everything is fine until
DLG
gets to the initial "s" in "stop". It then realizes it has
no match - but it can't backtrack. It passes back an error status to
ANTLR
which (normally) prints out something
like:
invalid token near line 1 (text was 'else ') ...
There is an "extra" space between the "else" and the closing single quote mark.
This problem is not detected by the
DLG
option Wambiguity.
The section, "Lexical Lookahead" has some additional information.
#27.
The lexical routines mode(), skip(), and more() are
not
complicated !
All they do is set status bits in a structure owned by the lexical analyzer and then return immediately. Thus it is OK
to call these routines anywhere from within a lexical action. You can even call them from within a subroutine called
from a lexical action routine.
It is meaningless to call both more() and skip() in the same action.
#28.
lextext() includes strings accumulated via more() - begexpr()/endexpr() refer only to the last matched RE
#29.
Use
"if (_lextext != _begexpr) {...
}" to test for RE being appended to lextext using more()
To track the line number of the
start
of a lexical element that may span several lines I use the following test:
if (_lextext == _begexpr) {startingLine=_line;} // user-defined var
#30.
#token actions can access protected variables of the
DLG
base class
#31.
When lookahead will break semantic routines in #token actions, consider using semantic predicates
In early versions on
PCCTS
it was common to change the token code based on semantic routines in the #token
actions.
Old style:
#token TypedefName
#token ID "[a-z A-Z]*"
<<if (isTypedefName(lextext)) return TypedefName;>>
New Style C mode:
#token ID "[a-z A-Z]*"
typedefName : <<isTypedefName(LA(1)->getText())>>? ID;
The old technique is appropriate for making
lexical
decisions based on the input: for instance, treating a number
appearing in columns 1 through 5 as a statement label rather than a number. The new style is important because of
the buffer between the lexer and parser introduced by large amounts of lookahead, especially syntactic predicates.
For instance a declaration of a type may not have been entered into the symbol table by the parser by the time the
lexer encounters a declaration of a variable of that type. An extreme case is infinite lookahead in C mode: parsing
doesn't even begin until the entire input has been processed by the lexer. See Item #138 for an extended discussion
of semantic predicates. Example #10 shows how some semantic decisions can be moved from the lexer to the token
buffer.
#32.
For 8 bit characters use flex or in
DLG
make
char
variables unsigned (g++ option funsigned-char)
#33.
The maximum size of a
DLG
token is set by an optional argument of the ctor
DLG
Lexer() - default is 2000
The maximum size of a character string stored in an
ANTLR
Token is independent of the maximum size of a
DLG
token. See Item #60.
#34.
If a token is recognized using more() and its #lexclass ignores end-of-file then the very last token will be lost
When a token is recognized in several pieces using more() it may happen that an end-of-file is detected before the
entire token is recognized. Without treatment of this special case the portions of the token already recognized will
be ignored and the error of a lexically incomplete token will be ignored. Since all appearances of the regular
expression "@", regardless of #lexclass, are mapped to the same #token value, proper handling requires some work-