27 March 2000 Release 2.22 Notes for New Users of PCCTS Version 1.33MR22
4
with input:
else stop
When
DLG
gets to the end of "else" it realizes that the space will allow it to match a longer string than "else" by
itself. So
DLG
accept the spaces. Everything is fine until
DLG
gets to the initial "s" in "stop". It then realizes it has
no match - but it can't backtrack. It passes back an error status to
ANTLR
which (normally) prints out something
like:
invalid token near line 1 (text was 'else ') ...
There is an "extra" space between the "else" and the closing single quote mark.
This problem is not detected by the
DLG
option ­Wambiguity.
The section, "Lexical Lookahead" has some additional information.
#27. The lexical routines mode(), skip(), and more() are
not
complicated !
All they do is set status bits in a structure owned by the lexical analyzer and then return immediately. Thus it is OK
to call these routines anywhere from within a lexical action. You can even call them from within a subroutine called
from a lexical action routine.
It is meaningless to call both more() and skip() in the same action.
#28. lextext() includes strings accumulated via more() - begexpr()/endexpr() refer only to the last matched RE
#29. Use
"if (_lextext != _begexpr) {...
}" to test for RE being appended to lextext using more()
To track the line number of the
start
of a lexical element that may span several lines I use the following test:
if (_lextext == _begexpr) {startingLine=_line;} // user-defined var
#30. #token actions can access protected variables of the
DLG
base class
#31. When lookahead will break semantic routines in #token actions, consider using semantic predicates
In early versions on
PCCTS
it was common to change the token code based on semantic routines in the #token
actions.
Old style:
#token TypedefName
#token ID "[a-z A-Z]*"
<<if (isTypedefName(lextext)) return TypedefName;>>
New Style C mode:
#token ID "[a-z A-Z]*"
typedefName : <<isTypedefName(LA(1)->getText())>>? ID;
The old technique is appropriate for making
lexical
decisions based on the input: for instance, treating a number
appearing in columns 1 through 5 as a statement label rather than a number. The new style is important because of
the buffer between the lexer and parser introduced by large amounts of lookahead, especially syntactic predicates.
For instance a declaration of a type may not have been entered into the symbol table by the parser by the time the
lexer encounters a declaration of a variable of that type. An extreme case is infinite lookahead in C mode: parsing
doesn't even begin until the entire input has been processed by the lexer. See Item #138 for an extended discussion
of semantic predicates. Example #10 shows how some semantic decisions can be moved from the lexer to the token
buffer.
#32. For 8 bit characters use flex or in
DLG
make
char
variables unsigned (g++ option ­funsigned-char)
#33. The maximum size of a
DLG
token is set by an optional argument of the ctor
DLG
Lexer() - default is 2000
The maximum size of a character string stored in an
ANTLR
Token is independent of the maximum size of a
DLG
token. See Item #60.
#34. If a token is recognized using more() and its #lexclass ignores end-of-file then the very last token will be lost
When a token is recognized in several pieces using more() it may happen that an end-of-file is detected before the
entire token is recognized. Without treatment of this special case the portions of the token already recognized will
be ignored and the error of a lexically incomplete token will be ignored. Since all appearances of the regular
expression "@", regardless of #lexclass, are mapped to the same #token value, proper handling requires some work-