27 March 2000 Release 2.22 Notes for New Users of PCCTS Version 1.33MR22
7
Alexey Demakov has supplied this class to provide more than one character of lookahead for the input stream. The
class is located in pccts/h/BufFileInput.*.
#44. One extra character of lookahead is available to the #token action routine in
ch
(except in interactive mode)
In interactive mode
(DLG
switch ­i is not supported in C++ mode)
DLG
fetches a character only when it needs it to
determine if the end of a token has been reached. In non-interactive mode the content of
ch
is always valid. The
debug code described in Item #149 can help debug problems with interactive lookahead.
For the remainder of this discussion assume that
DLG
is in non-interactive mode.
Consider the problem of distinguishing floating point numbers from range expressions such as those used in Pascal:
range: 1..23 float: 1.23
As a first effort one might try:
#token Int "[0-9]+"
#token Range ".."
#token Float "[0-9]+.[0-9]*"
The problem is that "1..23" looks like the floating point number "1." with an illegal "." at the end.
DLG
always takes
the longest matching string, so "1." will always look more appetizing than "1". What one needs to do is to look at
the character following "1." to see if it is another ".", and if it is to assume that it is a range expression. The flex
lexer has trailing context, but
DLG
doesn't - except for the single character in
ch
.
A solution in
DLG
is to write the #token Float action routine to look at what's been accepted, and at
ch
, in order to
decide what to do:
#token Float "[0-9]*.[0-9]*"
<<if (*endexpr() == '.' && /* might use more complex test */
ch == '.') {
mode(LC_Range); /* treat it like a range expression */
return Int; /* looks like an int followed by ".." */
};
>>
#lexclass LC_Range
#token Range "." <<mode(START);>> // consume second "." of range
#45. There is no easy way in
DLG
to distinguish integer "1" from floating point "1." when "1.and.2" is valid
This differs from Item #44 in that two characters of lookahead are required before a decision can be made on
whether the "." is part of ".and." or it is part of a floating point number. This is a frequent problem which can only
be handled by using a more powerful lexer such as flex.
#46. For lex operators "^" and "$" (anchor pattern to start/end of line) use flex - don't bother with dlg
Line and Column Information
Most names in this section refer to members of class
DLG
LexerBase or
DLG
Lexer
Before C++ mode the proper handling of line and column information was a large part of these notes.
#47. If you want column information for error messages (or other reasons) use C++ mode
#48. If you want accurate line information even with many characters of lookahead use C++ mode
#49. Call trackColumns() to request that
DLG
maintain column information
#50. To report column information in syntax error messages override
ANTLR
Parser::syn() - See Example #5
#51. Call newline() and then set_endcol(0) in the #token action when a newline is encountered
#52. Adjusting column position for tab characters
Assume that tabs are set every eight characters starting with column 9.
Computing the column position will be simple if you match tab characters in isolation:
#token Tab "\t" <<_endcol=((_endcol-1) & ~7) + 8;>>
This would be off by 1, except that
DLG
, on return from the #token action, computes the next column using: