27 March 2000 Release 2.22 Notes for New Users of PCCTS Version 1.33MR22
2
#12.
Be consistent with in-line token definitions: "
&&
" will not be assigned the same token number as "
\&\&
"
#13.
Duplicate definition of a #token name is not reported if there are no actions attached
ANTLR
will simply use the later definition and forget the earlier one. Using the
ANTLR
w2 option does not help
#14.
Use
ANTLR
option -info o to detect orphan rules when ambiguities are reported
#15.
LT
(
i
) and
LATEXT
(
i
) are magical names in semantic predicates - punctuation is critical
ANTLR
wants to determine the amount of lookahead required for evaluating a semantic predicate. It does this by
searching in C++ mode for strings of the form "
LT(
" and in C mode for strings of the form "
LATEXT(
". If there
are spaces before the open "(" it won't make a match. It evaluates the expression following the "(" under the
assumption that it is an integer literal (e.g."1"). If it is something like "
LT(1+i
)" then you'll have problems. With
ANTLR
switch w2 you will receive a warning if
ANTLR
doesn't find at least one
LT
(
i
) in a semantic predicate.
#token
#16.
To change the token name appearing in syntax error messages: #token ID("identifier") "[a-z A-Z]+"
The string appearing inside the parenthesis will be used for the token name in zztokens and _token_tbl
#17.
To match any single character use: "
~[]
", to match everything to a newline use: "
~[\n]*
"
#18.
To match an "
@
" in your input text use "
\@"
, otherwise it will be interpreted as the end-of-file symbol
#19.
The escaped literals in #token regular expressions are:
\t \n \r \b
(not the same as
ANSI
C)
#20.
In #token expressions "\12" is decimal, "\012" is octal, and "\0x12" is hex (not the same as
ANSI
C)
Contributed by John D. Mitchell (johnm@jGuru.net).
#21.
DLG
wants to find the longest possible string that matches
The regular expression "
~[]*
" will cause problems - it will gobble up everything to the end-of-file.
#22.
When two regular expressions of equal length match a regular expression the first one is chosen
Thus more specific regular expressions should appear in the grammar file before more general ones:
#token HELP "help" /* should appear before "symbol" */
#token Symbol "[a-z A-Z]*" /* should appear after keywords */
Some of these may be caught by using the
DLG
switch Wambiguity. In the following grammar the input string
"HELP" will never be matched:
#token WhiteSpace "[\ \t]" <<skip();>>
#token ID "[a-z A-Z]+"
#token HELP "HELP"
statement
: HELP "@" <<printf("token HELP\n");>> /* a1 */
| "inline" "@" <<printf("token inline\n");>> /* a2 */
| ID "@" <<printf("token ID\n");>> /* a3 */
;
The best advice may be to follow the practice of TJP: place "#token ID" at the end of the grammar file.
#23.
Inline regular expression are no different than #token statements
PCCTS
code does
not
check for a match to "inline" (Item #22 line a2) before attempting a match to the regular
expressions defined by #token statements. The first two alternatives ("a1" and "a2") will
never
be matched. All of
this will be clear from examination of the file "parser.dlg" (the name does
not
depend on the parser's class name).
Another way of looking at this is to recognize that the conversion of character strings to tokens takes place in class
DLG
Lexer, not class
ANTLR
Parser, and that all that is happening with an inline regular expression is that
ANTLR
is
allowing you to define a token's regular expression in a more convenient fashion - not changing the fundamental
behavior.
If one builds the example above using the
DLG
switch Wambiguity one gets the message:
dlg warning: ambigious regular expression 3 4