27 March 2000 Release 2.22 Notes for New Users of PCCTS Version 1.33MR22
1
Where is
#1.
The current maintenance release of
PCCTS
, these notes, and related examples are available on the net
1
#2.
Some other items available at http://www.polhode.com:
1
#3.
Newsgroup is comp.compilers.tools.pccts. Mailing list is pccts_1-33 at onelist.com.
­
Basics
#4.
Invoke
ANTLR
or
DLG
with no arguments to get a switch summary
­
#5.
Tokens begin with uppercase characters, rules begin with lowercase characters
­
#6.
Even in C mode you can use C++ style comments in the non-action portion of
ANTLR
source code
1
#7.
In #token regular expressions spaces and tabs which are not escaped are ignored
1
#8.
Never choose names which coincide with compiler reserved words or library names
1
#9.
Write <<predicate>>? not <<predicate
semi-colon
>>? (semantic predicates go in "if" conditions)
­
#10. Some constructs which cause warnings about ambiguities and optional paths
1
Checklist
#11. Locate incorrectly spelled #token symbols using
ANTLR
­w2 switch or by inspecting
parserClassName
.cpp 1
#12. Be consistent with in-line token definitions: "
&&
" will not be assigned the same token number as "
\&\&
"
­
#13. Duplicate definition of a #token name is not reported if there are no actions attached
2
#14. Use
ANTLR
option -info o to detect orphan rules when ambiguities are reported
­
#15.
LT
(
i
) and
LATEXT
(
i
) are magical names in semantic predicates - punctuation is critical
2
#token
#16. To change the token name appearing in syntax error messages: #token ID("identifier") "[a-z A-Z]+"
2
#17. To match any single character use: "
~[]
", to match everything to a newline use: "
~[\n]*
"
­
#18. To match an "
@
" in your input text use "
\@"
, otherwise it will be interpreted as the end-of-file symbol
­
#19. The escaped literals in #token regular expressions are:
\t \n \r \b
(not the same as
ANSI
C)
­
#20. In #token expressions "\12" is decimal, "\012" is octal, and "\0x12" is hex (not the same as
ANSI
C)
­
#21.
DLG
wants to find the longest possible string that matches
2
#22. When two regular expressions of equal length match a regular expression the first one is chosen
2
#23. Inline regular expression are no different than #token statements
2
#24. Watch out when you see
~[
list-of-characters
]
at the end of a regular expression
3
#25. Watch out when one regular expression is the prefix of another
3
#26.
DLG
is not able to backtrack (unlike flex)
3
#27. The lexical routines mode(), skip(), and more() are
not
complicated !
4
#28. lextext() includes strings accumulated via more() - begexpr()/endexpr() refer only to the last matched RE ­
#29. Use
"if (_lextext != _begexpr) {...
}" to test for RE being appended to lextext using more() 4
#30. #token actions can access protected variables of the
DLG
base class
­
#31. When lookahead will break semantic routines in #token actions, consider using semantic predicates
4
#32. For 8 bit characters use flex or in
DLG
make
char
variables unsigned (g++ option ­funsigned-char)
4
#33. The maximum size of a
DLG
token is set by an optional argument of the ctor
DLG
Lexer() - default is 2000
4
#34. If a token is recognized using more() and its #lexclass ignores end-of-file then the very last token will be lost 4
#35. Sometimes the easiest
DLG
solution is to accept one character at a time.
5
#tokclass
#36. #tokclass provides an efficient way to combine reserved words into reserved word sets
5
#37. Use
ANTLR
Parser::set_el() to test whether an
ANTLR
TokenType is in a #tokclass or #FirstSetSymbol
5
#tokdef
#38. A #tokdef must appear near the start of the grammar file (only #first and #header may precede it)
­
#lexclass
#39. Inline regular expressions are put in the most recently defined lexical class
5
#40. Use a stack of #lexclass modes in order to emulate lexical subroutines
6
#41. Sometimes a stack of #lexclass modes isn't enough
6
Lexical Lookahead
#42. Vern Paxson's flex has more powerful features for lookahead than dlg
6
#43. Extra lookahead is available from class BufFileInput (subclass of
DLG
InputStream)
6
#44. One extra character of lookahead is available to the #token action routine in
ch
(except in interactive mode) 7