27 March 2000 Release 2.22 Notes for New Users of PCCTS Version 1.33MR22
1
Where is
#1.
The current maintenance release of
PCCTS
, these notes, and related examples are available on the net
1
#2.
Some other items available at http://www.polhode.com:
1
#3.
Newsgroup is comp.compilers.tools.pccts. Mailing list is pccts_1-33 at onelist.com.
Basics
#4.
Invoke
ANTLR
or
DLG
with no arguments to get a switch summary
#5.
Tokens begin with uppercase characters, rules begin with lowercase characters
#6.
Even in C mode you can use C++ style comments in the non-action portion of
ANTLR
source code
1
#7.
In #token regular expressions spaces and tabs which are not escaped are ignored
1
#8.
Never choose names which coincide with compiler reserved words or library names
1
#9.
Write <<predicate>>? not <<predicate
semi-colon
>>? (semantic predicates go in "if" conditions)
#10.
Some constructs which cause warnings about ambiguities and optional paths
1
Checklist
#11.
Locate incorrectly spelled #token symbols using
ANTLR
w2 switch or by inspecting
parserClassName
.cpp
1
#12.
Be consistent with in-line token definitions: "
&&
" will not be assigned the same token number as "
\&\&
"
#13.
Duplicate definition of a #token name is not reported if there are no actions attached
2
#14.
Use
ANTLR
option -info o to detect orphan rules when ambiguities are reported
#15.
LT
(
i
) and
LATEXT
(
i
) are magical names in semantic predicates - punctuation is critical
2
#token
#16.
To change the token name appearing in syntax error messages: #token ID("identifier") "[a-z A-Z]+"
2
#17.
To match any single character use: "
~[]
", to match everything to a newline use: "
~[\n]*
"
#18.
To match an "
@
" in your input text use "
\@"
, otherwise it will be interpreted as the end-of-file symbol
#19.
The escaped literals in #token regular expressions are:
\t \n \r \b
(not the same as
ANSI
C)
#20.
In #token expressions "\12" is decimal, "\012" is octal, and "\0x12" is hex (not the same as
ANSI
C)
#21.
DLG
wants to find the longest possible string that matches
2
#22.
When two regular expressions of equal length match a regular expression the first one is chosen
2
#23.
Inline regular expression are no different than #token statements
2
#24.
Watch out when you see
~[
list-of-characters
]
at the end of a regular expression
3
#25.
Watch out when one regular expression is the prefix of another
3
#26.
DLG
is not able to backtrack (unlike flex)
3
#27.
The lexical routines mode(), skip(), and more() are
not
complicated !
4
#28.
lextext() includes strings accumulated via more() - begexpr()/endexpr() refer only to the last matched RE
#29.
Use
"if (_lextext != _begexpr) {...
}" to test for RE being appended to lextext using more()
4
#30.
#token actions can access protected variables of the
DLG
base class
#31.
When lookahead will break semantic routines in #token actions, consider using semantic predicates
4
#32.
For 8 bit characters use flex or in
DLG
make
char
variables unsigned (g++ option funsigned-char)
4
#33.
The maximum size of a
DLG
token is set by an optional argument of the ctor
DLG
Lexer() - default is 2000
4
#34.
If a token is recognized using more() and its #lexclass ignores end-of-file then the very last token will be lost
4
#35.
Sometimes the easiest
DLG
solution is to accept one character at a time.
5
#tokclass
#36.
#tokclass provides an efficient way to combine reserved words into reserved word sets
5
#37.
Use
ANTLR
Parser::set_el() to test whether an
ANTLR
TokenType is in a #tokclass or #FirstSetSymbol
5
#tokdef
#38.
A #tokdef must appear near the start of the grammar file (only #first and #header may precede it)
#lexclass
#39.
Inline regular expressions are put in the most recently defined lexical class
5
#40.
Use a stack of #lexclass modes in order to emulate lexical subroutines
6
#41.
Sometimes a stack of #lexclass modes isn't enough
6
Lexical Lookahead
#42.
Vern Paxson's flex has more powerful features for lookahead than dlg
6
#43.
Extra lookahead is available from class BufFileInput (subclass of
DLG
InputStream)
6
#44.
One extra character of lookahead is available to the #token action routine in
ch
(except in interactive mode)
7