27 March 2000 Release 2.22 Notes for New Users of PCCTS Version 1.33MR22
3
dlg warning: ambigious regular expression 3 5
The numbers which appear in the
DLG
message refer to the assigned token numbers. Examine the array _token_tbl
in
parserClassName
.cpp to find the regular expression which corresponds to the token number reported by
DLG
:
ANTLR
C
har *Parser::_token_tbl[]={
/* 00 */ "Invalid",
/* 01 */ "@",
/* 02 */ "WhiteSpace",
/* 03 */ "ID",
/* 04 */ "HELP",
/* 05 */ "inline"
};
Well, there is one important difference for those using Sorcerer. With in-line regular expressions there is no
symbolic name for the token, hence it can't be referenced in a Sorcerer rule. Contributed by John D. Mitchell
(johnm@jGuru.com).
#24.
Watch out when you see
~[
list-of-characters
]
at the end of a regular expression
What the user usually wants to express is that the regular expression should stop
before
the
list-of-characters
.
However the expression will include the complement of that list as part of the regular expression. Often users
forget about what happens to the characters which are in the complement of the set.
Consider for example a #lexclass for a C style comment:
/* C-style comment handling */
#lexclass COMMENT /* a1 */
#token "\*/" << mode(START); skip(); >> /* a2 */
#token "~[\*]+" << skip(); >> /* a3 */
#token "\*~[/]" << skip(); >> /* WRONG */ /* a4 */
/* Should be "\*" */ /* a5 */
/* Correction due to Tim Corringham */ /* a6 */
/* tim@ramjam.u-net.com 20-Dec-94 */ /* a7 */
The RE at line a2 accepts "*/" and changes to #lexclass START. The RE at line a4 accepts a "*" which is
not
followed by a "/". The problem arises with comments of the form:
/* this comments breaks the example **/
The RE at line a4 consumes the "**" at the end of the comment leaving nothing to be matched by "\*/".
This is a relatively efficient way to span a comment. However it is not the simplest. A simpler description is:
#token "\*/" << mode(START); skip(); >> /* b1 */
#token "~[]" << skip(); >> /* b2 */
This works because b1 ("*/") is two characters long while b2 is only one character long - and
DLG
always prefers
the longest expression which matches.
For those who are concerned with the efficiency of scanning:
#token
"[\n\r]"
<<skip();newline();>>
#token
"\*/"
<<mode(START);skip();>>
#token
"\*"
<<skip();>>
#token
"~[\*\n\r]+"
<<skip();>>
Contributed by Brad Schick
#25.
Watch out when one regular expression is the prefix of another
If the shorter regular expression is followed by something which can be the first character of the suffix of the longer
regular expression,
DLG
will happily assume that it is looking at the longer regular expression. See Item #44 for one
approach to this problem.
#26.
DLG
is not able to backtrack (unlike flex)
Consider the following example:
#token "[\ \t]*" <<skip();>>
#token ELSE "else"
#token ELSEIF "else [\ \t]* if"
#token STOP "stop"