27 March 2000 Release 2.22 Notes for New Users of PCCTS Version 1.33MR22
38
rule1 : rule2a | rule2b | rule2c ;
rule2a : A X | B Y | C Z ;
rule2b : B X | B Z ;
rule2c : C X ;
It should be clear that with the sentence being only two tokens this should be parseable with LL(2).
Instead, because k=1 and ck=2
ANTLR
will produce the following messages:
/pccts120/bin/antlr -k 1 -gs -ck 2 -gh example.g
ANTLR parser generator Version 1.20 1989-1994
example.g, line 23: warning: alts 1 and 2 of the rule itself
ambiguous upon { B }, { X Z }
example.g, line 23: warning: alts 1 and 3 of the rule itself
ambiguous upon { C }, { X }
The code generated resembles the following:
if (LA(1)==A || LA(1)==B || LA(1)==C) &&
(LA(2)==X || LA(2)==Y || LA(2)==Z) then rule2a()
else if (LA(1)==B) &&
(LA(2)==X || LA(2)==Z) then rule2b()
else if (LA(1)==C) &&
(LA(2)==X) then rule3a()
...
This might be called "product-of-sums". There is an "or" part for LA(1), an "or" part for LA(2), and they are
combined using "and". To match, the first lookahead token must be in the first set and the second lookahead token
must be in the second set. It doesn't matter that what one really wants is:
if (LA(1)==A && LA(2)==X) ||
(LA(1)==B && LA(2)==Y) ||
(LA(1)==C && LA(2)==Z) then rule2a()
else if (LA(1)==B && LA(2)==X) ||
(LA(1)==B && LA(2)==Z) then rule2b()
else if (LA(1)==C && LA(2)==X) then rule2c()
The problem is that each product involves one element from LA(1) and one from LA(2) and as the number of
possible tokens increases the number of terms grows as N
2
. With the linear approximation the number of terms
grows (surprise) linearly in the number of tokens.
ANTLR
won't do this with k=1 (it would for k=2). It will only do "product-of-sums". However, all is not lost - you
simply add a few well chosen semantic predicates which you have computed using your LL(
k
>1), mobile, water-
resistant, all purpose, guaranteed-for-a-lifetime, carbon based, analog computer.
The linear approximation selects for each branch of the "if" a set which may include
more
than what is wanted but
never selects a
subset
of the correct lookahead sets. We simply insert a hand-coded version of the LL(2)
computation. It's ugly, especially in this case, but it fixes the problem. In large grammars it may not be possible to
run
ANTLR
with k=2, so this fixes a few rules which cause problems. The generated parser may run faster because it
will have to evaluate fewer terms at execution time.
<<
int use_rule2a() {
if ( LA(1)==A && LA(2)==X ) return 1;
if ( LA(1)==B && LA(2)==Y ) return 1;
if ( LA(1)==C && LA(2)==Z ) return 1;
return 0;
}
>>
rule1 : <<use_rule2a()>>? rule2a | rule2b | rule2c ;
rule2a : A X | B Y | C Z ;
rule2b : B X | B Z ;
rule2c : C X ;
Correction due to Monty Zukowski (jaz@jGuru.com)
The real cases I've coded have shorter code sequences in the semantic predicate. I coded this as a function to make
it easier to read. Another reason to use a function (or macro) is to make it easier to read the generated code to