编译器/解释器

开发平台：

Others

CHANGES_FROM_1.33：源码内容

=======================================================================
List of Implemented Fixes and Changes for Maintenance Releases of PCCTS
=======================================================================
DISCLAIMER
The software and these notes are provided "as is". They may include
typographical or technical errors and their authors disclaims all
liability of any kind or nature for damages due to error, fault,
defect, or deficiency regardless of cause. All warranties of any
kind, either express or implied, including, but not limited to, the
implied warranties of merchantability and fitness for a particular
purpose are disclaimed.
#197. (Changed in MR14) Resetting the lookahead buffer of the parser
Explanation and fix by Sinan Karasu (sinan.karasu@boeing.com)
Consider the code used to prime the lookahead buffer LA(i)
of the parser when init() is called:
void
ANTLRParser::
prime_lookahead()
{
int i;
for(i=1;i<=LLk; i++) consume();
dirty=0;
//lap = 0; // MR14 - Sinan Karasu (sinan.karusu@boeing.com)
//labase = 0; // MR14
labase=lap; // MR14
}
When the parser is instantiated, lap=0,labase=0 is set.
The "for" loop runs LLk times. In consume(), lap = lap +1 (mod LLk) is
computed. Therefore, lap(before the loop) == lap (after the loop).
Now the only problem comes in when one does an init() of the parser
after an Eof has been seen. At that time, lap could be non zero.
Assume it was lap==1. Now we do a prime_lookahead(). If LLk is 2,
then
consume()
{
NLA = inputTokens->getToken()->getType();
dirty--;
lap = (lap+1)&(LLk-1);
}
or expanding NLA,
token_type[lap&(LLk-1)]) = inputTokens->getToken()->getType();
dirty--;
lap = (lap+1)&(LLk-1);
so now we prime locations 1 and 2. In prime_lookahead it used to set
lap=0 and labase=0. Now, the next token will be read from location 0,
NOT 1 as it should have been.
This was never caught before, because if a parser is just instantiated,
then lap and labase are 0, the offending assignment lines are
basically no-ops, since the for loop wraps around back to 0.
#196. (Changed in MR14) Problems with "(alpha)? beta" guess
Consider the following syntactic predicate in a grammar
with 2 tokens of lookahead (k=2 or ck=2):
rule : ( alpha )? beta ;
alpha : S t ;
t : T U
| T
;
beta : S t Z ;
When antlr computes the prediction expression with one token
of lookahead for alts 1 and 2 of rule t it finds an ambiguity.
Because the grammar has a lookahead of 2 it tries to compute
two tokens of lookahead for alts 1 and 2 of t. Alt 1 clearly
has a lookahead of (T U). Alt 2 is one token long so antlr
tries to compute the follow set of alt 2, which means finding
the things which can follow rule t in the context of (alpha)?.
This cannot be computed, because alpha is only part of a rule,
and antlr can't tell what part of beta is matched by alpha and
what part remains to be matched. Thus it impossible for antlr
to properly determine the follow set of rule t.
Prior to 1.33MR14, the follow of (alpha)? was computed as
FIRST(beta) as a result of the internal representation of
guess blocks.
With MR14 the follow set will be the empty set for that context.
Normally, one expects a rule appearing in a guess block to also
appear elsewhere. When the follow context for this other use
is "ored" with the empty set, the context from the other use
results, and a reasonable follow context results. However if
there is *no* other use of the rule, or it is used in a different
manner then the follow context will be inaccurate - it was
inaccurate even before MR14, but it will be inaccurate in a
different way.
For the example given earlier, a reasonable way to rewrite the
grammar:
rule : ( alpha )? beta
alpha : S t ;
t : T U
| T
;
beta : alpha Z ;
If there are no other uses of the rule appearing in the guess
block it will generate a test for EOF - a workaround for
representing a null set in the lookahead tests.
If you encounter such a problem you can use the -alpha option
to get additional information:
line 2: error: not possible to compute follow set for alpha
in an "(alpha)? beta" block.
With the antlr -alpha command line option the following information
is inserted into the generated file:
#if 0
Trace of references leading to attempt to compute the follow set of
alpha in an "(alpha)? beta" block. It is not possible for antlr to
compute this follow set because it is not known what part of beta has
already been matched by alpha and what part remains to be matched.
Rules which make use of the incorrect follow set will also be incorrect
1 #token T alpha/2 line 7 brief.g
2 end alpha alpha/3 line 8 brief.g
2 end (...)? block at start/1 line 2 brief.g
#endif
At the moment, with the -alpha option selected the program marks
any rules which appear in the trace back chain (above) as rules with
possible problems computing follow set.
Reported by Greg Knapen (gregory.knapen@bell.ca).
#195. (Changed in MR14) #line directive not at column 1
Under certain circunstances a predicate test could generate
a #line directive which was not at column 1.
Reported with fix by David K錱edal (davidk@lysator.liu.se)
(http://www.lysator.liu.se/~davidk/).
#194. (Changed in MR14) (C Mode only) Demand lookahead with #tokclass
In C mode with the demand lookahead option there is a bug in the
code which handles matches for #tokclass (zzsetmatch and
zzsetmatch_wsig).
The bug causes the lookahead pointer to get out of synchronization
with the current token pointer.
The problem was reported with a fix by Ger Hobbelt (hobbelt@axa.nl).
#193. (Changed in MR14) Use of PCCTS_USE_NAMESPACE_STD
The pcctscfg.h now contains the following definitions:
#ifdef PCCTS_USE_NAMESPACE_STD
#define PCCTS_STDIO_H <Cstdio>
#define PCCTS_STDLIB_H <Cstdlib>
#define PCCTS_STDARG_H <Cstdarg>
#define PCCTS_SETJMP_H <Csetjmp>
#define PCCTS_STRING_H <Cstring>
#define PCCTS_ASSERT_H <Cassert>
#define PCCTS_ISTREAM_H <istream>
#define PCCTS_IOSTREAM_H <iostream>
#define PCCTS_NAMESPACE_STD namespace std {}; using namespace std;
#else
#define PCCTS_STDIO_H <stdio.h>
#define PCCTS_STDLIB_H <stdlib.h>
#define PCCTS_STDARG_H <stdarg.h>
#define PCCTS_SETJMP_H <setjmp.h>
#define PCCTS_STRING_H <string.h>
#define PCCTS_ASSERT_H <assert.h>
#define PCCTS_ISTREAM_H <istream.h>
#define PCCTS_IOSTREAM_H <iostream.h>
#define PCCTS_NAMESPACE_STD
#endif
The runtime support in pccts/h uses these pre-processor symbols
consistently.
Also, antlr and dlg have been changed to generate code which uses
these pre-processor symbols rather than having the names of the
#include files hard-coded in the generated code.
This required the addition of "#include pcctscfg.h" to a number of
files in pccts/h.
It appears that this sometimes causes problems for MSVC 5 in
combination with the "automatic" option for pre-compiled headers.
In such cases disable the "automatic" pre-compiled headers option.
Suggested by Hubert Holin (Hubert.Holin@Bigfoot.com).
#192. (Changed in MR14) Change setText() to accept "const ANTLRChar *"
Changed ANTLRToken::setText(ANTLRChar *) to setText(const ANTLRChar *).
This allows literal strings to be used to initialize tokens. Since
the usual token implementation (ANTLRCommonToken) makes a copy of the
input string, this was an unnecessary limitation.
Suggested by Bob McWhirter (bob@netwrench.com).
#191. (Changed in MR14) HP/UX aCC compiler compatibility problem
Needed to explicitly declare zzINF_DEF_TOKEN_BUFFER_SIZE and
zzINF_BUFFER_TOKEN_CHUNK_SIZE as ints in pccts/h/AParser.cpp.
Reported by David Cook (dcook@bmc.com).
#190. (Changed in MR14) IBM OS/2 CSet compiler compatibility problem
Name conflict with "_cs" in pccts/h/ATokenBuffer.cpp
Reported by David Cook (dcook@bmc.com).
#189. (Changed in MR14) -gxt switch in C mode
The -gxt switch in C mode didn't work because of incorrect
initialization.
Reported by Sinan Karasu (sinan@boeing.com).
#188. (Changed in MR14) Added pccts/h/DLG_stream_input.h
This is a DLG stream class based on C++ istreams.
Contributed by Hubert Holin (Hubert.Holin@Bigfoot.com).
#187. (Changed in MR14) Rename config.h to pcctscfg.h
The PCCTS configuration file has been renamed from config.h to
pcctscfg.h. The problem with the original name is that it led
to name collisions when pccts parsers were combined with other
software.
All of the runtime support routines in pccts/h/* have been
changed to use the new name. Existing software can continue
to use pccts/h/config.h. The contents of pccts/h/config.h is
now just "#include "pcctscfg.h".
I don't have a record of the user who suggested this.
#186. (Changed in MR14) Pre-processor symbol DllExportPCCTS class modifier
Classes in the C++ runtime support routines are now declared:
class DllExportPCCTS className ....
By default, the pre-processor symbol is defined as the empty
string. This if for use by MSVC++ users to create DLL classes.
Suggested by Manfred Kogler (km@cast.uni-linz.ac.at).
#185. (Changed in MR14) Option to not use PCCTS_AST base class for ASTBase
Normally, the ASTBase class is derived from PCCTS_AST which contains
functions useful to Sorcerer. If these are not necessary then the
user can define the pre-processor symbol "PCCTS_NOT_USING_SOR" which
will cause the ASTBase class to replace references to PCCTS_AST with
references to ASTBase where necessary.
The class ASTDoublyLinkedBase will contain a pure virtual function
shallowCopy() that was formerly defined in class PCCTS_AST.
Suggested by Bob McWhirter (bob@netwrench.com).
#184. (Changed in MR14) Grammars with no tokens generate invalid tokens.h
Reported by Hubert Holin (Hubert.Holin@bigfoot.com).
#183. (Changed in MR14) -f to specify file with names of grammar files
In DEC/VMS it is difficult to specify very long command lines.
The -f option allows one to place the names of the grammar files
in a data file in order to bypass limitations of the DEC/VMS
command language interpreter.
Addition supplied by Bernard Giroud (b_giroud@decus.ch).
#182. (Changed in MR14) Output directory option for DEC/VMS
Fix some problems with the -o option under DEC/VMS.
Fix supplied by Bernard Giroud (b_giroud@decus.ch).
#181. (Changed in MR14) Allow chars > 127 in DLGStringInput::nextChar()
Changed DLGStringInput to cast the character using (unsigned char)
so that languages with character codes greater than 127 work
without changes.
Suggested by Manfred Kogler (km@cast.uni-linz.ac.at).
#180. (Added in MR14) ANTLRParser::getEofToken()
Added "ANTLRToken ANTLRParser::getEofToken() const" to match the
setEofToken routine.
Requested by Manfred Kogler (km@cast.uni-linz.ac.at).
#179. (Fixed in MR14) Memory leak for BufFileInput subclass of DLGInputStream
The BufFileInput class described in Item #142 neglected to release
the allocated buffer when an instance was destroyed.
Reported by Manfred Kogler (km@cast.uni-linz.ac.at).
#178. (Fixed in MR14) Bug in "(alpha)? beta" guess blocks first sets
In 1.33 vanilla, and all maintenance releases prior to MR14
there is a bug in the handling of guess blocks which use the
"long" form:
(alpha)? beta
inside a (...)*, (...)+, or {...} block.
This problem does *not* apply to the case where beta is omitted
or when the syntactic predicate is on the leading edge of an
alternative.
The problem is that both alpha and beta are stored in the
syntax diagram, and that some analysis routines would fail
to skip the alpha portion when it was not on the leading edge.
Consider the following grammar with -ck 2:
r : ( (A)? B )* C D
| A B /* forces -ck 2 computation for old antlr */
/* reports ambig for alts 1 & 2 */
| B C /* forces -ck 2 computation for new antlr */
/* reports ambig for alts 1 & 3 */
;
The prediction expression for the first alternative should be
LA(1)={B C} LA(2)={B C D}, but previous versions of antlr
would compute the prediction expression as LA(1)={A C} LA(2)={B D}
Reported by Arpad Beszedes (beszedes@inf.u-szeged.hu) who provided
a very clear example of the problem and identified the probable cause.
#177. (Changed in MR14) #tokdefs and #token with regular expression
In MR13 the change described by Item #162 caused an existing
feature of antlr to fail. Prior to the change it was possible
to give regular expression definitions and actions to tokens
which were defined via the #tokdefs directive.
This now works again.
Reported by Manfred Kogler (km@cast.uni-linz.ac.at).
#176. (Changed in MR14) Support for #line in antlr source code
Note: this was implemented by Arpad Beszedes (beszedes@inf.u-szeged.hu).
In 1.33MR14 it is possible for a pre-processor to generate #line
directives in the antlr source and have those line numbers and file
names used in antlr error messages and in the #line directives
generated by antlr.
The #line directive may appear in the following forms:
#line ll "sss" xx xx ...
where ll represents a line number, "sss" represents the name of a file
enclosed in quotation marks, and xxx are arbitrary integers.
The following form (without "line") is not supported at the moment:
# ll "sss" xx xx ...
The result:
zzline
is replaced with ll from the # or #line directive
FileStr[CurFile]
is updated with the contents of the string (if any)
following the line number
Note
----
The file-name string following the line number can be a complete
name with a directory-path. Antlr generates the output files from
the input file name (by replacing the extension from the file-name
with .c or .cpp).
If the input file (or the file-name from the line-info) contains
a path:
"../grammar.g"
the generated source code will be placed in "../grammar.cpp" (i.e.
in the parent directory). This is inconvenient in some cases
(even the -o switch can not be used) so the path information is
removed from the #line directive. Thus, if the line-info was
#line 2 "../grammar.g"
then the current file-name will become "grammar.g"
In this way, the generated source code according to the grammar file
will always be in the current directory, except when the -o switch
is used.
#175. (Changed in MR14) Bug when guess block appears at start of (...)*
In 1.33 vanilla and all maintenance releases prior to 1.33MR14
there is a bug when a guess block appears at the start of a (...)+.
Consider the following k=1 (ck=1) grammar:
rule :
( (STAR)? ZIP )* ID ;
Prior to 1.33MR14, the generated code resembled:
...
zzGUESS_BLOCK
while ( 1 ) {
if ( ! LA(1)==STAR) break;
zzGUESS
if ( !zzrv ) {
zzmatch(STAR);
zzCONSUME;
zzGUESS_DONE
zzmatch(ZIP);
zzCONSUME;
...
Note that the routine uses STAR for the prediction expression
rather than ZIP. With 1.33MR14 the generated code resembles:
...
while ( 1 ) {
if ( ! LA(1)==ZIP) break;
...
This problem existed only with (...)* blocks and was caused
by the slightly more complicate graph which represents (...)*
blocks. This caused the analysis routine to compute the first
set for the alpha part of the "(alpha)? beta" rather than the
beta part.
Both (...)+ and {...} blocks handled the guess block correctly.
Reported by Arpad Beszedes (beszedes@inf.u-szeged.hu) who provided
a very clear example of the problem and identified the probable cause.
#174. (Changed in MR14) Bug when action precedes syntactic predicate
In 1.33 vanilla, and all maintenance releases prior to MR14,
there was a bug when a syntactic predicate was immediately
preceded by an action. Consider the following -ck 2 grammar:
rule :
<<int i;>>
(alpha)? beta C
| A B
;
alpha : A ;
beta : A B;
Prior to MR14, the code generated for the first alternative
resembled:
...
zzGUESS
if ( !zzrv && LA(1)==A && LA(2)==A) {
alpha();
zzGUESS_DONE
beta();
zzmatch(C);
zzCONSUME;
} else {
...
The prediction expression (i.e. LA(1)==A && LA(2)==A) is clearly
wrong because LA(2) should be matched to B (first[2] of beta is {B}).
With 1.33MR14 the prediction expression is:
...
if ( !zzrv && LA(1)==A && LA(2)==B) {
alpha();
zzGUESS_DONE
beta();
zzmatch(C);
zzCONSUME;
} else {
...
This will only affect users in which alpha is shorter than
than max(k,ck) and there is an action immediately preceding
the syntactic predicate.
This problem was reported by reported by Arpad Beszedes
(beszedes@inf.u-szeged.hu) who provided a very clear example
of the problem and identified the presence of the init-action
as the likely culprit.
#173. (Changed in MR13a) -glms for Microsoft style filenames with -gl
With the -gl option antlr generates #line directives using the
exact name of the input files specified on the command line.
An oddity of the Microsoft C and C++ compilers is that they
don't accept file names in #line directives containing ""
even though these are names from the native file system.
With -glms option, the "" in file names appearing in #line
directives is replaced with a "/" in order to conform to
Microsoft compiler requirements.
Reported by Erwin Achermann (erwin.achermann@switzerland.org).
#172. (Changed in MR13) rn in antlr source counted as one line
Some MS software uses rn to indicate a new line. Antlr
now recognizes this in counting lines.
Reported by Edward L. Hepler (elh@ece.vill.edu).
#171. (Changed in MR13) #tokclass L..U now allowed
The following is now allowed:
#tokclass ABC { A..B C }
Reported by Dave Watola (dwatola@amtsun.jpl.nasa.gov)
#170. (Changed in MR13) Suppression for predicates with lookahead depth >1
In MR12 the capability for suppression of predicates with lookahead
depth=1 was introduced. With MR13 this had been extended to
predicates with lookahead depth > 1 and released for use by users
on an experimental basis.
Consider the following grammar with -ck 2 and the predicate in rule
"a" with depth 2:
r1 : (ab)* "@"
;
ab : a
| b
;
a : (A B)? => <<p(LATEXT(2))>>? A B C
;
b : A B C
;
Normally, the predicate would be hoisted into rule r1 in order to
determine whether to call rule "ab". However it should *not* be
hoisted because, even if p is false, there is a valid alternative
in rule b. With "-mrhoistk on" the predicate will be suppressed.
If "-info p" command line option is present the following information
will appear in the generated code:
while ( (LA(1)==A)
#if 0
Part (or all) of predicate with depth > 1 suppressed by alternative
without predicate
pred << p(LATEXT(2))>>?
depth=k=2 ("=>" guard) rule a line 8 t1.g
tree context:
(root = A
B
)
The token sequence which is suppressed: ( A B )
The sequence of references which generate that sequence of tokens:
1 to ab r1/1 line 1 t1.g
2 ab ab/1 line 4 t1.g
3 to b ab/2 line 5 t1.g
4 b b/1 line 11 t1.g
5 #token A b/1 line 11 t1.g
6 #token B b/1 line 11 t1.g
#endif
A slightly more complicated example:
r1 : (ab)* "@"
;
ab : a
| b
;
a : (A B)? => <<p(LATEXT(2))>>? (A B | D E)
;
b : <<q(LATEXT(2))>>? D E
;
In this case, the sequence (D E) in rule "a" which lies behind
the guard is used to suppress the predicate with context (D E)
in rule b.
while ( (LA(1)==A || LA(1)==D)
#if 0
Part (or all) of predicate with depth > 1 suppressed by alternative
without predicate
pred << q(LATEXT(2))>>?
depth=k=2 rule b line 11 t2.g
tree context:
(root = D
E
)
The token sequence which is suppressed: ( D E )
The sequence of references which generate that sequence of tokens:
1 to ab r1/1 line 1 t2.g
2 ab ab/1 line 4 t2.g
3 to a ab/1 line 4 t2.g
4 a a/1 line 8 t2.g
5 #token D a/1 line 8 t2.g
6 #token E a/1 line 8 t2.g
#endif
&&
#if 0
pred << p(LATEXT(2))>>?
depth=k=2 ("=>" guard) rule a line 8 t2.g
tree context:
(root = A
B
)
#endif
(! ( LA(1)==A && LA(2)==B ) || p(LATEXT(2)) ) {
ab();
...
#169. (Changed in MR13) Predicate test optimization for depth=1 predicates
When the MR12 generated a test of a predicate which had depth 1
it would use the depth >1 routines, resulting in correct but
inefficient behavior. In MR13, a bit test is used.
#168. (Changed in MR13) Token expressions in context guards
The token expressions appearing in context guards such as:
(A B)? => <<test(LT(1))>>? someRule
are computed during an early phase of antlr processing. As
a result, prior to MR13, complex expressions such as:
~B
L..U
~L..U
TokClassName
~TokClassName
were not computed properly. This resulted in incorrect
context being computed for such expressions.
In MR13 these context guards are verified for proper semantics
in the initial phase and then re-evaluated after complex token
expressions have been computed in order to produce the correct
behavior.
Reported by Arpad Beszedes (beszedes@inf.u-szeged.hu).
#167. (Changed in MR13) ~L..U
Prior to MR13, the complement of a token range was
not properly computed.
#166. (Changed in MR13) token expression L..U
The token U was represented as an unsigned char, restricting
the use of L..U to cases where U was assigned a token number
less than 256. This is corrected in MR13.
#165. (Changed in MR13) option -newAST
To create ASTs from an ANTLRTokenPtr antlr usually calls
"new AST(ANTLRTokenPtr)". This option generates a call
to "newAST(ANTLRTokenPtr)" instead. This allows a user
to define a parser member function to create an AST object.
Similar changes for ASTBase::tmake and ASTBase::link were not
thought necessary since they do not create AST object, only
use existing ones.
#164. (Changed in MR13) Unused variable _astp
For many compilations, we have lived with warnings about
the unused variable _astp. It turns out that this varible
can *never* be used because the code which references it was
commented out.
This investigation was sparked by a note from Erwin Achermann
(erwin.achermann@switzerland.org).
#163. (Changed in MR13) Incorrect makefiles for testcpp examples
All the examples in pccts/testcpp/* had incorrect definitions
in the makefiles for the symbol "CCC". Instead of CCC=CC they
had CC=$(CCC).
There was an additional problem in testcpp/1/test.g due to the
change in ANTLRToken::getText() to a const member function
(Item #137).
Reported by Maurice Mass (maas@cuci.nl).
#162. (Changed in MR13) Combining #token with #tokdefs
When it became possible to change the print-name of a
#token (Item #148) it became useful to give a #token
statement whose only purpose was to giving a print name
to the #token. Prior to this change this could not be
combined with the #tokdefs feature.
#161. (Changed in MR13) Switch -gxt inhibits generation of tokens.h
#160. (Changed in MR13) Omissions in list of names for remap.h
When a user selects the -gp option antlr creates a list
of macros in remap.h to rename some of the standard
antlr routines from zzXXX to userprefixXXX.
There were number of omissions from the remap.h name
list related to the new trace facility. This was reported,
along with a fix, by Bernie Solomon (bernard@ug.eds.com).
#159. (Changed in MR13) Violations of classic C rules
There were a number of violations of classic C style in
the distribution kit. This was reported, along with fixes,
by Bernie Solomon (bernard@ug.eds.com).
#158. (Changed in MR13) #header causes problem for pre-processors
A user who runs the C pre-processor on antlr source suggested
that another syntax be allowed. With MR13 such directives
such as #header, #pragma, etc. may be written as "#header",
"#pragma", etc. For escaping pre-processor directives inside
a #header use something like the following:
#header
<<
#include <stdio.h>
>>
#157. (Fixed in MR13) empty error sets for rules with infinite recursion
When the first set for a rule cannot be computed due to infinite
left recursion and it is the only alternative for a block then
the error set for the block would be empty. This would result
in a fatal error.
Reported by Darin Creason (creason@genedax.com)
#156. (Changed in MR13) DLGLexerBase::getToken() now public
#155. (Changed in MR13) Context behind predicates can suppress
With -mrhoist enabled the context behind a guarded predicate can
be used to suppress other predicates. Consider the following grammar:
r0 : (r1)+;
r1 : rp
| rq
;
rp : <<p LATEXT(1)>>? B ;
rq : (A)? => <<q LATEXT(1)>>? (A|B);
In earlier versions both predicates "p" and "q" would be hoisted into
rule r0. With MR12c predicate p is suppressed because the context which
follows predicate q includes "B" which can "cover" predicate "p". In
other words, in trying to decide in r0 whether to call r1, it doesn't
really matter whether p is false or true because, either way, there is
a valid choice within r1.
#154. (Changed in MR13) Making hoist suppression explicit using <<nohoist>>
A common error, even among experienced pccts users, is to code
an init-action to inhibit hoisting rather than a leading action.
An init-action does not inhibit hoisting.
This was coded:
rule1 : <<;>> rule2
This is what was meant:
rule1 : <<;>> <<;>> rule2
With MR13, the user can code:
rule1 : <<;>> <<nohoist>> rule2
The following will give an error message:
rule1 : <<nohoist>> rule2
If the <<nohoist>> appears as an init-action rather than a leading
action an error message is issued. The meaning of an init-action
containing "nohoist" is unclear: does it apply to just one
alternative or to all alternatives ?
#153. (Changed in MR12b) Bug in computation of -mrhoist suppression set
Consider the following grammar with k=1 and "-mrhoist on":
r1 : (A)? => ((p>>? x /* l1 */
| r2 /* l2 */
;
r2 : A /* l4 */
| (B)? => <<q>>? y /* l5 */
;
In earlier versions the mrhoist routine would see that both l1 and
l2 contained predicates and would assume that this prevented either
from acting to suppress the other predicate. In the example above
it didn't realize the A at line l4 is capable of suppressing the
predicate at l1 even though alt l2 contains (indirectly) a predicate.
This is fixed in MR12b.
Reported by Reinier van den Born (reinier@vnet.ibm.com)
#153. (Changed in MR12a) Bug in computation of -mrhoist suppression set
An oversight similar to that described in Item #152 appeared in
the computation of the set that "covered" a predicate. If a
predicate expression included a term such as p=AND(q,r) the context
of p was taken to be context(q) & context(r), when it should have
been context(q) | context(r). This is fixed in MR12a.
#152. (Changed in MR12) Bug in generation of predicate expressions
The primary purpose for MR12 is to make quite clear that MR11 is
obsolete and to fix the bug related to predicate expressions.
In MR10 code was added to optimize the code generated for
predicate expression tests. Unfortunately, there was a
significant oversight in the code which resulted in a bug in
the generation of code for predicate expression tests which
contained predicates combined using AND:
r0 : (r1)* "@" ;
r1 : (AAA)? => <<p LATEXT(1)>>? r2 ;
r2 : (BBB)? => <<q LATEXT(1)>>? Q
| (BBB)? => <<r LATEXT(1)>>? Q
;
In MR11 (and MR10 when using "-mrhoist on") the code generated
for r0 to predict r1 would be equivalent to:
if ( LA(1)==Q &&
(LA(1)==AAA && LA(1)==BBB) &&
( p && ( q || r )) ) {
This is incorrect because it expresses the idea that LA(1)
*must* be AAA in order to attempt r1, and *must* be BBB to
attempt r2. The result was that r1 became unreachable since
both condition can not be simultaneously true.
The general philosophy of code generation for predicates
can be summarized as follows:
a. If the context is true don't enter an alt
for which the corresponding predicate is false.
If the context is false then it is okay to enter
the alt without evaluating the predicate at all.
b. A predicate created by ORing of predicates has
context which is the OR of their individual contexts.
c. A predicate created by ANDing of predicates has
(surprise) context which is the OR of their individual
contexts.
d. Apply these rules recursively.
e. Remember rule (a)
The correct code should express the idea that *if* LA(1) is
AAA then p must be true to attempt r1, but if LA(1) is *not*
AAA then it is okay to attempt r1, provided that *if* LA(1) is
BBB then one of q or r must be true.
if ( LA(1)==Q &&
( !(LA(1)==AAA || LA(1)==BBB) ||
( ! LA(1) == AAA || p) &&
( ! LA(1) == BBB || q || r ) ) ) {
I believe this is fixed in MR12.
Reported by Reinier van den Born (reinier@vnet.ibm.com)
#151a. (Changed in MR12) ANTLRParser::getLexer()
As a result of several requests, I have added public methods to
get a pointer to the lexer belonging to a parser.
ANTLRTokenStream *ANTLRParser::getLexer() const
Returns a pointer to the lexer being used by the
parser. ANTLRTokenStream is the base class of
DLGLexer
ANTLRTokenStream *ANTLRTokenBuffer::getLexer() const
Returns a pointer to the lexer being used by the
ANTLRTokenBuffer. ANTLRTokenStream is the base
class of DLGLexer
You must manually cast the ANTLRTokenStream to your program's
lexer class. Because the name of the lexer's class is not fixed.
Thus it is impossible to incorporate it into the DLGLexerBase
class.
#151b.(Changed in MR12) ParserBlackBox member getLexer()
The template class ParserBlackBox now has a member getLexer()
which returns a pointer to the lexer.
#150. (Changed in MR12) syntaxErrCount and lexErrCount now public
See Item #127 for more information.
#149. (Changed in MR12) antlr option -info o (letter o for orphan)
If there is more than one rule which is not referenced by any
other rule then all such rules are listed. This is useful for
alerting one to rules which are not used, but which can still
contribute to ambiguity. For example:
start : a Z ;
unused: a A ;
a : (A)+ ;
will cause an ambiguity report for rule "a" which will be
difficult to understand if the user forgets about rule "unused"
simply because it is not used in the grammar.
#148. (Changed in MR11) #token names appearing in zztokens,token_tbl
In a #token statement like the following:
#token Plus "+"
the string "Plus" appears in the zztokens array (C mode) and
token_tbl (C++ mode). This string is used in most error
messages. In MR11 one has the option of using some other string,
(e.g. "+") in those tables.
In MR11 one can write:
#token Plus ("+") "+"
#token RP ("(") "("
#token COM ("comment begin") "/*"
A #token statement is allowed to appear in more than one #lexclass
with different regular expressions. However, the token name appears
only once in the zztokens/token_tbl array. This means that only
one substitute can be specified for a given #token name. The second
attempt to define a substitute name (different from the first) will
result in an error message.
#147. (Changed in MR11) Bug in follow set computation
There is a bug in 1.33 vanilla and all maintenance releases
prior to MR11 in the computation of the follow set. The bug is
different than that described in Item #82 and probably more
common. It was discovered in the ansi.g grammar while testing
the "ambiguity aid" (Item #119). The search for a bug started
when the ambiguity aid was unable to discover the actual source
of an ambiguity reported by antlr.
The problem appears when an optimization of the follow set
computation is used inappropriately. The result is that the
follow set used is the "worst case". In other words, the error
can lead to false reports of ambiguity. The good news is that
if you have a grammar in which you have addressed all reported
ambiguities you are ok. The bad news is that you may have spent
time fixing ambiguities that were not real, or used k=2 when
ck=2 might have been sufficient, and so on.
The following grammar demonstrates the problem:
------------------------------------------------------------
expr : ID ;
start : stmt SEMI ;
stmt : CASE expr COLON
| expr SEMI
| plain_stmt
;
plain_stmt : ID COLON ;
------------------------------------------------------------
When compiled with k=1 and ck=2 it will report:
warning: alts 2 and 3 of the rule itself ambiguous upon
{ IDENTIFIER }, { COLON }
When antlr analyzes "stmt" it computes the first[1] set of all
alternatives. It finds an ambiguity between alts 2 and 3 for ID.
It then computes the first[2] set for alternatives 2 and 3 to resolve
the ambiguity. In computing the first[2] set of "expr" (which is
only one token long) it needs to determine what could follow "expr".
Under a certain combination of circumstances antlr forgets that it
is trying to analyze "stmt" which can only be followed by SEMI and
adds to the first[2] set of "expr" the "global" follow set (including
"COLON") which could follow "expr" (under other conditions) in the
phrase "CASE expr COLON".
#146. (Changed in MR11) Option -treport for locating "difficult" alts
It can be difficult to determine which alternatives are causing
pccts to work hard to resolve an ambiguity. In some cases the
ambiguity is successfully resolved after much CPU time so there
is no message at all.
A rough measure of the amount of work being peformed which is
independent of the CPU speed and system load is the number of
tnodes created. Using "-info t" gives information about the
total number of tnodes created and the peak number of tnodes.
Tree Nodes: peak 1300k created 1416k lost 0
It also puts in the generated C or C++ file the number of tnodes
created for a rule (at the end of the rule). However this
information is not sufficient to locate the alternatives within
a rule which are causing the creation of tnodes.
Using:
antlr -treport 100000 ....
causes antlr to list on stdout any alternatives which require the
creation of more than 100,000 tnodes, along with the lookahead sets
for those alternatives.
The following is a trivial case from the ansi.g grammar which shows
the format of the report. This report might be of more interest
in cases where 1,000,000 tuples were created to resolve the ambiguity.
-------------------------------------------------------------------------
There were 0 tuples whose ambiguity could not be resolved
by full lookahead
There were 157 tnodes created to resolve ambiguity between:
Choice 1: statement/2 line 475 file ansi.g
Choice 2: statement/3 line 476 file ansi.g
Intersection of lookahead[1] sets:
IDENTIFIER
Intersection of lookahead[2] sets:
LPARENTHESIS COLON AMPERSAND MINUS
STAR PLUSPLUS MINUSMINUS ONESCOMPLEMENT
NOT SIZEOF OCTALINT DECIMALINT
HEXADECIMALINT FLOATONE FLOATTWO IDENTIFIER
STRING CHARACTER
-------------------------------------------------------------------------
#145. (Documentation) Generation of Expression Trees
Item #99 was misleading because it implied that the optimization
for tree expressions was available only for trees created by
predicate expressions and neglected to mention that it required
the use of "-mrhoist on". The optimization applies to tree
expressions created for grammars with k>1 and for predicates with
lookahead depth >1.
In MR11 the optimized version is always used so the -mrhoist on
option need not be specified.
#144. (Changed in MR11) Incorrect test for exception group
In testing for a rule's exception group the label a pointer
is compared against ''. The intention is "*pointer".
Reported by Jeffrey C. Fried (Jeff@Fried.net).
#143. (Changed in MR11) Optional ";" at end of #token statement
Fixes problem of:
#token X "x"
<<
parser action
>>
Being confused with:
#token X "x" <<lexical action>>
#142. (Changed in MR11) class BufFileInput subclass of DLGInputStream
Alexey Demakov (demakov@kazbek.ispras.ru) has supplied class
BufFileInput derived from DLGInputStream which provides a
function lookahead(char *string) to test characters in the
input stream more than one character ahead.
The default amount of lookahead is specified by the constructor
and defaults to 8 characters. This does *not* include the one
character of lookahead maintained internally by DLG in member "ch"
and which is not available for testing via BufFileInput::lookahead().
This is a useful class for overcoming the one-character-lookahead
limitation of DLG without resorting to a lexer capable of
backtracking (like flex) which is not integrated with antlr as is
DLG.
There are no restrictions on copying or using BufFileInput.* except
that the authorship and related information must be retained in the
source code.
The class is located in pccts/h/BufFileInput.* of the kit.
#141. (Changed in MR11) ZZDEBUG_CONSUME for ANTLRParser::consume()
A debug aid has been added to file ANTLRParser::consume() in
file AParser.cpp:
#ifdef ZZDEBUG_CONSUME_ACTION
zzdebug_consume_action();
#endif
Suggested by Sramji Ramanathan (ps@kumaran.com).
#140. (Changed in MR11) #pred to define predicates
+---------------------------------------------------+
| Note: Assume "-prc on" for this entire discussion |
+---------------------------------------------------+
A problem with predicates is that each one is regarded as
unique and capable of disambiguating cases where two
alternatives have identical lookahead. For example:
rule : <<pred(LATEXT(1))>>? A
| <<pred(LATEXT(1))>>? A
;
will not cause any error messages or warnings to be issued
by earlier versions of pccts. To compare the text of the
predicates is an incomplete solution.
In 1.33MR11 I am introducing the #pred statement in order to
solve some problems with predicates. The #pred statement allows
one to give a symbolic name to a "predicate literal" or a
"predicate expression" in order to refer to it in other predicate
expressions or in the rules of the grammar.
The predicate literal associated with a predicate symbol is C
or C++ code which can be used to test the condition. A
predicate expression defines a predicate symbol in terms of other
predicate symbols using "!", "&&", and "||". A predicate symbol
can be defined in terms of a predicate literal, a predicate
expression, or *both*.
When a predicate symbol is defined with both a predicate literal
and a predicate expression, the predicate literal is used to generate
code, but the predicate expression is used to check for two
alternatives with identical predicates in both alternatives.
Here are some examples of #pred statements:
#pred IsLabel <<isLabel(LATEXT(1))>>?
#pred IsLocalVar <<isLocalVar(LATEXT(1))>>?
#pred IsGlobalVar <<isGlobalVar(LATEXT(1)>>?
#pred IsVar <<isVar(LATEXT(1))>>? IsLocalVar || IsGlobalVar
#pred IsScoped <<isScoped(LATEXT(1))>>? IsLabel || IsLocalVar
I hope that the use of EBNF notation to describe the syntax of the
#pred statement will not cause problems for my readers (joke).
predStatement : "#pred"
CapitalizedName
(
"<<predicate_literal>>?"
| "<<predicate_literal>>?" predOrExpr
| predOrExpr
)
;
predOrExpr : predAndExpr ( "||" predAndExpr ) * ;
predAndExpr : predPrimary ( "&&" predPrimary ) * ;
predPrimary : CapitalizedName
| "!" predPrimary
| "(" predOrExpr ")"
;
What is the purpose of this nonsense ?
To understand how predicate symbols help, you need to realize that
predicate symbols are used in two different ways with two different
goals.
a. Allow simplification of predicates which have been combined
during predicate hoisting.
b. Allow recognition of identical predicates which can't disambiguate
alternatives with common lookahead.
First we will discuss goal (a). Consider the following rule:
rule0: rule1
| ID
| ...
;
rule1: rule2
| rule3
;
rule2: <<isX(LATEXT(1))>>? ID ;
rule3: <<!isX(LATEXT(1)>>? ID ;
When the predicates in rule2 and rule3 are combined by hoisting
to create a prediction expression for rule1 the result is:
if ( LA(1)==ID
&& ( isX(LATEXT(1) || !isX(LATEXT(1) ) ) { rule1(); ...
This is inefficient, but more importantly, can lead to false
assumptions that the predicate expression distinguishes the rule1
alternative with some other alternative with lookahead ID. In
MR11 one can write:
#pred IsX <<isX(LATEXT(1))>>?
...
rule2: <<IsX>>? ID ;
rule3: <<!IsX>>? ID ;
During hoisting MR11 recognizes this as a special case and
eliminates the predicates. The result is a prediction
expression like the following:
if ( LA(1)==ID ) { rule1(); ...
Please note that the following cases which appear to be equivalent
*cannot* be simplified by MR11 during hoisting because the hoisting
logic only checks for a "!" in the predicate action, not in the
predicate expression for a predicate symbol.
*Not* equivalent and is not simplified during hoisting:
#pred IsX <<isX(LATEXT(1))>>?
#pred NotX <<!isX(LATEXT(1))>>?
...
rule2: <<IsX>>? ID ;
rule3: <<NotX>>? ID ;
*Not* equivalent and is not simplified during hoisting:
#pred IsX <<isX(LATEXT(1))>>?
#pred NotX !IsX
...
rule2: <<IsX>>? ID ;
rule3: <<NotX>>? ID ;
Now we will discuss goal (b).
When antlr discovers that there is a lookahead ambiguity between
two alternatives it attempts to resolve the ambiguity by searching
for predicates in both alternatives. In the past any predicate
would do, even if the same one appeared in both alternatives:
rule: <<p(LATEXT(1))>>? X
| <<p(LATEXT(1))>>? X
;
The #pred statement is a start towards solving this problem.
During ambiguity resolution (*not* predicate hoisting) the
predicates for the two alternatives are expanded and compared.
Consider the following example:
#pred Upper <<isUpper(LATEXT(1))>>?
#pred Lower <<isLower(LATEXT(1))>>?
#pred Alpha <<isAlpha(LATEXT(1))>>? Upper || Lower
rule0: rule1
| <<Alpha>>? ID
;
rule1:
| rule2
| rule3
...
;
rule2: <<Upper>>? ID;
rule3: <<Lower>>? ID;
The definition of #pred Alpha expresses:
a. to test the predicate use the C code "isAlpha(LATEXT(1))"
b. to analyze the predicate use the information that
Alpha is equivalent to the union of Upper and Lower,
During ambiguity resolution the definition of Alpha is expanded
into "Upper || Lower" and compared with the predicate in the other
alternative, which is also "Upper || Lower". Because they are
identical MR11 will report a problem.
-------------------------------------------------------------------------
t10.g, line 5: warning: the predicates used to disambiguate rule rule0
(file t10.g alt 1 line 5 and alt 2 line 6)
are identical when compared without context and may have no
resolving power for some lookahead sequences.
-------------------------------------------------------------------------
If you use the "-info p" option the output file will contain:
+----------------------------------------------------------------------+
|#if 0 |
| |
|The following predicates are identical when compared without |
| lookahead context information. For some ambiguous lookahead |
| sequences they may not have any power to resolve the ambiguity. |
| |
|Choice 1: rule0/1 alt 1 line 5 file t10.g |
| |
| The original predicate for choice 1 with available context |
| information: |
| |
| OR expr |
| |
| pred << Upper>>? |
| depth=k=1 rule rule2 line 14 t10.g |
| set context: |
| ID |
| |
| pred << Lower>>? |
| depth=k=1 rule rule3 line 15 t10.g |
| set context: |
| ID |
| |
| The predicate for choice 1 after expansion (but without context |
| information): |
| |
| OR expr |
| |
| pred << isUpper(LATEXT(1))>>? |
| depth=k=1 rule line 1 t10.g |
| |
| pred << isLower(LATEXT(1))>>? |
| depth=k=1 rule line 2 t10.g |
| |
| |
|Choice 2: rule0/2 alt 2 line 6 file t10.g |
| |
| The original predicate for choice 2 with available context |
| information: |
| |
| pred << Alpha>>? |
| depth=k=1 rule rule0 line 6 t10.g |
| set context: |
| ID |
| |
| The predicate for choice 2 after expansion (but without context |
| information): |
| |
| OR expr |
| |
| pred << isUpper(LATEXT(1))>>? |
| depth=k=1 rule line 1 t10.g |
| |
| pred << isLower(LATEXT(1))>>? |
| depth=k=1 rule line 2 t10.g |
| |
| |
|#endif |
+----------------------------------------------------------------------+
The comparison of the predicates for the two alternatives takes
place without context information, which means that in some cases
the predicates will be considered identical even though they operate
on disjoint lookahead sets. Consider:
#pred Alpha
rule1: <<Alpha>>? ID
| <<Alpha>>? Label
;
Because the comparison of predicates takes place without context
these will be considered identical. The reason for comparing
without context is that otherwise it would be necessary to re-evaluate
the entire predicate expression for each possible lookahead sequence.
This would require more code to be written and more CPU time during
grammar analysis, and it is not yet clear whether anyone will even make
use of the new #pred facility.
A temporary workaround might be to use different #pred statements
for predicates you know have different context. This would avoid
extraneous warnings.
The above example might be termed a "false positive". Comparison
without context will also lead to "false negatives". Consider the
following example:
#pred Alpha
#pred Beta
rule1: <<Alpha>>? A
| rule2
;
rule2: <<Alpha>>? A
| <<Beta>>? B
;
The predicate used for alt 2 of rule1 is (Alpha || Beta). This
appears to be different than the predicate Alpha used for alt1.
However, the context of Beta is B. Thus when the lookahead is A
Beta will have no resolving power and Alpha will be used for both
alternatives. Using the same predicate for both alternatives isn't
very helpful, but this will not be detected with 1.33MR11.
To properly handle this the predicate expression would have to be
evaluated for each distinct lookahead context.
To determine whether two predicate expressions are identical is
difficult. The routine may fail to identify identical predicates.
The #pred feature also compares predicates to see if a choice between
alternatives which is resolved by a predicate which makes the second
choice unreachable. Consider the following example:
#pred A <<A(LATEXT(1)>>?
#pred B <<B(LATEXT(1)>>?
#pred A_or_B A || B
r : s
| t
;
s : <<A_or_B>>? ID
;
t : <<A>>? ID
;
----------------------------------------------------------------------------
t11.g, line 5: warning: the predicate used to disambiguate the
first choice of rule r
(file t11.g alt 1 line 5 and alt 2 line 6)
appears to "cover" the second predicate when compared without context.
The second predicate may have no resolving power for some lookahead
sequences.
----------------------------------------------------------------------------
#139. (Changed in MR11) Problem with -gp in C++ mode
The -gp option to add a prefix to rule names did not work in
C++ mode. This has been fixed.
Reported by Alexey Demakov (demakov@kazbek.ispras.ru).
#138. (Changed in MR11) Additional makefiles for non-MSVC++ MS systems
Sramji Ramanathan (ps@kumaran.com) has supplied makefiles for
building antlr and dlg with Win95/NT development tools that
are not based on MSVC5. They are pccts/antlr/AntlrMS.mak and
pccts/dlg/DlgMS.mak.
The first line of the makefiles require a definition of PCCTS_HOME.
These are in additiion to the AntlrMSVC50.* and DlgMSVC50.*
supplied by Jeff Vincent (JVincent@novell.com).
#137. (Changed in MR11) Token getType(), getText(), getLine() const members
--------------------------------------------------------------------
If you use ANTLRCommonToken this change probably does not affect you.
--------------------------------------------------------------------
For a long time it has bothered me that these accessor functions
in ANTLRAbstractToken were not const member functions. I have
refrained from changing them because it require users to modify
existing token class definitions which are derived directly
from ANTLRAbstractToken. I think it is now time.
For those who are not used to C++, a "const member function" is a
member function which does not modify its own object - the thing
to which "this" points. This is quite different from a function
which does not modify its arguments
Most token definitions based on ANTLRAbstractToken have something like
the following in order to create concrete definitions of the pure
virtual methods in ANTLRAbstractToken:
class MyToken : public ANTLRAbstractToken {
...
ANTLRTokenType getType() {return _type; }
int getLine() {return _line; }
ANTLRChar * getText() {return _text; }
...
}
The required change is simply to put "const" following the function
prototype in the header (.h file) and the definition file (.cpp if
it is not inline):
class MyToken : public ANTLRAbstractToken {
...
ANTLRTokenType getType() const {return _type; }
int getLine() const {return _line; }
ANTLRChar * getText() const {return _text; }
...
}
This was originally proposed a long time ago by Bruce
Guenter (bruceg@qcc.sk.ca).
#136. (Changed in MR11) Added getLength() to ANTLRCommonToken
Classes ANTLRCommonToken and ANTLRCommonTokenNoRefCountToken
now have a member function:
int getLength() const { return strlen(getText()) }
Suggested by Sramji Ramanathan (ps@kumaran.com).
#135. (Changed in MR11) Raised antlr's own default ZZLEXBUFSIZE to 8k
#134a. (ansi_mr10.zip) T.J. Parr's ANSI C grammar made 1.33MR11 compatible
There is a typographical error in the definition of BITWISEOREQ:
#token BITWISEOREQ "!=" should be "|="
When this change is combined with the bugfix to the follow set cache
problem (Item #147) and a minor rearrangement of the grammar
(Item #134b) it becomes a k=1 ck=2 grammar.
#134b. (ansi_mr10.zip) T.J. Parr's ANSI C grammar made 1.33MR11 compatible
The following changes were made in the ansi.g grammar (along with
using -mrhoist on):
ansi.g
======
void tracein(char *) ====> void tracein(const char *)
void traceout(char *) ====> void traceout(const char *)
<LT(1)->getType()==IDENTIFIER ? isTypeName(LT(1)->getText()) : 1>>?
====> <<isTypeName(LT(1)->getText())>>?
<<(LT(1)->getType()==LPARENTHESIS && LT(2)->getType()==IDENTIFIER) ?
isTypeName(LT(2)->getText()) : 1>>?
====> (LPARENTHESIS IDENTIFIER)? => <<isTypeName(LT(2)->getText())>>?
<<(LT(1)->getType()==LPARENTHESIS && LT(2)->getType()==IDENTIFIER) ?
isTypeName(LT(2)->getText()) : 1>>?
====> (LPARENTHESIS IDENTIFIER)? => <<isTypeName(LT(2)->getText())>>?
added to init(): traceOptionValueDefault=0;
added to init(): traceOption(-1);
change rule "statement":
statement
: plain_label_statement
| case_label_statement
| <<;>> expression SEMICOLON
| compound_statement
| selection_statement
| iteration_statement
| jump_statement
| SEMICOLON
;
plain_label_statement
: IDENTIFIER COLON statement
;
case_label_statement
: CASE constant_expression COLON statement
| DEFAULT COLON statement
;
support.cpp
===========
void tracein(char *) ====> void tracein(const char *)
void traceout(char *) ====> void traceout(const char *)
added to tracein(): ANTLRParser::tracein(r); // call superclass method
added to traceout(): ANTLRParser::traceout(r); // call superclass method
Makefile
========
added to AFLAGS: -mrhoist on -prc on
#133. (Changed in 1.33MR11) Make trace options public in ANTLRParser
In checking T.J. Parr's ANSI C grammar for compatibility with
1.33MR11 discovered that it was inconvenient to have the
trace facilities with protected access.
#132. (Changed in 1.33MR11) Recognition of identical predicates in alts
Prior to 1.33MR11, there would be no ambiguity warning when the
very same predicate was used to disambiguate both alternatives:
test: ref B
| ref C
;
ref : <<pred(LATEXT(1)>>? A
In 1.33MR11 this will cause the warning:
warning: the predicates used to disambiguate rule test
(file v98.g alt 1 line 1 and alt 2 line 2)
are identical and have no resolving power
----------------- Note -----------------
This is different than the following case
test: <<pred(LATEXT(1))>>? A B
| <<pred(LATEXT(1)>>? A C
;
In this case there are two distinct predicates
which have exactly the same text. In the first
example there are two references to the same
predicate. The problem represented by this
grammar will be addressed later.
#131. (Changed in 1.33MR11) Case insensitive command line options
Command line switches like "-CC" and keywords like "on", "off",
and "stdin" are no longer case sensitive in antlr, dlg, and sorcerer.
#130. (Changed in 1.33MR11) Changed ANTLR_VERSION to int from string
The ANTLR_VERSION was not an integer, making it difficult to
perform conditional compilation based on the antlr version.
Henceforth, ANTLR_VERSION will be:
(base_version * 10000) + release number
thus 1.33MR11 will be: 133*100+11 = 13311
Suggested by Rainer Janssen (Rainer.Janssen@Informatik.Uni-Oldenburg.DE).
#129. (Changed in 1.33MR11) Addition of ANTLR_VERSION to <parserName>.h
The following code is now inserted into <parserName>.h amd
stdpccts.h:
#ifndef ANTLR_VERSION
#define ANTLR_VERSION 13311
#endif
Suggested by Rainer Janssen (Rainer.Janssen@Informatik.Uni-Oldenburg.DE)
#128. (Changed in 1.33MR11) Redundant predicate code in (<<pred>>? ...)+
Prior to 1.33MR11, the following grammar would generate
redundant tests for the "while" condition.
rule2 : (<<pred>>? X)+ X
| B
;
The code would resemble:
if (LA(1)==X) {
if (pred) {
do {
if (!pred) {zzfailed_pred(" pred");}
zzmatch(X); zzCONSUME;
} while (LA(1)==X && pred && pred);
} else {...
With 1.33MR11 the redundant predicate test is omitted.
#127. (Changed in 1.33MR11)
Count Syntax Errors Count DLG Errors
------------------- ----------------
C++ mode ANTLRParser:: DLGLexerBase::
syntaxErrCount lexErrCount
C mode zzSyntaxErrCount zzLexErrCount
The C mode variables are global and initialized to 0.
They are *not* reset to 0 automatically when antlr is
restarted.
The C++ mode variables are public. They are initialized
to 0 by the constructors. They are *not* reset to 0 by the
ANTLRParser::init() method.
Suggested by Reinier van den Born (reinier@vnet.ibm.com).
#126. (Changed in 1.33MR11) Addition of #first <<...>>
The #first <<...>> inserts the specified text in the output
files before any other #include statements required by pccts.
The only things before the #first text are comments and
a #define ANTLR_VERSION.
Requested by and Esa Pulkkinen (esap@cs.tut.fi) and Alexin
Zoltan (alexin@inf.u-szeged.hu).
#125. (Changed in 1.33MR11) Lookahead for (guard)? && <<p>>? predicates
When implementing the new style of guard predicate (Item #113)
in 1.33MR10 I decided to temporarily ignore the problem of
computing the "narrowest" lookahead context.
Consider the following k=1 grammar:
start : a
| b
;
a : (A)? && <<pred1(LATEXT(1))>>? ab ;
b : (B)? && <<pred2(LATEXT(1))>>? ab ;
ab : A | B ;
In MR10 the context for both "a" and "b" was {A B} because this is
the first set of rule "ab". Normally, this is not a problem because
the predicate which follows the guard inhibits any ambiguity report
by antlr.
In MR11 the first set for rule "a" is {A} and for rule "b" it is {B}.
#124. A Note on the New "&&" Style Guarded Predicates
I've been asked several times, "What is the difference between
the old "=>" style guard predicates and the new style "&&" guard
predicates, and how do you choose one over the other" ?
The main difference is that the "=>" does not apply the
predicate if the context guard doesn't match, whereas
the && form always does. What is the significance ?
If you have a predicate which is not on the "leading edge"
it is cannot be hoisted. Suppose you need a predicate that
looks at LA(2). You must introduce it manually. The
classic example is:
castExpr :
LP typeName RP
| ....
;
typeName : <<isTypeName(LATEXT(1))>>? ID
| STRUCT ID
;
The problem is that isTypeName() isn't on the leading edge
of typeName, so it won't be hoisted into castExpr to help
make a decision on which production to choose.
The *first* attempt to fix it is this:
castExpr :
<<isTypeName(LATEXT(2))>>?
LP typeName RP
| ....
;
Unfortunately, this won't work because it ignores
the problem of STRUCT. The solution is to apply
isTypeName() in castExpr if LA(2) is an ID and
don't apply it when LA(2) is STRUCT:
castExpr :
(LP ID)? => <<isTypeName(LATEXT(2))>>?
LP typeName RP
| ....
;
In conclusion, the "=>" style guarded predicate is
useful when:
a. the tokens required for the predicate
are not on the leading edge
b. there are alternatives in the expression
selected by the predicate for which the
predicate is inappropriate
If (b) were false, then one could use a simple
predicate (assuming "-prc on"):
castExpr :
<<isTypeName(LATEXT(2))>>?
LP typeName RP
| ....
;
typeName : <<isTypeName(LATEXT(1))>>? ID
;
So, when do you use the "&&" style guarded predicate ?
The new-style "&&" predicate should always be used with
predicate context. The context guard is in ADDITION to
the automatically computed context. Thus it useful for
predicates which depend on the token type for reasons
other than context.
The following example is contributed by Reinier van den Born
(reinier@vnet.ibm.com).
+-------------------------------------------------------------------------+
| This grammar has two ways to call functions: |
| |
| - a "standard" call syntax with parens and comma separated args |
| - a shell command like syntax (no parens and spacing separated args) |
| |
| The former also allows a variable to hold the name of the function, |
| the latter can also be used to call external commands. |
| |
| The grammar (simplified) looks like this: |
| |
| fun_call : ID "(" { expr ("," expr)* } ")" |
| /* ID is function name */ |
| | "@" ID "(" { expr ("," expr)* } ")" |
| /* ID is var containing fun name */ |
| ; |
| |
| command : ID expr* /* ID is function name */ |
| | path expr* /* path is external command name */ |
| ; |
| |
| path : ID /* left out slashes and such */ |
| | "@" ID /* ID is environment var */ |
| ; |
| |
| expr : .... |
| | "(" expr ")"; |
| |
| call : fun_call |
| | command |
| ; |
| |
| Obviously the call is wildly ambiguous. This is more or less how this |
| is to be resolved: |
| |
| A call begins with an ID or an @ followed by an ID. |
| |
| If it is an ID and if it is an ext. command name -> command |
| if followed by a paren -> fun_call |
| otherwise -> command |
| |
| If it is an @ and if the ID is a var name -> fun_call |
| otherwise -> command |
| |
| One can implement these rules quite neatly using && predicates: |
| |
| call : ("@" ID)? && <<isVarName(LT(2))>>? fun_call |
| | (ID)? && <<isExtCmdName>>? command |
| | (ID "(")? fun_call |
| | command |
| ; |
| |
| This can be done better, so it is not an ideal example, but it |
| conveys the principle. |
+-------------------------------------------------------------------------+
#123. (Changed in 1.33MR11) Correct definition of operators in ATokPtr.h
The return value of operators in ANTLRTokenPtr:
changed: unsigned ... operator !=(...)
to: int ... operator != (...)
changed: unsigned ... operator ==(...)
to: int ... operator == (...)
Suggested by R.A. Nelson (cowboy@VNET.IBM.COM)
#122. (Changed in 1.33MR11) Member functions to reset DLG in C++ mode
void DLGFileReset(FILE *f) { input = f; found_eof = 0; }
void DLGStringReset(DLGChar *s) { input = s; p = &input[0]; }
Supplied by R.A. Nelson (cowboy@VNET.IBM.COM)
#121. (Changed in 1.33MR11) Another attempt to fix -o (output dir) option
Another attempt is made to improve the -o option of antlr, dlg,
and sorcerer. This one by JVincent (JVincent@novell.com).
The current rule:
a. If -o is not specified than any explicit directory
names are retained.
b. If -o is specified than the -o directory name overrides any
explicit directory names.
c. The directory name of the grammar file is *not* stripped
to create the main output file. However it is stil subject
to override by the -o directory name.
#120. (Changed in 1.33MR11) "-info f" output to stdout rather than stderr
Added option 0 (e.g. "-info 0") which is a noop.
#119. (Changed in 1.33MR11) Ambiguity aid for grammars
The user can ask for additional information on ambiguities reported
by antlr to stdout. At the moment, only one ambiguity report can
be created in an antlr run.
This feature is enabled using the "-aa" (Ambiguity Aid) option.
The following options control the reporting of ambiguities:
-aa ruleName Selects reporting by name of rule
-aa lineNumber Selects reporting by line number
(file name not compared)
-aam Selects "multiple" reporting for a token
in the intersection set of the
alternatives.
For instance, the token ID may appear dozens
of times in various paths as the program
explores the rules which are reachable from
the point of an ambiguity. With option -aam
every possible path the search program
encounters is reported.
Without -aam only the first encounter is
reported. This may result in incomplete
information, but the information may be
sufficient and much shorter.
-aad depth Selects the depth of the search.
The default value is 1.
The number of paths to be searched, and the
size of the report can grow geometrically
with the -ck value if a full search for all
contributions to the source of the ambiguity
is explored.
The depth represents the number of tokens
in the lookahead set which are matched against
the set of ambiguous tokens. A depth of 1
means that the search stops when a lookahead
sequence of just one token is matched.
A k=1 ck=6 grammar might generate 5,000 items
in a report if a full depth 6 search is made
with the Ambiguity Aid. The source of the
problem may be in the first token and obscured
by the volume of data - I hesitate to call
it information.
When the user selects a depth > 1, the search
is first performed at depth=1 for both
alternatives, then depth=2 for both alternatives,
etc.
Sample output for rule grammar in antlr.g itself:
+---------------------------------------------------------------------+
| Ambiguity Aid |
| |
| Choice 1: grammar/70 line 632 file a.g |
| Choice 2: grammar/82 line 644 file a.g |
| |
| Intersection of lookahead[1] sets: |
| |
| "}" "class" "#errclass" "#tokclass" |
| |
| Choice:1 Depth:1 Group:1 ("#errclass") |
| 1 in (...)* block grammar/70 line 632 a.g |
| 2 to error grammar/73 line 635 a.g |
| 3 error error/1 line 894 a.g |
| 4 #token "#errclass" error/2 line 895 a.g |
| |
| Choice:1 Depth:1 Group:2 ("#tokclass") |
| 2 to tclass grammar/74 line 636 a.g |
| 3 tclass tclass/1 line 937 a.g |
| 4 #token "#tokclass" tclass/2 line 938 a.g |
| |
| Choice:1 Depth:1 Group:3 ("class") |
| 2 to class_def grammar/75 line 637 a.g |
| 3 class_def class_def/1 line 669 a.g |
| 4 #token "class" class_def/3 line 671 a.g |
| |
| Choice:1 Depth:1 Group:4 ("}") |
| 2 #token "}" grammar/76 line 638 a.g |
| |
| Choice:2 Depth:1 Group:5 ("#errclass") |
| 1 in (...)* block grammar/83 line 645 a.g |
| 2 to error grammar/93 line 655 a.g |
| 3 error error/1 line 894 a.g |
| 4 #token "#errclass" error/2 line 895 a.g |
| |
| Choice:2 Depth:1 Group:6 ("#tokclass") |
| 2 to tclass grammar/94 line 656 a.g |
| 3 tclass tclass/1 line 937 a.g |
| 4 #token "#tokclass" tclass/2 line 938 a.g |
| |
| Choice:2 Depth:1 Group:7 ("class") |
| 2 to class_def grammar/95 line 657 a.g |
| 3 class_def class_def/1 line 669 a.g |
| 4 #token "class" class_def/3 line 671 a.g |
| |
| Choice:2 Depth:1 Group:8 ("}") |
| 2 #token "}" grammar/96 line 658 a.g |
+---------------------------------------------------------------------+
For a linear lookahead set ambiguity (where k=1 or for k>1 but
when all lookahead sets [i] with i<k all have degree one) the
reports appear in the following order:
for (depth=1 ; depth <= "-aad depth" ; depth++) {
for (alternative=1; alternative <=2 ; alternative++) {
while (matches-are-found) {
group++;
print-report
};
};
};
For reporting a k-tuple ambiguity, the reports appear in the
following order:
for (depth=1 ; depth <= "-aad depth" ; depth++) {
while (matches-are-found) {
for (alternative=1; alternative <=2 ; alternative++) {
group++;
print-report
};
};
};
This is because matches are generated in different ways for
linear lookahead and k-tuples.
#118. (Changed in 1.33MR11) DEC VMS makefile and VMS related changes
Revised makefiles for DEC/VMS operating system for antlr, dlg,
and sorcerer.
Reduced names of routines with external linkage to less than 32
characters to conform to DEC/VMS linker limitations.
Jean-Francois Pieronne discovered problems with dlg and antlr
due to the VMS linker not being case sensitive for names with
external linkage. In dlg the problem was with "className" and
"ClassName". In antlr the problem was with "GenExprSets" and
"genExprSets".
Added genmms, a version of genmk for the DEC/VMS version of make.
The source is in directory pccts/support/DECmms.
All VMS contributions by Jean-Francois Pieronne (jfp@iname.com).
#117. (Changed in 1.33MR10) new EXPERIMENTAL predicate hoisting code
The hoisting of predicates into rules to create prediction
expressions is a problem in antlr. Consider the following
example (k=1 with -prc on):
start : (a)* "@" ;
a : b | c ;
b : <<isUpper(LATEXT(1))>>? A ;
c : A ;
Prior to 1.33MR10 the code generated for "start" would resemble:
while {
if (LA(1)==A &&
(!LA(1)==A || isUpper())) {
a();
}
};
This code is wrong because it makes rule "c" unreachable from
"start". The essence of the problem is that antlr fails to
recognize that there can be a valid alternative within "a" even
when the predicate <<isUpper(LATEXT(1))>>? is false.
In 1.33MR10 with -mrhoist the hoisting of the predicate into
"start" is suppressed because it recognizes that "c" can
cover all the cases where the predicate is false:
while {
if (LA(1)==A) {
a();
}
};
With the antlr "-info p" switch the user will receive information
about the predicate suppression in the generated file:
--------------------------------------------------------------
#if 0
Hoisting of predicate suppressed by alternative without predicate.
The alt without the predicate includes all cases where
the predicate is false.
WITH predicate: line 7 v1.g
WITHOUT predicate: line 7 v1.g
The context set for the predicate:
A
The lookahead set for the alt WITHOUT the semantic predicate:
A
The predicate:
pred << isUpper(LATEXT(1))>>?
depth=k=1 rule b line 9 v1.g
set context:
A
tree context: null
Chain of referenced rules:
#0 in rule start (line 5 v1.g) to rule a
#1 in rule a (line 7 v1.g)
#endif
--------------------------------------------------------------
A predicate can be suppressed by a combination of alternatives
which, taken together, cover a predicate:
start : (a)* "@" ;
a : b | ca | cb | cc ;
b : <<isUpper(LATEXT(1))>>? ( A | B | C ) ;
ca : A ;
cb : B ;
cc : C ;
Consider a more complex example in which "c" covers only part of
a predicate:
start : (a)* "@" ;
a : b
| c
;
b : <<isUpper(LATEXT(1))>>?
( A
| X
);
c : A
;
Prior to 1.33MR10 the code generated for "start" would resemble:
while {
if ( (LA(1)==A || LA(1)==X) &&
(! (LA(1)==A || LA(1)==X) || isUpper()) {
a();
}
};
With 1.33MR10 and -mrhoist the predicate context is restricted to
the non-covered lookahead. The code resembles:
while {
if ( (LA(1)==A || LA(1)==B) &&
(! (LA(1)==X) || isUpper()) {
a();
}
};
With the antlr "-info p" switch the user will receive information
about the predicate restriction in the generated file:
--------------------------------------------------------------
#if 0
Restricting the context of a predicate because of overlap
in the lookahead set between the alternative with the
semantic predicate and one without
Without this restriction the alternative without the predicate
could not be reached when input matched the context of the
predicate and the predicate was false.