CHANGES_FROM_1.33
资源名称:pccts133.zip [点击查看]
上传用户:itx_2006
上传日期:2007-01-06
资源大小:493k
文件大小:165k
源码类别:
编译器/解释器
开发平台:
Others
- =======================================================================
- List of Implemented Fixes and Changes for Maintenance Releases of PCCTS
- =======================================================================
- DISCLAIMER
- The software and these notes are provided "as is". They may include
- typographical or technical errors and their authors disclaims all
- liability of any kind or nature for damages due to error, fault,
- defect, or deficiency regardless of cause. All warranties of any
- kind, either express or implied, including, but not limited to, the
- implied warranties of merchantability and fitness for a particular
- purpose are disclaimed.
- #197. (Changed in MR14) Resetting the lookahead buffer of the parser
- Explanation and fix by Sinan Karasu (sinan.karasu@boeing.com)
- Consider the code used to prime the lookahead buffer LA(i)
- of the parser when init() is called:
- void
- ANTLRParser::
- prime_lookahead()
- {
- int i;
- for(i=1;i<=LLk; i++) consume();
- dirty=0;
- //lap = 0; // MR14 - Sinan Karasu (sinan.karusu@boeing.com)
- //labase = 0; // MR14
- labase=lap; // MR14
- }
- When the parser is instantiated, lap=0,labase=0 is set.
- The "for" loop runs LLk times. In consume(), lap = lap +1 (mod LLk) is
- computed. Therefore, lap(before the loop) == lap (after the loop).
- Now the only problem comes in when one does an init() of the parser
- after an Eof has been seen. At that time, lap could be non zero.
- Assume it was lap==1. Now we do a prime_lookahead(). If LLk is 2,
- then
- consume()
- {
- NLA = inputTokens->getToken()->getType();
- dirty--;
- lap = (lap+1)&(LLk-1);
- }
- or expanding NLA,
- token_type[lap&(LLk-1)]) = inputTokens->getToken()->getType();
- dirty--;
- lap = (lap+1)&(LLk-1);
- so now we prime locations 1 and 2. In prime_lookahead it used to set
- lap=0 and labase=0. Now, the next token will be read from location 0,
- NOT 1 as it should have been.
- This was never caught before, because if a parser is just instantiated,
- then lap and labase are 0, the offending assignment lines are
- basically no-ops, since the for loop wraps around back to 0.
- #196. (Changed in MR14) Problems with "(alpha)? beta" guess
- Consider the following syntactic predicate in a grammar
- with 2 tokens of lookahead (k=2 or ck=2):
- rule : ( alpha )? beta ;
- alpha : S t ;
- t : T U
- | T
- ;
- beta : S t Z ;
- When antlr computes the prediction expression with one token
- of lookahead for alts 1 and 2 of rule t it finds an ambiguity.
- Because the grammar has a lookahead of 2 it tries to compute
- two tokens of lookahead for alts 1 and 2 of t. Alt 1 clearly
- has a lookahead of (T U). Alt 2 is one token long so antlr
- tries to compute the follow set of alt 2, which means finding
- the things which can follow rule t in the context of (alpha)?.
- This cannot be computed, because alpha is only part of a rule,
- and antlr can't tell what part of beta is matched by alpha and
- what part remains to be matched. Thus it impossible for antlr
- to properly determine the follow set of rule t.
- Prior to 1.33MR14, the follow of (alpha)? was computed as
- FIRST(beta) as a result of the internal representation of
- guess blocks.
- With MR14 the follow set will be the empty set for that context.
- Normally, one expects a rule appearing in a guess block to also
- appear elsewhere. When the follow context for this other use
- is "ored" with the empty set, the context from the other use
- results, and a reasonable follow context results. However if
- there is *no* other use of the rule, or it is used in a different
- manner then the follow context will be inaccurate - it was
- inaccurate even before MR14, but it will be inaccurate in a
- different way.
- For the example given earlier, a reasonable way to rewrite the
- grammar:
- rule : ( alpha )? beta
- alpha : S t ;
- t : T U
- | T
- ;
- beta : alpha Z ;
- If there are no other uses of the rule appearing in the guess
- block it will generate a test for EOF - a workaround for
- representing a null set in the lookahead tests.
- If you encounter such a problem you can use the -alpha option
- to get additional information:
- line 2: error: not possible to compute follow set for alpha
- in an "(alpha)? beta" block.
- With the antlr -alpha command line option the following information
- is inserted into the generated file:
- #if 0
- Trace of references leading to attempt to compute the follow set of
- alpha in an "(alpha)? beta" block. It is not possible for antlr to
- compute this follow set because it is not known what part of beta has
- already been matched by alpha and what part remains to be matched.
- Rules which make use of the incorrect follow set will also be incorrect
- 1 #token T alpha/2 line 7 brief.g
- 2 end alpha alpha/3 line 8 brief.g
- 2 end (...)? block at start/1 line 2 brief.g
- #endif
- At the moment, with the -alpha option selected the program marks
- any rules which appear in the trace back chain (above) as rules with
- possible problems computing follow set.
- Reported by Greg Knapen (gregory.knapen@bell.ca).
- #195. (Changed in MR14) #line directive not at column 1
- Under certain circunstances a predicate test could generate
- a #line directive which was not at column 1.
- Reported with fix by David K錱edal (davidk@lysator.liu.se)
- (http://www.lysator.liu.se/~davidk/).
- #194. (Changed in MR14) (C Mode only) Demand lookahead with #tokclass
- In C mode with the demand lookahead option there is a bug in the
- code which handles matches for #tokclass (zzsetmatch and
- zzsetmatch_wsig).
- The bug causes the lookahead pointer to get out of synchronization
- with the current token pointer.
- The problem was reported with a fix by Ger Hobbelt (hobbelt@axa.nl).
- #193. (Changed in MR14) Use of PCCTS_USE_NAMESPACE_STD
- The pcctscfg.h now contains the following definitions:
- #ifdef PCCTS_USE_NAMESPACE_STD
- #define PCCTS_STDIO_H <Cstdio>
- #define PCCTS_STDLIB_H <Cstdlib>
- #define PCCTS_STDARG_H <Cstdarg>
- #define PCCTS_SETJMP_H <Csetjmp>
- #define PCCTS_STRING_H <Cstring>
- #define PCCTS_ASSERT_H <Cassert>
- #define PCCTS_ISTREAM_H <istream>
- #define PCCTS_IOSTREAM_H <iostream>
- #define PCCTS_NAMESPACE_STD namespace std {}; using namespace std;
- #else
- #define PCCTS_STDIO_H <stdio.h>
- #define PCCTS_STDLIB_H <stdlib.h>
- #define PCCTS_STDARG_H <stdarg.h>
- #define PCCTS_SETJMP_H <setjmp.h>
- #define PCCTS_STRING_H <string.h>
- #define PCCTS_ASSERT_H <assert.h>
- #define PCCTS_ISTREAM_H <istream.h>
- #define PCCTS_IOSTREAM_H <iostream.h>
- #define PCCTS_NAMESPACE_STD
- #endif
- The runtime support in pccts/h uses these pre-processor symbols
- consistently.
- Also, antlr and dlg have been changed to generate code which uses
- these pre-processor symbols rather than having the names of the
- #include files hard-coded in the generated code.
- This required the addition of "#include pcctscfg.h" to a number of
- files in pccts/h.
- It appears that this sometimes causes problems for MSVC 5 in
- combination with the "automatic" option for pre-compiled headers.
- In such cases disable the "automatic" pre-compiled headers option.
- Suggested by Hubert Holin (Hubert.Holin@Bigfoot.com).
- #192. (Changed in MR14) Change setText() to accept "const ANTLRChar *"
- Changed ANTLRToken::setText(ANTLRChar *) to setText(const ANTLRChar *).
- This allows literal strings to be used to initialize tokens. Since
- the usual token implementation (ANTLRCommonToken) makes a copy of the
- input string, this was an unnecessary limitation.
- Suggested by Bob McWhirter (bob@netwrench.com).
- #191. (Changed in MR14) HP/UX aCC compiler compatibility problem
- Needed to explicitly declare zzINF_DEF_TOKEN_BUFFER_SIZE and
- zzINF_BUFFER_TOKEN_CHUNK_SIZE as ints in pccts/h/AParser.cpp.
- Reported by David Cook (dcook@bmc.com).
- #190. (Changed in MR14) IBM OS/2 CSet compiler compatibility problem
- Name conflict with "_cs" in pccts/h/ATokenBuffer.cpp
- Reported by David Cook (dcook@bmc.com).
- #189. (Changed in MR14) -gxt switch in C mode
- The -gxt switch in C mode didn't work because of incorrect
- initialization.
- Reported by Sinan Karasu (sinan@boeing.com).
- #188. (Changed in MR14) Added pccts/h/DLG_stream_input.h
- This is a DLG stream class based on C++ istreams.
- Contributed by Hubert Holin (Hubert.Holin@Bigfoot.com).
- #187. (Changed in MR14) Rename config.h to pcctscfg.h
- The PCCTS configuration file has been renamed from config.h to
- pcctscfg.h. The problem with the original name is that it led
- to name collisions when pccts parsers were combined with other
- software.
- All of the runtime support routines in pccts/h/* have been
- changed to use the new name. Existing software can continue
- to use pccts/h/config.h. The contents of pccts/h/config.h is
- now just "#include "pcctscfg.h".
- I don't have a record of the user who suggested this.
- #186. (Changed in MR14) Pre-processor symbol DllExportPCCTS class modifier
- Classes in the C++ runtime support routines are now declared:
- class DllExportPCCTS className ....
- By default, the pre-processor symbol is defined as the empty
- string. This if for use by MSVC++ users to create DLL classes.
- Suggested by Manfred Kogler (km@cast.uni-linz.ac.at).
- #185. (Changed in MR14) Option to not use PCCTS_AST base class for ASTBase
- Normally, the ASTBase class is derived from PCCTS_AST which contains
- functions useful to Sorcerer. If these are not necessary then the
- user can define the pre-processor symbol "PCCTS_NOT_USING_SOR" which
- will cause the ASTBase class to replace references to PCCTS_AST with
- references to ASTBase where necessary.
- The class ASTDoublyLinkedBase will contain a pure virtual function
- shallowCopy() that was formerly defined in class PCCTS_AST.
- Suggested by Bob McWhirter (bob@netwrench.com).
- #184. (Changed in MR14) Grammars with no tokens generate invalid tokens.h
- Reported by Hubert Holin (Hubert.Holin@bigfoot.com).
- #183. (Changed in MR14) -f to specify file with names of grammar files
- In DEC/VMS it is difficult to specify very long command lines.
- The -f option allows one to place the names of the grammar files
- in a data file in order to bypass limitations of the DEC/VMS
- command language interpreter.
- Addition supplied by Bernard Giroud (b_giroud@decus.ch).
- #182. (Changed in MR14) Output directory option for DEC/VMS
- Fix some problems with the -o option under DEC/VMS.
- Fix supplied by Bernard Giroud (b_giroud@decus.ch).
- #181. (Changed in MR14) Allow chars > 127 in DLGStringInput::nextChar()
- Changed DLGStringInput to cast the character using (unsigned char)
- so that languages with character codes greater than 127 work
- without changes.
- Suggested by Manfred Kogler (km@cast.uni-linz.ac.at).
- #180. (Added in MR14) ANTLRParser::getEofToken()
- Added "ANTLRToken ANTLRParser::getEofToken() const" to match the
- setEofToken routine.
- Requested by Manfred Kogler (km@cast.uni-linz.ac.at).
- #179. (Fixed in MR14) Memory leak for BufFileInput subclass of DLGInputStream
- The BufFileInput class described in Item #142 neglected to release
- the allocated buffer when an instance was destroyed.
- Reported by Manfred Kogler (km@cast.uni-linz.ac.at).
- #178. (Fixed in MR14) Bug in "(alpha)? beta" guess blocks first sets
- In 1.33 vanilla, and all maintenance releases prior to MR14
- there is a bug in the handling of guess blocks which use the
- "long" form:
- (alpha)? beta
- inside a (...)*, (...)+, or {...} block.
- This problem does *not* apply to the case where beta is omitted
- or when the syntactic predicate is on the leading edge of an
- alternative.
- The problem is that both alpha and beta are stored in the
- syntax diagram, and that some analysis routines would fail
- to skip the alpha portion when it was not on the leading edge.
- Consider the following grammar with -ck 2:
- r : ( (A)? B )* C D
- | A B /* forces -ck 2 computation for old antlr */
- /* reports ambig for alts 1 & 2 */
- | B C /* forces -ck 2 computation for new antlr */
- /* reports ambig for alts 1 & 3 */
- ;
- The prediction expression for the first alternative should be
- LA(1)={B C} LA(2)={B C D}, but previous versions of antlr
- would compute the prediction expression as LA(1)={A C} LA(2)={B D}
- Reported by Arpad Beszedes (beszedes@inf.u-szeged.hu) who provided
- a very clear example of the problem and identified the probable cause.
- #177. (Changed in MR14) #tokdefs and #token with regular expression
- In MR13 the change described by Item #162 caused an existing
- feature of antlr to fail. Prior to the change it was possible
- to give regular expression definitions and actions to tokens
- which were defined via the #tokdefs directive.
- This now works again.
- Reported by Manfred Kogler (km@cast.uni-linz.ac.at).
- #176. (Changed in MR14) Support for #line in antlr source code
- Note: this was implemented by Arpad Beszedes (beszedes@inf.u-szeged.hu).
- In 1.33MR14 it is possible for a pre-processor to generate #line
- directives in the antlr source and have those line numbers and file
- names used in antlr error messages and in the #line directives
- generated by antlr.
- The #line directive may appear in the following forms:
- #line ll "sss" xx xx ...
- where ll represents a line number, "sss" represents the name of a file
- enclosed in quotation marks, and xxx are arbitrary integers.
- The following form (without "line") is not supported at the moment:
- # ll "sss" xx xx ...
- The result:
- zzline
- is replaced with ll from the # or #line directive
- FileStr[CurFile]
- is updated with the contents of the string (if any)
- following the line number
- Note
- ----
- The file-name string following the line number can be a complete
- name with a directory-path. Antlr generates the output files from
- the input file name (by replacing the extension from the file-name
- with .c or .cpp).
- If the input file (or the file-name from the line-info) contains
- a path:
- "../grammar.g"
- the generated source code will be placed in "../grammar.cpp" (i.e.
- in the parent directory). This is inconvenient in some cases
- (even the -o switch can not be used) so the path information is
- removed from the #line directive. Thus, if the line-info was
- #line 2 "../grammar.g"
- then the current file-name will become "grammar.g"
- In this way, the generated source code according to the grammar file
- will always be in the current directory, except when the -o switch
- is used.
- #175. (Changed in MR14) Bug when guess block appears at start of (...)*
- In 1.33 vanilla and all maintenance releases prior to 1.33MR14
- there is a bug when a guess block appears at the start of a (...)+.
- Consider the following k=1 (ck=1) grammar:
- rule :
- ( (STAR)? ZIP )* ID ;
- Prior to 1.33MR14, the generated code resembled:
- ...
- zzGUESS_BLOCK
- while ( 1 ) {
- if ( ! LA(1)==STAR) break;
- zzGUESS
- if ( !zzrv ) {
- zzmatch(STAR);
- zzCONSUME;
- zzGUESS_DONE
- zzmatch(ZIP);
- zzCONSUME;
- ...
- Note that the routine uses STAR for the prediction expression
- rather than ZIP. With 1.33MR14 the generated code resembles:
- ...
- while ( 1 ) {
- if ( ! LA(1)==ZIP) break;
- ...
- This problem existed only with (...)* blocks and was caused
- by the slightly more complicate graph which represents (...)*
- blocks. This caused the analysis routine to compute the first
- set for the alpha part of the "(alpha)? beta" rather than the
- beta part.
- Both (...)+ and {...} blocks handled the guess block correctly.
- Reported by Arpad Beszedes (beszedes@inf.u-szeged.hu) who provided
- a very clear example of the problem and identified the probable cause.
- #174. (Changed in MR14) Bug when action precedes syntactic predicate
- In 1.33 vanilla, and all maintenance releases prior to MR14,
- there was a bug when a syntactic predicate was immediately
- preceded by an action. Consider the following -ck 2 grammar:
- rule :
- <<int i;>>
- (alpha)? beta C
- | A B
- ;
- alpha : A ;
- beta : A B;
- Prior to MR14, the code generated for the first alternative
- resembled:
- ...
- zzGUESS
- if ( !zzrv && LA(1)==A && LA(2)==A) {
- alpha();
- zzGUESS_DONE
- beta();
- zzmatch(C);
- zzCONSUME;
- } else {
- ...
- The prediction expression (i.e. LA(1)==A && LA(2)==A) is clearly
- wrong because LA(2) should be matched to B (first[2] of beta is {B}).
- With 1.33MR14 the prediction expression is:
- ...
- if ( !zzrv && LA(1)==A && LA(2)==B) {
- alpha();
- zzGUESS_DONE
- beta();
- zzmatch(C);
- zzCONSUME;
- } else {
- ...
- This will only affect users in which alpha is shorter than
- than max(k,ck) and there is an action immediately preceding
- the syntactic predicate.
- This problem was reported by reported by Arpad Beszedes
- (beszedes@inf.u-szeged.hu) who provided a very clear example
- of the problem and identified the presence of the init-action
- as the likely culprit.
- #173. (Changed in MR13a) -glms for Microsoft style filenames with -gl
- With the -gl option antlr generates #line directives using the
- exact name of the input files specified on the command line.
- An oddity of the Microsoft C and C++ compilers is that they
- don't accept file names in #line directives containing ""
- even though these are names from the native file system.
- With -glms option, the "" in file names appearing in #line
- directives is replaced with a "/" in order to conform to
- Microsoft compiler requirements.
- Reported by Erwin Achermann (erwin.achermann@switzerland.org).
- #172. (Changed in MR13) rn in antlr source counted as one line
- Some MS software uses rn to indicate a new line. Antlr
- now recognizes this in counting lines.
- Reported by Edward L. Hepler (elh@ece.vill.edu).
- #171. (Changed in MR13) #tokclass L..U now allowed
- The following is now allowed:
- #tokclass ABC { A..B C }
- Reported by Dave Watola (dwatola@amtsun.jpl.nasa.gov)
- #170. (Changed in MR13) Suppression for predicates with lookahead depth >1
- In MR12 the capability for suppression of predicates with lookahead
- depth=1 was introduced. With MR13 this had been extended to
- predicates with lookahead depth > 1 and released for use by users
- on an experimental basis.
- Consider the following grammar with -ck 2 and the predicate in rule
- "a" with depth 2:
- r1 : (ab)* "@"
- ;
- ab : a
- | b
- ;
- a : (A B)? => <<p(LATEXT(2))>>? A B C
- ;
- b : A B C
- ;
- Normally, the predicate would be hoisted into rule r1 in order to
- determine whether to call rule "ab". However it should *not* be
- hoisted because, even if p is false, there is a valid alternative
- in rule b. With "-mrhoistk on" the predicate will be suppressed.
- If "-info p" command line option is present the following information
- will appear in the generated code:
- while ( (LA(1)==A)
- #if 0
- Part (or all) of predicate with depth > 1 suppressed by alternative
- without predicate
- pred << p(LATEXT(2))>>?
- depth=k=2 ("=>" guard) rule a line 8 t1.g
- tree context:
- (root = A
- B
- )
- The token sequence which is suppressed: ( A B )
- The sequence of references which generate that sequence of tokens:
- 1 to ab r1/1 line 1 t1.g
- 2 ab ab/1 line 4 t1.g
- 3 to b ab/2 line 5 t1.g
- 4 b b/1 line 11 t1.g
- 5 #token A b/1 line 11 t1.g
- 6 #token B b/1 line 11 t1.g
- #endif
- A slightly more complicated example:
- r1 : (ab)* "@"
- ;
- ab : a
- | b
- ;
- a : (A B)? => <<p(LATEXT(2))>>? (A B | D E)
- ;
- b : <<q(LATEXT(2))>>? D E
- ;
- In this case, the sequence (D E) in rule "a" which lies behind
- the guard is used to suppress the predicate with context (D E)
- in rule b.
- while ( (LA(1)==A || LA(1)==D)
- #if 0
- Part (or all) of predicate with depth > 1 suppressed by alternative
- without predicate
- pred << q(LATEXT(2))>>?
- depth=k=2 rule b line 11 t2.g
- tree context:
- (root = D
- E
- )
- The token sequence which is suppressed: ( D E )
- The sequence of references which generate that sequence of tokens:
- 1 to ab r1/1 line 1 t2.g
- 2 ab ab/1 line 4 t2.g
- 3 to a ab/1 line 4 t2.g
- 4 a a/1 line 8 t2.g
- 5 #token D a/1 line 8 t2.g
- 6 #token E a/1 line 8 t2.g
- #endif
- &&
- #if 0
- pred << p(LATEXT(2))>>?
- depth=k=2 ("=>" guard) rule a line 8 t2.g
- tree context:
- (root = A
- B
- )
- #endif
- (! ( LA(1)==A && LA(2)==B ) || p(LATEXT(2)) ) {
- ab();
- ...
- #169. (Changed in MR13) Predicate test optimization for depth=1 predicates
- When the MR12 generated a test of a predicate which had depth 1
- it would use the depth >1 routines, resulting in correct but
- inefficient behavior. In MR13, a bit test is used.
- #168. (Changed in MR13) Token expressions in context guards
- The token expressions appearing in context guards such as:
- (A B)? => <<test(LT(1))>>? someRule
- are computed during an early phase of antlr processing. As
- a result, prior to MR13, complex expressions such as:
- ~B
- L..U
- ~L..U
- TokClassName
- ~TokClassName
- were not computed properly. This resulted in incorrect
- context being computed for such expressions.
- In MR13 these context guards are verified for proper semantics
- in the initial phase and then re-evaluated after complex token
- expressions have been computed in order to produce the correct
- behavior.
- Reported by Arpad Beszedes (beszedes@inf.u-szeged.hu).
- #167. (Changed in MR13) ~L..U
- Prior to MR13, the complement of a token range was
- not properly computed.
- #166. (Changed in MR13) token expression L..U
- The token U was represented as an unsigned char, restricting
- the use of L..U to cases where U was assigned a token number
- less than 256. This is corrected in MR13.
- #165. (Changed in MR13) option -newAST
- To create ASTs from an ANTLRTokenPtr antlr usually calls
- "new AST(ANTLRTokenPtr)". This option generates a call
- to "newAST(ANTLRTokenPtr)" instead. This allows a user
- to define a parser member function to create an AST object.
- Similar changes for ASTBase::tmake and ASTBase::link were not
- thought necessary since they do not create AST object, only
- use existing ones.
- #164. (Changed in MR13) Unused variable _astp
- For many compilations, we have lived with warnings about
- the unused variable _astp. It turns out that this varible
- can *never* be used because the code which references it was
- commented out.
- This investigation was sparked by a note from Erwin Achermann
- (erwin.achermann@switzerland.org).
- #163. (Changed in MR13) Incorrect makefiles for testcpp examples
- All the examples in pccts/testcpp/* had incorrect definitions
- in the makefiles for the symbol "CCC". Instead of CCC=CC they
- had CC=$(CCC).
- There was an additional problem in testcpp/1/test.g due to the
- change in ANTLRToken::getText() to a const member function
- (Item #137).
- Reported by Maurice Mass (maas@cuci.nl).
- #162. (Changed in MR13) Combining #token with #tokdefs
- When it became possible to change the print-name of a
- #token (Item #148) it became useful to give a #token
- statement whose only purpose was to giving a print name
- to the #token. Prior to this change this could not be
- combined with the #tokdefs feature.
- #161. (Changed in MR13) Switch -gxt inhibits generation of tokens.h
- #160. (Changed in MR13) Omissions in list of names for remap.h
- When a user selects the -gp option antlr creates a list
- of macros in remap.h to rename some of the standard
- antlr routines from zzXXX to userprefixXXX.
- There were number of omissions from the remap.h name
- list related to the new trace facility. This was reported,
- along with a fix, by Bernie Solomon (bernard@ug.eds.com).
- #159. (Changed in MR13) Violations of classic C rules
- There were a number of violations of classic C style in
- the distribution kit. This was reported, along with fixes,
- by Bernie Solomon (bernard@ug.eds.com).
- #158. (Changed in MR13) #header causes problem for pre-processors
- A user who runs the C pre-processor on antlr source suggested
- that another syntax be allowed. With MR13 such directives
- such as #header, #pragma, etc. may be written as "#header",
- "#pragma", etc. For escaping pre-processor directives inside
- a #header use something like the following:
- #header
- <<
- #include <stdio.h>
- >>
- #157. (Fixed in MR13) empty error sets for rules with infinite recursion
- When the first set for a rule cannot be computed due to infinite
- left recursion and it is the only alternative for a block then
- the error set for the block would be empty. This would result
- in a fatal error.
- Reported by Darin Creason (creason@genedax.com)
- #156. (Changed in MR13) DLGLexerBase::getToken() now public
- #155. (Changed in MR13) Context behind predicates can suppress
- With -mrhoist enabled the context behind a guarded predicate can
- be used to suppress other predicates. Consider the following grammar:
- r0 : (r1)+;
- r1 : rp
- | rq
- ;
- rp : <<p LATEXT(1)>>? B ;
- rq : (A)? => <<q LATEXT(1)>>? (A|B);
- In earlier versions both predicates "p" and "q" would be hoisted into
- rule r0. With MR12c predicate p is suppressed because the context which
- follows predicate q includes "B" which can "cover" predicate "p". In
- other words, in trying to decide in r0 whether to call r1, it doesn't
- really matter whether p is false or true because, either way, there is
- a valid choice within r1.
- #154. (Changed in MR13) Making hoist suppression explicit using <<nohoist>>
- A common error, even among experienced pccts users, is to code
- an init-action to inhibit hoisting rather than a leading action.
- An init-action does not inhibit hoisting.
- This was coded:
- rule1 : <<;>> rule2
- This is what was meant:
- rule1 : <<;>> <<;>> rule2
- With MR13, the user can code:
- rule1 : <<;>> <<nohoist>> rule2
- The following will give an error message:
- rule1 : <<nohoist>> rule2
- If the <<nohoist>> appears as an init-action rather than a leading
- action an error message is issued. The meaning of an init-action
- containing "nohoist" is unclear: does it apply to just one
- alternative or to all alternatives ?
- #153. (Changed in MR12b) Bug in computation of -mrhoist suppression set
- Consider the following grammar with k=1 and "-mrhoist on":
- r1 : (A)? => ((p>>? x /* l1 */
- | r2 /* l2 */
- ;
- r2 : A /* l4 */
- | (B)? => <<q>>? y /* l5 */
- ;
- In earlier versions the mrhoist routine would see that both l1 and
- l2 contained predicates and would assume that this prevented either
- from acting to suppress the other predicate. In the example above
- it didn't realize the A at line l4 is capable of suppressing the
- predicate at l1 even though alt l2 contains (indirectly) a predicate.
- This is fixed in MR12b.
- Reported by Reinier van den Born (reinier@vnet.ibm.com)
- #153. (Changed in MR12a) Bug in computation of -mrhoist suppression set
- An oversight similar to that described in Item #152 appeared in
- the computation of the set that "covered" a predicate. If a
- predicate expression included a term such as p=AND(q,r) the context
- of p was taken to be context(q) & context(r), when it should have
- been context(q) | context(r). This is fixed in MR12a.
- #152. (Changed in MR12) Bug in generation of predicate expressions
- The primary purpose for MR12 is to make quite clear that MR11 is
- obsolete and to fix the bug related to predicate expressions.
- In MR10 code was added to optimize the code generated for
- predicate expression tests. Unfortunately, there was a
- significant oversight in the code which resulted in a bug in
- the generation of code for predicate expression tests which
- contained predicates combined using AND:
- r0 : (r1)* "@" ;
- r1 : (AAA)? => <<p LATEXT(1)>>? r2 ;
- r2 : (BBB)? => <<q LATEXT(1)>>? Q
- | (BBB)? => <<r LATEXT(1)>>? Q
- ;
- In MR11 (and MR10 when using "-mrhoist on") the code generated
- for r0 to predict r1 would be equivalent to:
- if ( LA(1)==Q &&
- (LA(1)==AAA && LA(1)==BBB) &&
- ( p && ( q || r )) ) {
- This is incorrect because it expresses the idea that LA(1)
- *must* be AAA in order to attempt r1, and *must* be BBB to
- attempt r2. The result was that r1 became unreachable since
- both condition can not be simultaneously true.
- The general philosophy of code generation for predicates
- can be summarized as follows:
- a. If the context is true don't enter an alt
- for which the corresponding predicate is false.
- If the context is false then it is okay to enter
- the alt without evaluating the predicate at all.
- b. A predicate created by ORing of predicates has
- context which is the OR of their individual contexts.
- c. A predicate created by ANDing of predicates has
- (surprise) context which is the OR of their individual
- contexts.
- d. Apply these rules recursively.
- e. Remember rule (a)
- The correct code should express the idea that *if* LA(1) is
- AAA then p must be true to attempt r1, but if LA(1) is *not*
- AAA then it is okay to attempt r1, provided that *if* LA(1) is
- BBB then one of q or r must be true.
- if ( LA(1)==Q &&
- ( !(LA(1)==AAA || LA(1)==BBB) ||
- ( ! LA(1) == AAA || p) &&
- ( ! LA(1) == BBB || q || r ) ) ) {
- I believe this is fixed in MR12.
- Reported by Reinier van den Born (reinier@vnet.ibm.com)
- #151a. (Changed in MR12) ANTLRParser::getLexer()
- As a result of several requests, I have added public methods to
- get a pointer to the lexer belonging to a parser.
- ANTLRTokenStream *ANTLRParser::getLexer() const
- Returns a pointer to the lexer being used by the
- parser. ANTLRTokenStream is the base class of
- DLGLexer
- ANTLRTokenStream *ANTLRTokenBuffer::getLexer() const
- Returns a pointer to the lexer being used by the
- ANTLRTokenBuffer. ANTLRTokenStream is the base
- class of DLGLexer
- You must manually cast the ANTLRTokenStream to your program's
- lexer class. Because the name of the lexer's class is not fixed.
- Thus it is impossible to incorporate it into the DLGLexerBase
- class.
- #151b.(Changed in MR12) ParserBlackBox member getLexer()
- The template class ParserBlackBox now has a member getLexer()
- which returns a pointer to the lexer.
- #150. (Changed in MR12) syntaxErrCount and lexErrCount now public
- See Item #127 for more information.
- #149. (Changed in MR12) antlr option -info o (letter o for orphan)
- If there is more than one rule which is not referenced by any
- other rule then all such rules are listed. This is useful for
- alerting one to rules which are not used, but which can still
- contribute to ambiguity. For example:
- start : a Z ;
- unused: a A ;
- a : (A)+ ;
- will cause an ambiguity report for rule "a" which will be
- difficult to understand if the user forgets about rule "unused"
- simply because it is not used in the grammar.
- #148. (Changed in MR11) #token names appearing in zztokens,token_tbl
- In a #token statement like the following:
- #token Plus "+"
- the string "Plus" appears in the zztokens array (C mode) and
- token_tbl (C++ mode). This string is used in most error
- messages. In MR11 one has the option of using some other string,
- (e.g. "+") in those tables.
- In MR11 one can write:
- #token Plus ("+") "+"
- #token RP ("(") "("
- #token COM ("comment begin") "/*"
- A #token statement is allowed to appear in more than one #lexclass
- with different regular expressions. However, the token name appears
- only once in the zztokens/token_tbl array. This means that only
- one substitute can be specified for a given #token name. The second
- attempt to define a substitute name (different from the first) will
- result in an error message.
- #147. (Changed in MR11) Bug in follow set computation
- There is a bug in 1.33 vanilla and all maintenance releases
- prior to MR11 in the computation of the follow set. The bug is
- different than that described in Item #82 and probably more
- common. It was discovered in the ansi.g grammar while testing
- the "ambiguity aid" (Item #119). The search for a bug started
- when the ambiguity aid was unable to discover the actual source
- of an ambiguity reported by antlr.
- The problem appears when an optimization of the follow set
- computation is used inappropriately. The result is that the
- follow set used is the "worst case". In other words, the error
- can lead to false reports of ambiguity. The good news is that
- if you have a grammar in which you have addressed all reported
- ambiguities you are ok. The bad news is that you may have spent
- time fixing ambiguities that were not real, or used k=2 when
- ck=2 might have been sufficient, and so on.
- The following grammar demonstrates the problem:
- ------------------------------------------------------------
- expr : ID ;
- start : stmt SEMI ;
- stmt : CASE expr COLON
- | expr SEMI
- | plain_stmt
- ;
- plain_stmt : ID COLON ;
- ------------------------------------------------------------
- When compiled with k=1 and ck=2 it will report:
- warning: alts 2 and 3 of the rule itself ambiguous upon
- { IDENTIFIER }, { COLON }
- When antlr analyzes "stmt" it computes the first[1] set of all
- alternatives. It finds an ambiguity between alts 2 and 3 for ID.
- It then computes the first[2] set for alternatives 2 and 3 to resolve
- the ambiguity. In computing the first[2] set of "expr" (which is
- only one token long) it needs to determine what could follow "expr".
- Under a certain combination of circumstances antlr forgets that it
- is trying to analyze "stmt" which can only be followed by SEMI and
- adds to the first[2] set of "expr" the "global" follow set (including
- "COLON") which could follow "expr" (under other conditions) in the
- phrase "CASE expr COLON".
- #146. (Changed in MR11) Option -treport for locating "difficult" alts
- It can be difficult to determine which alternatives are causing
- pccts to work hard to resolve an ambiguity. In some cases the
- ambiguity is successfully resolved after much CPU time so there
- is no message at all.
- A rough measure of the amount of work being peformed which is
- independent of the CPU speed and system load is the number of
- tnodes created. Using "-info t" gives information about the
- total number of tnodes created and the peak number of tnodes.
- Tree Nodes: peak 1300k created 1416k lost 0
- It also puts in the generated C or C++ file the number of tnodes
- created for a rule (at the end of the rule). However this
- information is not sufficient to locate the alternatives within
- a rule which are causing the creation of tnodes.
- Using:
- antlr -treport 100000 ....
- causes antlr to list on stdout any alternatives which require the
- creation of more than 100,000 tnodes, along with the lookahead sets
- for those alternatives.
- The following is a trivial case from the ansi.g grammar which shows
- the format of the report. This report might be of more interest
- in cases where 1,000,000 tuples were created to resolve the ambiguity.
- -------------------------------------------------------------------------
- There were 0 tuples whose ambiguity could not be resolved
- by full lookahead
- There were 157 tnodes created to resolve ambiguity between:
- Choice 1: statement/2 line 475 file ansi.g
- Choice 2: statement/3 line 476 file ansi.g
- Intersection of lookahead[1] sets:
- IDENTIFIER
- Intersection of lookahead[2] sets:
- LPARENTHESIS COLON AMPERSAND MINUS
- STAR PLUSPLUS MINUSMINUS ONESCOMPLEMENT
- NOT SIZEOF OCTALINT DECIMALINT
- HEXADECIMALINT FLOATONE FLOATTWO IDENTIFIER
- STRING CHARACTER
- -------------------------------------------------------------------------
- #145. (Documentation) Generation of Expression Trees
- Item #99 was misleading because it implied that the optimization
- for tree expressions was available only for trees created by
- predicate expressions and neglected to mention that it required
- the use of "-mrhoist on". The optimization applies to tree
- expressions created for grammars with k>1 and for predicates with
- lookahead depth >1.
- In MR11 the optimized version is always used so the -mrhoist on
- option need not be specified.
- #144. (Changed in MR11) Incorrect test for exception group
- In testing for a rule's exception group the label a pointer
- is compared against ' '. The intention is "*pointer".
- Reported by Jeffrey C. Fried (Jeff@Fried.net).
- #143. (Changed in MR11) Optional ";" at end of #token statement
- Fixes problem of:
- #token X "x"
- <<
- parser action
- >>
- Being confused with:
- #token X "x" <<lexical action>>
- #142. (Changed in MR11) class BufFileInput subclass of DLGInputStream
- Alexey Demakov (demakov@kazbek.ispras.ru) has supplied class
- BufFileInput derived from DLGInputStream which provides a
- function lookahead(char *string) to test characters in the
- input stream more than one character ahead.
- The default amount of lookahead is specified by the constructor
- and defaults to 8 characters. This does *not* include the one
- character of lookahead maintained internally by DLG in member "ch"
- and which is not available for testing via BufFileInput::lookahead().
- This is a useful class for overcoming the one-character-lookahead
- limitation of DLG without resorting to a lexer capable of
- backtracking (like flex) which is not integrated with antlr as is
- DLG.
- There are no restrictions on copying or using BufFileInput.* except
- that the authorship and related information must be retained in the
- source code.
- The class is located in pccts/h/BufFileInput.* of the kit.
- #141. (Changed in MR11) ZZDEBUG_CONSUME for ANTLRParser::consume()
- A debug aid has been added to file ANTLRParser::consume() in
- file AParser.cpp:
- #ifdef ZZDEBUG_CONSUME_ACTION
- zzdebug_consume_action();
- #endif
- Suggested by Sramji Ramanathan (ps@kumaran.com).
- #140. (Changed in MR11) #pred to define predicates
- +---------------------------------------------------+
- | Note: Assume "-prc on" for this entire discussion |
- +---------------------------------------------------+
- A problem with predicates is that each one is regarded as
- unique and capable of disambiguating cases where two
- alternatives have identical lookahead. For example:
- rule : <<pred(LATEXT(1))>>? A
- | <<pred(LATEXT(1))>>? A
- ;
- will not cause any error messages or warnings to be issued
- by earlier versions of pccts. To compare the text of the
- predicates is an incomplete solution.
- In 1.33MR11 I am introducing the #pred statement in order to
- solve some problems with predicates. The #pred statement allows
- one to give a symbolic name to a "predicate literal" or a
- "predicate expression" in order to refer to it in other predicate
- expressions or in the rules of the grammar.
- The predicate literal associated with a predicate symbol is C
- or C++ code which can be used to test the condition. A
- predicate expression defines a predicate symbol in terms of other
- predicate symbols using "!", "&&", and "||". A predicate symbol
- can be defined in terms of a predicate literal, a predicate
- expression, or *both*.
- When a predicate symbol is defined with both a predicate literal
- and a predicate expression, the predicate literal is used to generate
- code, but the predicate expression is used to check for two
- alternatives with identical predicates in both alternatives.
- Here are some examples of #pred statements:
- #pred IsLabel <<isLabel(LATEXT(1))>>?
- #pred IsLocalVar <<isLocalVar(LATEXT(1))>>?
- #pred IsGlobalVar <<isGlobalVar(LATEXT(1)>>?
- #pred IsVar <<isVar(LATEXT(1))>>? IsLocalVar || IsGlobalVar
- #pred IsScoped <<isScoped(LATEXT(1))>>? IsLabel || IsLocalVar
- I hope that the use of EBNF notation to describe the syntax of the
- #pred statement will not cause problems for my readers (joke).
- predStatement : "#pred"
- CapitalizedName
- (
- "<<predicate_literal>>?"
- | "<<predicate_literal>>?" predOrExpr
- | predOrExpr
- )
- ;
- predOrExpr : predAndExpr ( "||" predAndExpr ) * ;
- predAndExpr : predPrimary ( "&&" predPrimary ) * ;
- predPrimary : CapitalizedName
- | "!" predPrimary
- | "(" predOrExpr ")"
- ;
- What is the purpose of this nonsense ?
- To understand how predicate symbols help, you need to realize that
- predicate symbols are used in two different ways with two different
- goals.
- a. Allow simplification of predicates which have been combined
- during predicate hoisting.
- b. Allow recognition of identical predicates which can't disambiguate
- alternatives with common lookahead.
- First we will discuss goal (a). Consider the following rule:
- rule0: rule1
- | ID
- | ...
- ;
- rule1: rule2
- | rule3
- ;
- rule2: <<isX(LATEXT(1))>>? ID ;
- rule3: <<!isX(LATEXT(1)>>? ID ;
- When the predicates in rule2 and rule3 are combined by hoisting
- to create a prediction expression for rule1 the result is:
- if ( LA(1)==ID
- && ( isX(LATEXT(1) || !isX(LATEXT(1) ) ) { rule1(); ...
- This is inefficient, but more importantly, can lead to false
- assumptions that the predicate expression distinguishes the rule1
- alternative with some other alternative with lookahead ID. In
- MR11 one can write:
- #pred IsX <<isX(LATEXT(1))>>?
- ...
- rule2: <<IsX>>? ID ;
- rule3: <<!IsX>>? ID ;
- During hoisting MR11 recognizes this as a special case and
- eliminates the predicates. The result is a prediction
- expression like the following:
- if ( LA(1)==ID ) { rule1(); ...
- Please note that the following cases which appear to be equivalent
- *cannot* be simplified by MR11 during hoisting because the hoisting
- logic only checks for a "!" in the predicate action, not in the
- predicate expression for a predicate symbol.
- *Not* equivalent and is not simplified during hoisting:
- #pred IsX <<isX(LATEXT(1))>>?
- #pred NotX <<!isX(LATEXT(1))>>?
- ...
- rule2: <<IsX>>? ID ;
- rule3: <<NotX>>? ID ;
- *Not* equivalent and is not simplified during hoisting:
- #pred IsX <<isX(LATEXT(1))>>?
- #pred NotX !IsX
- ...
- rule2: <<IsX>>? ID ;
- rule3: <<NotX>>? ID ;
- Now we will discuss goal (b).
- When antlr discovers that there is a lookahead ambiguity between
- two alternatives it attempts to resolve the ambiguity by searching
- for predicates in both alternatives. In the past any predicate
- would do, even if the same one appeared in both alternatives:
- rule: <<p(LATEXT(1))>>? X
- | <<p(LATEXT(1))>>? X
- ;
- The #pred statement is a start towards solving this problem.
- During ambiguity resolution (*not* predicate hoisting) the
- predicates for the two alternatives are expanded and compared.
- Consider the following example:
- #pred Upper <<isUpper(LATEXT(1))>>?
- #pred Lower <<isLower(LATEXT(1))>>?
- #pred Alpha <<isAlpha(LATEXT(1))>>? Upper || Lower
- rule0: rule1
- | <<Alpha>>? ID
- ;
- rule1:
- | rule2
- | rule3
- ...
- ;
- rule2: <<Upper>>? ID;
- rule3: <<Lower>>? ID;
- The definition of #pred Alpha expresses:
- a. to test the predicate use the C code "isAlpha(LATEXT(1))"
- b. to analyze the predicate use the information that
- Alpha is equivalent to the union of Upper and Lower,
- During ambiguity resolution the definition of Alpha is expanded
- into "Upper || Lower" and compared with the predicate in the other
- alternative, which is also "Upper || Lower". Because they are
- identical MR11 will report a problem.
- -------------------------------------------------------------------------
- t10.g, line 5: warning: the predicates used to disambiguate rule rule0
- (file t10.g alt 1 line 5 and alt 2 line 6)
- are identical when compared without context and may have no
- resolving power for some lookahead sequences.
- -------------------------------------------------------------------------
- If you use the "-info p" option the output file will contain:
- +----------------------------------------------------------------------+
- |#if 0 |
- | |
- |The following predicates are identical when compared without |
- | lookahead context information. For some ambiguous lookahead |
- | sequences they may not have any power to resolve the ambiguity. |
- | |
- |Choice 1: rule0/1 alt 1 line 5 file t10.g |
- | |
- | The original predicate for choice 1 with available context |
- | information: |
- | |
- | OR expr |
- | |
- | pred << Upper>>? |
- | depth=k=1 rule rule2 line 14 t10.g |
- | set context: |
- | ID |
- | |
- | pred << Lower>>? |
- | depth=k=1 rule rule3 line 15 t10.g |
- | set context: |
- | ID |
- | |
- | The predicate for choice 1 after expansion (but without context |
- | information): |
- | |
- | OR expr |
- | |
- | pred << isUpper(LATEXT(1))>>? |
- | depth=k=1 rule line 1 t10.g |
- | |
- | pred << isLower(LATEXT(1))>>? |
- | depth=k=1 rule line 2 t10.g |
- | |
- | |
- |Choice 2: rule0/2 alt 2 line 6 file t10.g |
- | |
- | The original predicate for choice 2 with available context |
- | information: |
- | |
- | pred << Alpha>>? |
- | depth=k=1 rule rule0 line 6 t10.g |
- | set context: |
- | ID |
- | |
- | The predicate for choice 2 after expansion (but without context |
- | information): |
- | |
- | OR expr |
- | |
- | pred << isUpper(LATEXT(1))>>? |
- | depth=k=1 rule line 1 t10.g |
- | |
- | pred << isLower(LATEXT(1))>>? |
- | depth=k=1 rule line 2 t10.g |
- | |
- | |
- |#endif |
- +----------------------------------------------------------------------+
- The comparison of the predicates for the two alternatives takes
- place without context information, which means that in some cases
- the predicates will be considered identical even though they operate
- on disjoint lookahead sets. Consider:
- #pred Alpha
- rule1: <<Alpha>>? ID
- | <<Alpha>>? Label
- ;
- Because the comparison of predicates takes place without context
- these will be considered identical. The reason for comparing
- without context is that otherwise it would be necessary to re-evaluate
- the entire predicate expression for each possible lookahead sequence.
- This would require more code to be written and more CPU time during
- grammar analysis, and it is not yet clear whether anyone will even make
- use of the new #pred facility.
- A temporary workaround might be to use different #pred statements
- for predicates you know have different context. This would avoid
- extraneous warnings.
- The above example might be termed a "false positive". Comparison
- without context will also lead to "false negatives". Consider the
- following example:
- #pred Alpha
- #pred Beta
- rule1: <<Alpha>>? A
- | rule2
- ;
- rule2: <<Alpha>>? A
- | <<Beta>>? B
- ;
- The predicate used for alt 2 of rule1 is (Alpha || Beta). This
- appears to be different than the predicate Alpha used for alt1.
- However, the context of Beta is B. Thus when the lookahead is A
- Beta will have no resolving power and Alpha will be used for both
- alternatives. Using the same predicate for both alternatives isn't
- very helpful, but this will not be detected with 1.33MR11.
- To properly handle this the predicate expression would have to be
- evaluated for each distinct lookahead context.
- To determine whether two predicate expressions are identical is
- difficult. The routine may fail to identify identical predicates.
- The #pred feature also compares predicates to see if a choice between
- alternatives which is resolved by a predicate which makes the second
- choice unreachable. Consider the following example:
- #pred A <<A(LATEXT(1)>>?
- #pred B <<B(LATEXT(1)>>?
- #pred A_or_B A || B
- r : s
- | t
- ;
- s : <<A_or_B>>? ID
- ;
- t : <<A>>? ID
- ;
- ----------------------------------------------------------------------------
- t11.g, line 5: warning: the predicate used to disambiguate the
- first choice of rule r
- (file t11.g alt 1 line 5 and alt 2 line 6)
- appears to "cover" the second predicate when compared without context.
- The second predicate may have no resolving power for some lookahead
- sequences.
- ----------------------------------------------------------------------------
- #139. (Changed in MR11) Problem with -gp in C++ mode
- The -gp option to add a prefix to rule names did not work in
- C++ mode. This has been fixed.
- Reported by Alexey Demakov (demakov@kazbek.ispras.ru).
- #138. (Changed in MR11) Additional makefiles for non-MSVC++ MS systems
- Sramji Ramanathan (ps@kumaran.com) has supplied makefiles for
- building antlr and dlg with Win95/NT development tools that
- are not based on MSVC5. They are pccts/antlr/AntlrMS.mak and
- pccts/dlg/DlgMS.mak.
- The first line of the makefiles require a definition of PCCTS_HOME.
- These are in additiion to the AntlrMSVC50.* and DlgMSVC50.*
- supplied by Jeff Vincent (JVincent@novell.com).
- #137. (Changed in MR11) Token getType(), getText(), getLine() const members
- --------------------------------------------------------------------
- If you use ANTLRCommonToken this change probably does not affect you.
- --------------------------------------------------------------------
- For a long time it has bothered me that these accessor functions
- in ANTLRAbstractToken were not const member functions. I have
- refrained from changing them because it require users to modify
- existing token class definitions which are derived directly
- from ANTLRAbstractToken. I think it is now time.
- For those who are not used to C++, a "const member function" is a
- member function which does not modify its own object - the thing
- to which "this" points. This is quite different from a function
- which does not modify its arguments
- Most token definitions based on ANTLRAbstractToken have something like
- the following in order to create concrete definitions of the pure
- virtual methods in ANTLRAbstractToken:
- class MyToken : public ANTLRAbstractToken {
- ...
- ANTLRTokenType getType() {return _type; }
- int getLine() {return _line; }
- ANTLRChar * getText() {return _text; }
- ...
- }
- The required change is simply to put "const" following the function
- prototype in the header (.h file) and the definition file (.cpp if
- it is not inline):
- class MyToken : public ANTLRAbstractToken {
- ...
- ANTLRTokenType getType() const {return _type; }
- int getLine() const {return _line; }
- ANTLRChar * getText() const {return _text; }
- ...
- }
- This was originally proposed a long time ago by Bruce
- Guenter (bruceg@qcc.sk.ca).
- #136. (Changed in MR11) Added getLength() to ANTLRCommonToken
- Classes ANTLRCommonToken and ANTLRCommonTokenNoRefCountToken
- now have a member function:
- int getLength() const { return strlen(getText()) }
- Suggested by Sramji Ramanathan (ps@kumaran.com).
- #135. (Changed in MR11) Raised antlr's own default ZZLEXBUFSIZE to 8k
- #134a. (ansi_mr10.zip) T.J. Parr's ANSI C grammar made 1.33MR11 compatible
- There is a typographical error in the definition of BITWISEOREQ:
- #token BITWISEOREQ "!=" should be "|="
- When this change is combined with the bugfix to the follow set cache
- problem (Item #147) and a minor rearrangement of the grammar
- (Item #134b) it becomes a k=1 ck=2 grammar.
- #134b. (ansi_mr10.zip) T.J. Parr's ANSI C grammar made 1.33MR11 compatible
- The following changes were made in the ansi.g grammar (along with
- using -mrhoist on):
- ansi.g
- ======
- void tracein(char *) ====> void tracein(const char *)
- void traceout(char *) ====> void traceout(const char *)
- <LT(1)->getType()==IDENTIFIER ? isTypeName(LT(1)->getText()) : 1>>?
- ====> <<isTypeName(LT(1)->getText())>>?
- <<(LT(1)->getType()==LPARENTHESIS && LT(2)->getType()==IDENTIFIER) ?
- isTypeName(LT(2)->getText()) : 1>>?
- ====> (LPARENTHESIS IDENTIFIER)? => <<isTypeName(LT(2)->getText())>>?
- <<(LT(1)->getType()==LPARENTHESIS && LT(2)->getType()==IDENTIFIER) ?
- isTypeName(LT(2)->getText()) : 1>>?
- ====> (LPARENTHESIS IDENTIFIER)? => <<isTypeName(LT(2)->getText())>>?
- added to init(): traceOptionValueDefault=0;
- added to init(): traceOption(-1);
- change rule "statement":
- statement
- : plain_label_statement
- | case_label_statement
- | <<;>> expression SEMICOLON
- | compound_statement
- | selection_statement
- | iteration_statement
- | jump_statement
- | SEMICOLON
- ;
- plain_label_statement
- : IDENTIFIER COLON statement
- ;
- case_label_statement
- : CASE constant_expression COLON statement
- | DEFAULT COLON statement
- ;
- support.cpp
- ===========
- void tracein(char *) ====> void tracein(const char *)
- void traceout(char *) ====> void traceout(const char *)
- added to tracein(): ANTLRParser::tracein(r); // call superclass method
- added to traceout(): ANTLRParser::traceout(r); // call superclass method
- Makefile
- ========
- added to AFLAGS: -mrhoist on -prc on
- #133. (Changed in 1.33MR11) Make trace options public in ANTLRParser
- In checking T.J. Parr's ANSI C grammar for compatibility with
- 1.33MR11 discovered that it was inconvenient to have the
- trace facilities with protected access.
- #132. (Changed in 1.33MR11) Recognition of identical predicates in alts
- Prior to 1.33MR11, there would be no ambiguity warning when the
- very same predicate was used to disambiguate both alternatives:
- test: ref B
- | ref C
- ;
- ref : <<pred(LATEXT(1)>>? A
- In 1.33MR11 this will cause the warning:
- warning: the predicates used to disambiguate rule test
- (file v98.g alt 1 line 1 and alt 2 line 2)
- are identical and have no resolving power
- ----------------- Note -----------------
- This is different than the following case
- test: <<pred(LATEXT(1))>>? A B
- | <<pred(LATEXT(1)>>? A C
- ;
- In this case there are two distinct predicates
- which have exactly the same text. In the first
- example there are two references to the same
- predicate. The problem represented by this
- grammar will be addressed later.
- #131. (Changed in 1.33MR11) Case insensitive command line options
- Command line switches like "-CC" and keywords like "on", "off",
- and "stdin" are no longer case sensitive in antlr, dlg, and sorcerer.
- #130. (Changed in 1.33MR11) Changed ANTLR_VERSION to int from string
- The ANTLR_VERSION was not an integer, making it difficult to
- perform conditional compilation based on the antlr version.
- Henceforth, ANTLR_VERSION will be:
- (base_version * 10000) + release number
- thus 1.33MR11 will be: 133*100+11 = 13311
- Suggested by Rainer Janssen (Rainer.Janssen@Informatik.Uni-Oldenburg.DE).
- #129. (Changed in 1.33MR11) Addition of ANTLR_VERSION to <parserName>.h
- The following code is now inserted into <parserName>.h amd
- stdpccts.h:
- #ifndef ANTLR_VERSION
- #define ANTLR_VERSION 13311
- #endif
- Suggested by Rainer Janssen (Rainer.Janssen@Informatik.Uni-Oldenburg.DE)
- #128. (Changed in 1.33MR11) Redundant predicate code in (<<pred>>? ...)+
- Prior to 1.33MR11, the following grammar would generate
- redundant tests for the "while" condition.
- rule2 : (<<pred>>? X)+ X
- | B
- ;
- The code would resemble:
- if (LA(1)==X) {
- if (pred) {
- do {
- if (!pred) {zzfailed_pred(" pred");}
- zzmatch(X); zzCONSUME;
- } while (LA(1)==X && pred && pred);
- } else {...
- With 1.33MR11 the redundant predicate test is omitted.
- #127. (Changed in 1.33MR11)
- Count Syntax Errors Count DLG Errors
- ------------------- ----------------
- C++ mode ANTLRParser:: DLGLexerBase::
- syntaxErrCount lexErrCount
- C mode zzSyntaxErrCount zzLexErrCount
- The C mode variables are global and initialized to 0.
- They are *not* reset to 0 automatically when antlr is
- restarted.
- The C++ mode variables are public. They are initialized
- to 0 by the constructors. They are *not* reset to 0 by the
- ANTLRParser::init() method.
- Suggested by Reinier van den Born (reinier@vnet.ibm.com).
- #126. (Changed in 1.33MR11) Addition of #first <<...>>
- The #first <<...>> inserts the specified text in the output
- files before any other #include statements required by pccts.
- The only things before the #first text are comments and
- a #define ANTLR_VERSION.
- Requested by and Esa Pulkkinen (esap@cs.tut.fi) and Alexin
- Zoltan (alexin@inf.u-szeged.hu).
- #125. (Changed in 1.33MR11) Lookahead for (guard)? && <<p>>? predicates
- When implementing the new style of guard predicate (Item #113)
- in 1.33MR10 I decided to temporarily ignore the problem of
- computing the "narrowest" lookahead context.
- Consider the following k=1 grammar:
- start : a
- | b
- ;
- a : (A)? && <<pred1(LATEXT(1))>>? ab ;
- b : (B)? && <<pred2(LATEXT(1))>>? ab ;
- ab : A | B ;
- In MR10 the context for both "a" and "b" was {A B} because this is
- the first set of rule "ab". Normally, this is not a problem because
- the predicate which follows the guard inhibits any ambiguity report
- by antlr.
- In MR11 the first set for rule "a" is {A} and for rule "b" it is {B}.
- #124. A Note on the New "&&" Style Guarded Predicates
- I've been asked several times, "What is the difference between
- the old "=>" style guard predicates and the new style "&&" guard
- predicates, and how do you choose one over the other" ?
- The main difference is that the "=>" does not apply the
- predicate if the context guard doesn't match, whereas
- the && form always does. What is the significance ?
- If you have a predicate which is not on the "leading edge"
- it is cannot be hoisted. Suppose you need a predicate that
- looks at LA(2). You must introduce it manually. The
- classic example is:
- castExpr :
- LP typeName RP
- | ....
- ;
- typeName : <<isTypeName(LATEXT(1))>>? ID
- | STRUCT ID
- ;
- The problem is that isTypeName() isn't on the leading edge
- of typeName, so it won't be hoisted into castExpr to help
- make a decision on which production to choose.
- The *first* attempt to fix it is this:
- castExpr :
- <<isTypeName(LATEXT(2))>>?
- LP typeName RP
- | ....
- ;
- Unfortunately, this won't work because it ignores
- the problem of STRUCT. The solution is to apply
- isTypeName() in castExpr if LA(2) is an ID and
- don't apply it when LA(2) is STRUCT:
- castExpr :
- (LP ID)? => <<isTypeName(LATEXT(2))>>?
- LP typeName RP
- | ....
- ;
- In conclusion, the "=>" style guarded predicate is
- useful when:
- a. the tokens required for the predicate
- are not on the leading edge
- b. there are alternatives in the expression
- selected by the predicate for which the
- predicate is inappropriate
- If (b) were false, then one could use a simple
- predicate (assuming "-prc on"):
- castExpr :
- <<isTypeName(LATEXT(2))>>?
- LP typeName RP
- | ....
- ;
- typeName : <<isTypeName(LATEXT(1))>>? ID
- ;
- So, when do you use the "&&" style guarded predicate ?
- The new-style "&&" predicate should always be used with
- predicate context. The context guard is in ADDITION to
- the automatically computed context. Thus it useful for
- predicates which depend on the token type for reasons
- other than context.
- The following example is contributed by Reinier van den Born
- (reinier@vnet.ibm.com).
- +-------------------------------------------------------------------------+
- | This grammar has two ways to call functions: |
- | |
- | - a "standard" call syntax with parens and comma separated args |
- | - a shell command like syntax (no parens and spacing separated args) |
- | |
- | The former also allows a variable to hold the name of the function, |
- | the latter can also be used to call external commands. |
- | |
- | The grammar (simplified) looks like this: |
- | |
- | fun_call : ID "(" { expr ("," expr)* } ")" |
- | /* ID is function name */ |
- | | "@" ID "(" { expr ("," expr)* } ")" |
- | /* ID is var containing fun name */ |
- | ; |
- | |
- | command : ID expr* /* ID is function name */ |
- | | path expr* /* path is external command name */ |
- | ; |
- | |
- | path : ID /* left out slashes and such */ |
- | | "@" ID /* ID is environment var */ |
- | ; |
- | |
- | expr : .... |
- | | "(" expr ")"; |
- | |
- | call : fun_call |
- | | command |
- | ; |
- | |
- | Obviously the call is wildly ambiguous. This is more or less how this |
- | is to be resolved: |
- | |
- | A call begins with an ID or an @ followed by an ID. |
- | |
- | If it is an ID and if it is an ext. command name -> command |
- | if followed by a paren -> fun_call |
- | otherwise -> command |
- | |
- | If it is an @ and if the ID is a var name -> fun_call |
- | otherwise -> command |
- | |
- | One can implement these rules quite neatly using && predicates: |
- | |
- | call : ("@" ID)? && <<isVarName(LT(2))>>? fun_call |
- | | (ID)? && <<isExtCmdName>>? command |
- | | (ID "(")? fun_call |
- | | command |
- | ; |
- | |
- | This can be done better, so it is not an ideal example, but it |
- | conveys the principle. |
- +-------------------------------------------------------------------------+
- #123. (Changed in 1.33MR11) Correct definition of operators in ATokPtr.h
- The return value of operators in ANTLRTokenPtr:
- changed: unsigned ... operator !=(...)
- to: int ... operator != (...)
- changed: unsigned ... operator ==(...)
- to: int ... operator == (...)
- Suggested by R.A. Nelson (cowboy@VNET.IBM.COM)
- #122. (Changed in 1.33MR11) Member functions to reset DLG in C++ mode
- void DLGFileReset(FILE *f) { input = f; found_eof = 0; }
- void DLGStringReset(DLGChar *s) { input = s; p = &input[0]; }
- Supplied by R.A. Nelson (cowboy@VNET.IBM.COM)
- #121. (Changed in 1.33MR11) Another attempt to fix -o (output dir) option
- Another attempt is made to improve the -o option of antlr, dlg,
- and sorcerer. This one by JVincent (JVincent@novell.com).
- The current rule:
- a. If -o is not specified than any explicit directory
- names are retained.
- b. If -o is specified than the -o directory name overrides any
- explicit directory names.
- c. The directory name of the grammar file is *not* stripped
- to create the main output file. However it is stil subject
- to override by the -o directory name.
- #120. (Changed in 1.33MR11) "-info f" output to stdout rather than stderr
- Added option 0 (e.g. "-info 0") which is a noop.
- #119. (Changed in 1.33MR11) Ambiguity aid for grammars
- The user can ask for additional information on ambiguities reported
- by antlr to stdout. At the moment, only one ambiguity report can
- be created in an antlr run.
- This feature is enabled using the "-aa" (Ambiguity Aid) option.
- The following options control the reporting of ambiguities:
- -aa ruleName Selects reporting by name of rule
- -aa lineNumber Selects reporting by line number
- (file name not compared)
- -aam Selects "multiple" reporting for a token
- in the intersection set of the
- alternatives.
- For instance, the token ID may appear dozens
- of times in various paths as the program
- explores the rules which are reachable from
- the point of an ambiguity. With option -aam
- every possible path the search program
- encounters is reported.
- Without -aam only the first encounter is
- reported. This may result in incomplete
- information, but the information may be
- sufficient and much shorter.
- -aad depth Selects the depth of the search.
- The default value is 1.
- The number of paths to be searched, and the
- size of the report can grow geometrically
- with the -ck value if a full search for all
- contributions to the source of the ambiguity
- is explored.
- The depth represents the number of tokens
- in the lookahead set which are matched against
- the set of ambiguous tokens. A depth of 1
- means that the search stops when a lookahead
- sequence of just one token is matched.
- A k=1 ck=6 grammar might generate 5,000 items
- in a report if a full depth 6 search is made
- with the Ambiguity Aid. The source of the
- problem may be in the first token and obscured
- by the volume of data - I hesitate to call
- it information.
- When the user selects a depth > 1, the search
- is first performed at depth=1 for both
- alternatives, then depth=2 for both alternatives,
- etc.
- Sample output for rule grammar in antlr.g itself:
- +---------------------------------------------------------------------+
- | Ambiguity Aid |
- | |
- | Choice 1: grammar/70 line 632 file a.g |
- | Choice 2: grammar/82 line 644 file a.g |
- | |
- | Intersection of lookahead[1] sets: |
- | |
- | "}" "class" "#errclass" "#tokclass" |
- | |
- | Choice:1 Depth:1 Group:1 ("#errclass") |
- | 1 in (...)* block grammar/70 line 632 a.g |
- | 2 to error grammar/73 line 635 a.g |
- | 3 error error/1 line 894 a.g |
- | 4 #token "#errclass" error/2 line 895 a.g |
- | |
- | Choice:1 Depth:1 Group:2 ("#tokclass") |
- | 2 to tclass grammar/74 line 636 a.g |
- | 3 tclass tclass/1 line 937 a.g |
- | 4 #token "#tokclass" tclass/2 line 938 a.g |
- | |
- | Choice:1 Depth:1 Group:3 ("class") |
- | 2 to class_def grammar/75 line 637 a.g |
- | 3 class_def class_def/1 line 669 a.g |
- | 4 #token "class" class_def/3 line 671 a.g |
- | |
- | Choice:1 Depth:1 Group:4 ("}") |
- | 2 #token "}" grammar/76 line 638 a.g |
- | |
- | Choice:2 Depth:1 Group:5 ("#errclass") |
- | 1 in (...)* block grammar/83 line 645 a.g |
- | 2 to error grammar/93 line 655 a.g |
- | 3 error error/1 line 894 a.g |
- | 4 #token "#errclass" error/2 line 895 a.g |
- | |
- | Choice:2 Depth:1 Group:6 ("#tokclass") |
- | 2 to tclass grammar/94 line 656 a.g |
- | 3 tclass tclass/1 line 937 a.g |
- | 4 #token "#tokclass" tclass/2 line 938 a.g |
- | |
- | Choice:2 Depth:1 Group:7 ("class") |
- | 2 to class_def grammar/95 line 657 a.g |
- | 3 class_def class_def/1 line 669 a.g |
- | 4 #token "class" class_def/3 line 671 a.g |
- | |
- | Choice:2 Depth:1 Group:8 ("}") |
- | 2 #token "}" grammar/96 line 658 a.g |
- +---------------------------------------------------------------------+
- For a linear lookahead set ambiguity (where k=1 or for k>1 but
- when all lookahead sets [i] with i<k all have degree one) the
- reports appear in the following order:
- for (depth=1 ; depth <= "-aad depth" ; depth++) {
- for (alternative=1; alternative <=2 ; alternative++) {
- while (matches-are-found) {
- group++;
- print-report
- };
- };
- };
- For reporting a k-tuple ambiguity, the reports appear in the
- following order:
- for (depth=1 ; depth <= "-aad depth" ; depth++) {
- while (matches-are-found) {
- for (alternative=1; alternative <=2 ; alternative++) {
- group++;
- print-report
- };
- };
- };
- This is because matches are generated in different ways for
- linear lookahead and k-tuples.
- #118. (Changed in 1.33MR11) DEC VMS makefile and VMS related changes
- Revised makefiles for DEC/VMS operating system for antlr, dlg,
- and sorcerer.
- Reduced names of routines with external linkage to less than 32
- characters to conform to DEC/VMS linker limitations.
- Jean-Francois Pieronne discovered problems with dlg and antlr
- due to the VMS linker not being case sensitive for names with
- external linkage. In dlg the problem was with "className" and
- "ClassName". In antlr the problem was with "GenExprSets" and
- "genExprSets".
- Added genmms, a version of genmk for the DEC/VMS version of make.
- The source is in directory pccts/support/DECmms.
- All VMS contributions by Jean-Francois Pieronne (jfp@iname.com).
- #117. (Changed in 1.33MR10) new EXPERIMENTAL predicate hoisting code
- The hoisting of predicates into rules to create prediction
- expressions is a problem in antlr. Consider the following
- example (k=1 with -prc on):
- start : (a)* "@" ;
- a : b | c ;
- b : <<isUpper(LATEXT(1))>>? A ;
- c : A ;
- Prior to 1.33MR10 the code generated for "start" would resemble:
- while {
- if (LA(1)==A &&
- (!LA(1)==A || isUpper())) {
- a();
- }
- };
- This code is wrong because it makes rule "c" unreachable from
- "start". The essence of the problem is that antlr fails to
- recognize that there can be a valid alternative within "a" even
- when the predicate <<isUpper(LATEXT(1))>>? is false.
- In 1.33MR10 with -mrhoist the hoisting of the predicate into
- "start" is suppressed because it recognizes that "c" can
- cover all the cases where the predicate is false:
- while {
- if (LA(1)==A) {
- a();
- }
- };
- With the antlr "-info p" switch the user will receive information
- about the predicate suppression in the generated file:
- --------------------------------------------------------------
- #if 0
- Hoisting of predicate suppressed by alternative without predicate.
- The alt without the predicate includes all cases where
- the predicate is false.
- WITH predicate: line 7 v1.g
- WITHOUT predicate: line 7 v1.g
- The context set for the predicate:
- A
- The lookahead set for the alt WITHOUT the semantic predicate:
- A
- The predicate:
- pred << isUpper(LATEXT(1))>>?
- depth=k=1 rule b line 9 v1.g
- set context:
- A
- tree context: null
- Chain of referenced rules:
- #0 in rule start (line 5 v1.g) to rule a
- #1 in rule a (line 7 v1.g)
- #endif
- --------------------------------------------------------------
- A predicate can be suppressed by a combination of alternatives
- which, taken together, cover a predicate:
- start : (a)* "@" ;
- a : b | ca | cb | cc ;
- b : <<isUpper(LATEXT(1))>>? ( A | B | C ) ;
- ca : A ;
- cb : B ;
- cc : C ;
- Consider a more complex example in which "c" covers only part of
- a predicate:
- start : (a)* "@" ;
- a : b
- | c
- ;
- b : <<isUpper(LATEXT(1))>>?
- ( A
- | X
- );
- c : A
- ;
- Prior to 1.33MR10 the code generated for "start" would resemble:
- while {
- if ( (LA(1)==A || LA(1)==X) &&
- (! (LA(1)==A || LA(1)==X) || isUpper()) {
- a();
- }
- };
- With 1.33MR10 and -mrhoist the predicate context is restricted to
- the non-covered lookahead. The code resembles:
- while {
- if ( (LA(1)==A || LA(1)==B) &&
- (! (LA(1)==X) || isUpper()) {
- a();
- }
- };
- With the antlr "-info p" switch the user will receive information
- about the predicate restriction in the generated file:
- --------------------------------------------------------------
- #if 0
- Restricting the context of a predicate because of overlap
- in the lookahead set between the alternative with the
- semantic predicate and one without
- Without this restriction the alternative without the predicate
- could not be reached when input matched the context of the
- predicate and the predicate was false.