CUP User's Manual
Scott E. Hudson
Graphics Visualization and Usability Center
Georgia Institute of Technology
Modified by Frank Flannery, C. Scott Ananian, Dan Wang with advice from Andrew W. Appel
Last updated July 1999 (v0.10j)
Table of Contents
i.
1.
2.
3.
4.
5.
6.
7.
A.
B.
About CUP Version 0.10
Introduction and Example
Specification Syntax
Running CUP
Customizing the Parser
Scanner interface
Error Recovery
Conclusion
References
Grammar for CUP Specification Files
A Very Simple Example Scanner
C.
D.
E.
Incompatibilites between CUP 0.9 and CUP 0.10
Bugs
Change log
i. About CUP Version 0.10
Version 0.10 of CUP adds many new changes and features over the previous releases of version
0.9. These changes attempt to make CUP more like its predecessor, YACC. As a result, the old 0.9
parser specifications for CUP are not compatible and a reading of appendix C of the new manual
will be necessary to write new specifications. The new version, however, gives the user more
power and options, making parser specifications easier to write.
1. Introduction and Example
This manual describes the basic operation and use of the Java(tm) Based Constructor of Useful
Parsers (CUP for short). CUP is a system for generating LALR parsers from simple specifications.
It serves the same role as the widely used program YACC [1] and in fact offers most of the
features of YACC. However, CUP is written in Java, uses specifications including embedded Java
code, and produces parsers which are implemented in Java.
Although this manual covers all aspects of the CUP system, it is relatively brief, and assumes you
have at least a little bit of knowledge of LR parsing. A working knowledge of YACC is also very
helpful in understanding how CUP specifications work. A number of compiler construction
textbooks (such as [2,3]) cover this material, and discuss the YACC system (which is quite similar
to this one) as a specific example.
Using CUP involves creating a simple specification based on the grammar for which a parser is
needed, along with construction of a scanner capable of breaking characters up into meaningful
tokens (such as keywords, numbers, and special symbols).
As a simple example, consider a system for evaluating simple arithmetic expressions over integers.
This system would read expressions from standard input (each terminated with a semicolon),
evaluate them, and print the result on standard output. A grammar for the input to such a system
might look like:
expr_list ::= expr_list expr_part | expr_part
expr_part ::= expr ';'
expr
::= expr '+' expr | expr '-' expr | expr '*' expr
| expr '/' expr | expr '%' expr | '(' expr ')'
| '-' expr | number
To specify a parser based on this grammar, our first step is to identify and name the set of terminal
symbols that will appear on input, and the set of non-terminal symbols. In this case,
the
non-terminals are:
expr_list, expr_part and expr.
For terminal names we might choose:
SEMI, PLUS, MINUS, TIMES, DIVIDE, MOD, NUMBER, LPAREN,
and RPAREN
The experienced user will note a problem with the above grammar. It is ambiguous. An ambiguous
grammar is a grammar which, given a certain input, can reduce the parts of the input in two
different ways such as to give two different answers. Take the above grammar, for example. given
the following input:
3 + 4 * 6
The grammar can either evaluate the 3 + 4 and then multiply seven by six, or it can evaluate 4 * 6
and then add three. Older versions of CUP forced the user to write unambiguous grammars, but
now there is a construct allowing the user to specify precedences and associativities for terminals.
This means that the above ambiguous grammar can be used, after specifying precedences and
associativities. There is more explanation later. Based on these namings we can construct a small
CUP specification as follows:
// CUP specification for a simple expression evaluator (no actions)
import java_cup.runtime.*;
/* Preliminaries to set up and use the scanner. */
init with {: scanner.init();
:};
scan with {: return scanner.next_token(); :};
/* Terminals (tokens returned by the scanner). */
terminal
terminal
terminal Integer
NUMBER;
SEMI, PLUS, MINUS, TIMES, DIVIDE, MOD;
UMINUS, LPAREN, RPAREN;
/* Non terminals */
non terminal
non terminal Integer
expr_list, expr_part;
expr, term, factor;
/* Precedences */
precedence left PLUS, MINUS;
precedence left TIMES, DIVIDE, MOD;
precedence left UMINUS;
/* The grammar */
expr_list ::= expr_list expr_part |
expr_part;
expr_part ::= expr SEMI;
expr
::= expr PLUS expr
| expr MINUS expr
| expr TIMES expr
| expr DIVIDE expr
| expr MOD expr
| MINUS expr %prec UMINUS
| LPAREN expr RPAREN
| NUMBER
;
We will consider each part of the specification syntax in detail later. However, here we can quickly
see that the specification contains four main parts. The first part provides preliminary and
miscellaneous declarations to specify how the parser is to be generated, and supply parts of the
runtime code. In this case we indicate that the java_cup.runtime classes should be imported, then
supply a small bit of initialization code, and some code for invoking the scanner to retrieve the
next input token. The second part of the specification declares terminals and non-terminals, and
associates object classes with each. In this case, the terminals are declared as either with no type,
or of type Integer. The specified type of the terminal or non-terminal is the type of the value of
those terminals or non-terminals. If no type is specified, the terminal or non-terminal carries no
value. Here, no type indicates that these terminals and non-terminals hold no value. The third part
specifies the precedence and associativity of terminals. The last precedence declaration give its
terminals the highest precedence. The final part of the specification contains the grammar.
To produce a parser from this specification we use the CUP generator. If this specification were
stored in a file parser.cup, then (on a Unix system at least) we might invoke CUP using a
command like:
java java_cup.Main < parser.cup
In this case, the system will produce two Java source files containing parts of the generated parser:
sym.java and parser.java. As you might expect, these two files contain declarations for the classes
sym and parser. The sym class contains a series of constant declarations, one for each terminal
symbol. This is typically used by the scanner to refer to symbols (e.g. with code such as "return
new Symbol(sym.SEMI);" ). The parser class implements the parser itself.
The specification above, while constructing a full parser, does not perform any semantic actions
&emdash; it will only indicate success or failure of a parse. To calculate and print values of each
expression, we must embed Java code within the parser to carry out actions at various points. In
CUP, actions are contained in code strings which are surrounded by delimiters of the form {: and :}
(we can see examples of this in the init with and scan with clauses above). In general, the system
records all characters within the delimiters, but does not try to check that it contains valid Java
code.
A more complete CUP specification for our example system (with actions embedded at various
points in the grammar) is shown below:
// CUP specification for a simple expression evaluator (w/ actions)
import java_cup.runtime.*;
/* Preliminaries to set up and use the scanner. */
init with {: scanner.init();
:};
scan with {: return scanner.next_token(); :};
/* Terminals (tokens returned by the scanner). */
terminal
terminal
terminal Integer
NUMBER;
SEMI, PLUS, MINUS, TIMES, DIVIDE, MOD;
UMINUS, LPAREN, RPAREN;
/* Non-terminals */
non terminal
non terminal Integer
expr_list, expr_part;
expr;
/* Precedences */
precedence left PLUS, MINUS;
precedence left TIMES, DIVIDE, MOD;
precedence left UMINUS;
/* The grammar */
expr_list ::= expr_list expr_part
|
expr_part;
expr_part ::= expr:e
{: System.out.println("= " + e); :}
SEMI
;
expr
::= expr:e1 PLUS expr:e2
{: RESULT = new Integer(e1.intValue() + e2.intValue()); :}
|
expr:e1 MINUS expr:e2
{: RESULT = new Integer(e1.intValue() - e2.intValue()); :}
|
expr:e1 TIMES expr:e2
{: RESULT = new Integer(e1.intValue() * e2.intValue()); :}
|
expr:e1 DIVIDE expr:e2
{: RESULT = new Integer(e1.intValue() / e2.intValue()); :}
|
expr:e1 MOD expr:e2
{: RESULT = new Integer(e1.intValue() % e2.intValue()); :}
|
NUMBER:n
{: RESULT = n; :}
|
MINUS expr:e
{: RESULT = new Integer(0 - e.intValue()); :}
%prec UMINUS
|
LPAREN expr:e RPAREN
{: RESULT = e; :}
;
Here we can see several changes. Most importantly, code to be executed at various points in the
parse is included inside code strings delimited by {: and:}. In addition, labels have been placed on
various symbols in the right hand side of productions. For example in:
expr:e1 PLUS expr:e2
{: RESULT = new Integer(e1.intValue() + e2.intValue()); :}
the first non-terminal expr has been labeled with e1, and the second with e2. The left hand side
value of each production is always implicitly labeled asRESULT.Each symbol appearing in a
production is represented at runtime by an object of type Symbol on the parse stack. The labels
refer to the instance variablevalue in those objects. In the expression expr:e1 PLUS expr:e2, e1
and e2 refer to objects of type Integer. These objects are in the value fields of the objects of type
Symbol representing those non-terminals on the parse stack. RESULT is of type Integer as well,
since the resulting non-terminal expr was declared as of type Integer. This object becomes the
value instance variable of a new Symbol object.
For each label, two more variables accessible to the user are declared. A left and right value labels
are passed to the code string, so that the user can find out where the left and right side of each
terminal or non-terminal is in the input stream. The name of these variables is the label name, plus
left orright. for example, given the right hand side of a production expr:e1 PLUS expr:e2 the user
could not only access variables e1 and e2, but also e1left, e1right, e2left and e2right. these
variables are of type int.
The final step in creating a working parser is to create a scanner (also known as a lexical analyzer
or simply a lexer). This routine is responsible for reading individual characters, removing things
things like white space and comments, recognizing which terminal symbols from the grammar
each group of characters represents, then returning Symbol objects representing these symbols to
the parser. The terminals will be retrieved with a call to the scanner function. In the example, the
should
parser will
type
java_cup.runtime.Symbol. This
than older versions of CUP's
java_cup.runtime.symbol. These Symbol objects contains the instance variable value of type
Object, which should be set by the lexer. This variable refers to the value of that symbol, and the
type of object in value should be of the same type as declared in the terminal andnon terminal
declarations. In the above example, if the lexer wished to pass a NUMBER token, it should create
a Symbol with the value instance variable filled with an object of type Integer. Symbol objects
corresponding to terminals and non-terminals with no value have a null value field.
scanner
is very different
scanner.next_token(). The
type
call
return
objects
of
The code contained in the init with clause of the specification will be executed before any tokens
are requested. Each token will be requested using whatever code is found in the scan with clause.
Beyond this, the exact form the scanner takes is up to you; however note that each call to the
scanner function should return a new instance of java_cup.runtime.Symbol (or a subclass). These
symbol objects are annotated with parser information and pushed onto a stack; reusing objects will
result in the parser annotations being scrambled. As of CUP 0.10j, Symbol reuse should be
detected if it occurs; the parser will throw an Errortelling you to fix your scanner.
In the next section a more detailed and formal explanation of all parts of a CUP specification will
be given. Section 3 describes options for running the CUP system. Section 4 discusses the details
of how to customize a CUP parser, while section 5 discusses the scanner interface added in CUP
0.10j. Section 6considers error recovery. Finally, Section 7 provides a conclusion.
2. Specification Syntax
Now that we have seen a small example, we present a complete description of all parts of a CUP
specification. A specification has four sections with a total of eight specific parts (however, most
of these are optional). A specification consists of:
package and import specifications,
user code components,
symbol (terminal and non-terminal) lists,
precedence declarations, and
the grammar.
Each of these parts must appear in the order presented here. (A complete grammar for the
specification language is given in Appendix A.) The particulars of each part of the specification
are described in the subsections below.
Package and Import Specifications
A specification begins with optional package and import declarations. These have the same syntax,
and play the same role, as the package and import declarations found in a normal Java program. A
package declaration is of the form:
package name;
where name name is a Java package identifier, possibly in several parts separated by ".". In
general, CUP employs Java lexical conventions. So for example, both styles of Java comments are
supported, and identifiers are constructed beginning with a letter, dollar sign ($), or underscore (_),
which can then be followed by zero or more letters, numbers, dollar signs, and underscores.
After an optional package declaration, there can be zero or more import declarations. As in a Java
program these have the form:
import package_name.class_name;
or
import package_name.*;
The package declaration indicates what package the sym and parser classes that are generated by
the system will be in. Any import declarations that appear in the specification will also appear in
the source file for the parser class allowing various names from that package to be used directly in
user supplied action code.
User Code Components
Following the optional package and import declarations are a series of optional declarations that
allow user code to be included as part of the generated parser (see Section 4 for a full description
of how the parser uses this code). As a part of the parser file, a separate non-public class to contain
all embedded user actions is produced. The first action code declaration section allows code to be
included in this class. Routines and variables for use by the code embedded in the grammar would
normally be placed in this section (a typical example might be symbol table manipulation
routines). This declaration takes the form:
action code {: ... :};
where {: ... :} is a code string whose contents will be placed directly within the action class class
declaration.
After the action code declaration is an optional parser code declaration. This declaration allows
methods and variable to be placed directly within the generated parser class. Although this is less
common, it can be helpful when customizing the parser &emdash; it is possible for example, to
include scanning methods inside the parser and/or override the default error reporting routines.
This declaration is very similar to the action code declaration and takes the form:
parser code {: ... :};
Again, code from the code string is placed directly into the generated parser class definition.
Next in the specification is the optional init declaration which has the form:
init with {: ... :};
This declaration provides code that will be executed by the parser before it asks for the first token.
Typically, this is used to initialize the scanner as well as various tables and other data structures
that might be needed by semantic actions. In this case, the code given in the code string forms the
body of a void method inside the parser class.
The final (optional) user code section of the specification indicates how the parser should ask for
the next token from the scanner. This has the form:
scan with {: ... :};
As with the init clause, the contents of the code string forms the body of a method in the generated
parser. However, in this case the method returns an object of type java_cup.runtime.Symbol.
Consequently the code found in the scan with clause should return such a value. See section 5 for
information on the default behavior if the scan with section is omitted.
As of CUP 0.10j the action code, parser code, init code, and scan with sections may appear in any
order. They must, however, precede the symbol lists.
Symbol Lists
Following user supplied code comes the first required part of the specification: the symbol lists.
These declarations are responsible for naming and supplying a type for each terminal and