|
Week 10/11, Pithy QuoteProgramming languages are merely the most reusable programs;Any program that gets sufficiently reusable starts to look like a programming language. |
![]() |
During this course so far, we have built some powerful abstractions. But perhaps the most mind boggling one is when you end up designing a language, and realize that it too, is *just* an abstraction barrier.
Designing a good language is alot like designing a good user interface. It needs to be simple to use, easy to express the types of problems you want to talk about, hide irrelevant details while exposing the right set of knobs, etc, etc. Yet there is no "right" way to do it --- just think of the different decisions made by Scheme and C++! If you like languages, classes like CS152/153 teach alot about language design. Here we will only focus on one aspect: given a language design, how do we write the program to support it? This is where it gets somewhat shocking...
How can implementing a language be easier than writing programs in that given language?
One reason is that there is a well developed theory of languages, that allows us to know when our "interface" (language) is complete and allows us to translate one language to another (CS121). Secondly, interpretation and compilation strategies are significantly shared between all languages. These are programs for which people actually understand alot of design!
In this last section of the course, we will look at languages as abstraction barriers. In particular we will expose what is below that barrier, and look the design of that program.
Our goal is to build an interpreter for arithmetic expressions.
Given a sequence of statements such as:
a = 1; b = (a * 6) / (a - 4); temp = (b - a) * -b; ans = temp * (1 + temp);
The interpreter should print the value assigned to the last variable.
aterp> ???
How did you come up with the answer?
An interpreter captures that intuition in a formal way. One piece
of the puzzle is the Grammar.
<PROGRAM> ::= { <asst> }*
<asst> ::= <ATOM> = <expr>;
<expr> ::= <term> { [+,-] <term> }*
<term> ::= <factor> {[*,/] <factor> }*
<factor> ::= <ATOM> | <NUMBER> | ( <expr> ) | -<factor>
where,
PROGRAM is the start symbol
ATOM and NUMBER, are terminals
( ) ; + - * / = eof are explicit strings
You have seen grammars like this before in Assignment 2, where you created and used such grammar to generate sentences. But more importantly you can use the grammar to recognize whether a sentence was created by your grammar. Lets use this grammer to look at some simple arithmetic examples:
EXAMPLE: ======== 3 + 4; b = 3 + 4; a = x + 4 * 5; Which ones are legal sentences given our grammar?
string ------------- tokens ___________ _______________
a + b | | "a" "+" "b" | | | |
=======> | Lexical |================> | Parser |========> | Interpreter |
| | | | | |
------------- ------------ ---------------
|| _______________
|| parse tree | |
===============> | Compiler |
| |
---------------
The Parts of the Problem
Consider the string A+B LEXER: decides if A+B is "A" "+" "B" or a variable name "A+B" PARSER: decides if writing "A" "+" "B" is a legal sentence INTERPRETER: decide what "+" means and what "A" and "B" map to.
EXAMPLE: "ab=1+2;" is the same as "ab = 1 + 2;"
but not the same as "a b = 1 + 2;"
The lexer transforms "ab=1+2;"
into ab = 1 + 2 ;
[ATOM] [OP] [NUM] [OP] [NUM] [OP]
The lexer transforms "ab12" into ab[ATOM] 12[NUM]
Lexer says 3+4;
Parser says, not legal
Lexer says ab = x + 4 * 5 ;
Parser says, its legal and here's the tree showing how it was created
(=)
|
(ab)--+---(+)
|
(x)-+--(*)
|
(4)--+--(5)
Parser says a=5 Interpreter makes name "a" correspond to 5 Parser says a*5 Interpreter computes 5*5 and returns the value 25
There are many ways to put these three pieces together. We may not even want to use all three pieces. For example, instead of an interpreter that "evaluates" the result, we could have a compiler that simply "translates" the parse tree into machine code instructions.
The lexer and the interpreter are relatively easy to understand how they might work for our simple arithmetic language. But its unclear how the parser does its job - how do we create a recognizer from a grammar. It turns out that if you design your grammar well, it can be amazingly simple to do.
We will use a very simple recursive descent parser. This parser is suitable only for grammars that obey the following properties: (a) the LHS has only one non-terminal symbol (b) if the RHS has multiple possibilities, all we need to do is look at the next token, to determine which rule applies. Basically, we will not allow the grammar to be ambiguous.
Here's how it works. Given a lexer input (a list of tokens) and a grammar (a list of rules), we will start with the START symbol in the grammar and keep replacing non-terminal symbols from the left, in a recursive fashion. That is until we actually hit a non-terminal symbol. Then we'll take a token from our lexer input and see if it matches the expectation. If not, then we say it failed. If so, then we continue expanding the rest of the non-terminals in the same way as before. The pattern in which we will explore looks alot like the way we explored the state space tree in Asst 3 -- hence the name "recursive descent".
<PROGRAM> ::= { <asst> }*
<asst> ::= <ATOM> = <expr>;
<expr> ::= <term> { [+,-] <term> }*
<term> ::= <factor> {[*,/] <factor> }*
<factor> ::= <ATOM> | <NUMBER> | ( <expr> ) | -<factor>
where,
program is the start symbol
ATOM and NUMBER, are terminals
( ) ; + - * / = eof are explicit strings
Suppose the Lexer gives us the token list A = 5 + 6; Then we can use the above grammar to recognize or reject it.
PROGRAM
|
asst asst asst .....(zero or more) // We will expand the left-most case
|
ATOM = expr ; // We got a TERMINAL!
// (is A an atom, yes! therefore continue)
= expr ; // We got another TERMINAL (OPERATOR =)
// (is = the symbol =, yes!)
expr ; // We must expand expr
|
term (possibly) +/- term .... // We will expand left-most term
|
factor (possibly) */ factor ...
|
MULTIPLE CHOICES!!!
ATOM, NUMBER, (expr), -factor // note that the symbols "(" and "-" help here
// (Look ahead one token -- 5 -- pick number!)
NUMBER
Now that we've completed this branch we can go back up and take other branches. Our current input is 5 + 6 ;
NUMBER (possibly) */ factor // This won't work, since next token is +
NUMBER (possibly) +/- terms... // Going up further, this will work!
Indeed 5 is a number. And the next symbol is +. So we now have to expand the next part
NUMBER +/- term ....
| does our remaining input ( 6 ;) work with this?
As we can see this branch isn't much different from what we just did. Term will reduce to the number 6. And then the next value is not a "+" but a semicolon. This will cause us to pop back up to the top, such that expr returns 6 and the next token semicolon matches the original program rule. So our sentence is legal.
Now we will look in more detail at how each piece is implemented. Note that, although we may think of this design as having three distinct phases that run one at a time in a seuqnce, we don't have to implement it that way. In fact the implementation we will look at behaves more like "streams" in Scheme. You can think of it as the interpreter "pulls" a statement from the parser, who pulls tokens from the lexer, who pulls characters from the input stream. As a result, we are only computing things "just-in-time". If we make an error, this implementation will tell us quickly and not waste time lexing and parsing the remaining program.
program file ==> Lexer ==> Parser ==> Interpreter (Push model)
Acts more like
program file <== Lexer <== Parser <== Interpreter (Pull model)
Just in time computation at work!!
The Lexical analyzer will do a bunch of tasks
Now we will look at the implementation, which involves two classes: token_class and lexer.
(Note: The code for this lecture will be available from the website. Here we mainly show relevant excerpts)
// Some restricted type definitions to help us.
enum tokentype { // Possible Token types
TOK_SYMBOL,
TOK_NUMBER,
TOK_ASSIGN,
TOK_SEMICOLON,
TOK_OP,
TOK_LEFT_PAREN,
TOK_RIGHT_PAREN,
TOK_EOF
};
enum optype { // Possible Operator Types
OP_DIVIDE,
OP_MINUS,
OP_PLUS,
OP_TIMES
};
// Instead of writing int or long everywhere, we use "avalue"
typedef long avalue;
// EXCERPTS FROM token.h
//-----------------------
// Now we can define the TOKEN CLASS
class token_class {
public:
// CONSTRUCTORS
token_class(tokentype t); // can create tokens
token_class(optype o); // from many types of things
token_class(avalue v);
token_class(char *p);
~token_class();
// MUTATOR:
void setval(avalue v); // if the token is a SYMBOL
// then we can set its numerical value
// ACCESSORS
avalue getval(void) // we can get information about a token
string getname(void);
bool getinit(void);
optype op_of(void);
// TYPE AND PREDICATES to check type
tokentype type_of(void);
bool is_plus();
bool is_minus();
bool is_times();
bool is_divide();
private:
tokentype type; // Type of this token
string *name; // Print name (valid only for TOK_SYMBOL)
optype op; // Operator type (valid only for TOK_OP)
avalue val; // Value (valid only for TOK_NUMBER, or TOK_SYMBOL)
bool isinit; // True if val is set for TOK_SYMBOL
};
Note that the token class is essentially a container that can contain objects of possibly many types. This is especially obvious from the private data members, most of which only apply to one type of token. Our language is simple so this is not too cumbersome, but how else could you design tokens?
// EXCERPTS FROM token.cxx Implementation
//----------------------------------------
// Some Constructors
token_class::token_class(tokentype t){
type = t;
}
token_class::token_class(optype o){
type = TOK_OP;
op = o;
}
token_class::token_class(avalue v) {
type = TOK_NUMBER;
val = v;
}
// How we deal with SYMBOLS
token_class::token_class(char *p) {
type = TOK_SYMBOL;
name = new string(p);
isinit = false;
}
void token_class::setval(avalue v) {
if (type != TOK_SYMBOL)
throw iexcept ("setval only allowed on SYMBOL");
val = v;
isinit = true;
}
// Some Accessors and Predicates
avalue token_class::getval(void) { return (val); }
string token_class::getname(void) { return (*name); }
bool token_class::is_plus() { return (op == OP_PLUS); }
A key thing to note here is how we deal with symbols. A symbol has
a name but it also has a potential value. The boolean isinit
tells us if this variable has been assigned a value yet or not. The
method setval allows us to set a symbols value.
The lexer interface provides some key methods that the parser will use in order to request and return tokens: get and unget. In addition the lexer maintains the symbol table, so that variables are handled correctly.
// EXCERPTS from lexer.h and lexer.cxx
//-------------------------------------
// The lexer class is a "static" or singleton class.
// We do not expect to create any objects, most methods/data are static
// This is necessary for allowing us to use the stream operators
class lexer {
public:
void init(void); // Initialize (empty) the symbol table
token get(void); // Get the next token (from cin)
void unget(void); // Put the token back (useful for peeking)
bool eof(void); // Check if end of stream
// These methods need to be called from the stream operators,
// they must be static (globally defined)
static void add_token(token tok); // add token to the Symbol Table
static token lookup (char *name); // lookup symbol in Symbol Table
private:
static map symTable; // The symbol table
static token _lasttok; // Last token read
static bool _backed_up; // True if lasttok is set
};
// Managing the Symbol Table
//---------------------------
void lexer::add_token(token tok) {
symTable[tok->getname()] = tok;
}
token lexer::lookup (char *name) {
string s(name);
if (symTable.find(s) == symTable.end())
return (NULL);
else
return (symTable[s]);
}
// Managing the Input Stream
//---------------------------
// Check if there are any more tokens in the input stream
bool lexer::eof(void)
{
if (!_backed_up) {
cin >> _lasttok; // Read a token from input stream cin
if (_lasttok != NULL) // store the token in "lasttok"
_backed_up = true; // and remember in "backedup"
}
return (!_backed_up);
}
// Get the next token from the input stream
token lexer::get(void)
{
if (eof()) // check the next token
throw iexcept("premature eof");
// Give away the token (no longer "backedup")
_backed_up = false;
return (_lasttok);
}
// Return the last token to the stream
void
lexer::unget(void)
{
assert(_lasttok != NULL && !_backed_up);
_backed_up = true;
}
// Notice that lasttok and backedup only give us enough space to
// remember one token, the most recent one. This is because the parser
// will never need to look more than one token ahead to take decisions
In the lexer method eof() we see that it is possible to read a token from the input stream, just as we have read integers or strings in the past.
int t; token tok; cin << t; // provided by cin implementation cin << tok; // we must implement this
This is achieved by overloading the operator << to behave correctly with tokens. Here a psuedocode version of this works
// Overload input/output operators.
//----------------------------------
// Streams are a very convenient way of dealing with input and output
// We've already seen this in the form of cout and cin
// Here we will use streams to manage our program input
// However there is a bit of cost,
// Overloading these operators involves alot of funny syntax
// See Lippman, page 129 for an explanation and example.
/* Note that streams provide some nice functions
s >> ws; // Consumes white space
s.get(); // Get a character from the stream
s.putback(c); // Return a charcter back to the stream
s.eof(); // Test for end of stream
*/
istream& operator>> (istream &s, token &tok){
s >> ws; // Skip whitespace
char c = s.get(); // Get first character
if (s.eof()){
tok = NULL: // If EOF, then nothing left to do
} else if (isdigit(c)){ // If ITS A DIGIT,
s.putback(c); // then read off the integer
int v;
s >> v;
} else if (isalpha(c)){ // IF ITS A ENGLISH ALPHABET LETTER,
// then must be a symbol
..read characters into the string
"variablename" until not alpha/digit..
// then check in case its already in the symbol tabel
tok = lexer::lookup(variable);
if (tok == NULL)
tok = new token_class(variablename);
lexer::add_token(tok);
} else { // must be a SPECIAL CHARACTER, (+,*,-)
... create the appropriate token....
}
// must return the stream at the end
return(s);
}
The parser is wishful thinking at its finest. Its so easy to write, it hard to believe it actually works.
The Parser Class
// Each element in the grammar simply returns an avalue.
class parser {
public:
parser(interpreter &interp, lexer &lex);
avalue program(void);
private:
avalue expr(void);
avalue term(void);
avalue factor(void);
avalue asst(void);
interpreter &interp;
lexer &lex;
};
#endif
Some cool features
Recursive Descent Parsing
Implementing a recursive descent parser is extremely simple. For each rule in the grammar we will create a function. The name of the function is the non-terminal symbol on the left (e.g. expr()). The function does what we did by hand - it expands using the RHS of the rule, starting with the leftmost symbol. If the symbol is a non-terminal (i.e. another rule) then it calls that function. If the symbol is a non-terminal (variable name or number or operator) then it checks to see if it is the right thing.
The parser is also a great example of The Power of Exceptions. If we mistype a program in the interpreter (a common fault) it would be terrible if the interpreter just crashed. Instead each function in the parser will throw an exception -- the exception will tell us what kind of mistake we made.
<program> ::= { <asst> }*
<asst> ::= <atom> = <expr>;
<expr> ::= <term> { [+,-] <term> }*
<term> ::= <factor> {[*,/] <factor> }*
<factor> ::= <ATOM> | <NUMBER> | ( <expr> ) | -<factor>
where,
program is the start symbol
ATOM and NUMBER, are terminals
( ) ; + - * / = eof are explicit strings
Implementing Parsing: Lets look at a few examples in pseudocode
GRAMMAR RULE <expr> ::= <term> { [+,-] <term> }* BASIC IDEA (PSEUDOCODE) expr(){ v1 = term(); // Get a term from the input while (next token is operator + or -){ v2 = term(); v1 = apply(operator,v1 v2); } } GRAMMAR RULE <term> ::= <factor> {[*,/] <factor> }* SIMILARLY TRANSLATES TO (PSEUDOCODE) term(){ v1 = factor(); while (next token is operator * or /) v2 = factor(); v1 = apply(operator, v1, v2); }
Code snippets from parser.cxx
avalue parser::expr(void)
{
avalue v1, v2;
token tok;
v1 = term();
tok = lex.get();
while ((tok->type_of() == TOK_OP) &&
(tok->is_plus() || tok->is_minus()))
{
// We got a plus or a minus
v2 = term();
if (tok->is_plus())
v1 = interp.evaluate("plus", v1, v2);
else
v1 = interp.evaluate("minus", v1, v2);
}
// Return the token that wasn't part of this expression
lex.unget();
return (v1);
}
avalue
parser::term(void)
{
avalue v1, v2;
token tok;
v1 = factor();
tok = lex.get();
while ((tok->type_of() == TOK_OP) &&
(tok->is_divide() || tok->is_times())
{
v2 = factor();
if (tok->is_times())
v1 = interp.evaluate("multiply", v1, v2);
else
v1 = interp.evaluate("divide", v1, v2);
}
// Return the token that wasn't part of this expression
lex.unget();
return (v1);
}
class interpreter {
public:
// Evaluate a token to produce a value
// (tok must be of type NUMBER or SYMBOL)
avalue evaluate (token tok);
avalue evaluate (string "type", token tok1, token tok2);
// Assign a value to a token; return the value assigned
avalue assign (token tok, avalue v);
};
int main()
{
avalue result;
interpreter interp;
lexer lex;
parser parse(interp, lex);
while (1) {
lex.init(); // Initialize lexical analyzer.
try {
cout << "Please enter your program, "
<< "ending with a '" << EOF_MARKER << "'.\n"
<< "Type ^D to quit.\n\n";
if (lex.eof())
break;
result = parse.program();
// program() may write to cout, so it must be
// finished before the next statement begins
cout << "\nValue: " << result << "\n\n";
}
catch (iexcept &i) {
i.what();
lex.flush();
continue;
}
}
return (0);
}