CS 51 Week 10, L2:

Language, The Final Frontier


Today's topics:

Week 10/11, Pithy Quote

Programming languages are merely the most reusable programs;
Any program that gets sufficiently reusable starts to look like a programming language.


Language as an Abstraction Barrier

During this course so far, we have built some powerful abstractions. But perhaps the most mind boggling one is when you end up designing a language, and realize that it too, is *just* an abstraction barrier.

Designing a good language is alot like designing a good user interface. It needs to be simple to use, easy to express the types of problems you want to talk about, hide irrelevant details while exposing the right set of knobs, etc, etc. Yet there is no "right" way to do it --- just think of the different decisions made by Scheme and C++! If you like languages, classes like CS152/153 teach alot about language design. Here we will only focus on one aspect: given a language design, how do we write the program to support it? This is where it gets somewhat shocking...

How can implementing a language be easier than writing programs in that given language?

One reason is that there is a well developed theory of languages, that allows us to know when our "interface" (language) is complete and allows us to translate one language to another (CS121). Secondly, interpretation and compilation strategies are significantly shared between all languages. These are programs for which people actually understand alot of design!

In this last section of the course, we will look at languages as abstraction barriers. In particular we will expose what is below that barrier, and look the design of that program.

A Simple Arithmetic Language

Our goal is to build an interpreter for arithmetic expressions.
Given a sequence of statements such as:

	a = 1;
	b = (a * 6) / (a - 4);
	temp = (b - a) * -b;
	ans = temp * (1 + temp);

The interpreter should print the value assigned to the last variable.

 
	aterp> ???

How did you come up with the answer?
An interpreter captures that intuition in a formal way. One piece of the puzzle is the Grammar.


<PROGRAM> ::= { <asst> }*
<asst> ::= <ATOM> = <expr>;
<expr> ::= <term> { [+,-] <term> }*
<term> ::= <factor> {[*,/] <factor> }*
<factor> ::= <ATOM> | <NUMBER> | ( <expr> ) | -<factor>

where,
   PROGRAM is the start symbol
   ATOM and NUMBER, are terminals
   ( ) ; + - * / = eof are explicit strings


You have seen grammars like this before in Assignment 2, where you created and used such grammar to generate sentences. But more importantly you can use the grammar to recognize whether a sentence was created by your grammar. Lets use this grammer to look at some simple arithmetic examples:

EXAMPLE:
========
3 + 4;
b = 3 + 4;
a =  x + 4 * 5;

Which ones are legal sentences given our grammar?




Information flow


    string   -------------     tokens       ___________           _______________
    a + b    |           |   "a" "+" "b"    |          |          |             |
    =======> |  Lexical  |================> |  Parser  |========> | Interpreter |
             |           |                  |          |          |             |
             -------------                  ------------          ---------------
                                                 ||               _______________
                                                 || parse tree    |             | 
                                                 ===============> |  Compiler   |
                                                                  |             |
                                                                  ---------------

The Parts of the Problem

Consider the string A+B

	LEXER: decides if A+B is "A" "+" "B" or a variable name "A+B"
	PARSER: decides if writing "A" "+" "B" is a legal sentence
	INTERPRETER: decide what "+" means and what "A" and "B" map to.

Information flow in the Simple Arithmetic Interpreter

There are many ways to put these three pieces together. We may not even want to use all three pieces. For example, instead of an interpreter that "evaluates" the result, we could have a compiler that simply "translates" the parse tree into machine code instructions.

Grammars and Parsing

The lexer and the interpreter are relatively easy to understand how they might work for our simple arithmetic language. But its unclear how the parser does its job - how do we create a recognizer from a grammar. It turns out that if you design your grammar well, it can be amazingly simple to do.

We will use a very simple recursive descent parser. This parser is suitable only for grammars that obey the following properties: (a) the LHS has only one non-terminal symbol (b) if the RHS has multiple possibilities, all we need to do is look at the next token, to determine which rule applies. Basically, we will not allow the grammar to be ambiguous.

Here's how it works. Given a lexer input (a list of tokens) and a grammar (a list of rules), we will start with the START symbol in the grammar and keep replacing non-terminal symbols from the left, in a recursive fashion. That is until we actually hit a non-terminal symbol. Then we'll take a token from our lexer input and see if it matches the expectation. If not, then we say it failed. If so, then we continue expanding the rest of the non-terminals in the same way as before. The pattern in which we will explore looks alot like the way we explored the state space tree in Asst 3 -- hence the name "recursive descent".

<PROGRAM> ::= { <asst> }*
<asst> ::= <ATOM> = <expr>;
<expr> ::= <term> { [+,-] <term> }*
<term> ::= <factor> {[*,/] <factor> }*
<factor> ::= <ATOM> | <NUMBER> | ( <expr> ) | -<factor>

where,
   program is the start symbol
   ATOM and NUMBER, are terminals
   ( ) ; + - * / = eof are explicit strings

Suppose the Lexer gives us the token list A = 5 + 6; Then we can use the above grammar to recognize or reject it.

PROGRAM
 |
asst asst asst .....(zero or more)       // We will expand the left-most case
 |
ATOM = expr ;                            // We got a TERMINAL!
                                         // (is A an atom, yes! therefore continue)
     = expr ;				 // We got another TERMINAL (OPERATOR =)
                                         // (is = the symbol =, yes!)
       expr ;				 // We must expand expr
        |
       term (possibly) +/- term ....     // We will expand left-most term
        |
       factor (possibly) */ factor ...
        |
       MULTIPLE CHOICES!!!
       ATOM, NUMBER, (expr), -factor     // note that the symbols "(" and "-" help here
                                         // (Look ahead one token -- 5 -- pick number!)
       NUMBER



Now that we've completed this branch we can go back up and take other branches. Our current input is 5 + 6 ;



NUMBER (possibly) */ factor             // This won't work, since next token is +
NUMBER (possibly) +/- terms...          // Going up further, this will work!

Indeed 5 is a number. And the next symbol is +. So we now have to expand the next part

NUMBER +/- term ....
            | does our remaining input ( 6 ;) work with this?


As we can see this branch isn't much different from what we just did. Term will reduce to the number 6. And then the next value is not a "+" but a semicolon. This will cause us to pop back up to the top, such that expr returns 6 and the next token semicolon matches the original program rule. So our sentence is legal.

Arithmetics: Implementation

Now we will look in more detail at how each piece is implemented. Note that, although we may think of this design as having three distinct phases that run one at a time in a seuqnce, we don't have to implement it that way. In fact the implementation we will look at behaves more like "streams" in Scheme. You can think of it as the interpreter "pulls" a statement from the parser, who pulls tokens from the lexer, who pulls characters from the input stream. As a result, we are only computing things "just-in-time". If we make an error, this implementation will tell us quickly and not waste time lexing and parsing the remaining program.

  program file ==> Lexer ==> Parser ==> Interpreter      (Push model)

Acts more like

  program file <== Lexer <== Parser <== Interpreter      (Pull model)

Just in time computation at work!!

Arithmetics: The Lexical Analyzer

The Lexical analyzer will do a bunch of tasks

Now we will look at the implementation, which involves two classes: token_class and lexer.

(Note: The code for this lecture will be available from the website. Here we mainly show relevant excerpts)


// Some restricted type definitions to help us.

enum tokentype {                    // Possible Token types
        TOK_SYMBOL,
        TOK_NUMBER, 
	TOK_ASSIGN, 
        TOK_SEMICOLON, 
        TOK_OP,
        TOK_LEFT_PAREN, 
	TOK_RIGHT_PAREN, 
        TOK_EOF
};

enum optype {                      // Possible Operator Types
        OP_DIVIDE, 
        OP_MINUS, 
        OP_PLUS, 
        OP_TIMES 
};

// Instead of writing int or long everywhere, we use "avalue"

typedef long avalue;    



// EXCERPTS FROM token.h
//-----------------------


// Now we can define the TOKEN CLASS

class token_class {
   public:
 
    // CONSTRUCTORS

    token_class(tokentype t);     // can create tokens
    token_class(optype o);        // from many types of things
    token_class(avalue v);
    token_class(char *p);
    ~token_class();

    // MUTATOR:

    void setval(avalue v);      // if the token is a SYMBOL
	                	// then we can set its numerical value

        // ACCESSORS
    avalue	getval(void)   // we can get information about a token
    string	getname(void);
    bool	getinit(void);
    optype	op_of(void);

        // TYPE AND PREDICATES to check type
    tokentype	type_of(void);
    bool	is_plus();
    bool	is_minus();
    bool	is_times();
    bool	is_divide();

private:
    tokentype	type;	// Type of this token
    string	*name;	// Print name (valid only for TOK_SYMBOL)
    optype    	op;	// Operator type (valid only for TOK_OP)
    avalue    	val;	// Value (valid only for TOK_NUMBER, or TOK_SYMBOL)
    bool	isinit;	// True if val is set for TOK_SYMBOL
};



Note that the token class is essentially a container that can contain objects of possibly many types. This is especially obvious from the private data members, most of which only apply to one type of token. Our language is simple so this is not too cumbersome, but how else could you design tokens?


// EXCERPTS FROM token.cxx Implementation
//----------------------------------------

// Some Constructors

token_class::token_class(tokentype t){
     type = t;
}

token_class::token_class(optype o){
     type = TOK_OP;
     op = o;
}

token_class::token_class(avalue v) {
     type = TOK_NUMBER;
     val = v;
}

// How we deal with SYMBOLS

token_class::token_class(char *p) {
		type = TOK_SYMBOL;
		name = new string(p);
		isinit = false;
}

void token_class::setval(avalue v) { 
	if (type != TOK_SYMBOL)
		throw iexcept ("setval only allowed on SYMBOL");
	val = v;
	isinit = true;
}

// Some Accessors and Predicates

avalue	token_class::getval(void)	{ return (val); }
string	token_class::getname(void)	{ return (*name); }
bool	token_class::is_plus()          { return (op == OP_PLUS); }

A key thing to note here is how we deal with symbols. A symbol has a name but it also has a potential value. The boolean isinit tells us if this variable has been assigned a value yet or not. The method setval allows us to set a symbols value.

The lexer interface provides some key methods that the parser will use in order to request and return tokens: get and unget. In addition the lexer maintains the symbol table, so that variables are handled correctly.


// EXCERPTS from lexer.h and lexer.cxx
//-------------------------------------

// The lexer class is a "static" or singleton class.
// We do not expect to create any objects, most methods/data are static
// This is necessary for allowing us to use the stream operators

class lexer {

  public:
	void init(void);         // Initialize (empty) the symbol table

	token get(void);         // Get the next token (from cin)
	void unget(void);        // Put the token back (useful for peeking)
	bool eof(void);          // Check if end of stream

	// These methods need to be called from the stream operators, 
        // they must be static (globally defined)

	static void add_token(token tok);   // add token to the Symbol Table
	static token lookup (char *name);   // lookup symbol in Symbol Table

  private:
 	static map symTable;   // The symbol table
	static token _lasttok;		     // Last token read
	static bool _backed_up;		     // True if lasttok is set
};


// Managing the Symbol Table
//---------------------------

void lexer::add_token(token tok) { 
       symTable[tok->getname()] = tok; 
}

token lexer::lookup (char *name) {
	string s(name);

	if (symTable.find(s) == symTable.end())
		return (NULL);
	else 
		return (symTable[s]);
}

// Managing the Input Stream
//---------------------------

// Check if there are any more tokens in the input stream

bool lexer::eof(void)
{
	if (!_backed_up) {
		cin >> _lasttok;   // Read a token from input stream cin
		if (_lasttok != NULL)  // store the token in "lasttok"
			_backed_up = true;   // and remember in "backedup"
	}
	return (!_backed_up);
}

// Get the next token from the input stream

token lexer::get(void)
{
	if (eof()) // check the next token
            throw iexcept("premature eof");

        // Give away the token (no longer "backedup")
	_backed_up = false;
	return (_lasttok);
}

// Return the last token to the stream

void
lexer::unget(void)
{
	assert(_lasttok != NULL && !_backed_up);

	_backed_up = true;
}

// Notice that lasttok and backedup only give us enough space to
// remember one token, the most recent one. This is because the parser
// will never need to look more than one token ahead to take decisions

In the lexer method eof() we see that it is possible to read a token from the input stream, just as we have read integers or strings in the past.

  int t;
  token tok;
  cin << t;    // provided by cin implementation
  cin << tok;  // we must implement this

This is achieved by overloading the operator << to behave correctly with tokens. Here a psuedocode version of this works

// Overload input/output operators.  
//----------------------------------
// Streams are a very convenient way of dealing with input and output
// We've already seen this in the form of cout and cin
// Here we will use streams to manage our program input
// However there is a bit of cost,
// Overloading these operators involves alot of funny syntax
// See Lippman, page 129 for an explanation and example.

/* Note that streams provide some nice functions
	s >> ws;      // Consumes white space
	s.get();      // Get a character from the stream
	s.putback(c); // Return a charcter  back to the stream
	s.eof();      // Test for end of stream
*/

istream& operator>> (istream &s, token &tok){

    s >> ws;                    // Skip whitespace
    char c = s.get();           // Get first character
    
    if (s.eof()){
               tok = NULL:     // If EOF, then nothing left to do

    } else if (isdigit(c)){    // If ITS A DIGIT, 
          s.putback(c);        // then read off the integer
          int v;
          s >> v;

    } else if (isalpha(c)){   // IF ITS A ENGLISH ALPHABET LETTER, 
                              // then must be a symbol
        ..read characters into the string 
          "variablename" until not alpha/digit..

       // then check in case its already in the symbol tabel
        tok = lexer::lookup(variable);   
        if (tok == NULL)
           tok = new token_class(variablename);
           lexer::add_token(tok);

   } else {                   // must be a SPECIAL CHARACTER, (+,*,-)
      ... create the appropriate token....
   }          

   // must return the stream at the end
   return(s);
}

Arithmetics: The Parser

The parser is wishful thinking at its finest. Its so easy to write, it hard to believe it actually works.

The Parser Class

// Each element in the grammar simply returns an avalue.

class parser {
public:
	parser(interpreter &interp, lexer &lex);
	avalue program(void);

private:
	avalue expr(void);
	avalue term(void);
	avalue factor(void);
	avalue asst(void);

	interpreter &interp;
	lexer &lex;
};

#endif

Some cool features



Recursive Descent Parsing

Implementing a recursive descent parser is extremely simple. For each rule in the grammar we will create a function. The name of the function is the non-terminal symbol on the left (e.g. expr()). The function does what we did by hand - it expands using the RHS of the rule, starting with the leftmost symbol. If the symbol is a non-terminal (i.e. another rule) then it calls that function. If the symbol is a non-terminal (variable name or number or operator) then it checks to see if it is the right thing.

The parser is also a great example of The Power of Exceptions. If we mistype a program in the interpreter (a common fault) it would be terrible if the interpreter just crashed. Instead each function in the parser will throw an exception -- the exception will tell us what kind of mistake we made.


<program> ::= { <asst> }*
<asst> ::= <atom> = <expr>;
<expr> ::= <term> { [+,-] <term> }*
<term> ::= <factor> {[*,/] <factor> }*
<factor> ::= <ATOM> | <NUMBER> | ( <expr> ) | -<factor>

where,
   program is the start symbol
   ATOM and NUMBER, are terminals
   ( ) ; + - * / = eof are explicit strings

Implementing Parsing: Lets look at a few examples in pseudocode

GRAMMAR RULE <expr> ::= <term> { [+,-] <term> }* BASIC IDEA (PSEUDOCODE) expr(){ v1 = term(); // Get a term from the input while (next token is operator + or -){ v2 = term(); v1 = apply(operator,v1 v2); } } GRAMMAR RULE <term> ::= <factor> {[*,/] <factor> }* SIMILARLY TRANSLATES TO (PSEUDOCODE) term(){ v1 = factor(); while (next token is operator * or /) v2 = factor(); v1 = apply(operator, v1, v2); }

Code snippets from parser.cxx

avalue parser::expr(void)
{
	avalue v1, v2;
	token tok;

	v1 = term();
        tok = lex.get();

	while ((tok->type_of() == TOK_OP) &&
	       (tok->is_plus() || tok->is_minus()))
        {
		// We got a plus or a minus
		v2 = term();
		if (tok->is_plus())
			v1 = interp.evaluate("plus", v1, v2);
		else 
			v1 = interp.evaluate("minus", v1, v2);
	}

	// Return the token that wasn't part of this expression
	lex.unget();

	return (v1);
}

avalue
parser::term(void)
{
	avalue v1, v2;
	token tok;

	v1 = factor();
        tok = lex.get();

	while ((tok->type_of() == TOK_OP) &&
	       (tok->is_divide() || tok->is_times()) 
        {
		v2 = factor();
		if (tok->is_times())
			v1 = interp.evaluate("multiply", v1, v2);			
		else
			v1 = interp.evaluate("divide", v1, v2);
	}

	// Return the token that wasn't part of this expression
	lex.unget();

	return (v1);
}

The very last part is the interpreter


class interpreter {
public:
	// Evaluate a token to produce a value
	// (tok must be of type NUMBER or SYMBOL)
	avalue evaluate (token tok);
	avalue evaluate (string "type", token tok1, token tok2);

	// Assign a value to a token; return the value assigned
	avalue assign (token tok, avalue v);
};


int main()
{
	avalue result;
	interpreter interp;
	lexer lex;
	parser parse(interp, lex);

	while (1) {
		lex.init();	// Initialize lexical analyzer.

		try {
			cout << "Please enter your program, "
			     << "ending with a '" << EOF_MARKER << "'.\n"
			     << "Type ^D to quit.\n\n";

			if (lex.eof())
				break;

			result = parse.program();

			// program() may write to cout, so it must be
			// finished before the next statement begins

			cout << "\nValue: " << result << "\n\n";
		}

		catch (iexcept &i) {
			i.what();
			lex.flush();
			continue;
		}
	}

	return (0);
}


CS51, Spring 2008 Radhika Nagpal