Grammar 1: Basic Operators

February 1st, 2007

This is part of a series of articles that started with Test-Driving a Compiler.

Grammar 0 was silly. We went to the trouble of creating a tokenizer and a recursive descent parser for a language where the only valid programs were integer constants. With that silliness out of the way, we can start doing things with these integers. Grammar 1 includes arithmetic and relational operators, as well as nesting expressions with parentheses. The precedence of operators will be as follows, where parentheses are the highest priority, and equality tests are lowest:

Operators Priority
( ) Grouping
* / Multiplicative
+ - Additive
< <= > >= Relational
== != Equality

Starting with Grammar 0, we'll grow this in pieces. Using the arithmetic grammar on this PEG wiki page as a guide, we introduce low-priority operators earlier within the grammar.

	//Grammar 0
	Expression := integer-literal

	//Introduce grouping operators.
	Expression := Value
	Value := integer-literal / '(' Expression ')'

	//Introduce multiplication and division.
	Expression := MultiplicativeExpression
	MultiplicativeExpression := Value (('*' / '/') Value)*
	Value := integer-literal / '(' Expression ')'

	//Introduce addition and subtraction.
	Expression := AdditiveExpression
	AdditiveExpression := MultiplicativeExpression (('+' / '-') MultiplicativeExpression)*
	MultiplicativeExpression := Value (('*' / '/') Value)*
	Value := integer-literal / '(' Expression ')'

	//Introduce comparisons.
	Expression := RelationalExpression
	RelationalExpression := AdditiveExpression (('<' / '<=' / '>' / '>=') AdditiveExpression)*
	AdditiveExpression := MultiplicativeExpression (('+' / '-') MultiplicativeExpression)*
	MultiplicativeExpression := Value (('*' / '/') Value)*
	Value := integer-literal / '(' Expression ')'

One last pair of operators to go, and we have the completed grammar:

	Grammar 1
	Expression := EqualityExpression
	EqualityExpression := RelationalExpression (('==' / '!=') RelationalExpression)*
	RelationalExpression := AdditiveExpression (('<' / '<=' / '>' / '>=') AdditiveExpression)*
	AdditiveExpression := MultiplicativeExpression (('+' / '-') MultiplicativeExpression)*
	MultiplicativeExpression := Value (('*' / '/') Value)*
	Value := integer-literal / '(' Expression ')'

Each change in the above sequence was driven by tests. I won't belabor the point by showing every test and every change in the parsing code as I did in the development of Grammar 0.

The tokenizer recognizes each operator as single token, even when an operator consists of 2 characters. The tokenizer had to be greedy, so that the input string "1<=2" is interpreted as 3 tokens ("1", "<=", "2") rather than 2 tokens followed by garbage ("1", "<", "=2").

The recursive descent parsing code very closely mimics the grammar. Each nonterminal in the grammar has a mutually-recursive method with the same name. The body of the method corresponds with the right-hand-side of that grammar rule. Notice that several of the grammar rules are nearly identical, and the BinaryOperation helper method embodies what those rules have in common (everything other than the particular operators to use and the name of the nonterminal method to call recursively). Here's the part of the code that matches the grammar:


	public sealed class Expression
	{
		private delegate AstNode NodeParser();

		private readonly AstNode _abstractSyntaxTree;
		private readonly Tokenizer _tokenizer;

		public Expression(Tokenizer tokenizer)
		{
			_tokenizer = tokenizer;
			_abstractSyntaxTree = ParseExpression();
		}

		private AstNode ParseExpression()
		{
			return EqualityExpression();
		}

		private AstNode EqualityExpression()
		{
			return BinaryOperation(ParseRelationalExpression, “==”, “!=”);
		}

		private AstNode ParseRelationalExpression()
		{
			return BinaryOperation(AdditiveExpression, “<”, “<=”, “>”, “>=”);
		}

		private AstNode AdditiveExpression()
		{
			return BinaryOperation(MultiplicativeExpression, “+”, “-”);
		}

		private AstNode MultiplicativeExpression()
		{
			return BinaryOperation(Value, “*”, “/”);
		}

		private AstNode Value()
		{
			string token = _tokenizer.Consume();

			if (token == “(”)
			{
				AstNode node = ParseExpression();
				_tokenizer.Consume(“)”);
				return node;
			}

			return new IntegerNode(token);
		}

		private AstNode BinaryOperation(NodeParser parseOperand, params string[] operators)
		{
			AstNode node = parseOperand();

			while (_tokenizer.Peek(operators))
				node = new OperatorNode(node, _tokenizer.Consume(), parseOperand());

			return node;
		}

		public override string ToString()
		{
			return _abstractSyntaxTree.ToString();
		}
	}

At this point, it helps to see the relationship between a sample input string, the tree of nodes produced by the recursion, and the S-Expression that is used as shorthand for that tree. For the input string "1 + 2 * 3 + 4 <= 12 / ( 2 + 1 )", we have the following tree and S-Expression:

Sample Abstract Syntax Tree

Armed with this tree, it's actually pretty easy to evaluate the expression. A pretty straight-forward recursive algorithm would work here.

//Psueodcode
Value(node):
	if node is a leaf:
		return node contents
	else:
		operator = function corresponding with node's token
		return operator(Value(node child A), Value(node child B))

Lastly, the code for the tokenizer started to get hard to read, even though it was passing all of the tests and contained no obvious duplication. There was a lot of string-index-twiddling as I walked along the input string, occasionally peeking at the next token and jumping back to where I was before, etc. Because I had high code coverage in my tests, I was able to confidently rewrite the private helpers, without having to change the tests or the public interface of the tokenizer. At each step in the rewrite, the tests still passed, so I knew that I (probably) didn’t break anything. Now, I use a series of prioritized regular expressions to recognize and extract each new token.

Filed under: TDD, Language Design

Powered by WordPress