Grammar 4: Statements & Function Calls

February 14th, 2007

This is part of a series of articles that started with Test-Driving a Compiler.

Grammar 3 introduced function definitions. Since then, I’ve introduced some unary operators, statements, and function call expressions. First, I added the unary-minus and unary-not operators with a higher priority than multiplication, using tests to confirm the basic parsing routines as well as the relative precedence compared to multiplication/division and parenthetical expressions:

MultiplicativeExpression := UnaryExpression (('*' / '/') UnaryExpression)*
UnaryExpression := ('-' / '!')? Value

Next, I introduced function calls. I expect to have higher-order functions, so function calls can be made against an identifier, as well as parenthetical expressions:

FunctionName(arg1, arg2);
(SomeExpressionThatIsCallable)(arg1, arg2);

To deal with the complexity of the grammar for these cases, I took advantage of the fact that PEG grammars have ordered sub-rules that allow backtracking, as I already discussed for Grammar 2:

Value := boolean-literal
       / integer-literal
       / '(' Expression ')' '(' ArgumentList ')'
       / '(' Expression ')'
       / identifier '(' ArgumentList ')'
       / identifier

Function definitions are already set up for a special shorthand: whenever the entire body is an expression whose value should be returned, curly-braces and an explicit ‘return’ can be omitted. Now, I added the more common scenario for multi-statement function bodies.

Function := identifier '(' ParameterList ')' &'{' CompoundStatement
          / identifier '(' ParameterList ')' Expression

Lastly, I introduced the usual statements available in imperative languages. Here’s the complete grammar to date:

Program := Function*
Function := identifier '(' ParameterList ')' &'{' CompoundStatement
          / identifier '(' ParameterList ')' Expression
ParameterList := &')' / Parameter (',' Parameter)*
Parameter := identifier identifier?

Statement := &'{' CompoundStatement
           / &'if' IfStatement
           / &'for' ForStatement
           / &'while' WhileStatement
           / &'return' ReturnStatement
           / &'yield' YieldStatement
           / identifier '=' Expression ';'
           / Expression ';'
CompoundStatement := '{' Statement* '}'
IfStatement := 'if' '(' Expression ')' Statement ('else' Statement)?
ForStatement := 'for' '(' identifier 'in' Expression ')' Statement
WhileStatement := 'while' '(' Expression ')' Statement
ReturnStatement := 'return' (Expression)? ';'
YieldStatement := 'yield' Expression ';'

Expression := IfExpression
IfExpression := 'if' '(' Expression ')' Expression 'else' Expression
              / OrExpression
OrExpression := AndExpression ('||' AndExpression)*
AndExpression := EqualityExpression ('&&' EqualityExpression)*
EqualityExpression := RelationalExpression (('==' / '!=') RelationalExpression)*
RelationalExpression := AdditiveExpression (('<' / '<=' / '>' / '>=') AdditiveExpression)*
AdditiveExpression := MultiplicativeExpression (('+' / '-') MultiplicativeExpression)*
MultiplicativeExpression := UnaryExpression (('*' / '/') UnaryExpression)*
UnaryExpression := ('-' / '!')? Value
Value := boolean-literal
       / integer-literal
       / '(' Expression ')' '(' ArgumentList ')'
       / '(' Expression ')'
       / identifier '(' ArgumentList ')'
       / identifier
ArgumentList := &')' / Expression (',' Expression)*

I’ve noticed a pattern while developing each of these changes to the grammar. Usually, when I decide on a single rule change or rule addition, that change suggests a change to the tokenizer, such as recognizing a new keyword or symbol. The new rule often suggests a new AST node type as well. So, I have been tending toward the following cycle:

  1. Write a test for the new tokens to be recognized.
  2. Update the tokenizer to pass the test (this has become a smaller and smaller task and is now trivial).
  3. Write a stub implemention of the new AST node type, as well as a test for the new class, asserting that ToString yields an expected S-Expression. The stub ToString returns null so that the test initially fails.
  4. Flesh out the AST node type to pass the test.
  5. Write a test for the Recursive Descent Parser class, for the new grammar rule.
  6. Change the RDP to match the grammar and pass the test.

For nontrivial rules, or rules that need to be composed with other rules, I’ll repeat the last two steps until I’m satisfied that all the relevant variations have been tested.

Filed under: TDD, Language Design

Big, Ugly, Naggingly Red 99s

February 11th, 2007

I’ve been using C# Express Edition so far to write my parser, and I’ve been running NUnit as a separate process. I haven’t been running NCover at all, so far. I like the idea of integrating NUnit and NCover into the IDE, but Express Edition doesn’t allow for plugins (as far as I know). This got me interested in SharpDevelop, which seems to have come a long way since the last time I tried it out. I got the latest version (2.1 RC 1), as well as the latest NCover (1.5.5), opened up the solution file that Express Edition had created, and everything just worked. Kudos to the SharpDevelop crew.

I ran the unit tests with code coverage from inside the IDE, and it turns out that I already have 100% code coverage. This was a pleasant surprise. There was one previous project where I used NUnit, but I only occasionally wrote tests first. In that project, reaching 80% coverage felt like a success, and I considered 90% to be a realistic goal. That was kind of silly of me; I was defining success as having confidence in only 4 out of every 5 lines of code I wrote.

On this project, reaching 100% coverage has been trivial. In fact, it’s been accidental. Since I never wrote a new line without first having a failing test, the chances of slipping in untested code would only come during the short refactoring step in each “Red, green, refactor” cycle. Since the refactoring steps are all about removing duplication, they mostly involve removing code rather than writing entirely new lines. There is always going to be a chance that a little untested code will slip through, despite taking things in these disciplined steps, but that chance is small and will bathe my screen with big, ugly, naggingly red 99s just moments after introducing the offending line of code. Over time, responding immediately to those red 99s will have a Pavlovian effect: every time I introduce untrusted code, I’ll feel the need to 1) immediately reach 100% again before proceeding and 2) crave me some Kibbles ‘n Bits.

Filed under: TDD

Grammar 3: Function Definitons

February 7th, 2007

This is part of a series of articles that started with Test-Driving a Compiler.

Grammar 3 introduces functions, identifiers, and programs composed of multiple function definitions. Here’s the grammar, with changes from Grammar 2 in bold:

Program := Function*
Function := identifier '(' ParameterList ')' Expression
ParameterList := &')'
                  / Parameter (',' Parameter)*
Parameter := identifier identifier?
Expression := IfExpression
IfExpression := 'if' Expression Expression 'else' Expression
              / 'if' Expression Expression
              / OrExpression
OrExpression := AndExpression ('||' AndExpression)*
AndExpression := EqualityExpression ('&&' EqualityExpression)*
EqualityExpression := RelationalExpression (('==' / '!=') RelationalExpression)*
RelationalExpression := AdditiveExpression (('<' / '<=' / '>' / '>=') AdditiveExpression)*
AdditiveExpression := MultiplicativeExpression (('+' / '-') MultiplicativeExpression)*
MultiplicativeExpression := Value (('*' / '/') Value)*
Value := boolean-literal / integer-literal / identifier / '(' Expression ')'

In English (more or less), this says that a program is zero or more functions. A function is an identifier, a possibly-empty parameter list, and a body composed of a single expression. Items in the parameter list are each specified by one or two identifiers (a name, or a name with an explicit type identifier).

My intention is to use type inference with optional explicit type declarations, so that’s why parameters don’t require the explicit type identifier. Also, since the entire function body is just one expression, I don’t need curly braces { } to mark the start and end, and I don’t need an explicit ‘return’ keyword. Multi-statement functions with curly braces and return statements will likely be included later, but I like having a shorthand for one-liner functions.

In the ParameterList grammar rule, I start by testing &')'. This is PEG notation for peeking ahead. &')' will match when the next token is a ‘)’, and will fail otherwise. Upon failure, we try the next part of the rule. Whether the test for ‘)’ passes or fails, none of the input is actually consumed by the test. So, when a parameter list is empty, this lets us avoid looking for any parameters while protecting the fact that the Function rule will expect to consume ‘)’. As I’ve stated before, I have one parsing method per grammar rule. In the implementation of ParameterList, I perform the &')' part by calling the Peek method, which returns true if the next token matches the given string, without consuming any of the input:

// ParameterList := &')' / Parameter (',' Parameter)*
private ParameterListNode ParameterList()
{
	List<ParameterNode> parameters = new List<ParameterNode>();

	if (_tokenizer.Peek(")"))
		return new ParameterListNode(parameters);

	parameters.Add(Parameter());
	while (_tokenizer.TryConsume(","))
		parameters.Add(Parameter());

	return new ParameterListNode(parameters);
}

Filed under: Language Design

Grammar 2: Lost in New York

February 2nd, 2007

This is part of a series of articles that started with Test-Driving a Compiler.

Some of the operators introduced in Grammar 1 have to do with boolean tests: ==, !=, <, etc. Further boolean support introduced in this article includes && (and), || (or), and if statements. Different languages deal with if statements in different ways. In fact, even calling them if statements can be a misnomer in some languages. Usually, language designers make a distinction between statements and expressions. Expressions have values ("2*3"), and statements don’t ("int x;"). In some languages, the distinction is blurry: sometimes expressions are valid statements, sometimes everything is an expression, etc. I’ve decided that I like the idea of letting ifs have values. The value of an if expression will be the last expression executed in either of the two blocks. One nice thing about letting ifs have values is that we can do away with the pesky ?: operator from C. The purpose of ?: is like blessing if statements with expression semantics.

Changing a grammar to support nested if expressions correctly is not easy! In general, we want each ‘else’ block to match up with the innermost preceding ‘if’, but it turns out that it’s easy to create a grammar for if statements that is ambiguous (in other words, two completely different trees could be generated by the same grammar, each of which would have a different meaning at run time). This complication is called the Dangling Else Problem.

I’m coding my parser manually, which is unusual. Most languages use parser-generators that take a grammar as input and generate code for you that does a lot of the parsing work. I prefer doing it manually because there is a much smaller learning curve involved, and I like that I can fit it all in my head at once (the real reason is that this is a goofy side project of no import, and it’s just plain entertaining to reinvent this wheel).

For those who rely on a third-party tool, the dangling else problem is a bigger deal. Depending on the tool, one might solve the problem by adding grammar rules that remove any ambiguity, which can make the grammar much harder to understand at a glance, or by asking the parser generator to detect ambiguity and resolve it with a particular heuristic.

This is a much smaller problem when you use PEG grammars, even when using a parser generator, because the slash symbol / implies a priority and backtracking that other notations do not. For the PEG rule e1 := e2 / e3, the parser generator is supposed to first attempt parsing e2. If e2 fails, then it backtracks and attempts e3 before deciding whether the rule has actually failed. That backtracking is what lets us fix the grammar without as much complexity as other notations. Even though I’m writing the parser manually, using the PEG notation still simplifies the problem. Knowing what a general-purpose PEG parser generator would do (try-and-backtrack), I can write a manual parser that is also unambiguous, without actually having to waste time with backtracking. The best way to illustrate that is by showing the new grammar as well as the parsing method which implements the new IfExpression rule.

	Grammar 1
	Expression := EqualityExpression
	EqualityExpression := RelationalExpression (('==' / '!=') RelationalExpression)*
	RelationalExpression := AdditiveExpression (('<' / '<=' / '>' / '>=') AdditiveExpression)*
	AdditiveExpression := MultiplicativeExpression (('+' / '-') MultiplicativeExpression)*
	MultiplicativeExpression := Value (('*' / '/') Value)*
	Value := integer-literal / '(' Expression ')'

	Grammar 2
	Expression := IfExpression
	IfExpression := 'if' Expression Expression 'else' Expression
	              / 'if' Expression Expression
	              / OrExpression
	OrExpression := AndExpression ('||' AndExpression)*
	AndExpression := EqualityExpression ('&&' EqualityExpression)*
	EqualityExpression := RelationalExpression (('==' / '!=') RelationalExpression)*
	RelationalExpression := AdditiveExpression (('<' / '<=' / '>'/ '>=') AdditiveExpression)*
	AdditiveExpression := MultiplicativeExpression (('+' / '-') MultiplicativeExpression)*
	MultiplicativeExpression := Value (('*' / '/') Value)*
	Value := boolean-literal / integer-literal / '(' Expression ')'

The IfExpression rule is adapted from Bryan Ford’s work on PEGs,
in particular this slide show about PEGs (PDF).

This rule works because the ‘else’ version is supposed to be attempted first; we’re supposed to be gluttonous when it comes to eating ‘else’ blocks, which gives us Lioi Life Lesson #1:

Lioi Life Lesson #1: Gluttony is bad for your health, unless you’re talking about a solution to the dangling else problem or Mallomars.*

*Consult a physician before following any Lioi Life Lesson.

In this particular case, the recursive methods that correspond with each rule can benefit from the test-and-backtrack approach without actually having to backtrack. If the next token is not ‘if’, then we know that the 3rd part of the rule is in effect, and defer to the OrExpression rule. If instead the next token is ‘if’, we know we have to parse two Expressions (condition and when-true). After that, we can peek ahead one token, to see if ‘else’ is present. If so, we know we’re in the first part of the rule, which has a higher priority. If ‘else’ isn’t present, then we’re done. We don’t explicitly backtrack, and we don’t waste time traversing paths that will ultimately fail, but we still benefit from the way that PEG grammars avoid ambiguity.


private AstNode IfExpression()
{
	if (_tokenizer.Peek("if"))
	{
		_tokenizer.Consume("if");

		AstNode condition = ParseExpression();
		AstNode whenTrue = ParseExpression();

		if (_tokenizer.Peek("else"))
		{
			_tokenizer.Consume("else");
			AstNode whenFalse = ParseExpression();
			return new IfNode(condition, whenTrue, whenFalse);
		}

		return new IfNode(condition, whenTrue);
	}

	return OrExpression();
}

All the theory in the world won’t give me enough confidence to trust this code on its own, so it was driven by a few tests. The first tests confirmed that the expected S-Expressions resulted from the simplest input if expressions, and then I followed those tests with a test that had a single nested if expression inside each block of an outer if expression. The ‘else’ blocks always matched up with the appropriate ‘if’.

Filed under: Language Design

Grammar 1: Basic Operators

February 1st, 2007

This is part of a series of articles that started with Test-Driving a Compiler.

Grammar 0 was silly. We went to the trouble of creating a tokenizer and a recursive descent parser for a language where the only valid programs were integer constants. With that silliness out of the way, we can start doing things with these integers. Grammar 1 includes arithmetic and relational operators, as well as nesting expressions with parentheses. The precedence of operators will be as follows, where parentheses are the highest priority, and equality tests are lowest:

Operators Priority
( ) Grouping
* / Multiplicative
+ - Additive
< <= > >= Relational
== != Equality

Starting with Grammar 0, we'll grow this in pieces. Using the arithmetic grammar on this PEG wiki page as a guide, we introduce low-priority operators earlier within the grammar.

	//Grammar 0
	Expression := integer-literal

	//Introduce grouping operators.
	Expression := Value
	Value := integer-literal / '(' Expression ')'

	//Introduce multiplication and division.
	Expression := MultiplicativeExpression
	MultiplicativeExpression := Value (('*' / '/') Value)*
	Value := integer-literal / '(' Expression ')'

	//Introduce addition and subtraction.
	Expression := AdditiveExpression
	AdditiveExpression := MultiplicativeExpression (('+' / '-') MultiplicativeExpression)*
	MultiplicativeExpression := Value (('*' / '/') Value)*
	Value := integer-literal / '(' Expression ')'

	//Introduce comparisons.
	Expression := RelationalExpression
	RelationalExpression := AdditiveExpression (('<' / '<=' / '>' / '>=') AdditiveExpression)*
	AdditiveExpression := MultiplicativeExpression (('+' / '-') MultiplicativeExpression)*
	MultiplicativeExpression := Value (('*' / '/') Value)*
	Value := integer-literal / '(' Expression ')'

One last pair of operators to go, and we have the completed grammar:

	Grammar 1
	Expression := EqualityExpression
	EqualityExpression := RelationalExpression (('==' / '!=') RelationalExpression)*
	RelationalExpression := AdditiveExpression (('<' / '<=' / '>' / '>=') AdditiveExpression)*
	AdditiveExpression := MultiplicativeExpression (('+' / '-') MultiplicativeExpression)*
	MultiplicativeExpression := Value (('*' / '/') Value)*
	Value := integer-literal / '(' Expression ')'

Each change in the above sequence was driven by tests. I won't belabor the point by showing every test and every change in the parsing code as I did in the development of Grammar 0.

The tokenizer recognizes each operator as single token, even when an operator consists of 2 characters. The tokenizer had to be greedy, so that the input string "1<=2" is interpreted as 3 tokens ("1", "<=", "2") rather than 2 tokens followed by garbage ("1", "<", "=2").

The recursive descent parsing code very closely mimics the grammar. Each nonterminal in the grammar has a mutually-recursive method with the same name. The body of the method corresponds with the right-hand-side of that grammar rule. Notice that several of the grammar rules are nearly identical, and the BinaryOperation helper method embodies what those rules have in common (everything other than the particular operators to use and the name of the nonterminal method to call recursively). Here's the part of the code that matches the grammar:


	public sealed class Expression
	{
		private delegate AstNode NodeParser();

		private readonly AstNode _abstractSyntaxTree;
		private readonly Tokenizer _tokenizer;

		public Expression(Tokenizer tokenizer)
		{
			_tokenizer = tokenizer;
			_abstractSyntaxTree = ParseExpression();
		}

		private AstNode ParseExpression()
		{
			return EqualityExpression();
		}

		private AstNode EqualityExpression()
		{
			return BinaryOperation(ParseRelationalExpression, “==”, “!=”);
		}

		private AstNode ParseRelationalExpression()
		{
			return BinaryOperation(AdditiveExpression, “<”, “<=”, “>”, “>=”);
		}

		private AstNode AdditiveExpression()
		{
			return BinaryOperation(MultiplicativeExpression, “+”, “-”);
		}

		private AstNode MultiplicativeExpression()
		{
			return BinaryOperation(Value, “*”, “/”);
		}

		private AstNode Value()
		{
			string token = _tokenizer.Consume();

			if (token == “(”)
			{
				AstNode node = ParseExpression();
				_tokenizer.Consume(“)”);
				return node;
			}

			return new IntegerNode(token);
		}

		private AstNode BinaryOperation(NodeParser parseOperand, params string[] operators)
		{
			AstNode node = parseOperand();

			while (_tokenizer.Peek(operators))
				node = new OperatorNode(node, _tokenizer.Consume(), parseOperand());

			return node;
		}

		public override string ToString()
		{
			return _abstractSyntaxTree.ToString();
		}
	}

At this point, it helps to see the relationship between a sample input string, the tree of nodes produced by the recursion, and the S-Expression that is used as shorthand for that tree. For the input string "1 + 2 * 3 + 4 <= 12 / ( 2 + 1 )", we have the following tree and S-Expression:

Sample Abstract Syntax Tree

Armed with this tree, it's actually pretty easy to evaluate the expression. A pretty straight-forward recursive algorithm would work here.

//Psueodcode
Value(node):
	if node is a leaf:
		return node contents
	else:
		operator = function corresponding with node's token
		return operator(Value(node child A), Value(node child B))

Lastly, the code for the tokenizer started to get hard to read, even though it was passing all of the tests and contained no obvious duplication. There was a lot of string-index-twiddling as I walked along the input string, occasionally peeking at the next token and jumping back to where I was before, etc. Because I had high code coverage in my tests, I was able to confidently rewrite the private helpers, without having to change the tests or the public interface of the tokenizer. At each step in the rewrite, the tests still passed, so I knew that I (probably) didn’t break anything. Now, I use a series of prioritized regular expressions to recognize and extract each new token.

Filed under: TDD, Language Design
« Previous PageNext Page »

Powered by WordPress