Grammar 0

January 29th, 2007

This is part of a series of articles that started with Test-Driving a Compiler

I’m going to approach this project by incrementally adjusting the Parsing Expression Grammar that goes along with the language. You can think of the PEG as a specification. It’s not a specification of the whole language, because it says nothing about semantics. The grammar only specifies which input strings are valid in the language. PEGs are like regular expressions, except that they can recognize far more complex patterns.

The page I linked to above on PEGs gives a sample grammar for basic arithmetic expressions (I flipped the order, but that doesn’t matter):

	Expr := Sum
	Sum := Product (('+' / '-') Product)*
	Product := Value (('*' / '/') Value)*
	Value := [0-9]+ / '(' Expr ')'

Note that operators that have a higher precedence appear “deeper” in the grammar. When we turn the grammar into a recursive descent parser, this ordering will ensure that our resulting AST respects the order of operations. Note that this grammar assumes that we’ll be looking at the input on a character-by-character basis.

Also note that this grammar is already too complicated to fit in my head all at once! There are three things I’ll do to deal with that complexity. First, we’ll deal with tokens that are potentially larger than a single character. This way, our grammar can talk about integers as if they were each a single unit. The grammar will be simpler, with the cost of an extra upfront tokenizing step. Second, we’ll just plain start with a simpler grammar and work our way towards the one above. Third, I’ll use names that are, hopefully, more descriptive. With no further ado, I give you Grammar 0 (”grammar zero” to programmers, “grammar aught” to old-timey gold miners):

	Grammar 0
	Expression := integer-literal

Now I can start trying out this fancy TDD that all the cool kids keep talking about. I’ll present each step in 3 parts: “Red” (write a test that compiles and fails), “Green” (make the test pass as quickly as possible, even if we have to commit a few coding sins to get there), “Refactor” (make up for any recent sins by removing duplicated code). “Duplication of code” can be subtle and subjective, but the idea is that we’ll strive to prevent cruft from accumulating over time. If we demand from ourselves to make everything clean before continuing with the next test, we won’t even have the chance of getting buried in a pile of our own spaghetti code. We might even kick that nasty debugger habit.

Tokenize the Empty String

Red

	//TokenizerTest.cs
	[TestFixture]
	public class TokenizerTest
	{
		[Test]
		public void Empty()
		{
			Assert.AreEqual(new string[] { }, new Tokenizer(“”).ToArray());
		}
	}

	//Tokenizer.cs
	public sealed class Tokenizer
	{
		public Tokenizer(string source) { }

		public string[] ToArray() { return null; }
	}

Green

	//Tokenizer.cs
	public sealed class Tokenizer
	{
		public Tokenizer(string source) { }

		public string[] ToArray() { return new string[] { }; }
	}

Tokenize a Single Integer

Red

	//TokenizerTest.cs
	[Test]
	public void Zero()
	{
		Assert.AreEqual(new string[] { “0″ }, new Tokenizer(“0″).ToArray());
	}

Green

	//Tokenizer.cs
	public sealed class Tokenizer
	{
		private readonly string _source;

		public Tokenizer(string source)
		{
			_source = source;
		}

		public string[] ToArray()
		{
			if (_source.Length == 0)
				return new string[] { };

			return new string[] { _source };
		}
	}

Eat Leading White Space

Red

	//TokenizerTest.cs
	[Test]
	public void EatLeadingWhiteSpace()
	{
		Assert.AreEqual(new string[] { “0″ }, new Tokenizer(” 0″).ToArray());
	}

Green

	//Tokenizer.cs
	public string[] ToArray()
	{
		if (_source.Length == 0)
			return new string[] { };

		int i = 0;
		while (i < _source.Length && Char.IsWhiteSpace(_source[i]))
			i++;

		return new string[] { _source.Substring(i) };
	}

Refactor - The whitespace-consuming loop is a little ugly. It’s tempting to put a comment in front of it. Rather, we’ll pull it out into a function with a descriptive name.

	//Tokenizer.cs
	public sealed class Tokenizer
	{
		private readonly string _source;
		private int _index;

		public Tokenizer(string source)
		{
			_source = source;
			_index = 0;
		}

		private void EatWhiteSpace()
		{
			while (_index < _source.Length && Char.IsWhiteSpace(_source[_index]))
				_index++;
		}

		public string[] ToArray()
		{
			if (_source.Length == 0)
				return new string[] { };

			EatWhiteSpace();

			return new string[] { _source.Substring(_index) };
		}
	}

White Space Separates Multiple Tokens

Red

	//TokenizerTest.cs
	[Test]
	public void WhitespaceSeparatesTokens()
	{
		Assert.AreEqual(new string[] { “0″, “1″, “2147483647″ }, new Tokenizer(“0 1 \t\n 2147483647″).ToArray());
	}

Green

	//Tokenizer.cs
	private string EatDigits()
	{
		int start = _index;

		while (_index < _source.Length && Char.IsDigit(_source[_index]))
			_index++;

		return _source.Substring(start, _index - start);
	}

	public string[] ToArray()
	{
		List<string> tokens = new List<string>();

		while (_index < _source.Length)
		{
			EatWhiteSpace();

			if (_index < _source.Length)
				tokens.Add(EatDigits());
		}

		return tokens.ToArray();
	}

Refactor - This is getting out of hand! First, _index < _source.Length is a condition we keep repeating, and it isn’t entirely clear what it means. So, we pull that condition out into a helper property called HasNext. At this point, we run the tests to confirm they still pass. Then, we notice that EatWhiteSpace and EatDigits are nearly identical. The main difference is which character-testing method to use, so we turn these methods into a single method that accepts a character-testing method as its argument. The tests still pass. Lastly, we add a helper method to the test class to avoid the copy-paste cycle that we’ve started in those test methods.

	//Tokenizer.cs
	public sealed class Tokenizer
	{
		private readonly string _source;
		private int _index;

		public Tokenizer(string source)
		{
			_source = source;
			_index = 0;
		}

		private bool HasNext
		{
			get { return _index < _source.Length; }
		}

		private string Eat(Predicate<char> test)
		{
			int start = _index;

			while (HasNext && test(_source[_index]))
				_index++;

			return _source.Substring(start, _index - start);
		}

		public string[] ToArray()
		{
			List<string> tokens = new List<string>();

			while (HasNext)
			{
				Eat(Char.IsWhiteSpace);

				if (HasNext)
					tokens.Add(Eat(Char.IsDigit));
			}

			return tokens.ToArray();
		}
	}

	//TokenizerTest.cs
	[TestFixture]
	public class TokenizerTest
	{
		private void AssertTokens(string source, params string[] expectedTokens)
		{
			Assert.AreEqual(expectedTokens, new Tokenizer(source).ToArray());
		}

		[Test]
		public void Empty()
		{
			AssertTokens(“”, new string[] { });
		}

		[Test]
		public void Zero()
		{
			AssertTokens(“0″, “0″);
		}

		[Test]
		public void EatLeadingWhiteSpace()
		{
			AssertTokens(” 0″, “0″);
		}

		[Test]
		public void WhitespaceSeparatesTokens()
		{
			AssertTokens(“0 1 \t\n 2147483647″, “0″, “1″, “2147483647″);
		}
	}

Zero AST

Red - We wish to use S-Expressions as a compact string representation of our ASTs. For integer constants, this is simply the string version of the integer. We’ll see more interesting S-Expressions later.

	//ExpressionTest.cs
	[TestFixture]
	public class ExpressionTest
	{
		[Test]
		public void ZeroTree()
		{
			Assert.AreEqual(“0″, new Expression(“0″).ToString());
		}
	}

	//Expression.cs
	public sealed class Expression
	{
		public Expression(string source) { }

		public override string ToString()
		{
			return null;
		}
	}

Green

	//Expression.cs
	public override string ToString()
	{
		return “0″;
	}

Max Integer AST

Red

	//ExpressionTest.cs
	[Test]
	public void MaxIntTree()
	{
		Assert.AreEqual(“2147483647″, new Expression(“2147483647″).ToString());
	}

Green

	//Expression.cs
	public sealed class Expression
	{
		private readonly string _source;

		public Expression(string source)
		{
			_source = source;
		}

		public override string ToString()
		{
			return _source;
		}
	}

Refactor

	//ExpressionTest.cs
	[TestFixture]
	public class ExpressionTest
	{
		private void AssertTree(string expectedAbstractSyntaxTree, string source)
		{
			Assert.AreEqual(expectedAbstractSyntaxTree, new Expression(source).ToString());
		}

		[Test]
		public void ZeroTree()
		{
			AssertTree(“0″, “0″);
		}

		[Test]
		public void MaxIntTree()
		{
			AssertTree(“2147483647″, “2147483647″);
		}
	}

Integer Node Wraps Integers

Red - This is our first AST node type. Integer nodes will just wrap an integer value. ToString() will provide an S-Expression that contributes to the overall Expression’s ToString() values.

	//IntegerNodeTest.cs
	[TestFixture]
	public class IntegerNodeTest
	{
		[Test]
		public void WrapsInt()
		{
			Assert.AreEqual(“0″, new IntegerNode(“0″).ToString());
		}
	}

	//IntegerNode.cs
	public sealed class IntegerNode
	{
		public IntegerNode(string integer) { }

		public override string ToString()
		{
			return null;
		}
	}

Green

	//IntegerNode.cs
	public sealed class IntegerNode
	{
		private readonly string _value;

		public IntegerNode(string integer)
		{
			_value = integer;
		}

		public override string ToString()
		{
			return _value;
		}
	}

Integer Node Throws on Overflow

Red

	//IntegerNodeTest.cs
	[Test]
	[ExpectedException(typeof(OverflowException))]
	public void ThrowUponOverflow()
	{
		new IntegerNode(“2147483648″);
	}

Green

	//IntegerNode.cs
	public sealed class IntegerNode
	{
		private readonly int _value;

		public IntegerNode(string integer)
		{
			_value = int.Parse(integer);
		}

		public override string ToString()
		{
			return _value.ToString();
		}
	}

Refactor - Now that our integer node type is ready, we’ll finally translate the single rule of Grammar 0 into code. Below, that’s _tree = new IntegerNode(source);.

	//Expression.cs
	public sealed class Expression
	{
		private readonly IntegerNode _tree;

		public Expression(string source)
		{
			_tree = new IntegerNode(source);
		}

		public override string ToString()
		{
			return _tree.ToString();
		}
	}

Tokenizer Consumes Tokens One at a Time

Red

	//TokenizerTest.cs
	[Test]
	public void ZeroTokens()
	{
		Assert.IsNull(new Tokenizer(“”).Consume());
	}

	//Tokenizer.cs
	public string Consume()
	{
		return “”;
	}

Green

	//Tokenizer.cs
	public string Consume()
	{
		return null;
	}

Refactor - Now we have enough tools to assert that expressions are defined in terms of tokens, not arbitrary strings. We give a meaningful definition of Consume(), and simplify ToArray() by calling it. Then, we let Expression accept a Tokenizer instead of a string. At this point, we’ll just assume that there is one token to pull from the Tokenizer.

	//Tokenizer.cs
	public string Consume()
	{
		if (HasNext)
			Eat(Char.IsWhiteSpace);

		if (HasNext)
			return Eat(Char.IsDigit);

		return null;
	}

	public string[] ToArray()
	{
		List<string> tokens = new List<string>();

		while (true)
		{
			string token = Consume();

			if (token == null)
				return tokens.ToArray();

			tokens.Add(token);
		}
	}

	//Expression.cs
	public Expression(Tokenizer tokenizer)
	{
		_tree = new IntegerNode(tokenizer.Consume());
	}

	//ExpressionTest.cs
	private void AssertTree(string expectedAbstractSyntaxTree, string source)
	{
		Expression expression = new Expression(new Tokenizer(source));
		Assert.AreEqual(expectedAbstractSyntaxTree, expression.ToString());
	}
Filed under: TDD, Language Design

Test-Driving a Compiler

January 25th, 2007

For some time now, I’ve been interested in making a programming language. Whenever I’ve attempted this in the past, I instantly ran into resistance because I was too concerned with making a Big Design Up Front. I would try to think about the entire language all at once, and put that into a grammar. Of course, that’s way too much to fit in your head all at once, so I would start with a grammar for an existing language, like Python or C#, and whittle away at it until I liked what I was looking at. Half-way through whittling, the house of cards that were my thoughts would come tumbling down.

At work, I’ve been Unit Testing a great deal for about a year now, but I have only been able to seriously write test-first code a fraction of the time. I’d like to attempt to create a programming language while strictly adhering to the TDD mantra “Red, Green, Refactor”, and I’ll be posting my progress here. This is a learning opportunity for me, so I’ll be posting my missteps as well as my successes.

About two years ago, I attended a conference where Robert C. Martin lead a session. Although the session wasn’t supposed to be about TDD, it quickly took that turn. I’m glad that it did, because I hadn’t seen TDD in action like that before. During this session, he started out by drawing a UML diagram to collect his thoughts and get the big picture down. But, he made a point of saying that now that we had a design document, we should hide it in a drawer. Although the diagram seemed appropriate and full of OOG (Object-Oriented Goodness; look it up), the end product that came out of the TDD style was nothing like that original diagram. Had we stuck with the diagram, we would have made something overly complicated and just plane wrong.Following that example, I’ll give a very high-level design for the language, but will then shove it in a drawer lest it get in the way. In future posts, I’ll actually show the code as it develops. Early on, I’ll show every minute step that I take, but after I get on a roll I’ll begin to give less tedious summaries of my progress.

A Back-of-a-Napkin Design

People have been writing compilers for quite some time now, and it only makes sense to base the design on those successes.
So, I know that I’ll start with a grammar using one of the standard notations, I’ll build a parser from that grammar which
turns source code into an Abstract Syntax Tree (AST), and then I’ll use the tree to validate the program and generate code for the target platform.

Back of a Napkin Design

For the parsing, I’ll be using the Recursive Descent technique, which is less efficient than bottom-up parsing, but is much easier to fit in my head. It turns out that there is a special grammar notation that is perfect for recursive descent, called a Parsing Expression Grammar. These references on RD and PEGs are my primary background materials as I start this project, and I’ll consider them a prerequisite to anyone who wants to follow my progress.

The target platform may be MSIL, Java byte code, Parrot, or something else; maybe even another high-level language. If I target a high level language, then I’ll benefit from all the brilliant people who already wrote slick, optimized compilers. Imagine that VB.NET were compiled by first translating to C#, and then sending that translation to the C# compiler.

It is quite likely that at some point I’ll want to replace the top-down, recursive descent parser with a bottom-up technique, since those are more sophisticated. I’ll keep that in mind as a possible change to be made in the future, but I won’t be over-engineering things now just to make that potential-change easier.

I suspect that I’ll eventually need to use the Visitor Design Pattern for inspecting the AST. Most discussions of the Visitor pattern use ASTs as the prime example. I’ll keep this pattern in mind once I start needing to do varying actions against the tree, but I won’t actually include this pattern in my code until I really feel the need for it. When things start to hurt in a way that Visitor will help with, I’ll start to use it.

Now that I know where I’m going, I’ll put this 100,000 foot view off to the side, and later I’ll see how close it is to what the TDD process produces.

Filed under: TDD, Language Design

Powered by WordPress