Grammar 0
January 29th, 2007This is part of a series of articles that started with Test-Driving a Compiler
I’m going to approach this project by incrementally adjusting the Parsing Expression Grammar that goes along with the language. You can think of the PEG as a specification. It’s not a specification of the whole language, because it says nothing about semantics. The grammar only specifies which input strings are valid in the language. PEGs are like regular expressions, except that they can recognize far more complex patterns.
The page I linked to above on PEGs gives a sample grammar for basic arithmetic expressions (I flipped the order, but that doesn’t matter):
Expr := Sum
Sum := Product (('+' / '-') Product)*
Product := Value (('*' / '/') Value)*
Value := [0-9]+ / '(' Expr ')'
Note that operators that have a higher precedence appear “deeper” in the grammar. When we turn the grammar into a recursive descent parser, this ordering will ensure that our resulting AST respects the order of operations. Note that this grammar assumes that we’ll be looking at the input on a character-by-character basis.
Also note that this grammar is already too complicated to fit in my head all at once! There are three things I’ll do to deal with that complexity. First, we’ll deal with tokens that are potentially larger than a single character. This way, our grammar can talk about integers as if they were each a single unit. The grammar will be simpler, with the cost of an extra upfront tokenizing step. Second, we’ll just plain start with a simpler grammar and work our way towards the one above. Third, I’ll use names that are, hopefully, more descriptive. With no further ado, I give you Grammar 0 (”grammar zero” to programmers, “grammar aught” to old-timey gold miners):
Grammar 0
Expression := integer-literal
Now I can start trying out this fancy TDD that all the cool kids keep talking about. I’ll present each step in 3 parts: “Red” (write a test that compiles and fails), “Green” (make the test pass as quickly as possible, even if we have to commit a few coding sins to get there), “Refactor” (make up for any recent sins by removing duplicated code). “Duplication of code” can be subtle and subjective, but the idea is that we’ll strive to prevent cruft from accumulating over time. If we demand from ourselves to make everything clean before continuing with the next test, we won’t even have the chance of getting buried in a pile of our own spaghetti code. We might even kick that nasty debugger habit.
Tokenize the Empty String
Red
//TokenizerTest.cs
[TestFixture]
public class TokenizerTest
{
[Test]
public void Empty()
{
Assert.AreEqual(new string[] { }, new Tokenizer(“”).ToArray());
}
}
//Tokenizer.cs
public sealed class Tokenizer
{
public Tokenizer(string source) { }
public string[] ToArray() { return null; }
}
Green
//Tokenizer.cs
public sealed class Tokenizer
{
public Tokenizer(string source) { }
public string[] ToArray() { return new string[] { }; }
}
Tokenize a Single Integer
Red
//TokenizerTest.cs
[Test]
public void Zero()
{
Assert.AreEqual(new string[] { “0″ }, new Tokenizer(“0″).ToArray());
}
Green
//Tokenizer.cs
public sealed class Tokenizer
{
private readonly string _source;
public Tokenizer(string source)
{
_source = source;
}
public string[] ToArray()
{
if (_source.Length == 0)
return new string[] { };
return new string[] { _source };
}
}
Eat Leading White Space
Red
//TokenizerTest.cs
[Test]
public void EatLeadingWhiteSpace()
{
Assert.AreEqual(new string[] { “0″ }, new Tokenizer(” 0″).ToArray());
}
Green
//Tokenizer.cs
public string[] ToArray()
{
if (_source.Length == 0)
return new string[] { };
int i = 0;
while (i < _source.Length && Char.IsWhiteSpace(_source[i]))
i++;
return new string[] { _source.Substring(i) };
}
Refactor - The whitespace-consuming loop is a little ugly. It’s tempting to put a comment in front of it. Rather, we’ll pull it out into a function with a descriptive name.
//Tokenizer.cs
public sealed class Tokenizer
{
private readonly string _source;
private int _index;
public Tokenizer(string source)
{
_source = source;
_index = 0;
}
private void EatWhiteSpace()
{
while (_index < _source.Length && Char.IsWhiteSpace(_source[_index]))
_index++;
}
public string[] ToArray()
{
if (_source.Length == 0)
return new string[] { };
EatWhiteSpace();
return new string[] { _source.Substring(_index) };
}
}
White Space Separates Multiple Tokens
Red
//TokenizerTest.cs
[Test]
public void WhitespaceSeparatesTokens()
{
Assert.AreEqual(new string[] { “0″, “1″, “2147483647″ }, new Tokenizer(“0 1 \t\n 2147483647″).ToArray());
}
Green
//Tokenizer.cs
private string EatDigits()
{
int start = _index;
while (_index < _source.Length && Char.IsDigit(_source[_index]))
_index++;
return _source.Substring(start, _index - start);
}
public string[] ToArray()
{
List<string> tokens = new List<string>();
while (_index < _source.Length)
{
EatWhiteSpace();
if (_index < _source.Length)
tokens.Add(EatDigits());
}
return tokens.ToArray();
}
Refactor - This is getting out of hand! First, _index < _source.Length is a condition we keep repeating, and it isn’t entirely clear what it means. So, we pull that condition out into a helper property called HasNext. At this point, we run the tests to confirm they still pass. Then, we notice that EatWhiteSpace and EatDigits are nearly identical. The main difference is which character-testing method to use, so we turn these methods into a single method that accepts a character-testing method as its argument. The tests still pass. Lastly, we add a helper method to the test class to avoid the copy-paste cycle that we’ve started in those test methods.
//Tokenizer.cs
public sealed class Tokenizer
{
private readonly string _source;
private int _index;
public Tokenizer(string source)
{
_source = source;
_index = 0;
}
private bool HasNext
{
get { return _index < _source.Length; }
}
private string Eat(Predicate<char> test)
{
int start = _index;
while (HasNext && test(_source[_index]))
_index++;
return _source.Substring(start, _index - start);
}
public string[] ToArray()
{
List<string> tokens = new List<string>();
while (HasNext)
{
Eat(Char.IsWhiteSpace);
if (HasNext)
tokens.Add(Eat(Char.IsDigit));
}
return tokens.ToArray();
}
}
//TokenizerTest.cs
[TestFixture]
public class TokenizerTest
{
private void AssertTokens(string source, params string[] expectedTokens)
{
Assert.AreEqual(expectedTokens, new Tokenizer(source).ToArray());
}
[Test]
public void Empty()
{
AssertTokens(“”, new string[] { });
}
[Test]
public void Zero()
{
AssertTokens(“0″, “0″);
}
[Test]
public void EatLeadingWhiteSpace()
{
AssertTokens(” 0″, “0″);
}
[Test]
public void WhitespaceSeparatesTokens()
{
AssertTokens(“0 1 \t\n 2147483647″, “0″, “1″, “2147483647″);
}
}
Zero AST
Red - We wish to use S-Expressions as a compact string representation of our ASTs. For integer constants, this is simply the string version of the integer. We’ll see more interesting S-Expressions later.
//ExpressionTest.cs
[TestFixture]
public class ExpressionTest
{
[Test]
public void ZeroTree()
{
Assert.AreEqual(“0″, new Expression(“0″).ToString());
}
}
//Expression.cs
public sealed class Expression
{
public Expression(string source) { }
public override string ToString()
{
return null;
}
}
Green
//Expression.cs
public override string ToString()
{
return “0″;
}
Max Integer AST
Red
//ExpressionTest.cs
[Test]
public void MaxIntTree()
{
Assert.AreEqual(“2147483647″, new Expression(“2147483647″).ToString());
}
Green
//Expression.cs
public sealed class Expression
{
private readonly string _source;
public Expression(string source)
{
_source = source;
}
public override string ToString()
{
return _source;
}
}
Refactor
//ExpressionTest.cs
[TestFixture]
public class ExpressionTest
{
private void AssertTree(string expectedAbstractSyntaxTree, string source)
{
Assert.AreEqual(expectedAbstractSyntaxTree, new Expression(source).ToString());
}
[Test]
public void ZeroTree()
{
AssertTree(“0″, “0″);
}
[Test]
public void MaxIntTree()
{
AssertTree(“2147483647″, “2147483647″);
}
}
Integer Node Wraps Integers
Red - This is our first AST node type. Integer nodes will just wrap an integer value. ToString() will provide an S-Expression that contributes to the overall Expression’s ToString() values.
//IntegerNodeTest.cs
[TestFixture]
public class IntegerNodeTest
{
[Test]
public void WrapsInt()
{
Assert.AreEqual(“0″, new IntegerNode(“0″).ToString());
}
}
//IntegerNode.cs
public sealed class IntegerNode
{
public IntegerNode(string integer) { }
public override string ToString()
{
return null;
}
}
Green
//IntegerNode.cs
public sealed class IntegerNode
{
private readonly string _value;
public IntegerNode(string integer)
{
_value = integer;
}
public override string ToString()
{
return _value;
}
}
Integer Node Throws on Overflow
Red
//IntegerNodeTest.cs
[Test]
[ExpectedException(typeof(OverflowException))]
public void ThrowUponOverflow()
{
new IntegerNode(“2147483648″);
}
Green
//IntegerNode.cs
public sealed class IntegerNode
{
private readonly int _value;
public IntegerNode(string integer)
{
_value = int.Parse(integer);
}
public override string ToString()
{
return _value.ToString();
}
}
Refactor - Now that our integer node type is ready, we’ll finally translate the single rule of Grammar 0 into code. Below, that’s _tree = new IntegerNode(source);.
//Expression.cs
public sealed class Expression
{
private readonly IntegerNode _tree;
public Expression(string source)
{
_tree = new IntegerNode(source);
}
public override string ToString()
{
return _tree.ToString();
}
}
Tokenizer Consumes Tokens One at a Time
Red
//TokenizerTest.cs
[Test]
public void ZeroTokens()
{
Assert.IsNull(new Tokenizer(“”).Consume());
}
//Tokenizer.cs
public string Consume()
{
return “”;
}
Green
//Tokenizer.cs
public string Consume()
{
return null;
}
Refactor - Now we have enough tools to assert that expressions are defined in terms of tokens, not arbitrary strings. We give a meaningful definition of Consume(), and simplify ToArray() by calling it. Then, we let Expression accept a Tokenizer instead of a string. At this point, we’ll just assume that there is one token to pull from the Tokenizer.
//Tokenizer.cs
public string Consume()
{
if (HasNext)
Eat(Char.IsWhiteSpace);
if (HasNext)
return Eat(Char.IsDigit);
return null;
}
public string[] ToArray()
{
List<string> tokens = new List<string>();
while (true)
{
string token = Consume();
if (token == null)
return tokens.ToArray();
tokens.Add(token);
}
}
//Expression.cs
public Expression(Tokenizer tokenizer)
{
_tree = new IntegerNode(tokenizer.Consume());
}
//ExpressionTest.cs
private void AssertTree(string expectedAbstractSyntaxTree, string source)
{
Expression expression = new Expression(new Tokenizer(source));
Assert.AreEqual(expectedAbstractSyntaxTree, expression.ToString());
}