Grammar 2: Lost in New York

February 2nd, 2007

This is part of a series of articles that started with Test-Driving a Compiler.

Some of the operators introduced in Grammar 1 have to do with boolean tests: ==, !=, <, etc. Further boolean support introduced in this article includes && (and), || (or), and if statements. Different languages deal with if statements in different ways. In fact, even calling them if statements can be a misnomer in some languages. Usually, language designers make a distinction between statements and expressions. Expressions have values ("2*3"), and statements don’t ("int x;"). In some languages, the distinction is blurry: sometimes expressions are valid statements, sometimes everything is an expression, etc. I’ve decided that I like the idea of letting ifs have values. The value of an if expression will be the last expression executed in either of the two blocks. One nice thing about letting ifs have values is that we can do away with the pesky ?: operator from C. The purpose of ?: is like blessing if statements with expression semantics.

Changing a grammar to support nested if expressions correctly is not easy! In general, we want each ‘else’ block to match up with the innermost preceding ‘if’, but it turns out that it’s easy to create a grammar for if statements that is ambiguous (in other words, two completely different trees could be generated by the same grammar, each of which would have a different meaning at run time). This complication is called the Dangling Else Problem.

I’m coding my parser manually, which is unusual. Most languages use parser-generators that take a grammar as input and generate code for you that does a lot of the parsing work. I prefer doing it manually because there is a much smaller learning curve involved, and I like that I can fit it all in my head at once (the real reason is that this is a goofy side project of no import, and it’s just plain entertaining to reinvent this wheel).

For those who rely on a third-party tool, the dangling else problem is a bigger deal. Depending on the tool, one might solve the problem by adding grammar rules that remove any ambiguity, which can make the grammar much harder to understand at a glance, or by asking the parser generator to detect ambiguity and resolve it with a particular heuristic.

This is a much smaller problem when you use PEG grammars, even when using a parser generator, because the slash symbol / implies a priority and backtracking that other notations do not. For the PEG rule e1 := e2 / e3, the parser generator is supposed to first attempt parsing e2. If e2 fails, then it backtracks and attempts e3 before deciding whether the rule has actually failed. That backtracking is what lets us fix the grammar without as much complexity as other notations. Even though I’m writing the parser manually, using the PEG notation still simplifies the problem. Knowing what a general-purpose PEG parser generator would do (try-and-backtrack), I can write a manual parser that is also unambiguous, without actually having to waste time with backtracking. The best way to illustrate that is by showing the new grammar as well as the parsing method which implements the new IfExpression rule.

	Grammar 1
	Expression := EqualityExpression
	EqualityExpression := RelationalExpression (('==' / '!=') RelationalExpression)*
	RelationalExpression := AdditiveExpression (('<' / '<=' / '>' / '>=') AdditiveExpression)*
	AdditiveExpression := MultiplicativeExpression (('+' / '-') MultiplicativeExpression)*
	MultiplicativeExpression := Value (('*' / '/') Value)*
	Value := integer-literal / '(' Expression ')'

	Grammar 2
	Expression := IfExpression
	IfExpression := 'if' Expression Expression 'else' Expression
	              / 'if' Expression Expression
	              / OrExpression
	OrExpression := AndExpression ('||' AndExpression)*
	AndExpression := EqualityExpression ('&&' EqualityExpression)*
	EqualityExpression := RelationalExpression (('==' / '!=') RelationalExpression)*
	RelationalExpression := AdditiveExpression (('<' / '<=' / '>'/ '>=') AdditiveExpression)*
	AdditiveExpression := MultiplicativeExpression (('+' / '-') MultiplicativeExpression)*
	MultiplicativeExpression := Value (('*' / '/') Value)*
	Value := boolean-literal / integer-literal / '(' Expression ')'

The IfExpression rule is adapted from Bryan Ford’s work on PEGs,
in particular this slide show about PEGs (PDF).

This rule works because the ‘else’ version is supposed to be attempted first; we’re supposed to be gluttonous when it comes to eating ‘else’ blocks, which gives us Lioi Life Lesson #1:

Lioi Life Lesson #1: Gluttony is bad for your health, unless you’re talking about a solution to the dangling else problem or Mallomars.*

*Consult a physician before following any Lioi Life Lesson.

In this particular case, the recursive methods that correspond with each rule can benefit from the test-and-backtrack approach without actually having to backtrack. If the next token is not ‘if’, then we know that the 3rd part of the rule is in effect, and defer to the OrExpression rule. If instead the next token is ‘if’, we know we have to parse two Expressions (condition and when-true). After that, we can peek ahead one token, to see if ‘else’ is present. If so, we know we’re in the first part of the rule, which has a higher priority. If ‘else’ isn’t present, then we’re done. We don’t explicitly backtrack, and we don’t waste time traversing paths that will ultimately fail, but we still benefit from the way that PEG grammars avoid ambiguity.


private AstNode IfExpression()
{
	if (_tokenizer.Peek("if"))
	{
		_tokenizer.Consume("if");

		AstNode condition = ParseExpression();
		AstNode whenTrue = ParseExpression();

		if (_tokenizer.Peek("else"))
		{
			_tokenizer.Consume("else");
			AstNode whenFalse = ParseExpression();
			return new IfNode(condition, whenTrue, whenFalse);
		}

		return new IfNode(condition, whenTrue);
	}

	return OrExpression();
}

All the theory in the world won’t give me enough confidence to trust this code on its own, so it was driven by a few tests. The first tests confirmed that the expected S-Expressions resulted from the simplest input if expressions, and then I followed those tests with a test that had a single nested if expression inside each block of an outer if expression. The ‘else’ blocks always matched up with the appropriate ‘if’.

Filed under: Language Design

Grammar 1: Basic Operators

February 1st, 2007

This is part of a series of articles that started with Test-Driving a Compiler.

Grammar 0 was silly. We went to the trouble of creating a tokenizer and a recursive descent parser for a language where the only valid programs were integer constants. With that silliness out of the way, we can start doing things with these integers. Grammar 1 includes arithmetic and relational operators, as well as nesting expressions with parentheses. The precedence of operators will be as follows, where parentheses are the highest priority, and equality tests are lowest:

Operators Priority
( ) Grouping
* / Multiplicative
+ - Additive
< <= > >= Relational
== != Equality

Starting with Grammar 0, we'll grow this in pieces. Using the arithmetic grammar on this PEG wiki page as a guide, we introduce low-priority operators earlier within the grammar.

	//Grammar 0
	Expression := integer-literal

	//Introduce grouping operators.
	Expression := Value
	Value := integer-literal / '(' Expression ')'

	//Introduce multiplication and division.
	Expression := MultiplicativeExpression
	MultiplicativeExpression := Value (('*' / '/') Value)*
	Value := integer-literal / '(' Expression ')'

	//Introduce addition and subtraction.
	Expression := AdditiveExpression
	AdditiveExpression := MultiplicativeExpression (('+' / '-') MultiplicativeExpression)*
	MultiplicativeExpression := Value (('*' / '/') Value)*
	Value := integer-literal / '(' Expression ')'

	//Introduce comparisons.
	Expression := RelationalExpression
	RelationalExpression := AdditiveExpression (('<' / '<=' / '>' / '>=') AdditiveExpression)*
	AdditiveExpression := MultiplicativeExpression (('+' / '-') MultiplicativeExpression)*
	MultiplicativeExpression := Value (('*' / '/') Value)*
	Value := integer-literal / '(' Expression ')'

One last pair of operators to go, and we have the completed grammar:

	Grammar 1
	Expression := EqualityExpression
	EqualityExpression := RelationalExpression (('==' / '!=') RelationalExpression)*
	RelationalExpression := AdditiveExpression (('<' / '<=' / '>' / '>=') AdditiveExpression)*
	AdditiveExpression := MultiplicativeExpression (('+' / '-') MultiplicativeExpression)*
	MultiplicativeExpression := Value (('*' / '/') Value)*
	Value := integer-literal / '(' Expression ')'

Each change in the above sequence was driven by tests. I won't belabor the point by showing every test and every change in the parsing code as I did in the development of Grammar 0.

The tokenizer recognizes each operator as single token, even when an operator consists of 2 characters. The tokenizer had to be greedy, so that the input string "1<=2" is interpreted as 3 tokens ("1", "<=", "2") rather than 2 tokens followed by garbage ("1", "<", "=2").

The recursive descent parsing code very closely mimics the grammar. Each nonterminal in the grammar has a mutually-recursive method with the same name. The body of the method corresponds with the right-hand-side of that grammar rule. Notice that several of the grammar rules are nearly identical, and the BinaryOperation helper method embodies what those rules have in common (everything other than the particular operators to use and the name of the nonterminal method to call recursively). Here's the part of the code that matches the grammar:


	public sealed class Expression
	{
		private delegate AstNode NodeParser();

		private readonly AstNode _abstractSyntaxTree;
		private readonly Tokenizer _tokenizer;

		public Expression(Tokenizer tokenizer)
		{
			_tokenizer = tokenizer;
			_abstractSyntaxTree = ParseExpression();
		}

		private AstNode ParseExpression()
		{
			return EqualityExpression();
		}

		private AstNode EqualityExpression()
		{
			return BinaryOperation(ParseRelationalExpression, “==”, “!=”);
		}

		private AstNode ParseRelationalExpression()
		{
			return BinaryOperation(AdditiveExpression, “<”, “<=”, “>”, “>=”);
		}

		private AstNode AdditiveExpression()
		{
			return BinaryOperation(MultiplicativeExpression, “+”, “-”);
		}

		private AstNode MultiplicativeExpression()
		{
			return BinaryOperation(Value, “*”, “/”);
		}

		private AstNode Value()
		{
			string token = _tokenizer.Consume();

			if (token == “(”)
			{
				AstNode node = ParseExpression();
				_tokenizer.Consume(“)”);
				return node;
			}

			return new IntegerNode(token);
		}

		private AstNode BinaryOperation(NodeParser parseOperand, params string[] operators)
		{
			AstNode node = parseOperand();

			while (_tokenizer.Peek(operators))
				node = new OperatorNode(node, _tokenizer.Consume(), parseOperand());

			return node;
		}

		public override string ToString()
		{
			return _abstractSyntaxTree.ToString();
		}
	}

At this point, it helps to see the relationship between a sample input string, the tree of nodes produced by the recursion, and the S-Expression that is used as shorthand for that tree. For the input string "1 + 2 * 3 + 4 <= 12 / ( 2 + 1 )", we have the following tree and S-Expression:

Sample Abstract Syntax Tree

Armed with this tree, it's actually pretty easy to evaluate the expression. A pretty straight-forward recursive algorithm would work here.

//Psueodcode
Value(node):
	if node is a leaf:
		return node contents
	else:
		operator = function corresponding with node's token
		return operator(Value(node child A), Value(node child B))

Lastly, the code for the tokenizer started to get hard to read, even though it was passing all of the tests and contained no obvious duplication. There was a lot of string-index-twiddling as I walked along the input string, occasionally peeking at the next token and jumping back to where I was before, etc. Because I had high code coverage in my tests, I was able to confidently rewrite the private helpers, without having to change the tests or the public interface of the tokenizer. At each step in the rewrite, the tests still passed, so I knew that I (probably) didn’t break anything. Now, I use a series of prioritized regular expressions to recognize and extract each new token.

Filed under: TDD, Language Design

Grammar 0

January 29th, 2007

This is part of a series of articles that started with Test-Driving a Compiler

I’m going to approach this project by incrementally adjusting the Parsing Expression Grammar that goes along with the language. You can think of the PEG as a specification. It’s not a specification of the whole language, because it says nothing about semantics. The grammar only specifies which input strings are valid in the language. PEGs are like regular expressions, except that they can recognize far more complex patterns.

The page I linked to above on PEGs gives a sample grammar for basic arithmetic expressions (I flipped the order, but that doesn’t matter):

	Expr := Sum
	Sum := Product (('+' / '-') Product)*
	Product := Value (('*' / '/') Value)*
	Value := [0-9]+ / '(' Expr ')'

Note that operators that have a higher precedence appear “deeper” in the grammar. When we turn the grammar into a recursive descent parser, this ordering will ensure that our resulting AST respects the order of operations. Note that this grammar assumes that we’ll be looking at the input on a character-by-character basis.

Also note that this grammar is already too complicated to fit in my head all at once! There are three things I’ll do to deal with that complexity. First, we’ll deal with tokens that are potentially larger than a single character. This way, our grammar can talk about integers as if they were each a single unit. The grammar will be simpler, with the cost of an extra upfront tokenizing step. Second, we’ll just plain start with a simpler grammar and work our way towards the one above. Third, I’ll use names that are, hopefully, more descriptive. With no further ado, I give you Grammar 0 (”grammar zero” to programmers, “grammar aught” to old-timey gold miners):

	Grammar 0
	Expression := integer-literal

Now I can start trying out this fancy TDD that all the cool kids keep talking about. I’ll present each step in 3 parts: “Red” (write a test that compiles and fails), “Green” (make the test pass as quickly as possible, even if we have to commit a few coding sins to get there), “Refactor” (make up for any recent sins by removing duplicated code). “Duplication of code” can be subtle and subjective, but the idea is that we’ll strive to prevent cruft from accumulating over time. If we demand from ourselves to make everything clean before continuing with the next test, we won’t even have the chance of getting buried in a pile of our own spaghetti code. We might even kick that nasty debugger habit.

Tokenize the Empty String

Red

	//TokenizerTest.cs
	[TestFixture]
	public class TokenizerTest
	{
		[Test]
		public void Empty()
		{
			Assert.AreEqual(new string[] { }, new Tokenizer(“”).ToArray());
		}
	}

	//Tokenizer.cs
	public sealed class Tokenizer
	{
		public Tokenizer(string source) { }

		public string[] ToArray() { return null; }
	}

Green

	//Tokenizer.cs
	public sealed class Tokenizer
	{
		public Tokenizer(string source) { }

		public string[] ToArray() { return new string[] { }; }
	}

Tokenize a Single Integer

Red

	//TokenizerTest.cs
	[Test]
	public void Zero()
	{
		Assert.AreEqual(new string[] { “0″ }, new Tokenizer(“0″).ToArray());
	}

Green

	//Tokenizer.cs
	public sealed class Tokenizer
	{
		private readonly string _source;

		public Tokenizer(string source)
		{
			_source = source;
		}

		public string[] ToArray()
		{
			if (_source.Length == 0)
				return new string[] { };

			return new string[] { _source };
		}
	}

Eat Leading White Space

Red

	//TokenizerTest.cs
	[Test]
	public void EatLeadingWhiteSpace()
	{
		Assert.AreEqual(new string[] { “0″ }, new Tokenizer(” 0″).ToArray());
	}

Green

	//Tokenizer.cs
	public string[] ToArray()
	{
		if (_source.Length == 0)
			return new string[] { };

		int i = 0;
		while (i < _source.Length && Char.IsWhiteSpace(_source[i]))
			i++;

		return new string[] { _source.Substring(i) };
	}

Refactor - The whitespace-consuming loop is a little ugly. It’s tempting to put a comment in front of it. Rather, we’ll pull it out into a function with a descriptive name.

	//Tokenizer.cs
	public sealed class Tokenizer
	{
		private readonly string _source;
		private int _index;

		public Tokenizer(string source)
		{
			_source = source;
			_index = 0;
		}

		private void EatWhiteSpace()
		{
			while (_index < _source.Length && Char.IsWhiteSpace(_source[_index]))
				_index++;
		}

		public string[] ToArray()
		{
			if (_source.Length == 0)
				return new string[] { };

			EatWhiteSpace();

			return new string[] { _source.Substring(_index) };
		}
	}

White Space Separates Multiple Tokens

Red

	//TokenizerTest.cs
	[Test]
	public void WhitespaceSeparatesTokens()
	{
		Assert.AreEqual(new string[] { “0″, “1″, “2147483647″ }, new Tokenizer(“0 1 \t\n 2147483647″).ToArray());
	}

Green

	//Tokenizer.cs
	private string EatDigits()
	{
		int start = _index;

		while (_index < _source.Length && Char.IsDigit(_source[_index]))
			_index++;

		return _source.Substring(start, _index - start);
	}

	public string[] ToArray()
	{
		List<string> tokens = new List<string>();

		while (_index < _source.Length)
		{
			EatWhiteSpace();

			if (_index < _source.Length)
				tokens.Add(EatDigits());
		}

		return tokens.ToArray();
	}

Refactor - This is getting out of hand! First, _index < _source.Length is a condition we keep repeating, and it isn’t entirely clear what it means. So, we pull that condition out into a helper property called HasNext. At this point, we run the tests to confirm they still pass. Then, we notice that EatWhiteSpace and EatDigits are nearly identical. The main difference is which character-testing method to use, so we turn these methods into a single method that accepts a character-testing method as its argument. The tests still pass. Lastly, we add a helper method to the test class to avoid the copy-paste cycle that we’ve started in those test methods.

	//Tokenizer.cs
	public sealed class Tokenizer
	{
		private readonly string _source;
		private int _index;

		public Tokenizer(string source)
		{
			_source = source;
			_index = 0;
		}

		private bool HasNext
		{
			get { return _index < _source.Length; }
		}

		private string Eat(Predicate<char> test)
		{
			int start = _index;

			while (HasNext && test(_source[_index]))
				_index++;

			return _source.Substring(start, _index - start);
		}

		public string[] ToArray()
		{
			List<string> tokens = new List<string>();

			while (HasNext)
			{
				Eat(Char.IsWhiteSpace);

				if (HasNext)
					tokens.Add(Eat(Char.IsDigit));
			}

			return tokens.ToArray();
		}
	}

	//TokenizerTest.cs
	[TestFixture]
	public class TokenizerTest
	{
		private void AssertTokens(string source, params string[] expectedTokens)
		{
			Assert.AreEqual(expectedTokens, new Tokenizer(source).ToArray());
		}

		[Test]
		public void Empty()
		{
			AssertTokens(“”, new string[] { });
		}

		[Test]
		public void Zero()
		{
			AssertTokens(“0″, “0″);
		}

		[Test]
		public void EatLeadingWhiteSpace()
		{
			AssertTokens(” 0″, “0″);
		}

		[Test]
		public void WhitespaceSeparatesTokens()
		{
			AssertTokens(“0 1 \t\n 2147483647″, “0″, “1″, “2147483647″);
		}
	}

Zero AST

Red - We wish to use S-Expressions as a compact string representation of our ASTs. For integer constants, this is simply the string version of the integer. We’ll see more interesting S-Expressions later.

	//ExpressionTest.cs
	[TestFixture]
	public class ExpressionTest
	{
		[Test]
		public void ZeroTree()
		{
			Assert.AreEqual(“0″, new Expression(“0″).ToString());
		}
	}

	//Expression.cs
	public sealed class Expression
	{
		public Expression(string source) { }

		public override string ToString()
		{
			return null;
		}
	}

Green

	//Expression.cs
	public override string ToString()
	{
		return “0″;
	}

Max Integer AST

Red

	//ExpressionTest.cs
	[Test]
	public void MaxIntTree()
	{
		Assert.AreEqual(“2147483647″, new Expression(“2147483647″).ToString());
	}

Green

	//Expression.cs
	public sealed class Expression
	{
		private readonly string _source;

		public Expression(string source)
		{
			_source = source;
		}

		public override string ToString()
		{
			return _source;
		}
	}

Refactor

	//ExpressionTest.cs
	[TestFixture]
	public class ExpressionTest
	{
		private void AssertTree(string expectedAbstractSyntaxTree, string source)
		{
			Assert.AreEqual(expectedAbstractSyntaxTree, new Expression(source).ToString());
		}

		[Test]
		public void ZeroTree()
		{
			AssertTree(“0″, “0″);
		}

		[Test]
		public void MaxIntTree()
		{
			AssertTree(“2147483647″, “2147483647″);
		}
	}

Integer Node Wraps Integers

Red - This is our first AST node type. Integer nodes will just wrap an integer value. ToString() will provide an S-Expression that contributes to the overall Expression’s ToString() values.

	//IntegerNodeTest.cs
	[TestFixture]
	public class IntegerNodeTest
	{
		[Test]
		public void WrapsInt()
		{
			Assert.AreEqual(“0″, new IntegerNode(“0″).ToString());
		}
	}

	//IntegerNode.cs
	public sealed class IntegerNode
	{
		public IntegerNode(string integer) { }

		public override string ToString()
		{
			return null;
		}
	}

Green

	//IntegerNode.cs
	public sealed class IntegerNode
	{
		private readonly string _value;

		public IntegerNode(string integer)
		{
			_value = integer;
		}

		public override string ToString()
		{
			return _value;
		}
	}

Integer Node Throws on Overflow

Red

	//IntegerNodeTest.cs
	[Test]
	[ExpectedException(typeof(OverflowException))]
	public void ThrowUponOverflow()
	{
		new IntegerNode(“2147483648″);
	}

Green

	//IntegerNode.cs
	public sealed class IntegerNode
	{
		private readonly int _value;

		public IntegerNode(string integer)
		{
			_value = int.Parse(integer);
		}

		public override string ToString()
		{
			return _value.ToString();
		}
	}

Refactor - Now that our integer node type is ready, we’ll finally translate the single rule of Grammar 0 into code. Below, that’s _tree = new IntegerNode(source);.

	//Expression.cs
	public sealed class Expression
	{
		private readonly IntegerNode _tree;

		public Expression(string source)
		{
			_tree = new IntegerNode(source);
		}

		public override string ToString()
		{
			return _tree.ToString();
		}
	}

Tokenizer Consumes Tokens One at a Time

Red

	//TokenizerTest.cs
	[Test]
	public void ZeroTokens()
	{
		Assert.IsNull(new Tokenizer(“”).Consume());
	}

	//Tokenizer.cs
	public string Consume()
	{
		return “”;
	}

Green

	//Tokenizer.cs
	public string Consume()
	{
		return null;
	}

Refactor - Now we have enough tools to assert that expressions are defined in terms of tokens, not arbitrary strings. We give a meaningful definition of Consume(), and simplify ToArray() by calling it. Then, we let Expression accept a Tokenizer instead of a string. At this point, we’ll just assume that there is one token to pull from the Tokenizer.

	//Tokenizer.cs
	public string Consume()
	{
		if (HasNext)
			Eat(Char.IsWhiteSpace);

		if (HasNext)
			return Eat(Char.IsDigit);

		return null;
	}

	public string[] ToArray()
	{
		List<string> tokens = new List<string>();

		while (true)
		{
			string token = Consume();

			if (token == null)
				return tokens.ToArray();

			tokens.Add(token);
		}
	}

	//Expression.cs
	public Expression(Tokenizer tokenizer)
	{
		_tree = new IntegerNode(tokenizer.Consume());
	}

	//ExpressionTest.cs
	private void AssertTree(string expectedAbstractSyntaxTree, string source)
	{
		Expression expression = new Expression(new Tokenizer(source));
		Assert.AreEqual(expectedAbstractSyntaxTree, expression.ToString());
	}
Filed under: TDD, Language Design

Test-Driving a Compiler

January 25th, 2007

For some time now, I’ve been interested in making a programming language. Whenever I’ve attempted this in the past, I instantly ran into resistance because I was too concerned with making a Big Design Up Front. I would try to think about the entire language all at once, and put that into a grammar. Of course, that’s way too much to fit in your head all at once, so I would start with a grammar for an existing language, like Python or C#, and whittle away at it until I liked what I was looking at. Half-way through whittling, the house of cards that were my thoughts would come tumbling down.

At work, I’ve been Unit Testing a great deal for about a year now, but I have only been able to seriously write test-first code a fraction of the time. I’d like to attempt to create a programming language while strictly adhering to the TDD mantra “Red, Green, Refactor”, and I’ll be posting my progress here. This is a learning opportunity for me, so I’ll be posting my missteps as well as my successes.

About two years ago, I attended a conference where Robert C. Martin lead a session. Although the session wasn’t supposed to be about TDD, it quickly took that turn. I’m glad that it did, because I hadn’t seen TDD in action like that before. During this session, he started out by drawing a UML diagram to collect his thoughts and get the big picture down. But, he made a point of saying that now that we had a design document, we should hide it in a drawer. Although the diagram seemed appropriate and full of OOG (Object-Oriented Goodness; look it up), the end product that came out of the TDD style was nothing like that original diagram. Had we stuck with the diagram, we would have made something overly complicated and just plane wrong.Following that example, I’ll give a very high-level design for the language, but will then shove it in a drawer lest it get in the way. In future posts, I’ll actually show the code as it develops. Early on, I’ll show every minute step that I take, but after I get on a roll I’ll begin to give less tedious summaries of my progress.

A Back-of-a-Napkin Design

People have been writing compilers for quite some time now, and it only makes sense to base the design on those successes.
So, I know that I’ll start with a grammar using one of the standard notations, I’ll build a parser from that grammar which
turns source code into an Abstract Syntax Tree (AST), and then I’ll use the tree to validate the program and generate code for the target platform.

Back of a Napkin Design

For the parsing, I’ll be using the Recursive Descent technique, which is less efficient than bottom-up parsing, but is much easier to fit in my head. It turns out that there is a special grammar notation that is perfect for recursive descent, called a Parsing Expression Grammar. These references on RD and PEGs are my primary background materials as I start this project, and I’ll consider them a prerequisite to anyone who wants to follow my progress.

The target platform may be MSIL, Java byte code, Parrot, or something else; maybe even another high-level language. If I target a high level language, then I’ll benefit from all the brilliant people who already wrote slick, optimized compilers. Imagine that VB.NET were compiled by first translating to C#, and then sending that translation to the C# compiler.

It is quite likely that at some point I’ll want to replace the top-down, recursive descent parser with a bottom-up technique, since those are more sophisticated. I’ll keep that in mind as a possible change to be made in the future, but I won’t be over-engineering things now just to make that potential-change easier.

I suspect that I’ll eventually need to use the Visitor Design Pattern for inspecting the AST. Most discussions of the Visitor pattern use ASTs as the prime example. I’ll keep this pattern in mind once I start needing to do varying actions against the tree, but I won’t actually include this pattern in my code until I really feel the need for it. When things start to hurt in a way that Visitor will help with, I’ll start to use it.

Now that I know where I’m going, I’ll put this 100,000 foot view off to the side, and later I’ll see how close it is to what the TDD process produces.

Filed under: TDD, Language Design
« Previous Page

Powered by WordPress