Grammar 2: Lost in New York
February 2nd, 2007This is part of a series of articles that started with Test-Driving a Compiler.
Some of the operators introduced in Grammar 1 have to do with boolean tests: ==, !=, <, etc. Further boolean support introduced in this article includes && (and), || (or), and if statements. Different languages deal with if statements in different ways. In fact, even calling them if statements can be a misnomer in some languages. Usually, language designers make a distinction between statements and expressions. Expressions have values ("2*3"), and statements don’t ("int x;"). In some languages, the distinction is blurry: sometimes expressions are valid statements, sometimes everything is an expression, etc. I’ve decided that I like the idea of letting ifs have values. The value of an if expression will be the last expression executed in either of the two blocks. One nice thing about letting ifs have values is that we can do away with the pesky ?: operator from C. The purpose of ?: is like blessing if statements with expression semantics.
Changing a grammar to support nested if expressions correctly is not easy! In general, we want each ‘else’ block to match up with the innermost preceding ‘if’, but it turns out that it’s easy to create a grammar for if statements that is ambiguous (in other words, two completely different trees could be generated by the same grammar, each of which would have a different meaning at run time). This complication is called the Dangling Else Problem.
I’m coding my parser manually, which is unusual. Most languages use parser-generators that take a grammar as input and generate code for you that does a lot of the parsing work. I prefer doing it manually because there is a much smaller learning curve involved, and I like that I can fit it all in my head at once (the real reason is that this is a goofy side project of no import, and it’s just plain entertaining to reinvent this wheel).
For those who rely on a third-party tool, the dangling else problem is a bigger deal. Depending on the tool, one might solve the problem by adding grammar rules that remove any ambiguity, which can make the grammar much harder to understand at a glance, or by asking the parser generator to detect ambiguity and resolve it with a particular heuristic.
This is a much smaller problem when you use PEG grammars, even when using a parser generator, because the slash symbol / implies a priority and backtracking that other notations do not. For the PEG rule e1 := e2 / e3, the parser generator is supposed to first attempt parsing e2. If e2 fails, then it backtracks and attempts e3 before deciding whether the rule has actually failed. That backtracking is what lets us fix the grammar without as much complexity as other notations. Even though I’m writing the parser manually, using the PEG notation still simplifies the problem. Knowing what a general-purpose PEG parser generator would do (try-and-backtrack), I can write a manual parser that is also unambiguous, without actually having to waste time with backtracking. The best way to illustrate that is by showing the new grammar as well as the parsing method which implements the new IfExpression rule.
Grammar 1
Expression := EqualityExpression
EqualityExpression := RelationalExpression (('==' / '!=') RelationalExpression)*
RelationalExpression := AdditiveExpression (('<' / '<=' / '>' / '>=') AdditiveExpression)*
AdditiveExpression := MultiplicativeExpression (('+' / '-') MultiplicativeExpression)*
MultiplicativeExpression := Value (('*' / '/') Value)*
Value := integer-literal / '(' Expression ')'
Grammar 2
Expression := IfExpression
IfExpression := 'if' Expression Expression 'else' Expression
/ 'if' Expression Expression
/ OrExpression
OrExpression := AndExpression ('||' AndExpression)*
AndExpression := EqualityExpression ('&&' EqualityExpression)*
EqualityExpression := RelationalExpression (('==' / '!=') RelationalExpression)*
RelationalExpression := AdditiveExpression (('<' / '<=' / '>'/ '>=') AdditiveExpression)*
AdditiveExpression := MultiplicativeExpression (('+' / '-') MultiplicativeExpression)*
MultiplicativeExpression := Value (('*' / '/') Value)*
Value := boolean-literal / integer-literal / '(' Expression ')'
The IfExpression rule is adapted from Bryan Ford’s work on PEGs,
in particular this slide show about PEGs (PDF).
This rule works because the ‘else’ version is supposed to be attempted first; we’re supposed to be gluttonous when it comes to eating ‘else’ blocks, which gives us Lioi Life Lesson #1:
Lioi Life Lesson #1: Gluttony is bad for your health, unless you’re talking about a solution to the dangling else problem or Mallomars.*
*Consult a physician before following any Lioi Life Lesson.
In this particular case, the recursive methods that correspond with each rule can benefit from the test-and-backtrack approach without actually having to backtrack. If the next token is not ‘if’, then we know that the 3rd part of the rule is in effect, and defer to the OrExpression rule. If instead the next token is ‘if’, we know we have to parse two Expressions (condition and when-true). After that, we can peek ahead one token, to see if ‘else’ is present. If so, we know we’re in the first part of the rule, which has a higher priority. If ‘else’ isn’t present, then we’re done. We don’t explicitly backtrack, and we don’t waste time traversing paths that will ultimately fail, but we still benefit from the way that PEG grammars avoid ambiguity.
private AstNode IfExpression()
{
if (_tokenizer.Peek("if"))
{
_tokenizer.Consume("if");
AstNode condition = ParseExpression();
AstNode whenTrue = ParseExpression();
if (_tokenizer.Peek("else"))
{
_tokenizer.Consume("else");
AstNode whenFalse = ParseExpression();
return new IfNode(condition, whenTrue, whenFalse);
}
return new IfNode(condition, whenTrue);
}
return OrExpression();
}
All the theory in the world won’t give me enough confidence to trust this code on its own, so it was driven by a few tests. The first tests confirmed that the expected S-Expressions resulted from the simplest input if expressions, and then I followed those tests with a test that had a single nested if expression inside each block of an outer if expression. The ‘else’ blocks always matched up with the appropriate ‘if’.