Adding Type Inference to Rook

July 29th, 2007

Earlier posts on Rook showed that it was a dynamically typed language: the type of an object was not declared, and was not known until the last moment at runtime. Typing mistakes on the part of the programmer would be discovered at runtime. Although I like the brevity of dynamic languages like Python, I have also been interested in using type inference to get some of the benefits of dynamic typing along with the benefits of static typing. The latest version of Rook, which is not quite complete enough to be made public, uses the Hindley-Milner Type Inference Algorithm instead of dynamic typing. The basic idea is that all of the sample source code in previous posts still works unchanged, with the added benefit of having all types checked prior to running.

I found this paper to be the most accessible description of the Hindley-Milner algorithm: Basic Polymorphic Type Checking by Luca Cardelli. It gives a lot of background information in plain English, assumes little about the reader other than familiarity with language constructs similar to those in Scheme, and includes a working implementation of the algorithm. Some of the formal theory went a little over my head, but the remainder of the paper proved very useful.

The sample implementation was in Modula-2, and understandably it did not include tests that would translate directly into NUnit tests. Still, I wanted to make sure that as I translated the sample implementation into C#, I would have a large set of tests to back it up. First, I made a translation from Modula-2 to C#-flavored pseudocode. Then, I attacked that code in several small pieces: whenever a relatively small part of the pseudocode could be tested/developed in isolation, I would move that piece into C# with tests. This approach allowed me to chip away at the stone with confidence that I wasn’t missing anything in the translation. Another fortunate side effect of this approach is that by making myself writes tests for each small piece being translated, I gained a better understanding of how the type inference system actually works.

Filed under: TDD, Language Design

Spaghetti Stack Attack!

March 10th, 2007

When I started writing Rook in January, I had a general idea of how things would work. I’d start with a grammar, then I’d write a recursive descent parser to turn the input source code into an abstract syntax tree. These steps aren’t trivial, but they’re also not as tough as they sound. It’s easy to find examples of these techniques.

A few weeks back I was armed with this totally awesome syntax tree, and didn’t quite know what to do with it next. I knew it was the right tool for the job, but I wasn’t 100% sure what that job was. The trouble is, this is the point where most of my sources just plain had nothing to say about the next step. What I wanted to be able to do was take a reference to the root expression in my syntax tree and ask it, “What is your value, after you evaluate your own subexpressions?”

This is a tricky question to answer, since some expressions introduce new scope, and some variables in one scope may “shadow” variables in the parent scope. A special data structure called a Spaghetti Stack helps us keep track of all these variations in scope. I’ll get to the specifics of spaghetti stacks in a moment, but the bottom line is that this data structure lets you keep track of which variables are in scope at any particular point in the input source code. My particular spaghetti stack also stores the runtime value associated with each identifier in each scope.

Note to self: Data structures named after food are more likely to be remembered. This fact has been demonstrated time and time again: see Spaghetti Stacks & Hash Tables. Commence work on Hamburger Queue immediately.

Spaghetti Stack

A spaghetti stack is really just a plain old tree structure, but there are two special details worth noting. First, all of the references point “up” toward the root rather than the other way around. Second, all of these references are immutable; once a node is created it will always point at the same parent, until it gets garbage collected. The nifty result of these restrictions is that from the point of view of any one node, that node is the top of a singly-linked stack. For instance, the colored nodes in the example do form a stack, with the “top” of the stack located at the bottom of the tree. Nodes have no sense of gravity, so it doesn’t matter that it’s upside down: it’s a stack. To sum up, we have a collection of stacks that are sharing nodes whenever they can, and every node thinks that it is at the top of the stack.

In this diagram, we see the state of a spaghetti stack after evaluation of a program has started. The root node represents the global namespace, and contains things like primitive operator definitions (Rook treats operators as functions, so the operator symbols are just identifiers). Node 1 represents a new scope that introduces three local variables, such as a function with 3 parameters. Node 3 represents a scope that shadows the “x” in the parent scope, but can see the parent scope’s “y” unhindered. Some expressions that can provide new items in the new scope don’t always need to, such as a function with zero parameters, and this case results in uninteresting nodes like node 6. All in all, this is just a tree representation of namespaces, and we probably already build these in our heads while we read code, so it isn’t really anything new.

In Rook, each node in the spaghetti stack is an instance of a class named Environment, which contains a name/object dictionary and a reference to the parent Environment. When lookup attempts fail, we defer to the parent before deciding whether the identifier exists:


public object this[string identifier]
{
	get
	{
		if (_locals.ContainsKey(identifier))
			return _locals[identifier];

		if (_parent != null)
			return _parent[identifier];

		throw new UndefinedIdentifierException(identifier);
	}

	set
	{
		if (_locals.ContainsKey(identifier))
			throw new DuplicateIdentifierException(identifier);

		_locals[identifier] = value;
	}
}

Each node in the accompanying abstract syntax tree is an Expression whose value depends on an Environment instance:

public interface Expression
{
	object Value(Environment environment);
}

To run a Rook program, we just pass the root environment to the Value method of the root expression. We let the Value method “push” a new Environment whenever the expression node introduces new scope. Simple expression types like integer and boolean literals just ignore the Environment, since they can determine their own values without it:


//IntegerLiteral Expressions
//Example: 1234
public override object Value(Environment environment)
{
	return _value;
}

For an Identifier expression, the value of the expression is the value found by looking up the identifier in the Environment’s dictionary:


//Identifier Expressions:
//Example: x
public override object Value(Environment environment)
{
	return environment[_identifier];
}

For If expressions, we also don’t need to introduce new scope. Rather, we just pass the given Environment along to each subexpression:


//If Expressions:
//Example: if (x>1) y else z
public override object Value(Environment environment)
{
	if ((bool)_condition.Value(environment))
		return _whenTrue.Value(environment);

	if (_whenFalse == null)
		return null;

	return _whenFalse.Value(environment);
}

Compound expressions introduce new scope, so they “push” a new Environment onto the stack, add the identifiers for each new variable binding, and then use that new Environment to evaluate subexpressions:


//Compound Expressions:
//Example: { let a = 5; def square(x) x*x; square(a); }
public override object Value(Environment environment)
{
	Environment withBindings = new Environment(environment);

	foreach (Binding binding in _bindings)
		withBindings[binding.Identifier] = binding.Value(withBindings);

	object result = null;

	foreach (Expression expression in _expressions)
		result = expression.Value(withBindings);

	return result;
}

We never explicitly pop from one of these stacks. Each implementation of Value may or may not push a new Environment onto a stack, and will use that Environment, but the caller never gains a direct reference to it. We let the garbage collector do our “pops” for us. In fact, there are cases where we will deliberately preserve some references to these otherwise-temporary Environments, preventing them from being garbage collected. This makes implementing closures painfully simple, but that is a discussion for another day:


//Lambda Expressions:
//Example: fn(x, y) x*y
public override object Value(Environment environment)
{
	return new Closure(this, environment);
}
Filed under: Language Design

(Re)Introducing Rook

March 5th, 2007

Sample Rook Session
For the past few weeks, I’ve been documenting my progress on the development of a new programming language. It is mostly an experiment to see whether or not I could do it, rather than an attempt to actually gain widespread use. Also, it’s giving me an opportunity to learn more about how functional languages work. In fact, I’m not even releasing the source code yet. I think it’ll be worth releasing after a few more key features are implemented. Still, I’ve made a lot of progress in the past week, and it’s worth presenting the first draft of the language’s documentation as well as some examples.

It’s called Rook. If Guido van Rossum can name a language after his favorite comedy troupe, I can name mine after my favorite member of the passerine order of birds.

In this post I’ll just dive right in with some examples. If something seems fishy, I’d recommend skimming the documentation linked to above. The bottom line is that it behaves a lot like a subset of Scheme, but its syntax is supposed to be more familiar to people who use C-derived languages. I’ve even got a little editor with an interactive REPL based loosely on DrScheme.

In the example screenshot, I start by defining a recursive function (factorial) as well as two mutually-recursive functions (even and odd). The ‘def’ expressions define functions. Rook is dynamically typed, so if we tried to pass a boolean value to the factorial function, it wouldn’t let you know something was wrong until it actually tried to evaluate the function body. Also, the identifiers inside method bodies aren’t checked until they are actually used, so we can define even(n) before odd(n) actually exists, and all is well. By being lazy about these checks, recursion works with no need for special-case code in the interpreter.

Next, the screenshot shows a function named apply, which takes in a function and two additional arguments, and calls that function with those arguments. first(a,b) and second(a,b) return the respective argument, ignoring the other one. In the REPL, you can see that apply(first, 1, 2) evaluates to 1 and apply(second, 1, 2) evaluates to 2. Also, note that we can treat operators as identifiers, passing in the plus operator so that apply((+), 1, 2) evaluates to 3.

The more exhaustive documentation linked to above goes into more about scope, nested scope, local variables, closures, and the like, but more examples won’t be very interesting until I have support for some collection data types like linked lists.

Filed under: Language Design

Grammar 4: Statements & Function Calls

February 14th, 2007

This is part of a series of articles that started with Test-Driving a Compiler.

Grammar 3 introduced function definitions. Since then, I’ve introduced some unary operators, statements, and function call expressions. First, I added the unary-minus and unary-not operators with a higher priority than multiplication, using tests to confirm the basic parsing routines as well as the relative precedence compared to multiplication/division and parenthetical expressions:

MultiplicativeExpression := UnaryExpression (('*' / '/') UnaryExpression)*
UnaryExpression := ('-' / '!')? Value

Next, I introduced function calls. I expect to have higher-order functions, so function calls can be made against an identifier, as well as parenthetical expressions:

FunctionName(arg1, arg2);
(SomeExpressionThatIsCallable)(arg1, arg2);

To deal with the complexity of the grammar for these cases, I took advantage of the fact that PEG grammars have ordered sub-rules that allow backtracking, as I already discussed for Grammar 2:

Value := boolean-literal
       / integer-literal
       / '(' Expression ')' '(' ArgumentList ')'
       / '(' Expression ')'
       / identifier '(' ArgumentList ')'
       / identifier

Function definitions are already set up for a special shorthand: whenever the entire body is an expression whose value should be returned, curly-braces and an explicit ‘return’ can be omitted. Now, I added the more common scenario for multi-statement function bodies.

Function := identifier '(' ParameterList ')' &'{' CompoundStatement
          / identifier '(' ParameterList ')' Expression

Lastly, I introduced the usual statements available in imperative languages. Here’s the complete grammar to date:

Program := Function*
Function := identifier '(' ParameterList ')' &'{' CompoundStatement
          / identifier '(' ParameterList ')' Expression
ParameterList := &')' / Parameter (',' Parameter)*
Parameter := identifier identifier?

Statement := &'{' CompoundStatement
           / &'if' IfStatement
           / &'for' ForStatement
           / &'while' WhileStatement
           / &'return' ReturnStatement
           / &'yield' YieldStatement
           / identifier '=' Expression ';'
           / Expression ';'
CompoundStatement := '{' Statement* '}'
IfStatement := 'if' '(' Expression ')' Statement ('else' Statement)?
ForStatement := 'for' '(' identifier 'in' Expression ')' Statement
WhileStatement := 'while' '(' Expression ')' Statement
ReturnStatement := 'return' (Expression)? ';'
YieldStatement := 'yield' Expression ';'

Expression := IfExpression
IfExpression := 'if' '(' Expression ')' Expression 'else' Expression
              / OrExpression
OrExpression := AndExpression ('||' AndExpression)*
AndExpression := EqualityExpression ('&&' EqualityExpression)*
EqualityExpression := RelationalExpression (('==' / '!=') RelationalExpression)*
RelationalExpression := AdditiveExpression (('<' / '<=' / '>' / '>=') AdditiveExpression)*
AdditiveExpression := MultiplicativeExpression (('+' / '-') MultiplicativeExpression)*
MultiplicativeExpression := UnaryExpression (('*' / '/') UnaryExpression)*
UnaryExpression := ('-' / '!')? Value
Value := boolean-literal
       / integer-literal
       / '(' Expression ')' '(' ArgumentList ')'
       / '(' Expression ')'
       / identifier '(' ArgumentList ')'
       / identifier
ArgumentList := &')' / Expression (',' Expression)*

I’ve noticed a pattern while developing each of these changes to the grammar. Usually, when I decide on a single rule change or rule addition, that change suggests a change to the tokenizer, such as recognizing a new keyword or symbol. The new rule often suggests a new AST node type as well. So, I have been tending toward the following cycle:

  1. Write a test for the new tokens to be recognized.
  2. Update the tokenizer to pass the test (this has become a smaller and smaller task and is now trivial).
  3. Write a stub implemention of the new AST node type, as well as a test for the new class, asserting that ToString yields an expected S-Expression. The stub ToString returns null so that the test initially fails.
  4. Flesh out the AST node type to pass the test.
  5. Write a test for the Recursive Descent Parser class, for the new grammar rule.
  6. Change the RDP to match the grammar and pass the test.

For nontrivial rules, or rules that need to be composed with other rules, I’ll repeat the last two steps until I’m satisfied that all the relevant variations have been tested.

Filed under: TDD, Language Design

Grammar 3: Function Definitons

February 7th, 2007

This is part of a series of articles that started with Test-Driving a Compiler.

Grammar 3 introduces functions, identifiers, and programs composed of multiple function definitions. Here’s the grammar, with changes from Grammar 2 in bold:

Program := Function*
Function := identifier '(' ParameterList ')' Expression
ParameterList := &')'
                  / Parameter (',' Parameter)*
Parameter := identifier identifier?
Expression := IfExpression
IfExpression := 'if' Expression Expression 'else' Expression
              / 'if' Expression Expression
              / OrExpression
OrExpression := AndExpression ('||' AndExpression)*
AndExpression := EqualityExpression ('&&' EqualityExpression)*
EqualityExpression := RelationalExpression (('==' / '!=') RelationalExpression)*
RelationalExpression := AdditiveExpression (('<' / '<=' / '>' / '>=') AdditiveExpression)*
AdditiveExpression := MultiplicativeExpression (('+' / '-') MultiplicativeExpression)*
MultiplicativeExpression := Value (('*' / '/') Value)*
Value := boolean-literal / integer-literal / identifier / '(' Expression ')'

In English (more or less), this says that a program is zero or more functions. A function is an identifier, a possibly-empty parameter list, and a body composed of a single expression. Items in the parameter list are each specified by one or two identifiers (a name, or a name with an explicit type identifier).

My intention is to use type inference with optional explicit type declarations, so that’s why parameters don’t require the explicit type identifier. Also, since the entire function body is just one expression, I don’t need curly braces { } to mark the start and end, and I don’t need an explicit ‘return’ keyword. Multi-statement functions with curly braces and return statements will likely be included later, but I like having a shorthand for one-liner functions.

In the ParameterList grammar rule, I start by testing &')'. This is PEG notation for peeking ahead. &')' will match when the next token is a ‘)’, and will fail otherwise. Upon failure, we try the next part of the rule. Whether the test for ‘)’ passes or fails, none of the input is actually consumed by the test. So, when a parameter list is empty, this lets us avoid looking for any parameters while protecting the fact that the Function rule will expect to consume ‘)’. As I’ve stated before, I have one parsing method per grammar rule. In the implementation of ParameterList, I perform the &')' part by calling the Peek method, which returns true if the next token matches the given string, without consuming any of the input:

// ParameterList := &')' / Parameter (',' Parameter)*
private ParameterListNode ParameterList()
{
	List<ParameterNode> parameters = new List<ParameterNode>();

	if (_tokenizer.Peek(")"))
		return new ParameterListNode(parameters);

	parameters.Add(Parameter());
	while (_tokenizer.TryConsume(","))
		parameters.Add(Parameter());

	return new ParameterListNode(parameters);
}

Filed under: Language Design
Next Page »

Powered by WordPress