## Writing interpreters and compilers

Writing an interpreter or a compiler is one of the most educational tasks in programming because you can become familiarized with the details of the code interpretation and evaluation process. You can obtain much deeper knowledge of what sorts of things are going on behind the scenes and gain some insights into the decisions behind language design.

This article will perform a basic overview of this process by showing how to write an interpreter for a simple language that we can use for a calculator application.

This article assumes that you have intermediate programming experience and basic familiarity with Javascript.

## Some Background

### The difference between interpreters, compilers and transpilers

In this article, we will be creating an *interpreter*, as opposed to a *compiler.* Our program will take some code as input and immediately execute it.

If we were writing a compiler, we would instead transform the inputted code in the source language into code in some lower-level target language, such as MSIL, or an assembly language, or even machine code.

A transpiler is similar to a compiler, except that the source language and target language are about the same level of abstraction. The canonical example of this in the client side web programming world is CoffeeScript, which transcompiles into Javascript.

### The traditional language code interpretation process

Most compilers will interpret code in a multi-step process that involves processes that include lexical analysis, preprocessing, parsing, optimization, and finally code generation, or, in the case of an interpreter, evaluation.

In this article, we will only focus on lexing, parsing, and evaluation.

#### Lexing

A lexer takes a input a string of characters, and outputs a list of tokens. For example, if we had some code like this

(12 + 4) / 6

the lexer would divide it up into the individual parts, called tokens, and output a list that might look something like this

[
["operator", "("],
["number", 12],
["operator", "+"],
["number", 4],
["operator", ")"],
["operator", "/"],
["number", 6]
]

This process is usually relatively simple. Somebody with a little bit of experience with string manipulation can work their way through building a lexer fairly quickly.

#### Parsing

The parser takes the list of tokens produced by the lexer as input, parses it according to some syntax rules, and outputs a representation of the syntactic structure called a parse tree.

{
operation: "/",
left: {
operation: "+",
left: 12,
right: 4
}
right: 6
}

This tends to be the most difficult part of the process.

#### Evalutation

The evaluator takes the parse tree produced by the parser and evaluates it. Evaluators usually traverse the parse tree recursively. Given the syntax tree above, the evaluator might first evaluate the left operand of the top-level `/`

operation, then the right operand, and then return the result of the division.

Evaluating the left operand would simplify the syntax tree to this

{
operation: "/",
left: 16,
right: 6
}

which in turn will evaluate to the final result

2.6666666666666665

## AEL: the calculator language

Before we can start writing our interpreter, we need to understand the language that we will be interpreting, which I just made up and will refer to as AEL, short for Arithmetic Expression Language.

The language is written as a series of arithmetic expressions composed of numbers and the arithmetic operators `+`

(addition), `-`

(subtraction and negation), `*`

(multiplication), `/`

(division), `%`

(modulo), `^`

(exponentiation), as well as parentheses `()`

for grouping.

The interpreter will evalutate each expression and print the result of each in the order that they were presented. For example, given the following input,

3
2 ^ 8
(12 % 7) * (3 + 2)
19 / -9

the interpreter will output

3
256
25
-2.111111111111111

AEL also will allow you to define variable assignments using the `=`

operator. The left hand side will be some identifier, and the right hand side will be an arithmetic expression. A statement with an assignment has no return value, so the interpreter will not print out a corresponding line.

For example,

hoursPerDay = 24
minutesPerHour = 60
minutesPerDay = minutesPerHour * hoursPerDay
minutesPerDay
minutesPerDay * 60

outputs

1440
86400

AEL will have two pre-defined variables: `pi`

and `e`

, which will correspond to the values `3.141592653589793`

and `2.718281828459045`

, respectively.

Finally, AEL will allow you to define functions. A function assignment also uses the `=`

operator, but the left hand argument must be an identifier followed by parentheses, which will contain zero or more argument names separated by commas. Once defined, a function can be called by writing an identifier name followed by parentheses containing zero or more arithmetic expressions separated by commas.

For example,

toDegrees(radians) = radians * 180 / pi
toDegrees(2 * pi)
cylinderVolume(r, h) = pi * r ^ 2 * h
cylinderVolume(2, 4)

outputs

360
50.26548245743669

Now that we know what the language is like, we can start writing the interpreter.

For reference, I have put an implementation of an AEL interpreter online.

## Step 1: The Lexer

The lexer takes text input and returns a list of tokens. The skeleton of our `lex`

function thus should look like this

var lex = function (input) {
var tokens = [];
return tokens;
};

We have several different types of tokens. There are operators, which are defned by the set of characters `+-*/^%=(),`

there are numbers composed of numeral digits, there is whitespace, which our lexer will ignore, and there are identifiers, which will be defined as any string of characters that does not contain operators, digits, or whitespace.

var isOperator = function (c) { return /[+\-*\/\^%=(),]/.test(c); },
isDigit = function (c) { return /[0-9]/.test(c); },
isWhiteSpace = function (c) { return /\s/.test(c); },
isIdentifier = function (c) { return typeof c === "string" && !isOperator(c) && !isDigit(c) && !isWhiteSpace(c); };

We will create a simple loop to scan the input text. We will have a variable `c`

which corresponds to the current character, create a function called `advance `

to step forward one character in the input and return the next character, and a function called `addToken`

which appends a token to the list of tokens.

var tokens = [], c, i = 0;
var advance = function () { return c = input[++i]; };
var addToken = function (type, value) {
tokens.push({
type: type,
value: value
});
};
while (i < input.length) {
c = input[i];
}

If `c`

is whitespace, we just ignore it and move on. If `c`

is an operator, add an operator token to the list and move on

if (isWhiteSpace(c)) advance();
else if (isOperator(c)) {
addToken(c);
advance();
}

For simplicity, we will define a number as a series of digits, optionally followed by a decimal point and another series of digits.

else if (isDigit(c)) {
var num = c;
while (isDigit(advance())) num += c;
if (c === ".") {
do num += c; while (isDigit(advance()));
}
num = parseFloat(num);
if (!isFinite(num)) throw "Number is too large or too small for a 64-bit double.";
addToken("number", num);
}

Every other character is an identifier.

else if (isIdentifier(c)) {
var idn = c;
while (isIdentifier(advance())) idn += c;
addToken("identifier", idn);
}

And, just in case something weird gets thrown at us

else throw "Unrecognized token.";

After we're done scanning the input, we'll add a token to demarcate the end. Then we're done.

addToken("(end)");
return tokens;

## Step 2: The Parser

There are lots of different strategies for writing a parser. One of the most common is writing a Backus-Naur grammar and using recursive descent. In this article, we will be using a somewhat less common strategy called operator-precedence parsing, using techiques described Douglas Crockford's Top Down Operator Precedence article.

Operator precedence parsing begins with the following process

- Associate every operational token with a left binding power, and an operational function.
- If the operator manipulates tokens to its left (such as
`+`

), associate it with a left denotative function (hereafter abbreviated as `led`

). If the operator does not manipulate the tokens on its left (such as the unary `-`

), associate it with a null denotative function (hereafter abbreviated as `nud`

). Identifiers and numbers also have a `nud`

function associated with them.

After this is done, it can start to generate the parse tree. To show how this works, I'll use the arithmetic expresion below as an example.

12 + 4 / 6

The parser function starts by executing the `nud`

of the first token (`12`

), which returns itself.

{ type: "number", value: 12 }

This become the left hand side that we pass into the `led`

of the next token (`+`

), which will recursively call the parser function to parse the remaining tokens.

{
type: "+",
left: { type: "number", value: 12 },
right: }

We start over by calling the `nud`

of the first of the remaining tokens (`4`

), which will return itself.

{ type: "number", value: 4 }

This becomes the left hand side that we pass into the led of the next token (/), which will recursively call the parser function to parse the remaining tokens.

{
type: "+",
left: { type: "number", value: 12 },
right: {
type: "/"
left: { type: "number", value: 4 },
right: }
}

We call the `nud`

of the first of the remaining tokens (`6`

), which will return itself. This is the end of the list of tokens, so the parse tree is complete.

{
type: "+",
left: { type: "number", value: 12 },
right: {
type: "/"
left: { type: "number", value: 4 },
right: { type: "number", value: 6 }
}
}

If we now switch the `+`

and `/`

, we will see how binding power comes into play. `/`

has a higher binding power than `+`

, so it will bind more tightly to the operands around it.

12 / 4 + 6

We start out the same way as before

{
type: "/",
left: { type: "number", value: 12 },
right: }

But this time, after we call the `nud`

of the `4`

, we will look ahead one token, and see that `+`

has lower precedence than the `/`

. This means that we will stick the 4 in the right hand side, and then pass the entire current tree into the led of `+`

, which makes us end up with this parse tree:

{
type: "+",
left: {
type: "/",
left: { type: "number", value: 12 },
right: { type: "number", value: 4 }
},
right: { type: "number", value: 6 },
}

Now that we have some understanding of the parsing process, we can begin writing our parser. The parser accepts the tokens that the lexer produced, and returns a parse tree, so the skeleton of our `parse `

function will look like this:

var parse = function (tokens) {
var parseTree = [];
return parseTree;
};

We need to have some sort of symbol table that associates symbol with a binding power, nud or led, and we need a function that will associate a token with the corresponding symbol.

var symbols = {},
symbol = function (id, nud, lbp, led) {
var sym = symbols[id] || {};
symbols[id] = {
lbp: sym.lbp || lbp,
nud: sym.nud || nud,
led: sym.lef || led
};
};
var interpretToken = function (token) {
var sym = Object.create(symbols[token.type]);
sym.type = token.type;
sym.value = token.value;
return sym;
};

We need some way to see what the current token is and a way to advance to the next token.

var i = 0, token = function () { return interpretToken(tokens[i]); };
var advance = function () { i++; return token(); };

Now we can write an expression function which will generate the parse tree of an expression according the way it was described above

var expression = function (rbp) {
var left, t = token();
advance();
if (!t.nud) throw "Unexpected token: " + t.type;
left = t.nud(t);
while (rbp < token().lbp) {
t = token();
advance();
if (!t.led) throw "Unexpected token: " + t.type;
left = t.led(left);
}
return left;
};

We can now create the infix and prefix functions, which we will be able to use to define operators

var infix = function (id, lbp, rbp, led) {
rbp = rbp || lbp;
symbol(id, null, lbp, led || function (left) {
return {
type: id,
left: left,
right: expression(rbp)
};
});
},
prefix = function (id, rbp) {
symbol(id, function () {
return {
type: id,
right: expression(rbp)
};
});
};

Now we can define all of arithmetic operators declaratively. Note that the exponentiation operator (`^`

) has a right binding power that is smaller than its left binding power. This is because exponentiation is right associative.

prefix("-", 7);
infix("^", 6, 5);
infix("*", 4);
infix("/", 4);
infix("%", 4);
infix("+", 3);
infix("-", 3);

There are some operators that just exist as separation or end demarcation. They are not prefixes or infixes so it is sufficient to add them to the symbol table.

symbol(",");
symbol(")");
symbol("(end)");

The parentheses `nud`

just returns what is inside of it, and the number `nud`

just returns itself.

symbol("(", function () {
value = expression(2);
if (token().type !== ")") throw "Expected closing parenthesis ')'";
advance();
return value;
});
symbol("number", function (number) {
return number;
});

The identifier `nud `

and the `=`

operator's `nud`

are more complicated because they have context-dependent behaviour.

symbol("identifier", function (name) {
if (token().type === "(") {
var args = [];
if (tokens[i + 1].type === ")") advance();
else {
do {
advance();
args.push(expression(2));
} while (token().type === ",");
if (token().type !== ")") throw "Expected closing parenthesis ')'";
}
advance();
return {
type: "call",
args: args,
name: name.value
};
}
return name;
});
infix("=", 1, 2, function (left) {
if (left.type === "call") {
for (var i = 0; i < left.args.length; i++) {
if (left.args[i].type !== "identifier") throw "Invalid argument name";
}
return {
type: "function",
name: left.name,
args: left.args,
value: expression(2)
};
} else if (left.type === "identifier") {
return {
type: "assign",
name: left.value,
value: expression(2)
};
}
else throw "Invalid lvalue";
});

Now that all the hard stuff is out of the way, we can just put on the finishing glue.

var parseTree = [];
while (token().type !== "(end)") {
parseTree.push(expression(0));
}
return parseTree;

## Step 3: The evaluator

The evaluator accepts the parse tree generated by the parser, evaluates each item, and produces the evaluated output. The skeleton of our evaluate function will look like this.

var evaluate = function (parseTree) {
var parseNode = function (node) {
};
var output = "";
for (var i = 0; i < parseTree.length; i++) {
var value = parseNode(parseTree[i]);
if (typeof value !== "undefined") output += value + "\n";
}
return output;
};

It is straightforward to define the behavior of the operators and all of the predefined variables and functions

var operators = {
"+": function(a, b) {
return a + b;
},
"-": function(a, b) {
if (typeof b === "undefined") return -a;
return a - b;
},
"*": function(a, b) {
return a * b;
},
"/": function(a, b) {
return a / b;
},
"%": function(a, b) {
return a % b;
},
"^": function(a, b) {
return Math.pow(a, b);
}
};
var variables = {
pi: Math.PI,
e: Math.E
};
var functions = {
sin: Math.sin,
cos: Math.cos,
tan: Math.cos,
asin: Math.asin,
acos: Math.acos,
atan: Math.atan,
abs: Math.abs,
round: Math.round,
ceil: Math.ceil,
floor: Math.floor,
log: Math.log,
exp: Math.exp,
sqrt: Math.sqrt,
max: Math.max,
min: Math.min,
random: Math.random
};
var args = {};

Now we can put in the meat of our evaluator, and we're done. The `parseNode`

function recursively traverses and evaluates the parse tree.

var parseNode = function (node) {
if (node.type === "number") return node.value;
else if (operators[node.type]) {
if (node.left) return operators[node.type](parseNode(node.left), parseNode(node.right));
return operators[node.type](parseNode(node.right));
}
else if (node.type === "identifier") {
var value = args.hasOwnProperty(node.value) ? args[node.value] : variables[node.value];
if (typeof value === "undefined") throw node.value + " is undefined";
return value;
}
else if (node.type === "assign") {
variables[node.name] = parseNode(node.value);
}
else if (node.type === "call") {
for (var i = 0; i < node.args.length; i++) node.args[i] = parseNode(node.args[i]);
return functions[node.name].apply(null, node.args);
}
else if (node.type === "function") {
functions[node.name] = function () {
for (var i = 0; i < node.args.length; i++) {
args[node.args[i].value] = arguments[i];
}
var ret = parseNode(node.value);
args = {};
return ret;
};
}
};

## Putting it all together

Now we can wrap it all up and put the thee pieces together in one single `calculate `

function.

var calculate = function (input) {
try {
var tokens = lex(input);
var parseTree = parse(tokens);
var output = evaluate(parseTree);
return output;
} catch (e) {
return e;
}
};

## What to do next

There are many things you could add to the language to make it more useful or just to see how things work. Some additions would be fairly easy, some could be very hard.

### Easy stuff

- Play around with adding your own predefined functions.
- Fiddle with the binding power numbers to see how that changes the way expressions are evaluated.
- Allow scientific notation in number literals or make it possible to use binary or hexidecimal number literals.

### Intermediate stuff

- Allow manipulation of more types of data, such as strings, booleans and arrays.
- Allow functions to have multiple statements and conditionally return values
- Replace the evaluator with a compiler or transpiler that takes the parse tree and outputs code in another language
- Make a package manager that allows you to import code from other files

### Harder stuff

- Implement optimizations to allow calculations to be performed more quickly.
- Make function first-class citizens and allow closures
- Make it so that all the data is evaluated lazily

If you know how to do some of this stuff, you are prepared to start making interpreters and compilers for your own domain specific languages.

##
That's all, folks!

If there is an error in this article or you see some way to make either the code or the explanations of the code easier to understand, let me know, and I can update it accordingly.