The long journey from character to AST

madflame

4.88/5 (5 votes)

Jan 9, 2013

CPOL

13 min read

19736

141

This article presents a fully functional proof of concept of the translation phases that transform your code from a flat string of characters to an abstract syntax tree.

Download source code - 7 Kb

Introduction

In this tutorial you'll learn about the intermediary steps taken by most compilers in transforming programs given as raw source code into more manageable data structures, like abstract syntax trees. This tutorial comes together with the source code of a parser. If you want to take a sneak peak at the final product that we're going to build here, you can either download the source code, or you can try it online here. Its basic structure is the same as for most compilers. To easily understand this article, you'll need to have basic knowledge of manipulating strings, tree (data structure) and context-free grammars (and preferably the EBNF notation).

The code is written in JavaScript because it's easy to prototype in it, and you can find a browser everywhere that will run your code. You shouldn't worry too much about the language because it's written in such a way that it's trivial to convert to any other modern object-oriented language.

First translation: Characters -> Tokens

Programs are entered by the user as a sequence a characters. Characters are all equal and bear absolutely no meaning - they're just characters (ok, there are some languages with one-character instructions, but they're esoteric and not typically used in doing real work). We need to divide the string of characters into groups that together translate into something more meaningful. The basic building block thus far is the character and after this phase we'll end up with tokens, which will be the basic building block for the next phase.

After the tokenization process we'll obtain, from a string that looks like this:

"fun(x) {return x-5;}"

a list of tokens which looks like this:

[identifier(fun), left-paren, identifier(x), right-paren, whitespace, left-curly, keyword(return), ...]

Characters are a basic data type and contain nothing besides their value. Tokens, however, contain information (be it numeric, or strings) AND some metainformation, like the type of token.

In the string "/*nothing of interest*/" the characters "/*" are used to say "hey, what follows is a comment!" and then again characters are used to define the comment content itself: "nothing of interest". After converting this comment to a token we'll end up with a data structure of the form [type: comment, content (String): "nothing of interest"]. Note that "/*" and "*/" don't appear anywhere in this representation - they've only served as meta-information.

Content will obviously be of the best type suited to represent it, and not necessarily of the type string. Number tokens will have the structure [type: fixed-point-number, value (Integer): 3746]. This time a conversion was needed from the string "3746" to the number 3746, which is not very spectacular, but now think of numbers in hexadecimal notation for which a conversion from "0x10C" to 268 involves a bit more effort.

Moreover, tokens, usually, have only one content/value container to store information, but you can come up with notations that require more, like for complex numbers. Take for example "4r0.9i"; the token structure for this one would look like this: [type: complex-num, real-part (Float): 4, imaginary-part (Float): 0.9]

It's important to note that in general there's a size reduction after tokenization. "123456789" uses 9 bytes while 123456789 fits neatly in a Integer (only 4 bytes). Also, after tokenization we get rid of "/*" and other notations like that.

Once we tokenize our code we can then convert it back to its string representation (think of syntax highlighting).

A basic tokenizer (lexer) works like this:

1. Start by having the code as a string and an index, to keep track of how far we've gone in processing the source

2. Take a character

2.1 Is it whitespace, tab or newline? Then either throw it away as it has usually no significance or add it to the token list.

2.2 Else: is it a letter? Then give control to the "letter-muncher" which will "munch" letters (and digits) and return the sequence of munched alphanumeric characters when it encounters something else. This sequence will be neatly wrapped as a "generic identifier" or "keyword". Obviously this subroutine will give control back to the generic character processor.

2.3 Else: is it a digit? Then give control to the number builder. This works in the same fashion as the "letter-muncher" described above, except it likes digits only, and optionally a dot followed by other digits (floating point numbers).

2.4 Else: is it a '"'? Then give control to the string builder which will consume characters until another '"' is encountered. (It will apply a different treatment to a '"' preceded by a '\'). If a newline character is encountered before the terminating '"' then throw and error.

2.5 Else: is it a '/' followed by a '*'? This means that this is a comment and it should be processed in the same manner as strings.

2.6 Else: is it any of these ['+', '-', '*', '=', ...] characters? Then try to find the longest string of characters that matches a supported operator. Why this "longest string of..."? Well, because you'd want ">=" to be translated to [Operator(greater-or-equal)] and not [Operator(greater), Operator(equal)]. Other examples include ">>=" and even ">>>=".

2.7 Else: is it anything else that the language supports (single line comments, multiline strings, regexes, numbers in hexadecimal notation)? Then handle it.

2.8 Else: throw an error! That's all we can do if we have checked all possibilities and found no matches.

Still "/*" seems parsable in 2 ways. Can you clarify a bit?

When the main character processor encounters a character that is a prefix to different possible tokens then it advances to some intermediary state and makes a decision after it analyses the next character. The particular case of "/*" can be treated as follows:

if current_char == '/' then
 if next_char == '*' then 
  token += get_multiline_comment()
 else 
  token += get_longest_matching_operator()
else
...

The test to determine whether the current character is the start of an operator should be handled after the test to determine if it is the beginning of a commentary. If it's handled before, it will label "/*" as [Operator(/), Operator(*)]

Note that this isn't the fastest way to tokenize stuff, since it requires peeking at future characters. The structure of this tokenizer is meant to be easy to understand and not insanely fast. The fastest tokenizers you can make involve a lot of gotos instead of "munchers".

Here's the source code for the tokenizer used for the parser we're building:

function Tokenizer() {

}

Tokenizer.chop = function(s) {
    //chop the string of characters into tokens
    var str = new IterableString(s + ' ');
    var tok = [];
 
    while(str.hasNext()) {
        var c = str.cur();
        
        //get an idea of what we're dealing with
        //and give control to specialized munchers
        if(Tokenizer.chop.isDigit(c))
            tok.push(Tokenizer.chop.munchNum(str));
        else if(Tokenizer.chop.isIdHeadChar(c))
            tok.push(Tokenizer.chop.munchAlphanum(str));
        else if(Tokenizer.chop.munchOperator.headList.indexOf(c) != -1)
            tok.push(Tokenizer.chop.munchOperator(str));
        else if(Tokenizer.chop.isWS(c))
            str.adv(); //ignore whitespace
        else throw 'Illegal character: ' + c;
    }
    
    tok.push(new End());
    
    return tok;
}

Tokenizer.chop.isWS = function(c) {
    return c == ' ' || c == '\n' || c == '\t';
}

Tokenizer.chop.isDigit = function(c) {
    return c >= '0' && c <= '9';
}

Tokenizer.chop.isLetter = function(c) {
    return (c >= 'A' && c <= 'Z') 
        || (c >= 'a' && c <= 'z');
}

Tokenizer.chop.isIdChar = function(c) {
    return c == '_' 
        || Tokenizer.chop.isLetter(c) 
        || Tokenizer.chop.isDigit(c);
}

Tokenizer.chop.isIdHeadChar = function(c) {
    return c == '_' || Tokenizer.chop.isLetter(c);
}

Tokenizer.chop.munchNum = function(str) {
    str.setMarker();

    //munch digits
    while(Tokenizer.chop.isDigit(str.cur()))
        str.adv();
        
    //and possibly one '.'
    if(str.cur() == '.') {
        str.adv();
    
        //and some more digits
        while(Tokenizer.chop.isDigit(str.cur()))
            str.adv();
    }
    
    return new Num(str.getMarked());
}

Tokenizer.chop.munchAlphanum = function(str) {
    str.setMarker();
    
    //munch alphanumeric characters
    while(Tokenizer.chop.isIdChar(str.cur()))
        str.adv();
    
    //alphanumeric stuff can only be variable identifiers here
    return new Identifier(str.getMarked());
}

Tokenizer.chop.munchOperator = function(str) {
    //create an alias
    var list = Tokenizer.chop.munchOperator.list;
    for(var i in list)
        if(str.match(list[i]))
            return new Operator(list[i]);
            
    throw 'Unknown symbol sequence';
}
    
//list of all operators 
Tokenizer.chop.munchOperator.list = ['+','-','*','/','(',')'];

//set of first character of every operator
//consequently these two lists match as we don't have lengthy operators
Tokenizer.chop.munchOperator.headList = ['+','-','*','/','(',')'];

We made use of "IterableString" which is just a wrapper for Strings to allow adding some needed operations.

function IterableString(s) {
    this.s = s;       //the wrapped string
    this.pointer = 0; //keeps track of our current position
    this.marker = 0;  //helps with
}

IterableString.prototype.adv = function() {
    //advances the current position
    this.pointer++;
}

IterableString.prototype.setMarker = function() {
    //updates the marker
    this.marker = this.pointer;
}

IterableString.prototype.cur = function() {
    //get the current character
    return this.s.charAt(this.pointer);
}

IterableString.prototype.next = function() {
    //get the next character
    return this.s.charAt(this.pointer + 1);
}

IterableString.prototype.hasNext = function() {
    //are there any more characters?
    return this.pointer < this.s.length;
}

IterableString.prototype.getMarked = function() {
    //get the substring form marker to the current position
    return this.s.substring(this.marker, this.pointer);
}

IterableString.prototype.match = function(str) {
    //checks if str matches s starting from the current position
    if(this.s.substring(this.pointer, this.pointer + str.length) == str) {
        this.pointer += str.length;
        return true;
    }
    return false;
}

IterableString.prototype.toString = function() {
    var ret = 'IterableString(pointer: ' + this.pointer 
                    + ', marker: ' + this.marker
                    + ', content: [' + this.token.join(', ') + ']';
}

So, what can we do at this level?

1. We can identify some errors, like invalid tokens, or strings or comments not properly terminating.

2. We can offer SOME syntax highlight and auto-indentation. Nicer highlight or auto-indentation requires a better analysis of the code. We can't quite colour variables and functions with different colours, as all we know about them at this stage is that they're "identifiers". Also, we can't colour local fields differently than global fields since the tokenizer has no idea what those notions are. Sure we can prefix local vars with L and global vars with G but that would make the language pretty ugly with lots of Ls and Gs.

3. We can reformat our code by removing unnecessary whitespace (very basic code compressor/minifier).

4. We can rename identifiers. This will treat variables of the same name regardless of their scope but at least it's still smarter than using basic text replacement. If we use basic text replacement for renaming variables we might get nasty results. Think of replacing "number" with "n" in "number = 'I like numbers'". You'll get "n = 'I like ns'". But if you use a smarter renaming tool that allows you to replace Identifier(number) with Identifier(n) in "number = 'I like numbers'" then you'll get what you wished for, namely "n = 'I like numbers'".

Second translation: Lexical preprocessing

Lexical preprocessing refers to operations that you can do on a program at the token level. The best example I can give is the C preprocessor with its #define directives. These allow replacement of tokens with sets of other tokens.

Preprocessing in C is neat and useful because it helps by creating abstractions. However, nowadays most modern languages are expressive enough to not need further abstractions, and most of them are cross-platform by design since they either compile to VM code or are interpreted. To quote Wood from Stack Overflow: "The (lexical) preprocessor is a cheap method to provide incomplete metaprogramming facilities to a language in an ugly fashion."

Note: some languages support "syntactic preprocessing" which performs operations on the syntax tree. This is a higher level type of preprocessor as it operates on a syntactic tree rather than on a list of tokens.

Third translation: Tokens -> AST

This one is more fun than the previous one as it finally gives a higher dimension to your source code. Up till now we've transformed a flat sequence of characters into a flat sequence of tokens. After that we've done some replacements in this sequence (with lexical preprocessing). And now, finally we're going to escape this flat realm and represent the program in a fancier way, as a tree - an Abstract Syntax Tree.

Another marvelous part to this higher level representation is that we'll get rid of a ton of tokens. When we converted from characters to tokens we got rid of notations ("/*", "*/", '"', "0x"). Now we're going to get rid of tokens that describe the structure of the tree.

There's no need for parenthesis of any kind in the AST. Parenthesis, brackets, commas and semicolons are used to group and separate stuff, but now, the grouping is visible at the first glance of an AST - that's what the higher dimensionality is about.

Building an AST

There are several ways of building an AST, but the one I'll cover here is called "recursive descent parsing".

Why recursive descent parsing?

Because it's the most intuitive and simple way or parsing. Hand-written recursive descent parsers (RDP from now on) are very easy to write and maintain. It's also very easy to give the user meaningful error messages when he/she made a syntax error.

Let's now have a look an how we'd use a RDC for our toy language (which consists of simple arithmetic expressions). First, we need a nice* grammar for our arithmetic expressions (I'll come back to this "nice" measure later). I'll assume you're familiar with the EBNF notation.

Exp -> Term { ("+" | "-") Term }
Term -> Factor { ("*" | "/") Factor }
Factor -> "(" Exp ")" | Number | Identifier

And now here's the actual code used by the parser:

function RDP() {

}

RDP.tree = function(t) {
    var token = new TokenList(t);
    var t = RDP.tree.exp(token);
    token.expect(new End(), 'RDP: expression not properly terminated');
    return t;
}

RDP.tree.num = new Num();

RDP.tree.identifier = new Identifier();

RDP.tree.exp = function(token) {
    // Exp -> Term { ( '+' | '-' ) Term }
    var operand = [];
    var operator = [];
    
    operand.push(RDP.tree.term(token));
    while(token.matchAny([[new Operator('+')], [new Operator('-')]])) {
        operator.push(token.next());
        operand.push(RDP.tree.term(token));
    }
    
    //if there's only one operand then we might as well return it without wrapping it
    if(operand.length == 1) return operand[0];
    else return new Exp(operand, operator);
}

RDP.tree.term = function(token) {
    // Term -> Factor { ( '*' | '/' ) Factor }
    var operand = [];
    var operator = [];

    operand.push(RDP.tree.factor(token));
    while(token.matchAny([[new Operator('*')], [new Operator('/')]])) {
        operator.push(token.next());
        operand.push(RDP.tree.factor(token));
    }
    
    //if there's only one operand then we might as well return it without wrapping it
    if(operand.length == 1) return operand[0];
    else return new Term(operand, operator);
} 

RDP.tree.factor = function(token) {
    // Factor -> '(' Exp ')' | Value | Variable
    if(token.match([new Operator('(')])) {
        token.adv();
        var tmp = RDP.tree.exp(token);
        token.expect(new Operator(')'), 'EDP: Factor: missing ")"');
        return tmp;
    }
    else if(token.match([RDP.tree.num])) {
        return new Val(token.next().val);
    }
    else if(token.match([RDP.tree.identifier])) {
        return new Var(token.next().name);
    }
    else throw 'EDP: Factor: Expected either a number, an identifier or a "("';
}

This recursive descent parser makes use of a helper structure similar to IterableString. This time it's an "Iterable Token List" but I've named it just "TokenList".

function TokenList(token) {
    //token list
    this.token = token;
    this.pointer = 0;
}

TokenList.prototype.matchAny = function(token) {
    for(var i=0;i<token.length;i++)
        if(this.match(token[i]))
            return true;

    return false;
}

TokenList.prototype.match = function(token) {
    for(var i=0;i<token.length;i++)
        if(!(this.token[this.pointer + i].match(token[i])))
            return false;
            
    return true;
}

TokenList.prototype.expect = function(token, exMessage) {
    if(this.match([token])) this.adv();
    else throw exMessage;
}

TokenList.prototype.adv = function() {
    if(this.pointer >= this.token.length)
        throw 'TokenList: You\'ve not enough tokens!';
    
    this.pointer++;
}

TokenList.prototype.next = function() {
    this.adv();
    return this.token[this.pointer - 1];
}

TokenList.prototype.toString = function() {
    var ret = 'TokenList(pointer: ' + this.pointer 
                    + ', content: [' + this.token.join(', ') + ']';
}

Notice how the structure of these 3 functions reflects the structure of the grammar.

EBNF -> RDP, in general please

First let's identify the elements of the EBNF. We have 4 operators: concatenation (" "), alternation ("|"), optionality ("[ ... ]") and repetition ("{ ... }"). Below, I will give examples of how to obtain the code that does what each of these EBNF operators describe.

Concatenation

Here's a production rule that describes the syntax of a common if-then-else statement found in most languages.

if-expression -> "if" "(" condition ")" "then" statement-list "else" statement-list

The associated subroutine for parsing this would be:

function if_expression() {
  expect("if")
  expect("(")
  cond = condition()
  expect(")")
  expect("then")
  stmt1 = statement_list()
  expect("else")
  stmt2 = statement_list()
 
  return new if_expression_node(cond, stmt1, stmt2)
}

Alternation:

Here's another rule from the grammar presented earlier:

condition -> expression relation expression
relation -> "<" | "==" | ">" | "!="

The associated subroutine for parsing this would be:

function condition() {
  exp1 = expression()
  rel = relation()
  exp2 = expression()
 
  return new condition_node(exp1, rel, exp2)
}

function relation() {
  if(match("<")) return lesser_relation
  else if(match("==")) return equal_relation
  else if(match(">")) return greater_relation
  else if(match("!=")) return notequal_relation
  else throw "Error: invalid relation symbol"
}

Repetition and Optionality

Consider the rule for defining a set of variables, some of which may be initialized.

var-declaration-list -> { var-declaration }
var-declaration -> "var" _IDENTIFIER ["=" _NUMBER] ";"

The associated subroutines for this grammar are:

function var_declaration_list() {
  list = []
  while(match("var"))
    list += var_declaration()
 
  return list
}

function var_declaration() {
  expect("var")
  id = get_current()
  advance()
  if(match("=")) {
    advance()
    num = get_current()
    advance()
  }
  expect(";")
 
  return new var_declaration_node(id, num)
}

(We could formalize this a bit by defining some sort of morphism from (Nonterminals & Terminals, ENBF operators) to (subroutines, some set of program control statements and instructions))

You can find a bigger example on Wikipedia. It doesn't contain code necessary to build the tree; it's just to verify the syntactic correctness of the code.

I also would like to mention the SEBNF notation. This simplifies the translation process from grammar to parser immensely, but with slight cost of adding more rules to the grammar. You can find more info with this project.

Can I build a working RDC for any grammar?
No. You'll quickly find this out when you're going to start and write grammars on your own. You'll have to deal with 2 problems when defining grammars:

1. Infinite recursion
Take for example the grammar defined in the previous example. It's not the first way you'd probably think of when writing a grammar for arithmetic expressions. Let me show you a more intuitive grammar:

Exp -> Number | Variable
Exp -> Exp ( "+" | "-" | "*" | "/" ) Exp
Exp -> "(" Exp ")"

It's really easy to see how we got this grammar: an expression can be a number or a variable, or be rewritten as an operation between 2 other expressions or it also can be wrapped in a set of '(', ')'. But this grammar has a big flaw: the second rule will win you a nice stack overflow. This happens because exp() will call exp() without munching any token.

2. Common prefixes
Sometimes you may need to look ahead a bit more. Let's take for example a simple language in which you can only declare variables or constants. A possible grammar for this would be:

declarations -> { variable-declaration | constant-declaration }
variable-declaration -> _IDENTIFIER "=" _NUMBER
constant-declaration -> "const" _IDENTIFIER "=" _NUMBER

And a sample "program" written in this language would be:

a = 0, const b = 10, const pi = 5.21, bar = 2

("a" and "bar" are variables and "b" and "pi" are constants)

The declarations() subroutine would parse declarations as long as it sees either an identifier (the first "terminal" produced by variable-declaration) or a "const" keyword (the first terminal produced by constant-declaration). So, declarations() needs to look only at the next token in order to determine if it should pass parsing control to variable_declaration() or to constant_declaration(). (Oh, and yeah, "const" is not a valid identifier because it would clash with the keyword bearing the same name).

Now let's make the RDP's task a bit harder. Consider a similar language, but in which you add the "const" keyword at the end of a declaration. A list of declared variables and constants would look like this:

a = 0, b = 10 const, pi = 5.21 const, bar = 2

(again "a" and "bar" are variables and "b" and "pi" are constants)

The associated grammar for this would be:

declarations -> { variable-declaration | constant-declaration }
variable-declaration -> _IDENTIFIER "=" _NUMBER
constant-declaration -> _IDENTIFIER "=" _NUMBER "const"

Now, the declarations() function cannot determine if it's dealing with a constant declaration just by looking at the current token from the input. And neither by looking at the second or third. Only by checking the fourth token can it determine if this was a constant declaration or not.

This is really uncommon. You can notice this by sensing how backwards this example looks. For simplicity's sake and for faster parsing, most of the languages in use have grammars with no common prefixes.

Nearing the end

Turning back to ASTs...

Now that we have the AST we should do some semantic analysis on it. This means stuff like type checking and object binding. Since our language is so tiny it has only one type of data, namely numbers, and it, well, doesn't have anything to bind.

Once we've obtained an AST we can proceed by either:

1. doing complex code transformations and repacking everything back in the original language (think of refactoring)

2. translating the code into another same-level language (this is called transcompilation or source-to-source compilation). The best examples here are Coffeescript -> Javascript, Haxe -> various

3. translating the code into a lower level language (usually Assembly); this process is what we traditionally call compilation

4. interpreting the AST directly

This time, the latter option is preferred, as the other options deserve being covered more thoroughly in articles of their own.

The evaluation process is easy-peasy compared to what we dealt with so far. Once we have the AST we just have to traverse it and keep track of the partial results.

Here are 3 types of nodes with their evaluation functions:

function Exp(operand, operator) {
    this.operand = operand;
    this.operator = operator;
}

Exp.prototype.evaluate = function(env) {
    var acc = this.operand[0].evaluate(env);
    
    for(var i=1;i<this.operand.length;i++)
        if(this.operator[i-1].match(new Operator('+'))) acc += this.operand[i].evaluate(env);
        else acc -= this.operand[i].evaluate(env);
    
    return acc;
}

Exp contains a reference to one or more nodes with "+" or "-" between them. Val contains a single value.

function Val(val) {
    this.val = val;
}

Val.prototype.evaluate = function(env) {
    return this.val;
}

Var is a variable. The AST knows only it's name. It's value is to be retrieved from memory (dubbed here "environment") at runtime.

function Var(name) {
    this.name = name;
}

Var.prototype.evaluate = function(env) {
    if(!env.container.hasOwnProperty(this.name)) 
        throw 'Interpreter: ' + this.name + ' was not declared in the Environment';
    
    return env.container[this.name];
}

All in all, this article presents a fully functional proof of concept of the translation phases that transform your code from a flat string of characters to a nice abstract syntax tree. As pointed out throughout the article, there is much you can variate and add.