Parser in JavaScript

Ivan Vodišek

4.71/5 (9 votes)

Nov 17, 2013

Public Domain

6 min read

33992

472

Universal text parsing by given grammar

Download mparser2-2.zip - 127.2 KB

0. Changes Log

Moony Parser 1.0
    - shift-reduce algorithm implemented

Moony Parser 2.0
    - implemented Earley parsing algorithm 

Moony Parser 2.1
    - bug fixes

Moony Parser 2.2
    - implemented “@AnyCharacter” variable
    - implemented substraction operator “\”
    - building grammar with javascript functions is marked obsolete. Preferred
      method is passing grammar string to compile function
    - changed returned parsed tree internal structure for memory and performance
      improvements
    - fixed recursive function call stack overflow error for big texts. Only
      memory is a limit now
    - cool grammar tester environment included in download bundle

1. Introduction

Parsing texts is something that every programmer encounters sooner or later. Word “parsing” in programming world denotes extraction of “abstract syntax tree” from textual input (string or file). For textual input that has some internal structure, we can use parsers to arrange fragments of that textual input over resulting syntax tree. Once built from input text, syntax tree is giving us an easy access to each syntax fragment that builds input text.

For example, we can have a date stored inside string. If we want to extract day, month and year separately from string like this: “20/10/2013”, we can make use of following grammar:

Date (
    Day {@Number},
    "/",
    Month {@Number},
    "/",
    Year {@Number}
)

After parsing with this grammar, abstract syntax tree for string “20/10/2013” would have this structure:

             Date
              |
     +---+----+----+---+
     |   |    |    |   |
    Day “/” Month “/” Year
     |        |        |
   “20”     “10”     “2013”

Although usual date parsing would be done with three simple substring functions, we used the date example for a sake of explanation simplicity. In the same way that date is parsed, we can parse any kind of text, assuming that we provide correct grammar. These grammars can include boolean comparison definition, math expression definition or even definitions for entire programming languages.

Moony Parser is a javascript library tool that can parse texts and build abstract syntax trees by any provided grammar.

2. About Moony Parser

Moony Parser is a type of “Earley” parser written in javascript. What is interesting with Moony parser is that it uses its own minimalistic language for defining grammars, named Moony grammar. Moony grammar tends to be context free complete, although it differes from context free grammar language. The main difference is structured representation of grammars, which makes it more readable than i.e. BNF’s or PEG’s grammars. Otherwise, Moony grammar language could be considered as a variant of context free grammar language.

3. Moony Grammar

Grammars in Moony parser are built from blocks that denote sequences (represented as tuples), choices (options determined at parsing time), variables, constants and regular expressions.

Tuples are building blocks that take any number of attributes which all have to exist in a sequence at parsing time. Tuples are defined inside round braces, preceeding with optional tuple name. Attributes of a tuple are delimited by a comma. A simple tuple named “Person” could have following structure:

Person (
    Name {@Word},
    @WhiteSpace,
    Surname {@Word}
)

This grammar succeeds at parsing strings like “John Doe”, but not “John Doe Smith”.

Choices represent building blocks of which any, but only one element should succeed at parsing time. Choices are defined within curly braces, preceeding with optional choice name. Elements of a choice are delimited by sign “|”. As an example of choice structure we can define a choice named “Vehicle”:

Vehicle {"car" | "ship" | "plane"}

In parsing time this grammar will succeed weather parsed string says “car”, “ship” or “plane”.

Another building blocks are variables. Variables are denoted with preceeding “@” sign. They can represent built in variables like:

@CR
@LF
@Blank
@WhiteSpace
@AnyCharacter
@Number
@Word
@Variable
@String
@RegExp
@Null

In our first example we used built in variables @Word and @Whitespace. A variable also can refer to existing tuple or choice, like in following example:

Person (
    Name {@Word},
    Next {(@WhiteSpace, @Person) | @Null}
)

In this example, besides built in variables we have also used a variable @Person that refers recursively to a tuple Person. Grammar in previous example can parse a name followed by any number of middle names and surnames. Variables can also make use of dots, giving us a way to denote a path to the final variable, in which case a variable “@Person.Name” would refer to its respective content.

Constants in Moony grammar are denoted inside single or double quotes. If a quote is a part of a constant, we can escape it by preceeding “\” sign (like “a quote inside const: \””). Example:

VariableDeclaration (
    Type {
        "string" |
        'int' |
        "decimal"
    },
    @WhiteSpace,
    Name {@Variable},
    EOL {';'}
)

This grammar can parse strings like “string name;” or “int num1;”.

Also, grammars can include regular expressions that are surrounded with “/” signs. Regular expressions can be followed by modifyers “i”, “m”, or both to denote “ignore case” and/or “multiline” respectively. An example would be:

AnyCaseWords {
    /[a-z]+/i,
    Next {(@WhiteSpace, @AnyCaseWords) | "!"}
}

This grammar can parse strings like “U can PARSE mE!”.

4. Keyword “abstract”

Moony grammar also accepts a keyword “abstract” that preceedes tuple or choice notation. The keyword “abstract” denotes that its respective tuple or choice would not be parsed at the spot where it resides. Yet, its tuple or choice can be parsed only at places where they are indirectly referenced through “@” variable. Example:

AbstractTest {
    abstract Str {"parse me there, not here"} |
    Test1 ("now ", @AbstractTest.Str) |
    Test2 ("again ", @AbstractTest.Str)
}

Previous example parses strings “now parse me there, not here”, “again parse me there, not here”, but not “parse me there, not here”. Abstractions are ought to be used when multiple references to the same structure would be required, to prevent repeating code. You can also abstract an attribute of a tuple, in which case that attribute is ignored at given position.

5. Left recursion

Moony Parser, being of “Earley” type, has no problems with parsing left recursive grammars. Left recursion can be genious used to produce interesting results with operator precedence:

Sum {
    Fact {
        Exp (
            {@WhiteSpace | @Null},
            {'-' | @Null},
            Value {
                Num {@Number} |
                Var {@Variable} |
                Bra (Left {'('}, In {@Sum}, Right {')'})
            },
            {@WhiteSpace | @Null}
        ) |
        MulDiv (Left {@Fact}, In {'*' | '/'}, Right {@Exp})
    } |
    AddSub (Left {@Sum}, In {'+' | '-'}, Right {@Fact})
}

Grammar in previous example happily respects all rules of operator precedence in math and can correctly parse strings like “1 + 2 * a - (b + c) / (d - e)”.

6. Substractions

Substractions are used when we want to avoid parsing of specific structures. For example if we want to build a quoted string parser, we want a quote to be excluded it from string value while interpreting it as a string terminator. Substractions exclude expressions from parsing and are noted after backslash operator like in following example:

Str (
    '"',
    Value (
        Char {@AnyCharacter} \ '"',
        Next {@Value | @Null}
    ),
    '"'
)

Substractions can take any valid grammar expression as an argument.

7. Programming interface

Grammar is defined in string format and it should be firstly compiled before parsing any text. Once when compile succeeds, we can pass “compiled” property of a result to parsing function that actually parses text. After parsing a text against compiled grammar rules, a syntax tree is returned.

Regular call to compiler and parser is done with “parser.compileSync” and “parser.parseSync” functions:

// this is synchronous call to compiler and parser
var rules = parser.compileSync (grammarString);
if (rules.errors.length === 0) {
    var tree = parser.parseSync (rules.compiled, stringToParse);
    if (tree.errors.length === 0) {
        // to do something with tree.parsed
    } else {
        alert ("Text error count: " + tree.errors.length);
    }
} else {
    alert ("Grammar error count: " + rules.errors.length);
}

Is some cases, when grammars are complicated or string to parse is very large, parsing could take a few or more seconds to complete. In those cases we might want to use a parsing call that returns control to a program right away after invoking parser, but triggers completing function after whole text is parsed. This kind of behavior is called asynchronous call and it can be done by functions “parser.compileSync” and “parser.parseAsync”:

// this is asynchronous call to parser
parser.compileAsync (grammarString, 
    function (rules) {
        if (rules.errors.length > 0) {
            alert ("Grammar error count: " + tree.errors.length);
        } else {
            parser.parseAsync (rules.compiled, text,
                function (tree) {
                    if (tree.errors.length > 0) {
                        alert ("Text error count: " + tree.errors.length);
                    } else {
                    // to do something with tree.parsed
                    }
                },
                function (progress) {
                    progresPercent = Math.round(progress * 100) + "%";
                    return true;
                }
            )
        }
    },
    function (progress) {
        progressPercent = Math.round (progress * 100) + "%";
        return true;
    }
);

As we can see, “parser.compileAsync” function has following parameters:

    - grammar: string where a grammar is stored in textual format,
    - done function: this function is called when parsing grammar is done. It takes
      compiled grammar or errors as parameter
    - progress function:  this function is repeatedly called while parsing executes.
      If it returns false, parsing breaks

And “parser.parseAsync” function has following parameters:

    - rules: top rule
    - text: text to parse by rules
    - done function: this function is called when parsing text is done. It takes
      syntax tree or errors as a parameter
    - progress function:  this function is repeatedly called while parsing executes.
      If it returns false, parsing breaks

8. Using parsed tree

Parsed tree returned from “parseSync” or passed by “parse” function has following structure:

    - if a node is a token:
        - “value” property holds a string as a fragment of parsed text
    - if a node is a choice:
        - “value” property holds a parsed sub-tree
    - if a node is a tuple:
        - “value” property contains an array of tuple attributes’ sub-trees
    - in all cases 
        - “parsedBy” property holds a rule by which a node is parsed, where:
            - “name” property holds a name of the rule
            - “type” property holds one of strings: “token”, “tuple” or “choice”
        - properties “from” and “to” hold integer positions of parsed node inside
          parsed string

A code that alerts whole syntax tree structure after parsing would look like following:

function pTree (tree) {
    var tmp;
    if (tree.parsedBy.type === "choice") {
        alert (tree.parsedBy.name)
        pTree (tree.value);
    } else if (tree.parsedBy.type === "tuple") {
        alert (tree.parsedBy.name)
        for (var i = 0; i < tree.value.length; i++) {
            pTree (tree.value[i]);
        }
    } else if (tree.parsedBy.type === "token") {
        alert (tree.value);
    }
}
pTree (tree.parsed);

9. Errors

If there were any errors during parsing gramar or text, result’s property “errors” holds an array of objects structured this way:

tree.errors[x] = {  
    absolutePosition: ...int...,
    line: ...int...,
    column: ...int...,
    description: ...string...
}

For parsing grammars, multiple errors can be reported (like “Element not found at...” with reference variables or “Root node can not be abstract”). For parsing texts only one error can be reported at the same time (that is “Syntax error at…”).

10. Conclusion

Moony parser is an “Earley” parser which makes it context free grammar complete. It has no problems with nullable grammars, at least none I’ve noticed. I guess it will find uses in parsing texts like database search engine queries or parsing textual file formats through AJAX calls. Basically, for parsing large files only amount of memory is the limit. 2 gigs of ram enables parsing only few thousands lines of text and bigger files tend to use virtual memory which degrades performance a lot.

The parser is relatively fast (half line per millisecond) and the code is free for any use, commercial or non-commercial, modified or non-modified by third party. For using the code, download bundled zip from this site and include “parser.js” file in your HTML. There is a parser testing environment included in download bundle in which U can test grammars before implementing actual code.

Good luck!