Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / C#

Glory: A GLR Parser Generator for .NET

5.00/5 (4 votes)
22 Feb 2020MIT20 min read 8K   52  
Glory can parse virtually anything, even natural language. Add powerful parsers to your projects. Parse languages like C# or incorporate human language parsing with your AI code.
GLR parsing is the most powerful general parsing algorithm available. It will parse ambiguous grammars and return multiple valid trees. There is no such thing as a conflict in a GLR grammar. Because of this property, it is suitable for processing natural language as well as generally ambiguous computer languages like C#.

Glory output

Introduction

I have found very little in the way of GLR parser offerings that target .NET. The things I have found were either incomplete or rolled into part of a much larger project. Glory aims to provide the .NET developer with access to the world of GLR parsers which are powerful enough to parse natural language.

Conceptualizing this Mess

Overview

GLR parsing is the most powerful of the generalized parsing algorithms but it shares the awkwardness of other bottom-up parsing techniques. Glory does a lot to hide this, but it's impossible to present a complete facade.

On the other hand, it shares the benefits of LR parsing as well such as power, efficiency (for non-ambiguous grammars), and ability to handle left recursion.

Due to the ability to handle left recursion and ambiguous grammars, there is no grammar that a GLR parser cannot accept. It's only a matter of how efficiently it can parse it. The more ambiguities, the less efficient it will be and more alternative parse trees will be returned.

As mentioned, Generalized LR parsing is a parsing technique that builds on existing LR parsing algorithms, like LR(1), LALR(1) or LR(0). Glory, like most GLR parsers, uses LALR(1) underneath due to its power and relatively compact table size compared to LR(1). LR(0) could be used, but it would degrade the performance of the parsing due to the frequency of grammar conflicts.

The Command Line Interface

The command line represents the primary interface to Glory so it's a good idea to understand it. First, here's the usage screen, with apologies for the word wrapping:

Glory generates a GLR parser and optional lexer spec

glory.exe <inputfile> [/output <outputfile>] [/rolex <rolexfile>]
        [/namespace <codenamespace>] [/class <codeclass>]
        [/langage <codelanguage>] [/fast] [/noshared]
        [/verbose] [/ifstale] [/noparser]

        <inputfile>             The XBNF input file to use.
        <outputfile>            The output file to use - default stdout.
        <rolexfile>             Output a Rolex lexer specification to the specified file
        <codenamespace>         Generate code under the specified namespace - default none
        <codeclass>             Generate code with the specified class name - default derived
                                from <outputfile> or the grammar.
        <codelanguage>          Generate code in the specified language - default derived 
                                from <outputfile> or C#.
        <fast>                  Generate code quickly, without resolution 
                                using C# only - not valid with the <codelanguage> option.
        <noshared>              Do not include shared library prerequisites
        <verbose>               Output all messages from the generation process
        <ifstale>               Do not generate unless output files are older than 
                                the input files.
        <noparser>              Generate any specified lexers with the appropriate 
                                symbol table but do not generate the parser output.
  • inputfile is the XBNF specification (explored below) that the parser will be generated from.
  • outputfile is optional and specifies the output code file to generate. If it's not specified, the code will be dumped to the stdout.
  • rolexfile is optional and specifies the rolex lexer specification file to generate. If it's not specified, then no rolex specification will be generated.
  • codenamespace is the namespace under which to generate the code. If it's not specified, the code won't be generated under a namespace.
  • codeclass is the name of the parser class to use. If you don't specify one, it will be the same as the outputfile, unless that's not specified, in which case it will take the name from the start element in the XBNF grammar.
  • codelanguage is the language of the code to generate. If you don't specify one, it will be based on outputfile's file extension. If outputfile is not specified, it will default to C#.
  • fast indicates that the code generation should skip the resolution step and simply emit in C#. This is not valid when specifying a codelanguage.
  • noshared skips generating the shared code. This is important to specify for generating a second parser in the same project. The first parser should be generated without this switch, and subsequent parsers should be generated with this switch. If noshared is specified here, it should also be specified to Rolex when Rolex is used to generate the lexer/tokenizer code. You don't want duplicates of shared code in your source, because it will cause compile errors.
  • verbose emits all grammar transformation and validation messages, including messages about refactored rules. This can create huge dumps in the case of large grammars with lots of refactoring, but it allows you to see what changes it made. You can also see the changes in the doc comments of the generated code, regardless.
  • ifstale skips generating the output unless the input has been modified. This is useful for pre-build steps as it prevents the overhead of rebuilding the parser every time.

You'll probably want to use this tool in tandem with Rolex, the tokenizer included in the solution. This is the easiest way to use, Glory. Almost all parsers (excepting PEG parsers particularly) require lexers/tokenizers and Glory is no exception. For further reading about Rolex, I wrote an article about creating Rolex here though you should use this codebase since it's much newer. With rolex, you'll want to use its /external option with your parser's namespace.

The XBNF Grammar Format

Glory takes an XBNF grammar document as its input and produces a single source file as its output. The XBNF format should be familiar if you've used any of my other parser generators but we'll go over it again for those who haven't or if you're rusty. The XBNF format is designed to be easy to learn if you know a little about composing grammars.

The productions take the form of:

identifier [ < attributes > ] = expressions ;

There are more advanced variations of the above production syntax we'll approach later.

So for example, here's a simple standards compliant JSON grammar:

// based on spec @ json.org
Json<start>= Object | Array;
Object= "{" [ Field { "," Field } ] "}";
Field= string ":" Value;
Array= "[" [ Value { "," Value } ] "]";
Value<collapsed>= 
    string    |
    number    |
    Object    |
    Array     |
    Boolean   |
    null      ;
Boolean= true|false;
number= '\-?(0|[1-9][0-9]*)(\.[0-9]+)?([Ee][\+\-]?[0-9]+)?';
string= '"([^\n"\\]|\\([btrnf"\\/]|(u[0-9A-Fa-f]{4})))*"';
true= "true";
false= "false";
null= "null";
lbracket<collapsed>= "[";
rbracket<collapsed>= "]";
lbrace<collapsed>= "{";
rbrace<collapsed>= "}";
colon<collapsed>= ":";
comma<collapsed>= ",";
whitespace<hidden>= '[\n\r\t ]+';

The first thing to note is the Json production is marked with a start attribute. Since start's value was not specified it is implicitly, start=true.

That tells the parser that Json is the start production. If it is not specified, the first non-terminal in the grammar will be used. Furthermore, this can cause a warning during generation since it's not a great idea to leave it implicit. Only the first occurrence of start will be honored.

Object | Array tells us the Json production is derived as an object or array. The Object production contains a repeat {} construct inside and optional [] construct, itself containing a reference to Field. Array is similar, except it uses "[" and "]" and it refers to Value instead of Field.

Expressions

  • ( ) parentheses allow you to create subexpressions like Foo (Bar|Baz)
  • [ ] optional expressions allow the subexpression to occur zero or once
  • { } this repeat construct repeats a subexpression zero or more times
  • { }+ this repeat construct repeats a subexpression one or more times
  • | this alternation construct derives any one of the subexpressions
  • Concatenation is implicit, separated by whitespace

Terminals

The terminals are all defined at the bottom but they can be anywhere in the document. XBNF considers any production that does not reference another production to be a terminal. You can force an element to be terminal with the terminal attribute (see below).

Regular expressions are between ' single quotes and literal expressions are between " double quotes. You may declare a terminal by using XBNF constructs or by using regular expressions. The specifics of the regular expression language and features depends on the lexer generator you are using. Not all attributes on terminals will be supported by all lexers. Currently, Glory will only generate Rolex specifications. However, you can use other lexer generators as long as they can be wrapped to expose an IEnumerable<Token> interface.

Attributes

The collapsed element tells Glory that this node should not appear in the parse tree. Instead, its children will be propagated to its parent. This is helpful if the grammar needs a nonterminal or a terminal in order to resolve a construct, but it's not useful to the consumer of the parse tree. Above, we've used it to significantly trim the parse tree of nodes we won't need including collapsing unnecessary terminals like : in the JSON grammar. This is because they don't help us define anything - they just help the parser recognize the input, so we can throw them out to make the parse tree smaller. Note that collapsed doesn't impact what is reported by GlrTableParser and its derivatives, it only impacts the end parse trees.

The hidden element tells the lexer/tokenizer that this terminal should be skipped. This is useful for things like comments and whitespace.

The blockEnd attribute is intended for terminals who have a multi character ending condition like C block comments, XML CDATA sections, and SGML/XML/HTML comments. If present, the lexer will continue until the literal specified as the blockEnd is matched. This is supported by Rolex, but may be supported by others in the future. I've had it working with GPLEX in the past, but I removed all of the GPLEX support code because it was ugly.

The terminal attribute declares a production to be explicitly terminal. Such a production is considered terminal even if it references other productions. If it does, those other productions will be included in their terminal form as though they were part of the original expression. This allows you to create composite terminals out of several terminal definitions.

The ignoreCase attribute specifies that case shouldn't matter for matching. This only applies to terminals and only works with Rolex.

The type attribute specifies a .NET type or intrinsic C# type of the associated evaluation code block (see below). Values returned from such "typed" non-terminals are automatically converted to the target type, obviating the need to cast return values, or convert them using int.Parse() or the like. It will be handled for you. Furthermore, it makes the generated routine return a typed value instead of just object.

The priority attribute marks a terminal production with a particular priority. A negative number is lower in priority than a positive number. Must be integer. By default, items are priority 0.

The errorSentinel attribute marks a terminal as one of the restart points in case of an error. This helps the parser find its way back after an error so that it can continue parsing. See the next section on error handling.

Error Handling

Glory currently uses a simple "panic-mode" error recovery process, and it requires a little help from the grammar author. The errorSentinel attribute should be used to mark terminals as restart points - basically in the case of an error, the parser will "eat" tokens until it encounters one of these symbols and then attempt to restore the stack state so that the sentinel will be accepted as a valid parse. It's a good idea to use things like end of statement markers and end of method markers and the like. The more of these you specify the better, up to a point, which varies depending on the grammar. This part isn't fun or easy - it just takes patience. The best thing to do is start by adding an error sentinel or two, and then trying grammars with errors in them. If it eats too much text, find a sentinel within that text that you could have used and mark that one up with the attribute as well. In the future, Glory will use a global generalized error recovery technique instead of this local error recovery technique.

The @include Directive

The @include directive allows you to create grammar out of several files and include them in your master grammar. This can allow for a better structure of a large grammar project. Like any directive, this must appear before any productions. The path is specified as the single argument and must be a JSON style string literal. The paths are relative to the path of the document that included them, not the current path. Here's an example:

@include "spec/types.xbnf"

The @options Directive

The @options directive allows you to override much of the command line of Glory with your own settings in the grammar. In the case where a grammar includes other grammars, only the top level grammar's @options settings will be honored. The format is as follows:

@options option1=optionValue1, option2=optionValue2, option2=optionValue3;

The @options directive may appear once, and like @include must appear before any productions. An option value can be a quoted string, an number, a boolean, or null. The current available options are as follows, and correspond to their command line counterparts: outputfile, codenamespace, codeclass, codelanguage, rolexfile, fast and verbose. Like with attributes, boolean values that are true don't need their value specified explicitly, just their name, so verbose and verbose=true are equivalent. Again, these override the command line options. You will typically use them with the Visual Studio integration, explored below.

Ambiguous Grammars

This deserves its own special section. It might seem that because a GLR parser generator will accept a grammar no matter the form, it would be easier to use. The opposite is true. GLR has no concept of "invalid" so your grammar mistakes won't register, except by giving you different parse tree(s) than you expected. Worse, it has been proven impossible to detect if any grammar is generally ambiguous, so Glory can't even warn you.

Using Glory is no cakewalk. You have to be able to work with ambiguity, both in the grammar and the parser output. This requires a lot of manual inspection of the trees you get back as you're crafting your grammar. Plus table generation can take significant time for complicated "real world" grammars, so my recommendation is when prototyping, use @include and only generate the part of the grammar you need to test, because you absolutely will need to test.

I've included a grammar for Slang expressions, a subset of C# expressions. C# is an ambiguous language to parse. The only way to select one of the multiple parse trees you'll get back is through the application of type information to narrow down the trees you get back, which the demo doesn't do. It simply returns all the possible parse trees for the expression it was given. Sometimes the results can be surprising as it will find syntax combinations you didn't expect. Again, see the above disclaimer about testing. However, this can happen even if your grammar is accurate because Glory is good at finding all possibilities of a parse - better than humans.

Coding this Mess

We'll start with XBNF code blocks in the grammar:

Evaluation Code Blocks

Evaluation code blocks are specified in the grammar using a lambda/anonymous method-like syntax. => { ... } at the end of a non-terminal production. They take the form of:

identifier [ < attributes > ] = expressions => { ... }

Like:

C#
Foo = Bar Baz => { // my code  } // notice we don't terminate with a semi-colon

This signifies a block of code to associate with that non-terminal. This code is for parse actions, triggered by the EvaluateXXXX() methods. In this code, you take a ParseNode object, indicated by node and evaluate it, returning the result. If you do not return a result, a default value will be returned. See the example below.

On grammar notation conventions: I use title case for the non-terminals in the grammar. This isn't necessary, but is better in the end because functions like EvaluateXXXX() take the XXXX portion directly from the non-terminal name. Ergo, the non-terminal foo would end up generating and Evaluatefoo(). This is obviously undesirable but it's not a show stopper. I thought about mangling the name to "correct" this, but in the end I decided not to make unexpected changes to the code. This way, it was straightforward to know what the eventual names of these methods will be. The symbol constants are generated directly from the symbol names as well, so if you wanted to be consistent with Microsoft's naming guidelines, you'd name your terminals with Pascal case/.NET case as well. However, this is far from important, especially since unlike with non-terminals, their names aren't appended to anything.

Let's take a look at the grammar for an expression parser:

C#
Term<start>= Factor { ("+"|"-") Factor };
Factor= Unary { ("*"|"/") Unary };
Unary= ("+"|"-") Unary | Leaf;
Leaf= integer | identifier | "(" Term ")";
// we only need to declare terminals explicitly if we want 
// constant names for them or to apply attributes to them
// we don't have to explicitly declare anything but whitespace
// we can define it all inline above but this is nicer
integer='[0-9]+';
identifier='[A-Z_a-z][0-9A-Z_a-z]*';
whitespace<hidden>='\s+';

We can get a parse tree back with the above, but we can't evaluate it. Let's fix that by adding some code to the grammar:

C#
Term<start>= Factor { ("+"|"-") Factor } => {
    int result = (int)EvaluateFactor(node.Children[0]);
    int i = 2;
    while (i<node.Children.Length) 
    {
        if(node.Children[i-1].SymbolId==add)
            result += (int)EvaluateFactor(node.Children[i]);
        else // sub
            result -= (int)EvaluateFactor(node.Children[i]);
        i+=2;
    }
    return result;
}

Factor= Unary { ("*"|"/") Unary } => { 
    int result = (int)EvaluateUnary(node.Children[0]);
    int i = 2;
    // Because of the way the grammar is factored, this can 
    // end up being Factors when collapsed. We have to check
    while (i<node.Children.Length) 
    {
        if(node.Children[i].SymbolId==Unary) // Unary
        {
            if(node.Children[i-1].SymbolId==mul) 
                result *= (int)EvaluateUnary(node.Children[i]);
            else // div
                result /= (int)EvaluateUnary(node.Children[i]);
        } else // Factor
        {
            if(node.Children[i-1].SymbolId==mul) 
                result *= (int)EvaluateFactor(node.Children[i]);
            else // div
                result /= (int)EvaluateFactor(node.Children[i]);
        }
        i+=2;
    }
    return result;
}
Unary= ("+"|"-") Unary | Leaf => {
    if(node.Children.Length==1)
        return EvaluateLeaf(node.Children[0]);
    if(node.Children[0].SymbolId==add)
        return EvaluateUnary(node.Children[1]);
    else
        return -(int)EvaluateUnary(node.Children[1]);
}
Leaf= integer | identifier | "(" Term ")" => {
    if(node.Children.Length==1) // integer | identifier
    {
        if(node.Children[1].SymbolId==integer)
            return int.Parse(node.Children[0].Value);
        else // identifier
            throw new NotImplementedException("Variables are not implemented.");
    } else   // Term
        return EvaluateTerm(node.Children[1]); 
}
// we only need to declare terminals explicitly if we want 
// constant names for them or to apply attributes to them
// in this case, we don't need divide, subtract or 
// parentheses to be declared
add="+";
mul="*";
integer='[0-9]+';
identifier='[A-Z_a-z][0-9A-Z_a-z]*';
whitespace<hidden>='\s+';

Note the code blocks indicated by => { ... } which tell our parser what to do with the parse tree we made.

This indicates an integer expression parser. Because there is code in the grammar, it creates a public static ExpressionParser.Evaluate() method that can be used to call it, and thus evaluate a simple integer expression like 4+2*8.

Here is the code it generates for EvaluateLeaf(). You can see the code looks a lot like the associated code in the grammar above, with minor changes.

C#
public static object EvaluateLeaf(ParseNode node, object state) {
    if ((ExpressionParser.Leaf == node.SymbolId)) {
        if ((node.Children.Length == 1)) {
            if ((node.Children[1].SymbolId == ExpressionParser.integer)) {
                return int.Parse(node.Children[0].Value);
            }
            else {
                throw new NotImplementedException("Variables are not implemented.");
            }
        }
        else {
            return ExpressionParser.EvaluateTerm(node.Children[1]);
        }
    }
    throw new SyntaxException("Expecting Leaf", node.Line, node.Column, node.Position);
}

Here's that same code in VB.NET.

VB.NET
Public Overloads Shared Function EvaluateLeaf_
    (ByVal node As ParseNode, ByVal state As Object) As Object
    If (ExpressionParser.Leaf = node.SymbolId) Then
        If (node.Children.Length = 1) Then
            If (node.Children(1).SymbolId = ExpressionParser.[integer]) Then
                Return Integer.Parse(node.Children(0).Value)
            Else
                Throw New NotImplementedException("Variables are not implemented.")
            End If
        Else
            Return ExpressionParser.EvaluateTerm(node.Children(1))
        End If
    End If
    Throw New SyntaxException("Expecting Leaf", node.Line, node.Column, node.Position)
End Function

As you can see, the code block in the grammar was translated to the target language. This is Slang. Slang is still experimental, but it works well enough for doing simple evaluation inside code blocks. There's one problem here. EvaluateUnary() (and Evaluate() and all of the EvaluateXXXX() methods return object! This is an integer expression parser so returning int is far more appropriate. Fortunately, we can tell the parser what type to return using the type attribute. Here, we add the type attributes to the elements we've changed the grammar, which I've put in bold:

C#
Term<start,type="int">= Factor { ("+"|"-") Factor } => {
    int result = EvaluateFactor(node.Children[0]);
    int i = 2;
    while (i<node.Children.Length) 
    {
        if(node.Children[i-1].SymbolId==add)
            result += EvaluateFactor(node.Children[i]);
        else // sub
            result -= EvaluateFactor(node.Children[i]);
        i+=2;
    }
    return result;
}

Factor<type="int">= Unary { ("*"|"/") Unary } => { 
    int result = EvaluateUnary(node.Children[0]);
    int i = 2;
    // Because of the way the grammar is factored, this can 
    // end up being Factors when collapsed. We have to check
    while (i<node.Children.Length) 
    {
        if(node.Children[i].SymbolId==Unary) // Unary
        {
            if(node.Children[i-1].SymbolId==mul) 
                result *= EvaluateUnary(node.Children[i]);
            else // div
                result /= EvaluateUnary(node.Children[i]);
        } else // Factor
        {
            if(node.Children[i-1].SymbolId==mul) 
                result *= EvaluateFactor(node.Children[i]);
            else // div
                result /= EvaluateFactor(node.Children[i]);
        }
        i+=2;
    }
    return result;
}
Unary<type="int">= ("+"|"-") Unary | Leaf => {
    if(node.Children.Length==1)
        return EvaluateLeaf(node.Children[0]);
    if(node.Children[0].SymbolId==add)
        return EvaluateUnary(node.Children[1]);
    else
        return -EvaluateUnary(node.Children[1]);
}
Leaf<type="int">= integer | identifier | "(" Term ")" => {
    if(node.Children.Length==1) // integer | identifier
    {
        if(node.Children[1].SymbolId==integer)
            return node.Children[0].Value;
        else // identifier
            throw new NotImplementedException("Variables are not implemented.");
    } else // Term
        return EvaluateTerm(node.Children[1]); 
}
// we only need to declare terminals explicitly if we want 
// constant names for them or to apply attributes to them
// in this case we don't need divide, subtract or 
// parentheses to be declared
add="+";
mul="*";
integer='[0-9]+';
identifier='[A-Z_a-z][0-9A-Z_a-z]*';
whitespace<hidden>='\s+';

Note that not only do we now have the attribute type="int" marked up on each non-terminal production, we also have changed the code (particularly in Leaf) slightly, from:

C#
if(node.Children.Length==1) // integer | identifier
{
    if(node.Children[1].SymbolId==integer)
        return int.Parse(node.Children[0].Value);
    else // identifier
        throw new NotImplementedException("Variables are not implemented.");
} else // Term
    return EvaluateTerm(node.Children[1]); 

to:

C#
if(node.Children.Length==1) // integer | identifier
{
    if(node.Children[1].SymbolId==integer)
        return node.Children[0].Value;
    else // identifier
        throw new NotImplementedException("Variables are not implemented.");
} else // Term
    return EvaluateTerm(node.Children[1]); 

This change wasn't necessary, but it simplifies things slightly. This uses Microsoft's TypeConverter framework in tandem with Convert.ChangeType() to automatically translate your return values to the target type.

Regardless of all that code, using the generated code is pretty simple:

C#
var text = "3*5+7*2"; 
var exprTokenizer = new ExpressionTokenizer(text);
var parser = new ExpressionParser(exprTokenizer); 
foreach(var pt in parser.ParseReductions()) 
    Console.WriteLine("{0} = {1}",text,ExpressionParser.Evaluate(pt));

The reason creating the parser is a separate step than creating the tokenizer is because it's actually possible to use a tokenizer other than a Rolex tokenizer with this parser, but you'll have to provide your own Token struct and a class implementing IEnumerable<Token>. For performance reasons, there is no IToken interface. However, your token must have the following fields: (int) SymbolId, (string) Symbol, (string) Value, (int) Line, (int) Column, and (long) Position. See the reference source under Glory\Export\Token.cs. Modifying the original file will change the behavior of Glory and probably break it.

The reason we are looping and printing potentially multiple results is because the parser might return multiple trees. We could have just chosen the first one as there will only ever be one with this unambiguous grammar, but I wanted to show you what it usually looks like to get parse trees back from Glory even where ambiguous grammars are concerned. That's why we're looping over all the parse trees returned by ParseReductions() even though in this case there will always be just one.

Finally, now we have Evaluate(), but there's one niggling issue to address, and that is variables, which were previously not implemented. We can implement variables using the state argument in Evaluate() which allows us to pass a user defined value in to our evaluation code. We'll have to change the code in the grammar in order to support it, so let's modify the grammar again.

C#
Term<start,type="int">= Factor { ("+"|"-") Factor } => {
    int result = EvaluateFactor(node.Children[0],state);
    int i = 2;
    while (i<node.Children.Length) 
    {
        if(node.Children[i-1].SymbolId==add)
            result += EvaluateFactor(node.Children[i],state);
        else // sub
            result -= EvaluateFactor(node.Children[i],state);
        i+=2;
    }
    return result;
}

Factor<type="int">= Unary { ("*"|"/") Unary } => { 
    int result = EvaluateUnary(node.Children[0],state);
    int i = 2;
    // Because of the way the grammar is factored, this can 
    // end up being Factors when collapsed. We have to check
    while (i<node.Children.Length) 
    {
        if(node.Children[i].SymbolId==Unary) // Unary
        {
            if(node.Children[i-1].SymbolId==mul) 
                result *= EvaluateUnary(node.Children[i],state);
            else // div
                result /= EvaluateUnary(node.Children[i],state);
        } else // Factor
        {
            if(node.Children[i-1].SymbolId==mul) 
                result *= EvaluateFactor(node.Children[i],state);
            else // div
                result /= EvaluateFactor(node.Children[i],state);
        }
        i+=2;
    }
    return result;
}
Unary<type="int">= ("+"|"-") Unary | Leaf => {
    if(node.Children.Length==1)
        return EvaluateLeaf(node.Children[0],state);
    if(node.Children[0].SymbolId==add)
        return EvaluateUnary(node.Children[1],state);
    else
        return -EvaluateUnary(node.Children[1],state);
}
Leaf<type="int">= integer | identifier | "(" Term ")" => {
    if(node.Children.Length==1) // integer | identifier
    {
        if(node.Children[0].SymbolId==integer)
            return node.Children[0].Value;
        else // identifier
        {
            if(state!=null) 
            {
                int val;
                var d = (IDictionary<string,int>)state;
                if(d.TryGetValue(node.Children[0].Value,out val))
                    return val;
            }    
            throw new Exception(string.Format("Reference to undefined variable {0}",
                                               node.Children[0].Value));
        }
    } else // Term
        return EvaluateTerm(node.Children[1],state); 
}
// we only need to declare terminals explicitly if we want 
// constant names for them or to apply attributes to them
// in this case we don't need divide, subtract or 
// parentheses to be declared
add="+";
mul="*";
integer='[0-9]+';
identifier='[A-Z_a-z][0-9A-Z_a-z]*';
whitespace<hidden>='\s+';

Note we're now using the state variable to hold an IDictionary<string,int> instance which we use to hold our variables. We also pass state along to each of our evaluation methods. This is important, as the evaluation methods do not have access to any state other than that and a portion of the parse tree.

When we encounter an identifier, we resolve it using a simple dictionary lookup.

We have to change the code we use to call it now, to pass in our variables:

C#
var text = "3*5+a*2";
var vars = new Dictionary<string, int>();
vars["a"] = 1;
var exprTokenizer = new ExpressionTokenizer(text);
var parser = new ExpressionParser(exprTokenizer);
foreach(var pt in parser.ParseReductions())
  Console.WriteLine("{0} = {1}",text,ExpressionParser.Evaluate(pt,vars));

And Bob's your uncle. There you go.

However, it's a bit complicated. So I've added some shorthand to simplify. For starters, there are virtual variables named <Production>1 to <Production>N where <Production> is the name of a production in your grammar.

For the above grammar, we have Unary1, Unary2, UnaryN, and Expression1, Expression2, ExpressionN, and Factor1 and Factor2 and so on. We also have SymbolId1, SymbolId2, etc.

The numbers are 1 based positions into the child nodes. SymbolId1 is an alias for node.Children[0].SymbolId. The terminals just point to node.Children[x].Value and the non-terminal productions call the associated evaluate method for the non-terminal at the specified position like, so Factor3 from the above grammar would resolve to ExpressionParser.EvaluateFactor(node.Children[2], state)!

You can also index in to these like Unary[i]. In addition, there is a Child macro (Child1..ChildN, Child[i]) that gives you the ability to evaluate the node type at run time and it will call the appropriate EvaluateXXXX() method for you. This is useful in cases where we have collapsed nodes in the underlying grammar because sometimes that means we can't know which nodes we'll see in our evaluate function ahead of time. We'll visit this below:

Also, this may not seem very intuitive above until you look at the final grammar format:

C#
Term<start,type="int">= Factor { ("+"|"-") Factor } => {
    int result = Factor1;
    int i = 2;
    while (i<Length) 
    {
        if(SymbolId[i-1]==add)
            result += Factor[i];
        else // sub
            result -= Factor[i];
        i+=2;
    }
    return result;
}

Factor<type="int">= Unary { ("*"|"/") Unary } => { 
    int result = Unary1;
    int i = 2;
    // Because of the way the grammar is 
    // factored, this can end up being 
    // Factors when collapsed. We use
    // Child[n] which is a "smart"
    // macro that evaluates the symbol
    // at runtime and calls the right
    // EvaluateXXXX method
    while (i<Length) 
    {
        // Child always returns an object type so 
        // be sure to cast as necessary
        if(SymbolId[i-1]==mul) 
            result *= (int)Child[i];
        else // div
            result /= (int)Child[i];
        i+=2;
    }
    return result;
}
Unary<type="int">= ("+"|"-") Unary | Leaf => {
    if(Length==1)
        return Leaf1;
    if(SymbolId[0]==add)
        return Unary2;
    else
        return -Unary2;
}
Leaf<type="int">= integer | identifier | "(" Term ")" => {
    if(Length==1) // integer | identifier
    {
        if(SymbolId[0]==integer)
            return integer1;
        else // identifier
        {
            if(state!=null) 
            {
                int val;
                var d = (IDictionary<string,int>)state;
                if(d.TryGetValue(identifier1,out val))
                    return val;
            }    
            throw new Exception(string.Format
                  ("Reference to undefined variable {0}",identifier1));
        }
    } else // Term
        return Term2;
}
// we only need to declare terminals explicitly if we want 
// constant names for them or to apply attributes to them
// in this case we don't need divide, subtract or 
// parentheses to be declared
add="+";
mul="*";
integer='[0-9]+';
identifier='[A-Z_a-z][0-9A-Z_a-z]*';
whitespace<hidden>='\s+';

Take a look at that! That's pretty easy to follow once you get the hang of it. The macros clear things right up and keep it simple.

The demo project is in C# and has several grammars you can try.

Code Regions In The Grammar

Anywhere between productions, you can specify regions of code to be added to the parser class using { }. Any fields or methods must be static or const, and not named after any symbol in the grammar.

These regions are added as members of the class and will be accessible from any evaluation code blocks.

Using as a Pre-Build Step

Simply navigate to your project options, under Build Events (C#) and add it to the pre-build event command line. Finding the pre-build event location in your project options is more involved in VB than it is in C#, but dig around. It's there.

Either way, you need to know what to feed it, so here you go. You need to make two things - a parser and a tokenizer/lexer. Each will have its own command line to execute. For your parser, you'll use glory.exe, and for your tokenizer/lexer, you'll usually use rolex.exe. You'll want glory.exe to come first, and make sure to give it the right arguments to generate a lexer file for you (assuming the grammar doesn't indicate one). Following that, you'll almost certainly want to run rolex.exe. If you're making your own lexer from scratch, you can skip that second command line but most people won't ever do that.

Anyway, here are the basics (see GloryDemo and GloryDemoVB):

"glory.exe" "$(ProjectDir)Expression.xbnf" /output "$(ProjectDir)ExpressionParser.cs" 
/rolex "$(ProjectDir)Expression.rl" /namespace GloryDemo /ifstale
"rolex.exe" "$(ProjectDir)Expression.rl" /output "$(ProjectDir)ExpressionTokenizer.cs" 
/namespace GloryDemo /external GloryDemo /ifstale

First, put quotes around paths, including macros, and including your initial executable path. We simply specified "glory.exe" and "rolex.exe" here, but you may need to put in their full paths unless they're in your PATH. Often times, I will simply copy the necessary tools to the solution folder and then set the build steps to use that EXE. That way, if you copy the project to another machine, it will still build. I hope that's clear.

Notice with Rolex we've specified /external GloryDemo. That forces Rolex to use an external Token struct already declared in the specified namespace rather than generating one. This is because Glory provides its own Token struct.

Note we've used macros to refer to the project directory. The list of macros is available in the edit window of the pre-build events if you click around/click through.

Visual Studio Integration

I've removed Visual Studio integration from this project because there are too many issues with it.

I'll be releasing a Visual Studio integration pack for all of my parsers eventually. There are two issues that are showstoppers: Language independent code generation requires a serious hack of the custom tool facilities and I haven't figured out how to detect if shared code is already present in the project before I generate shared code again. This is especially a problem where Rolex is concerned, since Rolex must use Glory's Token definition, but the Rolex custom tool can't tell if a Token struct is already present in the project. Microsoft hasn't documented their extensibility features that much so I'm not sure when I'll be able to fix these issues.

The other issue is that LR table generation can take a very long time for real world grammars, so it will freeze Visual Studio for a lengthy period. It's best to do this as a build step.

Points of Interest

My husband speaks multiple languages and studies linguistics and we've finally got a convergent project now. I teach him XBNF, and he hopefully gives me a partial Mixtec or Spanish grammar from it, for which I can produce parse/translation trees which apparently he can use for who knows what. It should be fun. Yay GLR.

History

  • 22nd February, 2020 - Initial submission

License

This article, along with any associated source code and files, is licensed under The MIT License