Newt: A Powerful C# Parser Generator in a Small Package

honey the codewitch

Rate me:

5.00/5 (15 votes)

14 May 2019CPOL11 min read

41.3K

An LL(1) pull parser and generator that thinks it's an LL(k) parser - with a rich, simple and beautiful EBNF syntax

Download source (Newt) - 79.5 KB

Below is for Visual Studio 2017 integration:

Download Cauldron (binaries) - 6.4 MB

Introduction

I've long been dissatisfied with the parser generator offerings available. Either they are limited and clunky like Cocos/R or they are huge and involved, like the latest renditions of ANTLR. Worse, their grammars can get overly complicated to the point of being unreadable, or they use complex engines like GLR or LL(k) parsing with FA to handle the parsing process, making them both more difficult to implement, and to use.

I also didn't care for how few of them expose a pull parser - something you can loop on and call read to get a chunk at a time. This is essential for parsing huge documents like mountains of JSON data, or even just anything with a big enough parse tree.

Originally, I was going to write a parser generator that allowed you to choose the underlying algorithm to use as a way of solving some of these limitations until I stumbled onto a much better and seemingly uncommon way to use a very simple algorithm - LL(1) - to parse very complex grammars and yield reasonable parse trees. Since it's LL, it allows for simple reading and traversal during the parse.

Background

To follow much of this, you'll probably need to be at least a little familiar with parser generators. If you aren't, I'll try to keep things as boiled down as I can, and the bits that are unfamiliar probably aren't important for using the end result, just understanding the full mechanics behind it.

I'm going to avoid talking code here, and instead talk about grammars. First, let's define what we mean by grammar, in this case, a Context Free grammar (Type-2 on the Chomsky hierarchy if you care)

All a grammar is essentially, is a series of rules. A rule is a left hand side, followed by one or more symbols on the right hand side, *or* epsilon, which can appear once, and by itself.

What is a symbol? It's just a handle. It can be anything, as long as you can get back to it, and find it again. For example, a symbol can be the string "foo" and you can tell if it's the same as another string because the string is the same.

A -> a B 

A -> C B d

Here we have two rules, with the same left hand side symbol, and a list of symbols on the right hand side.

A group of rules with the same left hand side are collectively known as a "non-terminal". Here, we have defined the non-terminal A with two rules. We can reference it in any other rule, and we can even make a non-terminal self-referential / recursive. We can also of course define other non-terminals in the same grammar such as:

B -> B A
B -> A
B -> <epsilon>

Epsilon simply means "nothing" or "empty" here.

How do the rules define a parse? Well, we use these rules to get an idea of the hierarchy of the grammar we are parsing. And we've not covered terminals yet, just non-terminals.

Non-terminals are the nodes of the parse tree, terminals are the leaves. A Context Free Grammar can reference terminals by symbol, but does not care how they are defined. Usually terminals are defined using small regular expressions and string literals

Above, we can see that B references A. B also references itself. What this does is define a loop of zero or more times, using substitution. Each iteration, A gets replaced by the definition of A which is parsed, and B gets replaced by yet another instance of B. Luckily, we have the two other rules which define the termination of the loop - either A by itself, or nothing (empty/epsilon)

Let's stop right there. This may work fine for a computer, but it's a downright awful way to define something to parse. So rather than teach you about all the specifics here, let's just say a CFG is that ugly thing.

Fortunately, we don't have to make them by hand. We can use something like EBNF to describe expressions which can then produce these grammars. Consider the following self describing EBNF grammar:

grammar<start>=productions;
productions= productions production | production;
production=identifier ["<" attributes ">"] "=" expressions ";";
expressions= expressions "|" expression | expression;
symbol = literal | regex | identifier | 
    "(" expressions ")" | 
    "[" expressions "]" |
    "{" expressions "}";
expression = { symbol };
attributes= attributes "," attribute | attribute;
attribute= identifier ["=" attrval];
attrval= literal | integer | identifier;

literal = '"([^"\\]|\\.)*"';
regex = '\'([^\'\\]|\\.)*\'';
identifier= '[A-Z_a-z][\-0-9A-Z_a-z]*';
integer='\-?[0-9]+';
whitespace<hidden>= '[ \v\f\t\r\n]';
lineComment<hidden>= '//[^\n]*';
blockComment<hidden,blockEnd="*/">="/*";

Sorry for the lack of syntax highlighting!

As you study this, assuming you know basic regex, it should start to come together. This describes an EBNF grammar in the syntax it describes.

[ ] optionally match whatever occurs between
{ } repeatedly match whatever occurs between
( ) standard grouping
< > attribute sets - special declarators that modify or specify the grammar or parse tree in some way
" " string literal
' ' regex expression - non-backtracking, basic.
| alternation - simple concatenation is implicit just like it is in regex.

This is a much easier way to describe a syntax! We can create a CFG from this. The trouble is, the CFG won't work out of the box with some common parser algorithms, including the one that ships with this article. The grammar is not "LL(1)" which just means there are two rules with the same left hand and same first right hand sides.

The grammar is also left-recursive because of expressions, which means it can create an infinite loop. We don't want that.

Fortunately, there are mathematical workarounds for these issues. They can be solved by what's known as left factoring. Left factoring at Tutorials Point

What we have to do is refactor the CFG grammar in order to make it usable with our little LL(1) parser.

The included project handles this. The problem with left factoring is that it bloats the parse tree. You have to create additional rules to workaround the limitations of the parsing algorithm. The extra rules mean extra non-terminals, which in turn means extra useless nodes on the parse tree. "What's the big deal!" you say. I say, talk to me when you've parsed a C# file that uses 7 nested non-terminals just to start an expression. Also, the deeper the parse tree, the more complicated and resource intensive it is to parse and schlep around.

What I've done is I've allowed the non-terminals and terminals to have attributes associated with them. You'll notice a few in the above EBNF. (Above "hidden", "start", and "blockEnd")

Here are the ones currently supported:

type - must be a literal string representing either a .NET or C# intrinsic type name - used on terminals for coercing the string value from the lexer into the specified type using System.ComponentModel.TypeConverter framework in .NET so you can add your own. See the MS docs.
hidden - boolean, used on terminals for telling the parser to throw them away and not report them. This is useful for whitespace and comments, and can be useful in some other contexts too, like CDATA sections in HTML if you don't want to parse them
blockEnd - string literal, used on terminals to specify a skip end-point for things like C and HTML comment blocks. These constructs are not easily matchable with a non-backtracking FA, so this is a special mode (but common to parsers) to simply skip until it finds the specified string literal in the output. It then takes that entire text and reports it as a terminal. Specifying "*/" would be useful for C block comments. Specifying "]]>" would be useful for CDATA sections. Specifying "-->" is useful for SGML/XML/HTML comments.
(not used above:)
substitute - string literal, used on terminals or non-terminals to indicate a substitute symbol (which must be defined) to report in the place of this symbol. Does not impact the parsing, just the reporting of the parse. The same rules are still followed and internally the symbol still has its underlying identity - it's just never reported.

Another you don't see is one of the most powerful ones the system uses behind the scenes. It's called "collapse" and all it does is replace itself with a list of its children. In other words, it removes itself from the parse tree, and moves its children up the chain to its parent node, so it basically "hides" from the tree, but propagates its children.

Adding "collapse" to any non-terminals as created during left-factoring has yielded parse trees that look exactly like the grammar that generated them would suggest.

If you're familiar with parser generators you know that last bit is a tall order when the EBNF grammars are as expressive as the above.

Using the Code

As I usually do, I will be providing sample code here to do most of the talking. This builds on my regular expression engine.

Please note that while the API is fairly easy once you know how, this is a research project I started, and I haven't done much refactoring or commenting. I decided to submit it as is because I'll be working on it for the next year, and I'm at least two weeks out from a git hub submission, so you fine readers get it first. Vote for Pedro! (or me)

// This demonstrates loading the ebnf.ebnf grammar file and 
// then using it to parse itself.
//
var grammarfile = @"..\..\..\ebnf.ebnf";

// parse our document into an EBNF DOM
// Note that an EBNF document is not a CFG grammar
// but it can be used to create one. It's also possible
// to create CFG grammars in other ways.
EbnfDocument doc = EbnfDocument.ReadFrom(grammarfile);

// create a CFG (context free grammar) from the DOM
var cfg = doc.ToCfGrammar();

// Write out the current CFG
Console.WriteLine(cfg);

// LL(1) is restrictive and most grammars won't be
// ready for LL(1) parsing without modification.
// We must remove left recursion, and then left
// factor the grammar to remove any ambiguity.
// PrepareLL1() does all of this, and returns any
// messages, warnings, and errors
//
// Note that this alters the grammar itself. 
Console.WriteLine("Preparing grammar for LL(1) parsing.");
var msgs = cfg.PrepareLL1();
bool hasError = false;
foreach (var m in msgs) {
    if (CfgErrorLevel.Error == m.ErrorLevel)
        hasError = true;
    Console.WriteLine(m);
}
if (hasError)
    return;

// write out the results of preparing the CFG grammar
Console.WriteLine();
Console.WriteLine("Modified grammar follows:");
Console.WriteLine();
Console.WriteLine(cfg);
Console.WriteLine();

// A parser uses a lexer to tokenize input, and a parser (based on a CFG) in order 
// to parse. A CFG is used to compute the parser. A lexer comes from elsewhere.
// In this case, we built it about using the DOM. We didn't have to. All we need to
// do is make sure that the tokens the lexer produces and the symbols the parser
// expects for the terminals match up. This is all handled for you if you just
// use the built-in facilities for lexing.
//
// use the DOM and the CFG together to create a lexer. CFGs don't know about lexing,
// only parsing. The DOM here knows what its lexer terminals are, so it is 
// responsible for building the lexer we'll pass to the parser later. It needs the 
// CFG so it can tie symbol names together with their ids and feed those to the lexer
// so that they agree on which symbols mean what.

// NOTE: If you do this before preparing the grammar, the lexer won't match up with the 
// modified CFG and you'll get a parse error, probably immediately.
var lexer = doc.ToLexer(cfg);

Console.WriteLine("Generating Table Driven Parser");
            
// cheap trick, but File.Exists doesn't work for UNC paths
try
{
    File.Delete(@"..\..\..\EbnfLLTableDrivenParser.Generated.cs");
}
catch { }
            
using (StreamWriter sw = new StreamWriter
      (File.OpenWrite(@"..\..\..\EbnfLLDrivenTableParser.Generated.cs")))
    // here below we generate a class that will parse the grammar represented by the cfg.
    // we write this to a file which is then used in this C# project to parse when the 
    // PARSER compile constant is defined. This method produces a table driven parser
    cfg.WriteCSharpTableDrivenLL1ParserClassTo(sw, "EbnfLLTableDrivenParser", null, lexer);

Console.WriteLine("Generating Compiled Parser");
try
{
    File.Delete(@"..\..\..\EbnfLLCompiledParser.Generated.cs");
}
catch { }
using (StreamWriter sw = new StreamWriter
      (File.OpenWrite(@"..\..\..\EbnfLLCompiledParser.Generated.cs")))
    // here we generate a class same as above but for the compiled parser. It doesn't use
    // tables to parse it generates switch cases for pretty much everything
    cfg.WriteCSharpCompiledLL1ParserClassTo(sw, "EbnfLLCompiledParser", null, lexer);

// parse itself
//
// here we tell the CFG to create an LL1Parser using the specified lexer we created.
// this is how to make a runtime parser and lexer - no code generation required, but 
// not as efficient
LLParser parser = cfg.ToLL1Parser(lexer, null);
            

Console.WriteLine("Using generated parser.");
parser.Close();
var pc = ParseContext.CreateFromFile(grammarfile);
            
// pick your poison - we've generated two classes, a table
// driven one, and a straight compiled one. The table driven
// one is usually a better choice, especially for large 
// grammars, but they should both be fast, and the compiled
// one probably takes less memory.

// commment out both lines below to use the runtime parser

parser = new EbnfLLTableDrivenParser(pc);
//parser = new EbnfLLCompiledParser();

// we use a parse context as the cursor over our input.
// restarting a parser simply allows you to feed it a new
// parse context. That way you can recycle an instance over
// several documents
parser.Restart(ParseContext.CreateFromFile(grammarfile));
            
// Dump our output to the console
// Since the Parser only keeps integer ids instead of symbol names around, the CFG
// is needed in order to do symbol-id to name resolution for printing.
// in your code you'll want to use the ints
OutputParseTo(Console.Out, parser, cfg);
parser.Close();

// one more time. This time grab the parse tree
parser.Restart(ParseContext.CreateFromFile(grammarfile));

var tree = parser.ParseSubtree();
parser.Close();
Console.WriteLine(tree.ToString(cfg)); // give it our parser so it can resolve ids

I've included a simple expression evaluator and the "newt" command line utility to generate code.

I've also added the remedial syntax highlighter back.

It's still an early submission - needs about a year of work before I'll be super happy with it but I'm already using it and it's fun, and pretty powerful.

If you still run VS2017 like I do, you may need to install the 4.7.2 or 4.8 targeting packs for Visual Studio to be able to build some of these projects.

The Other Beast, Cauldron

I've included a separate set of binaries for a VSIX package that integrates Newt into Visual Studio as a "custom tool" - the source for the project is too big to distribute here, but I'll offer it on request.

After installing it, what you do is you add a grammar to your project, set the custom tool property of the document in the solution explorer to NewtParserGenerator

Now, whenever you edit that file, and save it, code is generated automatically. The created class is <Input Grammar Filename>Parser so for example Ebnf.ebnf would create Ebnf.cs with the class EbnfParser underneath it. Eventually, I may add an attribute to the grammars that specify its incode representation so you can alter the class name, but for now, this is how it's done.

Cauldron has other generators, which I'll mention briefly here. Use the specified names in the Custom Tool field to instantiate the tool for a given input file. If you have questions about how this works in Visual Studio, use the comments here to ask them.

MergeMinifyPreprocessor takes an input file with a list of filenames in it (file relative path) and produces a single "brick" of minified C# source code you can more easily redistribute. I use it internally for my own projects, when I want to pull a chunk of code out of my massive codebase I've built. You can also specify ;foo to include the line "foo" somewhere in the list as raw text. Put it between files. Or at the top. using ;; will make it always appear at the top.
RegexGenerator: I wouldn't use this if I were you. It works, but explaining why you would even use it requires its own article. It's for building hand rolled parsers with partial FA lexers in them. It also doesn't generate classes that support complex error recovery yet.
TinyT4Preprocessor: I've run application centric, inproc webservers served by code generated with this little monster. All it is, is a barebones T4 processor that only generates C# and does not require Microsoft's CodeDom nuget package. The only template directive it supports is template but you shouldn't need the others for this. It supports basic T4. It has one major advantage over Microsoft's T4 engine as well - it streams. So you can use it with chunked transfer encoding in HTTP for better performance on large documents. Microsoft's does not stream for some reason. However, it's of primary import not for serving web pages, but for writing C# code generation routines. I more often just use more C# code, not HTML, in between those tags!

Points of Interest

I think this might be very similar to the algorithm used by NorKen Programmar, but I can't be sure because they never disclosed theirs. I didn't try to imitate them, but I do use something similar to their EBNF format because it's the cleanest I've seen.

History

May 11, 2019: Initial submission
May 13, 2019: Updated
- Removed the pure compiled parser support because it was too slow, and with it, I removed pre C#7 support.
- The generated table driven parser is just so much faster, I couldn't justify keeping the alternative.
- Added command line tool support and a new sample app. Removed the syntax highlighter since it wasn't a .NET core app. New updates coming shortly, including a .NET Core web based syntax highlighter that accepts your grammars.
May 14, 2019: Updated
- I've added support back for later versions of the .NET framework. Previous revisions didn't work due to a different dictionary constructor in .NET core. I've also optimized table parser generation some, at the cost of a little more memory.
- I've included an additional sample now that .NET framework is back. The highlighter is back too.