Object-Oriented Parsing: Breaking With Tradition (Part 4)

KenBeckett

4.96/5 (16 votes)

Nov 12, 2012

CDDL

16 min read

45883

885

A non-traditional method of parsing languages in an extensible way.

Introduction

This article describes a non-traditional technique of parsing computer languages in an object-oriented and extensible way. Specifically, it deals with parsing C# directly into codeDOM objects, but the same technique could be applied to any computer language. Sources are included. This is Part 4 of a series on codeDOMs, but it may be useful for anyone interested in parsing techniques. In the previous parts, I’ve discussed “CodeDOMs”, presented a C# codeDOM, and provided a WPF UI and IDE for the codeDOM.

Background

In the first part of this series, I talked about how traditional computer languages are generally closed to extension because of their text nature, and also about the traditional method of defining them with a grammar which is used to generate parsing logic which converts their text form into an abstract syntax tree (AST) that can be further processed by a compiler. The main goal of the codeDOM idea described in this series is to break with the tradition of defining computer languages in text form only, and to get away from the restrictions of text. However, the reality of the situation is that if it can’t interoperate easily with text, the codeDOM won’t see much use. So, it’s time to add the capability of parsing existing C# source files.

How Should The Parsing Work?

My main design goal for the parsing logic was to easily allow for possible modifications to the language, unlike traditional parsing techniques. It should be possible to add new codeDOM objects or subclass existing ones, and include parsing support without making changes to a separate grammar and re-generating the parsing logic. I also decided that parsing is a critical feature that the codeDOM objects will always need, and that I wanted to include parsing code within them, so that they “know” how to parse themselves. If you add or change a codeDOM object, you should be able to provide or modify the necessary text parsing logic internally without making changes anywhere else. Each codeDOM object encapsulates its own parsing ability, and this is why I refer to this technique as “object-oriented parsing”.

Part of the old traditional method still seems necessary: Converting the language text into a series of tokens. From there, how do we give our objects the ability to parse themselves? My idea was to think about how humans parse code when looking at it, and to try to give my objects the same ability. Sometimes you recognize code from keywords – that’s easy. But, other times it’s from where braces or other symbols are, or just general patterns in the code. For example, consider a C# property and a method:

    public int Property { get; set; } 

    protected string Method(int parameter) { /*body*/ }

Both look like an optional modifier followed by a type of some sort followed by a name. A property then has braces around get/set methods, and a method has parentheses with optional parameters and braces containing the body. We want to be able to recognize what we’re looking at without looking too far ahead, so even though the set/get methods in the body of a property might stand out to a human, it’s best to avoid relying on those. Going by the first thing that looks different, we can say that if the next token after the name is an open brace (‘{‘} then it’s probably a property, and if it’s an open parenthesis (‘(‘) then it’s probably a method. Both of these are also expected to appear in a type declaration, and not inside a method body, so scope could also perhaps come into consideration when parsing (more on that later).

I ended up with the idea of having each codeDOM object register a specific token (or tokens) that indicated that it might have found a text instance of itself, with a callback method that could then do some additional verification if necessary before deciding for sure and attempting to parse a new instance of itself. One complication that came up is that some tokens might possibly indicate any one of several different objects. My solution was simply to allow a precedence to also be registered with the token, and if what an object “sees” in the token stream doesn’t look right, it just defers from parsing and the next object “in line” is given the opportunity. Much of C# can be parsed by relying upon unique keywords and symbols, but this precedence chaining is needed in a few cases (and can be optimized by having more commonly used objects check first). Choosing a precedence value does require some knowledge of existing objects when creating new ones, but it’s a very loose coupling, and the values are separated by large gaps (such as increments of 100) so that it’s easy for a new object to “insert” itself into an existing precedence chain.

Context-Sensitive Keywords and Scope

The C# language has some context-sensitive keywords, such as ‘get’, ‘set’, and ‘value’ used for properties. They are only keywords if they occur in the expected scope – elsewhere they can be used for other names, such as for local variables. I handled this by including a scope in the form of a type when registering a token for parsing, and the callback is not activated if the parent object is not of the expected type.

While implementing the scoping, it occurred to me that many of the other keywords in C# that aren’t context-sensitive could be. For example, modifiers such as ‘public’, ‘private’, ‘protected’, etc are only used for type declarations and members of types – there’s no reason for them to be global. Fewer global keywords means fewer restricted names, and when you’re coding a project like this (that models the language itself) it’s not uncommon to want to name local variables after language keywords. Of course, C# provides the ‘@’ prefix to workaround name collisions, but context-sensitive keywords still seem to make a lot of sense – why not use them whenever you can?

In the end, I made the parsing context sensitive for some C# global keywords when there wasn’t any extra work to do so (I didn’t want to go out of my way to force them to be global). For example, the modifiers mentioned above aren’t registered to parse on since they are used by many different objects. Instead, they are ignored in the token stream until an object triggered by another token (such as ‘class’) parses and then it parses backwards a bit to pick up any modifiers. So, they end up being treated effectively as context-sensitive rather than global keywords. The fact that this differs from the C# spec doesn’t really matter, because this technique will produce valid results for valid C# code. Invalid programs might parse differently, but that’s not really a problem either, as long as the error messages are appropriate. As a matter of fact, I’m pretty sure that my parser will actually produce better results for some cases of invalid code than the C# compiler itself, but it also probably misses the mark in others, because it hasn’t been tested thoroughly on invalid code yet.

Just try deleting an open brace (‘{‘) on a method in VS, and watch the errors fly from there to the end of the file – it’s easy to handle invalid code better than that, but most parsers don’t try very hard (actually, handling this sort of thing well is probably an argument against context-sensitive parsing – it’s a trade off). Note that in theory, if you’re editing a codeDOM tree and using incremental parsing to parse changes, having an unmatched brace would produce exactly one error message, and would not have any impact on any code outside the affected block. This is yet another benefit of codeDOMs over text – in theory, editing code could be done just as easily, but without an explosion of error messages from the edit point to the end of the file just because you haven’t finished the edit yet and you have unmatched braces temporarily (VS tries to avoid this mess by auto-inserting a close brace when you type an open brace, but that hardly covers all editing situations).

Implementation Details

Each codeDOM class implements a static method named “AddParsePoints()” in order to register one or more tokens with the Parser class. The CodeObject base class calls all of these methods in a static constructor (using reflection, but it’s a one-time thing at startup). For each file or code fragment to be parsed, the Parser class is instantiated and parses the text into Tokens, also classifying each token as an identifier, symbol, comment, literal, etc. Whitespace and comments are skipped so that they don’t interfere with the parsing of tokens, but (unlike with most compilers) they are kept around instead of being thrown away. Comments remain attached to the Token that preceded them, and are later attached to the object that is parsed from that token. Callbacks are executed for tokens that were registered and are in the proper scope, and the callback does any additional necessary validation and if everything looks good, it parses an instance of the codeDOM object from the token stream (otherwise it returns null to indicate to the parser that the token wasn’t parsed).

For example, here’s the new PARSING region of the ClassDecl class (some code dealing with edge cases such as compiler directives has been removed for simplicity):

#region /* PARSING */

// The token used to parse the code object.
public new const string ParseToken = "class";

internal static void AddParsePoints()
{
    // Classes are only valid with a Namespace or TypeDecl parent, but we'll allow
    // any IBlock so that we can properly parse them if they accidentally end up at
    // the wrong level (only to flag them as errors).
    // This also allows for them to be embedded in a DocCode object.
    Parser.AddParsePoint(ParseToken, Parse, typeof(IBlock));
}

// Parse a ClassDecl.
public static ClassDecl Parse(Parser parser, CodeObject parent, ParseFlags flags)
{
    return new ClassDecl(parser, parent);
}

protected ClassDecl(Parser parser, CodeObject parent)
    : base(parser, parent)
{
    parser.NextToken();                    // Move past 'class'
    ParseNameTypeParameters(parser);       // Parse the name and any type parameters
    ParseModifiersAndAnnotations(parser);  // Parse any attributes and/or modifiers

    ParseBaseTypeList(parser);        // Parse the optional base-type list
    ParseConstraintClauses(parser);   // Parse any constraint clauses

    new Block(out _body, parser, this, true);  // Parse the body
}

#endregion

Refer to the sources to see the details of what the subroutines do, but you get the general idea of what’s going on. One thing that’s not obvious is that the ParseModifiersAndAnnotations() method actually parses backwards in the token stream (from a list of unused tokens), while the rest of the parsing is forwards. This is actually one of the more complicated language features to parse – most are simpler.

It’s worth noting that we’re not resolving symbolic references yet, so most of them will parse to UnresolvedRef objects (which are rendered in red). Built-in types (such as ‘int’, ‘string’, ‘object’, etc) are exceptions, because they are parsed by their keywords, and the referenced type is known.

Complications

It shouldn’t be any surprise to anyone that parsing a language as complex as C# isn’t really as simple as I might have made it appear so far. Some of the complications that came up are:

Blocks and Expressions both require various special handling to parse. This is taken care of in the “parse constructor” of Block, and a special static Parse() method on the Expression class.
Operators have left/right associativity to worry about in addition to precedence. This was handled by using a different method to register them, and Expression.Parse() retrieves and makes use of it while parsing expressions.
The Parser class must “maximize” tokens, treating “+=” as a single token instead of two separate ones. Conversely, a “>>” token might have to be treated as two separate tokens, depending upon the context (“A<T1, B<T2>>” vs “A >> B”) – this is a well-known ambiguity in the C# grammar.
There are cases where it’s necessary to determine if a series of tokens looks like a valid type reference, and nested generic types make this rather complicated. Generic types (and methods) also have an alternate syntax (“A{T}”) in documentation comments which must be handled.
Conditionals (“a ? b : c”) are very tricky to parse, because they can be nested in very complex expressions, and nullable types also use the ‘?’ token. Built-in nullable types (such as “int?”) are parsed by the keyword, but distinguishing other possible nullable types from conditional expressions requires recursive look-ahead of tokens.
Conditional compiler directives are very complicated to handle since they can split up any code that should otherwise be parsed into a single object. A lot of special logic is required to deal with such cases, and there are probably some rare-but-possible cases which aren’t handled yet, but it’s also easy to avoid them by tweaking the sources being parsed to avoid such things.
There were plenty of other more minor complications.

Although this parsing technique is relatively straightforward in principle, parsing a language as complex as C# is certainly never going to be simple. The primary advantage of doing it this way is having objects that basically know how to parse themselves, making it object-oriented and extensible.

Does This Really Work?

Yes, it really does. You might think that this technique could crash and burn on really complex code that only traditional parsing methods can deal with. All I can say is that I’ve fed every big steaming pile of C# that I could find to it, and it’s been able to digest them all. Sure, there was the occasional hiccup at first, but I was able to fix them all easily enough within the confines of my “object-oriented parsing” design.

There were times while I was implementing this when I wasn’t sure if it would really handle all situations. Now that it’s done and working just great, I feel that most computer languages should really be parsing using a similar method rather than the old traditional way. I’m sure others can improve upon my implementation provided here, but the general technique just seems better to me. I’m sure this will be controversial, so I guess I’m painting a target on my back… anyway, at least this method allows for a language to be open and easily extensible by its users.

Performance: Is This Slower Than Traditional Parsing?

I don’t really think so. Most of the time is taken up by disk I/O, and if you mostly eliminate that (by running the test twice consecutively), I’m seeing performance in the hundreds of thousands of lines of code per second (100K to 500K+, depending upon the CodeDOM version, multi-threading, server-mode GC, and of course the code base being parsed and the CPU type). That’s pretty fast. It’s mostly just parsing the text into tokens, doing a hash table lookup for each token, and fairly straightforward parsing. Some C# features can get a bit gnarly and require some look-ahead in the token stream, but these don’t occur all that often. There hasn’t been any performance tuning done yet, so there’s probably room for improvement. I’ll take a closer look at performance in a future article, when I compare this project to Roslyn.

Formatting

Parsed code retains newline information in the codeDOM, so when emitted as text it should be very similar to the original input with the possible exception of whitespace. There are options to control the use of tabs vs spaces which can be set in the config file. Whitespace in the output conforms to common default formatting for C#, but options could easily be added to use whatever is desired. Extra spaces used for alignment purposes can be handled by adding support for automatic alignment, some of which has already been added. This codeDOM is not currently designed to exactly reproduce whitespace (although such support could be added easily enough for the tradeoff of using more memory). But, it comes pretty close, including handling some multi-line alignment scenarios. Additional such scenarios and/or whitespace options could easily be added, and will be as needed.

Nova Studio provides a Diff option which compares the input text to the output using a diff tool configured in the config file, so this is useful to see what differences there might be. One benefit of running code through this parser is that it automatically re-formats it to be consistent according to the configured settings. There is also an AutomaticFormattingCleanup option available (it can be turned on in the config file). It does things such as:

Remove optional braces from single-line bodies, and add them to multi-line bodies.
Force empty braces onto a single line for methods (if the open brace is on a new line).
Force property get/set accessors to a single line if they have a single-line body.
Force constructor initializers to start on a new line.
Move operators that appear on the end of a line to the start of the next line.
Remove parentheses from expressions that don’t need them or use them by default.
Remove excessive blank lines or those in unusual places.

Using the Attached Source Code

The new code added in this installment is located in a new Parsing folder in the Nova.CodeDOM project, along with various parsing related methods in most of the existing CodeDOM classes – segregated into regions with a comment of “PARSING”. Also, Nova.Examples has a couple of new tests that parse an existing file and code fragment, and Nova.Studio can now load, parse, save, or diff any C# file (it diffs the original input file with the codeDOM output, using any diff tool setup in the .config file). Try loading various files from the sources, or try some from your own projects. Unlike manually generated code, in which local references are usually resolved, you’ll see that the majority of references in parsed code are unresolved – so you’ll be seeing a lot of red. As usual, a separate ZIP file containing binaries is provided so that you can run them without having to build them first.

Summary

I’ve built a parser that violates the laws of computer science as recorded long ago in the ancient tomes... so, perhaps I will be excommunicated. It’s not how you’re supposed to do it, but it seems to get the job done, and rather quickly and easily, and it allows for easy extension of my codeDOM with new objects. You’re welcome to go back to grammars, and parser generators, and ASTs if you like – but, I’m sticking with my technique of parsing text directly into codeDOM objects.

Of course, there’s still a lot of work to be done. My Nova Studio “IDE” currently only loads one file at a time, which isn’t very useful. It needs to be able to load “.sln” and “.csproj” files, and all of the source files they contain. In my next article, I’ll add codeDOM classes and parsing for VS Solution and Project files, so that it’s possible to easily load entire projects or solutions into a codeDOM.