Click here to Skip to main content
Click here to Skip to main content
Go to top

MinosseCC: a lexer/parser generator for C#

, 13 Sep 2005
Rate this:
Please Sign up or sign in to vote.
An article describing the project MinosseCC, that aims to provide developers with a new parser/lexer generator for C# language, which is very powerful, fast and AST ready.

Introduction

About one year ago, I decided to start a project to create a complete RDBMS written entirely in C# language that could run under .NET and Mono frameworks, Minosse RDBMS. The first fundamental point to focus on was the language for interoperating with the system: although SQL is widely used, well known and a standardized language for data process, I considered other alternatives such as Novell Db or XQuery (that would make the system heavily based on XML). In the end, I decided for SQL-92 implementation, that is the easiest and the standardized version of the ANSI SQL released.

Why a new parser generator in this world?

Once I was decided on the language to use for the system, I had to create the parser for the language to accept and analyze the input and produce results or compute operations. I studied many existing solutions that focused mainly on ANTLR, Coco/R and Jay (you can find it in the Mono project) which were the only serious parser/lexer generators available. I found some problems that made me stock for a long time the development of the project for each one of the solutions analyzed:

  • ANTLR: has many functionalities and supports a wide range of grammar options; has a limited LL(k) support (it is possible to specify the look ahead factor at the beginning of the lexer, but in many cases it is problematic); requires external reference to the ANTLR library (that in some cases, such as in the Minosse case, would slow down the execution of the parsing); it is quite slow in execution, compared to other solutions. Anyway, the great community around the project and the great amount of grammars for different languages could give a good approach to parsing.
  • Coco/R: It is a very basic generator that creates parsers in C#, Java and C++ languages; has a very complex conflict solution method and a limited support for LL(k). It needs, in addition, the specification of a *.frame file specifying the scanner structure, used by Coco/R to generate the lexer. The execution of the parser is quite fast, even if it is difficult to implement the grammar.
  • Jay: The porting by Miguel de Icaza the most known Lexer/Bison YACC, was originally used to generate Java parsers. It allows you to create fast C# parsers and it supports just one entry point for the parsers and requires the implementation of a scanner/lexer to properly work. Furthermore, it is complex to implement external utilities to work with it.

Because of these problems, I approached the problem with a different point of view: the porting. I considered the most famous Java parser generator, which became a standard among the developers of Sun's language, JavaCC. This application, in fact, provides many options for setting up a parser, working faster than other competitors, offering support for LL(k), LA(k) and LR(k) which is very easy to implement. Furthermore, it generates standalone parsers and lexers that can be implemented inside an application, without the need for referencing any external libraries.

Porting the JavaCC code

Originally, I thought of fully porting the source code of JavaCC to have a pure C# project, compiled in .NET/Mono bite code, embeddable in applications written with these frameworks. After making a few attempts, I was unable to figure out how to implement it, the project being self referenced (it requires a previous version of JavaCC to produce the actual parser for the grammars), I decided to move away from this operation plan. Then I started rewriting the original code of the project, modifying the JavaCC grammar and the code where the Java parsers and lexers were generated, to produce C# sources, fully compatible (from version 0.7.1) with both Mono and .NET Frameworks. This helped me create a *.jar application that works quite similar to the original JavaCC.

Differences between JavaCC and MinosseCC

The main difference between the two parser generators is in the grammar syntax. In fact, you can provide the following Java-like method within the grammar in JavaCC:

void Analyze(Vector vector) : {
    final String toAnalyze;
}
{
    toAnalyze = AnalyzeArgument()
    { vector.addElement(toanalyze); }
}

without having any error (final being a Java keyword), under MinosseCC you would convert it to:

void Analyze(ArrayList list) : {
    string toAnalyze;
}
{
    toAnalyze = AnalyzeArgument()
    { list.Add(toanalyze); }
}

This is mainly due to the conversion of the supported keywords and operators between Java and C# that I had to do for the production of well formatted C# sources. I assume that even the right types have to be provided: in the previous example, Vector is the java.util.Vector, not existing in the .NET/Mono Framework and corresponds to System.Collections.ArrayList. If you then implement your own type named Vector, there won't be any error during compile-time. Another big difference with the original JavaCC grammar is the package definition and importing. As you can guess, the Java syntax for defining a package is the following:

package org.deveel.minossecc;

package, in fact, is not a C# keyword, represents packaging in a form,

namespace Deveel.MinosseCC {
    ...
}

that requires the placing of the code inside the brackets after the namespace definition. Because of the structure of the grammar file, this cannot be done: I implemented a system for allowing namespace definition in the following way (very similar to the Java-way):

namespace Deveel.MinosseCC;

This will include the parser and the lexer within the namespace Deveel.MinosseCC (or whatever namespace you like). The importing syntax follows the same logic as described above. The following Java syntax:

import java.lang.*;
import java.util.Vector;

can be easily converted to:

using System;
using System.Collections;

The using syntax, then, has to be specified right after the namespace declaration, because of formatting requirements: this means that providing a using System; declaration before namespace Deveel.MinosseCC; won't throw any error on grammar compilation, but will produce a badly formatted source (that will cause errors during compile-time).

A brief grammar example

Following is a brief example of a grammar that parses the string ANALYZE codeProperties;

options {
    STATIC=false;
    IGNORE_CASE=true;
}

PARSER_BEGIN(Analyzer)
    namespace Deveel.Analysis;

    using System.Collections;

    class Analyzer {
        public void CloseAnalyzer() {
        }
    }
PARSER_END(Analyzer)

// this defines the characters to skip
SKIP: {
    " " |
    "\t" |
    "\r" |
    "\n"
}

TOKEN [IGNORE_CASE]: {
    <ANALYZE:   "analyze">
}

TOKEN : {
    <NUMBER: <FLOAT> | 
        <FLOAT> ( ["e","E"] ([ "-","+"])? <FLOAT> )? > |
        <#FLOAT: <INTEGER> |
        <INTEGER> ( "." <INTEGER> )? | 
        "." <INTEGER> |
        <#INTEGER: ( <DIGIT> )+ > |
        <#DIGIT: ["0" - "9"] >
}

// start of the production
void Analyze(ArrayList list) : {
    string toAnalyze;
}
{
    <ANALYZE> toAnalyze = AnalyzeArgument() ";"
}

string AnalyzeArgument() : {
    Token t;
}
{
    t = <INTEGER> |
    t = <FLOAT>
    { return t.image; }
}

Producing parsers

You can produce the parser by simply calling the minossecc.jar (as with JavaCC):

java --classpath=PATH_TO_MINOSSECC/minossecc.jar minossecc Analyzer.mcc

For all the syntax information specific to the parser grammar, please refer to JavaCC. I'm planning to produce a full documentation on MinosseCC (mainly on the differences with JavaCC), but at the moment you can easily start with the issues described above and follow the syntax defined in JavaCC documentation.

Persisting problems

At the moment I'm writing this presentation, there is just one bug that is still persisting: the parser is unable to recognize the <EOF> special built-in token. This token recognition, in fact is done for catching the IOException that is thrown when the end of stream is reached. Because of the differences in IO implementations between Java and .NET/Mono, this task, apparently easy, is difficult to solve.

Issues and future implementations

MinosseCC was born out of the Minosse RDBMS main project to produce SQL-92 parser: because of this reason, I have only implemented the parser generator and skipped other interesting parts of JavaCC (JJTree and JJDoc). Future implementations of MinosseCC would include these. In fact, we're planning to add XQuery support that can be done by just producing AST trees, produced by JJTree.

References

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

Share

About the Author

Antonello Provenzano
Architect
Italy Italy
No Biography provided

Comments and Discussions

 
QuestionMinosseCC grammar file Pinmembercoberg11-May-06 2:00 
AnswerRe: MinosseCC grammar file PinmemberAntonello Provenzano11-May-06 2:23 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web02 | 2.8.140916.1 | Last Updated 14 Sep 2005
Article Copyright 2005 by Antonello Provenzano
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid