Click here to Skip to main content
Click here to Skip to main content
Add your own
alternative version

Irony - .NET Compiler Construction Kit

, 4 Jan 2008 MIT
Introduction to Irony - a new technology of parser/compiler construction for .NET.
irony_article.zip
Irony_article
irony_exprTree.jpg
irony_GrammarExplorer.jpg
Irony_src.zip
irony_src.zip
Irony_src
Irony.GrammarExplorer
.svn
all-wcprops
entries
format
prop-base
props
text-base
030.Irony.GrammarExplorer.csproj.svn-base
app.config.svn-base
fmGrammarExplorer.cs.svn-base
fmGrammarExplorer.Designer.cs.svn-base
fmGrammarExplorer.resx.svn-base
fmShowException.cs.svn-base
fmShowException.Designer.cs.svn-base
fmShowException.resx.svn-base
License.txt.svn-base
Program.cs.svn-base
tmp
prop-base
props
text-base
Properties
.svn
all-wcprops
entries
format
prop-base
props
text-base
AssemblyInfo.cs.svn-base
Resources.Designer.cs.svn-base
Resources.resx.svn-base
Settings.Designer.cs.svn-base
Settings.settings.svn-base
tmp
prop-base
props
text-base
Settings.settings
Irony.Samples
.svn
all-wcprops
entries
format
prop-base
props
text-base
020.Irony.Samples.csproj.svn-base
License.txt.svn-base
tmp
prop-base
props
text-base
OtherGrammars
.svn
all-wcprops
entries
format
prop-base
props
text-base
ExpressionGrammar.cs.svn-base
GrammarEx434.cs.svn-base
GrammarEx446.cs.svn-base
GrammarExL514.cs.svn-base
tmp
prop-base
props
text-base
Properties
.svn
all-wcprops
entries
format
prop-base
props
text-base
AssemblyInfo.cs.svn-base
tmp
prop-base
props
text-base
Python
.svn
all-wcprops
entries
format
prop-base
props
text-base
Python_auth_svn.txt.svn-base
PythonGrammar.cs.svn-base
tmp
prop-base
props
text-base
Ruby
.svn
all-wcprops
entries
format
prop-base
props
text-base
Ruby_auth.txt.svn-base
RubyGrammar.cs.svn-base
tmp
prop-base
props
text-base
Scheme
.svn
all-wcprops
entries
format
prop-base
props
text-base
SampleAstNodes.cs.svn-base
SchemeGrammar.cs.svn-base
tmp
prop-base
props
text-base
SourceSamples
.svn
all-wcprops
entries
format
prop-base
props
text-base
_about.txt.svn-base
99 bottles.py.svn-base
99 bottles.rb.svn-base
99 bottles.scm.svn-base
ExprSample.txt.svn-base
tmp
prop-base
props
text-base
99 bottles.rb
99 bottles.scm
Irony
.svn
all-wcprops
entries
format
prop-base
props
text-base
010.Irony.csproj.svn-base
Common.cs.svn-base
License.txt.svn-base
tmp
prop-base
props
text-base
Compiler
.svn
all-wcprops
entries
format
prop-base
props
text-base
CompilerContext.cs.svn-base
Enums.cs.svn-base
EventArgs.cs.svn-base
LanguageCompiler.cs.svn-base
SyntaxError.cs.svn-base
tmp
prop-base
props
text-base
AST
.svn
all-wcprops
entries
format
prop-base
props
text-base
AstNode.cs.svn-base
GenericNode.cs.svn-base
tmp
prop-base
props
text-base
Grammar
.svn
all-wcprops
entries
format
prop-base
props
text-base
BnfElement.cs.svn-base
BnfExpression.cs.svn-base
Grammar.cs.svn-base
GrammarData.cs.svn-base
GrammarDataBuilder.cs.svn-base
tmp
prop-base
props
text-base
NonTerminals
.svn
all-wcprops
entries
format
prop-base
props
text-base
NonTerminal.cs.svn-base
tmp
prop-base
props
text-base
Parser
.svn
all-wcprops
entries
format
prop-base
props
text-base
Parser.cs.svn-base
ParserStack.cs.svn-base
tmp
prop-base
props
text-base
Scanner
.svn
all-wcprops
entries
format
prop-base
props
text-base
Scanner.cs.svn-base
SourceFile.cs.svn-base
Token.cs.svn-base
tmp
prop-base
props
text-base
Terminals
.svn
all-wcprops
entries
format
prop-base
props
text-base
_Terminal.cs.svn-base
CharLiteral.cs.svn-base
CommentTerminal.cs.svn-base
ConstantSetTerminal.cs.svn-base
CustomTerminal.cs.svn-base
IdentifierTerminal.cs.svn-base
NumberTerminal.cs.svn-base
RegExBasedTerminal.cs.svn-base
StringLiteral.cs.svn-base
SymbolTerminal.cs.svn-base
tmp
prop-base
props
text-base
TokenFilters
.svn
all-wcprops
entries
format
prop-base
props
text-base
BraceMatchFilter.cs.svn-base
CodeOutlineFilter.cs.svn-base
TokenFilter.cs.svn-base
tmp
prop-base
props
text-base
Properties
.svn
all-wcprops
entries
format
prop-base
props
text-base
AssemblyInfo.cs.svn-base
tmp
prop-base
props
text-base
<!--------------------------------------------------------------------------->  
<!--                        IGNORE THIS SECTION                            -->
<html>
<head>
<title>Irony - .NET Compiler Construction Kit</title>
<Style>
BODY, P, TD { font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10pt }
H2,H3,H4,H5 { color: #ff9900; font-weight: bold; }
H2 { font-size: 13pt; }
H3 { font-size: 12pt; }
H4 { font-size: 10pt; color: black; }
PRE { BACKGROUND-COLOR: #FBEDBB; FONT-FAMILY: "Courier New", Courier, mono; WHITE-SPACE: pre; }
CODE { COLOR: #990000; FONT-FAMILY: "Courier New", Courier, mono; }
</style>
<link rel="stylesheet" type=text/css href="http://www.codeproject.com/styles/global.css">
</head>
<body bgcolor="#FFFFFF" color=#000000>
<!--------------------------------------------------------------------------->  


<!-------------------------------     STEP 1      --------------------------->
<!--  Fill in the details (CodeProject will reformat this section for you) -->

<pre>
Title:       Irony - .NET Compiler Construction Kit
Author:      Roman Ivantsov 
Email:       rivantsov@gmail.com
Member ID:   2829616
Language:    C# 2.0
Platform:    Windows XP/Vista, .NET 2.0, Visual Studio 2005/Express
Technology:  .NET 
Level:       Advanced
Description: Introduction to Irony - a new technology of parser/compiler construction
Section      General c# programming
SubSection   Parsers and Interpreters
</pre>

<!-------------------------------     STEP 2      --------------------------->
<!--  Include download and sample image information.                       --> 

<ul class=download>
<li><a href="Irony_src.zip">Download source - 113 Kb</a></li>
</ul>


<!-------------------------------     STEP 3      --------------------------->
<!--  Add the article text. Please use simple formatting (<h2>, <p> etc)   --> 

<h2>Introduction</h2>
<p>
The goal of Irony, a new open source project, is to build a complete set of libraries and tools for language implementations in .NET platform. It is currently in its first phase, which includes building the two compiler front-end modules - scanner and parser. This article provides an overview of the technology and focuses on parser implementation with Irony. 
The project is hosted at CodePlex: <a href="http://www.codeplex.com/irony">http://www.codeplex.com/irony</a>. 

<p>Irony brings several principal innovations into the field of compiler construction. Like many parser-building tools in use today, Irony produces a working parser from grammar specifications. However, unlike the existing parser builders, Irony does not use a separate meta-language for encoding the target grammar. In Irony <strong>the grammar is encoded directly in c#</strong> using BNF-like expressions over grammatic elements represented by .NET objects. Additionally, <strong>no code generation is employed</strong> - Irony's universal LALR parser uses the information derived from c#-encoded grammar to control the parsing process. 

<h2>Using the code in download</h2>
<p>The download supplied with this article contains Irony's core assembly, Grammar Explorer application, and a set of sample parsers: a parser for arithmetic expressions and parsers based on simplified grammars for Scheme, Python and Ruby. With Grammar Explorer you can view the analysis data for sample grammars and test the parsers on source samples. The following image shows a source sample and its parsed syntax tree in the Grammar Explorer: </p>  

<img src="irony_GrammarExplorer.jpg" alt="GrammarExplorer" width=586 height=516>

<p>In order to try out the project:
<ol>
  <li>Download and unzip the source files</li>
  <li>Open <code>Irony_all.sln</code> solution file in Visual Studio</li>
  <li>Set the GrammarExplorer project as a startup project </li>
  <li>Build and run the application (F5)</li> 
  <li>When the Grammar Explorer's window opens, select the grammar in the top combobox and explore the grammar data</li>
  <li>Load source code sample from <em>Irony.Samples\SourceSamples</em> folder</li> 
  <li>Parse the source and explore the output syntax tree</li> 
</ol>

<h2>Background</h2>
<p>Many modern applications use parsing technologies in one form or another. Compiler development is one area that heavily employs  parsers. The other examples are script engines and expression evaluators, source code analysis and refactoring tools, template-based code generators, formatters and colorizers, data import and conversion tools, etc. Therefore it is important that application developers have reliable and straightforward methods of implementing a parser when one is needed. 
Unfortunately, this is not the case - with the current state of technology constructing a parser is still a big challenge. 

<p>In author's opinion, the reason for this is the fact that parser construction technology did not change much since the introduction of LEX/YACC generators in the mid-seventies. Most of the generators in use today are ports or derivatives of these original tools. Many things had changed, new languages and technologies had been introduced, OOP being probably the most prominent example. But somehow these advances didn't make into parser construction technology. Same plain c-style generated code, even in java- or c#- targeted parsers, with the same global variables and even the same yy- naming conventions. 

<p>Irony is an attempt to fix the situation, to bridge the existing gap and to bring the power of a modern language environment like c# and .NET into parser and compiler construction field. 

<h2>Simple Expression Parser</h2>
<p>To illustrate how Irony's approach is different, let's use it to build a parser for simple arithmetic expressions. We will target expressions that contain only numbers, variables, parenthesis and arithmetic operators +, -, *, / and **.  
<p>The first step is to write a formal grammar for expression language using the Backus-Naur Form (BNF) notation:
<pre>
  Expr  -> n | v | Expr BinOp Expr | UnOp Expr | (  Expr )
  BinOp -> + | - | * | / | **
  UnOp  -> -
  ExprLine -> Expr EOF
</pre>
<p>Our grammar has four non-terminals: <code>Expr</code>, <code>BinOp</code>, <code>UnOp</code> and <code>ExprLine</code> - shortcuts for expression, binary operation, unary operation and expression line. The <code>ExprLine</code> is the top-level element of our grammar representing the complete expression terminating with the <code>EOF</code> (end-of-file) symbol.
<p>Our grammar references several terminals: <code>n</code> (number), <code>v</code> (variable), and symbols for arithmetic operators and parenthesis. Terminal symbols are elements that are predefined for grammar rules - they don't have a definition in grammar. Terminals represent primary indivisible words (tokens) in the input stream and are produced by input pre-processor called Scanner or Lexer. Non-terminals on the other hand are compound elements and their internal structure is defined by BNF expressions. 
  
<p>To define this grammar in Irony, we create a new class inherited from the <code>Irony.Compiler.Grammar</code> base class: 

<pre lang=cs>
using System;
using System.Collections.Generic;
using System.Text;
using Irony.Compiler;
 
namespace Irony.Samples {
  //Sample expression grammar - recognizes arithmetic expressions with numbers and variables
  public class ExpressionGrammar : Irony.Compiler.Grammar {
    public ExpressionGrammar() {
      // 1. Terminals
      Terminal n = new NumberTerminal("number");
      Terminal v = new IdentifierTerminal("variable");
 
      // 2. Non-terminals
      NonTerminal Expr = new NonTerminal("Expr");
      NonTerminal BinOp = new NonTerminal("BinOp");
      NonTerminal UnOp = new NonTerminal("UnOp");
      NonTerminal ExprLine = new NonTerminal("ExprLine");
 
      // 3. BNF rules
      Expr.Expression = n | v | Expr + BinOp + Expr | UnOp + Expr | "(" + Expr + ")";
      BinOp.Expression = Symbol("+") | "-" | "*" | "/" | "**";
      UnOp.Expression = "-";              //Irony provides default conversion here 
      ExprLine.Expression = Expr + Eof;   //Eof is a predefined EOF terminal
      this.Root = ExprLine;               //Set grammar top element
 
      // 4. Set operators precedence and associativity
      RegisterOperators(1, "+", "-");
      RegisterOperators(2, "*", "/");
      RegisterOperators(3, Associativity.Right, "**");
 
      // 5. Register parenthesis as punctuation symbols so they will not appear in the syntax tree
      PunctuationSymbols.AddRange(new string[] { "(", ")" });

    }
  }
}
</pre>

<p>The entire grammar is defined in the class constructor. First we define terminal and non-terminal variables that we will use in grammar rules. We use standard classes defined by Irony. The name parameter in constructor is used to identify the terminal in listings of grammar data that you see in Grammar Explorer. Notice that we do not need to explicitly create terminals for operator symbols and parenthesis - we will use the string symbols directly in grammar expressions. Irony grammar processor discovers all these symbols automatically and transform them into symbol terminals. 

<p>Next comes the main part - encoding the grammar rules. NonTerminal objects have a property Expression which is assigned an expression over other terminals and non-terminals. We use c# expressions that look very similar to the original BNF - Irony overloads "+" and "|" operators for syntax element classes so we can write BNF expressions directly in c#! 

<p>A few comments on the code. In the expression for <code>BinOp</code> the first "+" symbol is wrapped into a <code>Symbol(...)</code> method. This method converts "+" into an instance of <code>SymbolTerminal</code> class, so c# compiler understands correctly the meaning of the first <code>|</code> operation and invokes the overload defined in Irony. Without this wrapper the compiler would not know how to interpret an expression <code>string | string</code>. The string needs to be wrapped just once when both initial operands are strings; other operators in the same expression will use the result of previous operation as an operand - which will be an instance of one of Irony's syntax classes - and operator overloading would work automatically. 
<p>The last statement in step 3 sets the <code>ExprLine</code> as the root element of our grammar by assigning it to the <code>Root</code> property of our grammar. This is an entry point into our grammar for Irony; starting from this non-terminal, it can discover the entire grammar tree through hierarchy of expressions.  
<p>Finally, we specify operators' precedence and associativity - this information will be used by the parser to identify the correct grouping of operations. The last statement marks parenthesis as punctuation symbols - they will be excluded from the output syntax tree. Once we have the syntax tree, we don't need such symbols anymore as the order of operations is encoded explicitly in the tree structure.
<p>The following is the example of how to use this grammar to parse an expression:               

<pre lang=cs>
  ExpressionGrammar grammar = new ExpressionGrammar();
  LanguageCompiler compiler = new LanguageCompiler(grammar);

  string expr = "a + b * 3.14 / (2.0 * x + y**z**2)";
  AstNode rootNode = compiler.Parse(expr);
  //rootNode now contains an AST node that is the root of the syntax tree.
</pre>

<p>The following image shows the parsed expression tree as it appears in the Grammar Explorer:</p>
<img src="irony_exprTree.jpg" alt="Parsed Expression Tree" width=299 height=455></p>

<p>The following sections provide more detailed information on the parsing process as it is implemented in Irony. 


<h2>Irony Parsing Pipeline</h2>
<p>A parser is a program that transforms the source text written in an input language into a tree-like representation called Abstract Syntax Tree (AST) or simply syntax tree. The tree nodes correspond to the syntactic structures of the input language. Schematically the process may be represented as follows:

<pre>
  source -&gt; PARSER -&gt; syntax tree
</pre>

<p>Traditionally this transformation is performed in two steps. First, the module called scanner combines the  sequences of input characters into more meaningful combinations called tokens - the words of the input language. The examples of tokens are numbers, string literals, keywords, special characters, etc. Scanner is sometimes called "lexer" or "tokenizer". 
The tokens produced by a scanner are then passed to the parser proper that actually constructs the syntax tree from the stream of tokens. The processing schema now changes to the following:   
<pre>
  source -&gt; SCANNER -&gt; tokens -&gt; PARSER -&gt; syntax tree
</pre>

<p>This structure is a standard processing pipeline used by all parsing solutions today. In Irony this standard schema is extended. Irony adds optional processing modules called Token Filters between scanner and parser:
<pre>
  source -&gt; SCANNER -&gt; tokens -&gt; (TOKEN FILTER)* -&gt; tokens -&gt; PARSER -&gt; syntax tree
</pre>

<p>(The Kleene star indicates "zero or more" objects). We will talk more about token filters and what they are used for in one of the following sections. For now lets talk about the connection between the modules. Irony uses a bit unconventional method to connect the modules - the pipeline is as a chain of c# iterators. Iterators are a very powerful feature introduced in c# 2.0. To illustrate how it works let's look at a sample token filter. The following code snippet shows filter's main processing method. The sample filter does nothing, simply passes all input tokens to the module in the pipeline: 

<pre lang=cs>
  public override IEnumerable<Token> BeginFiltering(CompilerContext context, IEnumerable<Token> tokens) {
    foreach (Token token in tokens) {
      // Analyze incoming token here and take the action:
      // we can either pass the token, supress it, or produce extra tokens before or after if we need to. 
      yield return token; //default action - pass the token down the pipeline
    }
  }	
</pre>
<p>As the comment in the code indicates, we can easily add or exclude tokens to/from the stream. Using iterators for connecting the modules provides much more flexibility and independence in constructing the individual modules while maintaining high pseudo-parallelism in execution. The execution point literally jumps from one module to another while processing the input as the tokens are moved along the pipeline.
<p>Let's now discuss each module in the pipeline in more details.  

<h2>Scanner</h2>
<p>The purpose of the scanner is to group input characters into meaningful words represented by objects called tokens. The examples of tokens are: numbers, string literals, keywords, variables, special symbols like parenthesis, comma, operator symbols, etc.
The other responsibility of the scanner is to eliminate the whitespace and comments from the input stream, leaving only meaningful content. Irony's <code>Scanner</code> class is a general implementation of a scanner. It is to be used by a custom language implementations directly, without inheriting into a custom scanner class. The customized behavior for particular input language is determined by the list of terminal objects that Irony finds in the language grammar rules.
<p>The expression grammar we just built uses two standard terminal classes: <code>NumberTerminal</code> and <code>IdentifierTerminal</code>. The expression grammar also implicitly defined a number of symbol terminals - all string literals in BNF expressions are converted automatically to <code>SymbolTerminal</code> objects. In Irony, the terminal classes are simply token recognizers. The main recognition method is defined in the base <code>Terminal</code> class:
<pre lang=cs>
   public virtual Token TryMatch(CompilerContext context, ISourceStream source) {
     //recognize a token in the source stream and return it, or return null if there is no match
	 ....
   }
</pre>
<p>The <code>ISourceStream</code> interface represents the input character stream.  
Language implementor can either use the standard terminals provided with Irony, or create custom terminal classes for language-specific tokens. See the examples of the <code>TryMatch</code> implementations in standard terminal classes in the article's download. 

<p>The following is a simplified description of Irony's scanning algorithm. The scanning process starts at initial (0) position in source code. First the scanner skips all characters that represent whitespace. Once a non-space character is reached, the scanner runs through the list of all defined terminals and for each calls the <code>TryMatch</code> method. It tries to find the terminal that can construct a token at current position in the input. If there is more than one terminal and several constructed tokens, the scanner chooses the "longest" token. This winner token is passed (yielded) to the output stream and the source position is advanced to the character right after the yielded token. After that the process is repeated at new position, and until the end of the source file is reached. 
<p>The actual implementation also does some optimization. The scanner does not actually run through all terminals all the time - it uses a hash table to quickly lookup the list of terminals that need to be called based on the current character in the input stream. The lookup table is constructed at initialization time from prefixes that each terminal may explicitly declare. This terminal hash table significantly speeds up the scanning process. 

<p>In multiple aspects  the Irony scanner operates differently  compared to existing compiler toolkits.   
First, the scanner completely ignores the whitespace, even for indentation-sensitive languages like Python. However, the indentation information is not completely lost - each token produced by the scanner contains the location (row and column) information. The indentation tokens can be recreated if neccessary from content tokens later in a dedicated token filter. This arrangement simplifies the general-purpose scanner and separates the indentation-analysis code into a separate layer (token filter) resulting in a better layered architecture. 

<p>Secondly, unlike other other LEX-like solutions, Irony scanner does not need to define hundreds of numeric constants for each distinct token type. Irony parser matches the tokens to grammar elements directly by their content, so the token type is identified by a containing symbol, e.g. "+", therefore it can be matched by the parser against a "+" terminal referenced in the grammar. 

<p>Finally, in most cases Irony scanner does not need to distinguish between keywords and identifiers. It simply tokenizes all alpha-numeric words as identifiers leaving it to the parser to differentiate them from each other. Some LEX-based solutions try to put too much responsibility on the scanner and make it recognize not only the token itself, but also its role in the surrounding context. This in turn requires extensive lookaheads, making scanner code quite complex. In author's opinion, identifying the token role is responsibility of a parser, not scanner. The scanner should recognize tokens without using any contextual information.    

<h2>Token Filters</h2>
<p>Token filters are special processors that intercept the token stream between scanner and parser. Although the term "filter" might suggest that filters only remove some tokens from the stream, they in fact are free to remove from or add tokens to the original stream. So why Irony introduces this extra processing layer in addition to traditional scanner/parser pair? The answer lies in realization that many processing tasks in scanning and parsing are best handled by an intermediate processor that sits between parser and scanner. The following are examples of such tasks: 
<ol>
  <li> Implementing macro expansion.</li>
  <li> Handling conditional compilation clauses.</li>
  <li> Verifying matching pairs of braces/brackets/parenthesis.</li>
  <li> Handling commented-out code blocks. The Scheme programming language allows commenting out blocks of code by prefixing them with the symbol "#;". This type of comment is difficult to handle in the scanner as it requires parser-like token processing to locate the end of the block. At the same time the parser is not a good place to handle this either, as parser normally expects only content tokens, without comments or whitespace. A dedicated token filter is an ideal place for implementing this task - after scanning but before parsing. </li>  
  <li>Handling special documentation comment blocks, e.g., XML documentation in c#. The filter can assemble the documentation block from individual comment lines and attach it to the next content token; later the information can be associated with the syntax node  that starts with this token.</li>
  <li>Handling code outlining for indentation-sensitive languages like Python. Tracking this information in the scanner makes things really complicated, and it does not fit well into general-purpose scanner. Token filter on the other hand can handle it easily. The scanner ignores the whitespace in this case, and the filter injects <code>NewLine</code>, <code>Indent</code> and <code>Unindent</code> tokens into the stream based on line/column information of the content tokens. </li>
</ol>
<p>The list above is certainly not all exaustive. Importantly, implementations can define or reuse several filters in the pipeline, each responsible for its own task. This results in a clear layered architecture and opens opportunities for wide code reuse. 
<p>Irony defines the base class <code>TokenFilter</code> that can be inherited to implement a custom filter. Irony initially provides two filters: <code>BraceMatchFilter</code> for matching open/close block symbols, and <code>CodeOutlineFilter</code> for handling indentation. Several more general-purpose filters are planned for implementation in the future releases.  

<h2>Parser</h2>

<p>After the token stream produced by the scanner passes the token filters it is consumed by the parser. Irony implements the so-called <a href="http://en.wikipedia.org/wiki/LALR_parser" target="_blank">LALR(1)</a>  parser. This type of parser is considered to be the most efficient, yet difficult to implement. It requires building huge action/transition tables from language grammar, so LALR parsers are usually generated by parser generators. Irony does not employ code generation; instead, it construct the transition graph (equivalent of a table) from the language grammar during initialization and uses general-purpose parser implementation. 

<p>LALR parser is a Deterministic Finite Automaton (DFA) - an abstract machine with finite number of internal states. The parser state represents a phase in analysis process of the input stream. Two states are special: initial state and final state.
The parser starts in initial state and moves from one state to another while consuming input tokens and executing the predefined actions, until it arrives at the final state. In each state, it looks at the input token, retrieves the appropriate action from action table for the current state and input token, executes the action and moves to the new state. The new state is derived indirectly from the result of the action. When the parser arrives at the final state, the result of the last Reduce action is an AST node for the entire input. It is the node corresponding to the root non-terminal of the grammar. This node is the root of the syntax tree and is returned as a result of the parsing. 

<p>The two main action types that parser executes are Shift and Reduce (that's why this kind of parser is also called shift-reduce parser). Each action manipulates the parser stack - an internal memory buffer. The parser accumulates input tokens in the stack by "shifting" them from the input. At some point it recognizes that there is a sequence on top of the stack that represents some language construct with corresponding non-terminal defined in the grammar. The parser then executes the Reduce action - it creates an AST node object corresponding to the recognized non-terminal, and replaces the sequence in the stack with the created node. The newly constructed node receives a list of its child nodes so it can properly link to them. The stack is effectively "reduced" in size in Reduce actions. Thus the stack contains either tokens shifted from the input by Shift actions, or AST nodes produced by Reduce actions. In fact they are all AST nodes because the <code>Token</code> class is a subclass of the <code>AstNode</code> class, so token is a trivial AST node. 

<p>The final result of the parsing process is a syntax tree for the source text. Normally each node in the tree would be of a specific type (class) matching the construct it represents. The language implementation would define node classes for all constructs in the language: there will be a node class for "if" statement, "for" statement, etc. Irony will also predefine some common node types in the future. 

<p>In order to know which class is needed to create a node the parser looks up the type stored in the <code>NodeType</code> property
of the  <code>NonTerminal</code> object specified in Reduce action. Custom language grammars would normally set this property to appropriate type reference for each non-terminal used in the grammar. If this property is null then the parser uses the default type <code>GenericNode</code>. Also, additional customization points are provided for AST nodes creation, like <code>NodeCreator</code> factory method delegate in <code>NonTerminal</code> class - see Irony source for more details.  Most of the samples in the download use the <code>GenericNode</code> type for all nodes in the output tree. This is a convenient default node - it has a <code>ChildNodes</code> property that contains a plain list of all child nodes. The sample parser for Scheme in download provides examples of custom node classes.

<h2>Grammar Data</h2>
<p>Irony's scanner and parser are implemented as c# classes which are intended to be used as is, without inheriting in custom language implementations. All custom behavior associated with the target language is encoded in the custom grammar class. The list of token filters is also specified in the grammar. 
<p>However, Irony does not use the target grammar class directly. Instead, it uses derived information in the <code>GrammarData</code> object that is produced at initialization. This object is used directly by the scanner and parser. The scanner in particular uses the list of all Terminals used in the grammar, while the parser uses the State Graph which encodes the behavior of the parsing automaton. Schematically the information flow looks as follows:          

<pre>
  Grammar -&gt; GrammarData -&gt; Terminals -&gt; SCANNER
                         -&gt; State Graph -&gt; PARSER
</pre>
<p>The construction of the <code>GrammarData</code> is probably the most complex process in Irony. It uses some sophisticated algorithms well described in compiler construction literature. 


<h2>LALR Grammars</h2>
<p>Any programming language syntax may be expressed in more than one grammar. While these multiple grammars are all equivalent in terms of language they produce, they are not equally usable by different parsing algorithms. The LALR(1) algorithm used by Irony requires a special kind of grammar - the so-called LALR grammar. All deviations from LALR requirements result in conflicts - situations when parser wouldn't know what to do, or will have more than one available action. The target grammar therefore must be carefully written and fine-tuned. To make this process easier Irony provides a Grammar Explorer application that displays grammar analysis data in multiple tabs on the form. This application is included in the download. 
<p>One of the tabs in Grammar Explorer entitled "Grammar Errors" shows all conflicts that were detected during grammar analysis. Language implementor should try to eliminate all errors and warning by tweaking the grammar, although it is not always possible to eliminate all of them. For ambiguous situations the parser chooses a preferred action - for example, Shift is preferred over Reduce in shift-reduce conflicts.

<h2>Performance</h2>
<p>Contrary to the common belief that universal parsers are slow (especially scanners), Irony's parsing is really fast. The initial performance measurements for languages like Python or Ruby give estimate in between 10 and 20 thousand lines per second for a standard 2GB/2GHz machine, which is impressive, even compared to professional-grade compilers. The initialization time (analyzing the grammar and building the state graph) is about 300 milliseconds, which is acceptable in most cases, but may be a bit too much for production compilers. The initialization time will be reduced in the future by introducing <code>GrammarData</code> serialization to code, so the data will be deserialized at start up instead of rebuilding it from scratch.          

<h2>Future Development</h2>
<p>The ultimate goal of the project is to create a complete set of tools for language implementations on .NET platform. As for more short term plan, several features are under development:
<ol>
  <li>Completing implementation of error recovery mechanisms
  <li>Implementing a complete set of standard Terminal classes
  <li>Creating set of reusable AST Node classes for common syntax constructs
  <li>Implementing direct evaluation (interpreter) of the syntax tree, without compiling to IL 
  <li>Creating basic interactive console
  <li>Adding support for big integers. 
  <li>Implementing back-end parts of the compiler - IL code generation.  
  <li>Adding CodeDom support and Visual Studio integration
  <li>Implementing the <code>GrammarData</code> serialization to code, so that it can be quickly deserialized from code at start-up  like Windows Forms.
</ol>
<p>More long-term goals include building a runtime infrastructure and implementing back-end parts of the compiler with IL code generation.
<p>Finally, I would like to welcome everybody to participate, both in development of the core feature set, and in language implementations based on Irony.


<h2>References</h2>
<ol>
<li><a href="http://lambda.uta.edu/cse5317/long/index.html">Leonidas Fegaras: Design and Construction of Compilers</a> -
concise but very comprehensive course notes. A good place to start online.</li>
<li>The "Dragon book": 
<a href="http://www.amazon.com/Compilers-Principles-Techniques-Alfred-Aho/dp/0201100886">Compilers: Principles, Techniques, and Tools</a> by A.Aho, R.Sethi, J.Ullman. </li>
</ol>

<h2>History</h2>
<p>January 2008 - Initial version.


</body>
</html>

By viewing downloads associated with this article you agree to the Terms of Service and the article's licence.

If a file you wish to view isn't highlighted, and is a text file (not binary), please let us know and we'll add colourisation support for it.

License

This article, along with any associated source code and files, is licensed under The MIT License

Share

About the Author

Roman Ivantsov
Architect Pulsar Informatics, Inc
United States United States
No Biography provided

| Advertise | Privacy | Mobile
Web02 | 2.8.141015.1 | Last Updated 4 Jan 2008
Article Copyright 2008 by Roman Ivantsov
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid