Click here to Skip to main content
Email Password   helpLost your password?
TinyPG_v1.1.png

Introduction

@TinyPG stands for "a Tiny Parser Generator". This particular generator is an LL(1) recursive descent parser generator. This means that instead of generating a state machine out of a grammar like most compiler compilers, instead it will generate source code directly; basically generating a method for each non-terminal in the grammar. Terminals are expressed using .Net's powerful regular expressions. To help the programmer create .Net regular expressions a Regex tool is embedded in TinyPG. Grammars can be written using the extended BNF notation. TinyGP v1.2 now allows you to generate a scanner, parser and parsetree file in either C# or VB code (!). These can be compiled directly into your own projects. Additionally now it is possible to generate code for your own text highlighter which you can use directly into your own text editor project. A simple example is added at the end of this article.

In this article I will not go into depth about compiler theory. Basic understanding of compilers and grammars is required. For your reference I have included a list of terms used in this article explained on Wikipedia: grammar, BNF, terminal, non-terminal, LL(1), lookahead, recursive decent, concrete syntax tree (CST) , parse tree,

Nowadays, with the availability of numerous compiler compilers, it becomes hard to think of a new name for a compiler compiler. 'Yet Another Compiler Compiler' was taken so I decided to name this tiny tool after its many strengths:

Background

Since there are already a number of compiler compilers out there, you might wonder why write another one? The reasons for writing this utility are:

Using the tool

I will explain the usage of the tool by means of two tiny tutorials. In these tutorials we will start writing a tiny expression calculator. In the former we will define the grammar for the expression evaluator; this will allow us to parse expressions. In the latter tutorial we will add some codeblocks to implement the semantics of the expression evaluator; after this step we will be able to evaluate (=calculate) simple expressions.

Before we start the tutorial I want to explain a little about naming conventions I prefer to use. Right now there are no clear conventions for naming. Also this utility does not require you to use any specific convention. However for readability also in the generated code I propose the following:

To write valid grammars you need to take the following rules into account:

Let's start the tutorial.

Simple expression calculator syntax

The goal of the expression evaluator will be to parse and evaluate simple numeric expressions consisting of (integer) numbers, +,-, * and / symbols. To make it a bit more fun we will also allow to use sub-expressions with the ( ) symbols. E.g. a valid expression could be: 4*(24/2-5)+14 which should evaluate to 42. This example is included in TinyPG, by opening the "simple expression1.tpg" (or the "simple expression1 vb.tpg" for the VB variant) file.

Terminals & regular expressions

Since we have decided what symbols to allow for in the calculator we can already start by defining the terminal symbols of our grammar using terminal production rules and regular expressions:

NUMBER -> @"[0-9]+"; 
PLUSMINUS -> @"(\+|-)"; 
MULTDIV -> @"\*|/"; 
BROPEN -> @"\("; 
BRCLOSE -> @"\)"; 

Terminals can be defined using .Net's Regex expression syntax (including the @ sign if it is so required). As you may have already guessed, the terminal definitions will be directly used within the Regex by the generated parser. Using .Net's Regex saves a lot of coding and keeps the scanner.cs very small and easily readable.

So indeed, this may be a good opportunity to brush up on your regular expression skills also. In order to play around with regular expressions I have included the Regex tool within the utility. Just click on the 'Regex tool'-tab and enter your regular expression. Any matches will be highlighted in the text immediately. This way you can test if your regular expression is matching the right symbols. At the end of this article I will include a number of often used regular expressions that may be very useful if you want to support those in your own language.

TinyPG evolvement cycle

The TinyPG does not have any reserved or predefined terminal symbols. E.g. some parser generators reserve the EOF terminal because it may be difficult to express. Regex can almost cope with any kind of terminal symbol, including an EOF symbol. An EOF symbol can be defined as follows:

EOF -> @"^$";

The ^ character indicates the regex will scan from the beginning of the text string/file, the $ indicates that the regex should scan until the end of the string/file. Because no characters where specified in between, this must be the end of the file (or text/string).

In order to be less restrictive about the formatting of the expression, we would like to allow whitespaces. However we do not want to check in the grammar for whitespaces. In fact we would like to have the scanner simply ignore whitespaces and continue scanning to the next symbol/token. This can be done by defining a terminal production rule for whitespace and prefixing it with the [Skip] attribute like so:

[Skip] WHITESPACE -> @"\s+";

[Skip] WHITESPACE -> @"\s+";

This indicates that any whitespace character will be skipped.

Non-terminals & production rules

Now that we have defined the terminals, let's continue to define the production rules for the non-terminals. TinyPG supports the extended BNF notation allowing the following symbols to be used in the production rules: *, +, ?, (, ), |, and a whitespace. These have the following meaning:

The grammar must start with a Start non-terminal. Hence the Start nonterminal is a reserved word. Let's define the grammar as follows:

Start     -> (AddExpr)? EOF;
AddExpr     -> MultExpr (PLUSMINUS MultExpr)*;
MultExpr     -> Atom (MULTDIV Atom)*;
Atom     -> NUMBER | BROPEN AddExpr BRCLOSE;

The Start production rule will check if there is an AddExpr (optional), and then expects an end of file (no other tokens are expected).
The AddExpr will only add or substract MultExpr expressions. The MultExpr will multiply or divide Atoms. By writing the grammar this way, the precedence of the * and / symbols over the + and - symbols is defined explicitly. The Atom for now can be either a number (integer) or another expression. So with only 4 simple production rules you can already parse complicated expressions like: (((5*8)/4+3*3-(7+2)) / 5)

Running the parser

Now press F6 to compile the grammar. The output pane should become visible displaying again the grammar as it is internally represented, but also displaying the 'First' symbols per non-terminal. The 'first' symbols are the symbols that the parser will make its decision on for which non-terminal or production rule to parse next. Also the generated c# code will be compiled internally and can be run. If all goes well the 'Compilation successful.' should display.

Press ctrl-e to open the Expression Evaluator pane. Now type in an expression for your grammar to evaluate, e.g. (((5*8)/4+3*3-(7+2)) / 5) and press F5 to run the parser. The expression will be evaluated. 'Parse was successful' should be displayed now. If the parse was successful the evaluator will continue to evaluate the resulting parse tree. Because we have not implemented any logic behind the production rules, you will get the following warning: 'Result: Could not interpret input; no semantics implemented.'

Simple expression calculator semantics

Adding codeblocks

Now that we have a working grammar we can start adding semantics to the production rules in terms of code blocks. Code blocks are snippets of c# (or VB) code that are almost directly inserted into the generated parser. Before it is inserted some variables are replaced first. Let's start with the first production rule:

Start -> (AddExpr)? EOF { return $AddExpr; };  

Notice the variable $AddExpr. $AddExpr corresponds to the value of the non-terminal AddExpr. During code generation, the $AddExpr is replaced by a .Net expression that will evaluate the AddExpr nonterminal. a $ variable is defined for each terminal and non-terminal in the production rule. E.g. in this example $EOF would also be a valid variable. Note that terminal and non-terminals always return values of the type object. You will need to do your own explicit casting if you want to do calculations. E.g. like in the following snippet:

AddExpr -> MultExpr (PLUSMINUS MultExpr)* 
    { 
        int Value = Convert.ToInt32($MultExpr); 
        int i = 1; 
        while ($MultExpr[i] != null) { 
            string sign = $PLUSMINUS[i-1].ToString(); 
            if (sign == "+") 
                Value += Convert.ToInt32($MultExpr[i++]); 
            else 
                Value -= Convert.ToInt32($MultExpr[i++]); 
        } 
        return Value; 
    }; 

Notice that in this production rule the term MultExpr is defined 2x. So to which value instance of MultExpr is the $MultExpr referring? Even more so, the latter MultExpr can be repeated endlessly. To refer to a specific instance of the MultExpr value, you can use indexers on the $ variable, e.g. $MultExpr[1] will refer to the second defined instance of the MultExpr of the input expression. So if we have for instance the expression '3+4*2-6' we will have 3 MultExpr non-terminals: 3, 4*2 (=8) and 6. So $MultExpr[1] will evaluate to the value 8. $MultExpr[3] however is not available and will therefore evaluate to null (= the .Net null value).

So this code evaluates the first MultExpr ($MultExpr is short for $MultExpr[0]) and assigns it to Value. We know that this one always exists accoording to the grammar. Then we loop through the following $MultExpr[i] and add or subtract the $MultExpr[i] to the Value until the $MultExpr[i] evaluates to null. In order to decide if a MultExpr should be added or subtracted we evaluate the $PLUSMINUS token. Hence now we can actually calculate additions and subtractions.

The same approach we can use for the MultExpr production rule. I will skip this in this article but you view the code if you open the 'simple expression2.tpg' file. Then last but not least there is the Atom production rule which can be defined as follows:

Atom -> NUMBER | BROPEN AddExpr BRCLOSE 
    { if ($NUMBER != null) 
            return $NUMBER; 
        else 
            return $AddExpr; 
    };

Because the Atom rule contains a choice between a NUMBER or a sub expression, we also need to check this choice in the code. Therefore we check if either of the sub rules is null. If not, we return this value.

Running the generated parser / compiler

Compile the grammar again by pressing F6. This should not result in any errors. Then type in the expression (((5*8)/4+3*3-(7+2)) / 5) and press F5. This time the expression should parse successfully and the outcome is calculated and should return 'Result: 2'.

congratulations, you have written your first TinyPG compiler!

Highlighting your expressions

To make things more interesting v1.2 allows you to add text highlighting to your grammar. Adding Text highlighting is done in two steps:

  • Add the TextHighlighter directive to the grammar
  • Add the Color attribute to terminals you want to have highlighted
When the TextHighlighter directive is added, @TinyPG will generate also a TextHighlighter.cs (or .vb) file. The generated TextHighlighter makes use of the generated parser/scanner to parse the input text and apply the color from the Color attribute to any terminal it recognizes. E.g. the directives and terminals of the simple expression evaluator could look like this:
TinyPG evolvement cycle
// By default the TextHighlighter is not generated. Set its Generate attribute to true
<% @TextHighlighter Generate="true" %>

// highlight numbers in red and symbols in blue
[Color(255, 0, 0)] NUMBER -> @"[0-9]+"; 
[Color(0, 0, 255)] PLUSMINUS -> @"(\+|-)"; 
[Color(0, 0, 255)] MULTDIV -> @"\*|/"; 
[Color(0, 0, 255)] BROPEN -> @"\("; 
[Color(0, 0, 255)] BRCLOSE -> @"\)"; 

When running expressions in the input pane, the expressions would be calculated and additionally highlighted in the input pane! This makes it extra simple to write your own text highlighters

Additional remarks

Note 1: it is not allowed to include terminal characters immediately in the production rules. This is allowed for some compiler compilers however I feel it does not add to the readability of the generated code. So each terminal must be defined explicitly as a regex expression. Non terminal production rules may only refer to teminals or other non-terminals in a LL(1) manner. This means that the parser will only be able to look ahead 1 token. When scanning and parsing an input token always corresponds to a terminal symbol (if it does not, you have a syntax error). So the parser will only need to look ahead 1 terminal to decide which production rule to choose.

E.g. the following grammar will result in errors because it is not LL(1):

// this is an LL(2) rule, you need to look ahead 2 symbols to determine which rule to choose 
Start -> PLUSMINUS NUMBER | PLUSMINUS Expression; 

The parser must choose the production (sub)rule based on 1 lookahead. Because both subrules are starting with the PLUSMINUS terminal, the parser cannot decide. TinyPG does not check grammars to be LL(1). It will simply generate the code. However on compilation of the code you will run into compilation errors. Luckely an LL(k) (k>1) rule can always be rewritten to LL(1) format, like so:

// the rule has been rewritten to LL(1), now only 1 symbol at a time is required to be looked at to make the decicion
Start -> PLUSMINUS ( NUMBER | Expression); 

By rewriting the rule like above, the production rule is LL(1) and can now be successfully generated into an LL(1) parser... or can it? The LL(k) problem is perhaps slightly more complicated. What if Expression is defined as:

Expression -> NUMBER | IDENTIFIER;

Again the parser will have the same problem, when it encounters a NUMBER token, should it choose the NUMBER rule of Start, or should it continue parsing an Expression? So here also the problem occurs. This time it is more difficult to solve the problem. In this case there is no easy solution but to rethink (partly rewrite) your grammar.

Note 2: TinyPG will not detect errors inside codeblocks itself. the .Net compiler will of course. This can lead to .Net compilation errors that are hard to track and map on the grammar codeblocks. It may be difficult to see what the issue is. the best way to debug this is to open the source code in visual studio and try compile it.

Note 3: It is also possible to separate the semantics from the syntax by not inserting codeblocks directly into the grammar. TinyPG will generate 3 source code files, the scanner, the parser and the parsetree. When parsing an input string successfully, the parser will return a filled parsetree. Normally TinyPG will insert codeblocks directly into the parsetree. The parsetree can then be evaluated separately. In this case we create a subclass of the parsetree and insert our own code there (the methods to implement can be overridden by the subclass). Then when calling the parser, you supply it with a new instance of your own parsetree. The parser will then fill this parsetree and return it again.

Of course an alternate manner is to simply evaluate the parsetree directly in the code, by traversing the tree nodes. However somehow I feel this option is less 'clean'.

Partial Context sensitive/ambiguous grammars

@TinyPG v1.2 now supports partial ambiguous grammars. Given the simple expression grammar, assume we would like to make a distinction between FACTORs and TERMs. The problem is, both FACTORs and TERMs are numbers and can be defined as:

        [Color(255, 0, 0)] FACTOR_NUMBER -> @"[0-9]+";    // mark factors in red
        [Color(0, 255, 0)] TERM_NUMBER -> @"[0-9]+";      // mark terms in green
    
This is typcically an ambiguous grammar because a number as input can match both symbols. Unless if you define your grammar for instance only to expect a term. e.g.
    Start -> TERM_NUMBER (MULTDIV) FACTOR_NUMBER;
    
The first input number is expected to be a TERM, the second number is expected to be a factor. So depending on the context (the rule the parser is parsing), the scanner will interpret a number as a TERM or as a FACTOR respectively. So the first number will be marked green, and the second in red.

Using the code

Once you have generated the Scanner, Parser, ParseTree and optionally the TextHighlighter classes and tested it with TinyPG, you obviously now want to use the code in your own project. This can be done be creating a new c# project with Visual Studio. Add the generated files to the project and compile, just to make sure there are no errors. To call the parser, use the following code:

#using TinyPG; // add the TinyPG namespace

...

// create the scanner to use
Scanner scanner = new Scanner(); 

// create the parser, and supply the scanner it should use
Parser parser = new Parser(scanner); 

//create a texthighligher (if one was generated) and attach the RichTextbox and parser and scanner.
TextHighlighter highlighter = new TextHighlighter(richTextbox, scanner, parser);

// define the input for the parser
string input = "... your expression here ...";

// parse the input. the result is a parse tree.
ParseTree tree = parser.Parse(input);

Notice that a ParseTree object is returned. The parse tree contains the structure of the input. If the syntax of the input is not correct the ParseTree contains errors. You can check for errors by investigating the ParseTree.Errors property.

Notice also that the TextHighlighter accepts a RichTextbox control. TextHighlighter will automatically start capturing its events analyzing its content and updating the content of the RichTextbox control.

If all is well, you can go ahead and evaluate the parse tree:

// evaluate the parse tree; do not pass any additional parameters
object result = ParseTree.Eval(null);

// write the result of the evaluation to the console:
Console.WriteLine("result: " + result.ToString());

Notice that the Eval(...) function returns a result of type object. During evaluation you are free to decide on what the return type will be. This will give you more freedom, however this also means you will have to cast types explicitly. To display the result cast it to a string using the result.ToString() method.

To conclude, this is all that is needed to build and implement your own grammar and parser. The generated code does not have any external dependencies nor are any additional libraries required in order to work with the code.

The tiny parser generator evolvement cycle

What I have always found intriguing about compiler compilers is that they are able to compile their own grammar and hence generate a compiler from that which in this case is a compiler compiler. Initially of course there is no compiler compiler available yet. So how to build one? In this section will explain the evolvement cycle as I have applied it for TinyPG as shown in the following graph:
TinyPG evolvement cycle Step 1: define the grammar as input text to be parsed.
Step 2: derive a concrete parse tree by parsing the input grammar
Step 3: transform the parse tree to an abstract parse tree: the grammar tree.
step 4: the grammar tree contains all information about the grammar stored as a tree.
step 4a: generate the grammar text from the grammar tree. This is a check to see if the input grammar corresponds to the grammar in the grammar tree. If they do not correspond, then most likely the transformation has gone wrong.
step4b: generate the parser c# source code
step 5: compile the sources into the actual parser.

next steps: take the generated grammar from step 4a and use it as input for the parser, continue the cycle in step 2.

To bootstrap the whole process, start with creating the abstract grammar tree manually (in code).

The most tricky parts are the Transformation and Generation processes. Once you have that going, the rest is relatively easy.

Using TinyPG Directives

Sometimes you want to be able to set some additional parameters for the tool so it knows how and where to generate the code, for instance specifying you want to generate C# or VB code. For this purpose I included the option to insert meta information directly into the grammar by means of directives. I was inspired by the way this is handled by aspx pages using the <% ... %> tags. I decided this would be handy and compact format which would be strict enough to allow only for some parameters to be specified. Also this will be easy to extend at a later stage, that is add more directives.

Notice that the syntax highlighting for codeblocks is now also implemented in v1.2. Codeblocks will be highlighted accoording to their respective Language setting of the @TinyPG directive

Currently the following directives are supported: @TinyPG, @Parser, @Scanner and @ParseTree and can be used as follows:

 // Namespace allows you to define a different namespace for the Scanner, 
//     Parser and ParseTree generated classes. By default this is set to "TinyPG"
// Language allows you to define the language to generate. Only C# (default) or VB are supported for now. 
// OutputPath allows you to define the absolute or relative outputpath for the
//     files. The relative path is relative to the grammar file. 
//     By default this is set to "./"
// Template path is relative to the TinyPG.exe file. By default this is set to 
//     "Templates"
<% @TinyPG Namespace="MyNamespace" Language="C#" OutputPath="MyGrammarPath" TemplatePath="MyParserTemplates" %>

// the Generate parameter specifies wether the file should be generated. By default this is set to True
<% @Parser Generate="True" %>
<% @ParserTree Generate="False" %>  // turn off generation of the the ParseTree.cs file
<% @Scanner Generate="True" %>
<% @TextHighlighter Generate="True" %> // indicates code for the TextHighlighter should be generated also

It is required that the directives are defined before the grammar implementation.

Some handy regular expressions.

Writing your own regular expressions not always easy. Specially not if you want to match tokens that are often used also in programming languages. I have summarized a few regular expressions here that can be very helpful.

// codeblock will match any text between { ... }; 
Regex codeblock = new Regex(@"\s*\{[^\}]*\}([^};][^}]*\})*;"); 

//eof will match the end of an input string only
Regex eof = new Regex(@"^$");

//whitespace will match any whitespace including tabs. This one is trivial, but often required.
Regex whitespace = new Regex(@"\s+");

//regex_string will match any text that is a .Net string. It also takes the ", "", @ and \ into account
Regex regex_string = new Regex(@"@?\""(\""\""|[^\""])*\""");

//commentline will match with a single line of text that starts with // and scan it until the end of the line
//very handy if you want to support commenting by your parser
Regex commentline = new Regex(@"//[^\n]*\n?"); 

//commentblock will match any text between /* ... */ . Very handy if you want to support commenting by your parser
Regex commentblock = new Regex(@"/\*[^*]*\*+(?:[^/*][^*]*\*+)*/", RegexOptions.Compiled); 

That's it for now for the interesting regular expressions I found on the web or wrote myself. If you have any interesting/complicated ones please drop me a line.

Points of interest

Apart from the parser generator functionality, the TinyPG utility also contains a number of additional components that may be interesting. The controls/feaures that are new in version 1.2 are made bold:

All All in all even though I named this the Tiny Parser Generator, this has become more than a tiny project. I have spent quite some effort in making this all work together nicely and now I feel this is worth sharing with the community. Although in a future release I would like to have additional functionility:

If you have any ideas for new features, comments or remarks please drop a note!

History

@TinyPG v1.0

Tiny Parser Generator Version 1.0 was released on 1st of August 2008. This version includes the basic implementation for generating simple top down parsers.

@TinyPG v1.1

Tiny Parser Generator Version 1.1 was released on 17th of August . Version 1.1 contains additional features, making the editor easier to use:

@TinyPG v1.2

Tiny Parser Generator Version 1.2 was released on 1st of September. Version 1.2 contains additional features, making the editor easier to use:

Special thanks go out to William A. McKee for being actively involved with @TinyPG, inspiring me to improve the tool further and help me revise the EBNF grammar. I have implemented some of his ideas in @TinyPG v1.2, including support for (partly) ambiguous grammars.

You must Sign In to use this message board.
 
 
Per page   
 FirstPrevNext
GeneralTiny PG with great man
reza_arab
20:35 2 Feb '10  
When I saw this software and code, I was very excited.thank you That have been present to share it with others. And congratulate to you due to this beautiful mind.
please present test class associated with this project if you would like.
thank you again.
QuestionTiny PG Parsing Queries
Manhar Goindi
1:25 6 Jan '10  
Hi,

Tiny PG is really a great tool.

We are using the TinyPG Parser and found it to be useful In generating C# code. However, we would like to know the following about this parser’s capabilities:

1. Is it possible to advance the parser from some input text position to skip some portion of the text and let it resume parsing from some new position in the input text? How can this be achieved?
2. Is it possible to delink the default TinyPG lexical analyzer from TinyPG Parser and link the TinyPG Parser to some custom Lexical Analyzer? How can this be achieved?


Thanks & Best Regards,
Manhar Goindi
AnswerRe: Tiny PG Parsing Queries
Herre Kuijpers
4:24 6 Jan '10  
Hi,

thank you for your feedback. To answer your questions:
1. in short: no. the parser analyzes all input text. I suppose there are 2 options:
1.1 somehow embed the text to be skipped as part of the grammar. this way the text is not really skipped, the grammar is extended so the text will be parsed anyway.
1.2 if this is not possible, then you will need to create your own pre-processor that feeds the parser with seperate textblocks, and merge the resulting parsetrees.

2. the only way i think this is possible is to write your own scanner.cs file that implements the same classes and methods as the current generated scanner.cs file, which functions as a wrapper around the custom Lexical Analyzer.


I hope this clarifies some things for you.
Regards,
Herre

Herre

GeneralCompiling with project files.
captncraig
11:20 5 Nov '09  
First off, awesome job with this. It is surprisingly powerful, and simple to use.

I am using the evaluation process to generate my own AST, using my custom AST classes. I am placing the parser files in the same namespace as my AST project, and they all build and run beautifully in visual studio In the TinyPG editor though, the compile step cannot find my AST files and so it says the compilation fails.
This isn't a huge deal, but it means I cannot use the super sweet parse tree viewer without first commenting out my AST generating code. I feel like there should be a way to link my files in so that the compilation step can find them, but I'm not sure how. I could copy and paste them into the template file, but I hope it doesn't come to that.
GeneralRe: Compiling with project files.
Herre Kuijpers
22:47 5 Nov '09  
Hi, It is true that the TinyPG compiler cannot include custom created (AST) classes. whenever TinyPG compiles the parser for internal use, it makes sure that the generated code implements specific interfaces (which are left out when generating output files).
These interfaces are needed by TinyPG to run the dynamically generated parser within the tool. Right now it will require a number of code modifications to allow an AST in. Also it will require some metadata to have the TinyPG understand the structure of the AST, so it all becomes a lot more complex then.

Best way to go about it for now, is to create your own visual studio project including the AST code and have TinyPG generate the parser code. then build your own specific AST viewer (it really isn't that much code).

Herre

GeneralRe: Compiling with project files.
captncraig
12:41 6 Nov '09  
Thanks! I took a look at your ParseTreeViewer code and was able to get something like it to run in my application to view the CST. I like the idea to make one for my AST too, but that will be a little tougher. Thanks for all the help.

I did notice TinyPG runs into problems with some of my C# code I put in the evaluations. Namely when I used a collection initializer (IList x = new List{1,2,3,4,5}) The parser goes all funny and tells me productions are not defined and so forth. Its not a big deal at all, but I thought you should know.
GeneralRe: Compiling with project files.
Herre Kuijpers
5:31 7 Nov '09  
hmm good point, could be an issue with the curly braces...

Herre

GeneralExtra expressions in tree
Lee Atkinson
2:28 28 Oct '09  
Hi

Many thanks for an excellent application.

I want to be able to walk the parsetree generated from my own query language to generate SQL query. If I run the "Simple Expression Evaluator' sample, I get extra exressions detected that don't really exist. Is there any way to remove them? Below is an example. Looking at the grammar, I think I can see why it happens, but wonder if it's possible to prevent them from occuring, or do I need to ignore them when I'm walking the tree (e.g. if they only have one child node when they should have more).

Many thanks

Lee

Grammar:

<% @TinyPG Language="vb" %>

NUMBER -> @"[0-9]+";
PLUSMINUS -> @"(\+|-)";
MULTDIV -> @"\*|/";
BROPEN -> @"\(";
BRCLOSE -> @"\)";
EOF -> @"^$";

[Skip] WHITESPACE -> @"\s+";

Start -> (AddExpr)? EOF;
AddExpr -> MultExpr (PLUSMINUS MultExpr)*;
MultExpr -> Atom (MULTDIV Atom)*;
Atom -> NUMBER | BROPEN AddExpr BRCLOSE;

Expression: 1 + (2 * 3)

Tree:

Start
-AddExpr
--MultExpr --- extra?
---Atom
----NUMBER '1'
--PLUSMINUS '+'
--MultExpr --- extra?
---Atom
----BROPEN '('
----AddExpr --- extra?
-----MultExpr
------Atom
-------NUMBER '2'
------MULTDIV '*'
------Atom
-------NUMBER '3'
----BRCLOSE ')'
-EOF
GeneralRe: Extra expressions in tree
Herre Kuijpers
6:27 29 Oct '09  
Hi Lee,

the 'extra' expressions are in fact there by design. What you get as a result is a so called Concrete Syntax Tree (CST). It shows all the intermediary steps including all tokens that may turn out to be redundant in the end (e.g. the brackets).

The point of a is that it is a direct mapping of the expression onto a tree structure. Therefore the exact input could theoretically also be generated again from the CST.

What you are looking for is the next step in the normal compiler process: the Abstract Syntax Tree (AST), which basically transforms the CST into a tree structure that is needed for your specific purpose only (e.g. it might remove the redundant tokens and nodes that you do not need).

In order to create an AST additional logic is required that is not within the scope of TinyPG. This will need to be implemented separately.
So in order to evaluate your SQL expression, you should first convert the CST to a AST containing only the necessary SQL nodes and representations (e.g. tables, attributes, operations etc).

hope this helps.
regards,

Herre

GeneralNice article
el_marcondon
7:35 21 Sep '09  
Got my 5.

However, I was trying to play with the 1.2 source code, and I noticed that it's missing the unit tests.
GeneralHow smart is your error reporting?
Yumashin Alex
9:25 19 Aug '09  
First, let me thank you for your excellent article and code. It's very nice realy!
Second, let me ask you two questions:

1. Are you planning to improve your error messages? It will be very good to know the exact reason (and position) of the error.

2. Suppose we have the following grammar description (for an xml-file):

Element A1 (i.e. element with its "name" attribute set to "A1") is mandatory and must be the very first element in the file. Then goes an optional element A2 which must have its "id" attribute set to "ABCD". Then goes an optinal repetitive (min = 0, max = unbounded) group "Z" of elements A3 and A4, mandatory and optional correspondingly. Then goes a mandatory element A5 which must have its "id" attribute set to "EFGH". And then goes a mandatory element A6 which must be the last element in the file.
I'm not ready now to express this grammar in EBNF, but I think it's not very hard to do.
Now let me show 2 samples of valid files (root element is omitted):

Sample 1:
<T name="A1" />
<T name="A2" id="ABCD" />
<T name="A3" />
<T name="A4" />
<T name="A3" />
<T name="A3" />
<T name="A4" />
<T name="A5" id="EFGH" />
<T name="A6" />
Sample 2:
<T name="A1" />
<T name="A5" id="EFGH" />
<T name="A6" />
I suppose your parser is quite capable to check these files against the given grammar.
Now let me show a sample of an invalid file:

*** mandatory element A1 is missing ***
<T name="A2" id="AB" /> *** attribute 'id' has wrong value ***
<T name="A3" /> <T name="A4" /> <T name="B1" /> *** element B1 is excessive ***
<T name="A6" /> *** element A6 must go after element A5, not before it ***
<T name="A5" id="EFGH" />
Now my question: is it possible to get "smart" error messages - as those enclosed in "***" symbols? or your parser will report only "Element 'X' not found at its expected position"?

And a "subquestion": imagine that "group Z" (see my grammar description above) is mandatory, i.e. min = 1, not zero. And imagine that the invalid file has the following contents:

<T name="A1" />
<T name="A5" id="EFGH" />
<T name="A6" />
Will your parser say just "Element A3 not found at its expected position" (or smth. like that) - or it is capable to say "Group 'Z' is missing" (which is more correct and "smart" in this situation)?

Thank You.
GeneralRe: How smart is your error reporting?
Herre Kuijpers
10:08 19 Aug '09  
hi Yumashin,

the answer in short: no, not out of the box.

the path from a grammar to a full blown compiler is a lengthy one.
you need to distinguish between the following concepts:
1- syntactical analysis, this involves checking if the structure of the input is correct. the result is Concrete Syntax Tree (CST).
2- semantical analysis, this involves checking of the contents of the parsed elemements. for example type checking, and checking if certain content is allowed only in a certain context. This requires to transform the Concrete Parse Tree to an Abstract Syntax Tree (AST).
3- Once the generation, once the input is found to be correct, execute the AST, that is, generate the output based on the AST.

@TinyPG allows step 1 by default, given a correct grammar. to come to step 2 you will have to implement your own logic (e.g. code inline)

Regarding your examples, I'm somewhat confused. Why did you choose XML? there are a lot of XML parsers out there already, why would you want to create another one?

well anyway I hope it answers your questions.
cheers

Herre

GeneralRe: How smart is your error reporting?
Yumashin Alex
0:36 20 Aug '09  
@TinyPG allows step 1 by default, given a correct grammar. to come to step 2 you will have to implement your own logic (e.g. code inline)
- well, my idea was to check my grammar (as described in the beginning of my first message) during step 1 only. I can dynamically generate grammar in EBNF based on some formalized representation of such description, can't I? E.g. imagine that my grammar is first formalized in the following way:

<ExpectedFileFormat>
  <T name="A1" use="M" />
  <T name="A2" use="O" id="ABCD" />
  <Group name="Z" minCount="0" maxCount="unbounded" >
    <T name="A3" use="M" />
    <T name="A4" use="O" />
  </Group>
  <T name="A5" use="M" id="EFGH" />
  <T name="A6" use="M" />
</ExpectedFileFormat>
I can convert it (with XSLT or in some other way) to the following representation (not sure that it is syntactically correct):

Start -> (ElementsChain)? EOF;
ElementsChain -> A1 | A2 | GroupZ* | A5 | A6;
A1 -> @"A1";
A2 -> @"(A2:id='ABCD')?";
A3 -> @"A3";
A4 -> @"(A4)?";
A5 -> @"(A5:id='EFGH')";
A6 -> @"A6";
GroupZ -> @"(" A3 | A4 ")";
Then I must prepare my xml-file (the one to be checked) by converting it to a string expected by this final grammar, e.g. to:
A1 | A5:id='EFGH' | A6 (that's grammar for "Sample 2" from my first message)

Then I generate and execute parser without any inline code, i.e. perform step 1 only - syntactical analysis. If it passes OK - then my xml-file has valid structure. If not - I'll be glad to get detailed error messages from your parser.

AFAIU, that's a correct approach, isn't it?


Regarding your examples, I'm somewhat confused. Why did you choose XML? there are a lot of XML parsers out there already, why would you want to create another one?
- I'm just not aware of XML parser which will suite my needs.
GeneralRe: How smart is your error reporting?
Herre Kuijpers
6:34 20 Aug '09  
I'm sorry but I still don't understand what you are trying to achieve.

My understanding so far:
1. you want to express a (E)BNF grammar with XML?
2. you want to check both the structure and contents of an XML file using your own grammar & parser?

if you can just read the XML with the standard DOM document (e.g. the one provided by the .Net framework), then can't you just simply walk through the DOM and check if the contents of each node is valid?

Herre

GeneralRe: How smart is your error reporting?
Yumashin Alex
9:40 20 Aug '09  
Sorry for my bad explanation. I'll try once more.

My task is to check that certain xml-files have expected structure and (to some extent) contents.
This expected structure/contents is originally described in words (see my very first message), but can be also expressed in some formalized form (plz see my previous message, the xml sample with "ExpectedFileFormat" root element). If xml-file does not pass this check, I need to describe WHY - e.g. "Group 'Z' is missing", "Excessive element 'B1' found", "Element 'A5' is missing" etc. And these error messages must be smart enough, which means that the check algorithm must be also smart - like the algorithm implemented inside "WinDiff" and similar utilities, when comparing strings "ABC123AZX" and "AZX" produces smart error "Position 1: group 'ABC123' is missing" instead of silly "Position 2: symbol 'B' is missing". That's it.

How can I solve this task? -

a). I cannot use xsd-schemas to do this, because the expected structure/contents (as one can easily see) cannot be described in terms of xsd-schemas. I can give further explanations here if needed.

b). I was able to find a way to use regular expressions to perform this check, but it only tells me "true/false" and - when "false" - does not tell me WHY the check hasn't been passed. So it's not a solution for me.

c). I can iterate through DOM (as You propose), but thus I'll always get silly error "Element 'A3' is missing" instead of smart error "Group 'Z' is missing" - when no any elements A3 and A4 are found in the xml-file being checked. The reason is that this simple iteration is not able to understand that "AZX" is a more preferrable match than just "A" (speaking of the sample in the 2nd paragraph of this posting).

My hope is that if I use syntactical analysis to solve my task, I will be able to get smart error messages. That's why I decided to look at Your parser! The idea is to: (1) programmatically convert xml-representation of expected structure/contents into a "grammar", i.e. a set of production rules written in EBNF (I tried to show it in my previous message - plz see "Start -> (ElementsChain)? EOF;" and further on), (2) programmatically convert xml-files (to be checked) so that the target format can be checked against this grammar (e.g. convert "Sample 2" from my very 1st message to a simple string "A1 | A5:id='EFGH' | A6"), (3) use Your parser to check grammar.

That's what I mean. I think this will work, but I'm not sure if Your parser is able to understand that "AZX" is a more preferrable match than just "A" Smile
GeneralRe: How smart is your error reporting?
Herre Kuijpers
0:50 22 Aug '09  
i still believe the easiest way to solve your problem is using a DOM parser and add your own logic as to what element is missing and what it's meaning is. (add semantics)

i don't see how this otherwise can be achieved with a one pass parser.

Herre

GeneralVery good job explaining and demonstrating
JasonShort
10:15 4 Aug '09  
I really enjoyed your article. You did a great job explaining and demonstrating the topics. It is rare to find such well laid out information. You should consider writing more articles!

Jason S Short, Ph.D.
VistaDB Software, Inc.

GeneralRe: Very good job explaining and demonstrating
Herre Kuijpers
22:46 4 Aug '09  
Hi Jason,
I appreciate your feedback. Unfortunately the lack of time prevents me from writing articles more frequently.
cheers,
Herre

Herre

NewsUsing TinyPG to parse Excel Formula
Herre Kuijpers
7:02 6 Jul '09  
For anyone interested in how to apply TinyPG in a more realistic setting, check out this article in which TinyPG is used to create the parser for Excel formulas by Chris Cavanagh.

http://chriscavanagh.wordpress.com/2009/02/19/custom-expressions-and-the-dlr-part-1/[^]

The artical and code on codeplex also shows how to implement the Abstract Syntax Tree derived from the ParseTree class. He defined a basic expression parser using the following grammar:

;
// Simple Excel grammar for TinyPG
// Copyright © Chris Cavanagh 2008
<% @TinyPG Namespace="Cjc.ExpressionEngine.Excel.Compiler" %>

PLUSMINUS -> @"(\+|-|&)";
MULTDIV -> @"\*|/";
BROPEN -> @"\(";
BRCLOSE -> @"\)";
COMMA -> @",";
COLON -> @"\:";
TRUE -> @"[Tt][Rr][Uu][Ee]";
FALSE -> @"[Ff][Aa][Ll][Ss][Ee]";
NULL -> @"[Nn][Uu][Ll][Ll]";
METHODNAME -> @"[a-zA-Z_][a-zA-Z0-9_]*(?=\()";
ASSIGNID -> @"[a-zA-Z_][a-zA-Z0-9_]*(?=\s?=\s?)";
CELLID -> @"([a-z]+|[A-Z]+)[0-9]+";
IDENTIFIER -> @"[a-zA-Z_][a-zA-Z0-9_]*";
INTEGER -> @"(\+|-)?[0-9]+";
NUMBER -> @"(\+|-)?[0-9]*\.[0-9]+";
STRING -> @"@?\""(\""\""|[^\""])*\""";
COMPARE -> @"
==|<=|>=|>|<|=|!=|<>";
EQUALS -> @"
=";
EOF -> @"
^$";

[Skip] WHITESPACE -> @"
\s+";

Start -> (Assignment|CompareExpr)? EOF;
Assignment -> ASSIGNID EQUALS CompareExpr;
CompareExpr -> AddExpr (COMPARE AddExpr)?;
AddExpr -> MultExpr (PLUSMINUS MultExpr)*;
MultExpr -> Atom (MULTDIV Atom)*;
Params -> CompareExpr? (COMMA CompareExpr)*;
Method -> METHODNAME BROPEN Params? BRCLOSE;
Identifier -> IDENTIFIER;
Range -> CELLID (COLON CELLID)?;
Atom -> TRUE | FALSE | NULL | INTEGER | NUMBER | STRING | BROPEN CompareExpr BRCLOSE | Range | Method | Identifier;


Herre

Generalapplying semantics
battlemodetwo
5:50 24 Apr '09  
how could I apply a semantics for this instance.

I have this bnf for declaration

Declaration -> DATATYPE IDENTIFIER EQUALS VALUE SEMICOL;

in C langauge this could be:

int c = 5;

If i would not apply any semantics, it would be possible to input this kind of codes:

int c = 5.05;
float d = 5;
char d = 5;
int c = 'f';

You get what I mean, another thing is I could assign variables that are not yet declared. Please give me a brief example on this matter.
Generalruntime generation and text highlighting performance
VincentHarink
9:38 6 Apr '09  
Verry good software.

I have two questions.

1) When I want to integrate the parser in my text editor then it would be nice when the user can make it's own rules which are stored in a text file. When running you can select which rules you want to use for the syntax highlighting.
For this to work the grammar compile should be at runtime.
Is this possible ?

2) When there is a lot of information in the richtextbox the performance is very slow, and when more than about 40k not working at all.
Is there a way to change the syntax highligthing so only the visible part of the richttextbox is handled ?

Thanks
GeneralString expression
egodefroy
14:09 24 Mar '09  
Hi Herre,
I'm using your tool with much pleasure!

If I use the production rule STRING -> @"@?\""(\""\""|[^\""])*\"""; as you mention in your example, then the corresponding token Text value for '"mystring"' is '"mystring"'. Shouldn't it be 'mystring' instead ?
If not, how do I convert the token Text value to the real string value in the general case ?
Thks
Eric
GeneralUsing dots inside and outside strings
NiloPaim
14:02 24 Feb '09  
Hi, Herre.

Consider the following lexical rules:

VERSION		-> @"version";
DOT -> @"\.";
STRING -> @"@?\""(\""\""|[^\""])*\""";
ANY_TEXT -> @"
[^\.]*";

And consider the following syntatic rule:

VERSION ( STRING | ANY_TEXT ) DOT

If I have a dot inside the string, it's not recognized by the syntatic rule.

version 1.
is ok.
version "1.00".
gave me syntax error.

Am I missing something? Confused

TIA
GeneralRe: Using dots inside and outside strings
Herre Kuijpers
3:43 25 Feb '09  
try the following

VERSION -> @"version";
DOT -> @"\.";
STRING -> @"@?\""(\""\""|[^\""])*\""";
ANY_TEXT -> @"[^.\s]*";

[Skip]
WS -> @"\s+";

Start -> VERSION ( STRING | ANY_TEXT ) DOT;

the problem is that the ANY_TEXT definition you used did not skip any whitespaces.
since the parser is greedy, it was matching ANY_TEXT on ' "1'.

hope this is what you needed.

regards,
Herre

Herre

GeneralRe: Using dots inside and outside strings
NiloPaim
13:24 25 Feb '09  
Hi, Herre.

If I do what you say, I just change the place of my problems. I'll show you an example. In Cobol, is perfectly valid the instruction:
author. nilo roberto c paim.
To parse it, I'd defined a rule as
AUTHOR DOT ANY_TEXT DOT
where ANY_TEXT is exactly what its name says: any text.

The end of the text is the final dot. If I add space to ANY_TEXT, just "nilo" on the example will be recognized as a valid token and everything else is considered as syntax errors.

In fact, I think that my biiiiig problem is:
Dot is a sentence terminator in Cobol, but it could appear on some other places successfully. A dot inside a string is perfectly valid. A dot inside a number (e.g., 3.14159) is valid too. How could I distinguish them?

I would like that TinyPG have some type of "priority" on using the lexical rules, understand? For instance, I could say that a STRING would be evaluated before a float number, and a float number would be evaluated before the terminator dot on a sentence.

Of course, I know that this kind of thing implies on some type of lookahead before TinyPG decides what rule accept.

How to do this?

Ah, I have made a workaround over some things using a pre-processor. Case insensitive on reserved words it's an example. I just apply a ToLower() on my input before send it to parser/scanner... Wink

Thanks for quick answer and congratulations. TinyPG is a wonderful tool.

Best regards from Brazil.

Nilo


Last Updated 1 Sep 2008 | Advertise | Privacy | Terms of Use | Copyright © CodeProject, 1999-2010