Introduction
@TinyPG stands for "a Tiny Parser Generator". This particular generator is an LL(1)
recursive descent parser generator. This means that instead of generating a
state machine out of a grammar like most compiler compilers, instead it will
generate source code directly; basically generating a method for each
non-terminal in the grammar. Terminals are expressed using .Net's powerful
regular expressions. To help the programmer create .Net regular expressions a
Regex tool is embedded in TinyPG. Grammars can be written using the extended BNF
notation. TinyGP v1.2 now allows you to generate a scanner, parser and parsetree file in either C# or VB code (!).
These can be compiled directly into your own projects. Additionally now it is possible to generate code for your own text highlighter
which you can use directly into your own text editor project. A simple example is added at the end of this article.
In this article I will not go into depth about compiler theory. Basic
understanding of compilers and grammars is required. For your reference I have
included a list of terms used in this article explained on Wikipedia:
grammar,
BNF,
terminal,
non-terminal,
LL(1),
lookahead,
recursive decent,
concrete syntax tree (CST)
, parse tree,
Nowadays, with the
availability of numerous compiler compilers, it becomes hard to think of a new name for a compiler compiler. 'Yet
Another Compiler Compiler' was taken so I decided to name this
tiny tool
after its many strengths:
- a powerful tiny utility in which to define the grammar for the new compiler,
allowing
- syntax and semantics checking of the grammar
- generating a tiny set of sources for the parser/scanner/parsetree (just 3
.cs or .vb files without any external dependencies)
- while each generated file remains clearly human readably and debuggable with
visual studio(!)
- including an expression evaluation tool which
- generates a traversable parse tree
- option to include c# codeblocks inside the grammar, adding immediate
functionality with just a few lines of code.
- including a tiny regular expression tool
- while trying to keep things as simple and tiny as possible.
Background
Since there are already a number of compiler compilers out there, you might
wonder why write another one? The reasons for writing this utility are:
- the fun factor, seriously, who doesn't want to write his/her own compiler
compiler!? 8-)
- the compiler compiler should be able to generate c# code
- the utility should be free, allowing developers to use the generated code for
whatever purpose (no restrictive licenses).
- the generated code should be readable and relatively easy to debug if needed.
- the generated code should be completely independent and not require any external
libraries to run (e.g. libraries from the compiler compiler itself !)
- source code available to allow developers to modify this utility as they like
- the tool should be free
- it should be possible to separate the semantics from the syntax (use
subclassing), or combine them (inline codeblocks)
Using the tool
I will explain the usage of the tool by means of two tiny tutorials. In these
tutorials we will start writing a tiny expression calculator. In the former we
will define the grammar for the expression evaluator; this will allow us to
parse expressions. In the latter tutorial we will add some codeblocks to
implement the semantics of the expression evaluator; after this step we will be
able to evaluate (=calculate) simple expressions.
Before we start the tutorial I want to explain a little about naming conventions
I prefer to use. Right now there are no clear conventions for naming. Also this
utility does not require you to use any specific convention. However for
readability also in the generated code I propose the following:
- For terminals use only upper case names. Also underscores are allowed,
e.g. NUMBER, IDENTIFIER, WHITESPACE or BRACKET_OPEN.
- For Non-terminals use Pascal casing. This corresponds to .Net standards. Since
from each non-terminal a method is generated, the method will then also be
Pascal cased. e.g. Start, Expression, MultiplyExpression
To write valid grammars you need to take the following rules into account:
- Each production rule must be terminated with a ; character.
- Start is a reserved word which indicates the starting symbol for the grammar
- It is allowed to include comments using either // to comment a line, or /*
... */ to comment a section
- When including codeblocks, the codeblock must be written between { ... }; Note
that the closing bracket must be followed by the semicolon immediately (no
whitespace) or it will result in errors.
Let's start the tutorial.
Simple expression calculator syntax
The goal of the expression evaluator will be to parse and evaluate simple numeric
expressions consisting of (integer) numbers, +,-, * and / symbols. To make it a
bit more fun we will also allow to use sub-expressions with the ( ) symbols.
E.g. a valid expression could be: 4*(24/2-5)+14 which should evaluate to 42.
This example is included in TinyPG, by opening the "simple expression1.tpg" (or the "simple expression1 vb.tpg" for the VB variant)
file.
Terminals & regular expressions
Since we have decided what symbols to allow for in the calculator we can already
start by defining the terminal symbols of our grammar using terminal production
rules and regular expressions:
NUMBER -> @"[0-9]+";
PLUSMINUS -> @"(\+|-)";
MULTDIV -> @"\*|/";
BROPEN -> @"\(";
BRCLOSE -> @"\)";
|
Terminals can be defined using .Net's Regex expression syntax (including the @
sign if it is so required). As you may have already guessed, the terminal
definitions will be directly used within the Regex by the generated parser.
Using .Net's Regex saves a lot of coding and keeps the scanner.cs very small and
easily readable.
So indeed, this may be a good opportunity to brush up on your
regular expression skills also. In order to play around with regular expressions
I have included the Regex tool within the utility. Just click on the 'Regex
tool'-tab and enter your regular expression. Any matches will be highlighted in
the text immediately. This way you can test if your regular expression is
matching the right symbols. At the end of this article I will include a number
of often used regular expressions that may be very useful if you want to support
those in your own language.
|
|
The TinyPG does not have any reserved or predefined terminal symbols. E.g. some
parser generators reserve the EOF terminal because it may be difficult to
express. Regex can almost cope with any kind of terminal symbol, including an
EOF symbol. An EOF symbol can be defined as follows:
EOF -> @"^$";
The ^ character indicates the regex will scan from the beginning of the text
string/file, the $ indicates that the regex should scan until the end of the
string/file. Because no characters where specified in between, this must be the
end of the file (or text/string).
In order to be less restrictive about the formatting of the expression, we would
like to allow whitespaces. However we do not want to check in the grammar for
whitespaces. In fact we would like to have the scanner simply ignore whitespaces
and continue scanning to the next symbol/token. This can be done by defining a
terminal production rule for whitespace and prefixing it with the [Skip]
attribute like so:
[Skip] WHITESPACE -> @"\s+";
[Skip] WHITESPACE -> @"\s+";
This indicates that any whitespace character will be skipped.
Non-terminals & production rules
Now that we have defined the terminals, let's continue to define the production
rules for the non-terminals.
TinyPG supports the extended BNF notation allowing the following symbols to be
used in the production rules: *, +, ?, (, ), |, and a whitespace. These have the
following meaning:
- * - the symbol or sub-rule can occur 0 or more times
-
+ - the symbol or sub-rule can occur 1 or more times
- ? - the symbol or sub-rule can occur 0 or 1
time.
- | - this defines a choice between 2 sub rules.
- whitespace - the symbol or sub-rules must
occur after each other
- ( ... ) - allows definition of a sub-rule.
The grammar must start with a Start non-terminal.
Hence the Start nonterminal is a reserved word. Let's define the grammar as
follows:
Start -> (AddExpr)? EOF;
AddExpr -> MultExpr (PLUSMINUS MultExpr)*;
MultExpr -> Atom (MULTDIV Atom)*;
Atom -> NUMBER | BROPEN AddExpr BRCLOSE;
The Start production rule will check if there is an AddExpr (optional), and then
expects an end of file (no other tokens are expected).
The AddExpr
will only add or substract MultExpr expressions. The MultExpr will multiply or
divide Atoms. By writing the grammar this way, the precedence of the * and /
symbols over the + and - symbols is defined explicitly. The Atom for now can be
either a number (integer) or another expression. So with only 4 simple
production rules you can already parse complicated expressions like:
(((5*8)/4+3*3-(7+2)) / 5)
Running the parser
Now press F6 to compile the grammar. The output pane should become visible
displaying again the grammar as it is internally represented, but also
displaying the 'First' symbols per non-terminal. The 'first' symbols are the
symbols that the parser will make its decision on for which non-terminal or
production rule to parse next. Also the generated c# code will be compiled
internally and can be run. If all goes well the 'Compilation successful.' should
display.
Press ctrl-e to open the Expression Evaluator pane. Now type in an expression
for your grammar to evaluate, e.g.
(((5*8)/4+3*3-(7+2)) / 5) and press F5 to run the parser. The expression will be
evaluated. 'Parse was successful' should be displayed now. If the parse was
successful the evaluator will continue to evaluate the resulting parse tree.
Because we have not implemented any logic behind the production rules, you will
get the following warning: 'Result: Could not interpret input; no semantics
implemented.'
Simple expression calculator semantics
Adding codeblocks
Now that we have a working grammar we can start adding semantics to the
production rules in terms of code blocks. Code blocks are snippets of c# (or VB) code
that are almost directly inserted into the generated parser. Before it is
inserted some variables are replaced first. Let's start with the first
production rule:
Start -> (AddExpr)? EOF { return $AddExpr; };
Notice the variable $AddExpr. $AddExpr corresponds to the value of the
non-terminal AddExpr. During code generation, the $AddExpr is replaced by a .Net
expression that will evaluate the AddExpr nonterminal. a $ variable is defined
for each terminal and non-terminal in the production rule. E.g. in this example
$EOF would also be a valid variable. Note that terminal and non-terminals always
return values of the type object. You will need to do your own explicit casting
if you want to do calculations. E.g. like in the following snippet:
AddExpr -> MultExpr (PLUSMINUS MultExpr)*
{
int Value = Convert.ToInt32($MultExpr);
int i = 1;
while ($MultExpr[i] != null) {
string sign = $PLUSMINUS[i-1].ToString();
if (sign == "+")
Value += Convert.ToInt32($MultExpr[i++]);
else
Value -= Convert.ToInt32($MultExpr[i++]);
}
return Value;
};
Notice that in this production rule the term MultExpr is defined 2x. So to which
value instance of MultExpr is the $MultExpr referring? Even more so, the latter
MultExpr can be repeated endlessly. To refer to a specific instance of the
MultExpr value, you can use indexers on the $ variable, e.g. $MultExpr[1] will
refer to the second defined instance of the MultExpr of the input expression. So
if we have for instance the expression '3+4*2-6' we will have 3 MultExpr
non-terminals: 3, 4*2 (=8) and 6. So $MultExpr[1] will evaluate to the value 8.
$MultExpr[3] however is not available and will therefore evaluate to null (= the
.Net null value).
So this code evaluates the first MultExpr ($MultExpr is short for $MultExpr[0])
and assigns it to Value. We know that this one always exists accoording to the
grammar. Then we loop through the following $MultExpr[i] and add or subtract
the $MultExpr[i] to the Value until the $MultExpr[i] evaluates to null. In order
to decide if a MultExpr should be added or subtracted we evaluate the
$PLUSMINUS token. Hence now we can actually calculate additions and
subtractions.
The same approach we can use for the MultExpr production rule. I will skip this
in this article but you view the code if you open the 'simple expression2.tpg'
file. Then last but not least there is the Atom production rule which can be
defined as follows:
Atom -> NUMBER | BROPEN AddExpr BRCLOSE
{ if ($NUMBER != null)
return $NUMBER;
else
return $AddExpr;
};
Because the Atom rule contains a choice between a NUMBER or a sub
expression, we also need to check this choice in the code. Therefore we check if
either of the sub rules is null. If not, we return this value.
Running the generated parser / compiler
Compile the grammar again by pressing F6. This should not result in any errors.
Then type in the expression
(((5*8)/4+3*3-(7+2)) / 5) and press F5. This time the expression should parse
successfully and the outcome is calculated and should return 'Result: 2'.
congratulations, you have written your first TinyPG compiler!
Highlighting your expressions
To make things more interesting v1.2 allows you to add text highlighting to your grammar. Adding Text highlighting is done in two steps:
- Add the
TextHighlighter directive to the grammar
- Add the
Color attribute to terminals you want to have highlighted
When the TextHighlighter directive is added, @TinyPG will generate also a TextHighlighter.cs (or .vb) file.
The generated TextHighlighter makes use of the generated parser/scanner to parse the input text and apply the color from the
Color attribute to any terminal it recognizes. E.g. the directives and terminals of the simple expression evaluator could look like this:
|
|
<% @TextHighlighter Generate="true" %>
[Color(255, 0, 0)] NUMBER -> @"[0-9]+";
[Color(0, 0, 255)] PLUSMINUS -> @"(\+|-)";
[Color(0, 0, 255)] MULTDIV -> @"\*|/";
[Color(0, 0, 255)] BROPEN -> @"\(";
[Color(0, 0, 255)] BRCLOSE -> @"\)";
When running expressions in the input pane, the expressions would be calculated and additionally highlighted in the input pane!
This makes it extra simple to write your own text highlighters
Additional remarks
Note 1: it is not allowed to include terminal characters immediately in the
production rules. This is allowed for some compiler compilers however I feel it
does not add to the readability of the generated code. So each terminal must be
defined explicitly as a regex expression. Non terminal production rules may only
refer to teminals or other non-terminals in a LL(1) manner. This means that the
parser will only be able to look ahead 1 token. When scanning and parsing an
input token always corresponds to a terminal symbol (if it does not, you have a
syntax error). So the parser will only need to look ahead 1 terminal to
decide which production rule to choose.
E.g. the following grammar will result in errors because it is not LL(1):
Start -> PLUSMINUS NUMBER | PLUSMINUS Expression;
The parser must choose the production (sub)rule based on 1 lookahead. Because
both subrules are starting with the PLUSMINUS terminal, the parser cannot
decide. TinyPG does not check grammars to be LL(1). It will simply
generate the code. However on compilation of the code you will run into
compilation errors. Luckely an LL(k) (k>1) rule can always be rewritten to LL(1)
format, like so:
Start -> PLUSMINUS ( NUMBER | Expression);
By rewriting the rule like above, the production rule is LL(1) and can now be
successfully generated into an LL(1) parser... or can it? The LL(k) problem is
perhaps slightly more complicated. What if Expression is defined as:
Expression -> NUMBER | IDENTIFIER;
Again the parser will have the same problem, when it encounters a NUMBER token,
should it choose the NUMBER rule of Start, or should it continue parsing an
Expression? So here also the problem occurs. This time it is more difficult to
solve the problem. In this case there is no easy solution but to rethink (partly
rewrite) your grammar.
Note 2: TinyPG will not detect errors inside
codeblocks itself. the .Net compiler will of course. This can lead to .Net
compilation errors that are hard to track and map on the grammar codeblocks. It
may be difficult to see what the issue is. the best way to debug this is to open
the source code in visual studio and try compile it.
Note 3: It is also possible to separate the
semantics from the syntax by not inserting codeblocks directly into the grammar.
TinyPG will generate 3 source code files, the scanner, the parser and the
parsetree. When parsing an input string successfully, the parser will return a
filled parsetree. Normally TinyPG will insert codeblocks directly into the
parsetree. The parsetree can then be evaluated separately. In this case we
create a subclass of the parsetree and insert our own code there (the methods to
implement can be overridden by the subclass). Then when calling the parser, you
supply it with a new instance of your own parsetree. The parser will then fill
this parsetree and return it again.
Of course an alternate manner is to simply evaluate the parsetree directly in
the code, by traversing the tree nodes. However somehow I feel this option is
less 'clean'.
Partial Context sensitive/ambiguous grammars
@TinyPG v1.2 now supports partial ambiguous grammars. Given the simple expression grammar, assume
we would like to make a distinction between FACTORs and TERMs. The problem is, both FACTORs and TERMs are numbers
and can be defined as:
[Color(255, 0, 0)] FACTOR_NUMBER -> @"[0-9]+";
[Color(0, 255, 0)] TERM_NUMBER -> @"[0-9]+";
This is typcically an ambiguous grammar because a number as input can match both symbols.
Unless if you define your grammar for instance only to expect a term. e.g.
Start -> TERM_NUMBER (MULTDIV) FACTOR_NUMBER;
The first input number is expected to be a TERM, the second number is expected to be a factor. So depending on the context (the rule
the parser is parsing), the scanner will interpret a number as a TERM or as a FACTOR respectively.
So the first number will be marked green, and the second in red.
Using the code
Once you have generated the Scanner, Parser, ParseTree and optionally the TextHighlighter classes and tested it with
TinyPG, you obviously now want to use the code in your own project. This can be
done be creating a new c# project with Visual Studio. Add the generated files to the
project and compile, just to make sure there are no errors. To call the parser,
use the following code:
#using TinyPG;
...
Scanner scanner = new Scanner();
Parser parser = new Parser(scanner);
TextHighlighter highlighter = new TextHighlighter(richTextbox, scanner, parser);
string input = "... your expression here ...";
ParseTree tree = parser.Parse(input);
Notice that a ParseTree object is returned. The parse tree contains the
structure of the input. If the syntax of the input is not correct the ParseTree
contains errors. You can check for errors by investigating the ParseTree.Errors
property.
Notice also that the TextHighlighter accepts a RichTextbox control. TextHighlighter will automatically start capturing its
events analyzing its content and updating the content of the RichTextbox control.
If all is well, you can go ahead and evaluate the parse tree:
object result = ParseTree.Eval(null);
Console.WriteLine("result: " + result.ToString());
Notice that the Eval(...) function returns a result of type object. During
evaluation you are free to decide on what the return type will be. This will
give you more freedom, however this also means you will have to cast types
explicitly. To display the result cast it to a string using the
result.ToString() method.
To conclude, this is all that is needed to build and implement your own grammar
and parser. The generated code does not have any external dependencies nor are any
additional libraries required in order to work with the code.
The tiny parser generator evolvement cycle
What I have always found intriguing about compiler compilers is that they are able to
compile their own grammar and hence generate a compiler from that which in this
case is a compiler compiler. Initially of course there is no compiler compiler
available yet. So how to build one? In this section will explain the evolvement
cycle as I have applied it for TinyPG as shown in the following graph:
 |
Step 1: define the grammar as input text to be parsed.
Step 2: derive a concrete parse tree by parsing the input grammar
Step 3: transform the parse tree to an abstract parse tree: the grammar tree.
step 4: the grammar tree contains all information about the grammar stored as a
tree.
step 4a: generate the grammar text from the grammar tree. This is a check to see
if the input grammar corresponds to the grammar in the grammar tree. If they do
not correspond, then most likely the transformation has gone wrong.
step4b: generate the parser c# source code
step 5: compile the sources into the actual parser.
next steps: take the generated grammar from step 4a and use it as input for the
parser, continue the cycle in step 2.
To bootstrap the whole process, start with creating the abstract grammar tree
manually (in code).
The most tricky parts are the Transformation and Generation processes. Once you
have that going, the rest is relatively easy. |
Using TinyPG Directives
Sometimes you want to be able to set some
additional parameters for the tool so it knows how and where to generate the
code, for instance specifying you want to generate C# or VB code.
For this purpose I included the option to insert meta information directly
into the grammar by means of directives. I was inspired by the way this
is handled by aspx pages using the <% ... %> tags. I decided this would be handy
and compact format which would be strict enough to allow only for some
parameters to be specified. Also this will be easy to extend at a later stage,
that is add more directives.
Notice that the syntax highlighting for codeblocks is now also implemented in v1.2. Codeblocks will be highlighted accoording to their
respective Language setting of the @TinyPG directive
Currently the following directives are supported: @TinyPG, @Parser, @Scanner and
@ParseTree and can be used as follows:
<% @TinyPG Namespace="MyNamespace" Language="C#" OutputPath="MyGrammarPath" TemplatePath="MyParserTemplates" %>
<% @Parser Generate="True" %>
<% @ParserTree Generate="False" %>
<% @Scanner Generate="True" %>
<% @TextHighlighter Generate="True" %>
It is required that the directives are defined before the grammar
implementation.
Some handy regular expressions.
Writing your own regular expressions not always easy. Specially not if you
want to match tokens that are often used also in programming languages. I have
summarized a few regular expressions here that can be very helpful.
Regex codeblock = new Regex(@"\s*\{[^\}]*\}([^};][^}]*\})*;");
Regex eof = new Regex(@"^$");
Regex whitespace = new Regex(@"\s+");
Regex regex_string = new Regex(@"@?\""(\""\""|[^\""])*\""");
//commentline will match with a single line of text that starts with // and scan it until the end of the line
//very handy if you want to support commenting by your parser
Regex commentline = new Regex(@"
Regex commentblock = new Regex(@"/\*[^*]*\*+(?:[^/*][^*]*\*+)*/", RegexOptions.Compiled);
That's it for now for the interesting regular expressions I found on the web or
wrote myself. If you have any interesting/complicated ones please drop me a
line.
Points of interest
Apart from the parser generator functionality, the TinyPG utility also contains
a number of additional components that may be interesting. The controls/feaures
that are new in version 1.2 are made bold:
- C# or VB code generation. I have seperated the generation code and templates inside the project and added support for
VB.Net on top of C#. It is now possible to create grammars to directly generate either C# or VB code. Also inline C# or respective VB
codeblocks are supported.
- Partial support for context sensitive/ambigous grammars. Instead of evaluating all possible terminal symbols as possible
input of the grammar, I have adapted to LookAhead function to check only look for the expected terminal symbols.
So if for example 2 terminals are defined to match the same input (this results in an ambiguous grammar), the parser still know which
terminal to look for, depending on the rule it is parsing.
- Code highlighting. Once you have a parsetree, it is relatively easy to highlight
both terminals and nonterminals. Because TinyPG v1.1 code highlighting becomes rather slow if
the text becomes too large, I decided to make the code highlighting asynchronious in v1.2. This seems
to be working rather well. However still when texts become too large (say over a 500 lines of code), highlighting
may take a little time to finish. Additionally I have included code highlighting of C# and VB inline codeblocks
(unfortunately no code completion yet).
- Syntax highlighting.Because EBNF notation is a language on itself, why
not have the syntax and semantics checked of the grammar as you type? Any
syntactic or semantic errors found will be highlighted directly in the code by
underline the text in red. Hovering over the error with the mouse will even
reveal a tooltip showing what the error is, just as you are used to in Visual
Studio. The TextMarker class is responsible for underlining the erroneous words.
That class can be easily reused in your own projects as it has no dependencies.
Add it to your project, wire it up with the RichtTextbox control and assign
erroneous words to it. Those will then be automatically marked.
- Context sensitive code completion. Depending on the section in which you
are typing, code completion will appear with the relevant verbs to complete your
typed word. Obviously this feature was also inspired by Visual Studio. This
feature is implemented in the AutoComplete form which has no dependencies and
can therefore also be reused. Just add it to your own project, wire it up with a
RichTextbox control, add the keywords and you are set to go. Turning
autocompletion on or off can be managed through the Enabled property of the
AutoCompletion control.
TabControlEx. I included an extention on the standard .Net TabControl
called TabControlEx. This one does render the tabs correctly when the control is
turned upside down unlike its superclass. The only problem this one has right
now is, that it does not render the tabs correctly when they are positioned on the
left or right side. - The RegexControl is a fully functional drop in to test your regular expressions.
This is not too exciting though, there are numerous of these tools out there.
- DockExtender.
I reused this code from another project I posted here on CodeProject. It will
allow you to drag, drop and dock the panels on the main window.
- HeaderLabel control. This is an extension on the label. The label is given a
gradient background color based on the currently selected Windows Theme. It is
also possible to activate or deactivate the label, giving it an active or
inactive background color. To top it off I even added the little close 'x' on
the label that will activate if you hover over it. Hence creating basically the
caption header much like that as used on a form.
- Parse tree. Once you are able to compile your grammar and parse your own
expressions, the parse tree will become available. When clicking on nodes in the
parse tree, it's corresponding tokens will be highlighted in the Expression
Evaluator pane. This is a way to browse through the tree and see which tokens
are parsed at what point in the parse tree. It may help you debug potential
mistakes made in the grammar.
All All in all even though I named this the Tiny Parser Generator, this has become
more than a tiny project. I have spent quite some effort in making this all work
together nicely and now I feel this is worth sharing with the community. Although
in a future release I would like to have additional functionility:
- Option to specify the namespace for the generated classes.
- Support for LL(k) (multiple look aheads), this makes it a bit easier to write
grammars; even though LL(1) will produce nicer code.
- Better code highlighting (will probably require partly rewriting the
RichtTextBox control). This could be a separate project on itself. Any help from
the community on this would be greatly appreciated!
- Highlighting of codeblocks! this will of course require
parsing c#. But hey, we have a parser generator available now! I just need to
find the grammar for c#... anyone?>. This one is still high on my priority list
but I am getting closer to implementing it.
- Generate to different languages. Because this will most likely require some
major reworking on some parts of the code I will stick with just c# for the time
being.
- Perhaps a nicer graphical display of the parse tree (e.g. something as is used
in Antlr).
- Display of a state machine like is done in Antlr. That's a nice feature and may
also make the production rule more clear.
- Better error handling and displaying. E.g. make it clickable to jump to the
position where the error occurred.
- Run the evaluator on a separate thread. If you insert faulty codeblocks (e.g.
endless loops) the tool will hang currently. By running it in a separate thread,
this can be controlled.
If you have any ideas for new features, comments or remarks please drop a note!
History
@TinyPG v1.0
Tiny Parser Generator Version 1.0 was released on 1st of August 2008.
This version includes the basic implementation for generating simple top down
parsers.
@TinyPG v1.1
Tiny Parser Generator Version 1.1 was released on 17th of August . Version 1.1 contains additional features, making the editor easier to use:
- Text and syntax highlighting
- Autocompletion
- Improved /revised grammar for the EBNF language
- Support for directives
- Improved FIRST algorithm
@TinyPG v1.2
Tiny Parser Generator Version 1.2 was released on 1st of September. Version 1.2 contains additional features, making the editor easier to use:
- Generation of C# and/or VB code (!)
- Allows for context partial sensitive/ambiguous grammars
- Codeblock highlighting for C# or VB code
- Attributes now allow parameters
- Color(R,G,B) attribute added
- Generation of texthighlighter code
- Asynchronious Text and syntax highlighting
Special thanks go out to William A. McKee for being actively involved with
@TinyPG, inspiring me to improve the tool further and help me revise the EBNF
grammar. I have implemented some of his ideas in @TinyPG v1.2,
including support for (partly) ambiguous grammars.
|
|
 |
 | Tiny PG with great man reza_arab | 20:35 2 Feb '10 |
|
 |
When I saw this software and code, I was very excited.thank you That have been present to share it with others. And congratulate to you due to this beautiful mind. please present test class associated with this project if you would like. thank you again.
|
|
|
|
 |
 | Tiny PG Parsing Queries Manhar Goindi | 1:25 6 Jan '10 |
|
 |
Hi,
Tiny PG is really a great tool.
We are using the TinyPG Parser and found it to be useful In generating C# code. However, we would like to know the following about this parser’s capabilities:
1. Is it possible to advance the parser from some input text position to skip some portion of the text and let it resume parsing from some new position in the input text? How can this be achieved? 2. Is it possible to delink the default TinyPG lexical analyzer from TinyPG Parser and link the TinyPG Parser to some custom Lexical Analyzer? How can this be achieved?
Thanks & Best Regards, Manhar Goindi
|
|
|
|
 |
|
 |
Hi,
thank you for your feedback. To answer your questions: 1. in short: no. the parser analyzes all input text. I suppose there are 2 options: 1.1 somehow embed the text to be skipped as part of the grammar. this way the text is not really skipped, the grammar is extended so the text will be parsed anyway. 1.2 if this is not possible, then you will need to create your own pre-processor that feeds the parser with seperate textblocks, and merge the resulting parsetrees.
2. the only way i think this is possible is to write your own scanner.cs file that implements the same classes and methods as the current generated scanner.cs file, which functions as a wrapper around the custom Lexical Analyzer.
I hope this clarifies some things for you. Regards, Herre
Herre
|
|
|
|
 |
 | Compiling with project files. captncraig | 11:20 5 Nov '09 |
|
 |
First off, awesome job with this. It is surprisingly powerful, and simple to use.
I am using the evaluation process to generate my own AST, using my custom AST classes. I am placing the parser files in the same namespace as my AST project, and they all build and run beautifully in visual studio In the TinyPG editor though, the compile step cannot find my AST files and so it says the compilation fails. This isn't a huge deal, but it means I cannot use the super sweet parse tree viewer without first commenting out my AST generating code. I feel like there should be a way to link my files in so that the compilation step can find them, but I'm not sure how. I could copy and paste them into the template file, but I hope it doesn't come to that.
|
|
|
|
 |
|
 |
Hi, It is true that the TinyPG compiler cannot include custom created (AST) classes. whenever TinyPG compiles the parser for internal use, it makes sure that the generated code implements specific interfaces (which are left out when generating output files). These interfaces are needed by TinyPG to run the dynamically generated parser within the tool. Right now it will require a number of code modifications to allow an AST in. Also it will require some metadata to have the TinyPG understand the structure of the AST, so it all becomes a lot more complex then.
Best way to go about it for now, is to create your own visual studio project including the AST code and have TinyPG generate the parser code. then build your own specific AST viewer (it really isn't that much code).
Herre
|
|
|
|
 |
|
 |
Thanks! I took a look at your ParseTreeViewer code and was able to get something like it to run in my application to view the CST. I like the idea to make one for my AST too, but that will be a little tougher. Thanks for all the help.
I did notice TinyPG runs into problems with some of my C# code I put in the evaluations. Namely when I used a collection initializer (IList x = new List{1,2,3,4,5}) The parser goes all funny and tells me productions are not defined and so forth. Its not a big deal at all, but I thought you should know.
|
|
|
|
 |
|
 |
hmm good point, could be an issue with the curly braces...
Herre
|
|
|
|
 |
 | Extra expressions in tree Lee Atkinson | 2:28 28 Oct '09 |
|
 |
Hi
Many thanks for an excellent application.
I want to be able to walk the parsetree generated from my own query language to generate SQL query. If I run the "Simple Expression Evaluator' sample, I get extra exressions detected that don't really exist. Is there any way to remove them? Below is an example. Looking at the grammar, I think I can see why it happens, but wonder if it's possible to prevent them from occuring, or do I need to ignore them when I'm walking the tree (e.g. if they only have one child node when they should have more).
Many thanks
Lee
Grammar:
<% @TinyPG Language="vb" %>
NUMBER -> @"[0-9]+"; PLUSMINUS -> @"(\+|-)"; MULTDIV -> @"\*|/"; BROPEN -> @"\("; BRCLOSE -> @"\)"; EOF -> @"^$";
[Skip] WHITESPACE -> @"\s+";
Start -> (AddExpr)? EOF; AddExpr -> MultExpr (PLUSMINUS MultExpr)*; MultExpr -> Atom (MULTDIV Atom)*; Atom -> NUMBER | BROPEN AddExpr BRCLOSE;
Expression: 1 + (2 * 3)
Tree:
Start -AddExpr --MultExpr --- extra? ---Atom ----NUMBER '1' --PLUSMINUS '+' --MultExpr --- extra? ---Atom ----BROPEN '(' ----AddExpr --- extra? -----MultExpr ------Atom -------NUMBER '2' ------MULTDIV '*' ------Atom -------NUMBER '3' ----BRCLOSE ')' -EOF
|
|
|
|
 |
|
 |
Hi Lee,
the 'extra' expressions are in fact there by design. What you get as a result is a so called Concrete Syntax Tree (CST). It shows all the intermediary steps including all tokens that may turn out to be redundant in the end (e.g. the brackets).
The point of a is that it is a direct mapping of the expression onto a tree structure. Therefore the exact input could theoretically also be generated again from the CST.
What you are looking for is the next step in the normal compiler process: the Abstract Syntax Tree (AST), which basically transforms the CST into a tree structure that is needed for your specific purpose only (e.g. it might remove the redundant tokens and nodes that you do not need).
In order to create an AST additional logic is required that is not within the scope of TinyPG. This will need to be implemented separately. So in order to evaluate your SQL expression, you should first convert the CST to a AST containing only the necessary SQL nodes and representations (e.g. tables, attributes, operations etc).
hope this helps. regards,
Herre
|
|
|
|
 |
 | Nice article el_marcondon | 7:35 21 Sep '09 |
|
 |
Got my 5.
However, I was trying to play with the 1.2 source code, and I noticed that it's missing the unit tests.
|
|
|
|
 |
 | How smart is your error reporting? Yumashin Alex | 9:25 19 Aug '09 |
|
 |
First, let me thank you for your excellent article and code. It's very nice realy! Second, let me ask you two questions:
1. Are you planning to improve your error messages? It will be very good to know the exact reason (and position) of the error.
2. Suppose we have the following grammar description (for an xml-file):
Element A1 (i.e. element with its "name" attribute set to "A1") is mandatory and must be the very first element in the file. Then goes an optional element A2 which must have its "id" attribute set to "ABCD". Then goes an optinal repetitive (min = 0, max = unbounded) group "Z" of elements A3 and A4, mandatory and optional correspondingly. Then goes a mandatory element A5 which must have its "id" attribute set to "EFGH". And then goes a mandatory element A6 which must be the last element in the file.
I'm not ready now to express this grammar in EBNF, but I think it's not very hard to do. Now let me show 2 samples of valid files (root element is omitted):
Sample 1:
<T name="A1" />
<T name="A2" id="ABCD" />
<T name="A3" />
<T name="A4" />
<T name="A3" />
<T name="A3" />
<T name="A4" />
<T name="A5" id="EFGH" />
<T name="A6" />
Sample 2:
<T name="A1" />
<T name="A5" id="EFGH" />
<T name="A6" />
I suppose your parser is quite capable to check these files against the given grammar. Now let me show a sample of an invalid file:
*** mandatory element A1 is missing *** <T name="A2" id="AB" /> *** attribute 'id' has wrong value *** <T name="A3" />
<T name="A4" />
<T name="B1" /> *** element B1 is excessive *** <T name="A6" /> *** element A6 must go after element A5, not before it *** <T name="A5" id="EFGH" />
Now my question: is it possible to get "smart" error messages - as those enclosed in "***" symbols? or your parser will report only "Element 'X' not found at its expected position"?
And a "subquestion": imagine that "group Z" (see my grammar description above) is mandatory, i.e. min = 1, not zero. And imagine that the invalid file has the following contents:
<T name="A1" />
<T name="A5" id="EFGH" />
<T name="A6" />
Will your parser say just "Element A3 not found at its expected position" (or smth. like that) - or it is capable to say "Group 'Z' is missing" (which is more correct and "smart" in this situation)?
Thank You.
|
|
|
|
 |
 | Re: How smart is your error reporting? Herre Kuijpers | 10:08 19 Aug '09 |
|
 |
hi Yumashin,
the answer in short: no, not out of the box.
the path from a grammar to a full blown compiler is a lengthy one. you need to distinguish between the following concepts: 1- syntactical analysis, this involves checking if the structure of the input is correct. the result is Concrete Syntax Tree (CST). 2- semantical analysis, this involves checking of the contents of the parsed elemements. for example type checking, and checking if certain content is allowed only in a certain context. This requires to transform the Concrete Parse Tree to an Abstract Syntax Tree (AST). 3- Once the generation, once the input is found to be correct, execute the AST, that is, generate the output based on the AST.
@TinyPG allows step 1 by default, given a correct grammar. to come to step 2 you will have to implement your own logic (e.g. code inline)
Regarding your examples, I'm somewhat confused. Why did you choose XML? there are a lot of XML parsers out there already, why would you want to create another one?
well anyway I hope it answers your questions. cheers
Herre
|
|
|
|
 |
|
 |
@TinyPG allows step 1 by default, given a correct grammar. to come to step 2 you will have to implement your own logic (e.g. code inline)
- well, my idea was to check my grammar (as described in the beginning of my first message) during step 1 only. I can dynamically generate grammar in EBNF based on some formalized representation of such description, can't I? E.g. imagine that my grammar is first formalized in the following way:
<ExpectedFileFormat>
<T name="A1" use="M" />
<T name="A2" use="O" id="ABCD" />
<Group name="Z" minCount="0" maxCount="unbounded" >
<T name="A3" use="M" />
<T name="A4" use="O" />
</Group>
<T name="A5" use="M" id="EFGH" />
<T name="A6" use="M" />
</ExpectedFileFormat>
I can convert it (with XSLT or in some other way) to the following representation (not sure that it is syntactically correct):
Start -> (ElementsChain)? EOF; ElementsChain -> A1 | A2 | GroupZ* | A5 | A6; A1 -> @"A1"; A2 -> @"(A2:id='ABCD')?"; A3 -> @"A3"; A4 -> @"(A4)?"; A5 -> @"(A5:id='EFGH')"; A6 -> @"A6"; GroupZ -> @"(" A3 | A4 ")";
Then I must prepare my xml-file (the one to be checked) by converting it to a string expected by this final grammar, e.g. to:
A1 | A5:id='EFGH' | A6 (that's grammar for "Sample 2" from my first message)
Then I generate and execute parser without any inline code, i.e. perform step 1 only - syntactical analysis. If it passes OK - then my xml-file has valid structure. If not - I'll be glad to get detailed error messages from your parser.
AFAIU, that's a correct approach, isn't it?
Regarding your examples, I'm somewhat confused. Why did you choose XML? there are a lot of XML parsers out there already, why would you want to create another one?
- I'm just not aware of XML parser which will suite my needs.
|
|
|
|
 |
|
 |
I'm sorry but I still don't understand what you are trying to achieve.
My understanding so far: 1. you want to express a (E)BNF grammar with XML? 2. you want to check both the structure and contents of an XML file using your own grammar & parser?
if you can just read the XML with the standard DOM document (e.g. the one provided by the .Net framework), then can't you just simply walk through the DOM and check if the contents of each node is valid?
Herre
|
|
|
|
 |
|
 |
Sorry for my bad explanation. I'll try once more.
My task is to check that certain xml-files have expected structure and (to some extent) contents. This expected structure/contents is originally described in words (see my very first message), but can be also expressed in some formalized form (plz see my previous message, the xml sample with "ExpectedFileFormat" root element). If xml-file does not pass this check, I need to describe WHY - e.g. "Group 'Z' is missing", "Excessive element 'B1' found", "Element 'A5' is missing" etc. And these error messages must be smart enough, which means that the check algorithm must be also smart - like the algorithm implemented inside "WinDiff" and similar utilities, when comparing strings "ABC123AZX" and "AZX" produces smart error "Position 1: group 'ABC123' is missing" instead of silly "Position 2: symbol 'B' is missing". That's it.
How can I solve this task? -
a). I cannot use xsd-schemas to do this, because the expected structure/contents (as one can easily see) cannot be described in terms of xsd-schemas. I can give further explanations here if needed.
b). I was able to find a way to use regular expressions to perform this check, but it only tells me "true/false" and - when "false" - does not tell me WHY the check hasn't been passed. So it's not a solution for me.
c). I can iterate through DOM (as You propose), but thus I'll always get silly error "Element 'A3' is missing" instead of smart error "Group 'Z' is missing" - when no any elements A3 and A4 are found in the xml-file being checked. The reason is that this simple iteration is not able to understand that "AZX" is a more preferrable match than just "A" (speaking of the sample in the 2nd paragraph of this posting).
My hope is that if I use syntactical analysis to solve my task, I will be able to get smart error messages. That's why I decided to look at Your parser! The idea is to: (1) programmatically convert xml-representation of expected structure/contents into a "grammar", i.e. a set of production rules written in EBNF (I tried to show it in my previous message - plz see "Start -> (ElementsChain)? EOF;" and further on), (2) programmatically convert xml-files (to be checked) so that the target format can be checked against this grammar (e.g. convert "Sample 2" from my very 1st message to a simple string "A1 | A5:id='EFGH' | A6"), (3) use Your parser to check grammar.
That's what I mean. I think this will work, but I'm not sure if Your parser is able to understand that "AZX" is a more preferrable match than just "A"
|
|
|
|
 |
|
 |
i still believe the easiest way to solve your problem is using a DOM parser and add your own logic as to what element is missing and what it's meaning is. (add semantics)
i don't see how this otherwise can be achieved with a one pass parser.
Herre
|
|
|
|
 |
 | Very good job explaining and demonstrating JasonShort | 10:15 4 Aug '09 |
|
 |
I really enjoyed your article. You did a great job explaining and demonstrating the topics. It is rare to find such well laid out information. You should consider writing more articles!
Jason S Short, Ph.D. VistaDB Software, Inc.
|
|
|
|
 |
|
 |
Hi Jason, I appreciate your feedback. Unfortunately the lack of time prevents me from writing articles more frequently. cheers, Herre
Herre
|
|
|
|
 |
 | Using TinyPG to parse Excel Formula Herre Kuijpers | 7:02 6 Jul '09 |
|
 |
For anyone interested in how to apply TinyPG in a more realistic setting, check out this article in which TinyPG is used to create the parser for Excel formulas by Chris Cavanagh.
http://chriscavanagh.wordpress.com/2009/02/19/custom-expressions-and-the-dlr-part-1/[^]
The artical and code on codeplex also shows how to implement the Abstract Syntax Tree derived from the ParseTree class. He defined a basic expression parser using the following grammar:
; <% @TinyPG Namespace="Cjc.ExpressionEngine.Excel.Compiler" %>
PLUSMINUS -> @"(\+|-|&)"; MULTDIV -> @"\*|/"; BROPEN -> @"\("; BRCLOSE -> @"\)"; COMMA -> @","; COLON -> @"\:"; TRUE -> @"[Tt][Rr][Uu][Ee]"; FALSE -> @"[Ff][Aa][Ll][Ss][Ee]"; NULL -> @"[Nn][Uu][Ll][Ll]"; METHODNAME -> @"[a-zA-Z_][a-zA-Z0-9_]*(?=\()"; ASSIGNID -> @"[a-zA-Z_][a-zA-Z0-9_]*(?=\s?=\s?)"; CELLID -> @"([a-z]+|[A-Z]+)[0-9]+"; IDENTIFIER -> @"[a-zA-Z_][a-zA-Z0-9_]*"; INTEGER -> @"(\+|-)?[0-9]+"; NUMBER -> @"(\+|-)?[0-9]*\.[0-9]+"; STRING -> @"@?\""(\""\""|[^\""])*\"""; COMPARE -> @"==|<=|>=|>|<|=|!=|<>"; EQUALS -> @"="; EOF -> @"^$";
[Skip] WHITESPACE -> @"\s+";
Start -> (Assignment|CompareExpr)? EOF; Assignment -> ASSIGNID EQUALS CompareExpr; CompareExpr -> AddExpr (COMPARE AddExpr)?; AddExpr -> MultExpr (PLUSMINUS MultExpr)*; MultExpr -> Atom (MULTDIV Atom)*; Params -> CompareExpr? (COMMA CompareExpr)*; Method -> METHODNAME BROPEN Params? BRCLOSE; Identifier -> IDENTIFIER; Range -> CELLID (COLON CELLID)?; Atom -> TRUE | FALSE | NULL | INTEGER | NUMBER | STRING | BROPEN CompareExpr BRCLOSE | Range | Method | Identifier;
Herre
|
|
|
|
 |
 | applying semantics battlemodetwo | 5:50 24 Apr '09 |
|
 |
how could I apply a semantics for this instance.
I have this bnf for declaration
Declaration -> DATATYPE IDENTIFIER EQUALS VALUE SEMICOL;
in C langauge this could be:
int c = 5;
If i would not apply any semantics, it would be possible to input this kind of codes:
int c = 5.05; float d = 5; char d = 5; int c = 'f';
You get what I mean, another thing is I could assign variables that are not yet declared. Please give me a brief example on this matter.
|
|
|
|
 |
 | runtime generation and text highlighting performance VincentHarink | 9:38 6 Apr '09 |
|
 |
Verry good software.
I have two questions.
1) When I want to integrate the parser in my text editor then it would be nice when the user can make it's own rules which are stored in a text file. When running you can select which rules you want to use for the syntax highlighting. For this to work the grammar compile should be at runtime. Is this possible ?
2) When there is a lot of information in the richtextbox the performance is very slow, and when more than about 40k not working at all. Is there a way to change the syntax highligthing so only the visible part of the richttextbox is handled ?
Thanks
|
|
|
|
 |
 | String expression egodefroy | 14:09 24 Mar '09 |
|
 |
Hi Herre, I'm using your tool with much pleasure!
If I use the production rule STRING -> @"@?\""(\""\""|[^\""])*\"""; as you mention in your example, then the corresponding token Text value for '"mystring"' is '"mystring"'. Shouldn't it be 'mystring' instead ? If not, how do I convert the token Text value to the real string value in the general case ? Thks Eric
|
|
|
|
 |
 | Using dots inside and outside strings NiloPaim | 14:02 24 Feb '09 |
|
 |
Hi, Herre.
Consider the following lexical rules:
VERSION -> @"version"; DOT -> @"\."; STRING -> @"@?\""(\""\""|[^\""])*\"""; ANY_TEXT -> @"[^\.]*";
And consider the following syntatic rule:
VERSION ( STRING | ANY_TEXT ) DOT
If I have a dot inside the string, it's not recognized by the syntatic rule.
version 1. is ok.
version "1.00". gave me syntax error.
Am I missing something?
TIA
|
|
|
|
 |
|
 |
try the following
VERSION -> @"version"; DOT -> @"\."; STRING -> @"@?\""(\""\""|[^\""])*\"""; ANY_TEXT -> @"[^.\s]*";
[Skip] WS -> @"\s+";
Start -> VERSION ( STRING | ANY_TEXT ) DOT;
the problem is that the ANY_TEXT definition you used did not skip any whitespaces. since the parser is greedy, it was matching ANY_TEXT on ' "1'.
hope this is what you needed.
regards, Herre
Herre
|
|
|
|
 |
|
 |
Hi, Herre.
If I do what you say, I just change the place of my problems. I'll show you an example. In Cobol, is perfectly valid the instruction: author. nilo roberto c paim. To parse it, I'd defined a rule as AUTHOR DOT ANY_TEXT DOT where ANY_TEXT is exactly what its name says: any text.
The end of the text is the final dot. If I add space to ANY_TEXT, just "nilo" on the example will be recognized as a valid token and everything else is considered as syntax errors.
In fact, I think that my biiiiig problem is: Dot is a sentence terminator in Cobol, but it could appear on some other places successfully. A dot inside a string is perfectly valid. A dot inside a number (e.g., 3.14159) is valid too. How could I distinguish them?
I would like that TinyPG have some type of "priority" on using the lexical rules, understand? For instance, I could say that a STRING would be evaluated before a float number, and a float number would be evaluated before the terminator dot on a sentence.
Of course, I know that this kind of thing implies on some type of lookahead before TinyPG decides what rule accept.
How to do this?
Ah, I have made a workaround over some things using a pre-processor. Case insensitive on reserved words it's an example. I just apply a ToLower() on my input before send it to parser/scanner...
Thanks for quick answer and congratulations. TinyPG is a wonderful tool.
Best regards from Brazil.
Nilo
|
|
|
|
 |
|
|
Last Updated 1 Sep 2008 |
Advertise |
Privacy |
Terms of Use |
Copyright ©
CodeProject, 1999-2010