Click here to Skip to main content
Click here to Skip to main content

Introduction to GOLD Parser

By , 26 May 2005
 

Introduction

GOLD Parser (see [1]) is a partial, open-source parsing system that can be used to parse existing programming languages, scripts or interpreted languages. It can even be used to develop your own programming language. GOLD Parser was developed by Devin Cook for his Masters Report in Computer Science at California State University, Sacramento. Devin still supports an active community through his web site and Yahoo! news groups.

There are many other fine articles on the Code Project web site about what a parser and lexer are (see [2]) and how they work. In this introductory article, I will focus, instead, on the basic features of the GOLD Parser system.

Background

The first time you learn about a parser or lexer, you are immediately linked to the classic Lex and YACC tools developed many years ago. You may also be referred to the open-source, Bison tool provided with the GNU distribution or ANTLR, a commercial parsing tool or Spirit, an open-source parsing system based on C++ templates. What's common to all these tools is that the code generated for parsing is always C or C++.

For those using .NET, there are really only two choices; Grammatica, which was recently ported from its original Java implementation to C# and GOLD Parser.

The GOLD Parsing System

The GOLD Parsing system consists of three phases;

Phase 1: the developer creates the grammar of the programming language, script or interpreter and compiles it into an intermediate, binary file using the GOLD Parser Builder tools.

Phase 2: the developer decides on a programming language to implement the parser. Besides the C# and VB.NET languages, implementations of the GOLD parser engine are provided for VB 6.0, Java, Delphi, Python and, of course, C and C++. The developer chooses the programming language template of choice from the the GOLD Parser Builder and generates a parser engine shell in that language. Note that some templates generate a more complete parser engine than others.

Phase 3: the developer takes each parsed token and, in the case of an interpreter, executes the command immediately or, in the case of a compiled language, begins to build an abstract syntax tree for later optimization, storage and/or execution of the input language.

Phase 1: Creating the Grammar

A grammar can be as simple as a basic calculator or as complex as the C# language or Structured Query Language (SQL). To get you started, the GOLD Parser Builder provides a wizard that helps you to identify the common features that your language might include such as line based or free-flowing, C++ or COBOL style identifiers, string literal representation and basic mathematical/comparison expressions. Clicking Finish, opens the generated grammar in the Grammar Editor window of the builder.

Like most other tools that convert grammars into working code, the input grammar allows you to enter meta data about the author, version, details about the grammar and the starting rule.

"Name" = 'Enter the name of the grammar'
"Author" = 'Enter your name'
"Version" = 'The version of the grammar and/or language'
"About" = 'A short description of the grammar'

"Start Rule" = <Start>

A grammar will typically include definitions for how to recognize a floating point number, hex integer or string while parsing. Regular expressions are used to build the definition for these terminals.

{ID Head} = {Letter} + [_]
{ID Tail} = {Alphanumeric} + [_]
{String Chars} = {Printable} - ["]

Identifier = {ID Head}{ID Tail}*
StringLiteral = '"' {String Chars}* '"'

GOLD has several built-in character-sets from which to build a terminal plus other examples on the web site.

The final and most interesting part of the grammar is the rules. Rules in a grammar are declared using Backus-Naur Form (BNF) statements. This notation consists of a series of 0 or more symbols where non-terminals are delimited by the angle brackets '<' and '>' and terminals are not delimited. For instance, the following declares the common if-statement.

<Statement> ::= if <Expression> then <Statements> end if

The symbols 'if', 'then', 'end', and 'if' are terminals and <Expression> and <Statements> are non-terminals. If you are declaring a series of rules that derive the same non-terminal (i.e. different versions of a rule), you can use a single pipe character '|' in the place of the rule's name and the "::=" symbol. The following declares a series of three different rules that define a 'Statement'.

<Statement> ::= if <Expression> then <Statements> end if
              | while <Expression> do <Statements> end while
              | for Id = <Range> loop <Statements> end for

Once the grammar is ready, it's time to compile it into a binary form. The Builder provides a simple 4-step, single button wizard to compile the grammar. Clicking the initial "1. Enter Grammar" button checks the grammar for syntax errors and warns of any unreachable rules or unused terminals. If all goes well the button caption changes to "2. Compute Parser". Clicking this button, computes the Look-ahead Left to Right Parsing (LALR) state tables. Again, if no errors are generated, the button caption changes to "3. Compute Tokenizer". Clicking this button computes the Deterministic Finite Automata (DFA) tables used by the tokenizer of the resulting parser engine. Finally, if there were no errors, the button's caption changes to "4. Save the Tables". Clicking this button allows the compiled LALR and DFA tables to be saved to a .CGT file for later use by the parser engine.

Phase 2: Choose a Language and Generate Parser

When the GOLD parsing system was initially released, the Visual Basic parser engine code for reading a .CGT file and parsing an input stream was included. Before long, contributors took it on themselves to port this code to their favorite programming language. Today, there are many language implementations, all available in source code form for inclusion in your project.

Implementing a parser is a coding intensive task. Manually typing in each symbol and rule constant can be tedious. Fortunately, the GOLD Parser Builder includes a template mechanism that provides a powerful approach to generating parser code. A simple pre-processor provides tags for emitting attributes of the grammar's symbols and rules. The tags can be used to create lists of constants, case statements, or whatever the engine developer needs. Templates are available for each parser engine language that has been contributed, so some of the work has already been done for you.

From within the Builder, select the Tools -> Create Skeleton Program menu option to see the list of templates available.

Clicking Create allows you to enter the name and extension of the file into which the code is generated. At this point, you can open the generated code in your favorite development environment and reference the corresponding parser engine component or include its source code.

Phase 3: Implementing the Grammar Specific Parser

Most language templates include some kind of function into which you pass the string or file to be parsed, and either through callbacks, delegates, overrides or inline code allow you to intercept each state as the input is being parsed. For example:

Public Sub DoParse(ByVal Source As String) 
  If Not Parser Then Parser = LoadParser("grammar.cgt")

  Parser.Source = Source
  Done = False 
  Do Until Done 
    Select Case Parser.Parse()
    Case LexicalError 
      'Cannot recognize token 
      Done = True 

    Case SyntaxError 
      'Expecting a different token 
      Done = True 

    Case Reduction 
      ' handle reduction of the token here
      With .CurrentReduction
        Select Case .ParentRule.TableIndex 
        Case 1 
          ' execute meaning of rule or create object representing the state
        Case 2
          '...
        End Select 
      End With 

    Case Accept ' CurrentReduction = the root node! 
      Done = True 
      Success = True 

    Case CommentError 
      'COMMENT ERROR! Unexpected end of file 
      Done = True 
    End Select 
  Loop 
End Sub

The function loads the .CGT file if it is not already and tells the parser about the the input file or string to be parsed. Then, in a loop, the parser parses the input and returns when something significant happens. There are distinct error conditions for lexical, syntax and comment errors. When a rule is encountered (reduction), the index and tokens are available for the current state. This is where you add code to either execute the meaning of the rule (in the case of a command language) or create an object representing the state for later optimization and/or execution (in the case of an interpreted language). When the end of input is encountered (accept), the loop breaks and the function returns. For an interpreted language, the developer can now traverse the tree of objects that was created and perform the meaning of the input grammar.

About the Code

The GOLD parser web-site provides a simple interpreted language to demonstrate how to implement a working parser. The sample is written in Visual Basic 6.0. Since I was interested in a solution for VB.NET, I took this code and created the attached VB.NET version. The demo includes a modified copy of the Klimstra C# Parser engine and a working version of the simple-interpreter sample, all compiled against .NET 1.1.

Simple Interpreter Grammar

The simple interpreter grammar allows basic variable declarations, execution flow, data input and output, and expression evaluation. The following grammar shows these basic rules:

<Statement> ::= display <Expression>
              | display <Expression> read ID
              | assign ID '=' <Expression>
              | while <Expression> do <Statements> end
              | if <Expression> then <Statements> end
              | if <Expression> then <Statements> else <Statements> end

This was compiled into the simple2.cgt file using the GOLD Parser Builder. Next, using the included Visual Basic .NET - Template Parser.pgt file, I used the GOLD Parser Builder to generate the file Template.vb. Then, I used the Visual Basic .NET - Sample Parser to generate the file Simple.vb.

I used Visual Studio 2003's Visual Basic Convert tool to convert the project to VB.NET, added a reference to the Klimstra Parser engine DLL, and included the Template and simple.vb files. Then, in Simple.vb, for each rule, I created the corresponding class for each rule. The following code snippet shows associating a class with a rule:

Public Class SimpleParser
   Inherits TemplateParser
   '...
   Protected Overrides Function _
            CreateRule_Statement_If_Then_Else_End(ByVal Tokens _
            As ArrayList) As Reduction
      Dim Obj As New SimpleIfStm
      Obj.IfClause = Tokens(1).Data
      Obj.ThenClause = Tokens(3).Data
      Obj.ElseClause = Tokens(5).Data
      return Obj
  End Function
  '...
End Class

The following code snippet shows the class that stores information about the rule and implements the method to execute the rule:

Private Class SimpleIfStm
  Inherits SimpleReduction
  Implements IContextMethod

  Public IfClause As IContextValue
  Public ThenClause As IContextMethod
  Public ElseClause As IContextMethod

  Public Function Execute() As Object Implements IContextMethod.Execute
    If IfClause.Value = "True" Then
      ThenClause.Execute()
    Else
      If Not ElseClause Is Nothing Then
        ElseClause.Execute()
      End If
    End If
  End Function
End Class

Finally, in the MainForm.vb file, in the method that handles the Parse button click event, the string containing the program to execute is passed into the parser as follows:

Parser.DoParse(New StringReader(txtProgram.Text))

If the input does not generate errors, the Execute button is enabled. The Click event handler for the Execute button simply takes the root node of the tree generated during parsing and calls the corresponding Execute method.

If Not Parser.CurrentReduction Is Nothing Then
  CType(Parser.CurrentReduction, IContextMethod).Execute()
End If

This walks the internal tree and calls each object's corresponding Execute or Value method depending on the rule.

The sample source code includes several example programs using the simple2 language grammar.

Points of Interest

I should mention that none of the templates provided with GOLD, take into consideration the lifecycle of developing a parser. When you are building a language, you typically never completely finish it before writing the parser. More typically, you get something basic working, then you can start to add language features. After code is generated from a template, it's easy to fall into the trap of adding your code into the generated file. However, the next time the grammar changes and you need to regenerate, you hopefully realize it would wipe out your code.

When developing the VB.NET version of the simple interpreter, I created two templates. The first template generates an abstract class with the DoParse function and must-override methods to handle each rule. The second template generates a concrete class that provides an overridden implementation for each rule and is where my custom code is added. The second template is typically only used once. This way, when the grammar changes, the first template updates the abstract definition without wiping out my changes and also makes it easy to identify any new overridable methods added to the parser.

Oh!, and in case you were wondering, GOLD stands for Grammar Oriented Language Developer. But I'm sure it has more to do with the Golden Gate Bridge in San Francisco near where Devin lives.

History

  • 26-05-2005 - Initial release.

Reference

  1. GOLD Parser web site
  2. Other Parser articles on Code Project.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Adrian_Moore
Web Developer
Canada Canada
Member
Adrian Moore is the Development Manager for the SCADA Vision system developed by ABB Inc in Calgary, Alberta.
 
He has been interested in compilers, parsers, real-time database systems and peer-to-peer solutions since the early 90's. In his spare time, he is currently working on a SQL parser for querying .NET DataSets (http://www.queryadataset.com).
 
Adrian is a Microsoft MVP for Windows Networking.

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
QuestionDid you implement an interpreter on your demo code?memberecualife26 Jul '11 - 20:10 
I would like to ask if you also implemented an interpreter to the the codes you attached and would like to know if you have a tutorial written explaining more on the vb codes you uploaded with the article
QuestionC Grammer?memberjulvr223 Jul '07 - 10:05 
I'm thinking of building some development tools, which among other things, anaylyze the content of C/CPP header files. To avoid reinventing the wheel, is the grammer for C/CPP files publicly available?
 
John

GeneralANTLR generates many different languagesmemberTriLogic220 Jun '07 - 4:22 
Someone has already mentioned that ANTLR is OpenSource. It is also true that the latest version of ANTLR generates lexers and parsers in multiple languages. There is an excellent visual development IDE written in java that makes it extremely easy to use.
GeneralVisual BNF parser generatormemberAdamSlosarski13 Mar '07 - 18:10 
Look at this parser with testing of grammars: www.intralogic.eu
 
Testing your code against performance will keep you running with good scalability and maintenance for years.
NTime.exe - the free tool for real developers of high scalability applications!

QuestionUpgrade to .NET 2.0?memberMichael Crandall2 Feb '07 - 11:50 
Adrian,
 
Great article. I would like to port it up to .NET 2.0 but don't know enough about your project or GOLD Parser. I would like to learn more about using GOLD Parser and your article looks like a great start. Would you be interested in assisting me in upgrading your project to .NET 2.0?
 
Thanks for your help in advance.
 
Mike.
Questiongrammar for SELECT statementmemberkavirajan4 Dec '06 - 0:24 
How can I write a grammar for the sql SELECT Statement?. I tried the SELECT statement grammar given in ANTLR.But GOLD is not accepting the grammar. Is there a diffrence between two notations?. Pls help me with the way to use it.

 
M.Michal
AnswerRe: grammar for SELECT statementmemberAdrian_Moore4 Dec '06 - 4:54 
My home page http://www.queryadataset.com[^]has a link at the bottom.
 
Ad.
GeneralRe: grammar for SELECT statementmemberkavirajan4 Dec '06 - 23:46 
That one is working fine. Thanks a lot
QuestionRe: grammar for SELECT statementmemberpeiman23 Jul '07 - 1:56 
Hi Adrian
I use ur grammer and it works fine but i cant modify it to use UNICODE or exetended character set for Indentifiers for example if SQL text is for tables name that are in Persian/Arabic...
 
in grammer code
 

{String Ch 1} = {All Valid} - ["]
{String Ch 2} = {All Valid} - ['']
{Id Ch Standard} = {Alphanumeric} + [_.]
{Id Ch Extended} = {Printable} - ['['] - [']'] + [_.]
{Hex Digit} = {Digit} + [abcdef]
{Id Tail} = {Alphanumeric} + [_]
{ID Name Chars} = {Printable} - ['['']']
 
StringLiteral = '"' ({String Ch 1} | '""')* '"' | ('N'|'n')? '' ({String Ch 2} | '''')* ''
IntegerLiteral = {Digit}+
RealLiteral = {Digit}* '.' {Digit}+ ( 'E' [+-]? {Digit}+ )? | {Digit}+ 'E' [+-]? {Digit}+
HexLiteral = '0x'{Hex Digit}+ [SIL]?
 
Identifier = (({Letter}|'#'|'@'|'_')('#')?{Id Ch Standard}*('.''*')? | '['{Id Ch Extended}+']'('.')?)+

changing Indetifier to this did not work
Identifier = (({Letter}|'#'|'@'|'_')('#')?{Id Ch Extended}*('.''*')? | '['{Id Ch Extended}+']'('.')?)+

 
your help is appreciated
 
thanks in advanceCool | :cool:
AnswerRe: grammar for SELECT statementmemberAdrian_Moore27 Jul '07 - 3:14 
Unicode identifiers are only supported using the [] extended syntax. Change the following in your grammar and it works as you need
 
Add:
 
{Id All Extended} = {All Valid} - ['['] - [']'] + [_.]
 
Change:
Identifier = (({Letter}|'#'|'@'|'_')('#')?{Id Ch Standard}*('.''*')? | '['{Id All Extended}*']'('.')?)+
 
Hope this helps
Ad.
GeneralRe: grammar for SELECT statementmemberpeiman29 Jul '07 - 0:03 
Hi Adrian
Thanks for your reply ,i am going to test it with Kessel engine http://www.kessels.com/Gold/index.html[^]
 
as i am beginner in this field i would like to know your opinion for this problem.
the problem is : GoldParser generates extremely large arrays for grammerCharset.
is there any way instead of checking against this charsets the engine checks for exclusions!
and Charset arrays were consisted from excluded chars.?
QuestionIndex out of range error in Parser.Parse()memberatismandal13 Sep '06 - 8:07 
I am getting this error - Index out of range error in Parser.Parse() - I checked the dump, it seems that the Lookaheadreader function is failing... Basically this happens when I have a number of line feeds in the text that is being parsed... Cannot figure out what the problem may be... Can anyone help?
GeneralGoldParser.dllmembernevski9 Jan '06 - 5:48 
Hi Adrian,
 
Do you have the engine source to rebuild the GoldParser.DLL? There's a version on Devin's website but it looks like the GoldParser.cs file is the wrong version as there should be two constructors (one for a string and one for a stream), however, that file only has the string constructor.
 
Thanks,
Neville
Generalvb.net interpretermembernevski14 Dec '05 - 3:45 
Hi Adrian, Do you have any code which uses the vb.net grammer? I'm looking for examples for how to implement various vb.net code items.
GeneralRe: vb.net interpretermemberAdrian_Moore14 Dec '05 - 9:39 
No. Sorry.
Generalcreating cdfgmemberUAU30 Nov '05 - 21:29 
hi,
 
i'm a student and i need to use a parser inorder to retrieve the CDFG(control-Data flow graph) of input file,is it possible to do this by gold parser?
QuestionCan you use GOLD to parse text files containing records?memberdonburke27 Nov '05 - 18:23 
I need to parse text files containing records. Files are from different sources in different formats (e.g. fixed length, delimeted, diff orders etc)
 
Is this possible? and better yet, is there a better way of doing it. I've been given a spec that tells me to do this..i'm not convinced. Any suggestions would be very appreciated.
 
Thanks in advance
 
Don
AnswerRe: Can you use GOLD to parse text files containing records?memberAdrian_Moore28 Nov '05 - 1:58 
While its possible to use GOLD to parse record based text files, its not very efficient. Have you looked for open-source implementations of doing this or even ODBC?
GeneralRe: Can you use GOLD to parse text files containing records?memberdonburke28 Nov '05 - 13:58 
Hi Adrian,
 
Thanks for the reply, thats what i thought too. But it turns out that my team leader wants me to build a custom language to use to parse file records, and create a grammar for the language that parses it.
 
For the different file formats, a different script will be written, for e.g.
 
RECORD_FORMAT = "Fixed_Length"
 
ID = REC.sub(1,4)
TYPE = if(ID = "1800") then REC.sub(4,10) else REC.sub(4,8)
 
Given a particular file format, the engine will be given the relative script like above, and then pass the script into tokens. WHich my c++ application will read and somehow figure out how its meant to parse records. The program may also need to do some transformations so thats why i think there are logic if statements.
 
I'm still coming to grasps with this myself. It seems over complicated. But do you think it can work?
 
Any input would really help me out.
GeneralRe: Can you use GOLD to parse text files containing records?memberGarth J Lancaster30 Nov '05 - 22:08 
Ive often thought about using 'this sort' of parser for the same thing - and it always seems to be inefficient/problematic .. what Im working on so far, which is not unique (ie, not my original idea), is an XML grammar describing the file (EDIFACT, various banking formats I use) and a hybrid parser that uses that XML grammar against the file ..
 
[edit/added]
If you're interested in this approach start with "Using XML with Legacy Business Applications" by Michael C Rawlins - comes with code (downloadable from his site)
[/edit]
 
'G'
 
-- modified at 4:10 Thursday 1st December, 2005
GeneralRe: Can you use GOLD to parse text files containing records?memberDave Bacher17 Mar '06 - 10:55 
This is an older post, I know.
 
We did this for 7+ years somewhere I worked. While it does work, I would suggest against it for longer term projects. The flexibility it gets you is useful, however every time someone decides that they "have" to have processing for some new format, you get to go in and hack in new support, and since you have an investment in scripts, you have to find a way to make the change without breaking all of them. It (generally) leads to kludges and ugliness 2 or 3 years down the line.
 
In my experience, it is best to consider if you are adding anything of value over a standard scripting language (on Windows, VB.NET, VBScript and JScript are standard languages you can easily use). If you aren't adding anything valuable, then you probably are better off using the standard scripting language than adding a new one.
 
Note that I am not saying don't ever use a custom scripting language; what I'm saying is if you can handle it with a standard scripting language, long term, it will be to everyone's benefit.
 
In this case (processing text files), its hard to see where you couldn't expose a StreamReader, a Name Value collection, and an event (callback) function and then use a standard scripting language. Doing this means less work (in terms of implementation) for the scripting language, and gets you more power during the processing of the file.
 
Again, I am not saying there aren't cases where it is beneficial or where you "need" to do this, where you can't live without it. But, you are ensuring you will pay more of a maintanence cost long term (by long term, I am talking 3 years+) than using a standard format. This cost includes end user training and support, internal training and support, bug fixes, etc.
 
There are times where these costs are necessary and unavoidable; unfortunately, some management people do not consider the costs involved in maintenance and support (or in the case of one of my previous employers, they are too busy actively defrauding their customers to be bothered with worrying about the long term cost of their decisions).
 
Doing what you want to do from C++, however, is very trivial.
 
You need a lexer and parser, which will convert "RECORD_FORMAT" = "FIXED_LENGTH" to a binary format. This is so you don't have to read the text, lex, parse every time.
 
It is usually best to create fixed length records for a number of reasons (I prefer to use rows in a database table for the binary format, with a column for each potential parameter), but primarily because it limits the per-token processing, at a small cost in space. The structure in C looks something like:
struct binaryToken
{
unsigned short int tokenID;
unsinged short int intParms[3];
unsigned char charParms[3][80];
}
 
This wastes some space - you can use a string table or whatever to recover that space, if you prefer. You create a "dispatch table." You set entry 1 to point at a function that can handle token 1, you set entry 2 to point at a function that can handle token 2, etc. You sit in a tight loop, you read one binaryToken from whereever you are storing it, then you pass it to the routine that handles it until you reach an "end" token or the file ends, whichever comes first.
 
A dispatch table is, typically, on the order of 10x faster than a switch/case statement, dependent on the number of tokens. The larger your grammar is, the faster a dispatch table approach is. The reason for using fixed size tokens is that you can preallocate the token structure, and use the same structure on every call (versus reloading after each token). On most modern processors, this means your structure will be in the on-processor cache, and life doesn't get any faster than that.
 
The other reason for using fixed size tokens is if you have to implement branching (which is necessary for if/then statements, etc.), you don't have to compute or depend on a byte offset. So if a month or year from now, you have to change the size of the token or any component of it, you'll have an easy way to convert. Also, if you use a grammar that requires one statement per line (like VB) vs. one that allows multiple statements/expressions, you can typically use the token offset to match the line number, so you have a line number for error messages.
 
Unless there is a compelling reason (file size, etc.) to use variable length tokens, I would avoid them entirely. If there is a compelling reason, consider using bit-extended integers (0-127 is stored in byte 1, if the high bit is set, then shift left 7 positions and or the next byte, until the high bit is 0), as this solves both most endian issues and all integer length issues.
 
A string table can be valuable in the fixed size token, because it reduces the size of each token (in my example, this would be a huge savings -- 240 bytes per token).
 
Keep in mind, if you need more than 3 parameters, you can always span to a second token. Smile | :)
GeneralKlimstra Parse Enginememberlarma17 Aug '05 - 3:10 
The whole parser topic is quite interesting. I downloaded source code for the Klimstra Parse Engine and (modified version) from the GOLD Parser website, both don't quite work with your source. There seems to be incompatible parameter type in the Parser constructor. The Engine source dated quite sometime ago, so I guess you could be using another engine. Is there a link to the latest parser engine?
 
Thanks
GeneralRe: Klimstra Parse Enginememberlarma18 Aug '05 - 4:19 
Thankyou very much for the VB engine.
GeneralGreat Article!sussAnonymous15 Jun '05 - 10:48 
I found the article very helpful and a very well written.
GeneralNatural Language ParsingmemberThomas Lykke Petersen9 Jun '05 - 2:06 
I am totally green when it comes to parsers, compilers and that stuff, but I am thinking of trying to learn about natural language parsing, which is ofcourse different from a programming language, but is it possible to use something like GOLD parser to build up semantix to parse a natural language????
 

 
Regards,
Thomas

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web03 | 2.6.130516.1 | Last Updated 27 May 2005
Article Copyright 2005 by Adrian_Moore
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid