Click here to Skip to main content
Click here to Skip to main content

Tokenizer and analyzer package supporting precedence prioritized rules

By , 1 Jan 2002
 

grammarIDE   Evaluation

Introduction

This library has its origin in some study project I am currently working on. All source code is covered by the LGPL - this means you can freely use the code in any project (commercial or non-commercial).

How it works

The tokenizer uses a hierarchical map structure to store the tokens, and should be as fast as the paragon lexx (at least n*O(lexx) :-). The analyzer does not have the restriction of being a LALR(1) analyzer (if used correctly), has some caching capabilities and allows resolution of literal tokens. It also supports a precedence prioritized rule set.

All projects make heavy use of the STL. Because the STL implementation shipped with MSVC6 is not the best option, you should download the STL Port 4.5.1 implementation freely available at www.stlport.org. I have now managed to get rid most of the C4786 warnings (hint: switch /FI).

How to setup the analyzer/tokenizer

A typical rule definition for a simple expression evaluator looks like this:

std::tstringstream init(
    "[seperators]\n"
    "200:+\n"       "201:-\n"
    "202:*\n"       "203:/\n"
    "204:^\n"       "205:;\n"
    "206:(\n"       "207:)\n"
    "' Whitespace tokens:\n"
    "0: \n"         "0:\t\n"
    "0:\\n\n"       "0:\\r\n"
    "[rules]\n"
    "300:numbers\n"
    "[grammar]\n"
    "401:{.expr}=100:{.expr}{$+}{.expr}\n"
    "402:{.expr}=100:{.expr}{$-}{.expr}\n"
    "403:{.expr}=99:{.expr}{$*}{.expr}\n"
    "404:{.expr}=99:{.expr}{$/}{.expr}\n"
    "405:{.expr}=98:{.expr}{$^}{.expr}\n"
    "406:{.expr}=0:{$(}{.expr}{$)}\n"
    "400:{.expr}=0:{!number}\n"
    "500:{.line}=0:{.expr}{$;}\n");

This initializes the tokenizer to recognize the separators '+', '-', '*', '/', '^', ';', '(', ')' and to skip the usual white space characters, and initializes the analyzer to recognize the math rules '+,*,-,/,^' and declares that '*' has a higher priority than '+', for example.

The interface class

Usually you have to deal only with one class, cxtPackage. This class exports all methods needed to access the library. Some of them are:

  • vSetInputStream() - sets the input stream :)
  • vSetDelimeterIDs() - can be used to set the end-of-statement tokens, in C++ this could be ';'
  • nReadUntilDelimeter() - parses the input stream until the next delimiting token is found or the end of the input stream is reached
  • papbCheckForRule() - analyzes the token stream for a given rule and returns a parse tree (if successful)
  • vRebalance() - rearranges the parse tree according to the precedence priority rules

Those are the most important ones. For more details please see the sample project or check the page http://www.subground.cc/devel which will have some minimal documentation on the classes available for download soon.

New: The grammar IDE

The new grammar IDE included in the complete download provides a environment to develop and test analyzer rulesets. It has some syntax-highlighting features, shows errors by marking the lines in the editor and has an integrated test environment to live-check the results of the ruleset. I have no documentation yet and the IDE is still early beta and it has some cosmetic bugs (for example, it is possible to insert via clipboard RTF-formatted text into the editor :-), but most of it is already usable.

The sample projects

There are two sample projects included in the complete download. One is an almost-empty sample application you can easily use to explore the library, and the other project, simpleCalc, shows how to use the library to build a simple expression evaluator in 200 lines. A step-by-step explanation on how the sample works is here.

How to use it in own projects

Make sure to insert the projects cxTokenizer, cxAnalyzer and cxtPackage in the workspace of your project. Adjust the dependencies of your project to depend on cxtPackage (which itself depends on cxAnalyzer, which in turn depends on cxTokenizer). If your project doesn't use the MFC, you have to include the file common.cpp which is located in the base directory of the project files in your project. If you have problems or questions, feel free to mail them to alexander-berthold@web.de. For the most recent updates, see also http://www.subground.cc/devel

Updates

  • 2002-01-02
    C++ (w/o templates and pre-processor) grammar is now included. See also here.
  • 2001-12-29
    Fixed small bug in analyzer. IDE download now includes a not yet complete C grammar.
  • 2001-12-27
    Updated Homepage-URL to ad-free host.
  • 2001-12-26
    Uploaded new release including the grammarIDE and some enhancements in the analyzer.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Alexander Berthold
Web Developer
Germany Germany
Member
No Biography provided

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
QuestionSeriously, why not just use lex???memberMatt Wigdahl28 Dec '01 - 9:50 
Granted lex and yacc (really, flex and bison) produce pretty ugly code.   But you should never need to even look at that code.   In my MFC projects that need parsing I include the clean, understandable .l and .yy files in my project with custom build steps that invoke flex and bison on them.   The resultant files are included in the build without my ever needing to see them.  
 
The advantages of using the existing tools are:   Flex and bison are free, available today, extremely fast, portable and almost completely rock-stable; their behavior is well-defined and lots of people know how to use them.   Contrast this to your yacc replacement, which will be buggy, and will have non-standard syntax.
 
Really, we should all be happy that flex produces ugly code.   It's that static, ugly code that makes it so screamingly fast since the token map is implemented in compilable, optimizable, C code.   Your solution implements lexing dynamically which almost guarantees that it will be drastically slower than flex for equivalent tasks.

 
Matthew L. Wigdahl
Software Engineer
matt@scriptpro.com
AnswerRe: Seriously, why not just use lex???memberAlexander Berthold28 Dec '01 - 11:32 
Take a look at the file at http://www.subground.cc/devel/sample-expression.txt.
It consists of 13516 tokens and is 18KB large; the whole file is a large C++ expression.
 
Using the "complex-grammar.txt" grammar provided with the grammarIDE binary and testing for the rule ".uexpr" (a C++ expression rule), it takes on my machine (Athlon 1.33 Ghz, 512 MB DDR) approx. 79 msec for the tokenizing stage and 53 msec for the analyzing stage (standard Release build).
 
Memory requirements (with disabled preview window):
Before tokenizing/analyzing
the memory occupied is 4.564 KB, with the complete parse tree of the expression above in memory (after tokenizing/analyzing) the memory consumption is 6572 KB.
 
If you want to try, i can send you a slightly modified version which measures using QueryPerformanceCounter the time consumed for each step.
GeneralRe: Seriously, why not just use lex???memberMatt Wigdahl28 Dec '01 - 11:50 
I would be very interested in performing a comparison test using a complex grammar such as C++ with a very large dataset. I'm not sure that the data you provide is directly comparable, however, for the following reason:
 
When a scanner and parser are generated by flex and bison, the resulting code takes the input text and analyzes it with no assumptions as to what rule will end up as terminal. Your grammarIDE program seems to require me to specify what type of syntactic element I think the input text will be, which bypasses some analysis and is unrealistic for real-world applications. I tried to paste some arbitrary C++ code into the grammarIDE and analyze it, but except for your example and one variable declaration I could not get anything to resolve. Perhaps I did not specify the correct rule to evaluate.
 
Is your grammar complete? Do you have a large block of sample C++ representing an entire nontrivial program that you can feed through your grammar using a rule like ".program" and get it to parse? If so, I'd be interested in timing it against an equivalent flex/bison C++ grammar.
 
If you like, we can use a simpler grammar and a very large amount of input (say, parsing sentences out of the contents of Project Gutenberg) -- this might make the comparison more straightforward by rendering the respective inefficiencies of the different grammar implementations less of an issue.
 
Matthew L. Wigdahl
Software Engineer
matt@scriptpro.com
GeneralRe: Seriously, why not just use lex???memberAlexander Berthold28 Dec '01 - 23:02 
I have made a C grammar and for the pseudo-C code at http://www.subground.cc/devel/sample-pseudo-c-prog.txt it takes 80 msec to tokenize and 67 msec to analyze and build the parse tree (".globalscopeblock"). I corrected also a small bug, if you want to try the grammar, please download again the binary.
 
The idea of trying a large dataset is interesting. Maybe i can contact you via mail?
GeneralJust a correctionmemberGabriel Praino28 Dec '01 - 2:48 
Thank's for the code, good job!! Just a correction.
Mathematic equations are evaluated from right to left, not from left to right.
If the example equation shown above had been:
1-2*(3-4)+8
your program would have divided it as:
1-expr, where expr = 2*(3-4)+8
and this is wrong. It should be: expr+8, and then divide expr as: 1-expr2.
 
Good luck

GeneralRe: Just a correctionmemberAlexander Berthold28 Dec '01 - 3:32 
??? What are you talking about?
 
Try it ... Of course the associativity is defined by the grammar.
GeneralRe: Just a correctionmemberkuhx19808 Dec '03 - 1:36 
but when I do this:
1-2*(3-4)*8
 
I think it should executed as:
(3-4)
2*(3-4)
2*(3-4)*8
1-2*(3-4)*8
 
but it seem be executed as thisFrown | :(
(3-4)
1-2*(3-4)
1-2*(3-4)*8
 
can you help me? Confused | :confused:


GeneralRe: Just a correctionmemberTim Smith29 Dec '01 - 2:20 
Math equations are left to right.
 
1-2*(3-4)+8
 
Executed as:
 
(3-4)
2*(3-4)
1-2*(3-4)
1-2*(3-4)+8
 
Which is what it looks like his code is doing.
 

 
Tim Smith
Descartes Systems Sciences, Inc.
General'lexx' and 'yacc' and AntlrmemberPaul Selormey25 Dec '01 - 20:09 
Thanks for the great work.
Now, how do you compare the final result with 'lexx' and 'yacc'?
 
Have you tried Antlr? If yes, how do you compare this with your sources?
 
Best regards,
Paul.

 
Paul Selormey, Bsc (Elect Eng), MSc (Mobile Communication) is currently Windows open source developer in Japan, and open for programming contract anywhere!
GeneralRe: 'lexx' and 'yacc' and AntlrmemberAlexander Berthold25 Dec '01 - 23:52 
Hi, thanks for your comment.
Intentionally i did not try to get too deep into 'lexx' and 'yacc' before writing this library, so i do not really have experience using them.
 
But i think 'cxtPackage' is comparable to lexx and yacc. In fact, i am currently writing a C++ compiler built upon 'cxtPackage' and w/o templates i managed up to now to implement >70% of the language grammar using cxtPackage, even complex expressions like (int *(const *a[5])[3]) can be parsed without modifications.
I also never took a closer look at 'Antlr', but i will in the near future.
 
In terms of speed i'd say 'cxtPackage' could be even faster compared to lexx/yacc, but i have never tried this also.
 
Regards + Happy new year,
Alexander Berthold

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web01 | 2.6.130523.1 | Last Updated 2 Jan 2002
Article Copyright 2001 by Alexander Berthold
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid