Click here to Skip to main content
Click here to Skip to main content

A Regular Expression Tokenizer using the YARD Parser

, 12 Dec 2004
Rate this:
Please Sign up or sign in to vote.
A tokenizer using the YARD parser which can recognize regular expressions.

Introduction

This article presents a simple tokenizer using the YARD (Yet Another Recursive Descent) parser. We use the tokenizer to run through a simple text file and generate a list of words.

Tokenizers

A tokenizer creates a list of strings or tokens from the input according to some simple rules. Usually, tokenizers are expressed in terms of the delimiter (i.e., what separates the tokens, e.g., commas or white space). Instead, we present the YARD string tokenizer which allows us to parse any kind of token expressible as a regular expression.

The YARD StringTokenizer parser mechanics are very simple. It requires a regular expression description (called a rule) of the token to be passed as a template parameter. It provides a single function, void Parse(char const* pBegin, char const* pEnd) which causes the Tokenizer to generate a list of tokens according to the rules passed as a template argument. For instance:

  // p is a pointer to a null terminated string
  const char* pEnd = &(p[strlen(p)]);
  StringTokenizer<MatchWord> tknzr;
  tknzr.Parse(p, pEnd);
  OutputTokens(tknzr.begin(), tknzr.end());

In order to work, the StringTokenizer requires that we declare it with a rule which describes each token. The rule we are using is defined in the file "rules.hpp", as follows:

  typedef MatchCharRange<'a', 'z'> MatchLowerCaseLetter;
  typedef MatchCharRange<'A', 'Z'> MatchUpperCaseLetter;
  typedef re_or<MatchLowerCaseLetter, MatchUpperCaseLetter> MatchLetter;
  typedef re_or<MatchLetter, MatchChar<'\''> > MatchWordChar;
  typedef re_plus<MatchWordChar> MatchWord;

Notice that the rules are described in terms of other rules, combined using what are known as the regular expression operators. Briefly, they are:

  • re_opt<Rule_T> - Matches Rule_T 0 or 1 times.
  • re_star<Rule_T> - Matches Rule_T 0 or more times.
  • re_plus<Rule_T> - Matches Rule_T 1 or more times.
  • re_repeat<Rule_T, N> - Matches Rule_T exactly N times.
  • re_or<Rule_T0, Rule_T1> - Matches Rule_T0 or failing that matches Rule_T1.
  • re_and<Rule_T0, Rule_T1 > - Matches Rule_T0 and then matches Rule_T1.

The YARD StringTokenizer Class

Here is the source for the StringTokenizer class and the auxiliary classes:

  typedef std::pair<char const*, char const*> Token;
  typedef std::list<Token> TokenList;
  typedef TokenList::const_iterator TokenIter;

  template<typename Rules_T>
  struct StringTokenizer
  {
    void Parse(char const* pBegin, char const* pEnd)
    {
      ParserInputStream<char> stream(pBegin, pEnd);
      while (!stream.AtEnd()) {
        char const* pos = stream.GetPos();
        if (Rules_T::Accept(stream)) {
          mTkns.push_back(Token(pos, stream.GetPos()));
        }
        stream.GotoNext();
      }
    }
    TokenIter begin() { return mTkns.begin(); }
    TokenIter end() { return mTkns.end(); }
  private:
    TokenList mTkns;
  };

Outputting Tokens

Once we parse the text file, we still need to do something with the tokens. Here is a function which outputs the first 10 tokens as strings to the standard output stream:

  void OutputTokens(TokenIter iter, TokenIter end)
  {
    // outputs first 10 tokens as strings
    for (int i=0; (i < 10) && (iter != end); i++, iter++) {
      int n = iter->second - iter->first;
      std::string s(iter->first, 0, n);
      std::cout << s.c_str() << std::endl;
    }
  }

Summary

That is the entire library in a nutshell. A tokenizer is only the tip of the iceberg in terms of what you can do with the YARD library. Have fun experimenting with the parser, and you may be surprised at the kinds of tasks you can apply it to.

Postscript

It is worth making mention of the Boost Spirit library which has the capability to do some of the same things as the YARD parser. Writing a similar tokenizer using Spirit could be done as follows (Note: I am not an expert at Spirit, it is not a tool that can be learned in an afternoon):

  #include <boost/spirit/core.hpp>
  #include "../yard/tokenizer.hpp"

  using namespace boost::spirit;
  using namespace yard;

  TokenList* gpTkns;

  void StoreToken(char const* str, char const* end) {
    gpTkns->push_back(Token(str, end));
  }

  void SpiritTest(char const* str, TokenList* pTkns) {
    gpTkns = pTkns;
    rule<> word_r = (alpha_p >> *(alpha_p | ch_p('\'')));
    rule<> r = *(word_r[&StoreToken] | space_p);
    parse(str, r);
  }

The Spirit library is designed with a completely different approach and is much more complex than the YARD parser. On the other hand, it comes with more functionality out of the box, once you figure out how to use it. Anyway, the choice is yours, happy parsing.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Christopher Diggins
Software Developer Autodesk
Canada Canada
This article was written by Christopher Diggins, a computer science nerd who currently works at Autodesk as an SDK specialist.
Follow on   Twitter   Google+   LinkedIn

Comments and Discussions

 
Generalextending the grammar PinmemberMaxim Khesin28-Sep-05 10:41 
GeneralNew Version of the YARD parser PinmemberChristopher Diggins30-Sep-05 16:49 
I am sorry Max, I simply don't know anymore what the answer would be concerning this posted version of YARD. It is quite old. I have been using a significantly improved version of the YARD parser which is not documented, but is part of the Heron download at http://www.heron-language.com
 
To achieve what you want using the latest version of YARD you could write:
 
template<typename T>
struct advance_if_not :
  bnf_and<
    bnf_not<
      T
    >,
    AnyChar
  >
{ }
 
I have no benchmarks for YARD, but for my own work (building a language translator and interpreter) it is exceedingly fast.
 
Thanks for the kind words, and I wish you the best of luck.
 
Christopher Diggins

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web04 | 2.8.140709.1 | Last Updated 13 Dec 2004
Article Copyright 2004 by Christopher Diggins
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid