A Regular Expression Tokenizer using the YARD Parser

Christopher Diggins

4.72/5 (15 votes)

Dec 13, 2004

2 min read

51223

902

A tokenizer using the YARD parser which can recognize regular expressions.

Download source files - 68.9 Kb

Introduction

This article presents a simple tokenizer using the YARD (Yet Another Recursive Descent) parser. We use the tokenizer to run through a simple text file and generate a list of words.

Tokenizers

A tokenizer creates a list of strings or tokens from the input according to some simple rules. Usually, tokenizers are expressed in terms of the delimiter (i.e., what separates the tokens, e.g., commas or white space). Instead, we present the YARD string tokenizer which allows us to parse any kind of token expressible as a regular expression.

The YARD StringTokenizer parser mechanics are very simple. It requires a regular expression description (called a rule) of the token to be passed as a template parameter. It provides a single function, void Parse(char const* pBegin, char const* pEnd) which causes the Tokenizer to generate a list of tokens according to the rules passed as a template argument. For instance:

  // p is a pointer to a null terminated string
  const char* pEnd = &(p[strlen(p)]);
  StringTokenizer<MatchWord> tknzr;
  tknzr.Parse(p, pEnd);
  OutputTokens(tknzr.begin(), tknzr.end());

In order to work, the StringTokenizer requires that we declare it with a rule which describes each token. The rule we are using is defined in the file "rules.hpp", as follows:

  typedef MatchCharRange<'a', 'z'> MatchLowerCaseLetter;
  typedef MatchCharRange<'A', 'Z'> MatchUpperCaseLetter;
  typedef re_or<MatchLowerCaseLetter, MatchUpperCaseLetter> MatchLetter;
  typedef re_or<MatchLetter, MatchChar<'\''> > MatchWordChar;
  typedef re_plus<MatchWordChar> MatchWord;

Notice that the rules are described in terms of other rules, combined using what are known as the regular expression operators. Briefly, they are:

re_opt<Rule_T> - Matches Rule_T 0 or 1 times.
re_star<Rule_T> - Matches Rule_T 0 or more times.
re_plus<Rule_T> - Matches Rule_T 1 or more times.
re_repeat<Rule_T, N> - Matches Rule_T exactly N times.
re_or<Rule_T0, Rule_T1> - Matches Rule_T0 or failing that matches Rule_T1.
re_and<Rule_T0, Rule_T1 > - Matches Rule_T0 and then matches Rule_T1.

The YARD StringTokenizer Class

Here is the source for the StringTokenizer class and the auxiliary classes:

  typedef std::pair<char const*, char const*> Token;
  typedef std::list<Token> TokenList;
  typedef TokenList::const_iterator TokenIter;

  template<typename Rules_T>
  struct StringTokenizer
  {
    void Parse(char const* pBegin, char const* pEnd)
    {
      ParserInputStream<char> stream(pBegin, pEnd);
      while (!stream.AtEnd()) {
        char const* pos = stream.GetPos();
        if (Rules_T::Accept(stream)) {
          mTkns.push_back(Token(pos, stream.GetPos()));
        }
        stream.GotoNext();
      }
    }
    TokenIter begin() { return mTkns.begin(); }
    TokenIter end() { return mTkns.end(); }
  private:
    TokenList mTkns;
  };

Outputting Tokens

Once we parse the text file, we still need to do something with the tokens. Here is a function which outputs the first 10 tokens as strings to the standard output stream:

  void OutputTokens(TokenIter iter, TokenIter end)
  {
    // outputs first 10 tokens as strings
    for (int i=0; (i < 10) && (iter != end); i++, iter++) {
      int n = iter->second - iter->first;
      std::string s(iter->first, 0, n);
      std::cout << s.c_str() << std::endl;
    }
  }

Summary

That is the entire library in a nutshell. A tokenizer is only the tip of the iceberg in terms of what you can do with the YARD library. Have fun experimenting with the parser, and you may be surprised at the kinds of tasks you can apply it to.

Postscript

It is worth making mention of the Boost Spirit library which has the capability to do some of the same things as the YARD parser. Writing a similar tokenizer using Spirit could be done as follows (Note: I am not an expert at Spirit, it is not a tool that can be learned in an afternoon):

  #include <boost/spirit/core.hpp>
  #include "../yard/tokenizer.hpp"

  using namespace boost::spirit;
  using namespace yard;

  TokenList* gpTkns;

  void StoreToken(char const* str, char const* end) {
    gpTkns->push_back(Token(str, end));
  }

  void SpiritTest(char const* str, TokenList* pTkns) {
    gpTkns = pTkns;
    rule<> word_r = (alpha_p >> *(alpha_p | ch_p('\'')));
    rule<> r = *(word_r[&StoreToken] | space_p);
    parse(str, r);
  }

The Spirit library is designed with a completely different approach and is much more complex than the YARD parser. On the other hand, it comes with more functionality out of the box, once you figure out how to use it. Anyway, the choice is yours, happy parsing.