Parsing XML in C++ using the YARD Parser

Christopher Diggins
Rate me:
4.79/5 (23 votes)
21 Dec 20046 min read
87.7K
1.2K
Provides a set of tools for building XML parsers in C++ using the YARD recursive descent parser.
yard.zip
- html_test.hpp
- parser.hpp
- parser_input_stream.hpp
- re_ops.hpp
- rules.hpp
- scanner.hpp
- test.html
- test.txt
- test.xml
- tokenizer.hpp
- utils.hpp
- xml_grammar.hpp
- xml_test.hpp
- yard.cpp
- yard.hpp
- yard.vcproj
<h1>Parsing XML in C++ using the YARD Parser</h1>

<h2>Introduction</h2>

<p>The YARD parser is a generic recursive descent parser for C++. The YARD parser
can be used to parse any pattern that can be expressed as a context-free grammar (CFG).
This article uses version 0.2 of the YARD parser, as opposed to the previous article
I posted <a href="http://www.codeproject.com/cpp/yard-tokenizer.asp">A Regular Expression
Tokenizer using the YARD Parser</a>. There are a few minor syntactic changes, as well
as more features, such as support for user-defined semantic actions.
</p>

<p>My goal wasn't to write a complete XML parser, but rather provided a practical demo of the YARD
parser doing a real-world task which could be useful in some circumstances. If a programmer wants
a more complete version of the XML parser, they are of course free to do and are encouraged (but
not obliged) to share their modifications. This source code is entirely public domain.
</p>

<p><b>Note</b>: This code only works on Visual C++ 7.1 or better.
</p>

<h2>Context Free Grammars (CFG)</h2>

<p>A context-free grammar (CFG) is a way of expressing a pattern, like a regular expression (theoretical regular expressions,
not the Perl kind). In fact for every CFG there is a regular expression and vice versa. A CFG is typically expressed in some kind
of normal form, such as an EBNF, which expresses a CFG as a set of grammar productions. A CFG lends itself to the writing of
a tool known as a recursive descent parser (R-D parser). An example of an annotated CFG for XML can be found at
<a href="http://www.w3.org/TR/REC-xml/">http://www.w3.org/TR/REC-xml/</a>.
</p>

<h2>Grammar Productions</h2>

<p>A grammar is described typically as a series of grammar productions. There are the basic types of productions
when describing a CFG are:
</p>

<ul>
  <li><code>C ::== A</code> : Renaming</li>
  <li><code>C ::== AB</code> : Concatenation</li>
  <li><code>C ::== A | B</code> : Union</li>
  <li><code>C ::== A *</code> : Kleene star</li>
  <li><code>C ::== null</code> : Empty set match</li>
</ul>

<p>The notation used is a semi-formal syntax known as a BNF (Backus Naur Form)
Even though, these rules are sufficient for describing a CFG, more operations are often desirable for
convenience sake, such as:</p>

<ul>
  <li><code>C ::== A k</code> : The concatenation of <code>A</code> k times</li>
  <li><code>C ::== A ?</code> : <code>C ::== A | null</code>
  <li><code>C ::== A +</code> : <code>C ::== A A*</code></li>
</ul>

<p>These extended operations (and others) when used with a BNF are known as an EBNF (Extended Backus Naur Form).
</p>

<h2>The Parser</h2>

<p>The YARD parser works by taking a starting grammar production (called a rule in YARD)
and an input data sequence passed as a pair of iterators. The Parser function returns true or false
depending on whether the input data matches the grammar.
</p>

<pre>
  if (Parse&lt;xml_grammar::document, char, char const*>(pBuf, pEnd)) {
    puts("successfully parsed xml document");
  }
  else {
    puts("failed to parse xml document");
  }
 </pre>

<p>This is how most parsers work, and of course this in of itself isn't much use to anybody.
Like most other parsers, the YARD parser allows the definition of semantic actions.
</p>

<h2>Semantic Actions</h2>

<p>A semantic action is like an events or call-backs that is triggered at a specific time
during the parsing process. Most parsers requires semantic actions to be embedded directly
in the grammar itself, this is not the case with YARD.
YARD is very flexible, and nothing stops the industrious programmer from writing their own
rules which have embedded semantic actions.
</p>

<p>A semantic action in YARD is defined by creating a template specialization of the following
type:
</p>

<pre>
  template&lt;typename Iter_T, typename Rules_T>
  struct Actor {
    static void OnBefore(Iter_T pos) { }
    static void OnSuccess(Iter_T begin, Iter_T end) { }
    static void OnFailure(Iter_T pos) { }
  };</pre>

<p>Creating a template specialization means that an overloaded version of the type is defined for specific
template parameters. An example of a specialization is:
</p>

<pre>
  template&lt;>
  struct yard::Actor&lt;char const*, xml_grammar::STag> : public yard::BaseActor {
    static void OnBefore(char const* pos) { }
    static void OnFailure(char const* pos) { }
    static void OnSuccess(char const* begin, char const* end) {
      std::string s(begin, end - begin);
      s = "STag: " + s;
      puts(s.c_str());
    }
  };</pre>

<h2>YARD Rules</h2>

<p>When describing a grammar production for the YARD parser, it is called a rule. A rule is a type which matches the following:
</p>

<pre>
  struct rule {
    template&lt;typename Elem_T>
    static bool Accept(ParserInputStream&lt;Elem_T>& stream) {
      // ...
    }
  };</pre>

<p>A rule is a pattern matcher, which recognizes input and returns true or false depending on whether it is successful or not
</p>

<p>Most people are concerned with only matching text, so many of the predefined rules assume the <code>Elem_T</code> parameter to be
of type <code>char</code>. Some of the predefined text parsing rules found in <code>rules.hpp</code> are:
</p>

<pre>
  typedef MatchCharRange&lt;'a', 'z'> MatchLowerCaseLetter;
  typedef MatchCharRange&lt;'A', 'Z'> MatchUpperCaseLetter;
  typedef MatchCharRange&lt;'0', '9'> MatchNumber;
  typedef re_or&lt;MatchLowerCaseLetter, MatchUpperCaseLetter> MatchLetter;
  typedef re_or&lt;MatchLetter, MatchChar&lt;'\''> > MatchWordChar;
  typedef re_plus&lt;MatchWordChar> MatchWord;
  typedef re_or&lt;MatchLetter, MatchChar&lt;'_'> > MatchIdentFirstChar;
  typedef re_or&lt;MatchIdentFirstChar, MatchNumber> MatchIdentOtherChar;
  typedef re_and&lt;MatchIdentFirstChar, re_star&lt;MatchIdentOtherChar> > MatchIdent;</pre>

<p>These rules are defined in terms of a core set of text parsing rules:
</p>

  <pre>
  struct MatchChar { ... }
  struct MatchCharRange { ... }
  struct MatchString { ... }
  struct MatchStringNoCase { ... }
  struct MatchWSpace { ... }</pre>

<h3>Regular Expression Meta-Functions</h3>

<p>The key to YARD is the ability to combine rules using the following set of
regular expression operators:
</p>

<UL>
  <LI><CODE>re_opt&lt;Rule_T&gt;</CODE> - Matches <CODE>Rule_T</CODE> 0 or 1 times.
  <LI><CODE>re_star&lt;Rule_T&gt;</CODE> - Matches <CODE>Rule_T</CODE> 0 or more times.
  <LI><CODE>re_plus&lt;Rule_T&gt;</CODE> - Matches <CODE>Rule_T</CODE> 1 or more times.
  <LI><CODE>re_repeat&lt;Rule_T, N&gt;</CODE> - Matches <CODE>Rule_T</CODE> exactly N times.
  <LI><CODE>re_or&lt;Rule_T0, Rule_T1&gt;</CODE> - Matches <CODE>Rule_T0</CODE> or failing that matches <CODE>Rule_T1</CODE>.
  <LI><CODE>re_and&lt;Rule_T0, Rule_T1&gt;</CODE> - Matches <CODE>Rule_T0</CODE> and then matches <CODE>Rule_T1</CODE>. </LI>
  <LI><CODE>re_or3&lt;Rule_T0, Rule_T1, Ru&gt;</CODE> - Matches <CODE>Rule_T0</CODE> or failing that matches <CODE>Rule_T1</CODE>.
  <LI><CODE>re_and3&lt;Rule_T0, Rule_T1&gt;</CODE> - Matches <CODE>Rule_T0</CODE> and then matches <CODE>Rule_T1</CODE>. </LI>
  <LI><CODE>re_until&lt;Rule_T&gt;</CODE> - Matches everything up to and including <CODE>Rule_T</CODE>, fails if it reaches the end of input</LI>
</UL>

<p>These are called meta-functions, but really they are simply parameterized types. The reason we call this meta-programming is because
the parsing algorithm is expressed using the same technique as functional programming.
</p>

<h2>Under The Hood: The Pattern Matching Algorithms</h2>

<p>The YARD engine uses a brute force trial and error matching algorithm. There is a trade-off of some speed for ease of use and simplicity.
It was my goal to design the simplest possible generic R-D parser. The YARD parser is nonetheless sufficiently fast for
most purposes.
</p>

<h2>The XML Grammar</h3>

<p>The XML grammar that I use the YARD parser to read was lifted directly from
<a href="http://www.w3.org/TR/REC-xml/">http://www.w3.org/TR/REC-xml/</a>. Since this is more of a demo than an industrial strength
XML parser, I have cut several corners (i.e. left out features, and relaxed certain constraints),
at the same time, I have been more true to the grammar than many other so-called open-source "XML parsers". The naming of
the productions is taken from the official XML grammar.
</p>

<p>The grammar productions (YARD rules) are contained in the file <tt>xml_grammar.hpp</tt>. The starting production is <code>document</code>
which is at the bottom of the file. YARD grammars have to be read starting from the bottom of the file. This has to due with C++ compilation rules.
Another artifact of compilation order dependencies in C++ is that cyclical type references have to be broken using functions. You will notice
four functions at the bottom of the file: <code>AcceptElement()</code>, <code>AcceptComment()</code>, <code>AcceptCDSect()</code>, <code>AcceptPI()</code>
which are required because they represent recursive grammar productions.
</p>

<p>Here is a small example of the XML grammar:
</p>

<pre>
  struct PI : public
    re_and&lt;
      MatchString&lt;PI_begin_string>,
      re_until&lt;
        MatchString&lt;PI_end_string>
      >
    >
    { };

  struct Misc : public
    re_or3&lt;
      Comment,
      PI,
      S
    >
    { };

  struct prolog : public
    re_and3&lt;
      re_opt&lt;XMLDecl>,
      re_star&lt;Misc>,
      re_opt&lt;
        re_and&lt;
          doctypedecl,
          re_star&lt;Misc>
        >
      >
    >
    { };

  struct document : public
    re_and3&lt;
      prolog,
      element,
      re_star&lt;Misc>
    >
    { };
    </pre>

<h2>Using The XML Parser</h3>

<p>The XML parser is initiated and executed from the following function inside of yard.cpp:
</p>

<pre>
  void RunXmlTest()
  {
    puts("testing xml parser");
    char* pBuf = AllocFromFile("c:\\doc.xml");
    const char* pEnd = &(pBuf[strlen(pBuf)]);
    {
      TimeIt t;
      if (Parse&lt;xml_grammar::document, char, char const*>(pBuf, pEnd))
      {
        puts("parsed xml document");
      }
      else {
        puts("failed to parse xml document");
      }
    }
    free(pBuf);
  }
</pre>

<p>There appears to be something mysterious going on, because the parser doesn't apparently do anything. In fact the parser automatically
calls the semantic actions which are defined in the file <tt>xml_test.hpp</tt>. Semantic actions can be defined anywhere, and are
automatically associated with the parser, because they are defined as template specializations. There are two semantic actions defined,
which upon a successful match of the pattern <code>STag</code> or <code>ETag</code> output the text of the match to the standard output stream.
Here is the <code>STag</code> matching semantic action defined:
</p>

<pre>
  template&lt;>
  struct yard::Actor&lt;char const*, xml_grammar::STag>  {
    static void OnBefore(char const* pos) { }
    static void OnFailure(char const* pos) { }
    static void OnSuccess(char const* begin, char const* end) {
      std::string s(begin, end - begin);
      s = "STag: " + s;
      puts(s.c_str());
    }
  };
</pre>

<p><b>Note:</b> Because of the way the R-D parser works, <code>yard::Actor::OnSuccess(...)</code> will be triggered immediately upon a successful match,
even if a parent production ultimately fails.
</p>

<h2>About the Project</h2>

<p>The source project includes all of the work on the YARD parser up to the current moment, and runs three tests: it tests the the parser along with
the string tokenizer and the scanner which are part of YARD but are not discussed in this article.
</p>

<h2>Summary</h2>

<p>I have only just touched on the potential of the YARD parser, and R-D parsers in general. This also isn't even a complete XML parser,
but hopefully it will provide the motivated reader with enough information to go ahead and implement a more funtional and useful XML parser.
</p>
By viewing downloads associated with this article you agree to the Terms of Service and the article's licence.
If a file you wish to view isn't highlighted, and is a text file (not binary), please let us know and we'll add colourisation support for it.
License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.
A list of licenses authors might use can be found here
Written By
Christopher Diggins
Software Developer Ara 3D
Canada
I am the designer of the Plato programming language and I am the founder of Ara 3D. I can be reached via email at cdiggins@gmail.com
Parsing XML in C++ using the YARD Parser

License

Comments and Discussions