Click here to Skip to main content
15,861,125 members
Articles / Programming Languages / C++
Article

Parsing XML in C++ using the YARD Parser

Rate me:
Please Sign up or sign in to vote.
4.79/5 (23 votes)
21 Dec 20046 min read 86.6K   1.2K   39   12
Provides a set of tools for building XML parsers in C++ using the YARD recursive descent parser.

Introduction

The YARD parser is a generic recursive descent parser for C++. The YARD parser can be used to parse any pattern that can be expressed as a context-free grammar (CFG). This article uses version 0.2 of the YARD parser as opposed to the previous article I posted: A Regular Expression Tokenizer using the YARD Parser. There are a few minor syntactic changes, as well as more features, such as support for user-defined semantic actions.

My goal wasn't to write a complete XML parser, but rather to provide a practical demo of the YARD parser doing a real-world task which could be useful in some circumstances. If a programmer wants a more complete version of the XML parser, they are of course free to do and are encouraged (but not obliged) to share their modifications. This source code is entirely public domain.

Note: This code only works on Visual C++ 7.1 or better.

Context Free Grammars (CFG)

A context-free grammar (CFG) is a way of expressing a pattern, like a regular expression (theoretical regular expressions, not the Perl kind). In fact, for every regular expression and there is a CFG. A CFG is typically expressed in some kind of normal form, such as an EBNF, which expresses a CFG as a set of grammar productions. A CFG lends itself to the writing of a tool known as a recursive descent parser (R-D parser). An example of an annotated CFG for XML can be found here.

Grammar Productions

A grammar is described typically as a series of grammar productions. There are the basic types of productions when describing a CFG:

  • C ::== A - Renaming
  • C ::== AB - Concatenation
  • C ::== A | B - Union
  • C ::== A * - Kleene star
  • C ::== null - Empty set match

The notation used is a semi-formal syntax known as a BNF (Backus Naur Form). Even though these rules are sufficient for describing a CFG, more operations are often desirable for convenience sake, such as:

  • C ::== A k - The concatenation of A k times
  • C ::== A ? - Equivalent to C ::== A | null
  • C ::== A + - Equivalent to C ::== A A*

These extended operations (and others) when used with a BNF are known as an EBNF (Extended Backus Naur Form).

The Parser

The YARD parser works by taking a starting grammar production (called a rule in YARD) and an input data sequence passed as a pair of iterators. The Parser function returns true or false depending on whether the input data matches the grammar.

if (Parse<xml_grammar::document, char, char const*>(pBuf, pEnd)) {
  puts("successfully parsed xml document");
}
else {
  puts("failed to parse xml document");
}

This is how most parsers work, and of course, this in itself isn't much use to anybody. Like most other parsers, the YARD parser allows the definition of semantic actions.

Semantic Actions

A semantic action is like an event or call-back that is triggered at a specific time during the parsing process. Most parsers require semantic actions to be embedded directly in the grammar itself, this is not the case with YARD. YARD is very flexible, and nothing stops the industrious programmer from writing their own rules which have embedded semantic actions.

A semantic action in YARD is defined by creating a template specialization of the following type:

template<typename Iter_T, typename Rules_T>
struct Actor {
  static void OnBefore(Iter_T pos) { }
  static void OnSuccess(Iter_T begin, Iter_T end) { }
  static void OnFailure(Iter_T pos) { }
};

Creating a template specialization means that an overloaded version of the type is defined for specific template parameters. An example of a specialization is:

template<>
struct yard::Actor<char const*, xml_grammar::STag> : public yard::BaseActor {
  static void OnBefore(char const* pos) { }
  static void OnFailure(char const* pos) { }
  static void OnSuccess(char const* begin, char const* end) {
    std::string s(begin, end - begin);
    s = "STag: " + s;
    puts(s.c_str());
  }
};

YARD Rules

When describing a grammar production for the YARD parser, it is called a rule. A rule is a type which matches the following:

struct rule {
  template<typename Elem_T>
  static bool Accept(ParserInputStream<Elem_T>& stream) {
    // ...
  }
};

A rule is a pattern matcher, which recognizes input and returns true or false depending on whether it is successful or not.

Most people are concerned with only matching text, so many of the predefined rules assume the Elem_T parameter to be of type char. Some of the predefined text parsing rules found in rules.hpp are:

typedef MatchCharRange<'a', 'z'> MatchLowerCaseLetter;
typedef MatchCharRange<'A', 'Z'> MatchUpperCaseLetter;
typedef MatchCharRange<'0', '9'> MatchNumber;
typedef re_or<MatchLowerCaseLetter, MatchUpperCaseLetter> MatchLetter;
typedef re_or<MatchLetter, MatchChar<'\''> > MatchWordChar;
typedef re_plus<MatchWordChar> MatchWord;
typedef re_or<MatchLetter, MatchChar<'_'> > MatchIdentFirstChar;
typedef re_or<MatchIdentFirstChar, MatchNumber> MatchIdentOtherChar;
typedef re_and<MatchIdentFirstChar, re_star<MatchIdentOtherChar> > MatchIdent;

These rules are defined in terms of a core set of text parsing rules:

struct MatchChar { ... }
struct MatchCharRange { ... }
struct MatchString { ... }
struct MatchStringNoCase { ... }
struct MatchWSpace { ... }

Regular Expression Meta-Functions

The key to YARD is the ability to combine rules using the following set of regular expression operators:

  • re_opt<Rule_T> - Matches Rule_T 0 or 1 times.
  • re_star<Rule_T> - Matches Rule_T 0 or more times.
  • re_plus<Rule_T> - Matches Rule_T 1 or more times.
  • re_repeat<Rule_T, N> - Matches Rule_T exactly N times.
  • re_or<Rule_T0, Rule_T1> - Matches Rule_T0 or failing that matches Rule_T1.
  • re_and<Rule_T0, Rule_T1> - Matches Rule_T0 and then matches Rule_T1.
  • re_or3<Rule_T0, Rule_T1, Ru> - Matches Rule_T0 or failing that matches Rule_T1.
  • re_and3<Rule_T0, Rule_T1> - Matches Rule_T0 and then matches Rule_T1.
  • re_until<Rule_T> - Matches everything up to and including Rule_T, fails if it reaches the end of input.

These are called meta-functions, but really they are simply parameterized types. The reason we call this meta-programming is because the parsing algorithm is expressed using the same technique as functional programming.

Under The Hood: The Pattern Matching Algorithms

The YARD engine uses a brute force trial and error matching algorithm. There is a trade-off of some speed for ease of use and simplicity. It was my goal to design the simplest possible generic R-D parser. The YARD parser is nonetheless sufficiently fast for most purposes.

The XML Grammar

The XML grammar that I use the YARD parser to read was lifted directly from here. Since this is more of a demo than an industrial strength XML parser, I have cut several corners (i.e., left out features, and relaxed certain constraints), at the same time, I have been more true to the grammar than many other so-called open-source "XML parsers". The naming of the productions is taken from the official XML grammar.

The grammar productions (YARD rules) are contained in the file xml_grammar.hpp. The starting production is document which is at the bottom of the file. YARD grammars have to be read starting from the bottom of the file. This has to do with C++ compilation rules. Another artifact of compilation order dependencies in C++ is that cyclical type references have to be broken using functions. You will notice four functions at the bottom of the file: AcceptElement(), AcceptComment(), AcceptCDSect(), AcceptPI() which are required because they represent recursive grammar productions.

Here is a small example of the XML grammar:

struct PI : public
  re_and<
    MatchString<PI_begin_string>,
    re_until<
      MatchString<PI_end_string>
    >
  >
  { };

struct Misc : public
  re_or3<
    Comment,
    PI,
    S
  >
  { };

struct prolog : public
  re_and3<
    re_opt<XMLDecl>,
    re_star<Misc>,
    re_opt<
      re_and<
        doctypedecl,
        re_star<Misc>
      >
    >
  >
  { };

struct document : public
  re_and3<
    prolog,
    element,
    re_star<Misc>
  >
  { };

Using The XML Parser

The XML parser is initiated and executed from the following function inside of yard.cpp:

void RunXmlTest()
{
  puts("testing xml parser");
  char* pBuf = AllocFromFile("c:\\doc.xml");
  const char* pEnd = &(pBuf[strlen(pBuf)]);
  {
    TimeIt t;
    if (Parse<xml_grammar::document, char, char const*>(pBuf, pEnd))
    {
      puts("parsed xml document");
    }
    else {
      puts("failed to parse xml document");
    }
  }
  free(pBuf);
}

There appears to be something mysterious going on, because the parser doesn't apparently do anything. In fact, the parser automatically calls the semantic actions which are defined in the file xml_test.hpp. Semantic actions can be defined anywhere, and are automatically associated with the parser, because they are defined as template specializations. There are two semantic actions defined, which upon a successful match of the pattern STag or ETag, output the text of the match to the standard output stream. Here is the STag matching semantic action defined:

template<>
struct yard::Actor<char const*, xml_grammar::STag>  {
  static void OnBefore(char const* pos) { }
  static void OnFailure(char const* pos) { }
  static void OnSuccess(char const* begin, char const* end) {
    std::string s(begin, end - begin);
    s = "STag: " + s;
    puts(s.c_str());
  }
};

Note: Because of the way the R-D parser works, yard::Actor::OnSuccess(...) will be triggered immediately upon a successful match, even if a parent production ultimately fails.

About the Project

The source project includes all of the work on the YARD parser up to the current moment, and runs three tests: it tests the parser along with the string tokenizer and the scanner which are part of YARD but are not discussed in this article.

Summary

I have only just touched on the potential of the YARD parser, and R-D parsers, in general. This also isn't even a complete XML parser, but hopefully, it will provide the motivated reader with enough information to go ahead and implement a more functional and useful XML parser.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


Written By
Software Developer Ara 3D
Canada Canada
I am the designer of the Plato programming language and I teach computer science at Bishop's University. I can be reached via email at cdiggins@gmail.com

Comments and Discussions

 
GeneralI appreciate your code! Pin
snoying28-Oct-12 16:58
snoying28-Oct-12 16:58 
GeneralMy vote of 5 Pin
scristian7131-Mar-11 6:43
scristian7131-Mar-11 6:43 
GeneralRule based source code instrumenation/code injection Pin
Lokanatha Reddy7-Mar-07 2:06
Lokanatha Reddy7-Mar-07 2:06 
GeneralYARD at SourceForge Pin
Christopher Diggins23-Dec-04 19:59
professionalChristopher Diggins23-Dec-04 19:59 
GeneralKick start for rookie Pin
DavidNohejl22-Dec-04 10:10
DavidNohejl22-Dec-04 10:10 
GeneralRe: Kick start for rookie Pin
Christopher Diggins22-Dec-04 10:29
professionalChristopher Diggins22-Dec-04 10:29 
GeneralW3C XML standards Pin
Nosheen Iqbal21-Dec-04 16:14
Nosheen Iqbal21-Dec-04 16:14 
GeneralRe: W3C XML standards Pin
Christopher Diggins21-Dec-04 16:45
professionalChristopher Diggins21-Dec-04 16:45 
GeneralPrecedence and associativity of operators Pin
Mircea Puiu21-Dec-04 7:54
Mircea Puiu21-Dec-04 7:54 
GeneralRe: Precedence and associativity of operators Pin
Christopher Diggins21-Dec-04 8:09
professionalChristopher Diggins21-Dec-04 8:09 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.