Click here to Skip to main content
Click here to Skip to main content
Go to top

Parse XML at SAX Speed without DOM or SAX

, 11 Dec 2011
Rate this:
Please Sign up or sign in to vote.
SAX performance with the readability of DOM like access.

Introduction

I have used several XML parsers over the last decade including Xerces and MS-XML. The two principal APIs are SAX and DOM. When you've worked with XML for a while, you'll discover that DOM code is very readable - particularly when written in a top-down style. You start from the document root element and work your way down to the data needed. In contrast, a SAX API presents a callback mechanism. A handful of different callbacks reflect the XML content as it is parsed from beginning to end. Your application's callback functions must manage some state information to know in what context it is getting a text, element, or attribute node. The result is tolerable but not nearly as readable as top-down DOM access. The advantage of SAX (and Expat) is performance. You process the XML as you parse it. The DOM approach essentially processes the XML twice. Once to create a DOM and again when you access the DOM to extract the desired data. The latter step requires the entire DOM object to be present in memory. In short, this is expensive in processing and memory.

With .NET, Microsoft introduced another type of XML parser called the XML Reader. This parser provides DOM-like access in a limited forward-only manner. This allows top-down access without the need to track state or consume large chunks of memory (for large XML files). The parser presented here was inspired by the .NET XML Reader. Wanting this functionality for C/C++, I looked around for a C/C++ implementation that offered the same semantics for processing XML data. Not finding one, I decided to create one.

Scope

For heavy duty work, I will use Xerces or MS-XML. However, for one native Windows project I needed an XML parser that did not require linking against a large library or creating a new COM dependency. Turns out explaining to users how to install MS-XML is a bit of a pain. Also, I did not want to deploy an application where the XML library is larger than the app. Zipped binaries of Xerces for x86 Windows are 18 megabytes. The app fit on a floppy disk. Remember floppy disks?

My XML needs are modest. I don't need CDATA, namespaces, XPath, DTDs, or XML schema. The API presented here allows faking namespace support but it does not try to resolve out-of-document references. The parser is limited to the content of the input stream. I only want to store and retrieve some XML data. If you need DTDs or validation, consider a much larger library. This limited scope has worked very well over the last three years.

Recursive Descent Parsing

Recursive descent parsing is an easy to understand and easy to maintain mechanism for parsing text. The approach here concentrates on a handful of utility functions that manage a read-ahead buffer. A cursor position tracks forward progress through the buffer. The functions return true when something is matched (and optionally consumed). The peek functions perform matching. The parse functions extend this to also consume buffer content. The functions return false when the buffer does not match. In this case, the buffer is left untouched (e.g., the cursor position does not change).

bool peekMatch(const char *text, size_t len);
bool peekMatch(const char text);
bool parseMatch(const char *text, size_t len);
bool parseMatch(const char text); 

The above functions are the foundation for the whole parser. The result is significant functionality in only 618 lines (as of this writing) of code.

Performance and Memory

The read-ahead buffer tries to read data from the stream in fixed chunks. The parser is currently designed such that no single parseable item can exceed the size of the read-ahead buffer. I don't embed binary blobs so this has not been an issue for me. If you run into a situation where the internal buffer is too small, you can alter the source code to make the buffer larger.

The library presented here is built around STL strings which aren't the fastest string objects. For higher performance, you may want to substitute a different string class. I haven't needed this myself - yet.

Input Stream

To decouple the parser from the various sources of XML data, an IInputStream interface was defined. Included here are IInputStream implementations for disk file and memory. There's not much to the interface, see:

virtual bool Read(unsigned char *pOctets, size_t iOctets, size_t &iRead);
virtual void Close(); 

As you can imagine, creating file and memory implementations are trivial. Implementing IInputStream for a socket, named pipe, or gzip compressed stream is left as an exercise for the reader.

XML Reader

The reader / parser accepts an IInputStream through the Open method. The underlying stream is assumed to be ready.

FileInputStream istream;
XML::Reader reader;
if ( istream.Open(path) && reader.open(&istream) )
{
    // do some parsing
}
reader.close();
istream.Close();

From there, you can start parsing. Parsing elements follows this loose outline:

  • Call readStartElement to parse the element (up to but not including any nested content or closing tag).
  • Use getAttribute zero or more times to retrieve attributes (must be done at this stage).
  • Parse nested content, if any.
  • Call readEndElement to advance to next node in the stream.

Suppose you expect a document level element named "Root". You can simultaneously consume the element and verify the element is indeed named "Root" with the following statement:

if ( reader.readStartElement("Root") )
{
    // parse rest of document.
    reader.readEndElement(false);
}

The parser won't advance if the root element is not named "Root". You can test for any of several possible document level elements. Notice the matching call to readEndElement. The paired calls are required for a matched set of opening / closing tags and for self-closing tags.

Here's a self-closing tag example to illustrate. I've included column numbers to make it easier to read.

         1         2
12345678901234567890123
<element key="value" />

readStartElement will consume all of the above text from column 1 up to and including column 20. Basically, that's everything but the element's "/>" tail. Notice that the attributes are also consumed at this stage. The element's attributes are temporarily retained up until the eventual call to readEndElement. The readEndElement call will consume columns 21 through 23.

Here's a matched opening / closing tag example to illustrate. I've included column numbers to make it easier to read.

         1         2         3
12345678901234567890123456789012345
<element key="value">text</element>

readStartElement will consume all of the above text from column 1 up to and including column 21. Basically, that's the element's opening tag. Notice that the attributes are also consumed at this stage. The element's attributes are temporarily retained up until another call to readStartElement or the eventual call to readEndElement. The readEndElement call will consume columns 26 through 35 - but only if you've made a call to readPCData to consume the text at columns 22 through 25.

To deal with nested content, write something like this:

if ( reader.readStartElement("parent") )
{
    reader.readPCData(text);
    while ( readStartElement("child") )
    {
    // do something with child attributes or possibly nested PCData.
    reader.readEndElement();
    }
    reader.readEndElement(false);
}

With the above example, I hope you can see how easy it is to express XML parsing in a top-down manner.

Note: Microsoft's MSXML parser requires testing for self-closing tags. In this library, the same mechanism accepts both open/closing tags and self-closing tags.

Suppose you have an XML fragment like so:

<Contact
    first="Michael"
    last="Jordan"
    city="Chicago"
    state="Illinois"
    postal="60612"
/>
<Contact
    first="Barbara"
    last="Jordan"
    city="Houston"
    state="Texas"
    postal="77002"
/>
<Contact
    first="Michael"
    last="Dunn"
    city="Toronto"
    state="Ontario"
    postal="M4R1K1"
/> 

Your objective is to parse the above into a C++ class like so:

class Contact
{
    std::string first, last, city, state, postal;
};

Here's one way to parse a consecutive run of "Contact" elements.

std::list<Contact *> list;

while ( reader.readStartElement("Contact") )
{
    Contact *p = new Contact();
    reader.getAttribute("first", p->first);
    reader.getAttribute("last", p->last);
    reader.getAttribute("city", p->city);
    reader.getAttribute("state", p->state);
    reader.getAttribute("postal", p->postal);

    // add to list of parsed contacts.
    list.push_back(p);
}

The getAttribute function returns true if the attribute was found. Suppose you want to enforce the presence of the first and last name attributes. You could do this as ...

std::list<Contact *> list;

while ( reader.readStartElement("Contact") )
{
    Contact copy;
    // got both first and last names?
    if ( reader.getAttribute("first", copy.first) && 
        reader.getAttribute("last", copy.last) )
    {
        // yes: parse the rest.
        reader.getAttribute("city", copy.city);
        reader.getAttribute("state", copy.state);
        reader.getAttribute("postal", copy.postal);

        // add to list of parsed contacts.
        list.push_back( new Contact(copy) );
    }
    else
    {
        // emit log trace or throw exception here.
    }
}

It is very common to see elements used like attributes. They have no attributes and only PCData text content. Suppose you have an XML fragment like so:

<Contact>
    <first>Michael</first>
    <last>Jordan</last>
    <city>Chicago</city>
    <state>Illinois</state>
    <postal>60612</postal>
</Contact>
<Contact>
    <first>Barbara</first>
    <last>Jordan</last>
    <city>Houston</city>
    <state>Texas</state>
    <postal>77002</postal>
</Contact>
<Contact>
    <first>Michael</first>
    <last>Dune</last>
    <city>Toronto</city>
    <state>Ontario</state>
    <postal>M4R1K1</postal>
</Contact>

readStringElement is tailor-made for parsing this type of data. With readStringElement, the above example becomes ...

std::list<Contact *> list;

while ( reader.readStartElement("Contact") )
{
    Contact copy;

    // got both first and last names?
    if ( reader.readStringElement("first", copy.first)
        reader.readStringElement("last", copy.last) )
    {
        // yes: parse the rest.
        reader.readStringElement("city", copy.city) );
        reader.readStringElement("state", copy.state) );
        reader.readStringElement("postal", copy.postal) );
        // add to list of parsed contacts.
        list.push_back( new Contact(copy) );
    }
    else
    {
        // emit log trace or throw exception here.
    }
}

What if you don't know the names of the elements and attributes? No worry. You can use the anonymous version of readStartElement to consume an element without knowing its name. The getElement method will retrieve the element's name. For unknown attributes, an enumAttributes function allows you to iterate over the list of attributes for the current element. You could, for example, write your own XML pretty-printer using these two functions.

The attribute values shown so far have been strings. What about other types like bool or int? You have the full power of the C standard library at your disposal. I make heavy use of strtol or the TCHAR aware version tcstol. Consider strtoul for unsigned 32 bit long integers. For smaller integer types like short and char, a simple typecast does the job. I usually use "0" and "1" for true/false values.

tstring strValue;
long width = 0;
short value = 0;
char rank = 0;
bool locked = false;

if ( reader.getAttribute("width", strValue) )
    width = tcstol(strValue.c_str(), NULL, 10);
if ( reader.getAttribute("value", strValue) )
    value = (short)tcstol(strValue.c_str(), NULL, 10);
if ( reader.getAttribute("rank", strValue) )
    rank = (char)tcstol(strValue.c_str(), NULL, 10);
if ( reader.getAttribute("locked", strValue) )
    locked = tcstol(strValue.c_str(), NULL, 10) != 0;

Schema Change Tolerance

Attributes discovered by the parser but not retrieved by getAttributes are quietly ignored. This means older code won't crash when presented with a new attribute.

You have some control over tolerating future addition of child nodes not anticipated in your current schema or design. The readEndElement function accepts a boolean flag. Set this flag to true to silently ignore any nested content (e.g. text or child elements). The getSkipped function alerts you to any discarded content.

At some point, schema changes force you to lose backwards compatibility. Be sure to include a version attribute in your document level element to cope with this.

Namespaces

As stated above, this library does not support namespaces. But this does not prevent you from parsing XML with namespace prefixes. A trick I use is to parse the elements anonymously and ignore the namespace prefix. A more robust technique is to parse the namespace declaration, and merge the prefix with your expected element names before calls to readStartElement. Again, I've scraped by without the need to properly deal with namespaces. I try not to implement something until there is an actual need.

Entities

Only five standard XML entities are recognized. Those five are:

  • &amp;
  • &apos;
  • &gt;
  • &lt;
  • &quot;

All others are passed through to your application.

Unicode

The parser only parses XML containing UTF-8 and ASCII text. It does not check for byte markers. The client API accepts TCHAR (#define'd as char or wchar_t) data when retrieving attribute values or PCData. Element names and attribute names are always specified as UTF-8 character strings (e.g., std::string and not std::wstring). The typdef tstring is defined as either std::string or std::wstring depending upon your application's Unicode settings. If your code must support both multibyte (MBCS or UTF-8) and Unicode (UTF-16) builds, I suggest using tstring. In the case of a Unicode build, the library will transcode UTF-8 to UTF-16 before passing the data to your application.

Linking to Your App

Separating functionality into separate DLLs is common. For the EXE and DLLs to work together, they must all be built with compatible compiler options and link against compatible versions of the C run-time library. Add one third party DLL and this quickly devolves to DLL hell. I've long since adopted the practice of static libraries and a single EXE. The optimizing linker is smart enough to omit unused library functions.

This library statically links to your application. Make sure your compile options are compatible. Check character encoding, run time library (multithreaded vs. single-threaded, and DLL vs. static CRT). Relative to DLL hell, this is the lesser of two evils. The Reader class resides under the XML namespace. If this collides with other symbols or namespaces, you may want to tweak the source files.

Concludes

If you have modest XML needs and like the idea of a forward only XML parser, give this little library a try! In a future installment, I'll show the XML writer companion class and later combine them into a Windows native SOAP client.

Future

Supporting DTDs, CDATA, and namespaces might be nice someday. I should add some getAttribute overrides for int, char, bool, long, and short and their unsigned counterparts. The XML::Reader class was recently split into separate classes for the read-ahead buffer with low level parsing primitives and high level XML API. The read-ahead parser can be leveraged for other things like JSON and CSV.

Terms

  • attribute: A key / value pair appearing inside an XML element; written as key="value"; following the element name and preceding the closing bracket '>'.
  • element: An XML object used to described structured content. An element may appear with or without attributes. It may appear as a self-closing tag or as a pair of opening and closing tags. Nested content may appear between the opening and closing tags.
  • opening tag: A tag marking the start of an XML element and its nested content, if any. The opening tag is written as <element [ attribute="value" ]* >. The text "element" is replaced with the actual element name. Nested content may follow.
  • closing tag: A tag marking the end of an XML element and its nested content. The closing tag is written as </element> where the text "element" is replaced with the actual element name.
  • self-closing: A tag marking an XML element with no nested content. A self-closing tag may have attributes. The syntax of a self-closing tag is <element [ key="value" ]* />. Notice the trailing '/'. The text "element" is replaced with the actual element name.
  • PCData: Parsed character data; free text that appears between opening and closing element tags. May contain XML entity references.
  • entity: An XML macro mechanism used for text substitution. Some examples are: &amp; &lt; &gt; &quot;

References

History

  • 10 December 2011 - Article first published.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

unitrunker
Architect Unitrunker
United States United States
Rick is a long-time programmer. His one claim to fame is being known by first name at nearly every coffeehouse in town. He enjoys speaking in speed-talk settings on technical topics. A faded bumper sticker on his car reads: "I <br /> for XHTML."
 
You can find more info on my Github page.
Follow on   Twitter

Comments and Discussions

 
QuestionCode update Pinmemberunitrunker20-Oct-12 19:55 
QuestionDoesn't Xerces operate in SAX mode and how about BOOST Spirit? PinmemberPeteBarber10-Jan-12 21:25 
AnswerRe: Doesn't Xerces operate in SAX mode and how about BOOST Spirit? Pinmemberunitrunker14-Jan-12 15:43 
AnswerRe: Doesn't Xerces operate in SAX mode and how about BOOST Spirit? Pinmemberunitrunker14-Jan-12 15:58 
GeneralMy vote of 5 PinmemberMihai MOGA18-Dec-11 3:33 
GeneralMy vote of 5 PinmemberKjellKod.cc16-Dec-11 1:04 
QuestionStAX PinmemberKindDragon12-Dec-11 12:02 
QuestionGreat article Pinmembered welch12-Dec-11 5:03 
Questionnice effort PinmemberShakti Misra11-Dec-11 18:17 
AnswerRe: nice effort Pinmemberunitrunker11-Dec-11 18:54 
Thanks Misra (I love the feedback the codeproject community provides). RapidXML competes with the other DOM parsers.
AnswerRe: nice effort Pinmvpthatraja8-Jan-12 22:21 
QuestionWhy not use XMLite? PinmemberG Zizys11-Dec-11 8:09 
AnswerRe: Why not use XMLite? Pinmemberunitrunker11-Dec-11 8:53 
AnswerRe: Why not use XMLite? Pinmemberunitrunker11-Dec-11 9:41 
GeneralRe: Why not use XMLite? Pinmemberqisamuelzhang12-Dec-11 14:31 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web04 | 2.8.140916.1 | Last Updated 11 Dec 2011
Article Copyright 2011 by unitrunker
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid