Introduction
I have used several XML parsers over the last decade including Xerces and
MS-XML. The two principal APIs are SAX and DOM.
When you've worked with XML for a while, you'll discover that DOM code is very readable - particularly when written in a
top-down style. You start from the document root element and work your way down to the data needed. In contrast, a SAX
API presents a callback mechanism. A handful of different callbacks reflect the XML content as it is parsed from
beginning to end. Your application's callback functions must manage some state information to know in what context it is
getting a text, element, or attribute node. The result is tolerable but not nearly as readable as top-down DOM access.
The advantage of SAX (and Expat) is performance. You process the XML as you parse it. The DOM approach essentially processes
the XML twice. Once to create a DOM and again when you access the DOM to extract the desired data. The latter step
requires the entire DOM object to be present in memory. In short, this is expensive in processing and memory.
With .NET, Microsoft introduced another type of XML parser called the XML Reader.
This parser provides DOM-like access in a limited forward-only manner. This allows top-down access without the need to track state or consume large chunks of memory
(for large XML files). The parser presented here was inspired by the .NET XML Reader. Wanting this functionality for C/C++, I
looked around for a C/C++ implementation that offered the same semantics for processing XML data. Not finding one, I decided to create one.
Scope
For heavy duty work, I will use Xerces or MS-XML. However, for one native Windows project I needed an XML parser that
did not require linking against a large library or creating a new COM dependency. Turns out explaining to users how to
install MS-XML is a bit of a pain. Also, I did not want to deploy an application where the XML library is larger than
the app. Zipped binaries of Xerces for x86 Windows are 18 megabytes. The app fit on a floppy disk. Remember floppy disks?
My XML needs are modest. I don't need CDATA, namespaces, XPath, DTDs, or XML schema. The API presented here allows
faking namespace support but it does not try to resolve out-of-document references. The parser is limited to the content
of the input stream. I only want to store and retrieve some XML data. If you need DTDs or validation, consider a much
larger library. This limited scope has worked very well over the last three years.
Recursive Descent Parsing
Recursive descent parsing is an easy to
understand and easy to maintain mechanism for parsing text. The approach here concentrates on a handful of utility
functions that manage a read-ahead buffer. A cursor position tracks forward progress through the buffer. The functions
return true when something is matched (and optionally consumed). The peek functions perform matching. The parse
functions extend this to also consume buffer content. The functions return false when the buffer does not match.
In this case, the buffer is left untouched (e.g., the cursor position does not change).
bool peekMatch(const char *text, size_t len);
bool peekMatch(const char text);
bool parseMatch(const char *text, size_t len);
bool parseMatch(const char text);
The above functions are the foundation for the whole parser. The result is significant functionality in only 618 lines (as of this writing) of code.
Performance and Memory
The read-ahead buffer tries to read data from the stream in fixed chunks. The parser is currently designed such that
no single parseable item can exceed the size of the read-ahead buffer. I don't embed binary blobs so this has not been
an issue for me. If you run into a situation where the internal buffer is too small, you can alter the source code to make the buffer larger.
The library presented here is built around STL strings which aren't the fastest string objects. For higher performance, you may want to substitute
a different string class. I haven't needed this myself - yet.
Input Stream
To decouple the parser from the various sources of XML data, an IInputStream interface was defined. Included here
are IInputStream implementations for disk file and memory. There's not much to the interface, see:
virtual bool Read(unsigned char *pOctets, size_t iOctets, size_t &iRead);
virtual void Close();
As you can imagine, creating file and memory implementations are trivial. Implementing IInputStream for a socket,
named pipe, or gzip compressed stream is left as an exercise for the reader.
XML Reader
The reader / parser accepts an IInputStream through the Open method. The underlying stream is assumed to be ready.
FileInputStream istream;
XML::Reader reader;
if ( istream.Open(path) && reader.open(&istream) )
{
}
reader.close();
istream.Close();
From there, you can start parsing. Parsing elements follows this loose outline:
- Call
readStartElement to parse the element (up to but not including any nested content or closing tag).
- Use
getAttribute zero or more times to retrieve attributes (must be done at this stage).
- Parse nested content, if any.
- Call
readEndElement to advance to next node in the stream.
Suppose you expect a document level element named "Root". You can simultaneously consume the element and verify the element
is indeed named "Root" with the following statement:
if ( reader.readStartElement("Root") )
{
reader.readEndElement(false);
}
The parser won't advance if the root element is not named "Root". You can test for any of several possible document
level elements. Notice the matching call to readEndElement. The paired calls are required for a matched set of opening / closing tags and for self-closing tags.
Here's a self-closing tag example to illustrate. I've included column numbers to make it easier to read.
1 2
12345678901234567890123
<element key="value" />
readStartElement will consume all of the above text from column 1 up to and including column 20. Basically, that's everything but the element's "/>" tail.
Notice that the attributes are also consumed at this stage. The element's attributes are temporarily retained up until the eventual call to readEndElement.
The readEndElement call will consume columns 21 through 23.
Here's a matched opening / closing tag example to illustrate. I've included column numbers to make it easier to read.
1 2 3
12345678901234567890123456789012345
<element key="value">text</element>
readStartElement will consume all of the above text from column 1 up to and including column 21. Basically, that's the element's opening tag.
Notice that the attributes are also consumed at this stage. The element's attributes are temporarily retained up until another call to readStartElement or the eventual call
to readEndElement. The readEndElement call will consume columns 26 through 35 - but only if you've made a call to readPCData
to consume the text at columns 22 through 25.
To deal with nested content, write something like this:
if ( reader.readStartElement("parent") )
{
reader.readPCData(text);
while ( readStartElement("child") )
{
reader.readEndElement();
}
reader.readEndElement(false);
}
With the above example, I hope you can see how easy it is to express XML parsing in a top-down manner.
Note: Microsoft's MSXML parser requires testing for self-closing tags. In this library, the same mechanism accepts both open/closing tags and self-closing tags.
Suppose you have an XML fragment like so:
<Contact
first="Michael"
last="Jordan"
city="Chicago"
state="Illinois"
postal="60612"
/>
<Contact
first="Barbara"
last="Jordan"
city="Houston"
state="Texas"
postal="77002"
/>
<Contact
first="Michael"
last="Dunn"
city="Toronto"
state="Ontario"
postal="M4R1K1"
/>
Your objective is to parse the above into a C++ class like so:
class Contact
{
std::string first, last, city, state, postal;
};
Here's one way to parse a consecutive run of "Contact" elements.
std::list<Contact *> list;
while ( reader.readStartElement("Contact") )
{
Contact *p = new Contact();
reader.getAttribute("first", p->first);
reader.getAttribute("last", p->last);
reader.getAttribute("city", p->city);
reader.getAttribute("state", p->state);
reader.getAttribute("postal", p->postal);
list.push_back(p);
}
The getAttribute function returns true if the attribute was found. Suppose you want to enforce the presence of the first and last name attributes.
You could do this as ...
std::list<Contact *> list;
while ( reader.readStartElement("Contact") )
{
Contact copy;
if ( reader.getAttribute("first", copy.first) &&
reader.getAttribute("last", copy.last) )
{
reader.getAttribute("city", copy.city);
reader.getAttribute("state", copy.state);
reader.getAttribute("postal", copy.postal);
list.push_back( new Contact(copy) );
}
else
{
}
}
It is very common to see elements used like attributes. They have no attributes and only PCData text content. Suppose you have an XML fragment like so:
<Contact>
<first>Michael</first>
<last>Jordan</last>
<city>Chicago</city>
<state>Illinois</state>
<postal>60612</postal>
</Contact>
<Contact>
<first>Barbara</first>
<last>Jordan</last>
<city>Houston</city>
<state>Texas</state>
<postal>77002</postal>
</Contact>
<Contact>
<first>Michael</first>
<last>Dune</last>
<city>Toronto</city>
<state>Ontario</state>
<postal>M4R1K1</postal>
</Contact>
readStringElement is tailor-made for parsing this type of data. With readStringElement, the above example becomes ...
std::list<Contact *> list;
while ( reader.readStartElement("Contact") )
{
Contact copy;
if ( reader.readStringElement("first", copy.first)
reader.readStringElement("last", copy.last) )
{
reader.readStringElement("city", copy.city) );
reader.readStringElement("state", copy.state) );
reader.readStringElement("postal", copy.postal) );
list.push_back( new Contact(copy) );
}
else
{
}
}
What if you don't know the names of the elements and attributes? No worry. You can use the anonymous version of readStartElement to consume an element without knowing its name.
The getElement method will retrieve the element's name. For unknown attributes, an enumAttributes function allows you to iterate over the list of attributes for the
current element. You could, for example, write your own XML pretty-printer using these two functions.
The attribute values shown so far have been strings. What about other types like bool or int? You have the full power of the C standard library at your disposal.
I make heavy use of strtol or the TCHAR aware version tcstol. Consider strtoul for unsigned 32 bit long integers. For smaller integer types
like short and char, a simple typecast does the job. I usually use "0" and "1" for true/false values.
tstring strValue;
long width = 0;
short value = 0;
char rank = 0;
bool locked = false;
if ( reader.getAttribute("width", strValue) )
width = tcstol(strValue.c_str(), NULL, 10);
if ( reader.getAttribute("value", strValue) )
value = (short)tcstol(strValue.c_str(), NULL, 10);
if ( reader.getAttribute("rank", strValue) )
rank = (char)tcstol(strValue.c_str(), NULL, 10);
if ( reader.getAttribute("locked", strValue) )
locked = tcstol(strValue.c_str(), NULL, 10) != 0;
Schema Change Tolerance
Attributes discovered by the parser but not retrieved by getAttributes are quietly ignored. This means older code won't crash when presented with a new attribute.
You have some control over tolerating future addition of child nodes not anticipated in your current schema or design. The readEndElement function accepts a boolean flag.
Set this flag to true to silently ignore any nested content (e.g. text or child elements). The getSkipped function alerts you to any discarded content.
At some point, schema changes force you to lose backwards compatibility. Be sure to include a version attribute in your document level element to cope with this.
Namespaces
As stated above, this library does not support namespaces. But this does not prevent you from parsing XML with
namespace prefixes. A trick I use is to parse the elements anonymously and ignore the namespace prefix. A more robust
technique is to parse the namespace declaration, and merge the prefix with your expected element names before calls to
readStartElement. Again, I've scraped by without the need to properly deal with namespaces. I try not to implement something until there is an actual need.
Entities
Only five standard XML entities are recognized. Those five are:
- &
- '
- >
- <
- "
All others are passed through to your application.
Unicode
The parser only parses XML containing UTF-8 and ASCII text. It does not check for byte markers. The client API accepts TCHAR (#define'd as char
or wchar_t) data when retrieving attribute values or PCData. Element names and attribute names are always specified as UTF-8 character strings (e.g., std::string and not
std::wstring). The typdef tstring is defined as either std::string or std::wstring depending upon your application's Unicode
settings. If your code must support both multibyte (MBCS or UTF-8) and Unicode (UTF-16) builds, I suggest using tstring. In the case of a Unicode build,
the library will transcode UTF-8 to UTF-16 before passing the data to your application.
Linking to Your App
Separating functionality into separate DLLs is common. For the EXE and DLLs to work together, they must all be built with compatible compiler options and link against compatible versions
of the C run-time library. Add one third party DLL and this quickly devolves to DLL hell. I've long since adopted the practice of static libraries and a single
EXE. The optimizing linker is smart enough to omit unused library functions.
This library statically links to your application. Make sure your compile options are compatible. Check character
encoding, run time library (multithreaded vs. single-threaded, and DLL vs. static CRT). Relative to DLL hell, this is
the lesser of two evils. The Reader class resides under the XML namespace. If this collides with other symbols or namespaces, you may want to tweak the source files.
Concludes
If you have modest XML needs and like the idea of a forward only XML parser, give this little library a try!
In a future installment, I'll show the XML writer companion class and later combine them into a Windows native SOAP client.
Future
Supporting DTDs, CDATA, and namespaces might be nice someday. I should add some getAttribute overrides for int,
char, bool, long, and short and their unsigned counterparts. The XML::Reader class was recently split into separate
classes for the read-ahead buffer with low level parsing primitives and high level XML API. The read-ahead parser can be leveraged for other things like JSON and CSV.
Terms
- attribute: A key / value pair appearing inside an XML element; written as key="value"; following the element name and preceding the closing bracket '>'.
- element: An XML object used to described structured content. An element may appear with or without attributes. It may appear as a self-closing tag or as a pair of opening and closing tags.
Nested content may appear between the opening and closing tags.
- opening tag: A tag marking the start of an XML element and its nested content, if any.
The opening tag is written as
<element [ attribute="value" ]* >. The text "element" is replaced with the actual element name. Nested content may follow.
- closing tag: A tag marking the end of an XML element and its nested content. The closing tag is written
as
</element> where the text "element" is replaced with the actual element name.
- self-closing: A tag marking an XML element with no nested content. A self-closing tag may have attributes.
The syntax of a self-closing tag is
<element [ key="value" ]* />. Notice the trailing '/'. The text "element" is replaced with the actual element name.
- PCData: Parsed character data; free text that appears between opening and closing element tags. May contain XML entity references.
- entity: An XML macro mechanism used for text substitution. Some examples are: & < > "
References
History
- 10 December 2011 - Article first published.