Click here to Skip to main content
Click here to Skip to main content
Go to top

Parsing XML using a C++ wrapper for SAX2

, 3 Aug 2005
Rate this:
Please Sign up or sign in to vote.
A basic C++ wrapper framework for the MSXML SAX2 API is presented.

Introduction

As a GUI programmer, I prefer using the SAX2 API for my XML parsing needs because of its events-based usage model. However, the API set is large, COM-based (MS version), and requires you to work with wide character strings. I find it simpler to wrap the API and expose just the functionality that I need. In this article, I present a C++ wrapper framework that allows you to do basic XML parsing and validation using SAX2. The wrapper interfaces use pure C++/STL and the events-based model is retained. In addition, beyond just returning strings as they are parsed, I've included some basic XML data types in the wrapper layer to represent XML elements and attributes.

The attached demo is actually a VC6 workspace (TestXml.dsw) that is comprised of two separate projects. The first project is a Win32 static library (XmlSupport) that contains the wrapper framework classes. This project does not use MFC. The second project does use MFC and is a dialog-based test application (TestXmlSupport) that demonstrates how to use the wrapper classes. I've divided the code in this way to make it clearer which classes are intended for reuse and which are merely for demonstration. If you find the code useful, you can repackage the classes as you like, such as in a DLL. Also, note that the wrapper exposes only a small part of the SAX2 API set. But it's enough for what I want to demonstrate in this article.

Prerequisites

Support for SAX2 (the latest version of the Simple API for XML) is provided as part of the Microsoft XML Core Services (MSXML) SDK. I am using MSXML 4.0 SP2, which is a prerequisite for compiling the demo projects as they rely on certain functionality (such as schema validation using SAX2) which is only supported as of version 4.0. On my XP Pro machine, I initially had problems installing SP2, as I had a partial install of MSXML 4.0 already. However, after I uninstalled that first (as recommended by Microsoft), I was able to re-install SP2 just fine. As I understand, MSXML 4.0 can coexist with a MSXML 3.0 SDK installation.

Background

The SAX2 API is useful when you need to parse large XML documents because it does not require you to read in the entire file before returning parse results. When you initiate a parse with the SAX2 parser, XML element and attribute entities are returned as soon as they are encountered, while the document is still being processed serially. In addition, with SAX2, you can abort a parse at any time. For example, this allows you to stop parsing as soon as you've found a particular record that's stored in your XML file. The test application demonstrates this ability to abort.

To use the SAX2 API directly yourself, you typically begin by specifying an import directive to incorporate information from the MSXML COM type library. For example, add the following to an include file before using any SAX2 interfaces or types (note: these examples do not use smart pointers):

#import <msxml4.dll> raw_interfaces_only
using namespace MSXML2;

The SAX2 parser is encapsulated by the COM class, SAXXMLReader40. The parser interface it supports is ISAXXMLReader, which offers methods such as parseURL(). Below is the code I use to create an instance of the COM class, and at the same time, get a pointer to the parser interface (m_reader):

void CXmlParserImpl::CreateSaxReader()
{
    HRESULT hr = CoCreateInstance(
        __uuidof(SAXXMLReader40), 
        NULL, 
        CLSCTX_ALL, 
        __uuidof(ISAXXMLReader), 
        (void **)&m_reader);
    
    if ( SUCCEEDED(hr) )
    {
        ...
    }
}

To initiate a parse using the SAX reader, I simply call ISAXXMLReader::parseURL() and pass in a wide character string that contains a HTTP URL or the file path of an XML file. An HTTP URL can be used, for example, to specify an XML-based RSS feed from a website such as CNN.com.

HRESULT hr = m_reader->parseURL(wszURL);

Since SAX2 is events-based, you need to implement one or more event handler interfaces in order to receive parsing results or error notifications. For example, there is the ISAXContentHandler interface, which has virtual methods such as startElement() and endElement(). These methods are invoked by the SAX reader when it encounters the start or end of an XML element. In your application, you need to write a class that implements ISAXContentHandler and overrides methods such as startElement(). Then create an instance of your class and register the instance with the SAX reader. This registration is typically done immediately after creating the SAX reader:

void CXmlParserImpl::CreateSaxReader()
{
    ...
    if ( SUCCEEDED(hr) )
    { 
        // Set the content handler.
        m_contentHandler = new CSaxContentHandler;
        hr = m_reader->putContentHandler(m_contentHandler);
        if ( FAILED(hr) )
        {
            delete m_contentHandler;
            m_contentHandler = NULL;
        }
    }
}

Similarly, to receive parsing errors, you can write a class that implements the ISAXErrorHandler interface, and register that handler instance with the SAX reader as well. The ISAXErrorHandler interface has virtual methods such as error() and fatalError() that are called when errors in the input XML source are detected.

Below is an example XML file, test.xml, which is a slightly modified version of a sample file from the MSDN Technical article, "JumpStart for Creating a SAX2 Application with C++" (see References section for link). I've added comments beside each line in the code block below to indicate which of the relevant ISAXContentHandler methods are invoked as each line is processed by the SAX reader.

<?xml version="1.0" encoding="ISO-8859-1"?> ........................ startDocument()
<root> ............................................................. startElement()
 <PARTS> ........................................................ startElement()
  <PART ID="ABC" Tag="cab"> .................................. startElement()
   <PARTNO>12345</PARTNO> ................................. startElement(), 
                                                 characters(), endElement()
   <DESCRIPTION>VIP - Very Important Part</DESCRIPTION> ... startElement(), 
                                                 characters(), endElement() 
  </PART> .................................................... endElement()
  <PART ID="XYZ" Tag="zxy"> .................................. startElement()
   <PARTNO>5678</PARTNO> .................................. startElement(), 
                                                characters(), endElement()
   <DESCRIPTION>LIP - Less Important Part</DESCRIPTION> ... startElement(), 
                                                characters(), endElement()
  </PART> .................................................... endElement()
 </PARTS> ....................................................... endElement()
</root> ............................................................ endElement(), 
                                                                     endDocument()

The C++ wrapper discussed in this article hides much of the above coding details, such as creation of the COM class, optional remapping of wide character strings, etc. However, the wrapper layer still uses an events-based approach, so understanding the above will help in learning how to use and even extend the wrapper.

The C++ wrapper layer

The classes that are exported by the C++ wrapper layer are described below. Note that these classes use char-based STL strings. For each of these classes, there is an equivalent class in the wrapper layer that uses wide-character (wchar_t) strings instead. This is discussed further in a later section.

  • CXmlAttribute: This is a data class that represents a single XML attribute. An XML attribute is basically a name-value pair of strings. For example, in the test.xml file listed earlier, ID="ABC" is an attribute where the attribute name is "ID" and the attribute value is "ABC".
  • CXmlElement: This is a data class that represents either a start element or an end element. It contains information such as the element name and a set of zero or more CXmlAttribute objects. For example, in the test.xml file listed earlier, <PART ID="ABC" Tag="cab"> is an XML element with element name "PART" and two attributes.
  • IXmlElementHandler: This is a pure abstract class that defines an interface for receiving/handling XML event notifications during parsing. One of your application classes should implement this interface and register itself with a CXmlParser instance.
    class IXmlElementHandler
    {
    public:
        // Handle XML content events during parsing.
        virtual void OnXmlStartElement(const CXmlElement& xmlElement) = 0;
        virtual void OnXmlElementData(const std::string& elementData, 
                                                          int depth) = 0;
        virtual void OnXmlEndElement(const CXmlElement& xmlElement) = 0;
    
        // Handle XML error events during parsing.
        virtual void OnXmlError(int line, int column, 
                const std::string& errorText, unsigned long errorCode) = 0;
        
        // Return true to stop parsing earlier.
        virtual bool OnXmlAbortParse(const CXmlElement& xmlElement) = 0;
    };
  • CXmlParser: This is the primary class in the wrapper layer. It's a concrete class that wraps the functionality of the SAX2 reader (parser). Below is the class definition for reference:
    class CXmlParser
    {
    public:
        CXmlParser();
        ~CXmlParser();
    
        // Is the parser available (e.g. was the COM class
        //                             created properly?).
        bool IsReady() const;
    
        // Attach XML events handler.
        void AttachElementHandler(IXmlElementHandler* pElementHandler);
        void DetachElementHandler();
    
        // Set parser feature options.
        bool SetParserFeature(const std::string& featureName, bool value);
        bool GetParserFeature(const std::string& featureName, 
                                          bool& value) const;
    
        // Add/remove XSD schemas for validation. The namespaceURI
        // can be an empty string.
        bool AddValidationSchema(const std::string& namespaceURI, 
                                     const std::string& xsdPath);
        bool RemoveValidationSchema(const std::string& namespaceURI);
    
        // Parse a local file path, or a HTTP URL path.
        bool Parse(const std::string& xmlPath);
    
    private:
        // Use the impl technique so we can hide the implementation
        // and not require wrapper clients to import MSXML types.
        // CXmlParserImpl uses wide-char strings natively.
        CXmlParserImpl* m_impl;
    };

You can find each of the above classes defined in the XmlSupport project. As mentioned earlier, this is a Win32 static library. To use the library in your own application, just include the XmlParser.h file, and link your project against XmlSupport.lib. Note that the run-time library I am using in my projects is Debug Multithreaded DLL (for Win32 Debug configuration) and Multithreaded DLL (for Win32 Release configuration). Check the project settings of your application (C/C++ tab, Code Generation category) to make sure your settings are compatible or else you will get linker errors.

Validation

Before discussing the test application, I will backtrack a bit and give some background information on validating XML files using SAX2. As of MSXML 4.0, SAX2 supports validation using XML schemas as defined in XSD files. An XML schema is like a grammar for describing XML instance documents. As an analogy, I like to think of it as a blueprint that tells you how to build a house and also allows you to check whether you've built it correctly. For example, for the test.xml file listed earlier, you can create a corresponding test.xsd file that defines a schema which can be used to detect errors such as someone using <PARTNUM> instead of <PARTNO> in the XML file, or <DESC> instead of <DESCRIPTION>. The SAX2 reader reports validation errors during parsing through the normal mechanism. You just need to attach an error handler that implements ISAXErrorHandler as discussed earlier.

By default, the SAX2 reader does not perform validation during parsing. To enable validation, you must turn on a "feature", which is like a boolean property that controls a particular parsing option. Using ISAXXMLReader directly:

HRESULT hr = m_reader->putFeature(L"schema-validation", VARIANT_TRUE);

Correspondingly, in the CXmlParser wrapper class, you can set features using the SetParserFeature() method.

bool result = m_xmlParser->SetParserFeature("schema-validation", true);

Once validation is enabled, the SAX reader must be able to find the XSD file corresponding to the XML file being parsed. There are two ways to do this. The first, and easiest way, is to have the XML file reference the location of the schema file. For example, the books.xml file from MSDN examples on validation contains an attribute specification near the top of the file that specifies books.xsd as the schema file to use.

xsi:schemaLocation="urn:books books.xsd"

Finally, there is also another SAX reader feature which needs to be enabled in order to validate using the schema location. By default, this feature is enabled though.

bool result = m_xmlParser->SetParserFeature("use-schema-location", true);

The second way to associate an XML file with an XSD file for validation is to use a schema cache. A schema cache is basically a container of XSD file paths, each indexed by a key - the XML namespace. Take a look again at the schemaLocation attribute I showed earlier. The string "urn:books" is actually the namespace associated with the books.xsd schema file. When you add a new XSD path to the schema cache, you need to specify the namespace to use. The code below shows how the schema cache is created and associated with the SAX reader:

void CXmlParserImpl::CreateSchemaCache()
{
    if ( m_reader == NULL )
        return;

    HRESULT hr = CoCreateInstance(
        __uuidof(XMLSchemaCache40), 
        NULL, 
        CLSCTX_ALL, 
        __uuidof(IXMLDOMSchemaCollection2), 
        (void **)&m_schemaCache);

    if ( SUCCEEDED(hr) )
    {
        // Set the "schemas" property in the reader in order
        // to associate the schema cache with the reader.
        hr = m_reader->putProperty(L"schemas", 
                             _variant_t(m_schemaCache));
        if ( FAILED(hr) )
        {
            OutputDebugString("CXmlParserImpl::Create" 
                "SchemaCache(): putProperty(\"schemas\",...) failed\n");
        }
    }
}

Once the schema cache is created and registered with the SAX reader, you can add a schema:

hr = m_schemaCache->add(wszNamespaceURI, _variant_t(xsdPath.c_str()));

Or, remove a schema:

hr = m_schemaCache->remove(wszNamespaceURI);

Note that if you add the wrong schema (e.g., you specify a namespace that is not used by your XML file), validation won't work properly, even if you have the "schema-validation" feature enabled in the SAX reader. In this case, the SAX reader will report an error when it finishes parsing the root element: "Validate failed because the root element had no associated DTD/Schema". If an XML file does not use namespaces, you can use an empty string ("") for the namespace when adding/removing schemas. A proper validation error reported by the SAX reader looks something like: "Element content is invalid according to the DTD/Schema. Expecting: ...".

The TestXmlSupport application

The TestXmlSupport application demonstrates the use of the CXmlParser wrapper class. It's a dialog-based MFC application that allows you to choose an XML file, initiate a parse, and see the parsing results appear in a list box. The parsing results consist of "Log" messages and a printout of each XML element as it is received by the class that implements IXmlElementHandler. Below is a snapshot of the application after it has parsed a RSS feed from the CNN website.

The dialog has a validation section that allows you to experiment with various parser options. The Enable exhaustive errors checkbox corresponds to the "exhaustive-errors" feature. When enabled, the SAX reader will continue parsing even if it has found an error. This allows you to receive all of the errors in the file instead of just the first one encountered. The other two checkboxes correspond to the "schema-validation" and "use-schema-location" features which were discussed earlier. Below the row of checkboxes is a set of controls for adding and removing schemas. To use the controls, you type in a namespace string, select your XSD file, and then press the Add button to add the schema to the parser. A log message in the results list box will tell you whether the operation succeeded or not. Note that it is possible to leave the namespace field empty if your XML file does not use namespaces. However, I will mention again that even if the add operation succeeds, if the namespace you entered (blank or otherwise) does not actually match what your XML file is using, then validation will not work properly.

The bottom half of the dialog contains a list box that displays parsing results, along with options for controlling the parse. There is an edit box that allows you to specify a parsing delay in milliseconds. This tells the application to pause for a short period of time after each start element is encountered. Using a value of 500 milliseconds, for example, will slow down the parsing enough that the user has time to abort the parse. Otherwise, the parse will likely complete before you can press the Abort button. The Clear button clears the contents of the results list box.

Instead of embedding a lot of logic into the dialog class, I've moved most of its functionality into a helper class, CXmlTester. This is actually the class that uses the CXmlParser wrapper and also handles the XML parser events by implementing the IXmlElementHandler interface. For example, here is the CXmlTester implementation of the virtual OnXmlStartElement() method:

void CXmlTester::OnXmlStartElement(const CXmlElement& xmlElement)
{
    // Update results window.
    if ( m_resultsWnd != NULL )
    {
        m_resultsWnd->InsertString(-1, xmlElement.ToString().c_str());
    }

    // Check if we need to delay the parsing inbetween start elements.
    if ( m_parsingDelay > 0 )
    {
        // By pumping messages, we avoid freezing the GUI.
        // This gives the appearance of being multithreaded.
        PumpWindowsMessages();

        // Sleep for a duration in milliseconds.
        ::Sleep(m_parsingDelay);
    }
}

In the TestXmlSupport project, if you go to the FileView in the workspace window within VC6, you can see I've added a folder to the project called "XML Files". Here I've inserted some sample XML/XSD files that you can try out using the test application. These files are slightly modified versions taken from the MSDN documentation:

  • test.xml: Models a parts catalog.
  • books.xml/books.xsd: Models a catalog of books. Uses the default namespace (empty string). I've modified this from the MSDN original by changing the element name of the last book record to "mybook". This is to make the XML file non-valid according to books.xsd (so I can test that validation is working and that it is able to detect the error that I introduced).
  • books2.xml/books2.xsd: Models a set of books. Uses the namespace "urn:books". The XML file also is non-valid according to the XSD file (so it can also be used to test that validation is working). The XML file uses the schemaLocation attribute.

Building an application model

Although CXmlTester exercises all of the methods in the CXmlParser wrapper class, it's not a practical example of why you want to parse an XML file in the first place. Typically, you want to do more than just re-display the XML file in a list box, or log parser events. Thus, I've provided a second class, CBookCatalog, in the test application which models a catalog of books. This class also uses CXmlParser in order to build up a collection of books and is designed to work with the books.xml file. In the dialog, when the user presses the Parse button, the CXmlTester instance is used to perform an initial parse. If that parse completes successfully, a CBookCatalog instance will attempt to build its catalog from the same XML file (thus parsing a second time). If the proper books.xml file was selected, the catalog should have 11 books in total. If not, the catalog will remain empty. If the catalog is built successfully, you can search for a particular book in the catalog by entering a book ID in the dialog and then pressing the Find Book button. The search result will be displayed in a simple message box.

Wide-character string support

In my initial posting of the demo source code, the classes exported by the wrapper layer used char-based strings (e.g., std::string). In other words, there is a conversion that takes place in the wrapper layer from SAX strings (which are wide-character) to std::string. While this can be convenient for applications which are not using Unicode strings, feedback from CodeProject members correctly point out that this is non-standard and to be generic, wide-character strings should be supported. Thus, I've added an alternate set of classes to the wrapper layer based on std::wstring. To avoid duplication of code, I used templates to parameterize the two XML data types:

// Typedefs for template specializations.
// Client code uses these typedefs instead of
// using CBasicXmlAttribute directly.
typedef CBasicXmlAttribute<char>    CXmlAttribute;
typedef CBasicXmlAttribute<wchar_t> CWXmlAttribute;

// Typedefs for template specializations.
// Client code uses these typedefs instead of
// using CBasicXmlElement directly.
typedef CBasicXmlElement<char>    CXmlElement;
typedef CBasicXmlElement<wchar_t> CWXmlElement;

I also added a new element handler interface, IWXmlElementHandler, which is used in conjunction with the wide-character version of the parser wrapper class, CWXmlParser. These changes do not affect the test application that I provided originally as none of the original class names were changed. However, in order to test the new parser wrapper, CWXmlParser, I decided to write a simple console application, TestConsole, that parses the books2.xml file a couple of times and prints out the elements as they are received.

Summary

The presented wrapper framework can be a starting point for adding XML parsing support to your own applications. I want to make it clear though that this is not a generic solution for all cases as I have only wrapped a small subset of the SAX2 API. If you have more generic requirements, perhaps this article can help you to come up with a better wrapper design. Or, if you are relatively new to the MSXML SDK, you can use this work to help you ramp up quicker than relying solely on the MSDN documentation. Another goal of my article was to provide an example of how to structure/organize classes for reusability. For example, the wrapper approach should help in applications where you need to parse more than one type of XML file. Future areas to look at include the MXXMLWriter COM class in SAX2, which provides for XML writing/generation.

References

History

  • July 23, 2005
    • Initial revision.
  • July 24, 2005
    • Updated summary with some clarifications based on feedback from Mihai Nita.
  • July 29, 2005
    • As per my discussion with Martin Friedrich, added support for wide-character strings to the library. The existing dialog application is unchanged. Also added a new console application (TestConsole) for exercising the alternate set of wrapper classes.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

nschan
Web Developer
Canada Canada
No Biography provided

Comments and Discussions

 
QuestionProblem in reading XML Attribute Pinmembereprakash198711-Aug-13 22:14 
QuestionLink errors PinmemberHateYouIdiot9-Jan-13 0:10 
GeneralThumbs UP :thumbsup: PinmemberManea Cornel22-May-09 2:43 
Questionabout wstring and string PinmemberMember 23943452-Dec-08 14:02 
QuestionXMLSupport.lib problem. PinmemberT.RATHA KRISHNAN20-May-08 0:09 
AnswerRe: XMLSupport.lib problem. Pinmembernschan20-May-08 15:42 
QuestionCommercial use PinmemberKeith Barrett7-Nov-06 23:12 
AnswerRe: Commercial use Pinmembernschan8-Nov-06 14:46 
QuestionHow do i add msxml6.dll to my installer ? Pinmembercode4jigar5-Sep-06 2:19 
AnswerRe: How do i add msxml6.dll to my installer ? Pinmvptoxcct25-Jan-07 11:11 
Generalweak performance PinmemberLaxRoth29-Jul-05 2:03 
GeneralRe: weak performance Pinmembernschan29-Jul-05 12:59 
GeneralRe: weak performance PinsussAnonymous30-Jul-05 1:01 
GeneralRe: weak performance Pinmembernschan31-Jul-05 3:24 
GeneralSome notes PinmemberMihai Nita23-Jul-05 22:27 
GeneralRe: Some notes Pinmembernschan24-Jul-05 2:43 
GeneralRe: Some notes PinmemberMihai Nita24-Jul-05 15:51 
GeneralRe: Some notes PinmemberMartin Friedrich26-Jul-05 11:57 
GeneralRe: Some notes Pinmembernschan27-Jul-05 1:04 
GeneralRe: Some notes PinmemberNemanja Trifunovic4-Aug-05 1:25 
nschan wrote:
It also sounds like it might be useful to define a generic C++ interface for an XML parser that is events-based, thus allowing one to swap parser implementations dynamically (from MSXML to other, etc).
 
There is Arabica[^]
 


My programming blahblahblah blog. If you ever find anything useful here, please let me know to remove it.
GeneralRe: Some notes Pinmembernschan4-Aug-05 2:56 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web04 | 2.8.140926.1 | Last Updated 4 Aug 2005
Article Copyright 2005 by nschan
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid