<!-- Download Links -->
<!-- Add the rest of your HTML here -->
Introduction
The Expat XML Parser is a fine
and widely used event based XML parser. One of the nicer features of
Expat is that it has an API capable of being used by C programs. Even
though many programmers use Expat in a C++ environment, the C based API makes
it easy to export this API from a DLL.
However, Expat being a C based API doesn't mean we have to live without
our
C++ classes. Luckily, Expat was designed with the ability to be
augmented with classes.
(Definition: Event Based XML Parser - An XML parser
which invokes methods (a.k.a. events) when XML constructs are parsed.
This differs from the DOM (Document Object Model) style parsers that parse the
XML and then present the application with XML data in its logical hierarchical
format.)
Design Rational
The primary considerations when designing the Expat wrapper classes was
completeness, simplicity, and extensibility. For completeness, almost all
Expat API routines have been wrapped in the classes. This includes even
API such as XML_ExpatVersionInfo. For simplicity, the
wrapper classes only wrap the Expat API and provide no other features.
For extensibility, the wrapper classes make it easy to derive new classes the
provide enhanced functionality.
Basics
This Expat wrappers consist of 2 classes, a template based class
(CExpatImpl <class _T>) and a virtual function based class
(CExpat). Each class has features the lend themselves to specific
solutions.
The following table illustrates the relationship between the API and the two
classes.
|
CExpat
|
|
CExpatImpl <class _T>
|
|
Expat C API
|
The template class CExpatImpl <class _T> provides the base layer of
translation between C++ and the Expat C API. The benefit to the template
designed is that if the application only needs a few of the Expat event
routines, then the code for the event routines are not compiled into the final
executable. Admittedly, the amount of space wasted is minimal, but why
waste it.
The CExpat class is derived from the CExpatImpl <class _T> template
class. However, excluding the default constructor, the only methods
contained within this class are all the event methods declared as virtual
functions. CExpat is intended for situations where virtual functions are
more preferable than templates.
Within reason, the two classes are interchangeable. If you have a class that is derived from CExpat, it could
be easily modified to use CExpatImpl <class _T> or visa-versa without
having to modify any other source. See the "Implementation Notes" for
more information about some implementation pitfalls with regard to more complex
derived classes.
For the rest of this document, only the CExpatImpl <class _T> class will
be discussed. As stated previously, the two wrapper classes are almost
100 percent interchangeable. Documenting both would be redundant.
Getting Started
The first step in using CExpatImpl <class _T> is deriving a new class that
will provide the application specific implementation. Deriving a class is
required. Like Expat, if there is no derived class then Expat would only
verify that the XML is well formed.
As a starting point, let us define an XML parser that will display when an
element begins, ends, and the data contained within the element.
class CMyXML : public CExpatImpl <CMyXML>
{
public:
CMyXML ()
{
}
void OnPostCreate ()
{
EnableStartElementHandler ();
EnableEndElementHandler ();
EnableCharacterDataHandler ();
}
void OnStartElement (const XML_Char *pszName, const XML_Char **papszAttrs)
{
printf ("We got a start element %s\n", pszName);
return;
}
void OnEndElement (const XML_Char *pszName)
{
printf ("We got an end element %s\n", pszName);
return;
}
void OnCharacterData (const XML_Char *pszData, int nLength)
{
printf ("We got %d bytes of data\n", nLength);
return;
}
};
The CMyXML::OnPostCreate method will be invoked by CExpatImpl <class _T>
after the Expat parser has been created. This provides an easy method of
enabling event routines. The CMyXML::OnStartElement,
CMyXML::OnEndElement, and CMyXML::OnCharacterData methods will be invoked by
Expat while the XML text is being parsed. These routines will not be
invoked unless they are enabled. The code inside CMyXML::OnPostCreate
enables the three event routines.
Creating a Parser
Now that we have a derived class, we can use it to create an Expat parser.
Creating the parser is very easy. First create an instance of the parser
class, then invoke the Create method.
The Create method has two arguments, the document encoding and the character
used to separate namespaces a name. The encoding is the default
encoding that will be used while parsing the XML document unless an encoding is
specified by in the XML document itself. The namespace separator
is used
to separate the namespace from the name in calls such as OnStartElement.
For example, if in the XML document there was the name
SOAP_ENC:Envelope, the SOAP_ENC was defined as being
"http://schemas.xmlsoap.org/soap/envelope/" and "#" was specified to
Create, then OnStartElement would be invoked with the string
"http://schemas.xmlsoap.org/soap/envelope/#Envelope".
bool ParseSomeXML (LPCTSTR pszXMLText)
{
CMyXML sParser;
sParser .Create ();
}
Parsing a Simple Text String
Next, we actually need to send the XML document to the parser. There are
two different methods of sending the document to the XML parser, directly or by
internal buffers. The easier of the two is sending the data
directly to the parser. However, it is also just a bit slower.
To send a simple string to the parser, the application invokes the Parse
(LPCTSTR pszBuffer, int nLength = -1, bool fIsFinal = true) method. The
first argument is a pointer to a string of data to be parsed. A routine
has been defined for both ANSI and UNICODE strings. The second parameter
is the length of the string in characters (char or wchar_t depending on ANSI or
UNICODE). If nLength is less than zero, then it is required that the
string pointed to by pszBuffer is a NUL terminated string and the length will
be determined from the string. If nLength is greater or equal to zero,
then the string need not be NUL terminated and the length shouldn't include the
NUL character if it exists. The third parameter lets the XML parser know
when there is no more data. If the whole XML document can be contained
within one simple string, then fIsFinal can be set to true the first
time. Otherwise, fIsFinal should remain false while there is more data to
be parsed. Parse can be invoked with a nLength set to zero and
fIsFinal set to true after all data has been read in.
bool ParseSomeXML (LPCTSTR pszXMLText)
{
CMyXML sParser;
sParser .Create ();
return sParser .Parse (pszXMLText);
}
Parsing Using Internal Buffers
To reduce the number of extra memory copies, buffers internal to the Expat
parser can be used instead of passing data into the parser just to have the
Expat parser copy the data to internal buffers. Using internal buffers
takes 3 steps, requesting a buffer, reading data into the buffer, submitting
the data to the parser.
bool ParseSomeXML (LPCSTR pszFileName)
{
CMyXML sParser;
if (!sParser .Create ())
return false;
FILE *fp = fopen (pszFileName, "r");
if (fp == NULL)
return false;
bool fSuccess = true;
while (!feof (fp) && fSuccess)
{
LPSTR pszBuffer = (LPSTR) sParser .GetBuffer (256); if (pszBuffer == NULL)
fSuccess = false;
else
{
int nLength = fread (pszBuffer, 1, 256, fp); fSuccess = sParser .ParseBuffer (nLength, nLength == 0); }
}
fclose (fp);
return fSuccess;
}
As you can see, this method is more complicated that the other, but when you
modify the example in the previous section to read a file, the differences in
complexity are minimal.
Working With Event Routines
Event routines provide the actual information about what has been parsed to the
application. The method names inside the CExpatImpl <class _T>
class have been selected to make it easy to know which routine applies to what
Expat event.
In Expat:
| Set the event handler routine |
XML_Set[Event Name]Handler |
| Name of the event handler |
Application specific |
In CExpatImpl <class _T>
| Enable the event handler routine |
Enable[Event Name]Handler |
| Name of the event handler |
On[Event Name] |
| Name of the internal event handler |
[Event Name]Handler |
So, if you wish to receive StartElement events, you define a method called
OnStartElement with the proper arguments and invoke EnableStartElementHandler
with a true for the only argument. The event routine can be later
disabled by invoking EnableStartElementHandler again with false as the only
argument.
The specifics about each of the event routines is beyond the scope of this
document. For more information about the events and the Expat parser
itself, see http://www.xml.com/pub/a/1999/09/expat/index.html.
The most all information contained within this document has a counterpart
of the same name in CExpatImpl <class _T>.
Implementation Notes
As stated earlier, there are some pitfalls applications will have to be aware of
when creating complex derived class hierarchies. Let us consider the example
of an XML parser consisting of two classes, CMyXMLBase and CMyXML. CMyXML
is derived from CMyXMLBase and CMyXMLBase is derived from one of the Expat
class wrappers.
Consider the case where the classes are derived from the CExpatImpl <class
_T> template class.
class CMyXMLBase : public CExpatImpl <CMyXMLBase>
{
public:
CMyXMLBase ()
{
}
void OnStartElement (const XML_Char *pszName, const XML_Char **papszAttrs)
{
return;
}
};
class CMyXML : public CMyXMLBase
{
public:
CMyXML ()
{
}
void OnStartElement (const XML_Char *pszName, const XML_Char **papszAttrs)
{
return;
}
};
In this case, the programmer expects the OnStartElement to be invoked by
the Expat parser. However, due to the design of the CExpatImpl <class
_T> class, only the methods of the class specified in the template argument
list would be invoked. This is by design.
There are three different way to fix this problem. The first method would
be to declare OnStartElement as being virtual in CMyXMLBase. The second
would be to derive CMyXMLBase from CExpat instead of CExpatImpl <class
_T>. The third method requires the changing of CMyXMLBase from a
normal class to a template. This change provides CExpatImpl <class
_T> with the name of the class from which to locate the event routines.
template <class _T>
class CMyXMLBase : public CExpatImpl <_T>
{
public:
CMyXMLBase ()
{
}
void OnStartElement (const XML_Char *pszName, const XML_Char **papszAttrs)
{
return;
}
};
class CMyXML : public CMyXMLBase <CMyXML>
{
public:
CMyXML ()
{
}
void OnStartElement (const XML_Char *pszName, const XML_Char **papszAttrs)
{
return;
}
};
About the Author
Tim has been a professional programmer for way too long. He currently
works at a company he co-founded that specializes in data acquisition
software for industrial automation.