PugXML - A Small, Pugnacious XML Parser

Kristen Wegner

4.98/5 (147 votes)

Jan 12, 2003

24 min read

1413017

14219

Discussion of techniques for fast, robust, light-weight XML parsing.

Introduction

Presented is a small, fast, non-validating DOM XML parser, contained in a single header, having no dependencies other than the standard C libraries, and <iostream> (KERNEL32.DLL with WIN32). This XML parser segments a given string in situ (like strtok), performing scanning/tokenization, and parsing in a single pass. Preliminary analysis shows a best-case of 22 X86 CPU clock cycles average per input byte, and a worst-case of 108 CPU cycles, for an executable built using VC7 with the /Og /Ob2 /Oi /Ot /Oy compiler optimization flags. This may be further improved by modifying the parser to discard comments, processing instruction (PI), DOCTYPE, and CDATA sections, as well as using faster memory management routines, (notice how the Microsoft malloc, realloc and free implementations seem to be rather slow).

Why Write Another XML Parser?

Of the many XML parsers available, most present significant concerns, such as complexity, size, portability, etc. It is no small task to correctly implement XML-reading capabilities in a C or C++ application. Of the C, C++ and COM implementations one must gain familiarity with, and work with a large set of functions, structures, and/or classes. In addition, the basic function of many of these implementations is opaque, or obscured by a hairy call graph, so implementation-specific modification or optimization is not realistically possible.

As the use of XML gains ground in diverse industries, year after year, one finds people doing all kinds of bizarre things with XML that were perhaps never intended by the original working group (see for example: HumanML, XML markup for human emotions, etc.; and SuperX++, an XML-based programming language). The bottom line is that these diverse needs are not best met by one 'universal' XML parsing solution, but occasionally the requirements may indicate a customized approach.

The implementation presented here served as a study model for an XML-parsing algorithm that was later rewritten in assembly, for an embedded implementation. The intention here was to develop a lightweight, fully transparent XML parser, from which the essential process of parsing XML markup could be examined, and then pared down as needed. The source code and description may be used for didactic purposes, as a drop-in solution for light-weight DOM-style XML parsing in your own application, or may be used as a model or starting point for developing your own custom parser.

Approach

The PugXML parser performs string scanning, tokenization, parsing, and construction of the document tree structure in a single pass. A common approach in designing a parser would be to scan and tokenize the string in an initial pass, build the parse tree structure in the next pass, then validate in additional passes. While this approach has many advantages in terms of simplicity of implementation, error detection/tolerance/recovery, etc., it can easily result in a larger, slower, more opaque implementation, as the construction and maintenance of a token table requires additional memory and instruction steps. Eliminating the use of the tokenization pass, while significantly smaller, and faster, results in code that can be difficult to follow, as some goto statements are needed.

In our approach we avoid traversing the tree as it is being built, in order to improve performance, although this has certain drawbacks. The XML format is cleverly designed with redundancy and error-correction features. In developing a single-pass XML parser we rely heavily on these in order to minimize look-ahead and look-back scanning of the input string. A brief examination of the nature of XML markup will permit us to address this approach in further detail. The structure of an XML document is that of a tree, where each tree node or branch may have certain associated data. In our implementation, each branch may be one of the following standalone or hierarchical types:

Standalone Types

1.)  <!ATTLIST ...>
2.)  <!ELEMENT ...>
3.)  <!ENTITY ...>
4.)  <!NOTATION ...>
5.)  <?name ... ?>
6.)  <name ... />
7.)  <!--text-->
8.)  <![CDATA[text]]>
9.)  <![INCLUDE[text]]>
10.) text

Hierarchical

11.)  <!DOCTYPE [...]>
12.) <name ...>...</name>

Representation

All branches, and their attributes are represented using the structures:

XMLATTR is a name=value attribute structure. The *_insitu (in situ) flags keep track of whether the string is a segment of the original parse string (this helps with garbage collection).
XMLENTITY specifies the kind of branch, as in the above discussion of "standalone" and "hierarchical" types.
XMLBRANCH is a the branch itself, which may have zero or more attributes (pointers to XMLATTR structures), zero or more children (pointers to XMLBRANCH structures), an element name, data, and a classification. The *_space scalars keep track of how much pointer space is available in the array.

During the parsing process we maintain a cursor to our current branch (like a stack). Upon hitting a standalone region, we just append a new branch as a child of the cursor branch. Upon hitting an hierarchical region, we append a new branch as a child of the cursor branch, then set our cursor to this new branch (like a stack push). On leaving a hierarchical region (e.g. '</name>'), we set our cursor to its parent (like a stack pop). To save time, and avoid traversing the tree as we build it (such as verifying that we are popping the right branch), we ignore the end-tag name (e.g. '</name>'), and just step to the parent of the current branch. This should be OK if the document is well-formed, but may cause problems otherwise (for example, parsing HTML). Using this approach, the following malformed structure:

<example>
	<foo>Oops!</bar>
	<bar>Ouch!</foo>
	<strange>Huh?</>
	<goo><gaa></goo></gaa>
</example>

would parse as:

<example>
        <foo>Oops!</foo>
        <bar>Ouch!</bar>
        <strange>Huh?</strange>
        <goo><gaa /></goo>
</example>

This seems a half-way usable interpretation, however parsers checking for well-formedness would choke on it. Because this is not compliant behavior, the PugXML parser, and parsers taking a similar approach would be indicated for situations requiring the parsing of machine-generated, well-formed fully-compliant XML documents, and not human-generated documents. See W3C Compliance below.

The Inner Loop

In developing the PugXML parser, the intention was to try to condense everything into a single function with a single loop, removing all non-essential operations (within a reasonable margin). The PugXML parser inner loop may be understood most easily using the finite state machine (FSM) formalism, as depicted in the simplified diagram below. The parser functions as a giant while() loop on the condition that a string zero-terminator is not hit. Within this loop a number of states are possible. Depending upon the current state:

A new node is appended to the current node's list of children and the cursor is moved to this position (::GraftBranch()).
A new attribute is appended to the current node's list of attributes (::AddAttribute()).
The starting position of a string segment is stored in the current context (e.g. a node or attribute), then the end position is located and zero-terminated.
The cursor is moved to the current node's parent (pCursor = pCursor->parent).
Any time we hit a sequence we don't like, we continue, in which case we obliviously gobble up the string until we enter another valid segment. Note: if the document is very badly formed, we may end up rejecting the entire input.

Each vertex in the diagram below represents a character, or set of characters we expect to encounter (see the inline static BOOL Is*(TCHAR c) member functions of the CPugXmlParser class). Edges between vertices represent control flow as each successive character, or classified segment of characters is matched. The stippled arrows signify that you can get to the indicated vertex from just about anywhere.

Scanner/Parser State Diagram

Legend

#	Symbol, e.g.: `'<symbol '` -or- `'<... symbol='` -or- `'<... ...=symbol '`
$	Data segment; e.g. `'>data<'` -or- `'<... ...="data"'` -or- `'<![CDATA[data]]>'`
"	`'` -or- `"`
_	Whitespace
0	Null character
!	All other chars match themselves

String Segmentation Process

Below is a graphical depiction and narration of the incremental string segmentation process for a sample XML string, with reference to the above diagram. Each numbered rectangle represents the state of the input string as the parsing proceeds.

For the following string:

<example version="1.0">Text</example>

segmentation would proceed:

String Segmentation Process

such that, at:

A '<', then a symbol char is hit (see ::IsSymbol()) so we are now in the '<#' state. A new node is grafted onto the tree (pCursor= ::GraftBranch()). The position is stored in the LPTSTR XMLBRANCH::name member.
Whitespace is hit, so the parser is now in the '<#_' state. The character at this offset is set to null. We expect to hit an attribute, but we could just as well hit a '/' or a '>'.
A symbol char is hit so we assume an attribute ('<#_#') state. An attribute is appended to the XMLATTR** XMLBRANCH::attribute member. The string position is stored in the attribute char* XMLATTR::name member.
We scan ahead for a '=', ' ', '/', or '>'. Finding an '=', we expect attribute data to follow.
Finding a '"', we remember this quote character, then store the next character position in the LPTSTR XMLATTR::value member of the current attribute.
We now scan ahead until we hit the terminating '"', then mark it as null.
We have hit a '>', and there is following non-whitespace data so we graft a new node onto the tree (pCursor= ::GraftBranch()), store the following position in the LPTSTR XMLBRANCH::data member, then restore the cursor to the current node's parent (pCursor = pCursor->parent).
We now search ahead for a '<', then mark it as null, terminating the PCDATA segment. Finding a following '/', we are in a '</' state, so we step to the current node's parent (pCursor = pCursor->parent).

resulting in the document tree:

Usage

First and foremost, because the generated document tree structure is directly indexed to the input string, you must ensure that the input string will exist for as long as you wish to traverse the tree. In addition, the string is altered by the parsing process (it is zero-terminated for every element name, attribute name, PCDATA section, etc.), so you cannot expect to use it for anything after the parsing pass is completed. If you want to alter any string segment (e.g. element name, attribute value, etc.) the wrapper class, CPugXmlBranch , will allocate memory for that segment if the resulting string will not fit in the existing space. This can result in greater memory use.

Following is a listing for a minimal application using the GnatXMLParser. Note: See the demo project for more extensive examples.

#include <iostream>

#include "pugxml.h"


using namespace std;

int _tmain(int argc, _TCHAR* argv[])
{
	if(argc < 2) return 0; //Expect a filename argument.

	CPugXmlParser* pXml = new CPugXmlParser(); //Construct.

	pXml->ParseFile(argv[1]); //Parse the file.

	cout << pXml->GetRoot() << endl; //Output the resulting tree.

	delete pXml; //Destruct.

	return 0;
}

Classes (Note: See also 'pugxml.h' or the provided manual)

The PugXML parser uses only a limited set of classes, the methods of which are all implemented inline in order to shoehorn everything into a single header. The classes you need mainly be concerned with are CPugXmlParser, which provides an interface to the underlying parser function (static LPTSTR CPugXmlParser::Parse()), and CPugXmlBranch, which provides a lightweight interface to the generated document tree. If you would like to minimize code size, cut & paste the inline functions to a new .cpp file, and remove the inline declarations. If you would like to use the basic parsing facility in a C project, dispense with the classes, and copy the static member functions of CPugXmlParser, the defines, enums, structures, and the global functions, NewAttribute(), NewBranch(), GraftBranch(), AddAttribute(), and FreeTree() to your C implementation file. The entire implementation uses only the following externs, so it should be relatively easy to port:

malloc, memset, realloc, free, memcpy.
strlen, isalnum, strncpy (or their wide equivalents via <wtypes.h> and <tchar.h>).
strtol, strtod, strtok, strcpy (ditto, for CPugXmlBranch).
CreateFile, ReadFile, CloseHandle, ZeroMemory (for CPugXmlParser::ReadFileData(), could as easily be fopen, fread, fclose).
ostream (for CPugXmlBranch::Serialize() - wanton profligacy: who needs streams, when you have handles!).

CPugXmlBranch Class - Provides a light-weight wrapper for manipulating `XMLBRANCH` structures.

Public Member	Summary

Construction/Destruction

`CPugXmlBranch()`	Default constructor.
`CPugXmlBranch(XMLBRANCH* p)`	Construct, wrapping the given `XMLBRANCH` pointer.
`CPugXmlBranch(const CPugXmlBranch& r)`	Copy constructor.
`~CPugXmlBranch()`	Destructor.

Overloaded Operators

`operator XMLBRANCH*()`	Cast as `XMLBRANCH` pointer.
`operator LPVOID()`	Cast as `void` pointer.
`CPugXmlBranch& operator=(const CPugXmlBranch& r)`	Assign internal branch pointer, with no copy.
`BOOL operator==(const CPugXmlBranch& r)`	Compare internal branch pointer.
`CPugXmlBranch operator[](UINT_PTR i)`	Access the child at subscript.

Branch Classification

`BOOL IsNull()`	Branch pointer is not null.
`BOOL IsElement()`	Branch is an element.
`BOOL IsComment()`	Branch is a comment.
`BOOL IsPCDATA()`	Branch is `PCDATA`.
`BOOL IsCDATA()`	Branch is `CDATA`.
`BOOL IsPI()`	Branch is a processing instruction.
`BOOL IsDOCTYPE()`	Branch is `DOCTYPE`. If `TRUE`, parameter symbols will be stored as name-only attributes (e.g. `SYSTEM` or `PUBLIC`), and quoted sections as value-only attributes. If an inline DTD section is present, it will be stored in the `XMLBRANCH::data` member (access this using `GetData()` or `SetData()`). You can then parse this with a fresh `CPugXmlParser` to get access to the DTD entities (see `IsDTD*()`), or set the parser flags `PARSE_DOCTYPE\|PARSE_DTD`.
`BOOL IsDTD() BOOL IsDTD_ATTLIST() BOOL IsDTD_ELEMENT() BOOL IsDTD_ENTITY() BOOL IsDTD_NOTATION()`	Branch is a DTD element, e.g. `ATTLIST`, `ELEMENT`, or `ENTITY`. If `TRUE`, the name, e.g. '`<!ATTLIST name ...>`' will be stored in the `XMLBRANCH::name` member, and all other string up to the terminating '`>`' ('`...>`' as above), will be stored in the `XMLBRANCH::data` member. An `ENTITY` name of the form '`<!ENTITY % name...`', or of the form '`<!ENTITY %name...`' will be stored as '`%name`'. Further parsing of this data is up to you.
`BOOL IsNamed(LPCTSTR szName)`	Branch is named `szName`.
`BOOL IsRoot()`	Branch is tree root.

Member Inventory

`BOOL HasData()`	Branch has data (comment, CDATA or PCDATA).
`BOOL HasChildren()`	Branch has 1 or more children.
`BOOL HasAttributes()`	Branch has 1 or more attributes.
`BOOL HasSiblings()`	Branch has one or more siblings.
`BOOL HasName()`	Branch has a name.
`BOOL HasAttribute(LPCTSTR szName)`	Test if branch has an attribute named `szName`.

Member Accessors

`LPCTSTR GetName()`	Access pointer to branch name if any, else empty string.
`XMLENTITY GetType()`	Access branch entity type.
`LPCTSTR GetData()`	Access pointer to data if any, else empty string.
`UINT_PTR GetChildrenCount()`	Access branch's child count.
`CPugXmlBranch GetChildAt(UINT_PTR i)`	Access child branch at subscript as `CPugXmlBranch`.
`UINT_PTR GetAttributesCount()`	Access branch's attribute count.
`const XMLATTR* GetAttributeAt(UINT_PTR i)`	Access attribute at subscript if any, else empty attribute.
`UINT_PTR GetSiblingsCount()`	Access branch's sibling count (parent's child count).
`CPugXmlBranch GetSiblingAt(UINT_PTR i)`	Access sibling branch at subscript as `CPugXmlBranch`.
`CPugXmlBranch GetParent()`	Access branch's parent if any, else null `CPugXmlBranch`.
`LPCTSTR GetAttribute(LPCTSTR szName,LPCTSTR tDefault)`	Access attribute value as `LPCTSTR` for attribute named `szName`. If not found return `tDefault`.
`LONG GetAttribute(LPCTSTR szName,LONG tDefault)`	Access attribute value as `LONG` for attribute named `szName`. If not found return `tDefault`.
`DOUBLE GetAttribute(LPCTSTR szName,DOUBLE tDefault)`	Access attribute value as `DOUBLE` for attribute named `szName`. If not found return `tDefault`.
`BOOL GetAttribute(LPCTSTR szName,BOOL tDefault)`	Access attribute value as `BOOL` for attribute named `szName`. If not found return `tDefault`.

Name-To-Object Mapping

`XMLATTR* MapStringToAttributePtr(LPCTSTR szName)`	Brute force map an attribute name to a pointer to that attribute, if found.
`UINT_PTR MapStringToAttributeIndex(LPCTSTR szName)`	Brute force map an attribute name to the index of that attribute, if found.
`XMLBRANCH* MapStringToChildPtr(LPCTSTR szName)`	Brute force map a child name to a pointer to the first matching child, if found.
`UINT_PTR MapStringToChildIndex(LPCTSTR szName)`	Brute force map a child name to the index of the first matching child, if found.

Traversal Helpers

`CPugXmlBranch FindFirstElement(LPCTSTR szName)`	Recursively-implemented depth-first find the first matching child element having name `szName`. Use for shallow drill-downs. Returns branch, or `CPugXmlBranch(NULL)` if not found.
`CPugXmlBranch FindFirstBranch(XMLENTITY eType)`	Recursively-implemented depth-first find the first matching child branch having type `eType`. Use for shallow drill-downs. Returns branch, or `CPugXmlBranch(NULL)` if not found.
`CPugXmlBranch FindFirstElementWhere(LPCTSTR szName,LPCTSTR szAttr,LPCTSTR szValue)`	Recursively-implemented depth-first find the first matching child element having name `szName`, an attribute named `szAttr`, and value of said attribute being `szValue`. Returns branch, or `CPugXmlBranch(NULL)` if not found.
`CPugXmlBranch FindFirstElemData(LPCTSTR szName,LPCTSTR szData)`	Recursively-implemented depth-first find the first matching element also having matching PCDATA.
`CPugXmlBranch FindFirstElemAttr(LPCTSTR szName,LPCTSTR szAttr,LPCTSTR szValue)`	Recursively-implemented depth-first find the first matching element also having matching attribute.
`void FindAllElements(LPCTSTR szName,CPugXmlBranchArray& rFound)`	Find all elements having the given name.
`LPTSTR CompilePath(LPCTSTR szDelim)`	Compile an absolute branch directory-style path from root as a text string (e.g. '`/document/branch`').
`CPugXmlBranch FindByPath(LPCTSTR szPath,LPCTSTR szDelim)`	Search for a branch by directory-style path (e.g. '`/document/branch`' -or- '`../foo`' -or- '`./../bar` ', etc.).
`BOOL Traverse(CPugXmlFilter& rFilter)`	Recursively traverse the tree, receiving notification in `CPugXmlFilter`-derived class `OnBranch()` callback.
`BOOL MoveToRoot()`	Move the cursor (`_m_pRoot`) to the absolute root of the document tree.
`BOOL MoveToParent()`	Move the cursor (`_m_pRoot`) to the current branch's parent.
`BOOL MoveToSibling(UINT_PTR i)`	Move the cursor (`_m_pRoot`) to the current branch's sibling at subscript. Equivalent to `MoveToChild` following `MoveToParent`.
`BOOL MoveToFirstSibling(LPCTSTR szName)`	Move the cursor (`_m_pRoot`) to the current branch's sibling matching given name.
`BOOL MoveToNextSibling(LPCTSTR szName)`	Move to the current branch's next sibling by position and name.
`BOOL MoveToNextSibling()`	Move to the current branch's next sibling by position.
`BOOL MoveToChild(UINT_PTR i)`	Move the cursor (`_m_pRoot`) to the current branch's child at subscript.
`BOOL MoveToChild(LPCTSTR szName)`	Move the cursor (`_m_pRoot`) to the current branch's child matching given name.

Editorial Helpers

`void Create()`	Create a new, empty root, for de novo tree construction.
`static BOOL SetStringMember(LPTSTR* pDest,LPCTSTR szSrc,LPBOOL pInSitu)`	Set structure string member to given value.
`BOOL SetAttributeName(UINT_PTR i,LPCTSTR newVal)`	Set attribute name at subscript.
`BOOL SetAttributeName(LPCTSTR szName,LPCTSTR newVal)`	Set attribute name where name is now `szName`.
`BOOL SetAttributeValue(UINT_PTR i,LPCTSTR newVal)`	Set attribute value at subscript.
`BOOL SetAttributeValue(LPCTSTR szName,LPCTSTR newVal)`	Set attribute value to `newVal` where name is `szName`.
`BOOL SetAttributeValue(LPCTSTR szName,LONG newVal)`	Set attribute value to `newVal` where name is `szName`.
`BOOL SetAttributeValue(LPCTSTR szName,DOUBLE newVal)`	Set attribute value to `newVal` where name is `szName`.
`BOOL SetAttributeValue(LPCTSTR szName,BOOL newVal)`	Set attribute value to `newVal` where name is `szName`.
`BOOL SetName(LPCTSTR newVal)`	Set element name.
`BOOL SetData(LPCTSTR newVal)`	Set branch data (`PCDATA`, `CDATA`, or comment).
`BOOL DeleteAttributeAt(UINT_PTR i)`	Remove attribute at the given subscript.
`BOOL DeleteAttributeAt(LPCTSTR szName)`	Remove attribute having the given name.
`BOOL AddAttribute(LPCTSTR szName,LPCTSTR szValue)`	Append a new attribute to the branch list of attributes.
`BOOL AddAttribute(LPCTSTR szName,LONG lValue)`	Append a new attribute of type `LONG` to the branch list of attributes.
`BOOL AddAttribute(LPCTSTR szName,DOUBLE dValue)`	Append a new attribute of type `DOUBLE` to the branch list of attributes.
`BOOL AddAttribute(LPCTSTR szName,BOOL bValue)`	Append a new attribute of type `BOOL` to the branch list of attributes.
`XMLENTITY SetType(XMLENTITY eType)`	Set the current branch entity type.
`CPugXmlBranch AppendChild(XMLENTITY eType)`	Allocate & append a child branch of the given type at the end of the current branch array of children.
`CPugXmlBranch InsertChildAt(UINT_PTR i,XMLENTITY eType)`	Allocate & insert a child branch of the given type at subscript.
`BOOL DeleteChildAt(UINT_PTR i)`	Delete the child branch at the given subscript.

Output Helpers
`<hr noShade SIZE="1">`
`static void Serialize(ostream& rOs,CPugXmlIndent& rIndent,XMLBRANCH* pBranch,BOOL bLineBreaks)`	Stream output. Recursively writes the given `XMLBRANCH` structure to the given stream.
`void Serialize(ostream& rOs,TCHAR cIndentChar,BOOL bLineBreaks)`	Stream output. Recursively writes the internal `XMLBRANCH` structure to the given stream.
`friend ostream& operator<<(ostream& rOs,CPugXmlBranch& rBranch)`	Stream output operator. Wraps `Serialize`.

CPugXmlParser Class - Provides a high-level interface to the XML parser.

Public Member	Summary

Construction/Destruction

`CPugXmlParser(LONG lGrowSize,BOOL bAutoDelete)`	Default constructor.
`CPugXmlParser(LPTSTR s,LONG lGrowSize,BOOL bAutoDelete)`	Direct parse constructor.
`~CPugXmlParser()`	Destructor.

Accessors/Operators

`operator XMLBRANCH*()`	Cast as `XMLBRANCH` pointer to root.
`operator CPugXmlBranch()`	Cast as `CPugXmlBranch` (same as `GetRoot()`).
`CPugXmlBranch GetRoot()`	Returns the root wrapped by an `CPugXmlBranch`.
`void Clear()`	Clear any existing tree or string data.
`XMLBRANCH* Attach(XMLBRANCH* pRoot)`	Attach an externally-generated root to the parser.
`XMLBRANCH* Detach()`	Detach the current root from the parser.

Parsing Helpers

`LPTSTR Parse(LPTSTR s,DWORD dwOpts)`	Parse the given XML string in-situ, using the specified options. Returns any remaining string or null.
`BOOL ParseFile(LPCTSTR szPath,DWORD dwOpts)`	Load into memory and parse the contents of the file at the given path, using the specified options.

Utility Functions

`static LPTSTR Parse(register LPTSTR s,XMLBRANCH* pRoot,LONG lGrow,DWORD dwOpts)`	The parser inner loop. Parse the given XML string in-situ, over the given root, using the specified growth increment, and options. Returns any remaining string not parsed, or null. See Parser Options below.
`inline static void FreeTreeRecursive(XMLBRANCH* pRoot)`	Recursively free a tree.
`static BOOL ReadFileData(LPCTSTR szPath,LPTSTR*,LPDWORD,DWORD)`	Read data from the file at at the given path into a buffer.

Parser Options

`PARSE_MINIMAL`	No trim or normalize, no DTD, no PI, no CDATA, no comments.
`PARSE_PI`	Parse '`<?...?>`'.
`PARSE_DOCTYPE`	Parse '`<!DOCTYPE name ... [...]>`'
`PARSE_COMMENTS`	Parse '`<!--...-->`'
`PARSE_CDATA`	Parse '`<![CDATA[...]]>`' -or- '`<![INCLUDE[...]]>`'.
`PARSE_TRIM_PCDATA`	Trim '`>...<`'.
`PARSE_TRIM_ATTRIBUTE`	Trim '`foo="..."`'.
`PARSE_TRIM_CDATA`	Trim '`<![CDATA[...]]>`' -or- '`<![INCLUDE[...]]>`'.
`PARSE_TRIM_ENTITY`	Trim '`<!ENTITY name ...>`', etc.
`PARSE_TRIM_DOCTYPE`	Trim '`<!DOCTYPE name ... [...]>`'.
`PARSE_TRIM_COMMENT`	Trim '`<!--...-->`'.
`PARSE_NORMALIZE`	Normalize all that are flagged to be trimmed (e.g. translate multiple whitespace segments to single space char).
`PARSE_DTD`	If `PARSE_DOCTYPE` set, then parse whatever is in data member (' `[...]>`').
`PARSE_DTD_ONLY`	If `(PARSE_DOCTYPE\|PARSE_DTD)` set, then abort after parsing DTD. If using `CPugXmlParser::ParseFile()` access the final string position through `CPugXmlParser::GetParseFilePos()`. If using `CPugXmlParser::Parse()`, access the final string position as the value returned from the function call.
`PARSE_DEFAULT`	Parse all, trim and normalize.

Global Methods - Used primarily by the parser inner loop.

Method	Summary
`static BOOL StrCatGrow(LPTSTR* pLhs,LPCTSTR pRhs)`	Concatenate `pRhs` to `pLhs`, growing `pRhs` if neccessary.
`static BOOL StrWtrim(LPTSTR* s)`	In situ trim leading and trailing whitespace.
`static BOOL StrWnorm(LPTSTR* s)`	In situ trim leading and trailing whitespace, then convert all consecutive whitespace to a single space char.
`inline static XMLATTR* NewAttribute(void)`	Allocate & init an `XMLATTR` structure.
`inline static XMLBRANCH* NewBranch(XMLENTITY eType)`	Allocate & init an `XMLBRANCH` structure.
`inline static XMLBRANCH* GraftBranch(XMLBRANCH* pParent,LONG lGrow,XMLENTITY eType)`	Allocate & graft a new `XMLBRANCH` onto the given parent.
`inline static XMLATTR* AddAttribute(XMLBRANCH* pBranch,LONG lGrow)`	Allocate & append a new attribute to the given `XMLBRANCH`.
`inline static void FreeTree(XMLBRANCH* pRoot)`	Non-recursively free a tree, given its root (also frees the root).

Performance

Note: For all performance tests, the parser was configured with following flags set: (PARSE_PI|PARSE_DOCTYPE|PARSE_COMMENTS|PARSE_CDATA|PARSE_DTD).

Operating Efficiency

The "http://www.codeproject.com/datetime/ccputicker.asp">CCPUTicker class by J.M. McGuiness and P.J. Naughter was used to measure the elapsed count of non-exclusive Pentium clock cycles for the static Parse inner loop. This value was divided by the count input bytes. In this summary test, a set of seven well-formed XML documents were parsed, representing a random mix of sizes, features and complexity. Following is a table of these measurements, and two plots, Input Bytes/Elapsed CPU Cycles (c_n-c₀), and Input Bytes/Elapsed Milliseconds (t_n-t₀).

Note: See Test Platform below for the test platform specs.

Bytes	t_n-t₀	(t_n-t₀)/Bytes	c_n-c₀	(c_n-c₀)/Bytes
670	0	0.00000	34,707	51
29,504	0	0.00000	2,144,866	72
97,000	10	0.00010	2,711,453	27
161,000	20	0.00012	6,264,292	38
281,772	40	0.00014	15,667,883	55
1,054,981	170	0.00016	79,446,024	75
6,998,822	350	0.00005	157,064,600	22

An identical test performed using the MSXML 4.0, and .NET parsers showed a 4.5X increase in elapsed milliseconds.

Memory Overhead

In an additional test, the parser's memory overhead was measured as a function of input bytes, and peak memory usage. This gains cogency in the context of the input data markup-to-data ratio. In the table below, 'Saturated' markup refers to markup that has a high markup-to-data ratio, e.g.:

<a><b><c><d><e/></d></c></a></a>

or:

<a b="b" c="c" d="d" e="e">f</a>

'Sparse' markup refers to the opposite, e.g.:

<aaaaaaaaaaaaaaaaaaaaaaaaaaaaa/>

or:

<a>bbbbbbbbbbbbbbbbbbbbbbbbb</a>

and, 'Mixed' markup would be somewhere in between.

Bytes	Markup Type	Peak (Bytes)	Overhead
68,039	Saturated	1,798,144	26 X
281,772	Mixed	1,826,816	6 X
1,819,636	Mixed	8,724,480	5 X
1,958,486	Mixed	10,395,648	5 X
6,998,822	Sparse	16,158,720	2 X

As you can see, 'Saturated' markup has the highest memory overhead (understandably, since the parser must allocate many more XMLBRANCH or XMLATTR structures). Although it is not implemented in PugXML, it should be possible to take a random sampling of the input string by counting the markup character to data character ratio, and thereby estimate both memory overhead, and knowing the input string length, peak memory usage. This might prove useful for embedded implementations where one would desire to know a priori whether the potential document tree could fit into memory.

An identical test performed using the MSXML 4.0, and .NET parsers showed a memory overhead of approximately 3.5X that of the PugXML parser.

Interpretation

OK, so the PugXML parser is comparatively fast, and doesn't use absurd amounts of memory, but what does this really mean? By analogy, if you took your average gas guzzling SUV (MSXML 4.0), and stripped everything off the chassis, except for one seat, the steering wheel, and the engine, you could probably get the thing going really fast. The question you have to ask yourself is: would you drive your children to school in such a vehicle? You bet! (Note: For some serious performance in a fully-validating XML parser, take a look at RXP).

Test Platform

OS Name Microsoft Windows 2000
Version 5.0.2195 Service Pack 3, RC 3.136 Build 2195
Processor x86 Family 5 Model 9 Stepping 1 AuthenticAMD ~451 Mhz
Total Physical Memory 261,680 KB
Total Virtual Memory 1,024,820 KB

W3C Compliance

PugXML was designed to perform robust parsing of potentially malformed XML markup, and as such is non-compliant. Using a fully W3C-compliant parser adhering to the strictures of well-formedness, validity, DTDs, XSDs, etc., one often hits a document that simply cannot be parsed with significant manual editing. PugXML never rejects a document. PugXML will parse a wide variety of malformed fragments, and attempt to generate a well-formed document tree. Using the W3C Conformance Test Suite 20020606, PugXML will parse all well-formed (valid) stand-alone documents correctly (without entity expansion). Of the invalid tests, PugXML will, in 99% of the cases yield a well-formed document tree from which one can actually extract some data. Despite not being fully W3C compliant, PugXML could easily be used as a starting point to build a fully compliant parser (although it would be sure to bite you if you tried).

Known Problems

Entities: Although the DOCTYPE section and subsidiary DTD entities may be parsed using this parser, entities declared therein are not expanded in the document. In addition, build-in entities (e.g. '
') are not expanded however they are retained in whatever name, data or value section, so it would not be difficult to apply them after the document tree is constructed. Alternatively, you could first parse the DTD section, build an entity table, expand instances of the entities in the input string, then parse the input string as normally.
Character Encodings: Although the parser can be compiled and will run correctly with UNICODE or _MBCS defined, it does not understand character encoding as specified in the XML PI section. If you are attempting to parse a non-standard codepage, the parser may fail, or corrupt the data. Currently the parser will work correctly for UTF-8 (65001), and various flavors of Latin and Western European ISO/ASCII with _MBCS defined, and for UNICODE (1200) with UNICODE defined (but not both in the same build). If your requirements include the need to parse ad-hoc data that may use one or more different codepages, you might consider using another parser.
Recursion: Some functions such as the traversal helpers, and the static stream output function, CPugXmlBranch::Serialize, are implemented using recursion. In some cases, particularly if you are traversing a very deep tree, and depending on your compiler, you may encounter a stack overflow. It would be relatively easy to make safer non-recursive implementations (see the ::FreeTree function for an example of non-recursive traversal).
XML Namespaces: The parser does not tokenize '<ns:tag...' or '<... ns:attribute...'. If namespace prefixes are present they will be left as is, as part of the object name, e.g. 'ns:tag' or 'ns:attribute'.

Errata

Since the initial performance measurements, there have been some minor changes made to the static CPugXmlParser::Parse(), NewBranch() and GraftBranch() functions (verification of memory allocation, and the string normalization option). Until new tests can be performed it is hard to say whether or how these changes may impact the overall performance.

History

01/15/2003: Bug Fix: A memory exception at CPugXmlParser::~CPugXmlParser(), or CPugXmlParser::Clear(), when parser option PARSE_DTD_ONLY is not set. Note: The source download, but not the demo project, has been updated with this fix.
01/10/2003: Initial Release.