Presented is a small, fast, non-validating DOM XML parser, contained in a single header, having no dependencies other than the standard C libraries, and <iostream> (KERNEL32.DLL with WIN32). This XML parser segments a given string in situ (like strtok), performing scanning/tokenization, and parsing in a single pass. Preliminary analysis shows a best-case of 22 X86 CPU clock cycles average per input byte, and a worst-case of 108 CPU cycles, for an executable built using VC7 with the /Og /Ob2 /Oi /Ot /Oy compiler optimization flags. This may be further improved by modifying the parser to discard comments, processing instruction (PI), DOCTYPE, and CDATA sections, as well as using faster memory management routines, (notice how the Microsoft malloc, realloc and free implementations seem to be rather slow).
<iostream>
KERNEL32.DLL
strtok
/Og /Ob2 /Oi /Ot /Oy
DOCTYPE
CDATA
malloc
realloc
free
Of the many XML parsers available, most present significant concerns, such as complexity, size, portability, etc. It is no small task to correctly implement XML-reading capabilities in a C or C++ application. Of the C, C++ and COM implementations one must gain familiarity with, and work with a large set of functions, structures, and/or classes. In addition, the basic function of many of these implementations is opaque, or obscured by a hairy call graph, so implementation-specific modification or optimization is not realistically possible.
As the use of XML gains ground in diverse industries, year after year, one finds people doing all kinds of bizarre things with XML that were perhaps never intended by the original working group (see for example: HumanML, XML markup for human emotions, etc.; and SuperX++, an XML-based programming language). The bottom line is that these diverse needs are not best met by one 'universal' XML parsing solution, but occasionally the requirements may indicate a customized approach.
The implementation presented here served as a study model for an XML-parsing algorithm that was later rewritten in assembly, for an embedded implementation. The intention here was to develop a lightweight, fully transparent XML parser, from which the essential process of parsing XML markup could be examined, and then pared down as needed. The source code and description may be used for didactic purposes, as a drop-in solution for light-weight DOM-style XML parsing in your own application, or may be used as a model or starting point for developing your own custom parser.
The PugXML parser performs string scanning, tokenization, parsing, and construction of the document tree structure in a single pass. A common approach in designing a parser would be to scan and tokenize the string in an initial pass, build the parse tree structure in the next pass, then validate in additional passes. While this approach has many advantages in terms of simplicity of implementation, error detection/tolerance/recovery, etc., it can easily result in a larger, slower, more opaque implementation, as the construction and maintenance of a token table requires additional memory and instruction steps. Eliminating the use of the tokenization pass, while significantly smaller, and faster, results in code that can be difficult to follow, as some goto statements are needed.
goto
In our approach we avoid traversing the tree as it is being built, in order to improve performance, although this has certain drawbacks. The XML format is cleverly designed with redundancy and error-correction features. In developing a single-pass XML parser we rely heavily on these in order to minimize look-ahead and look-back scanning of the input string. A brief examination of the nature of XML markup will permit us to address this approach in further detail. The structure of an XML document is that of a tree, where each tree node or branch may have certain associated data. In our implementation, each branch may be one of the following standalone or hierarchical types:
1.) <!ATTLIST ...> 2.) <!ELEMENT ...> 3.) <!ENTITY ...> 4.) <!NOTATION ...> 5.) <?name ... ?> 6.) <name ... /> 7.) <!--text--> 8.) <![CDATA[text]]> 9.) <![INCLUDE[text]]> 10.) text
11.) <!DOCTYPE [...]> 12.) <name ...>...</name>
All branches, and their attributes are represented using the structures:
XMLATTR
*_insitu
XMLENTITY
XMLBRANCH
*_space
During the parsing process we maintain a cursor to our current branch (like a stack). Upon hitting a standalone region, we just append a new branch as a child of the cursor branch. Upon hitting an hierarchical region, we append a new branch as a child of the cursor branch, then set our cursor to this new branch (like a stack push). On leaving a hierarchical region (e.g. '</name>'), we set our cursor to its parent (like a stack pop). To save time, and avoid traversing the tree as we build it (such as verifying that we are popping the right branch), we ignore the end-tag name (e.g. '</name>'), and just step to the parent of the current branch. This should be OK if the document is well-formed, but may cause problems otherwise (for example, parsing HTML). Using this approach, the following malformed structure:
</name>
<example> <foo>Oops!</bar> <bar>Ouch!</foo> <strange>Huh?</> <goo><gaa></goo></gaa> </example>
would parse as:
<example> <foo>Oops!</foo> <bar>Ouch!</bar> <strange>Huh?</strange> <goo><gaa /></goo> </example>
This seems a half-way usable interpretation, however parsers checking for well-formedness would choke on it. Because this is not compliant behavior, the PugXML parser, and parsers taking a similar approach would be indicated for situations requiring the parsing of machine-generated, well-formed fully-compliant XML documents, and not human-generated documents. See W3C Compliance below.
In developing the PugXML parser, the intention was to try to condense everything into a single function with a single loop, removing all non-essential operations (within a reasonable margin). The PugXML parser inner loop may be understood most easily using the finite state machine (FSM) formalism, as depicted in the simplified diagram below. The parser functions as a giant while() loop on the condition that a string zero-terminator is not hit. Within this loop a number of states are possible. Depending upon the current state:
while()
::GraftBranch()
::AddAttribute()
pCursor = pCursor->parent
continue
Each vertex in the diagram below represents a character, or set of characters we expect to encounter (see the inline static BOOL Is*(TCHAR c) member functions of the CPugXmlParser class). Edges between vertices represent control flow as each successive character, or classified segment of characters is matched. The stippled arrows signify that you can get to the indicated vertex from just about anywhere.
inline static BOOL Is*(TCHAR c)
CPugXmlParser
'<symbol '
'<... symbol='
'<... ...=symbol '
'>data<'
'<... ...="data"'
'<![CDATA[data]]>'
'
"
Below is a graphical depiction and narration of the incremental string segmentation process for a sample XML string, with reference to the above diagram. Each numbered rectangle represents the state of the input string as the parsing proceeds.
For the following string:
<example version="1.0">Text</example>
segmentation would proceed:
such that, at:
'<'
::IsSymbol()
'<#'
pCursor= ::GraftBranch()
LPTSTR XMLBRANCH::name
'<#_'
'/'
'>'
'<#_#'
XMLATTR** XMLBRANCH::attribute
char* XMLATTR::name
'='
' '
'"'
LPTSTR XMLATTR::value
LPTSTR
XMLBRANCH::data
PCDATA
'</'
resulting in the document tree:
First and foremost, because the generated document tree structure is directly indexed to the input string, you must ensure that the input string will exist for as long as you wish to traverse the tree. In addition, the string is altered by the parsing process (it is zero-terminated for every element name, attribute name, PCDATA section, etc.), so you cannot expect to use it for anything after the parsing pass is completed. If you want to alter any string segment (e.g. element name, attribute value, etc.) the wrapper class, CPugXmlBranch , will allocate memory for that segment if the resulting string will not fit in the existing space. This can result in greater memory use.
CPugXmlBranch
Following is a listing for a minimal application using the GnatXMLParser. Note: See the demo project for more extensive examples.
#include <span class="code-keyword"><iostream> </span> #include <span class="code-string">"pugxml.h" </span> using namespace std; int _tmain(int argc, _TCHAR* argv[]) { if(argc < 2) return 0; //Expect a filename argument. CPugXmlParser* pXml = new CPugXmlParser(); //Construct. pXml->ParseFile(argv[1]); //Parse the file. cout << pXml->GetRoot() << endl; //Output the resulting tree. delete pXml; //Destruct. return 0; }
The PugXML parser uses only a limited set of classes, the methods of which are all implemented inline in order to shoehorn everything into a single header. The classes you need mainly be concerned with are CPugXmlParser, which provides an interface to the underlying parser function (static LPTSTR CPugXmlParser::Parse()), and CPugXmlBranch, which provides a lightweight interface to the generated document tree. If you would like to minimize code size, cut & paste the inline functions to a new .cpp file, and remove the inline declarations. If you would like to use the basic parsing facility in a C project, dispense with the classes, and copy the static member functions of CPugXmlParser, the defines, enums, structures, and the global functions, NewAttribute(), NewBranch(), GraftBranch(), AddAttribute(), and FreeTree() to your C implementation file. The entire implementation uses only the following externs, so it should be relatively easy to port:
static LPTSTR CPugXmlParser::Parse()
.cpp
inline
C
static
NewAttribute()
NewBranch()
GraftBranch()
AddAttribute()
FreeTree()
memset
memcpy
strlen
isalnum
strncpy
<wtypes.h>
<tchar.h>
strtol
strtod
strcpy
CreateFile
ReadFile
CloseHandle
ZeroMemory
CPugXmlParser::ReadFileData()
fopen
fread
fclose
ostream
CPugXmlBranch::Serialize()
CPugXmlBranch()
CPugXmlBranch(XMLBRANCH* p)
CPugXmlBranch(const CPugXmlBranch& r)
~CPugXmlBranch()
operator XMLBRANCH*()
operator LPVOID()
void
CPugXmlBranch& operator=(const CPugXmlBranch& r)
BOOL operator==(const CPugXmlBranch& r)
CPugXmlBranch operator[](UINT_PTR i)
BOOL IsNull()
BOOL IsElement()
BOOL IsComment()
BOOL IsPCDATA()
BOOL IsCDATA()
BOOL IsPI()
BOOL IsDOCTYPE()
TRUE
SYSTEM
PUBLIC
GetData()
SetData()
IsDTD*()
PARSE_DOCTYPE|PARSE_DTD
BOOL IsDTD()<br> BOOL IsDTD_ATTLIST()<br> BOOL IsDTD_ELEMENT()<br> BOOL IsDTD_ENTITY()<br> BOOL IsDTD_NOTATION()
ATTLIST
ELEMENT
ENTITY
<!ATTLIST name ...>
XMLBRANCH::name
>
...>
<!ENTITY % name...
<!ENTITY %name...
%name
BOOL IsNamed(LPCTSTR szName)
szName
BOOL IsRoot()
BOOL HasData()
BOOL HasChildren()
BOOL HasAttributes()
BOOL HasSiblings()
BOOL HasName()
BOOL HasAttribute(LPCTSTR szName)
LPCTSTR GetName()
XMLENTITY GetType()
LPCTSTR GetData()
UINT_PTR GetChildrenCount()
CPugXmlBranch GetChildAt(UINT_PTR i)
UINT_PTR GetAttributesCount()
const XMLATTR* GetAttributeAt(UINT_PTR i)
UINT_PTR GetSiblingsCount()
CPugXmlBranch GetSiblingAt(UINT_PTR i)
CPugXmlBranch GetParent()
LPCTSTR GetAttribute(LPCTSTR szName,LPCTSTR tDefault)
LPCTSTR
tDefault
LONG GetAttribute(LPCTSTR szName,LONG tDefault)
LONG
DOUBLE GetAttribute(LPCTSTR szName,DOUBLE tDefault)
DOUBLE
BOOL GetAttribute(LPCTSTR szName,BOOL tDefault)
BOOL
XMLATTR* MapStringToAttributePtr(LPCTSTR szName)
UINT_PTR MapStringToAttributeIndex(LPCTSTR szName)
XMLBRANCH* MapStringToChildPtr(LPCTSTR szName)
UINT_PTR MapStringToChildIndex(LPCTSTR szName)
CPugXmlBranch FindFirstElement(LPCTSTR szName)
CPugXmlBranch(NULL)
CPugXmlBranch FindFirstBranch(XMLENTITY eType)
eType
CPugXmlBranch FindFirstElementWhere(LPCTSTR szName,LPCTSTR szAttr,LPCTSTR szValue)
szAttr
szValue
CPugXmlBranch FindFirstElemData(LPCTSTR szName,LPCTSTR szData)
CPugXmlBranch FindFirstElemAttr(LPCTSTR szName,LPCTSTR szAttr,LPCTSTR szValue)
void FindAllElements(LPCTSTR szName,CPugXmlBranchArray& rFound)
LPTSTR CompilePath(LPCTSTR szDelim)
/document/branch
CPugXmlBranch FindByPath(LPCTSTR szPath,LPCTSTR szDelim)
../foo
./../bar
BOOL Traverse(CPugXmlFilter& rFilter)
CPugXmlFilter
OnBranch()
BOOL MoveToRoot()
_m_pRoot
BOOL MoveToParent()
BOOL MoveToSibling(UINT_PTR i)
MoveToChild
MoveToParent
BOOL MoveToFirstSibling(LPCTSTR szName)
BOOL MoveToNextSibling(LPCTSTR szName)
BOOL MoveToNextSibling()
BOOL MoveToChild(UINT_PTR i)
BOOL MoveToChild(LPCTSTR szName)
void Create()
static BOOL SetStringMember(LPTSTR* pDest,LPCTSTR szSrc,LPBOOL pInSitu)
BOOL SetAttributeName(UINT_PTR i,LPCTSTR newVal)
BOOL SetAttributeName(LPCTSTR szName,LPCTSTR newVal)
BOOL SetAttributeValue(UINT_PTR i,LPCTSTR newVal)
BOOL SetAttributeValue(LPCTSTR szName,LPCTSTR newVal)
newVal
BOOL SetAttributeValue(LPCTSTR szName,LONG newVal)
BOOL SetAttributeValue(LPCTSTR szName,DOUBLE newVal)
BOOL SetAttributeValue(LPCTSTR szName,BOOL newVal)
BOOL SetName(LPCTSTR newVal)
BOOL SetData(LPCTSTR newVal)
BOOL DeleteAttributeAt(UINT_PTR i)
BOOL DeleteAttributeAt(LPCTSTR szName)
BOOL AddAttribute(LPCTSTR szName,LPCTSTR szValue)
BOOL AddAttribute(LPCTSTR szName,LONG lValue)
BOOL AddAttribute(LPCTSTR szName,DOUBLE dValue)
BOOL AddAttribute(LPCTSTR szName,BOOL bValue)
XMLENTITY SetType(XMLENTITY eType)
CPugXmlBranch AppendChild(XMLENTITY eType)
CPugXmlBranch InsertChildAt(UINT_PTR i,XMLENTITY eType)
BOOL DeleteChildAt(UINT_PTR i)
<hr noShade SIZE="1">
static void Serialize(ostream& rOs,CPugXmlIndent& rIndent,XMLBRANCH* pBranch,BOOL bLineBreaks)
void Serialize(ostream& rOs,TCHAR cIndentChar,BOOL bLineBreaks)
friend ostream& operator<<(ostream& rOs,CPugXmlBranch& rBranch)
Serialize
CPugXmlParser(LONG lGrowSize,BOOL bAutoDelete)
CPugXmlParser(LPTSTR s,LONG lGrowSize,BOOL bAutoDelete)
~CPugXmlParser()
operator CPugXmlBranch()
GetRoot()
CPugXmlBranch GetRoot()
void Clear()
XMLBRANCH* Attach(XMLBRANCH* pRoot)
XMLBRANCH* Detach()
LPTSTR Parse(LPTSTR s,DWORD dwOpts)
BOOL ParseFile(LPCTSTR szPath,DWORD dwOpts)
static LPTSTR Parse(register LPTSTR s,XMLBRANCH* pRoot,LONG lGrow,DWORD dwOpts)
inline static void FreeTreeRecursive(XMLBRANCH* pRoot)
static BOOL ReadFileData(LPCTSTR szPath,LPTSTR*,LPDWORD,DWORD)
PARSE_MINIMAL
PARSE_PI
<?...?>
PARSE_DOCTYPE
<!DOCTYPE name ... [...]>
PARSE_COMMENTS
<!--...-->
PARSE_CDATA
<![CDATA[...]]>
<![INCLUDE[...]]>
PARSE_TRIM_PCDATA
>...<
PARSE_TRIM_ATTRIBUTE
foo="..."
PARSE_TRIM_CDATA
PARSE_TRIM_ENTITY
<!ENTITY name ...>
PARSE_TRIM_DOCTYPE
PARSE_TRIM_COMMENT
PARSE_NORMALIZE
PARSE_DTD
[...]>
PARSE_DTD_ONLY
(PARSE_DOCTYPE|PARSE_DTD)
CPugXmlParser::ParseFile()
CPugXmlParser::GetParseFilePos()
CPugXmlParser::Parse()
PARSE_DEFAULT
static BOOL StrCatGrow(LPTSTR* pLhs,LPCTSTR pRhs)
pRhs
pLhs
static BOOL StrWtrim(LPTSTR* s)
static BOOL StrWnorm(LPTSTR* s)
inline static XMLATTR* NewAttribute(void)
inline static XMLBRANCH* NewBranch(XMLENTITY eType)
inline static XMLBRANCH* GraftBranch(XMLBRANCH* pParent,LONG lGrow,XMLENTITY eType)
inline static XMLATTR* AddAttribute(XMLBRANCH* pBranch,LONG lGrow)
inline static void FreeTree(XMLBRANCH* pRoot)
Note: For all performance tests, the parser was configured with following flags set: (PARSE_PI|PARSE_DOCTYPE|PARSE_COMMENTS|PARSE_CDATA|PARSE_DTD).
(PARSE_PI|PARSE_DOCTYPE|PARSE_COMMENTS|PARSE_CDATA|PARSE_DTD)
The "http://www.codeproject.com/datetime/ccputicker.asp">CCPUTicker class by J.M. McGuiness and P.J. Naughter was used to measure the elapsed count of non-exclusive Pentium clock cycles for the static Parse inner loop. This value was divided by the count input bytes. In this summary test, a set of seven well-formed XML documents were parsed, representing a random mix of sizes, features and complexity. Following is a table of these measurements, and two plots, Input Bytes/Elapsed CPU Cycles (cn-c0), and Input Bytes/Elapsed Milliseconds (tn-t0).
"http://www.codeproject.com/datetime/ccputicker.asp">CCPUTicker
Parse
Note: See Test Platform below for the test platform specs.
670
0
0.00000
34,707
51
29,504
2,144,866
72
97,000
10
0.00010
2,711,453
27
161,000
20
0.00012
6,264,292
38
281,772
40
0.00014
15,667,883
55
1,054,981
170
0.00016
79,446,024
75
6,998,822
350
0.00005
157,064,600
22
An identical test performed using the MSXML 4.0, and .NET parsers showed a 4.5X increase in elapsed milliseconds.
In an additional test, the parser's memory overhead was measured as a function of input bytes, and peak memory usage. This gains cogency in the context of the input data markup-to-data ratio. In the table below, 'Saturated' markup refers to markup that has a high markup-to-data ratio, e.g.:
<a><b><c><d><e/></d></c></a></a>
<a b="b" c="c" d="d" e="e">f</a>
'Sparse' markup refers to the opposite, e.g.:
<aaaaaaaaaaaaaaaaaaaaaaaaaaaaa/>
<a>bbbbbbbbbbbbbbbbbbbbbbbbb</a>
and, 'Mixed' markup would be somewhere in between.
As you can see, 'Saturated' markup has the highest memory overhead (understandably, since the parser must allocate many more XMLBRANCH or XMLATTR structures). Although it is not implemented in PugXML, it should be possible to take a random sampling of the input string by counting the markup character to data character ratio, and thereby estimate both memory overhead, and knowing the input string length, peak memory usage. This might prove useful for embedded implementations where one would desire to know a priori whether the potential document tree could fit into memory.
An identical test performed using the MSXML 4.0, and .NET parsers showed a memory overhead of approximately 3.5X that of the PugXML parser.
OK, so the PugXML parser is comparatively fast, and doesn't use absurd amounts of memory, but what does this really mean? By analogy, if you took your average gas guzzling SUV (MSXML 4.0), and stripped everything off the chassis, except for one seat, the steering wheel, and the engine, you could probably get the thing going really fast. The question you have to ask yourself is: would you drive your children to school in such a vehicle? You bet! (Note: For some serious performance in a fully-validating XML parser, take a look at RXP).
OS Name Microsoft Windows 2000 Version 5.0.2195 Service Pack 3, RC 3.136 Build 2195 Processor x86 Family 5 Model 9 Stepping 1 AuthenticAMD ~451 Mhz Total Physical Memory 261,680 KB Total Virtual Memory 1,024,820 KB
PugXML was designed to perform robust parsing of potentially malformed XML markup, and as such is non-compliant. Using a fully W3C-compliant parser adhering to the strictures of well-formedness, validity, DTDs, XSDs, etc., one often hits a document that simply cannot be parsed with significant manual editing. PugXML never rejects a document. PugXML will parse a wide variety of malformed fragments, and attempt to generate a well-formed document tree. Using the W3C Conformance Test Suite 20020606, PugXML will parse all well-formed (valid) stand-alone documents correctly (without entity expansion). Of the invalid tests, PugXML will, in 99% of the cases yield a well-formed document tree from which one can actually extract some data. Despite not being fully W3C compliant, PugXML could easily be used as a starting point to build a fully compliant parser (although it would be sure to bite you if you tried).
UNICODE
_MBCS
CPugXmlBranch::Serialize
::FreeTree
<ns:tag...
<... ns:attribute...
ns:tag
ns:attribute
Since the initial performance measurements, there have been some minor changes made to the static CPugXmlParser::Parse(), NewBranch() and GraftBranch() functions (verification of memory allocation, and the string normalization option). Until new tests can be performed it is hard to say whether or how these changes may impact the overall performance.
static CPugXmlParser::Parse()
CPugXmlParser::~CPugXmlParser()
CPugXmlParser::Clear(),
This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.
A list of licenses authors might use can be found here
General News Suggestion Question Bug Answer Joke Rant Admin
Skills that self-taught computer programmers lack