MiniXML, a Fast parser for the XML Configuration File






3.84/5 (11 votes)
May 29, 2005
4 min read

100117

1091
An article that presents a fast XML parser for accessing the configuration file.
Introduction
There are many cases where software projects require a small, fast and portable XML parser without worrying about platform dependency or something like COM interface of MSXML. I recently designed a fast cross-platform XML parser called MiniXML that can quickly parse the XML data into a document tree and provide an intuitive interface to access the data maintained by the document tree.
Design consideration
This project requires the best performance to build the document tree. I have the following considerations to achieve this goal:
- The MiniXML will only scan the entire XML data once and build the document tree.
- The MiniXML will minimize the string copy times during the tree building process. I read the source code of many other XML parsers which copy the parsed element name or attribute value to a newly created string and keep them in the new element object. I think these actions spend too much time and memory. The MiniXML will avoid those unnecessary behaviors by carefully designing the new string class and the tokenization solutions.
- Like many other XML configuration files, our required XML format is a much smaller subset of the XML standard. For example, the CDATA section is not required in our configuration file. The MiniXML should avoid unnecessary tokenization process for those not required XML definitions in order to save the processing time. I have summarized the BNF format of our configuration file as follows: (Form 1)
Document ::= doctypedecl element doctypedecl = any normal chars ?> element ::=EmptyElementTag | Stag Content Etag EmptyElementTag ::='<' Name (S Attribute)* S '/>' STag ::='<' Name (S Attribute)* S '>' Content:= (element|CharData | PI | Comment)* PI::= any normal chars?> Comment::='<!-- any normal chars --> CharData::=any normal chars
Obviously, the above rules set is much smaller than the official XML definition (EBNF for XML). The MiniXML will parse the configuration files using this succinct rule set in order to minimize the CPU and memory usages.
General design
The following diagram (Figure 1) shows the MiniXML classes hierarchy:
The MiniXML implements two major tasks:
Parse the input XML data and build a document tree
The node of the document tree is the object of CElement
. Each object has a member of CElement* m_pFistChild
and a pair of m_pPrevSibling
and m_pNextSibling
members. The m_pFistChild
points to the first child node, while the sibling nodes consist of a double linked list maintained by the m_pPrevSibling
and m_pNextSibling
of each CElement
object. The self-built double linked list (instead of the STL container object) facilitates the iteration solution in the CElementIterator
.
The constructor of CXmlConf
initiates the process of creating a document tree. It creates the root object of CElement
and CScanner
objects, and calls the CElement
's Parse
function to build the document tree:
m_pScanner= new CScanner(pBuffer,pBuffer+buffersize); m_pRoot=new CElement(NULL); ... m_pRoot->Parse(m_pScanner); ...
The CElement::Parse
function directs the parsing tasks to the associated CBaseParser
objects according to the input token and the BNF rules defined in form 1.
bool CElement::Parse(CScanner* pScan) { CStagParser StagParser(this); CEtagParser EtagParser(this); if (!StagParser.Parse(pScan)) return false; if (StagParser.IsEmptyElementTag()|| StagParser.IsPITag()|| StagParser.IsCommentTag()) { m_StringValue=StagParser.GetNameObj(); if (m_pParent) m_pParent->AddChildElement(this); m_bValid=true; return true; } CContent contentParser(this); if (!contentParser.Parse(pScan)) return false; if (!EtagParser.Parse(pScan)) return false; if(StagParser.GetNameObj()==EtagParser.GetNameObj()) { m_StringValue=StagParser.GetNameObj(); if (m_pParent) m_pParent->AddChildElement(this); m_bValid=true; } return bValid; }
Access the XML data from the document tree
The MiniXML access interface consists of the following three classes:
CXmlConf
, which acquires the XML data and initiates the parsing process to build a document tree.CElement
, which is the core class to access all the XML data for an Element.CElementIterator
, which is an iterator class to access the sibling nodes of a givenCElement*
pointer.
For the demo purpose, I wrote a CElement* Clone (CElement* pObj)
function (in ElementClone.cpp) to show you how to use the public
member functions of the classes CElement
and CElementIterator
. The function returns a pointer of CElement
object that copies all the members and sub-nodes tree structure of the CElement
node pointed by pObj
. This function is not a practical way to do a real cloning job considering its low performance and memory usage. However, it is a helpful example to show you how to use the MiniXML's interface classes.
CElement* Clone(CElement*p) { if (!p||!p->IsValid()) return NULL; vector<char> ElementName; if (!p->GetElementName(ElementName)) return NULL; CElement* retRoot=CElement::CreateNewElement(ElementName.begin()); int AttrCount=p->GetAttributeCount(); for (int i=0; i<AttrCount;i++)<ATTRCOUNT;I++){ vector<char> { vector<char> AttrName,AttrValue; if (!p->GetAttributePairByIndex(i, AttrName,AttrValue)) { retRoot->Delete(); return NULL; } retRoot->AddAttributePair(AttrName.begin(),AttrValue.begin()); } vector <char> charData; if (p->GetCharData(charData)) retRoot->SetCharData(charData); // Configure the children Elements CElementIterator iter(p->GetFirstChild()); while (iter.IsValid()){ CElement*child=Clone(iter.GetElementPtr()); if (child) retRoot->AddChildElement(child); ++iter; } return retRoot; }
Using the code
The best practice of using MiniXML is to create a CXmlConf
object by acquiring a XML file or string and use the member functions of CElement
and CElementIterator
to walk through the established document tree. The ParseAndCloneTest
function defined in Test.cpp shows the usage:
void ParseAndCloneTest() { // // Create a CXMLConf object, Clone it to the // element of pRoot, and output // CXmlConf xmlConf("sampleXML.xml"); if (xmlConf) { // The clone function was defined above CElement*pRoot=xmlConf.Clone(); cout<<*pRoot; pRoot->Delete(); } else cout<<"ParseAndCloneTest Failed."; }
Under certain cases, users may want to read or write a specific element of the document tree. The MiniXML provides a function CXmlConf::GetRootElement
to get the pointer of the first child CElement
that matches the element name sequence. For example, for the XML input:
<?xml version="1.0" encoding= "UTF-8" ?>
<?xml version="1.0" encoding= "UTF-8" ?>
<Element1 attr="haha" attr2="haha2" attr3="hahah3">
<SubElement1 Attr="Book" Attr2="Pen" Attr3="keyboard"/>
<SubElement1 Attr="Book2" Attr2="Pen" Attr3="keyboard">
<SubElement2>
<SubElement2Sub attr="Beijing" attr2="ShangHai"> </SubElement2Sub>
<SubElement2Sub attr="XiAn" attr2="NanJing"> </SubElement2Sub>
</SubElement2>
</SubElement1>
</Element1>
Users can call CXmlConf::GetRootElement("Element.SubElement1.SubElement2.SubElement2Sub")
to get the pointer of the first SubElement2Sub
element whose attr
attribute is "Beijing". The following example ElementModifyTest
gets the second SubElement2Sub
element pointer, does certain element modification, and finally outputs the document tree to "OutputXML.xml" file.
void ElementModifyTest() { // This example shows how to // modify an attribute of // 2nd Element1->SubElement1->SubElement2->SubElement2Sub // from 'XiAn' to 'HuNan' // and save the element to an validate file CXmlConf xmlConf("sampleXML.xml"); if (xmlConf) { CElementIterator iter(xmlConf.GetRootElement("Element1." "SubElement1.SubElement2.SubElement2Sub")); if (iter.IsValid()) { ++iter; if (iter.IsValid()) { iter.GetElementPtr()->ModifyAttribute("attr","HuNan"); CElement*p=CElement::CreateNewElement("NewElement"); p->AddAttributePair("Attribute", "hahahaha"); p->AddAttributePair("Attribute2", "hahahaha2"); iter.GetElementPtr()->AddChildElement(p); } } // // Output the result to a file // ofstream ofs("OutputXML.xml"); ofs<<xmlConf; } }
Points of interest
The class CStringValue
has been used in all the MiniXML classes to maintain string information such as element names, element char data, attribute names and attribute values. CStringValue
offers two different ways to keep the string information. During the parsing process, the CStringValue
will not copy the string acquired from the input XML data in its internal buffer. Instead, it will keep the starting and ending addresses of the string into its members m_pBegin
and m_pEnd
. This solution will avoid unnecessary string buffer creation and string copy during the parsing process. On the other hand, when users want to modify the document tree such as modify the attributes value, the CStringValue
will behave as a regular string class which uses its internal buffer to maintain the input string. CStringValue
class is defined in MiniParser.h.