Click here to Skip to main content
15,861,168 members
Articles / Programming Languages / C++
Article

MiniXML, a Fast parser for the XML Configuration File

Rate me:
Please Sign up or sign in to vote.
3.84/5 (12 votes)
5 Oct 20054 min read 98K   1.1K   48   32
An article that presents a fast XML parser for accessing the configuration file.

Introduction

There are many cases where software projects require a small, fast and portable XML parser without worrying about platform dependency or something like COM interface of MSXML. I recently designed a fast cross-platform XML parser called MiniXML that can quickly parse the XML data into a document tree and provide an intuitive interface to access the data maintained by the document tree.

Design consideration

This project requires the best performance to build the document tree. I have the following considerations to achieve this goal:

  • The MiniXML will only scan the entire XML data once and build the document tree.
  • The MiniXML will minimize the string copy times during the tree building process. I read the source code of many other XML parsers which copy the parsed element name or attribute value to a newly created string and keep them in the new element object. I think these actions spend too much time and memory. The MiniXML will avoid those unnecessary behaviors by carefully designing the new string class and the tokenization solutions.
  • Like many other XML configuration files, our required XML format is a much smaller subset of the XML standard. For example, the CDATA section is not required in our configuration file. The MiniXML should avoid unnecessary tokenization process for those not required XML definitions in order to save the processing time. I have summarized the BNF format of our configuration file as follows: (Form 1)
    Document ::= doctypedecl element
    doctypedecl =<? any normal chars ?>
    element ::=EmptyElementTag | Stag Content Etag
    EmptyElementTag ::='<' Name (S Attribute)* S '/>'
    STag ::='<' Name (S Attribute)* S '>'
    Content:= (element|CharData | PI | Comment)*
    PI::=<? any normal chars?>
    Comment::='<!-- any normal chars -->
    CharData::=any normal chars

Obviously, the above rules set is much smaller than the official XML definition (EBNF for XML). The MiniXML will parse the configuration files using this succinct rule set in order to minimize the CPU and memory usages.

General design

The following diagram (Figure 1) shows the MiniXML classes hierarchy:

Image 1

The MiniXML implements two major tasks:

Parse the input XML data and build a document tree

The node of the document tree is the object of CElement. Each object has a member of CElement* m_pFistChild and a pair of m_pPrevSibling and m_pNextSibling members. The m_pFistChild points to the first child node, while the sibling nodes consist of a double linked list maintained by the m_pPrevSibling and m_pNextSibling of each CElement object. The self-built double linked list (instead of the STL container object) facilitates the iteration solution in the CElementIterator.

The constructor of CXmlConf initiates the process of creating a document tree. It creates the root object of CElement and CScanner objects, and calls the CElement's Parse function to build the document tree:

m_pScanner= new CScanner(pBuffer,pBuffer+buffersize);
m_pRoot=new CElement(NULL);
...
m_pRoot->Parse(m_pScanner);
...

The CElement::Parse function directs the parsing tasks to the associated CBaseParser objects according to the input token and the BNF rules defined in form 1.

bool CElement::Parse(CScanner* pScan)
{
    CStagParser StagParser(this);
    CEtagParser EtagParser(this);
    if (!StagParser.Parse(pScan))  return false;

    if (StagParser.IsEmptyElementTag()||
                  StagParser.IsPITag()||
              StagParser.IsCommentTag())
    {
        m_StringValue=StagParser.GetNameObj();
        if (m_pParent) m_pParent->AddChildElement(this);
        m_bValid=true;    return true;
    }

    CContent contentParser(this);
    if (!contentParser.Parse(pScan))    return false;
    if (!EtagParser.Parse(pScan))      return false;

    if(StagParser.GetNameObj()==EtagParser.GetNameObj())
    {
        m_StringValue=StagParser.GetNameObj();
        if (m_pParent)    m_pParent->AddChildElement(this);
         m_bValid=true;
    }
    return bValid;
}

Access the XML data from the document tree

The MiniXML access interface consists of the following three classes:

  • CXmlConf, which acquires the XML data and initiates the parsing process to build a document tree.
  • CElement, which is the core class to access all the XML data for an Element.
  • CElementIterator, which is an iterator class to access the sibling nodes of a given CElement* pointer.

For the demo purpose, I wrote a CElement* Clone (CElement* pObj) function (in ElementClone.cpp) to show you how to use the public member functions of the classes CElement and CElementIterator. The function returns a pointer of CElement object that copies all the members and sub-nodes tree structure of the CElement node pointed by pObj. This function is not a practical way to do a real cloning job considering its low performance and memory usage. However, it is a helpful example to show you how to use the MiniXML's interface classes.

CElement* Clone(CElement*p)
{
    if (!p||!p->IsValid())    return NULL;
     vector<char> ElementName;
    if (!p->GetElementName(ElementName)) return NULL;   
       CElement* retRoot=CElement::CreateNewElement(ElementName.begin());
    
    int AttrCount=p->GetAttributeCount();
    for (int i=0; i<AttrCount;i++)<ATTRCOUNT;I++){ vector<char>
    {
       vector<char> AttrName,AttrValue;
       if (!p->GetAttributePairByIndex(i, AttrName,AttrValue))    
       {
         retRoot->Delete();
         return NULL;
       }
        retRoot->AddAttributePair(AttrName.begin(),AttrValue.begin());
    }    
    vector <char> charData;
    if (p->GetCharData(charData))     retRoot->SetCharData(charData);
    // Configure the children Elements
    CElementIterator iter(p->GetFirstChild());
    while (iter.IsValid()){
        CElement*child=Clone(iter.GetElementPtr());
        if (child)     retRoot->AddChildElement(child);
         ++iter;
    }
    return retRoot;
}

Using the code

The best practice of using MiniXML is to create a CXmlConf object by acquiring a XML file or string and use the member functions of CElement and CElementIterator to walk through the established document tree. The ParseAndCloneTest function defined in Test.cpp shows the usage:

void ParseAndCloneTest()
{
    //
    // Create a CXMLConf object, Clone it to the 
    // element of pRoot, and output
    //
    CXmlConf xmlConf("sampleXML.xml");
    if (xmlConf)    {
        // The clone function was defined above
        CElement*pRoot=xmlConf.Clone(); 
        cout<<*pRoot;
        pRoot->Delete();
    }
    else cout<<"ParseAndCloneTest Failed.";      
}

Under certain cases, users may want to read or write a specific element of the document tree. The MiniXML provides a function CXmlConf::GetRootElement to get the pointer of the first child CElement that matches the element name sequence. For example, for the XML input:

XML
<?xml version="1.0" encoding= "UTF-8" ?> 
<?xml version="1.0" encoding= "UTF-8" ?>
<Element1 attr="haha" attr2="haha2" attr3="hahah3">
<SubElement1 Attr="Book" Attr2="Pen" Attr3="keyboard"/>
<SubElement1 Attr="Book2" Attr2="Pen" Attr3="keyboard">
<SubElement2>
<SubElement2Sub attr="Beijing" attr2="ShangHai"> </SubElement2Sub>
<SubElement2Sub attr="XiAn" attr2="NanJing"> </SubElement2Sub>
</SubElement2>
</SubElement1>
</Element1>

Users can call CXmlConf::GetRootElement("Element.SubElement1.SubElement2.SubElement2Sub") to get the pointer of the first SubElement2Sub element whose attr attribute is "Beijing". The following example ElementModifyTest gets the second SubElement2Sub element pointer, does certain element modification, and finally outputs the document tree to "OutputXML.xml" file.

void ElementModifyTest()
{
    // This example shows how to 
    // modify an attribute of 
    // 2nd Element1->SubElement1->SubElement2->SubElement2Sub 
    // from 'XiAn' to 'HuNan'
    // and save the element to an validate file
    CXmlConf xmlConf("sampleXML.xml");
    if (xmlConf)
    {
        CElementIterator iter(xmlConf.GetRootElement("Element1."
                           "SubElement1.SubElement2.SubElement2Sub"));
        if (iter.IsValid())
        {
            ++iter;
              if (iter.IsValid())
            {
                 iter.GetElementPtr()->ModifyAttribute("attr","HuNan");
                CElement*p=CElement::CreateNewElement("NewElement");
                p->AddAttributePair("Attribute", "hahahaha");
                p->AddAttributePair("Attribute2", "hahahaha2");
                iter.GetElementPtr()->AddChildElement(p);
            }
        }

        //
        // Output the result to a file 
        //
        ofstream ofs("OutputXML.xml");
        ofs<<xmlConf;
    }
}

Points of interest

The class CStringValue has been used in all the MiniXML classes to maintain string information such as element names, element char data, attribute names and attribute values. CStringValue offers two different ways to keep the string information. During the parsing process, the CStringValue will not copy the string acquired from the input XML data in its internal buffer. Instead, it will keep the starting and ending addresses of the string into its members m_pBegin and m_pEnd. This solution will avoid unnecessary string buffer creation and string copy during the parsing process. On the other hand, when users want to modify the document tree such as modify the attributes value, the CStringValue will behave as a regular string class which uses its internal buffer to maintain the input string. CStringValue class is defined in MiniParser.h.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


Written By
Architect
United States United States
Richard Lin is senior software engineer of in Silicon Valley.

Richard Lin was born in Beijing and came to US in the fall of 1995. He began his first software career in bay area of California in 1997. He has worked for many interesting projects including manufacturing testing systems, wireless AP firmware and applications, email anti-virus system and personal firewalls. He loves playing go (WeiQi in Chinese) and soccer in his spare time. He has a beautiful wife and a cute daughter and enjoys his life in San Jose of California.

Comments and Discussions

 
GeneralMaybe a bug... Pin
Gianvito Tangorra1-Nov-06 23:03
Gianvito Tangorra1-Nov-06 23:03 
QuestionBorland C-Builder Pin
gigi`10-Oct-05 20:27
gigi`10-Oct-05 20:27 
AnswerRe: Borland C-Builder Pin
Richard Lin11-Oct-05 10:04
Richard Lin11-Oct-05 10:04 
GeneralI'm sure you did a good job Pin
the-unforgiven5-Oct-05 9:48
the-unforgiven5-Oct-05 9:48 
GeneralUpdate the code fix Pin
Richard Lin1-Oct-05 11:57
Richard Lin1-Oct-05 11:57 
GeneralRe: Update the code fix Pin
Kochise2-Oct-05 22:11
Kochise2-Oct-05 22:11 
GeneralCode fix uploaded here ! Pin
Kochise4-Oct-05 5:27
Kochise4-Oct-05 5:27 
GeneralErrrr... Pin
Kochise28-Sep-05 5:44
Kochise28-Sep-05 5:44 
GeneralW3C compliance and other issues Pin
Umut Alev25-Jul-05 20:56
Umut Alev25-Jul-05 20:56 
GeneralRe: W3C compliance and other issues Pin
zm10xn296-Aug-06 21:30
zm10xn296-Aug-06 21:30 
This is untrue, in the real world only a tiny piece of base XML needs to be working for interop. This has been proven again and again by many internet protocols through the 'conservative in what you send liberal in what you accept' dictum.

The only reason for the existance of the majority of a 'perfect implementation' is for finger pointing and test suites.



Robert
Generaldoubling: delete [] inner buffer Pin
igen17-Jul-05 12:08
igen17-Jul-05 12:08 
Generalunnecessary last symbol... Pin
igen17-Jul-05 11:49
igen17-Jul-05 11:49 
Question'CToken::TC_RightArrow' OR 'CToken::TC_LeftArrow'? Pin
igen17-Jul-05 11:31
igen17-Jul-05 11:31 
Generalincorrect behaviour in function CElement::GetFirstChild Pin
igen17-Jul-05 11:17
igen17-Jul-05 11:17 
GeneralParser Error Pin
Onkel Tom27-Jun-05 5:09
Onkel Tom27-Jun-05 5:09 
GeneralRe: Parser Error Pin
igen17-Jul-05 10:52
igen17-Jul-05 10:52 
GeneralChange function in MiniXML.cpp Pin
ManiB18-Jun-05 12:14
ManiB18-Jun-05 12:14 
GeneralCompile error VC6 VC.NET Pin
_Instructor_9-Jun-05 22:01
_Instructor_9-Jun-05 22:01 
GeneralCompile Error under Visual Studion and .NET 1.1 Pin
hhudler6-Jun-05 10:09
hhudler6-Jun-05 10:09 
GeneralRe: Compile Error under Visual Studion and .NET 1.1 Pin
Richard Lin7-Jun-05 17:51
Richard Lin7-Jun-05 17:51 
GeneralRe: Compile Error under Visual Studion and .NET 1.1 Pin
hhudler8-Jun-05 4:36
hhudler8-Jun-05 4:36 
GeneralRe: Compile Error under Visual Studion and .NET 1.1 Pin
igen17-Jul-05 11:00
igen17-Jul-05 11:00 
GeneralProblem to delete Data! Pin
ManiB2-Jun-05 6:57
ManiB2-Jun-05 6:57 
GeneralRe: Problem to delete Data! Pin
Richard Lin3-Jun-05 19:54
Richard Lin3-Jun-05 19:54 
GeneralRe: Problem to delete Data! Pin
ManiB4-Jun-05 12:57
ManiB4-Jun-05 12:57 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.