MiniXML, a Fast parser for the XML Configuration File

Richard Lin

3.84/5 (11 votes)

May 29, 2005

4 min read

100117

1091

An article that presents a fast XML parser for accessing the configuration file.

Download source files - 17.9 Kb

Introduction

There are many cases where software projects require a small, fast and portable XML parser without worrying about platform dependency or something like COM interface of MSXML. I recently designed a fast cross-platform XML parser called MiniXML that can quickly parse the XML data into a document tree and provide an intuitive interface to access the data maintained by the document tree.

Design consideration

This project requires the best performance to build the document tree. I have the following considerations to achieve this goal:

The MiniXML will only scan the entire XML data once and build the document tree.
The MiniXML will minimize the string copy times during the tree building process. I read the source code of many other XML parsers which copy the parsed element name or attribute value to a newly created string and keep them in the new element object. I think these actions spend too much time and memory. The MiniXML will avoid those unnecessary behaviors by carefully designing the new string class and the tokenization solutions.
Like many other XML configuration files, our required XML format is a much smaller subset of the XML standard. For example, the CDATA section is not required in our configuration file. The MiniXML should avoid unnecessary tokenization process for those not required XML definitions in order to save the processing time. I have summarized the BNF format of our configuration file as follows: (Form 1)
```
Document ::= doctypedecl element
doctypedecl =
element ::=EmptyElementTag | Stag Content Etag
EmptyElementTag ::='<' Name (S Attribute)* S '/>'
STag ::='<' Name (S Attribute)* S '>'
Content:= (element|CharData | PI | Comment)*
PI::=
Comment::='
CharData::=any normal chars
```

Obviously, the above rules set is much smaller than the official XML definition (EBNF for XML). The MiniXML will parse the configuration files using this succinct rule set in order to minimize the CPU and memory usages.

General design

The following diagram (Figure 1) shows the MiniXML classes hierarchy:

The MiniXML implements two major tasks:

Parse the input XML data and build a document tree

The node of the document tree is the object of CElement. Each object has a member of CElement* m_pFistChild and a pair of m_pPrevSibling and m_pNextSibling members. The m_pFistChild points to the first child node, while the sibling nodes consist of a double linked list maintained by the m_pPrevSibling and m_pNextSibling of each CElement object. The self-built double linked list (instead of the STL container object) facilitates the iteration solution in the CElementIterator.

The constructor of CXmlConf initiates the process of creating a document tree. It creates the root object of CElement and CScanner objects, and calls the CElement's Parse function to build the document tree:

    m_pScanner= new CScanner(pBuffer,pBuffer+buffersize);
    m_pRoot=new CElement(NULL);
    ...
    m_pRoot->Parse(m_pScanner);
    ...

The CElement::Parse function directs the parsing tasks to the associated CBaseParser objects according to the input token and the BNF rules defined in form 1.

    bool CElement::Parse(CScanner* pScan)
    {
        CStagParser StagParser(this);
        CEtagParser EtagParser(this);
        if (!StagParser.Parse(pScan))  return false;
    
        if (StagParser.IsEmptyElementTag()||
                      StagParser.IsPITag()||
                  StagParser.IsCommentTag())
        {
            m_StringValue=StagParser.GetNameObj();
            if (m_pParent) m_pParent->AddChildElement(this);
            m_bValid=true;    return true;
        }
 
        CContent contentParser(this);
        if (!contentParser.Parse(pScan))    return false;
        if (!EtagParser.Parse(pScan))      return false;
    
        if(StagParser.GetNameObj()==EtagParser.GetNameObj())
        {
            m_StringValue=StagParser.GetNameObj();
            if (m_pParent)    m_pParent->AddChildElement(this);
             m_bValid=true;     
        }
        return bValid;
    }

Access the XML data from the document tree

The MiniXML access interface consists of the following three classes:

CXmlConf, which acquires the XML data and initiates the parsing process to build a document tree.
CElement, which is the core class to access all the XML data for an Element.
CElementIterator, which is an iterator class to access the sibling nodes of a given CElement* pointer.

For the demo purpose, I wrote a CElement* Clone (CElement* pObj) function (in ElementClone.cpp) to show you how to use the public member functions of the classes CElement and CElementIterator. The function returns a pointer of CElement object that copies all the members and sub-nodes tree structure of the CElement node pointed by pObj. This function is not a practical way to do a real cloning job considering its low performance and memory usage. However, it is a helpful example to show you how to use the MiniXML's interface classes.

CElement* Clone(CElement*p)
{
    if (!p||!p->IsValid())    return NULL;
     vector<char> ElementName;
    if (!p->GetElementName(ElementName)) return NULL;   
       CElement* retRoot=CElement::CreateNewElement(ElementName.begin());
    
    int AttrCount=p->GetAttributeCount();
    for (int i=0; i<AttrCount;i++)<ATTRCOUNT;I++){ vector<char>
    {
       vector<char> AttrName,AttrValue;
       if (!p->GetAttributePairByIndex(i, AttrName,AttrValue))    
       {
         retRoot->Delete();
         return NULL;
       }
        retRoot->AddAttributePair(AttrName.begin(),AttrValue.begin());
    }    
    vector <char> charData;
    if (p->GetCharData(charData))     retRoot->SetCharData(charData);
    // Configure the children Elements
    CElementIterator iter(p->GetFirstChild());
    while (iter.IsValid()){
        CElement*child=Clone(iter.GetElementPtr());
        if (child)     retRoot->AddChildElement(child);
         ++iter;
    }
    return retRoot;
}

Using the code

The best practice of using MiniXML is to create a CXmlConf object by acquiring a XML file or string and use the member functions of CElement and CElementIterator to walk through the established document tree. The ParseAndCloneTest function defined in Test.cpp shows the usage:

void ParseAndCloneTest()
{
    //
    // Create a CXMLConf object, Clone it to the 
    // element of pRoot, and output
    //
    CXmlConf xmlConf("sampleXML.xml");
    if (xmlConf)    {
        // The clone function was defined above
        CElement*pRoot=xmlConf.Clone(); 
        cout<<*pRoot;
        pRoot->Delete();
    }
    else cout<<"ParseAndCloneTest Failed.";      
}

Under certain cases, users may want to read or write a specific element of the document tree. The MiniXML provides a function CXmlConf::GetRootElement to get the pointer of the first child CElement that matches the element name sequence. For example, for the XML input:

<?xml version="1.0" encoding= "UTF-8" ?> 
<?xml version="1.0" encoding= "UTF-8" ?>
<Element1 attr="haha" attr2="haha2" attr3="hahah3">
<SubElement1 Attr="Book" Attr2="Pen" Attr3="keyboard"/>
<SubElement1 Attr="Book2" Attr2="Pen" Attr3="keyboard">
<SubElement2>
<SubElement2Sub attr="Beijing" attr2="ShangHai"> </SubElement2Sub>
<SubElement2Sub attr="XiAn" attr2="NanJing"> </SubElement2Sub>
</SubElement2>
</SubElement1>
</Element1>

Users can call CXmlConf::GetRootElement("Element.SubElement1.SubElement2.SubElement2Sub") to get the pointer of the first SubElement2Sub element whose attr attribute is "Beijing". The following example ElementModifyTest gets the second SubElement2Sub element pointer, does certain element modification, and finally outputs the document tree to "OutputXML.xml" file.

void ElementModifyTest()
{
    // This example shows how to 
    // modify an attribute of 
    // 2nd Element1->SubElement1->SubElement2->SubElement2Sub 
    // from 'XiAn' to 'HuNan'
    // and save the element to an validate file
    CXmlConf xmlConf("sampleXML.xml");
    if (xmlConf)
    {
        CElementIterator iter(xmlConf.GetRootElement("Element1."
                           "SubElement1.SubElement2.SubElement2Sub"));
        if (iter.IsValid())
        {
            ++iter;
              if (iter.IsValid())
            {
                 iter.GetElementPtr()->ModifyAttribute("attr","HuNan");
                CElement*p=CElement::CreateNewElement("NewElement");
                p->AddAttributePair("Attribute", "hahahaha");
                p->AddAttributePair("Attribute2", "hahahaha2");
                iter.GetElementPtr()->AddChildElement(p);
            }
        }

        //
        // Output the result to a file 
        //
        ofstream ofs("OutputXML.xml");
        ofs<<xmlConf;
    }
}

Points of interest

The class CStringValue has been used in all the MiniXML classes to maintain string information such as element names, element char data, attribute names and attribute values. CStringValue offers two different ways to keep the string information. During the parsing process, the CStringValue will not copy the string acquired from the input XML data in its internal buffer. Instead, it will keep the starting and ending addresses of the string into its members m_pBegin and m_pEnd. This solution will avoid unnecessary string buffer creation and string copy during the parsing process. On the other hand, when users want to modify the document tree such as modify the attributes value, the CStringValue will behave as a regular string class which uses its internal buffer to maintain the input string. CStringValue class is defined in MiniParser.h.