Click here to Skip to main content
15,867,308 members
Articles / Desktop Programming / MFC

Simple C++ XML Parser

Rate me:
Please Sign up or sign in to vote.
3.56/5 (17 votes)
1 Oct 2010CPOL5 min read 182.2K   6.6K   30   46
A Simple C++ XML parser with only the basic functionality

Introduction

I wrote this article because I was in need of a basic XML parser and could not find one suitable for my needs on the internet (a light weight parser).

The complexity of the parsers out there is rather disarming, and requires a huge amount of knowledge to understand. If you are not a seasoned C++ programmer, it is very hard to make sense of the code, and if you are then you have already written your own parser.

I wrote the parse function, then a set of classes for storing the data parsed, and an example on how to use it (MFC dialog based with a tree view). This parser is very simple and has only the basic functionality in order to work, no fancy stuff. There are some limitations to it:

  • The parser recognizes comments to some extent. An error will be generated if any are found in the XML file outside the root node.
  • As of 29-September-2010, some CDATA support has been added. It is still limited (for example:
    C++
    //<![CDATA[
    will generate an error). Also, CDATA sections may exist only as node values. Extra implementation can be easily added if needed but given the purpose of this project, I would prefer to leave like this, in order not to complicate the parse function to the extent that it gets difficult to modify.
  • No support for processing instructions
  • No support for DTDs and entities

What is New

This project has been designed with simplicity in mind, in order to be able to assimilate the code base quickly and easily and add to it the extra functionality needed by each specific application.

These classes provide only the basic functionality, therefore is the most lightweight parser from all that I could find: 500 lines of code for the parser (including some unused base64 functions, which can be removed if necessary), and another 500 for the binary tree construct. If more functionality is needed, this has to be added in order to fulfill one's needs.

It is very easy to understand and work with this code as a base for further development.

The parsing function is iterative, it only goes once through the XML string, so the performance is quite satisfactory. The memory requirements are low, each object allocates just how much memory it actually needs. There is much more room for improvement, but it is not the target of this project.

How It Works

I personally think that nothing else needs to be said about this, because the code speaks for itself, and it has been designed to be easy to read and understand. If it is considered necessary, I can go in some details on how it works, and how to be enriched.

Using the Code

Dealing with the XML format, there are three classes (marked with a red dot) :

Cxml, CAttribute and CNode.

class_struct.JPG

Cxml class is the working horse for this project and contains the parse function:

C++
bool Cxml::ParseString(_TCHAR* szXML); 

Once parsed, the information needs to be stored in memory in a manner easy to use, therefore the existence of the Node and Attribute classes.

The Node class has a tree like structure, having a parent pointer and a children list. It also contains a list of Attributes.

An addition to these classes, there is a Utils set of files (.h & .cpp) in which there are some utility functions.

How to Use It?

Well, in order to use it, you have to do the following:

  • Add
    C++
    #include "Cxml.h"
    

    to your project.

  • Create a instance of the class:
    C++
    Cxml *oxml = new Cxml();
  • Pass a pointer to the string containing the XML code to the parse function:
    C++
    oxml->ParseString(szXML); 

After the ParseString returns, the structure of the XML is replicated in the class, and the XML root node can be retrieved with the...

C++
oxml->GetRootNode(); 

...call.

There is a peculiarity here. Because I have adopted the "last in first out" way, the nodes will be organized in a reverse order than they are to be found in the original XML string.

The Node object can be navigated by using its public functions. Remember though that the GetNextChild() function increments the position of the counter and I have not implemented a way to reset it.

The best way to understand the inner workings is to get the demo project and test for yourselves. You will need Visual Studio 2008 to compile the project as is, without reconstruction. If you reconstruct remember: it has not been tested for Unicode.

Download the project and extract it. Compile. Run it and press the load button.

exe_print_screen.JPG

Choose one of the XML files provided as examples. Click open.

tree_print_screen.JPG

This is the result for one of the XML files provided.

XML
<CATALOG>
...
    <PLANT>
        <COMMON>Snakeroot</COMMON>
        <BOTANICAL>Cimicifuga</BOTANICAL>
        <ZONE>Annual</ZONE>
        <LIGHT>Shade</LIGHT>
        <PRICE>$5.63</PRICE>
        <AVAILABILITY>071199</AVAILABILITY>
    </PLANT>
    <PLANT>
        <COMMON>Cardinal Flower</COMMON>
        <BOTANICAL>Lobelia cardinalis</BOTANICAL>
        <ZONE>2</ZONE>
        <LIGHT>Shade</LIGHT>
        <PRICE>$3.02</PRICE>
        <AVAILABILITY>022299</AVAILABILITY>
    </PLANT>
</CATALOG>

History

  • September 20, 2010
    • Added
  • September 21, 2010
    • Revised
  • September 22, 2010
    • New version of the project has been updated
      • Corrected a bug for multiple attributes
      • Added 'some' support for comments. Comments outside the root node will generate an error
  • September 28, 2010
    • Tested for unicode, and added a unicode XML for ones test
    • Removed the intellisense file from the project
    • Removed the bug of confusing single quotes and double quotes with the actual delimiter found by Lorenzo Gatti
  • September 29, 2010
    • Added run-time support for UNICODE character set! The program must be compiled using this character set:

    unicode.JPG

  • September 29, 2010
    • Some support for the CDATA sections has been added
    • I also discovered that the Parse function needs some re-factoring because it got heavy and hard to understand.
  • October 1, 2010
    • Replaced all fixed size char arrays from the project with dynamically allocated ones.
    • Added support for processing instructions. All processing instructions are now treated as a special type of node, just like the comments.
    • Also comments can now exist outside the root-node of the XML since now the lowest level node is the XML_DOC node, not the root-node of the XML.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Romania Romania
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
QuestionProblem with Unicode Pin
spyhunter8831-Jul-11 15:36
spyhunter8831-Jul-11 15:36 
AnswerRe: Problem with Unicode Pin
BratilaRazvan31-Jul-11 22:09
BratilaRazvan31-Jul-11 22:09 
Rant[My vote of 2] Better than 1 or:my vote of 2 Pin
cccfff77726-Apr-11 5:02
cccfff77726-Apr-11 5:02 
GeneralMy vote of 5 Pin
maplewang21-Oct-10 14:57
maplewang21-Oct-10 14:57 
GeneralMy vote of 1 [modified] Pin
Aescleal4-Oct-10 11:05
Aescleal4-Oct-10 11:05 
General[My vote of 1] Sorry mate but waste of time Pin
Galatei3-Oct-10 0:58
Galatei3-Oct-10 0:58 
GeneralRe: [My vote of 1] Sorry mate but waste of time [modified] Pin
Lorenzo Gatti3-Oct-10 21:36
Lorenzo Gatti3-Oct-10 21:36 
GeneralRe: [My vote of 1] Sorry mate but waste of time Pin
BratilaRazvan4-Oct-10 23:46
BratilaRazvan4-Oct-10 23:46 
GeneralRe: [My vote of 1] Sorry mate but waste of time Pin
Aescleal5-Oct-10 1:19
Aescleal5-Oct-10 1:19 
GeneralRe: [My vote of 1] Sorry mate but waste of time Pin
BratilaRazvan5-Oct-10 1:38
BratilaRazvan5-Oct-10 1:38 
GeneralRe: [My vote of 1] Sorry mate but waste of time Pin
Aescleal5-Oct-10 3:24
Aescleal5-Oct-10 3:24 
GeneralMy vote of 1 [modified] Pin
Lorenzo Gatti27-Sep-10 21:11
Lorenzo Gatti27-Sep-10 21:11 
GeneralRe: My vote of 1 Pin
BratilaRazvan28-Sep-10 2:21
BratilaRazvan28-Sep-10 2:21 
GeneralRe: My vote of 1 Pin
BratilaRazvan28-Sep-10 4:12
BratilaRazvan28-Sep-10 4:12 
GeneralRe: My vote of 1 Pin
BratilaRazvan28-Sep-10 23:22
BratilaRazvan28-Sep-10 23:22 
GeneralRe: My vote of 1 Pin
BratilaRazvan1-Oct-10 3:29
BratilaRazvan1-Oct-10 3:29 
GeneralRe: My vote of 1 Pin
Lorenzo Gatti3-Oct-10 21:55
Lorenzo Gatti3-Oct-10 21:55 
GeneralI don't agree with the "My Vote of..." PinPopular
FrankLaPiana27-Sep-10 15:51
FrankLaPiana27-Sep-10 15:51 
GeneralMy vote of 1 Pin
xComaWhitex24-Sep-10 20:50
xComaWhitex24-Sep-10 20:50 
Very poor designed class. You basically do not use anything really of C in your class. You use malloc instead of new (no reason for it). Why re-invent the wheel when there are other better xml parsers?
GeneralRe: My vote of 1 Pin
BratilaRazvan24-Sep-10 22:24
BratilaRazvan24-Sep-10 22:24 
GeneralRe: My vote of 1 Pin
xComaWhitex24-Sep-10 22:38
xComaWhitex24-Sep-10 22:38 
GeneralRe: My vote of 1 Pin
BratilaRazvan24-Sep-10 23:18
BratilaRazvan24-Sep-10 23:18 
GeneralRe: My vote of 1 Pin
xComaWhitex24-Sep-10 23:26
xComaWhitex24-Sep-10 23:26 
GeneralRe: My vote of 1 Pin
BratilaRazvan24-Sep-10 23:36
BratilaRazvan24-Sep-10 23:36 
GeneralRe: My vote of 1 Pin
xComaWhitex24-Sep-10 23:38
xComaWhitex24-Sep-10 23:38 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.