Click here to Skip to main content
15,867,308 members
Articles / Programming Languages / XML

A Simple XML Parser from Scratch

Rate me:
Please Sign up or sign in to vote.
4.38/5 (9 votes)
8 Jul 2011CPOL3 min read 42.9K   466   24   9
For performance reasons, I decided to write my own XML parser.

Introduction

This is an implementation of a subset of the XML classes provided in .NET. This may be useful since it does run faster than the .NET classes, although it has less features and has some bugs. If you know exactly what data you need from a document, this is a fast approach to pull it out. It is also useful if you don't have access to these .NET libraries for some reason such as if you are using silverlight or embedded devices, with limited libraries, XBox, windows phone 7 etc etc. 

If you've never used the .NET classes XmlDocument and XPathNavigator, you will need to know the select syntax. If you know, then feel free to skip this section. If not, first read this page.

Essentially, Select just lets you navigate through the trees of XML nodes. However, there are also attributes in XML such as ex.

XML
<shape><triangle sid="1/></shape> . 

The sid is an attribute and to select this, the path would be:

ex1)

SQL
Select("shape/triangle") 
Evaluate( "@sid" )

returns 1

ex2) (no attribute)

XML
<car><wheel>spinning</wheel></car> :
Select = "car"
evaluate "wheel" // no @ needed

returns "spinning"

OR the fastest way, but with no scope protection:

Just simply evaluate("@sid") with no select at all will search out the first occurrence @sid and MoveNext() does what you'd think. This example is not the same as .NET and obviously increases performance massively but you can only do it when you know the document.

Scope

When I say Scope detection, I really mean limiting the evaluation to the node scope. The .NET classes have protection and will return String.Empty in cases where the scope is wrong. This class disables it so you can optimize your search without specifying any scope. In reality, it doesn't really disable it, inherent in the parsing algorithm is a key difference. This algorithm uses searches by simple indexof(). It does not compile and cache anything except two integers begin and end index. This approach uses almost no memory and is in every case faster but in cases with ambiguous scope it CAN return the wrong data. Therefore, you need to know the scope is unique in your selects.

Usage

Loading up some values from a document is similar to .NET:

C#
public static void GetSingleValueFXmlDoc()
{
    string xml = "<car value=\"1\"><wheel sid=\"1\" bolts=\"4\">
	ddd</wheel><wheel sid=\"2\" bolts=\"4\">dddd</wheel></car>";

    DateTime dt = DateTime.Now;

    FXMLDocument doc = new FXMLDocument();
    doc.LoadXml(xml);
    doc.Select( "car/wheel" );
    string sid = doc.GetValueString("@sid"); // or doc.Evaluate(
    double df = doc.GetValueDouble("@bolts");
    
    string diff = DateTime.Now.Subtract(dt).ToString();
    Console.WriteLine("single values FXML Time\t\t={0}", diff);
} 

It's basically the same as .NET but no navigator. Here, we load the document, select the path in syntax the same as xpath navigator and call get value. There is also an evaluate method but getvalue just returns the type you want with a simple wrapper.

Reading all attributes in a selected node

C#
for (int i = 0; i < count; ++i)
{
    FXMLDocument doc = new FXMLDocument();
    doc.LoadXml(xml);
    doc.Select("car/wheel");

    foreach (FXMLDocument x in doc)
    {    
        string sid = x.GetValueString("@sid");
        double df = x.GetValueDouble("@bolts");
    }
}

How It Works

C#
   for (int i = 0; i < nodes.Length; i++)
225      {
226          string nodeStart = String.Empty;
227          string nodeEnd = String.Empty;
228  
229          if ((nodes.Length > i + 1) && nodes[i + 1].IndexOf("@") == 0)
230          {
231              nodeStart = string.Format("{0}=\"", nodes[i + 1].Replace("@", ""));
232              nodeEnd = "\"";
233              isAttribute = true;
234          }
235          else if ((nodes.Length == 1) && nodes[i].IndexOf("@") == 0)
236          {
237              nodeStart = string.Format("{0}=\"", nodes[i].Replace("@", ""));
238              nodeEnd = "\"";
239              isAttribute = true;
240          }
241          else
242          {
243              // not last node
244              nodeStart = string.Format("{0}{1}", "<", nodes[i]);
245              nodeEnd = string.Format("{0}{1}{2}", "</", nodes[i], ">");
246              isAttribute = false;
247          }
248  
249          GetBetweenExcludeTokens(xmlDocument, nodeStart, nodeEnd, ref begin, out end);
250  
251          if (isAttribute)
252          {
253              return; //xmlbuffer;
254          }
255          int indexNextNode = (xmlDocument.IndexOf(">", begin)) + 1; // +1 to include >
256          int indexofEquals = xmlDocument.IndexOf("=", begin);
257          if (indexofEquals != -1 && (indexofEquals < indexNextNode))
258          {
259              if (begin < indexNextNode)
260              {
261                  begin += indexNextNode - begin;
262              }
263              else
264              {
265                  begin += begin - indexNextNode;
266              }
267          }
268          else
269          {
270              if (xmlDocument[begin] == ('>'))
271              {
272                  ++begin;
273              }
274          }
275      }
276  }

This is the main loop of the entire process. Basically, it just loops through all the nodes of the select statement and gets the text between them and that's it. Of course, there is special handling for attributes which really makes what would have been an elegant system a real nightmare to handle. I can bet you that there were some arguments in the office at Microsoft the day some smart guy wanted attributes. In any case, GetBetweenExcludeTokens is doing the meat of the entire operation, everything else is just syntax support.

Points of Interest

This runs only around 30% faster than the .NET classes which for most people are not going to be worth it since it is not fully compatible. However in example GetSingleValueFXmlDoc, there is a secret to speed things up. Because we know the document contains only wheels->bolts ( bolts is unique), why bother selecting car/wheel, with this you can just call doc.Evaluate( @bolts ) without selecting anything which will boost performance by at least 100% over .NET classes. This was actually all it was intended for but, as I went I hit a lot of gotchas and this minimizes them slightly when you turn ValidateSyntax on.

My first version was just a static method which was probably better and I wish I saved it, however you can't MoveNext very nicely in a static implementation (although it's nice not to have to new it). You could modify getSubNodes to be static quite easily.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Senior)
Canada Canada
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
GeneralMy vote of 1 Pin
Stephen Brannan8-Jul-11 11:41
Stephen Brannan8-Jul-11 11:41 
AnswerRe: My vote of 1 Pin
rj4527-Jul-11 9:56
rj4527-Jul-11 9:56 
GeneralRe: My vote of 1 Pin
Stephen Brannan27-Jul-11 10:34
Stephen Brannan27-Jul-11 10:34 
GeneralRe: My vote of 1 Pin
Stephen Brannan27-Jul-11 10:37
Stephen Brannan27-Jul-11 10:37 
GeneralRe: My vote of 1 Pin
rj4527-Jul-11 13:41
rj4527-Jul-11 13:41 
GeneralMy vote of 5 Pin
Member 414818817-Apr-11 19:34
Member 414818817-Apr-11 19:34 
Generalas long as it suits Pin
Huisheng Chen4-Apr-11 0:58
Huisheng Chen4-Apr-11 0:58 
GeneralRe: as long as it suits Pin
rj454-Apr-11 6:49
rj454-Apr-11 6:49 
GeneralRe: as long as it suits Pin
Huisheng Chen4-Apr-11 16:03
Huisheng Chen4-Apr-11 16:03 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.