Click here to Skip to main content
Email Password   helpLost your password?

Summary

Many readers of The Code Project are familiar with various types of XML parsers in the .NET environment. This article series introduces a new XML processing model called VTD-XML to The Code Project community. It goes significantly beyond those traditional models by fundamentally overcoming many tough technical challenges hampering SOA and enterprise XML application development. The first part of this series demonstrates the benefits of VTD-XML as a parser with integrated XPath and as an indexer. The second part shows you how to benefit from VTD-XML's cutting, editing and modifying capabilities, as well as introduces the concept of "document-centric" XML processing. The third part of this series shows you how to code your application in a C version of VTD-XML.

Introduction

VTD-XML is a suite of open-source XML processing technologies centered around a "non-extractive" XML processing technique called "Virtual Token Descriptor." It is cross-platform and available in C#, C and Java. The latest version is 2.2, which can be downloaded here. Depending on the perspective, VTD-XML can be viewed as one of the following:

Digging Into "Non-Extractive" Parsing

Let me quickly go over some of the new definitions introduced in the section above.

"Non-extractive" parsing means the XML text is kept intact in memory and un-decoded while tokens are represented exclusively using offsets and lengths (no string content copying). This is in contrast to "extractive" parsing (on which DOM, SAX and other old XML processing models are based), which allocates small memory blocks (a.k.a. strings) and copies into them the actual token content.

"Virtual Token Descriptor" (VTD), whose layout is shown in Figure 1, is a binary encoding format extending the concept of "non-extractive" parsing to XML. A VTD record is a 64-bit integer that encodes the length, offset, nesting depth and type of an XML token. As of VTD-XML 2.2, the bit layout of a VTD record is further defined as follows:

vtd_layout.jpg

Figure 1. Bit Layout of a VTD Record

Understand the Benefits of VTD-XML

Simply put, VTD-XML fundamentally solves a significant number of XML processing related issues in enterprise, ranging from the obvious ones that you experience every day, to those hidden ones that prevent you from taking your SOA project to the next level. Below is a brief discussion of some of those issues:

To understand the benefits that VTD-XML brings to the table, below is the highlight of some of its features:

As you probably have guessed, VTD is the primary reason why VTD-XML is able to simultaneously achieve all those feats. A typical DOM parser allocates one unit of memory for each token in the XML input file tree. This is costly in both memory performance (due to heap fragmentation) and time because of the sheer quantity of allocation requests. VTD-XML simply stores a verbatim copy of the XML in-memory unparsed and then generates VTD records in front of it to allow for simple navigation and access. Because reading an XML file is by definition a read-only process, it makes sense that you need not have the flexibility of variable-allocation at this point in the parsing. Last, keep in mind that VTD-XML is technically a processing model rather than an API and you can build your own API on top of a VTD-XML model.

There are a lot of articles written on various aspects of VTD-XML. They are available at "Links and presentation page". Also if your browser has Java plug-in installed, you can view this demo to help you understand the basic concept of non-extractive parsing.

A Typical Use Case

Right now, many applications suffer from serious performance issues when sending large, complex-structured XML documents across your enterprise messaging backbone (using ESB, MQ or BizTalk server). The streaming API-based approach is inherently less applicable due to its inability to deal with structure. However, if the application is coded in DOM, then the memory usage is an additional burden, forcing developers to split the documents into smaller ones prior to the sending.

With VTD-XML, you don't just solve the problem. In fact, there is more than one way to solve the problem. Because of its memory efficiency, random access and XPath support, VTD-XML in parsing mode allows your application to handle much larger documents at higher performance with less coding. In other words, the XML documents appear "smaller" with VTD processing.

Moreover, when you send the VTD index along with the XML text, the application at the receiving end can directly perform application logic (e.g. XPath queries, etc.) with zero parsing overhead, further enhancing throughput and reducing latency. Things get even better with VTD-XML when your applications start to modify the documents (to be discussed in the second part of this series).

The rest of this article will demonstrate how to use VTD-XML to parse, run the XPath query and index (both generating and loading) XML documents. Before running those code samples, you need to download the VTD-XML project and download the full version of its C# port.

Hello World!

This example shows you how to parse a file, manually navigate to a desired node and then print out its text content. In the input XML, the text node " hello world! " is nested two-levels deep down the hierarchy.

<ns1:a xmlns:ns1="someURL">
   <ns1:b>   hello   world! </ns1:b>

</ns1:a>

The example first instantiates VTDGen and then calls parseFile() to parse the input document. After parsing, this example obtains an instance of VTDNav with getNav(). The VTDNav object wraps around the underlying XML hierarchy and contains a global cursor that the application can navigate by calling various flavors of toElement() and toElementNS(). There are six constants that determine the direction of navigation: ROOT, PARENT, FIRST_CHILD, LAST_CHILD, NEXT_SIBLING or PREV_SIBLING. Calling getText() returns either the index of its VTD record or -1 (corresponding to no such record). To print out the text content, the application converts the index of text node by first calling toString() and toNormalizedString().

using System;
using System.Collections.Generic;
using System.Text;
using com.ximpleware;
namespace example1
{
    class Hello_World
    {
        static void Main(string[] args)
        {
            VTDGen vg = new VTDGen();
            if (vg.parseFile("test1.xml", true))
            {
                try{
                    VTDNav vn = vg.getNav();
                    if (vn.toElementNS(VTDNav.FIRST_CHILD,"someURL","b")){
                        int i = vn.getText();
                            if (i!=-1){
                                Console.WriteLine(vn.toString(i));
                                Console.WriteLine(vn.toNormalizedString(i));
                            }
                    }
                }
                catch(NavException e){
                }
            }
        }
    }
}

The output shows the difference of the strings converted using VTDNav's toString() and toNormalizedString(). Please notice the differences between those two strings (which is the subtle part of VTD-XML parsing).

  hello   world!
hello world!

Running XPath Query

The second example shows you how to query the document using XPath. Below is the XML document:

<?xml version="1.0"?>
<purchaseOrder orderDate="1999-10-20">
    <items>
        <item partNum="872-AA">
            <productName>Lawnmower</productName>

            <quantity>1</quantity>
            <USPrice>148.95</USPrice>
            <comment>Confirm this is electric</comment>
        </item>

        <item partNum="872-AA">
            <productName>Lawnmower</productName>
            <quantity>1</quantity>
            <USPrice>148.95</USPrice>

            <comment>Confirm this is electric</comment>
        </item>
     </items>
</purchaseOrder>

To evaluate XPath queries, you need to instantiate AutoPilot. Call selectXPath() to set the XPath expression. Nesting evalXPath() in a while loop is the most common way to retrieve evaluated nodes, whose indices get returned one at a time. This is in contrast with both DOM and XPathNavigator, as both return the entire node set all at once. For further reading, please visit "Improve XPath performance with VTD-XML".

using System;
using System.Collections.Generic;
using System.Text;
using com.ximpleware;
namespace example2
{
         class Program
         {
                static void Main(string[] args)
                {
                        VTDGen vg = new VTDGen();
                        int i;
                        if (vg.parseFile("test2.xml", false))
                        {
                            try{
                                VTDNav vn = vg.getNav();
                                AutoPilot ap = new AutoPilot(vn);
                                ap.selectXPath(
                                    "/purchaseOrder/items/item[@partNum=\"
                                    872-AA\"]/USPrice/text()");
                                while ((i = ap.evalXPath())!=-1)
                                {
                                    Console.WriteLine(vn.toString(i));
                                }
                            }catch(NavException e){
                            }
                        }
                 }
        }
}

The output simply echoes the qualified text nodes.

148.95
148.95

Index Writing

This example shows you how to write the index file for an XML document to avoid repetitive parsing at a later time. This is mostly done by calling VTDNav's writeIndex() method. If you open input.vxl (think VTD-XML), you can actually read it.

using System;
using System.Collections.Generic;
using System.Text;
using com.ximpleware;
namespace example3
{
    public class writeIndex
    {
        public static void Main(string[] args)
        {
            VTDGen vg = new VTDGen();
            if (vg.parseFile("d:/C#_tutorial_by_code_examples/4/input.xml",true)){
                vg.writeIndex("d:/input.vxl");
            }
        }
    }
}

Index Loading

To load the index file, call loadIndex() of VTDNav. It returns a VTDNav object with which an application can do any application-specific processing.

using System;
using System.Collections.Generic;
using System.Text;
using com.ximpleware;
namespace example
{
    public class loadIndex
    {
        public static void Main(string[] args)
        {
            try
            {
                VTDGen vg = new VTDGen();
                VTDNav vn = vg.loadIndex("input.vxl");
                // do whatever you want here
            }
            catch (IndexReadException e)
            {
            }
        }
    }
}

Recap

DOM, SAX and streaming XML parsing have numerous technical problems, mostly caused by extractive parsing and excessive object creation. VTD-XML is faster, more memory-efficient and easier to use because it resorts to non-extractive parsing to eliminating object creation. However, this article only showed a glimpse of what the future of XML processing is like. In the second article of this series, I will show you more features of VTD-XML that will take your breath away.

History

You must Sign In to use this message board.
 
 
Per page   
 FirstPrevNext
Generaltx
Miss Green
12:34 8 Oct '09  
tx a lot, i need this example for my exercise Wink Wink

: : The Experience Is Te Best Teacher : :
Share Your Experience...
Thanks....

Generaldoesn't make a lot of sense
rich rendle
14:24 25 Mar '08  
Usability: isn't it easy to just have the xmlserializer convert the xml into business objects - as in 1 line of code? Then processing is done using an approach everyone is familiar with.

Parsing Performance: as you say, the forward only streaming api provides the fastest access. If you need this kind of performance just write a method to parse the xml using the streaming api and convert it into business objects.

Modification Efficiency: modifying business objects couldnt be easier nor less practiced in the real world.

The reality is xml processing that you propose lacks any strong type checking, is difficult to read by fellow team members, and would unlikely be adopted by an enterprise.

The only real advantage i see here for you would be in mem usage. CPU usage may be better but that would have to be proven. So in the rare cases where memory is a constraining factor you might get some bites.

Sorry if i sound negative, this is just the way i see it. Please correct me if i am wrong, happy to hear.

rich rendle code master

GeneralRe: doesn't make a lot of sense
Jimmy Zhang
6:54 26 Mar '08  
I had many different views on this ...

1. Usability: AFAIK, the popular XML processing models are DOM and SAX, VTD-XML is presented as a new option on those processing models.. but if your favorite apporach is data binding, our discussion should be within a different context... xML is used in so many different places so the usage/context is important...

2. Parsing Performance: Again, forward-only API's perfomrance comes at the expense of limited usability, can your run XPath etc? VTD-XML is different in that it outperforms SAX while delivering random-access...

3. Modification efficiency: It is not about modifying biz objects... it is about modifiying XML documents... I think there may be a misunderstanding on your part...


4. The reality I am aware of is that DOM sax XMLreader (all of which VTD-XML compares itself to) are typeless parsers... this is hardly surprising because XML's true value is that it is primarily a *typeless* data format

I think that base on your argument, you have been living doing OO a while.. there may be some practical limitations that you need to be aware of that could help balance your thinking ... the second part of this series has a discussion on this ..

VTD-XML: XML Processing for the Future (Part II) http://www.codeproject.com/KB/cs/xml_processing_future.aspx[^]

Your comments welcome!
GeneralRe: doesn't make a lot of sense
rich rendle
11:48 26 Mar '08  
1. Databinding is one such situation i am involved with. But mostly I am dealing with enterprise application integration via xml communicaton. This is both internal and with trading partners across the globe. To me the usage is irrelevant, the approach is the same.

xml -> business objects -> manipulation/processing -> xml.

2. The limited usability of the forward api's is solved through

xml -> forward reading api -> business objects -> manipulation/processing -> xml

3. I agree it's about modifying xml and not business objects, my point is that modifying xml is best done through business objects (or entities or whatever you want to call them - something that provides strong typing and easy to read code).

You want to be able to swap out your datasource with minimal effort for ultimate flexibility. If you use my approach then all you have to do is swap out the code that parses the xml into the business objects with say a sql result. The code that manipulates and processes the business objects is a separate service if you will that needs not be touched.

4. Agreed about typeless - which is the whole reason i argue my approach.

I will have to look at part II soon. Keep up the good work!

rich rendle code master

GeneralRe: doesn't make a lot of sense
Jimmy Zhang
9:21 28 Mar '08  
I think that u may be the right audience for the follow-up article on this one.. u may still not agree with me... but it may be a discussion well-worth having...

The central argument is whether XML documents are bits/bytes or trees of objects...

document centric XML processing treats XML docs as the former

oo xml processing treats XML docs as the latter...

each has its strengths and weaknesses.. so just make sure u know all the options available when making design decisions...

As to the usablity part, I suspect that XPath + VTD-XML makes it easy enough so they are competitive with any oo-data binding approach ...

If your company do a lot of stuff with ESB, I suspect that the redundant de-serialization/re-serialization is the #1 technical problem

Let me know if it makes sense or not...
GeneralRe: doesn't make a lot of sense
rich rendle
13:09 31 Mar '08  
Jimmy Zhang wrote:
If your company do a lot of stuff with ESB, I suspect that the redundant de-serialization/re-serialization is the #1 technical problem

I would have to disagree. Serialization is far from every being an issue even considered ever in places I have worked. Ok I take that back - there are some situations where it is an issue and in those cases XML with something more compact and efficient. But in general it is far from an issue considered - reason being because that would comprise < .01% of the total processing done on any given day. XML serialization is either the begin or the end of a relatively long process - with just a single serialize and single deserialize done on any single transaction. The processing could include anything from inserting/retrieving data from a database, calling a web service, calling an FTP Service, and so on. Serialization really isn't even a blip on the radar for me.

rich rendle code master

GeneralRe: doesn't make a lot of sense
Jimmy Zhang
10:42 3 Apr '08  
It really depends on a lot of factors... This paper
http://www.research.ibm.com/trl/people/mich/pub/200609_icws2006esbperf.pdf[^] will tell you that ESB is a choke point... but again, you may just have a different use case...

The other draw-back about doing data-binding is that it could lead to tightly-coupled system... if someone sends a extended version of XML that doesn't fit the schema.. your system tends to break..

Below are some of my suggestions for achieving loose coupling... hope they can give you another point of view
1. Use document-literal style of Web Servce
2. Use DOM/VTD-XML over HTTP
3. don't assume schema (I wrote an article called schemaless C# XML data binding with VTD-XML
Schemaless C#-XML data binding with VTD-XML [^])
GeneralNow it makes more sense
Lee Humphries
20:41 17 Mar '08  
I read your previous article about an XML processor on a chip and didn't really get it. With this it now makes a lot more sense.

Now if you could do the following I'd really be interested:
XSLT 1 and 2
XQuery
XSD (and a SOM equivalent)

Keep up the good work.
GeneralRe: Now it makes more sense
Jimmy Zhang
11:38 19 Mar '08  
Working on those... also xsd may come out first...
GeneralRe: Now it makes more sense
Lee Humphries
13:15 19 Mar '08  
I'd recommend doing the XSLT v1 as well. The reason being is that XSD spec has a few logical holes in it when it comes to dependencies within a document that it simply cannot test for. However once you combine it with schematron (which requires XSLT) you can plug all of those logical holes.

At the present stage the .Net compiled XSLT transform is by far and away the fastest XSLT processor I've come across, probably followed by MSXML v6 then v4 then the in-built processor in Stylus Studio, followed by Saxon. I don't really bother with any other XSLT processors apart from those. Unfortunately the Microsoft XSLT processors have a couple of quirks in their behaviour, stuff which the spec doesn't define one way or the other, so MS are just different from everyone else.

When you do get to the XSLT processor let me know. I've been working with XSLT since 2000 so I have a bit of experience and more than a few really hairy test cases. Using xsl:key and the Meunchian Method (I probably misspelt that) are a key tricks for performance, those two definitely should be in your test cases.
GeneralRe: Now it makes more sense
Jimmy Zhang
13:23 19 Mar '08  
Sure, defintely keep that in mind when doing XSLT... what is your contact info (etc) so do you mind join my linked-in network ... I can send u an invite if u like
GeneralRe: Now it makes more sense
Lee Humphries
14:03 19 Mar '08  
Do that - look for Lee Humphries in LinkedIn - currently working for Solomon Telekom Company Limited
GeneralRe: Now it makes more sense
Jimmy Zhang
14:52 19 Mar '08  
still needs email, or pls send me an invite crackeur@comcast.net
GeneralRe: Now it makes more sense
Lee Humphries
20:32 19 Mar '08  
Hi Jimmy,

You should have an invite now - maybe you'll want to delete your previous message so you don't get spammed - It seems we have the guys at ZapThink in common.

Regards, Lee


Last Updated 17 Apr 2008 | Advertise | Privacy | Terms of Use | Copyright © CodeProject, 1999-2010