Skip to main content
Email Password   helpLost your password?

Summary

Many readers of The Code Project are familiar with various types of XML parsers in the .NET environment. This article series introduces a new XML processing model called VTD-XML to The Code Project community. It goes significantly beyond those traditional models by fundamentally overcoming many tough technical challenges hampering SOA and enterprise XML application development. The first part of this series demonstrates the benefits of VTD-XML as a parser with integrated XPath and as an indexer. The second part shows you how to benefit from VTD-XML's cutting, editing and modifying capabilities, as well as introduces the concept of "document-centric" XML processing. The third part of this series shows you how to code your application in a C version of VTD-XML.

Introduction

VTD-XML is a suite of open-source XML processing technologies centered around a "non-extractive" XML processing technique called "Virtual Token Descriptor." It is cross-platform and available in C#, C and Java. The latest version is 2.2, which can be downloaded here. Depending on the perspective, VTD-XML can be viewed as one of the following:

Digging Into "Non-Extractive" Parsing

Let me quickly go over some of the new definitions introduced in the section above.

"Non-extractive" parsing means the XML text is kept intact in memory and un-decoded while tokens are represented exclusively using offsets and lengths (no string content copying). This is in contrast to "extractive" parsing (on which DOM, SAX and other old XML processing models are based), which allocates small memory blocks (a.k.a. strings) and copies into them the actual token content.

"Virtual Token Descriptor" (VTD), whose layout is shown in Figure 1, is a binary encoding format extending the concept of "non-extractive" parsing to XML. A VTD record is a 64-bit integer that encodes the length, offset, nesting depth and type of an XML token. As of VTD-XML 2.2, the bit layout of a VTD record is further defined as follows:

vtd_layout.jpg

Figure 1. Bit Layout of a VTD Record

Understand the Benefits of VTD-XML

Simply put, VTD-XML fundamentally solves a significant number of XML processing related issues in enterprise, ranging from the obvious ones that you experience every day, to those hidden ones that prevent you from taking your SOA project to the next level. Below is a brief discussion of some of those issues:

To understand the benefits that VTD-XML brings to the table, below is the highlight of some of its features:

As you probably have guessed, VTD is the primary reason why VTD-XML is able to simultaneously achieve all those feats. A typical DOM parser allocates one unit of memory for each token in the XML input file tree. This is costly in both memory performance (due to heap fragmentation) and time because of the sheer quantity of allocation requests. VTD-XML simply stores a verbatim copy of the XML in-memory unparsed and then generates VTD records in front of it to allow for simple navigation and access. Because reading an XML file is by definition a read-only process, it makes sense that you need not have the flexibility of variable-allocation at this point in the parsing. Last, keep in mind that VTD-XML is technically a processing model rather than an API and you can build your own API on top of a VTD-XML model.

There are a lot of articles written on various aspects of VTD-XML. They are available at "Links and presentation page". Also if your browser has Java plug-in installed, you can view this demo to help you understand the basic concept of non-extractive parsing.

A Typical Use Case

Right now, many applications suffer from serious performance issues when sending large, complex-structured XML documents across your enterprise messaging backbone (using ESB, MQ or BizTalk server). The streaming API-based approach is inherently less applicable due to its inability to deal with structure. However, if the application is coded in DOM, then the memory usage is an additional burden, forcing developers to split the documents into smaller ones prior to the sending.

With VTD-XML, you don't just solve the problem. In fact, there is more than one way to solve the problem. Because of its memory efficiency, random access and XPath support, VTD-XML in parsing mode allows your application to handle much larger documents at higher performance with less coding. In other words, the XML documents appear "smaller" with VTD processing.

Moreover, when you send the VTD index along with the XML text, the application at the receiving end can directly perform application logic (e.g. XPath queries, etc.) with zero parsing overhead, further enhancing throughput and reducing latency. Things get even better with VTD-XML when your applications start to modify the documents (to be discussed in the second part of this series).

The rest of this article will demonstrate how to use VTD-XML to parse, run the XPath query and index (both generating and loading) XML documents. Before running those code samples, you need to download the VTD-XML project and download the full version of its C# port.

Hello World!

This example shows you how to parse a file, manually navigate to a desired node and then print out its text content. In the input XML, the text node " hello world! " is nested two-levels deep down the hierarchy.

<ns1:a xmlns:ns1="someURL">
   <ns1:b>   hello   world! </ns1:b>

</ns1:a>

The example first instantiates VTDGen and then calls parseFile() to parse the input document. After parsing, this example obtains an instance of VTDNav with getNav(). The VTDNav object wraps around the underlying XML hierarchy and contains a global cursor that the application can navigate by calling various flavors of toElement() and toElementNS(). There are six constants that determine the direction of navigation: ROOT, PARENT, FIRST_CHILD, LAST_CHILD, NEXT_SIBLING or PREV_SIBLING. Calling getText() returns either the index of its VTD record or -1 (corresponding to no such record). To print out the text content, the application converts the index of text node by first calling toString() and toNormalizedString().

using System;
using System.Collections.Generic;
using System.Text;
using com.ximpleware;
namespace example1
{
    class Hello_World
    {
        static void Main(string[] args)
        {
            VTDGen vg = new VTDGen();
            if (vg.parseFile("test1.xml", true))
            {
                try{
                    VTDNav vn = vg.getNav();
                    if (vn.toElementNS(VTDNav.FIRST_CHILD,"someURL","b")){
                        int i = vn.getText();
                            if (i!=-1){
                                Console.WriteLine(vn.toString(i));
                                Console.WriteLine(vn.toNormalizedString(i));
                            }
                    }
                }
                catch(NavException e){
                }
            }
        }
    }
}

The output shows the difference of the strings converted using VTDNav's toString() and toNormalizedString(). Please notice the differences between those two strings (which is the subtle part of VTD-XML parsing).

  hello   world!
hello world!

Running XPath Query

The second example shows you how to query the document using XPath. Below is the XML document:

<?xml version="1.0"?>
<purchaseOrder orderDate="1999-10-20">
    <items>
        <item partNum="872-AA">
            <productName>Lawnmower</productName>

            <quantity>1</quantity>
            <USPrice>148.95</USPrice>
            <comment>Confirm this is electric</comment>
        </item>

        <item partNum="872-AA">
            <productName>Lawnmower</productName>
            <quantity>1</quantity>
            <USPrice>148.95</USPrice>

            <comment>Confirm this is electric</comment>
        </item>
     </items>
</purchaseOrder>

To evaluate XPath queries, you need to instantiate AutoPilot. Call selectXPath() to set the XPath expression. Nesting evalXPath() in a while loop is the most common way to retrieve evaluated nodes, whose indices get returned one at a time. This is in contrast with both DOM and XPathNavigator, as both return the entire node set all at once. For further reading, please visit "Improve XPath performance with VTD-XML".

using System;
using System.Collections.Generic;
using System.Text;
using com.ximpleware;
namespace example2
{
         class Program
         {
                static void Main(string[] args)
                {
                        VTDGen vg = new VTDGen();
                        int i;
                        if (vg.parseFile("test2.xml", false))
                        {
                            try{
                                VTDNav vn = vg.getNav();
                                AutoPilot ap = new AutoPilot(vn);
                                ap.selectXPath(
                                    "/purchaseOrder/items/item[@partNum=\"
                                    872-AA\"]/USPrice/text()");
                                while ((i = ap.evalXPath())!=-1)
                                {
                                    Console.WriteLine(vn.toString(i));
                                }
                            }catch(NavException e){
                            }
                        }
                 }
        }
}

The output simply echoes the qualified text nodes.

148.95
148.95

Index Writing

This example shows you how to write the index file for an XML document to avoid repetitive parsing at a later time. This is mostly done by calling VTDNav's writeIndex() method. If you open input.vxl (think VTD-XML), you can actually read it.

using System;
using System.Collections.Generic;
using System.Text;
using com.ximpleware;
namespace example3
{
    public class writeIndex
    {
        public static void Main(string[] args)
        {
            VTDGen vg = new VTDGen();
            if (vg.parseFile("d:/C#_tutorial_by_code_examples/4/input.xml",true)){
                vg.writeIndex("d:/input.vxl");
            }
        }
    }
}

Index Loading

To load the index file, call loadIndex() of VTDNav. It returns a VTDNav object with which an application can do any application-specific processing.

using System;
using System.Collections.Generic;
using System.Text;
using com.ximpleware;
namespace example
{
    public class loadIndex
    {
        public static void Main(string[] args)
        {
            try
            {
                VTDGen vg = new VTDGen();
                VTDNav vn = vg.loadIndex("input.vxl");
                // do whatever you want here
            }
            catch (IndexReadException e)
            {
            }
        }
    }
}

Recap

DOM, SAX and streaming XML parsing have numerous technical problems, mostly caused by extractive parsing and excessive object creation. VTD-XML is faster, more memory-efficient and easier to use because it resorts to non-extractive parsing to eliminating object creation. However, this article only showed a glimpse of what the future of XML processing is like. In the second article of this series, I will show you more features of VTD-XML that will take your breath away.

History

You must Sign In to use this message board.
 
 
Per page   
 FirstPrevNext
Generaltx Pin
Miss Green
12:34 8 Oct '09  
Generaldoesn't make a lot of sense Pin
rich rendle
14:24 25 Mar '08  
GeneralRe: doesn't make a lot of sense Pin
Jimmy Zhang
6:54 26 Mar '08  
GeneralRe: doesn't make a lot of sense Pin
rich rendle
11:48 26 Mar '08  
GeneralRe: doesn't make a lot of sense Pin
Jimmy Zhang
9:21 28 Mar '08  
GeneralRe: doesn't make a lot of sense Pin
rich rendle
13:09 31 Mar '08  
GeneralRe: doesn't make a lot of sense Pin
Jimmy Zhang
10:42 3 Apr '08  
GeneralNow it makes more sense Pin
Lee Humphries
20:41 17 Mar '08  
GeneralRe: Now it makes more sense Pin
Jimmy Zhang
11:38 19 Mar '08  
GeneralRe: Now it makes more sense Pin
Lee Humphries
13:15 19 Mar '08  
GeneralRe: Now it makes more sense Pin
Jimmy Zhang
13:23 19 Mar '08  
GeneralRe: Now it makes more sense Pin
Lee Humphries
14:03 19 Mar '08  
GeneralRe: Now it makes more sense Pin
Jimmy Zhang
14:52 19 Mar '08  
GeneralRe: Now it makes more sense Pin
Lee Humphries
20:32 19 Mar '08  


Last Updated 17 Apr 2008 | Advertise | Privacy | Terms of Use | Copyright © CodeProject, 1999-2009