(untagged)

VTD-XML: XML Processing for the Future (Part I)

Jimmy Zhang

0.00/5 (No votes)

17 Apr 2008

Introduce VTD-XML, the future of XML processing

Summary

Many readers of The Code Project are familiar with various types of XML parsers in the .NET environment. This article series introduces a new XML processing model called VTD-XML to The Code Project community. It goes significantly beyond those traditional models by fundamentally overcoming many tough technical challenges hampering SOA and enterprise XML application development. The first part of this series demonstrates the benefits of VTD-XML as a parser with integrated XPath and as an indexer. The second part shows you how to benefit from VTD-XML's cutting, editing and modifying capabilities, as well as introduces the concept of "document-centric" XML processing. The third part of this series shows you how to code your application in a C version of VTD-XML.

Introduction

VTD-XML is a suite of open-source XML processing technologies centered around a "non-extractive" XML processing technique called "Virtual Token Descriptor." It is cross-platform and available in C#, C and Java. The latest version is 2.2, which can be downloaded here. Depending on the perspective, VTD-XML can be viewed as one of the following:

A "document-centric" XML parser
A native XML indexer or a file format that uses binary data to enhance the XML text
An incremental XML content modifier
An XML slicer/splitter/assembler
An XML editor/eraser
A way to port XML processing onto a chip

Digging Into "Non-Extractive" Parsing

Let me quickly go over some of the new definitions introduced in the section above.

"Non-extractive" parsing means the XML text is kept intact in memory and un-decoded while tokens are represented exclusively using offsets and lengths (no string content copying). This is in contrast to "extractive" parsing (on which DOM, SAX and other old XML processing models are based), which allocates small memory blocks (a.k.a. strings) and copies into them the actual token content.

"Virtual Token Descriptor" (VTD), whose layout is shown in Figure 1, is a binary encoding format extending the concept of "non-extractive" parsing to XML. A VTD record is a 64-bit integer that encodes the length, offset, nesting depth and type of an XML token. As of VTD-XML 2.2, the bit layout of a VTD record is further defined as follows:

Starting offset: 31 bits (b30 ~ b0) without namespace support, or 30 bits (b29~b0) with namespace support
Length: 20 bits (b51 ~ b32) maximum value is 2^20-1 = 1M -1
- For some token types:
  - Prefix length: 9 bits (b51~ b43) max value 511
  - Q-name length: 11 bits (b42 ~ b 32) max value 1023
Depth: 8 bits (b59~b52) max value is 2^8-1 = 255
Token type: 4 bits (b63~b60)

Figure 1. Bit Layout of a VTD Record

Understand the Benefits of VTD-XML

Simply put, VTD-XML fundamentally solves a significant number of XML processing related issues in enterprise, ranging from the obvious ones that you experience every day, to those hidden ones that prevent you from taking your SOA project to the next level. Below is a brief discussion of some of those issues:

Usability: Most developers find DOM (or DOM-like, such as XPathNavigator) processing flexible and easy to use because the entire XML tree resides in memory and you can either navigate the tree manually, or better, using XPath. But DOM-like parsing is CPU-intensive and consumes a lot of memory. Streaming XML API (e.g. XMLTextReader) and SAX.NET are forward-only API that are generally considered difficult to use (no random access, no XPath). This makes them ill-suited for processing complex-structured XML data. For further reading, see "Better, faster XML processing with VTD-XML".
Memory Usage: Random-access capabilities and XPath make XML programming easy and intuitive. However, enabling random access requires the XML tree to be loaded entirely in memory, incurring the cost of 5x~10x the amount of physical memory than the size of the XML text. Because of this, right now developers often can't use DOM to process large XML ranging from 10s to 100s of MB in size, forcing them to settle for far less usable SAX or Streaming API. Some readers may wonder if one could load portions of a tree in memory to performance random access. The problem with this approach is that if the desired portions of a tree aren't in memory, the cost of loading them from the hard disk is three or more orders of magnitudes slower than DRAM performance. For further reading, see "The performance woe of binary XML".
Parsing Performance: DOM (or DOM-like API, such as XPathNavigator) is known for being slow. However, once parsed, the performance of navigating the in-memory tree is quite fast. The raw performance of SAX and Streaming API, on the other hand, looks much faster on paper. In practice, though, applications based on SAX have to manually maintain the state information or scan the XML documents multiple times, often rendering the performance advantages over DOM all but insignificant. For further reading, see "Simplify XML processing with VTD-XML".
Indexing Support: Right now, if your applications repetitively read the same XML documents, they will have to perform repetitive and expensive parsing every time. A detailed discussion can be found in "Index XML documents with VTD-XML".
Modification Efficiency: Suppose you want to change a certain text value. Using editors such as VI, you can just move the cursor to the text and modify it in place, i.e. without touching the irrelevant parts of the document. This is called "incremental update." However, when your applications (based on DOM and SAX) make similar changes to XML, you find extractive parsing doing a lot of wasted work that makes no sense whatsoever. The first step is parsing (or de-serialization), in which you pick apart the XML documents into many string objects. Then you navigate to the text node and make the change. Finally, you put those string objects back together (i.e. re-serialization) into a new document. Why can't your application just navigate to the text node and make changes in place like normal text editing? More discussion can be found in the next article of the series.

To understand the benefits that VTD-XML brings to the table, below is the highlight of some of its features:

Parsing Performance: 5x~10x of DOM, even faster (typically 1.5x) than the max performance of SAX or XMLTextReader (with NULL content handler), but more importantly it is random-access capable with integrated XPath support.
Memory Usage: VTD-XML's tree representation typically takes about 1.3x~1.5x the size of the XML text, which is typically about 1/3~1/5 of the memory consumed by DOM.
Native Indexing: VTD-XML is the only general-purpose native XML indexer that eliminates the repetitive parsing, while retaining human-readability.
Incremental Update: VTD-XML is the only XML processing package supporting incremental update.

As you probably have guessed, VTD is the primary reason why VTD-XML is able to simultaneously achieve all those feats. A typical DOM parser allocates one unit of memory for each token in the XML input file tree. This is costly in both memory performance (due to heap fragmentation) and time because of the sheer quantity of allocation requests. VTD-XML simply stores a verbatim copy of the XML in-memory unparsed and then generates VTD records in front of it to allow for simple navigation and access. Because reading an XML file is by definition a read-only process, it makes sense that you need not have the flexibility of variable-allocation at this point in the parsing. Last, keep in mind that VTD-XML is technically a processing model rather than an API and you can build your own API on top of a VTD-XML model.

There are a lot of articles written on various aspects of VTD-XML. They are available at "Links and presentation page". Also if your browser has Java plug-in installed, you can view this demo to help you understand the basic concept of non-extractive parsing.

A Typical Use Case

Right now, many applications suffer from serious performance issues when sending large, complex-structured XML documents across your enterprise messaging backbone (using ESB, MQ or BizTalk server). The streaming API-based approach is inherently less applicable due to its inability to deal with structure. However, if the application is coded in DOM, then the memory usage is an additional burden, forcing developers to split the documents into smaller ones prior to the sending.

With VTD-XML, you don't just solve the problem. In fact, there is more than one way to solve the problem. Because of its memory efficiency, random access and XPath support, VTD-XML in parsing mode allows your application to handle much larger documents at higher performance with less coding. In other words, the XML documents appear "smaller" with VTD processing.

Moreover, when you send the VTD index along with the XML text, the application at the receiving end can directly perform application logic (e.g. XPath queries, etc.) with zero parsing overhead, further enhancing throughput and reducing latency. Things get even better with VTD-XML when your applications start to modify the documents (to be discussed in the second part of this series).

The rest of this article will demonstrate how to use VTD-XML to parse, run the XPath query and index (both generating and loading) XML documents. Before running those code samples, you need to download the VTD-XML project and download the full version of its C# port.

Hello World!

This example shows you how to parse a file, manually navigate to a desired node and then print out its text content. In the input XML, the text node " hello world! " is nested two-levels deep down the hierarchy.

<ns1:a xmlns:ns1="someURL">
   <ns1:b>   hello   world! </ns1:b>

</ns1:a>

The example first instantiates VTDGen and then calls parseFile() to parse the input document. After parsing, this example obtains an instance of VTDNav with getNav(). The VTDNav object wraps around the underlying XML hierarchy and contains a global cursor that the application can navigate by calling various flavors of toElement() and toElementNS(). There are six constants that determine the direction of navigation: ROOT, PARENT, FIRST_CHILD, LAST_CHILD, NEXT_SIBLING or PREV_SIBLING. Calling getText() returns either the index of its VTD record or -1 (corresponding to no such record). To print out the text content, the application converts the index of text node by first calling toString() and toNormalizedString().

using System;
using System.Collections.Generic;
using System.Text;
using com.ximpleware;
namespace example1
{
    class Hello_World
    {
        static void Main(string[] args)
        {
            VTDGen vg = new VTDGen();
            if (vg.parseFile("test1.xml", true))
            {
                try{
                    VTDNav vn = vg.getNav();
                    if (vn.toElementNS(VTDNav.FIRST_CHILD,"someURL","b")){
                        int i = vn.getText();
                            if (i!=-1){
                                Console.WriteLine(vn.toString(i));
                                Console.WriteLine(vn.toNormalizedString(i));
                            }
                    }
                }
                catch(NavException e){
                }
            }
        }
    }
}

The output shows the difference of the strings converted using VTDNav's toString() and toNormalizedString(). Please notice the differences between those two strings (which is the subtle part of VTD-XML parsing).

  hello   world!
hello world!

Running XPath Query

The second example shows you how to query the document using XPath. Below is the XML document:

<?xml version="1.0"?>
<purchaseOrder orderDate="1999-10-20">
    <items>
        <item partNum="872-AA">
            <productName>Lawnmower</productName>

            <quantity>1</quantity>
            <USPrice>148.95</USPrice>
            <comment>Confirm this is electric</comment>
        </item>

        <item partNum="872-AA">
            <productName>Lawnmower</productName>
            <quantity>1</quantity>
            <USPrice>148.95</USPrice>

            <comment>Confirm this is electric</comment>
        </item>
     </items>
</purchaseOrder>

To evaluate XPath queries, you need to instantiate AutoPilot. Call selectXPath() to set the XPath expression. Nesting evalXPath() in a while loop is the most common way to retrieve evaluated nodes, whose indices get returned one at a time. This is in contrast with both DOM and XPathNavigator, as both return the entire node set all at once. For further reading, please visit "Improve XPath performance with VTD-XML".

using System;
using System.Collections.Generic;
using System.Text;
using com.ximpleware;
namespace example2
{
         class Program
         {
                static void Main(string[] args)
                {
                        VTDGen vg = new VTDGen();
                        int i;
                        if (vg.parseFile("test2.xml", false))
                        {
                            try{
                                VTDNav vn = vg.getNav();
                                AutoPilot ap = new AutoPilot(vn);
                                ap.selectXPath(
                                    "/purchaseOrder/items/item[@partNum=\"
                                    872-AA\"]/USPrice/text()");
                                while ((i = ap.evalXPath())!=-1)
                                {
                                    Console.WriteLine(vn.toString(i));
                                }
                            }catch(NavException e){
                            }
                        }
                 }
        }
}

The output simply echoes the qualified text nodes.

148.95
148.95

Index Writing

This example shows you how to write the index file for an XML document to avoid repetitive parsing at a later time. This is mostly done by calling VTDNav's writeIndex() method. If you open input.vxl (think VTD-XML), you can actually read it.

using System;
using System.Collections.Generic;
using System.Text;
using com.ximpleware;
namespace example3
{
    public class writeIndex
    {
        public static void Main(string[] args)
        {
            VTDGen vg = new VTDGen();
            if (vg.parseFile("d:/C#_tutorial_by_code_examples/4/input.xml",true)){
                vg.writeIndex("d:/input.vxl");
            }
        }
    }
}

Index Loading

To load the index file, call loadIndex() of VTDNav. It returns a VTDNav object with which an application can do any application-specific processing.

using System;
using System.Collections.Generic;
using System.Text;
using com.ximpleware;
namespace example
{
    public class loadIndex
    {
        public static void Main(string[] args)
        {
            try
            {
                VTDGen vg = new VTDGen();
                VTDNav vn = vg.loadIndex("input.vxl");
                // do whatever you want here
            }
            catch (IndexReadException e)
            {
            }
        }
    }
}

Recap

DOM, SAX and streaming XML parsing have numerous technical problems, mostly caused by extractive parsing and excessive object creation. VTD-XML is faster, more memory-efficient and easier to use because it resorts to non-extractive parsing to eliminating object creation. However, this article only showed a glimpse of what the future of XML processing is like. In the second article of this series, I will show you more features of VTD-XML that will take your breath away.

History

14 February, 2008 -- Original version posted
1 March, 2008 -- Graphical VTD Layout added
14 March, 2008 -- A Java applet added that demonstrates the basic concept of VTD-XML parsing

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here