Index XML Documents with VTD+XML

Jimmy Zhang

Rate me:

4.68/5 (13 votes)

11 Jun 2008CPOL12 min read

80.1K

458

Introduces a simple, efficient, human-readable XML index called VTD+XML.

Download source code - 22.5 KB

Summary

Traditionally, DOM or SAX-based enterprise applications have to repeat CPU-intensive XML parsing when accessing the same documents multiple times. This article introduces a very simple, general-purpose, native XML index called VTD+XML that eliminates the need for repetitive parsing of those applications.

Avoid Repetitive XML Parsing with VTD-XML

As discussed in "Simplify XML processing with VTD-XML", to date, one of the underlying assumptions in XML application development is that an XML document must be parsed before anything else can be done with it. In other words, the processing logic of XML applications can't start without parsing. Frequently considered a threat to database performance, XML parsing is usually many times slower than other XML operations such as XPath evaluation. When those applications perform multiple read-only access to XML data that don't change very often, wouldn't it be nice to be able to eliminate the overhead of associated repetitive parsing?

With the native XML indexing feature of VTD-XML, you can do precisely that. VTDGen, the class encapsulating various parsing routines, now adds "readIndex(...)" and "writeIndex(...)". There are two exceptions related to indexing: indexWriteException and indexReadException.

Let me put those new methods into action, and show you how to turn on the indexing capability in your application. Consider the following XML document:

XML

<purchaseOrder orderDate="1999-10-21">
     <item partNum="872-AA">
       <productName>Lawnmower</productName>
       <quantity>1</quantity>
       <USPrice>148.95</USPrice>
     </item>
</purchaseOrder>

Below is a simple application named "printPrice.cs" that prints out the contents of the element "USPrice". Notice that it parses the XML file and then uses XPath to filter out the target nodes.

using com.ximpleware.*;
public class printPrice{
   public static void main(String args[]){
     VTDGen vg = new VTDGen();
     try{
       if (vg.parseFile("po.xml",true)){
         VTDNav vn = vg.getNav();
         AutoPilot ap = new AutoPilot(vn);
         ap.selectXPath("/purchaseOrder/item/USPrice/text()");
         int i=-1; 
         while((i=ap.evalXPath())!=-1){
           System.out.println(" USPrice ==> "+vn.toString(i));
         }
       } 
     }catch(Exception e){ 

     }
   }
}

A few changes are needed to add VTD-XML's new indexing capability to the C# code above. First, you need to read in the XML document, parse it, and then write out the indexed version of the same XML document. From that point onwards, your application can run an XPath query or processing logic directly on top of the index, saving the CPU cycles of parsing the XML document again. The following code snippets (named "genIndex.cs" and "accessIndex.cs", respectively) show you how to generate and access the index. Notice that, when executed sequentially, both applications produce the identical output as "printPrice.cs".

The first application (genIndex.cs) reads in "po.xml" and produces "po.vxl".

using System;
using com.ximpleware;

namespace genIndex
{
    class genIndex
    {
        static void Main(string[] args)
        {
            VTDGen vg = new VTDGen();
            try
            {
                if (vg.parseFile("d:/codeProject/app3/po.xml", true))
                {
                    vg.writeIndex("d:/codeProject/app3/po.vxl");
                }
            }
            catch (VTDException e)
            {
            }
        }
    }
}

The second application (accessIndex.java) loads "po.vxl" and filters the document using an XPath expression "/purchaseOrder/item/USPrice/text()".

Java

using System;
using com.ximpleware;

namespace accessIndex
{
    class accessIndex
    {
        static void Main(string[] args)
        {
            VTDGen vg = new VTDGen();
            try
            {
                VTDNav vn = vg.loadIndex("d:/codeProject/app3/po.vxl");
                AutoPilot ap = new AutoPilot(vn);
                ap.selectXPath("/purchaseOrder/item/USPrice/text()");
                int i = -1;
                while ((i = ap.evalXPath()) != -1)
                {
                    Console.WriteLine(" USPrice ==> " + vn.toString(i));
                }
            }
            catch (VTDException e)
            {
            }

        }
    }
}

VTD+XML in 30 Seconds

Allowing XML parsing to be decoupled from application logic, the key in the example above is the index file "po.vxl", which conforms to the VTD+XML spec. What is VTD+XML? Since VTD-XML's internal representation of XML infoset is inherently persistent, VTD+XML, as the name suggests, is simply a binary packaging format that combines VTD records, LCs entries, and XML into a single file. The detailed technical spec can be found here.

A Simple Example

This section gets down to the nitty-gritty of the specification by manually composing, byte-by-byte, a VTD+XML index. For the sake of simplicity, this example chooses to index a simple XML document containing a single child-less root element whose parsed representation doesn't have location cache entries. This example also assumes small-Endian byte order (as in C#) and UTF-8 document encoding (the default character set). The name space awareness is set to false.

The first four-byte word of the corresponding index file is 0x01028000, containing:

The VTD+XML version number (0x01) in the first byte
The character encoding format (0x02) in the second byte
The name space awareness, word length of LC entries in the last level, byte Endian-ness of the platform, and VTD version as encoded in various bit fields in the third byte (0x80)
The document depth (0x00, as the root element has no child)

The second four-byte word has the value of 0x00040001, containing:

The number of LC levels supported by the VTD-XML implementation in the upper 16 bits (0x0004 in big Endian)
The root element index value in the lower 16 bits (0x0001 in big Endian)

The next four four-byte words are reserved and set to zero.

The byte order of all the ensuing 32-bit or 64-bit words is platform-dependent, and specified in the third byte of the VTD+XML spec. The next eight-byte word indicate the size (in bytes) of the XML document, which equals seven in this example (0x0700000000000000). Immediately following (0x3C726F6F742F3E00) is the byte content of the XML, rounded up to an integer multiple of eight bytes by padding zero to the end.

The remaining part of the VTD+XML index consists of multiple adjacent segments, each containing an eight-byte word (0x0000000000000002 indicating the VTD record or LC entry count), followed by the actual content of the VTD records or LC entries. The first eight-byte word (0x020000000000000000) indicates that there are two VTD records that are 0x000000000000F0DF and 0x0100000004000000.

The remaining three eight-byte words all have the value of zero, indicating that the location caches in level one, two, and three have a zero entry in the VTD+XML index.

As the final output, the VTD+XML index for "<root/>" is 88-bytes long, and looks like the following hex:

Figure 1. Screen Capture of Bytes for "<root/>"

Benefits and Limitations

Because VTD+XML straightforwardly combines VTD and XML, it inherits all the benefits of VTD-XML parsing. When compared with existing XML indices (e.g., various pure-binary XML indices modeling labeled, ordered tree etc.), VTD+XML possesses many unique technical benefits:

General Purpose - Before VTD+XML, most native XML indices only optimized specific types (e.g., the axis) of XPath lookups. If an input query differs slightly from the index type, the query execution still has to resort to expensive parsing. Due to this limitation, many native XML databases today require users to create multiple indices, one for each input query type so users can benefit from those indices. The problem is that XML database applications usually serve many types of queries that are unpredictable and complex in nature, often rendering the benefits of indexing insignificant. In comparison, VTD+XML is the first index that completely eliminates the cost of XML parsing, and predictably speeds up any type of XPath query. It also works with namespaces exceptionally well.
Human Readable - VTD+XML is also the first human-readable XML index. You can actually open it in a text editor to examine the XML text. Figure 2 is what "po.vxl" looks like in "VI". More than just a nice property, VTD+XML's human-readability offers distinct advantages over pure binary indexing schemes. Everything else being equal, keeping XML in its original format avoids the processing cost of converting to and from binary formats. Moreover, what if your applications just want to modify the XML payload, such as inserting into it a chunk of XML text extracted out of another SOAP message? What's the point of converting XML to binary formats? In a service-oriented heterogeneous environment, maintaining XML in its original format automatically retains the openness and interoperability. It just seems to me that the only loss-less equivalent of XML is XML itself, no less.

Figure 2. VTD+XML Index's Natural Human-Readability

Superior Indexing Performance - The performance of generating a VTD+XML index is identical to the parsing performance of VTD-XML, since both are essentially the same operation viewed from two different angles. On a 1.7GHz Pentium machine, it's reasonable to expect a sustained indexing performance of 50 MB/s-70 MB/s.
Easy to Use - Usually, adding a couple of lines (loadIndex(...) and writeIndex(...) as seen in the previous example) to your existing VTD-XML code is all that's needed to enable VTD+XML in your applications.
Compact - The size of VTD+XML is usually about 30%-50% bigger than the size of the corresponding XML document. This is again consistent with the memory use of the VTD-XML processing model.
Platform Neutral - Just like XML, VTD+XML is designed to be platform-neutral in that it explicitly includes information about the byte Endian-ness of the platform on which the index is generated. Users of the C or C# version VTD-XML code can automatically recognize and make use of the index generated by the Java version.

At the same time, users of VTD+XML need to be aware of the following limitations:

Upper Limit on Document Size - The maximum XML document size supported by VTD+XML is 2GB without name space support. With name space, VTD+XML supports a maximum of 1GB.
Lack of Support for External Entities - VTD-XML currently supports five built-in entity references (<, >, &, ', and ") as defined in XML 1.0.

The Case Involving XML Content Update

Some of you may wonder: What if the subsequent XML operations involve content updates that shift the offset value? In general, those use cases often require the updated XML document to be re-indexed. And, for large XML documents, you may argue that the cost of re-indexing can be quite significant. However, there are actually several workarounds, all aimed at reducing, even eliminating, the cost of re-indexing.

The first workaround

Instead of creating the VTD+XML index for a single big XML document, split the XML document into multiple smaller ones, each of which is then indexed using VTD+XML. From this point on, you only need to regenerate a VTD+XML index for those "updated" XML fragments that are usually a lot smaller and, therefore, cheaper to re-index.

VTD+XML's editing capability lets you modify XML content without needing to regenerate the index. The code below makes use of the VTDNav class' new overWrite(...) to change the text node of "<root>good</root>" from "good" or "bad". If the new content is shorter or equal in length to that of the old content, the method "overWrite(...)" fills up the non-overlapping portion of the text with white spaces, and returns true. Otherwise, no change to the original content, and "overWrite(...)" returns false.

using System;
using System.Text;
using com.ximpleware;

namespace template
{
    class template
    {
        static void Main(string[] args)
        {
            VTDGen vg = new VTDGen();
            Encoding eg = System.Text.Encoding.GetEncoding("utf-8");
            if (vg.parseFile("d:/codeProject/app3/temp1.xml", true))
            {
                VTDNav vn = vg.getNav();
                int i = vn.getText();
                //print "good"
                Console.WriteLine("text ---> " + vn.toString(i));
                if (vn.overWrite(i,eg.GetBytes("bad")))
                {
                    //overwrite, if successful, returns true
                    //print "bad" here
                    Console.WriteLine("text ---> " + vn.toString(i));
                }
            }
        }
    }
}

Though simple, this "editing" feature actually has unexpected performance implications. Consider the database table design in which you specify the column width. You can now borrow the same technique for XML composition: by pre-serializing some extra spaces into text nodes or attribute values, you can make "in situ" updates to those nodes, and do so without regenerating the index. You can even pre-serialize, in an XML document, dummy elements containing text nodes or attribute values whose initial values are entirely white spaces. Those dummy elements serve as templates in anticipation of a future content update, as shown in the example below.

The template:

XML

<purchaseorder orderdate="" />
 <purchaseorder orderdate="" />
    <purchaseOrder orderDate=" ">
      <item partNum=" " >
         <productName> </productName>
         <quantity> </quantity>
         <USPrice> </USPrice>
      </item>
    </purchaseOrder>
    <quantity /></quantity />
    <usprice /> </usprice />
 </purchaseorder />
</purchaseorder />

After "stamping" in the data:

XML

<purchaseOrder orderDate="1999-10-21">
   <item partNum="872-AA" >
      <productName>Lawnmower </productName>
      <quantity>1 </quantity>
      <USPrice> 100 </USPrice>
   </item>
</purchaseOrder>

And, by the same token, the concept of XML content deletion deserves a bit of rethinking as well. Instead of physically deleting an XML element, you can disable the XML elements by making them "invisible" to your applications to achieve the same goal. The benefit: you again avoid the need to re-index. Notice that this plays favorably to XML's strength as a loose encoding data format. Below is an example of setting the value of the attribute "enable" of an element to make it "invisible".

Before:

XML

<purchaseorder orderdate="1999-10-21"" />
  <purchaseOrder orderDate="1999-10-21">
      <item partNum="872-AA" enable="1">
         <productName>Lawnmower</productName>
         <quantity>1</quantity>
         <USPrice>148.95</USPrice>
      </item>
  </purchaseOrder>
</purchaseorder />

After:

XML

<purchaseOrder orderDate="1999-10-21">
   <item partNum="872-AA" enable='0'>
     <productName>Lawnmower</productName>
     <quantity>1</quantity>
     <USPrice>148.95</USPrice>
   </item>
</purchaseOrder>

Application Scenarios

There are at least two different views to make sense of VTD+XML as a practical solution to real problems. The first is a traditional view of native XML indexing. Alternatively, you can think of VTD+XML as a binary data format backwards-compatible with XML.

Native XML Indexing

In this view, you simply use VTD+XML as the basis for native XML data stores that serve the backend data needs of XML/SOA applications. By saving it as a BLOB (Binary Large OBject) in a more traditional database table, you obtain the additional capabilities such as concurrency and data integrity and replication. Being vastly superior to the awkward shredding-based XML to relational data mapping, VTD+XML fits exceptionally well in a pure XML/SOA environment. Have a lot of XBRL (Extensible Business Reporting Language) documents, or those big GML (Geography Markup Language) files? VTD+XML should equip you with horsepower never before available.

Binary Enhanced XML

VTD+XML also naturally extends the core capabilities of XML by boosting its processing efficiency to a whole new level. In other words, as a wire format, XML now has it all: not only is it easy to learn, human-readable, interoperable, and loosely encoded by design, performance-wise, it also leads CORBA, DCOM, and RMI by a mile. When applied to XML pipelining, VTD+XML can potentially eliminate the repetitive parsing at each stage of the pipeline - an issue none of the existing XML pipeline specs (e.g., XProc and the XML pipeline definition language) address. If it takes too long for you to push large documents over your DOM-based ESB (Enterprise Services Bus), how does 100MB around a single second sound?

Performance Comparison

This section basically highlights the performance of index loading. For VTD+XML, the performance number is the latency for loading the index in memory and then running a single XPath. For DOM, the performance is the latency for DOM parsing plus running a single XPath (both Xalan and Jaxen). The results are normalized, so the ratios are shown. As you can see, when turning on index loading, the performance improvement can be very dramatic.

Figure 3. Relative Latency Comparison for XPath "/*/*/*[position() mod 2=0]

Figure 4. Relative Latency Comparison for XPath "/purchaseOrder/items/item[USPrice<100]"

Figure 5. Relative Latency Comparison for XPath "//item/comment"

Conclusion

This article has introduced the latest indexing feature of VTD+XML along with the latest benchmark numbers, showcasing the efficiency level it achieves. Prior to VTD+XML, an XML/SOA application written in DOM or SAX incurs the overhead of XML parsing, XPath evaluation and, optionally, content update. It's not uncommon that those overheads account for 80%-90% or more of the total CPU cycles of running the application. VTD+XML obliterates those overheads since there's not much overhead left to optimize. Using VTD+XML as a parser reduces XML parsing overhead by 5x-10x. Next, VTD+XML's incremental update uniquely eliminates the roundtrip overhead of updating the XML. Moreover, this article shows VTD+XML's innovative non-blocking, stateless XPath engine significantly outperforming Jaxen and Xalan. With the addition of the indexing capability, XML parsing has now become "optional". In other words, obstacles standing on the path to successful SOA have quietly disappeared. But, this is just another starting point. It probably won't be difficult to see that none of its benefits would exist if VTD+XML stuck with excessive object allocation like DOM. In the context of XML processing, pure OO modeling of an XML infoset (e.g., string and node objects) just doesn't appear the right thing to do in the first place. Like anything else, OO has its weaknesses. The problems (e.g., DOM and SAX's problems) arise when one chooses OO for the sake of choosing it, and stops questioning its sensibility. To me, knowing when not to use objects is equally, if not more, important. Derived from those weaknesses, constraints, and limitations, VTD+XML strives to be the simple, sensible answer to the problems.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

Jimmy Zhang

Chief Technology Officer XimpleWare

United States

Jimmy Zhang is a cofounder of XimpleWare, a provider of high performance XML processing solutions. He has working experience in the fields of electronic design automation and Voice over IP for a number of Silicon Valley high-tech companies. He holds both a BS and MS from the department of EECS from U.C. Berkeley.

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.