Click here to Skip to main content
15,867,771 members
Articles / Mobile Apps

Programming XPath with VTD-XML

Rate me:
Please Sign up or sign in to vote.
3.93/5 (8 votes)
10 Jun 2010CPOL10 min read 72K   343   32   15
Learn to take advantage of VTD-XML's stateless XPath capability to achieve unmatched efficiency and flexibility

Introduction

Originally seen as an element for W3C's XSLT and XPointer, XPath has been gaining popularity as an effective tool for developing XML applications. A significant benefit of XPath is its simplicity: Instead of manually navigating the hierarchical data structure, you can use compact, "file-system"-like expressions to address any nodes in XML documents. However, most of existing XPath engines work with DOM trees or similar object models, which are slow to build and modify, and consume excessive amount of memory. This presents a dilemma for anyone looking to take advantage of XPath for SOA applications that either (1) are performance sensitive or (2) routinely deal with large XML documents.

My last several articles on VTD-XML (http://vtd-xml.sourceforge.net) as the next generation XML processing model that goes beyond DOM and SAX in performance, memory usage and ease of use. VTD-XML is simultaneously:

  • Memory-efficient: VTD-XML typically consumes an amount of memory somewhere between 1.3~1.5 times the size of the XML document, meaning it works very well with large XML documents.
  • High-performance: VTD-XML typically outperforms DOM parsers by 5~10 times, and it typically outperforms SAX parsers with null content handler by about 100%.
  • Easy to use: Applications written in VTD-XML are more compact and readable than those written in DOM or SAX.

What is VTD-XML's not-so-secret sauce? Unlike traditional XML parsers which create string-based tokens as the first step of parsing, VTD-XML internally uses linear buffers to store 64-bit integers containing the starting offsets and lengths of XML tokens, while keeping XML document intact and undecoded. All VTD-XML's benefits result, one way or the other, from this "non-extractive" tokenization. At the API level, VTD-XML consists of the following core classes:

  • VTDGen is the class that encapsulates the main parsing, index writing and index loading functions.
  • VTDNav exports a cursor-based API that contains the methods that navigate the XML hierarchy.
  • AutoPilot is the class that allows for document-order element traversal similar to Xerces' NodeIterator.

UnderStanding VTD-XML's XPath Capability

VTD-XML's XPath implemenation, which supports the full W3C XPath 1.0 spec, was introduced since version 1.0. It builds upon VTDNav's concept of cursor-based navigation. AutoPilot is the main class that exports all the XPath related methods. As described in one of the early articles, to manually navgiate an XML document's hiearchical structure, you obtain an instance of VTDNav and repeatedly call "toElement()" to move the cursor to various parts of the documents. Using XPath, you essentially ask AutoPilot to *automatically* move VTDNav's cursor to all the qualified nodes in the document.

AutoPilot's XPath-related methods are:

  • declareXPathNameSpace(...): This method binds a namespace prefix (used in an XPath expression) to a URL
  • selectXPath(...): This method compiles a XPath expression into internal representation.
  • evalXPath(...): This method moves the cursor to a qualified node in the node set.
  • evalXPathToBoolean(...), evalXPathToNumber(...), evalXpathToString(...): These three methods evalutes an XPath expression respectively to a boolean, a double and a string.
  • resetXPath(): Reset the internal state so the XPath can be re-used.
  • getExprString(): Call this method to verify the correctness of the compiled expression.

In addtion, two more exception classes are introduced for XPath:

  • XPathParseException: It is thrown when there is a syntax error in the XPath expression.
  • XPathEvalException: It is thrown when there is an exception condition during XPath evalution.

The most important innovation of VTD-XML's XPath implementation is its stateless, non-blocking evaluation of the node set. Unlike other XPath engines (e.g. XmlDocuemnt, XPathDocument JAXEN, XALAN, etc) that return the entire nodeset all at once, VTD-XML returns a qualified node immediately after the XPath engine finds a match. The key method is AutoPilot's evalXPath(), which moves the cursor to the next node (if there is at least one remaining in the nodeset) every time it is called. To illustrate the differences between those two styles of XPath evalution, let us look at the two short snippets written respectively in JAXEN and VTD-XML that execute the same application logic on "catalog.xml" as shown below.

XML
<CATALOG>
    <CD>
        <TITLE>Empire Burlesque</TITLE>
        <ARTIST>Bob Dylan</ARTIST>
        <COUNTRY>USA</COUNTRY>
        <COMPANY>Columbia</COMPANY>
        <PRICE>10.90</PRICE>
        <YEAR>1985</YEAR>
     </CD>
     <CD>
        <TITLE>Still got the blues</TITLE>
        <ARTIST>Gary More</ARTIST>
        <COUNTRY>UK</COUNTRY>
        <COMPANY>Virgin redords</COMPANY>
        <PRICE>10.20</PRICE>
        <YEAR>1990</YEAR>
     </CD>
     <CD>
        <TITLE>Hide your heart</TITLE>
        <ARTIST>Bonnie Tyler</ARTIST>
        <COUNTRY>UK</COUNTRY>
        <COMPANY>CBS Records</COMPANY>
        <PRICE>9.90</PRICE>
        <YEAR>1988</YEAR>
     </CD>
     <CD>
        <TITLE>Greatest Hits</TITLE>
        <ARTIST>Dolly Parton</ARTIST>
        <COUNTRY>USA</COUNTRY>
        <COMPANY>RCA</COMPANY>
        <PRICE>9.90</PRICE>
        <YEAR>1982</YEAR>
     </CD>
</CATALOG>

The XPathDocument based code is shown below:

C#
using System;
using System.Collections.Generic;
using System.Text;
using System.Xml;
using System.Xml.XPath;

namespace app0
{
    class Program
    {
        static void Main(string[] args)
        {
            XPathDocument myXPathDocument = new XPathDocument(args[0]);
            XPathNavigator myXPathNavigator = myXPathDocument.CreateNavigator();
            XPathExpression myXPathExpr = myXPathNavigator.Compile(
                "/CATALOG/CD/TITLE/text()");
            XPathNodeIterator myXPathNodeIterator = myXPathNavigator.Select(myXPathExpr);

            // iterate and print out the nodes
            while (myXPathNodeIterator.MoveNext())
                Console.WriteLine(myXPathNodeIterator.Current.Value);
        }
    }
}

The same application written in VTD-XML is shown below:

C#
using System;
using System.Collections.Generic;
using System.Text;
using com.ximpleware;
namespace app1
{
    class Program
    {
        static void Main(string[] args)
        {
            VTDGen vg = new VTDGen();
            int i;
            AutoPilot ap = new AutoPilot();
            ap.selectXPath("/CATALOG/CD/TITLE/text()");
            if (vg.parseFile("catalog.xml", false))
            {
                VTDNav vn = vg.getNav();
                ap.bind(vn);
                //XPath eval returns one node at a time
                while ((i = ap.evalXPath()) != -1)
                {
                    Console.WriteLine(" text ==> " +
                           vn.toString(i));
                }
                ap.resetXPath();
            }
        }
   }
}

The code sample shown above deals with only one XPath expression. If you deal with multiple XPath expressions, you will need to instantiate multiple AutoPilot objects, each of which corresponds to a different XPath expression. If you call selectXPath(...) multiple times, the AutoPilot object remembers only the last XPath expression. Once an AutoPilot object selects an XPath expression, you call evalXPath() which moves the cursor of the embedded VTDNav instance to the next available node and returns its VTD record index. If no more node remains in the node set, evalXPath() returns -1.

What if you want to execute same XPath query over multiple XML documents? The key, in this case, is to combine XPath reuse with delayed binding for AutoPilot. After executing a query on one XML document, you need to first bind (by calling bind(...) ) the same AutoPilot object to a new VTDNav object corresponding to a different XML document. After evaluating the XPath, you also need to call resetXPath() which resets the internal state of XPath so it is ready for reuse.

The following example shows you how apply two XPath expressions over two separate XML documents.

C#
using System;
using System.Collections.Generic;
using System.Text;
using com.ximpleware;
namespace app2
{
   class test {
     static void Main(string[] args)
     {
         VTDGen vg = new VTDGen();
         int i;
         AutoPilot ap = new AutoPilot();
         ap.selectXPath("/CATALOG/CD/TITLE/text()");
         AutoPilot ap2 = new AutoPilot();
         ap2.selectXPath("/CATALOG/CD[PRICE < 10]/TITLE/text()");

         // apply both XPath expressions to "catalog.xml"
         if (vg.parseFile("catalog.xml", false))
         {
             VTDNav vn = vg.getNav();
             ap.bind(vn);
             //XPath eval returns one node at a time
             while ((i = ap.evalXPath()) != -1)
             {
                 Console.WriteLine(" text ==> " +
                        vn.toString(i));
             }
             ap.resetXPath();

             ap2.bind(vn);
             while ((i = ap2.evalXPath()) != -1)
             {
                 Console.WriteLine(" text ==> " +
                        vn.toString(i));
             }
             ap2.resetXPath();
         }
         // apply both XPath expressions to a different file
         if (vg.parseFile("another.xml", false))
         {
             VTDNav vn = vg.getNav();
             ap.bind(vn);
             //XPath eval returns one node at a time
             while ((i = ap.evalXPath()) != -1)
             {
                 Console.WriteLine(" text ==> " +
                        vn.toString(i));
             }
             ap.resetXPath();

             ap2.bind(vn);
             while ((i = ap2.evalXPath()) != -1)
             {
                 Console.WriteLine(" text ==> " +
                        vn.toString(i));
             }
             ap2.resetXPath();
         }
     }
  } 
}

Evaluate XPath to String, Number and Boolean

XPath 1.0 also defines that an expression may be evaluated to a string, a double, or a boolean. VTD-XML's AutoPilot class contains evalXPathToString(), evalXPathToNumber() and evalXPathToBoolean(). Notice that those methods don't need resetXPath() in order to be reused as they perform reset internally.

Benefits

When comparing with other XPath engines, VTD-XML's non-blocking XPath evaluation approach has a number of benefits. Without sacraficing any inherent benefits of XPath, VTD-XML offers unsurpassed flexibility and puts more control in your hands. Consider the use case in which you want to use XPath to process a large XML document and the node set contains thousands of nodes. With other XPath engines, your application logic will have to wait until the entire node set to be returned before doing anything. With VTD-XML, you application logic can start as soon as the first node gets returned. If your application only processes a small number of nodes in the node set, or you have prior knowledge of the XML document (e.g. you know there is only one occurrence of the matching node at the beginning of a large XML file), why wait until the XPath finishes searching the entire document? In this case, other XPath engine's node evaluation unnecessarily blocks the program flow and seem more like the wrong approach! In other use cases, VTD-XML's non-blocking XPath can even be used to walk the nodes, which simply isn't possible with other types of XPath engines.

When combined with other benefits/features of VTD-XML, the comparison with DOM and SAX is about to get lopsided in a hurry. Regardless of the file sizes, whether it is parsing, updating or XPath evaluation (shown later), VTD-XML delivers the kind of across-the-board performance improvements never before thought possible by many. As the processing model of choice for your transactional SOA applications dealing with small to medium size XML files, VTD-XML will easily blow away DOM . And the benefit only gets bigger for large XML files. You literally moves from long, ugly, and unmaintainable SAX code to clean, short VTD-XML code, while enjoying the full power of VTD-XML's random access and a 10 to 50 times speedup along the way. If you regularly deal with large XML documents or XPath expressions returning huge node sets, VTD-XML is a must-have!

Common Usage Patterns

There are a few common patterns to follow in order to base your application on VTD-XML.

Embed evalXPath(..) in while's control statement

The most common pattern is to embed evalXPath() in the control statement of a while loop. So far all the code uses this pattern to navigate to the qualified nodes in the document. The return value of EvalXPath(), if equal to -1, indicates there is no node remaining in the node set. Otherwise, the return value equals to the VTD record index of the node selected by the XPath, to which the cursor of the VTDNav object (bound to the AutoPilot object) also points. The following two short code fragments are completely equivalent.

C#
int i;

ap.bind(vn);
ap.selectXPath(...); 
while( (i=ap.evalXPath())!=-1){ 

}
C#
int i;

ap.bind(vn); 
ap.selectXPath(...);

while(ap.evalXPath()!>=0){ 
i = vn.getCurrentIndex();

}

Combining with Manual Navigation

But sometimes you want to navigate the cursor a bit further within the while loop. To do that, the rule is that the code segment nested in the loop must not alter the node position. You can either move the cursor back to the original location, or use the VTDNav object's push() and pop() to enforce the rule. For the same XML data shown above, the first example, performing a little manual navigation in the while loop, moves the cursor back to the starting position before the next iteration.

C#
using System;
using System.Collections.Generic;
using System.Text;
using com.ximpleware;
namespace app3
{
    class Program
    {
        static void Main(string[] args)
        {
            VTDGen vg = new VTDGen();
            int i;
            AutoPilot ap = new AutoPilot();

            ap.selectXPath("/CATALOG/CD[PRICE < 10]");
            if (vg.parseFile("catalog.xml", false))
            {
                VTDNav vn = vg.getNav();
                ap.bind(vn);
                //XPath eval returns one node at a time
                while ((i = ap.evalXPath()) != -1)
                {// get to the first child
                    if (vn.toElement(VTDNav.FIRST_CHILD, "TITLE"))
                    {
                        int j = vn.getText();

                        if (j != -1)

                        Console.WriteLine(" text node ==>" + vn.toString(j));
                        vn.toElement(VTDNav.PARENT); // move the cursor back 
                    }
                }

                ap.resetXPath();
            }
        }
    }
}

The second example uses push() and pop(), which is slightly less efficient.

C#
using System;
using System.Collections.Generic;
using System.Text;
using com.ximpleware;
namespace app4
{
    class Program
    {
        static void Main(string[] args)
        {
            VTDGen vg = new VTDGen();
            int i;
            AutoPilot ap = new AutoPilot();
            ap.selectXPath("/CATALOG/CD[PRICE < 10]");
            if (vg.parseFile("catalog.xml", false))
            {
                VTDNav vn = vg.getNav();
                ap.bind(vn);
                //XPath eval returns one node at a time
                while ((i = ap.evalXPath()) != -1)
                {
                    vn.push();
                    // get to the first child
                    if (vn.toElement(VTDNav.FIRST_CHILD, "TITLE"))
                    {
                        int j = vn.getText();
                        if (j != -1)
                            Console.WriteLine(" text node ==>" + vn.toString(j));
                    }
                    vn.pop();
                }
                ap.resetXPath();
            }
        }
    }
}

Nested XPath support

If the navigation you want to perform is more complicated, you can in fact nest XPath queries. The above code re-written in XPath is shown below. The rule that ensures the consistency of the cursor position still applies.

C#
using System;
using System.Collections.Generic;
using System.Text;
using com.ximpleware;
namespace app6
{
    class Program
    {
        static void Main(string[] args)
        {
            VTDGen vg = new VTDGen();
            int i;
            AutoPilot ap = new AutoPilot();
            ap.selectXPath("/CATALOG/CD[PRICE < 10]");
            AutoPilot ap2 = new AutoPilot();
            ap2.selectXPath("TITLE/text()");
            if (vg.parseFile("catalog.xml", false))
            {
                VTDNav vn = vg.getNav();
                ap.bind(vn);
                ap2.bind(vn);
                //XPath eval returns one node at a time
                while ((i = ap.evalXPath()) != -1)
                {
                    vn.push();
                    // get to the first child 
                    int j;
                    while ((j = ap2.evalXPath()) != -1)
                    {
                        Console.WriteLine(" text node ==>" + vn.toString(j));
                    }
                    // doesn't forget this since the next iteration will reuse
                    // ap2's XPath!!! 
                    ap2.resetXPath();
                    vn.pop();
                }
                ap.resetXPath();
            }
        }
    }
}

Ad-Hoc tree walker mode

Of course, there is nothing preventing you from calling evalXPath() just to move the cursor to the desired destination(s) in the XML document.

C#
import com.ximpleware.*; 
public class vtdXPath5{
 public static void main(String args[]) throws Exception{
  VTDGen vg =new VTDGen();
  int i;
  AutoPilot ap = new AutoPilot(); 
  ap.selectXPath("/CATALOG/CD[PRICE < 10]"); 

  if (vg.parseFile("catalog.xml", false)){
   VTDNav vn = vg.getNav();
   ap.bind(vn);
    //XPath eval returns one node at a time
    if (ap.evalXPath()>=0){ 
      // do what ever you like here with vn 
    }    
  }
 }
}

XPath Evaluation Performance BenchMarks

This section will show you the VTD-XML's XPath evaluation performance and compare it with Xalan and JAXEN, both which works with Xerces DOM. The benchmark programs precompile a set of XPath expressions and then repetitively execute the queries over the same set of documents (three documents of similar structure, but differing in size, are chosen). The tables and charts below report average latencies of query executions. For a complete description and discussion of the test set up, methodology and code, please visit Latency Number

/*/*/*[position() mod 2 = 0]

 Jaxen (ms) JDK (ms)VTD-XML (ms)
po_small.xml 0.0395 1.123 0.00994
po_medium.xml 0.524 6.552 0.115
po_big.xml 8.812 94.535 1.292

/purchaseOrder/items/item[USPrice <100]

Jaxen (ms) JDK (ms) VTD-XML (ms)
po_small.xml 0.0857 1.204 0.048
po_medium.xml 1.498 9.381 0.854
po_big.xml 22.262 113.831 9.238

/*/*/*/quantity/text()

Jaxen (ms) JDK (ms) VTD-XML (ms)
po_small.xml 0.107 1.0756 0.0317
po_medium.xml 1.922 7.0268 0.537
po_big.xml 32.181 106.7 6.339

//item/comment

Jaxen (ms) JDK (ms) VTD-XML (ms)
po_small.xml 0.356 1.208 0.089
po_medium.xml 9.824 9.922 1.505
po_big.xml 197.319 138.45 17.825

//item/comment/../quantity

Jaxen (ms) JDK (ms) VTD-XML (ms)
po_small.xml 0.374 1.274 0.105
po_medium.xml 11.147 10.298 1.787
po_big.xml 204.714 140.06 21.06

/*/*/*/* [position() mod 2 = 0]

Jaxen (ms) JDK(ms) VTD-XML(ms)
po_small.xml 0.136 1.21 0.0353
po_medium.xml 2.7 8.216 0.604
po_big.xml 65.469 122.83 6.559

/purchaseOrder/items/item[USPrice > 100]/comment/text()

Jaxen (ms) JDK (ms) VTD-XML (ms)
po_small.xml 0.113 1.262 .06
po_medium.xml 2.082 9.154 1.061
po_big.xml 32.892 121.07 12.35

Graphical View

/*/*/*[position() mod 2 = 0]

/purchaseOrder/items/item[USPrice <100]

/*/*/*/quantity/text()

//item/comment

//item/comment/../quantity

/*/*/*/* [position() mod 2 = 0]

/purchaseOrder/items/item[USPrice > 100]/comment/text()

Conclusion

Hopefully this article has provided you with both an introduction to, and an overview of, VTD-XML's XPath implementation. When combined the other inherent benefits, VTD-XML empowers you to write clean, readable code and acheive peerless performance, efficiency and flexibility. Because of it, some of the most deeply entrenched perceptions and assumptions about XML programming are about to change. Consequently, the infrastructure built on top of XML, such as Web Services and SOA, is on the verge of positive changes as well. My next articles will show you how to use VTD-XML to significantly improve Java-XML databinding.

History

  • 7/29/08 Initial Article posted
  • 8/11/08 Added subsection covering EvalXPathToString(), EvalXPathToNumber(), EvalXPathToBoolean().
  • 6/9/10 Article edited

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Chief Technology Officer XimpleWare
United States United States
Jimmy Zhang is a cofounder of XimpleWare, a provider of high performance XML processing solutions. He has working experience in the fields of electronic design automation and Voice over IP for a number of Silicon Valley high-tech companies. He holds both a BS and MS from the department of EECS from U.C. Berkeley.

Comments and Discussions

 
SuggestionComparison with .NET XPathNavigator Pin
Elmue5-Jul-11 15:18
Elmue5-Jul-11 15:18 
QuestionNested fragment selection Pin
Patricia P22-Oct-09 22:46
Patricia P22-Oct-09 22:46 
AnswerRe: Nested fragment selection Pin
Jimmy Zhang23-Oct-09 19:38
Jimmy Zhang23-Oct-09 19:38 
QuestionHTML parser Pin
Bizounours14-Jun-09 21:10
Bizounours14-Jun-09 21:10 
AnswerRe: HTML parser Pin
Jimmy Zhang15-Jun-09 13:15
Jimmy Zhang15-Jun-09 13:15 
GeneralPerf Comparisons Pin
Stuart Dootson10-Jun-09 1:02
professionalStuart Dootson10-Jun-09 1:02 
GeneralRe: Perf Comparisons Pin
Jimmy Zhang10-Jun-09 7:05
Jimmy Zhang10-Jun-09 7:05 
GeneralRe: Perf Comparisons Pin
Stuart Dootson10-Jun-09 7:15
professionalStuart Dootson10-Jun-09 7:15 
GeneralRe: Perf Comparisons Pin
Jimmy Zhang10-Jun-09 16:42
Jimmy Zhang10-Jun-09 16:42 
GeneralGetting to attributes via XPath [modified] Pin
James H8-Jan-09 12:36
James H8-Jan-09 12:36 
GeneralRe: Getting to attributes via XPath Pin
James H8-Jan-09 13:13
James H8-Jan-09 13:13 
GeneralRe: Getting to attributes via XPath Pin
Stuart Dootson10-Jun-09 0:59
professionalStuart Dootson10-Jun-09 0:59 
GeneralRe: Getting to attributes via XPath Pin
Jimmy Zhang8-Jan-09 13:50
Jimmy Zhang8-Jan-09 13:50 
GeneralWrong referencial info Pin
Vitaly Tomilov30-Jul-08 0:10
Vitaly Tomilov30-Jul-08 0:10 
GeneralRe: Wrong referencial info Pin
Jimmy Zhang30-Jul-08 8:04
Jimmy Zhang30-Jul-08 8:04 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.