|
||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||
|
Announcements
Services
Chapters
Feature Zones
|
Note: This is an unedited contribution. If this article is inappropriate,
needs attention or copies someone else's work without reference then please
Report This Article
SummaryTraditionally DOM or SAX-based enterprise applications have to repeat CPU-intensive XML parsing when accessing the same documents multiple times. This article introduces a very simple, general-purpose, native XML index called VTD+XML (http://vtd-xml.sourceforge.net/persistence.html) that eliminates the need for repetitive parsing of those applications. Avoid Repetitive XML Parsing with VTD-XMLAs discussed in "Simplify XML processing with VTD-XML," to date one of underlying assumptions in XML application development is that an XML document must be parsed before anything else can be done with it. In other words, the processing logic of XML applications can't start without parsing. Frequently considered a threat to database performance, XML parsing is usually many times slower than other XML operations such as XPath evaluation. When those applications perform multiple read-only access to XML data that don't change very often, wouldn't it be nice to able to eliminate the overhead of associated repetitive parsing? With the native XML indexing feature of VTD-XML, you can do precisely that. Let me put those new methods into action and show you how to turn on the indexing capability in your application. Consider the following XML document: <purchaseOrder orderDate="1999-10-21"> <item partNum="872-AA"> <productName>Lawnmower</productName> <quantity>1</quantity> <USPrice>148.95</USPrice> </item> </purchaseOrder> Below is a simple application named "printPrice.cs" that prints out the content of the element "USPrice." Notice that it parses the XML file and then uses XPath to filter out the target nodes. using com.ximpleware.*; public class printPrice{ public static void main(String args[]){ VTDGen vg = new VTDGen(); try{ if (vg.parseFile("po.xml",true)){ VTDNav vn = vg.getNav(); AutoPilot ap = new AutoPilot(vn); ap.selectXPath("/purchaseOrder/item/USPrice/text()"); int i=-1; while((i=ap.evalXPath())!=-1){ System.out.println(" USPrice ==> "+vn.toString(i)); } } }catch(Exception e){ } } } A few changes are needed to add VTD-XML's new indexing capability to the C# code above. First, you need to read in the XML document, parse it, and then write out the indexed version of the same XML document. From that point onward, your application can run XPath query or processing logic directly on top of the index, saving the CPU cycles of parsing the XML document again. The following code snippets (named "genIndex.cs" and "accessIndex.cs" respectively) show you how to generate and access the index. Notice that, when executed sequentially, both applications produce the identical output as "printPrice.cs." The first application (genIndex.cs) reads in "po.xml" and produces "po.vxl." using System; using com.ximpleware; namespace genIndex { class genIndex { static void Main(string[] args) { VTDGen vg = new VTDGen(); try { if (vg.parseFile("d:/codeProject/app3/po.xml", true)) { vg.writeIndex("d:/codeProject/app3/po.vxl"); } } catch (VTDException e) { } } } } The second application (accessIndex.java) loads "po.vxl" and filters the document using XPath expression "/purchaseOrder/item/USPrice/text()." using System; using com.ximpleware; namespace accessIndex { class accessIndex { static void Main(string[] args) { VTDGen vg = new VTDGen(); try { VTDNav vn = vg.loadIndex("d:/codeProject/app3/po.vxl"); AutoPilot ap = new AutoPilot(vn); ap.selectXPath("/purchaseOrder/item/USPrice/text()"); int i = -1; while ((i = ap.evalXPath()) != -1) { Console.WriteLine(" USPrice ==> " + vn.toString(i)); } } catch (VTDException e) { } } } } VTD+XML in 30 Seconds
A Simple Example
The first four-byte word of the corresponding index file is 0x01028000 containing:
The second four-byte word has the value of 0x00040001 containing:
The next four four-byte words are reserved and set to zero. The byte order of all the ensuing 32-bit or 64-bit words is platform-dependent and specified in the third byte of the VTD+XML spec. The next eight-byte words indicate the size (in bytes) of the XML document, which equals seven in this example(0x0700000000000000). Immediately following (0x3C726F6F742F3E00) is the byte content of the XML rounded up to an integer multiple of eight bytes by padding zero to the end. The remaining part of VTD+XML index consists of multiple adjacent segments each containing an eight-byte word (0x0000000000000002 indicating the VTD record or LC entry count) followed by the actual content of the VTD records or LC entries. The first eight-byte word (0x020000000000000000) indicates that there are two VTD records that are 0x000000000000F0DF and 0x0100000004000000. The remaining three eight-byte words all have the value of zero indicating that the location caches in level one, two, and three have zero entry in the VTD+XML index. As the final output, the VTD+XML index for "<root/>" is 88-bytes long and looks like the following hex:
Figure 1. Screen Capture of bytes for "<root/>" Benefits and LimitationsBecause VTD+XML straightforwardly combines VTD and XML, it inherits all the benefits of VTD-XML parsing. When compared with existing XML indices (e.g., various pure-binary XML indices modeling labeled, ordered tree etc.), VTD+XML possesses many unique technical benefits:
At the same time, users of VTD+XML need to be aware of the following limitations:
The Case Involving XML Content Update Some of you may wonder: What if the subsequent XML operations involve content updates that shift the offset value? In general, those use cases often require the updated XML document to be re-indexed. And for large XML documents, you may argue that the cost of re-indexing can be quite significant. However, there are actually several workarounds, all aimed at reducing, even eliminating, the cost of re-indexing. The first workaround Instead of creating the VTD+XML index for a single big XML document, split the XML document into multiple smaller ones, each of which is then indexed using VTD+XML. From this point on, you only need to regenerate a VTD+XML index for those "updated" XML fragments that are usually a lot smaller and therefore cheaper to re-index. VTD-XML's editing capability lets you modify XML content without needing to regenerate the index. The code below makes use of the VTDNav class's new "overWrite(...)" to change the text node of "<root>good</root>" from "good" or "bad." If the new content is shorter or equal in length to that of the old content, the method "overWrite(...)" fills up the non-overlapping portion of the text with white spaces and returns true. Otherwise, no change to the original content and "overWrite(...)" returns false. using System; using System.Text; using com.ximpleware; namespace template { class template { static void Main(string[] args) { VTDGen vg = new VTDGen(); Encoding eg = System.Text.Encoding.GetEncoding("utf-8"); if (vg.parseFile("d:/codeProject/app3/temp1.xml", true)) { VTDNav vn = vg.getNav(); int i = vn.getText(); //print "good" Console.WriteLine("text ---> " + vn.toString(i)); if (vn.overWrite(i,eg.GetBytes("bad"))) { //overwrite, if successful, returns true //print "bad" here Console.WriteLine("text ---> " + vn.toString(i)); } } } } } Though simple, this "editing" feature actually has unexpected performance implications. Consider the database table design in which you specify the column width. You can now borrow the same technique for XML composition: By pre-serializing some extra spaces into text nodes or attribute values, you can make "in situ" updates to those nodes and do so without regenerating the index. You can even pre-serialize, in an XML document, dummy elements containing text nodes or attribute values whose initial values are entirely white spaces. Those dummy elements serve as templates in anticipation of a future content update, as shown in the example below. The template
After "stamping" in the data <purchaseOrder orderDate="1999-10-21"> <item partNum="872-AA" > <productName>Lawnmower </productName> <quantity>1 </quantity> <USPrice> 100 </USPrice> </item> </purchaseOrder> And, by the same token, the concept of XML content deletion deserves a bit of rethinking as well. Instead of physically deleting an XML element, you can disable the XML elements by making them "invisible" to your applications to achieve the same goal. The benefit: you again avoid the need to re-index. Notice that this plays favorably to XML's strength as a loose encoding data format. Below is an example of setting the value of the attribute "enable" of an element to make it "invisible." Before
After <purchaseOrder orderDate="1999-10-21"> <item partNum="872-AA" enable='0'> <productName>Lawnmower</productName> <quantity>1</quantity> <USPrice>148.95</USPrice> </item> </purchaseOrder> Applications ScenariosThere are at least two different views to make sense of VTD+XML as a practical solution to real problems. The first is a traditional view of native XML indexing. Alternatively, you can think of VTD+XML as a binary data format backwards-compatible with XML. Native XML IndexingIn this view, you simply use VTD+XML as the basis for native XML data stores that serve the backend data needs of XML/SOA applications. By saving it as a BLOB (Binary Large OBject) in a more traditional database table, you obtain the additional capabilities such as concurrency and data integrity and replication. Being vastly superior to the awkward shredding-based XML to relational data mapping, VTD+XML fits exceptionally well in a pure XML/SOA environment. Have a lot of XBRL (Extensible Business Reporting Language) documents, or those big GML (Geography Markup Language) files? VTD+XML should equip you with horsepower never before available. Binary Enhanced XMLVTD+XML also naturally extends the core capabilities of XML by boosting its processing efficiency to a whole new level. In other words, as a wire format, XML now has it all: not only is it easy to learn, human-readable, interoperable, and loosely encoded by design, performance-wise it also leads CORBA, DCOM, and RMI by a mile. When applied to XML pipelining, VTD+XML can potentially eliminate the repetitive parsing at each stage of the pipeline - an issue none of the existing XML pipeline specs (e.g., XProc and the XML pipeline definition language) address. If it takes too long for you to push large documents over your DOM-based ESB (Enterprise Services Bus), how does 100MB around a single second sound? Performance
|
|||||||||||||||||||||||||||||||||||
| You must Sign In to use this message board. | ||||||
|
||||||
|
General
News
Question
Answer
Joke
Rant
Admin
|
PermaLink |
Privacy |
Terms
of Use
Last Updated: 11 May 2008 Editor: |
Copyright 2008 by Jimmy Zhang Everything else Copyright © CodeProject, 1999-2008 Web17 | Advertise on the Code Project |