Click here to Skip to main content
15,991,544 members
Articles / Programming Languages / Java / Java SE
Article

Flat DOM

Rate me:
Please Sign up or sign in to vote.
3.46/5 (8 votes)
24 Oct 2005CPOL2 min read 46.6K   269   12   6
A simpler way to process XML

Introduction

[Note -- The concept below and the supplied code also work for HTML.]

When processing XML documents, one usually has three choices:

  1. Use a stateless event driven technique such as SAX. Fast, efficient, but very low level. This technique is just above raw parsing.
  2. Use a document object model (DOM) approach. This technique has the advantage of being standardized, but is slow and memory intensive. Surprisingly, parsing an XML document into a hierarchical data structure is not as useful as it seems and still leaves a huge amount of work to do anything useful.
  3. Use XSLT, or XPath on top of the standard DOM. This technique is often considered the easiest to use, but is the least efficient. In addition, there are many types of manipulation that do not fit within these higher-level models and jumping down to lower-level code is difficult.

We offer a fourth approach that runs near the efficiency of SAX and provides the near ease-of-use of XPath, while allowing us to jump down to a lower programming level with ease. We call our solution "Flat DOM" for reasons that will soon become obvious.

Very simply, we use SAX to build a list of key/value pairs. The key is the complete path up to that point for that particular entry. The value is the text or the attribute quantity. We use the special character @ to represent attributes and #text to represent text segments. Sequential text segments are merged together for ease of processing. The example below illustrates what we mean.

If we take the XML document example below...

XML
<?xml version="1.0" encoding="UTF-8"?>
<content xmlns="http:XMLSerialization">
    <object id="2">
        <class flag="3" id="0" name="NSArray" suid="-3789592578296478260">
            <field name="objects" type="java.lang.Object[]"/>
        </class>
        <array field="objects" id="4" ignoreEDB="1" length="3" type="java.lang.Object[]">
            <string id="5">The Chestry Oak</string>
            <string id="6">A Tree for Peter</string>
            <string id="7">The White Stag</string>
        </array>
    </object>
</content>

... and process it into a Flat DOM, we get the list of key/value pairs (represented as key: value)

/content/: #START#
/content/@xmlns: http:XMLSerialization
/content/object/: #START#
/content/object/@id: 2
/content/object/class/: #START#
/content/object/class/@flag: 3
/content/object/class/@id: 0
/content/object/class/@name: NSArray
/content/object/class/@suid: -3789592578296478260
/content/object/class/field/: #START#
/content/object/class/field/@name: objects
/content/object/class/field/@type: java.lang.Object[]
/content/object/class/field/: #END#
/content/object/class/: #END#
/content/object/array/: #START#
/content/object/array/@field: objects
/content/object/array/@id: 4
/content/object/array/@ignoreEDB: 1
/content/object/array/@length: 3
/content/object/array/@type: java.lang.Object[]
/content/object/array/string/: #START#
/content/object/array/string/@id: 5
/content/object/array/string/#text: The Chestry Oak
/content/object/array/string/: #END#
/content/object/array/string/: #START#
/content/object/array/string/@id: 6
/content/object/array/string/#text: A Tree for Peter
/content/object/array/string/: #END#
/content/object/array/string/: #START#
/content/object/array/string/@id: 7
/content/object/array/string/#text: The White Stag
/content/object/array/string/: #END#
/content/object/array/: #END#
/content/object/: #END#
/content/: #END#

Iterating through this list and extracting the data we need is now relatively trivial. We can use regular expressions, or complex matching criteria as we need.

We have chosen to take an interface and implementation approach for the code representing the structure. The reason for this is to allow for alternate representations of the list structure underneath. For example, there is a lot of repetition in the key part. We have written an implementation that stores the key parts separately in order to compress the representation in memory. We have also written an implementation for handling HTML. One could also write an implementation that stores and accesses the data from a file or a database. The interface is:

Java
public interface XMLVector {
   public int length();
   public String getPath(int i);
   public String getValue(int i);
   public int[] getPosition(String key);
   public int[] getPosition(Pattern keyRe);
   public String getValue(String key);
   public String getValue(Pattern keyRe);
}

And below is an example of using the library to extract all of the attributes from an XML file:

Java
public class Main {
   public static void main(String[] args) throws Throwable {
      File f = new File("sample1.xml");
      XMLVector vec = new XMLVectorImp(f);
      System.out.println(vec);
      // Extract all attributes for fun
      int pos[] = vec.getPosition(Pattern.compile(".*@.*"));
      for(int i = 0; i<pos.length; ++i) {
         System.out.println(vec.getPath(pos[i])+": "+vec.getValue(pos[i]));
      }
   }
}

Hopefully, you will find this as useful and easy to use as I do. The code is written in Java, but the concept is very simple and it shouldn't be too difficult to translate it into C# and other languages.

History

  • 24th October, 2005: Initial post

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Web Developer
Canada Canada
Software developer for Simple Software.

Currently developing a Java client application that allows you to view and interact with real-time web metrics sent from your web server.

Comments and Discussions

 
QuestionIterators? Pin
Davy Boy12-Dec-05 4:08
Davy Boy12-Dec-05 4:08 
AnswerRe: Iterators? Pin
Ian Schumacher12-Dec-05 5:59
Ian Schumacher12-Dec-05 5:59 
GeneralRe: Iterators? Pin
Davy Boy12-Dec-05 6:07
Davy Boy12-Dec-05 6:07 
GeneralRe: Iterators? Pin
Ian Schumacher12-Dec-05 6:12
Ian Schumacher12-Dec-05 6:12 
GeneralNice Idea Pin
Doron Barak24-Oct-05 18:13
Doron Barak24-Oct-05 18:13 
GeneralRe: Nice Idea Pin
Ian Schumacher24-Oct-05 18:28
Ian Schumacher24-Oct-05 18:28 
Thanks. Yes, if you need to alter the XML then this is probably not appropriate.

Even when the structure is flattened there can still be a significant amount of coding required to do what you need such as mapping XML to objects, but I have found that it seemed to be much simpler to do. Extracting data, on the other hand, is made very easy in comparison.

Cool about your XML parser, that's very compact. I wrote an HTML parser that is included with the code, but it's not perfect. Still in about five lines of code I can load, parse, and screen scrape data from an HTML page and I have found this very useful from time to time.

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.