Using XPath to Query the Internet

Bruce Yang CL

5.00/5 (2 votes)

Dec 9, 2013

CPOL

2 min read

15140

A way to query the data from Internet

Download XMLofInternet_01.zip - 2.2 KB

Introduction

Once upon a time, people used HTML to ship data, year after year. Internet became the biggest datasource of the world, but unlike databases that have SQL, Internet has the lack of a query language. Fortunately, HTML is just like XML, and is easily translated to XML. We have the XML query language, the XPath, so we just change the look of the Internet to a big, big XML document, and we can query the Internet and get whatever we want. The new face of the Internet maybe like below:

<internet>
    <com>
        <website name="booksite">
            <www port="80">
                <html name="index">
                    <!-- html context -->
                    <head>
                        <title>Book Store</title>
                    </head>
                    <body>
                        <div>Welcome</div>
                    </body>
                </html>
                <html name="booklist/isbn777">
                    <head>
                        <title>CSharp Guide</title>
                    </head>
                    <body></body>
                </html>
                <html name="booklist/isbn888">
                    <head>
                        <title>Java Guide</title>
                    </head>
                    <body></body>
                </html>
            </www>
            <ftp></ftp>
        </website>
        <website name="candysite">
            <www port="80">
                <html name="index">
                    <head>
                        <title>Candy Store</title>
                    </head>
                    <body></body>
                </html>
                <html name="candylist/candy001">
                    <head>
                        <title>Chocolate</title>
                    </head>
                    <body></body>
                </html>
            </www>
        </website>
    </com>
    <net></net>
    <org></org>
</internet>

We can evaluate an XPath expression "//html/head/title//text()" to get all the page's titles from the internet. Unfortunately, standard XPath tools just can query an actual XML document, not a virtual one. Overriding the implementation of XPath is a big project, not this article's intend.

The following section will try to do somethings to let XPath work, by loading all the website's pages, translating to standard XML, and integrating it all into a single one.

Using XPath in Java

First, we need to know how to use XPath, it is simple in java , just new a instance then call method 'evaluate'.

Document doc = builder.parse("XMLOfInternet.xml");
 
XPathFactory factory = XPathFactory.newInstance();
XPath xpath = factory.newXPath();
 
System.out.println("--ALL Websites--");
NodeList nodes = (NodeList) xpath.evaluate(
        "//html/head/title//text()", doc, XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); i++) {
    System.out.println(nodes.item(i).getNodeValue());
}

Evaluating Expression on Internet

In this article we don't want to override the implementation of XPath to let it travel the Internet, The other way to let XPath work is loading all internet's HTMLs, then translating and integrating into single XML.

String exp = "//website[@name='booksite']/www/html[starts-with(@name,'booklist')]/head/title//text()";
queryInternet(exp);

public static void queryInternet(String exp) {
    Website[] wbs = Website.load(new String[] { "candysite", "booksite" });
    String xml = makeXML(wbs);
    System.out.println( "NoIndex XML Length=" + xml.length() );
    query(xml, exp);
}

public static String makeXML(Website[] wbs) {
    StringBuilder sb = new StringBuilder("<internet><com>");
    for (Website wb : wbs) {
        sb.append("<website name=\"" + wb.name + "\"><www port=\"80\">");
        for (Page p : wb.pages) {
            sb.append("<html name=\"" + p.name + "\">");
            sb.append(HTMLtoXML(p.context));
            sb.append("</html>");
        }
        sb.append("</www></website>");
    }
    sb.append("</com></internet>");
    return sb.toString();
}

public static class Website {
    public Website(String _name) {
        this.name = _name;
    }
 
    public String name;
    public ArrayList<Page> pages = new ArrayList<Page>();
}

public static class Page {
    public Page() {
    }
 
    public Page(String _name, String _context) {
        this.name = _name;
        this.context = _context;
    }
 
    public String name;
    public String context;
}

The xpath expression from above "//website[@name='booksite']/www/html[starts-with(@name,'booklist')]/head/title//text()" is getting book's titles from book website. it doesn't like SQL at all , but understandable.

Making Indexes

Loading Internet each time is expensive, so we need some local indexes or clouded indexes to help xpath evaluation. At this sample I used a embedded database iBoxDB to make indexes, if we've reached a condition, we'll use local indexes to make XML for XPath, otherwise we'll load Internet's pages.

public static void indexedQueryInternet(String exp) {
    if (exp.contains("website[@name='booksite']")
            && exp.contains("html[starts-with(@name,'booklist')")) {
        Website[] wbs = Website.loadFromIndex("booksite", "booklist");
        String xml = makeXML(wbs);
        System.out.println("Indexed XML Length=" + xml.length());
        query(xml, exp);
    } else {
        queryInternet(exp);
    }
}

private static iBoxDB.LocalServer.AutoBox indexes;
public static Website[] loadFromIndex(String webName, String path) {
    makeIndex();
    Website wb = new Website(webName);
    for (Page p : indexes.select(Page.class,
            "from pages where webName=? & name>?", webName, path)) {
        wb.pages.add(p);
    }
    return new Website[] { wb };
}

private static void makeIndex() {
    if (indexes == null) {
        DB db = new DB();
        // tableName= "pages"
        // tableKey= "webName" + "name";
        db.ensureTable(FullPage.class, "pages", "webName", "name");
        indexes = db.open();
        if (indexes.selectCount("from pages") < 1) {
            Website[] wbs = Website.load(new String[] { "candysite",
                    "booksite" });
            for (Website wb : wbs) {
                for (Page p : wb.pages) {
                    indexes.insert("pages", new FullPage(wb.name,
                            p.name, p.context));
                }
            }
        }
    }
}

public static class FullPage extends Page {
    public FullPage() {
    }
 
    public FullPage(String _webName, String _name, String _context) {
        super(_name, _context);
        this.webName = _webName;
    }
 
    public String webName;
}

The results of above procedures are (read attached file for details ).

--ALL Websites--
Book Store
CSharp Guide
Java Guide
Candy Store
Chocolate
--Book Website--
CSharp Guide
Java Guide
NoIndex XML Length=585
--Results--
CSharp Guide
Java Guide
Indexed XML Length=361
--Results--
Java Guide
CSharp Guide

Points of Interest

This article showed a case what XPath is capable of doing, because world's data are unstructured or structured but we don't understand yet, so we will need more and more NoSQL techniques to let the world be calculable, XPath is one of the choices. The demonstrated code is easy and small, because HTML just another XML, we don't have much things to do. Using XPath to query several websites' data is easy, but querying total Internal needs huge effort. Whatever, SQL is not suited for querying Internet, XPath is the best way we can find.