Click here to Skip to main content
Click here to Skip to main content

Using XPath to Query the Internet

, 10 Dec 2013 CPOL
Rate this:
Please Sign up or sign in to vote.
A way to query the data from Internet

 Introduction  

Once upon a time, people used HTML to ship data, year after year. Internet became the biggest datasource of the world, but unlike databases that have SQL, Internet has the lack of a query language. Fortunately,  HTML is just like XML, and is easily translated to XML. We have the XML query language, the XPath, so we just change the look of the Internet to a big, big XML document, and we can query the Internet and get whatever we want. The new face of the Internet maybe like below: 

<internet>
    <com>
        <website name="booksite">
            <www port="80">
                <html name="index">
                    <!-- html context -->
                    <head>
                        <title>Book Store</title>
                    </head>
                    <body>
                        <div>Welcome</div>
                    </body>
                </html>
                <html name="booklist/isbn777">
                    <head>
                        <title>CSharp Guide</title>
                    </head>
                    <body></body>
                </html>
                <html name="booklist/isbn888">
                    <head>
                        <title>Java Guide</title>
                    </head>
                    <body></body>
                </html>
            </www>
            <ftp></ftp>
        </website>
        <website name="candysite">
            <www port="80">
                <html name="index">
                    <head>
                        <title>Candy Store</title>
                    </head>
                    <body></body>
                </html>
                <html name="candylist/candy001">
                    <head>
                        <title>Chocolate</title>
                    </head>
                    <body></body>
                </html>
            </www>
        </website>
    </com>
    <net></net>
    <org></org>
</internet> 

We can evaluate an XPath expression "//html/head/title//text()"  to get all the page's titles from the internet. Unfortunately, standard XPath tools just can query an actual XML document, not a virtual one. Overriding the implementation of XPath is a big project, not this article's intend.   

The following section will try to do somethings to let XPath work, by loading all the website's pages, translating to standard XML, and integrating it all into a single one.    

Using XPath in Java  

First, we need to know how to use XPath, it is simple in java , just new a instance then call method 'evaluate'. 

Document doc = builder.parse("XMLOfInternet.xml");
 
XPathFactory factory = XPathFactory.newInstance();
XPath xpath = factory.newXPath();
 
System.out.println("--ALL Websites--");
NodeList nodes = (NodeList) xpath.evaluate(
        "//html/head/title//text()", doc, XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); i++) {
    System.out.println(nodes.item(i).getNodeValue());
}  

Evaluating Expression on Internet   

 In this article we don't want to override the implementation of XPath to let it travel the Internet, The other way to let XPath work is loading all internet's HTMLs, then translating and integrating into single XML.   

String exp = "//website[@name='booksite']/www/html[starts-with(@name,'booklist')]/head/title//text()";
queryInternet(exp);

public static void queryInternet(String exp) {
    Website[] wbs = Website.load(new String[] { "candysite", "booksite" });
    String xml = makeXML(wbs);
    System.out.println( "NoIndex XML Length=" + xml.length() );
    query(xml, exp);
}

public static String makeXML(Website[] wbs) {
    StringBuilder sb = new StringBuilder("<internet><com>");
    for (Website wb : wbs) {
        sb.append("<website name=\"" + wb.name + "\"><www port=\"80\">");
        for (Page p : wb.pages) {
            sb.append("<html name=\"" + p.name + "\">");
            sb.append(HTMLtoXML(p.context));
            sb.append("</html>");
        }
        sb.append("</www></website>");
    }
    sb.append("</com></internet>");
    return sb.toString();
}

public static class Website {
    public Website(String _name) {
        this.name = _name;
    }
 
    public String name;
    public ArrayList<Page> pages = new ArrayList<Page>();
}

public static class Page {
    public Page() {
    }
 
    public Page(String _name, String _context) {
        this.name = _name;
        this.context = _context;
    }
 
    public String name;
    public String context;
}

The xpath expression from above "//website[@name='booksite']/www/html[starts-with(@name,'booklist')]/head/title//text()"  is getting book's titles from book website. it doesn't like SQL at all , but understandable.  

Making Indexes   

Loading Internet each time is expensive, so we need some local indexes or clouded indexes to help xpath evaluation. At this sample I used a embedded database iBoxDB to make indexes, if we've reached a condition, we'll use local indexes to make XML for XPath, otherwise we'll load Internet's pages. 

public static void indexedQueryInternet(String exp) {
    if (exp.contains("website[@name='booksite']")
            && exp.contains("html[starts-with(@name,'booklist')")) {
        Website[] wbs = Website.loadFromIndex("booksite", "booklist");
        String xml = makeXML(wbs);
        System.out.println("Indexed XML Length=" + xml.length());
        query(xml, exp);
    } else {
        queryInternet(exp);
    }
}

private static iBoxDB.LocalServer.AutoBox indexes;
public static Website[] loadFromIndex(String webName, String path) {
    makeIndex();
    Website wb = new Website(webName);
    for (Page p : indexes.select(Page.class,
            "from pages where webName=? & name>?", webName, path)) {
        wb.pages.add(p);
    }
    return new Website[] { wb };
}

private static void makeIndex() {
    if (indexes == null) {
        DB db = new DB();
        // tableName= "pages"
        // tableKey= "webName" + "name";
        db.ensureTable(FullPage.class, "pages", "webName", "name");
        indexes = db.open();
        if (indexes.selectCount("from pages") < 1) {
            Website[] wbs = Website.load(new String[] { "candysite",
                    "booksite" });
            for (Website wb : wbs) {
                for (Page p : wb.pages) {
                    indexes.insert("pages", new FullPage(wb.name,
                            p.name, p.context));
                }
            }
        }
    }
}

public static class FullPage extends Page {
    public FullPage() {
    }
 
    public FullPage(String _webName, String _name, String _context) {
        super(_name, _context);
        this.webName = _webName;
    }
 
    public String webName;
} 

The results of above procedures are (read attached file for details ).

--ALL Websites--
Book Store
CSharp Guide
Java Guide
Candy Store
Chocolate
--Book Website--
CSharp Guide
Java Guide
NoIndex XML Length=585
--Results--
CSharp Guide
Java Guide
Indexed XML Length=361
--Results--
Java Guide
CSharp Guide  

Points of Interest  

This article showed a case what XPath is capable of doing,  because world's data are unstructured or structured but we don't understand yet, so we will need more and more NoSQL techniques to let the world be calculable, XPath is one of the choices.  The demonstrated code is easy and small, because HTML just another XML, we don't have much things to do. Using XPath to query several websites' data is easy, but querying total Internal needs huge effort. Whatever, SQL is not suited for querying Internet, XPath is the best way we can find.  

References   

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Bruce Yang cp
Systems Engineer
China China
Mixing technologies for fun.

Comments and Discussions

 
QuestionNot an article PinprotectorOriginalGriff8-Dec-13 22:33 
GeneralRe: Not an article [modified] PinmemberBruce Yang cp9-Dec-13 7:16 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web03 | 2.8.1411023.1 | Last Updated 10 Dec 2013
Article Copyright 2013 by Bruce Yang cp
Everything else Copyright © CodeProject, 1999-2014
Layout: fixed | fluid