Click here to Skip to main content
15,881,516 members
Articles / Web Development / XHTML
Tip/Trick

Using XPath to Query the Internet

Rate me:
Please Sign up or sign in to vote.
5.00/5 (2 votes)
10 Dec 2013CPOL2 min read 14.6K   37   4   2
A way to query the data from Internet

 Introduction  

Once upon a time, people used HTML to ship data, year after year. Internet became the biggest datasource of the world, but unlike databases that have SQL, Internet has the lack of a query language. Fortunately,  HTML is just like XML, and is easily translated to XML. We have the XML query language, the XPath, so we just change the look of the Internet to a big, big XML document, and we can query the Internet and get whatever we want. The new face of the Internet maybe like below: 

XML
<internet>
    <com>
        <website name="booksite">
            <www port="80">
                <html name="index">
                    <!-- html context -->
                    <head>
                        <title>Book Store</title>
                    </head>
                    <body>
                        <div>Welcome</div>
                    </body>
                </html>
                <html name="booklist/isbn777">
                    <head>
                        <title>CSharp Guide</title>
                    </head>
                    <body></body>
                </html>
                <html name="booklist/isbn888">
                    <head>
                        <title>Java Guide</title>
                    </head>
                    <body></body>
                </html>
            </www>
            <ftp></ftp>
        </website>
        <website name="candysite">
            <www port="80">
                <html name="index">
                    <head>
                        <title>Candy Store</title>
                    </head>
                    <body></body>
                </html>
                <html name="candylist/candy001">
                    <head>
                        <title>Chocolate</title>
                    </head>
                    <body></body>
                </html>
            </www>
        </website>
    </com>
    <net></net>
    <org></org>
</internet> 

We can evaluate an XPath expression "//html/head/title//text()"  to get all the page's titles from the internet. Unfortunately, standard XPath tools just can query an actual XML document, not a virtual one. Overriding the implementation of XPath is a big project, not this article's intend.   

The following section will try to do somethings to let XPath work, by loading all the website's pages, translating to standard XML, and integrating it all into a single one.    

Using XPath in Java  

First, we need to know how to use XPath, it is simple in java , just new a instance then call method 'evaluate'. 

Java
Document doc = builder.parse("XMLOfInternet.xml");
 
XPathFactory factory = XPathFactory.newInstance();
XPath xpath = factory.newXPath();
 
System.out.println("--ALL Websites--");
NodeList nodes = (NodeList) xpath.evaluate(
        "//html/head/title//text()", doc, XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); i++) {
    System.out.println(nodes.item(i).getNodeValue());
}  

Evaluating Expression on Internet   

 In this article we don't want to override the implementation of XPath to let it travel the Internet, The other way to let XPath work is loading all internet's HTMLs, then translating and integrating into single XML.   

Java
String exp = "//website[@name='booksite']/www/html[starts-with(@name,'booklist')]/head/title//text()";
queryInternet(exp);

public static void queryInternet(String exp) {
    Website[] wbs = Website.load(new String[] { "candysite", "booksite" });
    String xml = makeXML(wbs);
    System.out.println( "NoIndex XML Length=" + xml.length() );
    query(xml, exp);
}

public static String makeXML(Website[] wbs) {
    StringBuilder sb = new StringBuilder("<internet><com>");
    for (Website wb : wbs) {
        sb.append("<website name=\"" + wb.name + "\"><www port=\"80\">");
        for (Page p : wb.pages) {
            sb.append("<html name=\"" + p.name + "\">");
            sb.append(HTMLtoXML(p.context));
            sb.append("</html>");
        }
        sb.append("</www></website>");
    }
    sb.append("</com></internet>");
    return sb.toString();
}

public static class Website {
    public Website(String _name) {
        this.name = _name;
    }
 
    public String name;
    public ArrayList<Page> pages = new ArrayList<Page>();
}

public static class Page {
    public Page() {
    }
 
    public Page(String _name, String _context) {
        this.name = _name;
        this.context = _context;
    }
 
    public String name;
    public String context;
}

The xpath expression from above "//website[@name='booksite']/www/html[starts-with(@name,'booklist')]/head/title//text()"  is getting book's titles from book website. it doesn't like SQL at all , but understandable.  

Making Indexes   

Loading Internet each time is expensive, so we need some local indexes or clouded indexes to help xpath evaluation. At this sample I used a embedded database iBoxDB to make indexes, if we've reached a condition, we'll use local indexes to make XML for XPath, otherwise we'll load Internet's pages. 

Java
public static void indexedQueryInternet(String exp) {
    if (exp.contains("website[@name='booksite']")
            && exp.contains("html[starts-with(@name,'booklist')")) {
        Website[] wbs = Website.loadFromIndex("booksite", "booklist");
        String xml = makeXML(wbs);
        System.out.println("Indexed XML Length=" + xml.length());
        query(xml, exp);
    } else {
        queryInternet(exp);
    }
}

private static iBoxDB.LocalServer.AutoBox indexes;
public static Website[] loadFromIndex(String webName, String path) {
    makeIndex();
    Website wb = new Website(webName);
    for (Page p : indexes.select(Page.class,
            "from pages where webName=? & name>?", webName, path)) {
        wb.pages.add(p);
    }
    return new Website[] { wb };
}

private static void makeIndex() {
    if (indexes == null) {
        DB db = new DB();
        // tableName= "pages"
        // tableKey= "webName" + "name";
        db.ensureTable(FullPage.class, "pages", "webName", "name");
        indexes = db.open();
        if (indexes.selectCount("from pages") < 1) {
            Website[] wbs = Website.load(new String[] { "candysite",
                    "booksite" });
            for (Website wb : wbs) {
                for (Page p : wb.pages) {
                    indexes.insert("pages", new FullPage(wb.name,
                            p.name, p.context));
                }
            }
        }
    }
}

public static class FullPage extends Page {
    public FullPage() {
    }
 
    public FullPage(String _webName, String _name, String _context) {
        super(_name, _context);
        this.webName = _webName;
    }
 
    public String webName;
} 

The results of above procedures are (read attached file for details ).

--ALL Websites--
Book Store
CSharp Guide
Java Guide
Candy Store
Chocolate
--Book Website--
CSharp Guide
Java Guide
NoIndex XML Length=585
--Results--
CSharp Guide
Java Guide
Indexed XML Length=361
--Results--
Java Guide
CSharp Guide  

Points of Interest  

This article showed a case what XPath is capable of doing,  because world's data are unstructured or structured but we don't understand yet, so we will need more and more NoSQL techniques to let the world be calculable, XPath is one of the choices.  The demonstrated code is easy and small, because HTML just another XML, we don't have much things to do. Using XPath to query several websites' data is easy, but querying total Internal needs huge effort. Whatever, SQL is not suited for querying Internet, XPath is the best way we can find.  

References   

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Database Developer
China China
Tech for fun.

Comments and Discussions

 
QuestionSorry, but you're wrong about HTML being just another XML. Pin
Pete O'Hanlon9-Dec-13 0:41
mvePete O'Hanlon9-Dec-13 0:41 
Generalinteresting idea... Pin
Paolo Foti9-Dec-13 0:03
Paolo Foti9-Dec-13 0:03 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.