Click here to Skip to main content
15,885,244 members
Articles / Web Development / XHTML

Mine Your Website's Data With a Private Custom Crawler

Rate me:
Please Sign up or sign in to vote.
5.00/5 (1 vote)
30 Aug 2012CPOL3 min read 7.2K   5   1
Mine your Website's Data with a Private Custom Crawler

Web pages provide a plethora of information and mineable data. Unfortunately, most of them are not using the XML based XHTML but the classic HTML. Therefore, we decided to extend the ANKHOR XML parser to accept most HTML content.

With this extension, it is now quite simple to e.g. extract all <img> references from a web page and convert it into a table.

httpxmlfilter

I have created a simple web crawler for testing purposes that walks through all reachable documents on a given domain starting at the root. It uses a while loop to iterate through the access depth. A HEAD request is executed in parallel for all resources that are reachable at this level and have not been accessed in one of the iterations before.

Resources with type text/html are fetched and all references such as links, images or scripts extracted and merged into a list. The list of already visited resources is then removed from this list of references to form the list for the next iteration. The loop ends when no new resources are found. Only references to the same domain are followed to avoid crawling over foreign territory.

The result is a table of all referenced resources of the domain.

resourcetable1

A reverse match is then executed to find all inbound references, which is later used to build a scatter plot with inbound and outbound connections. So now this is enough information to build four simple charts: three bar charts with the number of resources by status code, depth and resource type and the scatter plot with in and outbound connections.

Here, we have the charts for two different domains.

chartsankhor inoutboundhorizon

All data elements have tooltips, so it is possible to see which pages are responsible for e.g. 404 status codes.

tooltip

Another way to visualize the domain is using a balloon tree to display the crawling pattern. Easily accessible pages are close to the centre – hard to find resources are in leafs. The type of resource is colour coded and all nodes have their URL as tooltips.

crawlingankhor crawlinghorizon

The Graph building operator is composed of four macros. The “Edges” macro generates all internal edges as a two column table. The “ColorCodes” macro selects the color for the nodes based on the mime type of the resource. The “Graph” macro builds the graph image and the “Scale800” macro scales the resulting graph to a width of 800 pixels.

GraphBuildGraph

The “ColorCodes” macro shows a frequently used concept of selecting elements into groups based on identical (or similar) values in one column. In this case, I use the mime type as the group indicator.

ColorCodesGraph

  • Identify Groups: Reduce the list of mime types to the unique values and sort them alphabetically.
  • Build Group ID: Generate a uniue Identifier for the groups, in this case, I select a specific color using a palette creating macro that picks the given number of colors from a standard palette.
  • Group Membership: Find the group that each member of the incoming list belongs to. An alternative would be to use the optional second output of the "uniquerows" operator that returns the mapping from source rows to unified rows, but it would need an additional translation step due to the sorting.
  • Assign ID: Select the ID based on the determined group membership.

While I don't think it is wise to provide an easy to (mis)use web crawler with everyone, I would be more than happy to share it with anybody who has a compelling idea for more interesting mining targets on one's own website.

Author

Dr. Ulrich Sigmund

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Senior)
Germany Germany
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
QuestionNice work Pin
FIFXC18-May-14 19:57
FIFXC18-May-14 19:57 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.