Click here to Skip to main content
Click here to Skip to main content
Technical Blog

Tagged as

Automatic web scraping

, 9 Jan 2013 CPOL
Rate this:
Please Sign up or sign in to vote.
I have been interested in automatic approaches to web scraping for a few years now.During university I created the SiteScraper library, which used training cases to automatically scrape webpages.This approach was particularly useful for scraping a website periodically because the model could automat

I have been interested in automatic approaches to web scraping for a few years now. During university I created the SiteScraper library, which used training cases to automatically scrape webpages. This approach was particularly useful for scraping a website periodically because the model could automatically adapt when the structure was updated but the content remained statuc.

However this approach is not helpful for me these days because most of my work involves scraping a website once-off. It is quicker to just specify the XPaths required than collect and test the training cases.

I would still like an automated approach to help me work more efficiently. Ideally I would have a solution that when given a website URL:

  • crawls the website
  • organize the webpages into groups that share the same template (a directory page will have a different HTML structure than a listing page)
  • the group with the most amount of webpages should be the listings
  • compare these listing webpages to find what is static (the template) and what changes
  • the parts that change represent dynamic data such as description, reviews, etc

Apparently this process of scraping data automatically is known as wrapper induction in academia. Unfortunately there do not seem to be any good open source solutions yet. The most commonly referenced one is Templatemaker, which is aimed at small text blocks and crashes in my test cases of real webpages. The author stopped development in 2007.

Some commercial groups have developed their own solutions so this certainly is technically possible:

If I do not find an open source solution I plan to attempt building my own later this year.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Richard Penman

Australia Australia
No Biography provided

Comments and Discussions

 
-- There are no messages in this forum --
| Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.141223.1 | Last Updated 9 Jan 2013
Article Copyright 2013 by Richard Penman
Everything else Copyright © CodeProject, 1999-2014
Layout: fixed | fluid