Click here to Skip to main content
Click here to Skip to main content
Technical Blog

Tagged as

The SiteScraper module

, 13 Jan 2013 CPOL
Rate this:
Please Sign up or sign in to vote.
Automatically scraping website data based on example cases.

A few years ago I developed the sitescraper library for automatically scraping website data based on example cases:

>>> from sitescraper import sitescraper>>> ss = sitescraper()  
>>> url = 'http://www.amazon.com/s/ref=nb_ss_gw?
            url=search-alias%3Daps&field-keywords=python&x=0&y=0'  
>>> data = ["Amazon.com: python", ["Learning Python, 3rd Edition",   
  "Programming in Python 3: A Complete Introduction to the Python Language",
  "Python in a Nutshell, Second Edition (In a Nutshell (O'Reilly))"]]  
>>> ss.add(url, data)  
>>> # we can add multiple example cases,
>>> # but this is a simple example so one will do (I generally use 3)  
>>> # ss.add(url2, data2)   
>>> ss.scrape('http://www.amazon.com/s/ref=nb_ss_gw?
                url=search-alias%3Daps&field-keywords=linux&x=0&y=0')  
["Amazon.com: linux", [
    "A Practical Guide to Linux(R) Commands, Editors, and Shell Programming", 
    "Linux Pocket Guide", 
    "Linux in a Nutshell (In a Nutshell (O'Reilly))", 
    'Practical Guide to Ubuntu Linux (Versions 8.10 and 8.04), A (2nd Edition)', 
    'Linux Bible, 2008 Edition'
]]

See this paper for more info.

It was designed for scraping websites over time where their layout may change. Unfortunately I don't use it much these days because most of my projects are one-off scrapes.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Richard Penman

Australia Australia
No Biography provided

Comments and Discussions

 
-- There are no messages in this forum --
| Advertise | Privacy | Mobile
Web02 | 2.8.141022.2 | Last Updated 13 Jan 2013
Article Copyright 2013 by Richard Penman
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid