Click here to Skip to main content
Click here to Skip to main content

How to teach yourself web scraping

, 9 Jan 2013
Rate this:
Please Sign up or sign in to vote.
How to learn about web scraping.

I often get asked how to learn about web scraping. Here is my advice.

First learn a popular high level scripting language. A higher level language will allow you to work and test ideas faster. You don't need a more efficient compiled language like C because the bottleneck when web scraping is bandwidth rather than code execution. And learn a popular one so that there is already a community of other people working at similar problems so you can reuse their work. I use Python, but Ruby or Perl would also be a good choice.

The following advice will assume you want to use Python for web scraping. If you have some programming experience then I recommend working through the Dive Into Python book:

Make sure you learn all the details of the urllib2 module. Here are some additional good resources:

Learn about the HTTP protocol, which is how you will interact with websites.

Learn about regular expressions:

Learn about XPath:

If necessary learn about JavaScript:

These FireFox extensions can make web scraping easier:

Some libraries that can make web scraping easier:

Some other resources:

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Richard Penman

Australia Australia
No Biography provided

Comments and Discussions

 
-- There are no messages in this forum --
| Advertise | Privacy | Mobile
Web01 | 2.8.140721.1 | Last Updated 9 Jan 2013
Article Copyright 2013 by Richard Penman
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid