Click here to Skip to main content
Click here to Skip to main content
Technical Blog

Tagged as

Typical web scraping job

, 23 Jan 2013 CPOL
Rate this:
Please Sign up or sign in to vote.
In this post I will clarify what I do by walking through a simple web scraping job I worked on.

In this post I will clarify what I do by walking through a simple web scraping job I worked on.

A few months back a client asked me for a quote to get demographic data for every county and city in the US. I first checked around for an existing data set but did not find one, so I would need to scrape it from the official census website. I spent some time getting to know this website and found it followed a simplehierarchy, with navigation performed through selecting options from select boxes:

Overview page / stage pages / county pages | city pages

I viewed the source of these webpages and found the content I was after embedded, which meant the content was defined statically rather than being loaded dynamically with JavaScript or AJAX. This would make scraping it more straightforward.

I emailed the client that the census website was small sized, easily navigable, and I would be able to provide a CSV file of the data within 3 days. I would be willing to do this for US $200 with half deposited beforehand (by PayPal) and the remainder after they were satisfied with the results. The client was satisfied with this arrangement, so it was time to get started.

The first step was to get all the state page URLs from the select box. I could hardcode these URLs but I don't like grunt work, so I constructed a regular expression to extract them automatically.
This expression can also be used to extract all the county and city URLs from their respective select boxes, so now I have access to all the required URLs.
(Note that using regular expressions is generally a bad approach to web scraping, which I will expand on in a future post.)

There are many factors involved in crawling that I will leave to future posts and will now jump ahead to after I have downloaded the HTML and am ready to scrape data from it.

If you have a look at a sample census page, you will see that there is a lot of data. It would be toocumbersometo craft regular expressions that extracted each of these fields in a structured way so I used XPathsinstead. (I will also leave a detailed coverage of XPaths for a future post, but essentially it is a convenient method for selecting HTML nodes.)

Now I am on the home stretch. I combine these various parts together into a single script that iterates the HTML pages, extracts the content with XPath, and writes out the result to a CSV file.



This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


About the Author

Richard Penman

Australia Australia
No Biography provided

Comments and Discussions

-- There are no messages in this forum --
| Advertise | Privacy | Terms of Use | Mobile
Web03 | 2.8.150414.1 | Last Updated 23 Jan 2013
Article Copyright 2013 by Richard Penman
Everything else Copyright © CodeProject, 1999-2015
Layout: fixed | fluid