In this post I will clarify what I do by walking through a simple web scraping job I worked on.
A few months back a client asked me for a quote to get demographic data for every county and city in the US. I first checked around for an existing data set but did not find one, so I would need to scrape it from the official census website. I spent some time getting to know this website and found it followed a simplehierarchy, with navigation performed through selecting options from select boxes:
Overview page / stage pages / county pages | city pages
I emailed the client that the census website was small sized, easily navigable, and I would be able to provide a CSV file of the data within 3 days. I would be willing to do this for US $200 with half deposited beforehand (by PayPal) and the remainder after they were satisfied with the results. The client was satisfied with this arrangement, so it was time to get started.
The first step was to get all the state page URLs from the select box. I could hardcode these URLs but I don't like grunt work, so I constructed a regular expression to extract them automatically.
This expression can also be used to extract all the county and city URLs from their respective select boxes, so now I have access to all the required URLs.
(Note that using regular expressions is generally a bad approach to web scraping, which I will expand on in a future post.)
There are many factors involved in crawling that I will leave to future posts and will now jump ahead to after I have downloaded the HTML and am ready to scrape data from it.
If you have a look at a sample census page, you will see that there is a lot of data. It would be toocumbersometo craft regular expressions that extracted each of these fields in a structured way so I used XPathsinstead. (I will also leave a detailed coverage of XPaths for a future post, but essentially it is a convenient method for selecting HTML nodes.)
Now I am on the home stretch. I combine these various parts together into a single script that iterates the HTML pages, extracts the content with XPath, and writes out the result to a CSV file.