Web Scraping with Python: Part I

VISWESWARAN1998

4.66/5 (16 votes)

Dec 25, 2016

CPOL

16 min read

55399

698

Scrape the websites with Python 3

Download source - 4.3 KB

IMPORTANT: This article has been improved so very much and the selenium tutorial has been added to the article. Filling up the entry boxes, clicking the buttons and automating many things have been added to this. However, I do not have a good English foundation, so the grammar and punctuation is not good. I have made this entire article (source) open source at Github ^ in the name source.html. If anyone is interested, please help in formatting this. Thank you! One of the comments says this is a basic article and now it is becoming a false statement.

Introduction

This article is meant for learning web scraping using various libraries available from Python. If you are good with Python, you can refer to this article. It is a complete guide started from scratch.

Note: I stick with 3.x version which guarantees future usage.

Background

For some who have never heard about web-scraping.

Consider this situation - a person wants to print two numbers in a console/terminal in Python. He/she will use something like this:

print("1 2")

So what to do if he/she wants to print about 10 numbers? Well, he/she can use looping. OK, now we can come to our situation if a website contains information about a person and you want it in Excel? What do you do? You will copy the info of a person and add his contact info and other stuff in several rows. What do you do when there is information about 1000 persons? Well, you have to code a bot to do this work.

There are various libraries out there available for Python. I will try to explain all the important stuff to become a fully fledged web scrapper.

Using the Libraries

Using Default urllib.request Library

Python has its own web scraping which may not be easier for several advanced scraping, however useful for basic scraping. There is a library named requests which is the best alternative and most stable than this, so I will cover more in requests than here.

OK, open up your favourite Python editor and import this library.

[code1]

import urllib.request

and type the following code:

# SWAMI KARUPPASWAMI THUNNAI
import urllib.request
source = urllib.request.urlopen("http://www.codeproject.com")
print(source)

Here, urllib.request.urlopen gets the web-page. Now, when we execute this program, we will get something like this:

Let's have a closer look at this output. We got a response object at an address. So here, we have actually got an address. http.client.HTTPResponse is a class. From that, we have used an object and it returned an address. so, in order to see the value, we will use a pointer to see what is actually in it. Modify the print statement above to pointer as:

print(*source)

Now when executing the code, you will see something like this:

You may ask, hey what is this? Well, this is the HTML source code of the web link you are requested for.

On seeing this, a common thought will arise for every one, oh yeah! I got the HTML code now I can use regular expressions to get what I want :) but you should not. There is a specific parsing library available which I will explain later.

[code2]

The same can be achieved without using a pointer. just replace *source with source.read().

Scraping the Images

Ok, now how to scrape the images using Python? Now let's take a website, we can take this very own site CodeProject not meant for commercial use. Open CodeProject's home page, you will see their company logo, right click their logo and select view image, you will see this:

OK, now get the web-link https://codeproject.global.ssl.fastly.net/App_Themes/CodeProject/Img/logo250x135.gif

Note the last letters which say the extension of the image which is gif in our case. Now we can save this into our disk.

Python has a module named urllib.request which can be seen in request.py and it has a member function named urlretrieve which is used to save a file locally from a network and we are going to use it to save our images.

[code 3]

# SWAMI KARUPPASWAMI THUNNAI
import urllib.request
# Syntax : urllib.request.urlretrieve(arg1,arg2)
# arg1 = web url
# arg2 = path to be saved
source = urllib.request.urlretrieve
("https://codeproject.global.ssl.fastly.net/App_Themes/CodeProject/Img/logo250x135.gif",
 "our.gif")

The above code will save the image to the location where Python file is located. The first argument is url and the second is the file name. Refer to the syntax in code.

That's it for urllib. We need to shift with requests. It is important since I found it is more stable. Everything which can be done with urllib can be done with requests.

Requests

Requests will not come along with Python. You need to install it. To install this, just run the below pip command.

PIP COMMAND:

pip install requests

or you can use the regular method of installing from source. I leave it to you. Try importing requests to check whether requests has been installed successfully.

import requests

You should not get any errors while importing this statement.

We will try this code:

[code 4]

# SWAMI KARUPPASWAMI THUNNAI
import requests
request = requests.get("https://www.codeproject.com")
print(request)

Here, we will have a look at line 3 (I am counting from 1) in this line request is a variable and requests is a module which has a member function named get which we passed our web-link as an argument. This code will generate this output:

So it is nothing but a http status code which says it is successful.

[code 5]: modify the variable present in the print statement to this => request.content the output will be the contents of the web-page which is nothing but an HTML source.

What is a user-agent?

In Networking, while transmitting a data from source to destination, the data is split up into smaller chunks called packets, this is a simple definition of a packets in internet. Usually, the packet headers consist of several information about the source and the destination. We will only analyse packet headers which are useful for web-scraping.

I will show you why this is important, first, we will fire up our own server and make it listen on local machine i.p 127.0.0.1 @ port 1000, here instead of connecting to code project we will connect to this server using http://127.0.0.1:1000, and you will see something like this on server:

Let's have a closer look at the message, when we are connecting to the server using [code 6] (It is same as that of code 4. I've just changed the destination address) you will find the user-agent as python-requests with its version and other details.

This user-agent reveals that the request is from machine and not from a human, so some advanced websites will block you from scraping. What do we do now?

Changing the user-agent

This is our code 6.

# SWAMI KARUPPASWAMI THUNNAI
import requests
request = requests.get("http://127.0.0.1:1000")
print(request.content)

We will add custom headers to our above code.

Open up this link in Wikipedia for user agent https://en.wikipedia.org/wiki/User_agent and you will find an example user agent there, ok we will use it for your convenience. I will show the example present there.

User agent present in wikipedia example:

Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us) 
AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405

Python dictionary is used for adding the user agent and key = User-Agent ; value = any user agents for example we will take the above.

So our code will be something like this [code 7]

agent = {'User-Agent': 'Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us) 
          AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405'}

Ok, now we can add this dictionary while requesting to change the user agent:

    request = requests.get("http://127.0.0.1:1000",
    headers=agent) #see the additional argument named headers

On execution, you will see the user-agent gets changed from python-requests to Mozilla Firefox, How do I believe? See the screenshot below:

What have we done so far? We have only got the page source using different methods so we will now gather the data, let's get started!

Library 3: Beautifulsoup: pip install beautifulsoup4

So what is beautiful soup? Is it a scraping library? Actually beautifulsoup is a parsing library which is used to parse HTML.

HTML? Yes HTML, remember all the above methods are used to get the page source which is nothing but an HTML source.

TARGET SCRAPING WEBSITE: https://www.yellowpages.com.au/search/listings?clue=Restaurants&locationClue=&lat=&lon=&selectedViewMode=list

IMPORTANT: I am using this website just as an example for educational purposes nothing more, I've used this website since it has clear layout with pagination which stands as the best example for scraping other sites which you have permission. I've warned you scraping may lead to penalty if used wrongly. Don't involve me. :)

Ok let's get into the topic:

IMPORTANT

Basics Tags in HTML which are mostly used for website scraping:

Tags: (Pocket hints)

<title> </title> => adds title to the webpage

<p> </p> => Paragraph

<a href="someLink"> </a> =>Links

<h(x)> </h(x)> => Heading tags

and some other tags like div - container and so on, This is not an html tutorial anyways,

[How to find out whether we are on the safer side of scraping - identifying the site is not allowing us to scrape]:

The structure of the website is something like this:

What we are going to do is we are going to scrape all the bold letters (e.g., Royal India Restaurant which is seen in the picture.)

STEPS

Right click the bold words in the website (Royal India Restaurant) and select inspect element. You will see something like this:

So we have got the proper HTML tag. Have a look at it - you will find something like this.

<a class="class name" .... here a means link as I have explained in pocket hints. so the bold letters are link which belongs to a class named "listing-name". So can you guess now how to get all the bold names???

Answer: Scraping all the links which belongs to this class name will give us all the names of those restaurants.

Alright, we will write a script to scrape all the links first. To get the HTML source, I am going to use requests and to parse the HTML, I am going to use BeautifulSoup.

Ok you may find this code will display the page content.

[code 9]

# SWAMI KARUPPASWAMI THUNNAI
import requests
from bs4 import BeautifulSoup


if __name__=="__main__":
    req = requests.get("https://www.yellowpages.com.au/search/listings?
          clue=Restaurants&locationClue=&lat=&lon=&selectedViewMode=list")
    #req.content = html page source and we are using the html parser
    soup = BeautifulSoup(req.content,"html.parser")
    print(soup)

NO, this will not display the page source, the output will be something like this:

Quote:

We value the quality of content provided to our customers, and to maintain this, we would like to ensure real humans are accessing our information.

.

.

.

<form action="/dataprotection" method="post" name="captcha">

Why did this happen?

This page appears when online data protection services detect requests coming from your computer network which appear to be in violation of our website's terms of use.

I told you in the real word scraping the requests coming from Python will get blocked. Of course, we are all violating their terms and conditions, but this can be bypassed easily by adding user agent to it, I have added the user agent in [code 9] and when you run the code, this code will work and we will get the page source. So we have now found that we are violating their terms and conditions and we should not scrape further. So I have ended this here just by showing the scraped names in page1 of the website!.

Below are the ways to break security - for educational purpose only.

So now our modified [code 9]:

# SWAMI KARUPPASWAMI THUNNAI
import requests
from bs4 import BeautifulSoup


if __name__=="__main__":
    agent = {'User-Agent': 'Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; 
              en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405'}
    req = requests.get("https://www.yellowpages.com.au/search/listings?
          clue=Restaurants&locationClue=&lat=&lon=&selectedViewMode=list",headers=agent)
    #req.content = html page source and we are using the html parser
    soup = BeautifulSoup(req.content,"html.parser")
    for i in soup.find_all("a",class_="listing-name"):
        print(i.text)

will yield this:

I have ended the scraping here and the website has not been scrapped further more. I strongly recommend you to do the same so that no one will be affected.

One Complete Scraping Example for Familiarizing the Scraping

Target site: https://www.yelp.com/search?find_desc=Restaurants&find_loc=San+Francisco%2C+CA&ns=1

Before getting into scraping this website, I would like to explain the general layouts which may be seen in the websites. Once the layouts is identified, we can code according to it.

1. Information in one long lengthy page

If this is our case, then it is easier for us to write the script which scrapes for the single page alone.

2. Pagination

If a website has a pagination layout, the web site will have multiple pages like page1, page2, page3 and so on.

Our example scraping website does have a pagination layout, follow the target site https://www.yelp.com/search?find_desc=Restaurants&find_loc=San+Francisco%2C+CA&ns=1

and scroll down, you will see something like this:

So that is pagination. In this case, we need to write a script to go to every single page and scrape the information will explain more about scraping pagination below.

3. AJAX Spinner

We need to use selenium to get the job done for these types of websites and I will also explain how to use selenium in further/same article(s).

Explanation for scraping pagination for the above link: https://www.yelp.com/search?find_desc=Restaurants&find_loc=San+Francisco%2C+CA&ns=1

What we are going to do for this case is we are going to scrape all the available page links first (see the above image), you will find something like this 1,2,3,...... Next> all those are links <a> tag in HTML, but don't scrape for those links. If you scrape for those links, then this will happen. You will get the page link for page 1,2,3...9 but not for further pages since you have Next > link blocking the further links there.

Run the below code and see the output [code 10]:

#Scrapes for 9 pages only
def Scrape(weblink):
    r = requests.get(weblink)
    soup = BeautifulSoup(r.content,"html.parser")
    for i in soup.find_all("a",class_="available-number pagination-links_anchor"):
        print("https://www.yelp.com"+i.get("href"))
        print(i.text)

You will only get the output for the first 9 pages so in order to get the links of all pages, what we are going to do is we are going to scrape the next links.

Go to the link and inspect the next link:

So you will find that the next link belongs to a class named u-decoration-none next pagination-links_anchor.

Scraping for the link will give you the next link for the page so if you are scraping for page 1, it will give the link for page 2, if you are scraping for page 2, then will you link for page 3, does it make any sense?

RECUSION...! :)

    def scrape(weblink):
          r = requests.get(weblink)
          soup = BeautifulSoup(r.content,"html.parser")
          # Do some scraping for the current page here
          for i in soup.find_all("a",class_="u-decoration-none next pagination-links_anchor"):
                print("https://www.yelp.com"+i.get("href"))
                scrape("https://www.yelp.com"+i.get("href"))

Now we can do whatever we want.

We will scrape the names of all restaurants as an example.

def scrape(weblink):
    print(weblink)
    r = requests.get(weblink)
    soup = BeautifulSoup(r.content,"html.parser")
    for i in soup.find_all("a",class_="biz-name js-analytics-click"):
        print(i.text)
    for i in soup.find_all("a",class_="u-decoration-none next pagination-links_anchor"):
        print("https://www.yelp.com"+i.get("href"))
        scrape("https://www.yelp.com"+i.get("href"))

This will give output something like this:

Quote:

https://www.yelp.com/search?find_desc=Restaurants&find_loc=San+Francisco%2C+CA&ns=1
Extreme Pizza
B Patisserie
Cuisine of Nepal
ABV
Southern Comfort Kitchen
Buzzworks
Frances
The Morris
Tacorea
No No Burger
August 1 Five
https://www.yelp.com/search?find_desc=Restaurants&find_loc=San+Francisco%2C+CA&start=10
https://www.yelp.com/search?find_desc=Restaurants&find_loc=San+Francisco%2C+CA&start=10
Extreme Pizza
Gary Danko
Italian Homemade Company
Nopa
Sugarfoot
Big Rec Taproom
El Farolito
Hogwash
Loló
Kebab King
Paprika
https://www.yelp.com/search?find_desc=Restaurants&find_loc=San+Francisco%2C+CA&start=20

You will find Extreme pizza repeated 2 times. It is nothing but a sponsored advertisement of the hotel which will be displayed as the first for every page. We can write a script to skip the first entry. Conditional statement will do. I don't need to explain this as a beginner could do this.

Using Burp Suite and Firefox Developer Edition to Assist with Complex Post Calls

Before getting started, download burp suite here: https://portswigger.net/burp/communitydownload and note that community edition of burpsuite is more than sufficient. You don't need a pro version unless you go for security testing.

First go to burp and check if proxy service is enabled.
Open firefox dev edition to listen on the target proxy network.
Go to proxy to and capture get and post requests.

I know it is difficult to follow the three steps above, so I made a video so that it will be easier for you to understand.

Ultimate Selenium Guide

So far, we have used several libraries for some really basic scraping. Now, we are going to use web drivers for complete browser automation and this is going to be really interesting to view and watch...

The best way to install the selenium is by downloading the source https://pypi.python.org/pypi/selenium

OK, after installing the Selenium, test the working of the selenium by importing the web driver:

    from selenium import webdriver

You should not get any error if this line is executed. If there are no errors, then we shall get started, from the import module, you will find something like webdriver, yeah we are going to use webdrivers for automating the browsers.

OK, there are various web drivers available out there which do the very same task but I am going to cover only two.

chrome driver: for real world scraping
phantomjs: for headless scraping

Download chrome driver: https://sites.google.com/a/chromium.org/chromedriver/downloads
Download PhantomJs: http://phantomjs.org/download.html

Chrome Driver

We will see how to use the Chrome driver here now. Once you have tested that chrome driver has installed properly, then do the following things. First of all, it is advised to place the chromedriver in a static location e.g C:\\chromedriver.exe. This is because you can avoid huge memory consumption(literally). I mean that if you are placing it near to your Python scripts, it will work fine but for separate projects, you need to place the chromedriver everywhere this will lead to lot of nuisance.

OK, now have a look at code 11, this code will open Google.

[CODE 11]

from selenium import webdriver
# we are going to use the Chrome Driver so we have used Chrome
browser = webdriver.Chrome("E:\\chromedriver.exe")
#Get the website
browser.get("https://www.google.com")

The function get will get the websitelink which is passed as an argument! Now we will open and close the browser. In-order to close the browser, we will use the close() method.

syntax for closing the browser:

webdriver.close()

So, adding browser.close() at the end of the code will close the browser.

Have a look @ this video please: https://www.youtube.com/watch?v=MRkEvaJ8SXA&feature=youtu.be

You can find that the browser closed by the webdriver is not closed. This can be done by using browser.quit() method. [code 12]

Understanding ID, name and css_selectors

ID

The id global attribute defines a unique identifier (ID) which must be unique in the whole document. Its purpose is to identify the element when linking (using a fragment identifier), scripting, or styling (with CSS).
Reference: MDN

See the sample ID here:

OK, now we will try to click the button using an id!

See the sample code project page.

Here, you will find the CANCEL button. Right click on the Cancel button and inspect the element, you will find something like this:

CANCEL BUTTON

Once, we click on the Cancel button, it is going to redirect us on CodeProject's homepage, so we are going to use selenium to automate the process, ok let's get started! [code - 13]

    #SWAMI KARUPPASWAMI THUNNAI

    from selenium import webdriver
    
    if __name__=="__main__":
        browser = webdriver.Chrome("E:\\chromedriver.exe")
        #get the url
        browser.get("https://www.codeproject.com/Questions/ask.aspx")
        #click the cancel button using id
        cancel_button = browser.find_element_by_id("ctl00_ctl00_MC_AMC_PostEntry_Cancel")
        cancel_button.click()

Have a look at this line [cancel_button = browser.find_element_by_id("ctl00_ctl00_MC_AMC_PostEntry_Cancel")] we are finding the element using ID and clicking it. After that, we do nothing so once the button has clicked, then the url redirects to homepage since we have clicked Cancel button.

Understanding Name Tags

Now, open up Google^ and start inspecting the search panel and you will find something like this:

Google Screenshot

IMPORTANT:
So, far we have opened up a page, clicked the button and got the page source. Now we are going to fill an-entry box so, we need some more attention from now since we are moving from basics to advanced!

Import the keys in-order to send the key-words:

    from selenium.webdriver.common.keys import Keys

Ok, we will see an example how we are going automate a google search [code - 14]:

    # SWAMI KARUPPASWAMI THUNNAI
    # CODE-14
    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys
    browser = webdriver.Chrome("E:\\chromedriver.exe")
    browser.get("https://www.google.com")
    name = browser.find_element_by_name("q")
    keyword = "Codeproject"
    #Use send_keys to send the keywords
    # NOTE: Do not use the webdriver like in here browser.send_keys("something")
    # Webdriver does not have that kind of attribute
    # Use the actual variable which is used to find the element
    # in our case it is "name"
    name.send_keys(keyword)

So, send_keys(arg) will take the keyword as an argument and will send the keywords in the entry boxes.

Understanding css_selectors

Selectors define to which element the set of CSS rules are to be applied. refer more here: https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors

[code 15]: In this code will search duckduckgo using css_selectors

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

if __name__ == "__main__":
    search_for = "apples"
    website = "https://duckduckgo.com/"
    browser = webdriver.Chrome()
    browser.get(website)
    search = browser.find_element_by_css_selector("#search_form_input_homepage")
    search.send_keys(search_for+Keys.ENTER) # This will be fill apples and 
                                            # hit ENTER key automatically

The easiest way for getting the CSS selector is to use Firefox developer edition:

In the upcoming series, I will update more about the Selenium.

Points of Interest

Final message I would like to convey to you is:

I have not broken any rules on any of the sites and I have not disobeyed any of the websites terms and conditions. I ask the users to use this knowledge wisely for humanity and not for exploiting websites which I will not encourage you to do.

More to come in future articles.

Kindly take a survey. I will cover it for you if this article is a successful one.