Click here to Skip to main content
14,299,779 members

Web scraping with Python Part-I

Rate this:
4.70 (9 votes)
Please Sign up or sign in to vote.
4.70 (9 votes)
3 Mar 2017CPOL
Scrape the websites with Python 3

IMPORTANT:
This article has been improved so very much and the selenium tutorial has been added to the article. filling up the entry boxes, clicking the buttons and automating many things have been added to this. However I do not have a good English basement, so the grammer and punctuitation is so WORST sorry I am not shouting. so I have made this entrie article(source) as open source At Github ^ in the name source.html so if any one is interested please help formatting this. Thank you And one of the comment says this is basic article and now it is becomming a false statement.

Introduction

This article is meant for learning web scraping using various libraries avaialable from Python. If you are good with Python you can refer this article, it is a complete guide started from scratch.

Note: I stick with 3.x version which guarantees future usage.

Background

For some who have never heared about web-scrapping. 

consider this situation,

a person wants to print two numbers in a console/terminal in python he/she will use some thing like this,

print("1 2")

so what do you he/she wants to print about 10 numbers? well, he/she can use looping. ok now we can come to our sitution if a website contains information about a person and you want it in an excel? what do you do? you will copy the info of a person and add his contact info other other stuffs in several rows. what do you do when there are infos about 1000 persons? well you have to code a bot to do this work.

There are various libraries out there available for Python I will try to explain all the important stuffs to become a fully fledged web scrapper.

Using the Libraries

Using default urllib.request library

Python has its own web scrapping which may not be easier for several advanced scrapping however useful for basic scrapping. There is a library named requests which is the best alternative and most stable than this so I will cover more in requests than here.

ok open up your favourite python editor and import this library

[code1]

import urllib.request

and type the following code,

# SWAMI KARUPPASWAMI THUNNAI
import urllib.request
source = urllib.request.urlopen("http://www.codeproject.com")
print(source)

here urllib.request.urlopen gets the web-page now when we execute this program we will get something like this,

Image 1

Let's have a closer look at this ouptut we got a response object at an address. so here we have actually got an address. http.client.HTTPResponse is a class from that we have used an object and it returned an address. so, inoorder to see the value we will use a pointer to see what is actually in it. modify the print statement above to pointer as ,

print(*source)

now when executing the code you will see something like this,

Image 2

you may ask hey what is this? well, this is the html source code of the web link you are requested for.

on seeing this, a common thought will arise for every one, oh yeah! I got the HTML code now I can use regular expressions to get what I want :) but you should not there is a specific parsing library avaiable which I will explain later.

[code2]

The same can be achived without using a pointer. just replace *source to source.read()

Scrapping the images

ok, now how to scrape the images using python, ok now lets take a website, we can take this very own site code project not meant for commercial use, open codeprojects home page, you will see a their company logo, right clcik their logo and select view image, you will see this

Image 3

ok now get the web-link https://codeproject.global.ssl.fastly.net/App_Themes/CodeProject/Img/logo250x135.gif

note the last letters which says the extension of the image which is gif in our case ok now we can save this into our disk.

Python has a module named urllib.request which can be seen in request.py and it has a member function named urlretrieve which is used to save a file locally from a network and we are going to use it to save our images.

[code 3]

# SWAMI KARUPPASWAMI THUNNAI
import urllib.request
# Syntax : urllib.request.urlretrieve(arg1,arg2)
# arg1 = web url
# arg2 = path to be saved
source = urllib.request.urlretrieve("https://codeproject.global.ssl.fastly.net/App_Themes/CodeProject/Img/logo250x135.gif","our.gif")

the above code will save the image to the location where python file is located. first argument is url and the second is the file name. refer syntax in code.

That's it for urllib we need to shift with requests it is important since I found it is more stable everything which can be done with urllib can be done with requests.

Requests

Requests will not come along with python you need to install it, to install this just run the below pip command.

PIP COMMAND:
pip install requests

or you can use the regular method of installing from source. I leave it to you. try importing requests to check whether requests has been installed successfully.

import requests

you should not get any errors while importing this statement.

we will try this code,

[code 4]

# SWAMI KARUPPASWAMI THUNNAI
import requests
request = requests.get("https://www.codeproject.com")
print(request)

here we will have a look at line 3(I am counting from 1) in this line request is a variable and requests is a module which has a member function named get which we passed our web-link as an argument. This code will generate this output,

Image 4

so it is nothing but a http status code which says it is successful.

[code 5]: modify the variable present in the print statement to this => request.content the output will be the contents of the web-page which is nothing but an html source.

what is an user-agent?

In NetWorking while transmitting a data from source to destination the data is split up into smaller chunks called packets, this is a simple definition of a packets in internet. usually a the packet headers consists of several information about the source and the destination. We will only analyse packet headers which are useful for web-scrapping.

I will show you why this is important, first we will fire up our own server and make it listen on local machine i.p 127.0.0.1 @ port 1000, here instead of connecting to code project we will connect to this server using http://127.0.0.1:1000, and you will see something like this on server

Image 5

Lets have a closer look at the message, when we are conneting to the server using [code 6](It is same as of code 4 I've jsut changed the destination address) you will find the user-agent as python-requests with its version and other details.

This user-agent reveals that the request is from machine and not from the human so some advanced websites will block you from scraping. so what do we do now?

changing the user-agent

This is our code 6,

# SWAMI KARUPPASWAMI THUNNAI
import requests
request = requests.get("http://127.0.0.1:1000")
print(request.content)

we will add custom headers to our above code,

open up this link in wikipeda for user agent https://en.wikipedia.org/wiki/User_agent

and you will find an examlpe user agent there, ok we will use it for your convineance I will show the example present there,

User agent present in wikipedia example:
Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405

python dictionary is used for adding the user agent and key = User-Agent ; value = any user agents for example we will take the above.

so our code will be some thing like this, [code 7]

agent = {'User-Agent': 'Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405'}

Ok, now we can add this dictionary while requesting to change the user agent

request = requests.get("http://127.0.0.1:1000",headers=agent) #see the additional argument named headers

on execution you will see the user-agent gets changed from python-requests to Mozilla Firefox, How do I believe? see the screenshot below

Image 6

What have we done so far? we have only got the page source using different methods so we will now gather the data, let's get started!

Library 3 : Beautifulsoup: pip install beautifulsoup4

so what is beautiful soup is it a scraping library? actually beautifulsoup is a parsing library which is used to parse html.

HTML? yes HTML, remeber all the above methods is used to get the page source which is nothing but an HTML source.

TARGET SCRAPING WEBSITE: https://www.yellowpages.com.au/search/listings?clue=Restaurants&locationClue=&lat=&lon=&selectedViewMode=list

IMPORTANT: I am using this website just as an example for educational purpouses nothing more, I've used this website since it has clear layout with pagination which stands as the best example for scraping other sites which you have permission. I've warned you scrapping may lead to penality if used wrong. Dont involve me :)

Ok lets get into the topic:

IMPORTANT:

Basics Tags in HTML which is mostly used for website scraping:

Tags: (Pocket hints)
<title> </title>                                => adds title to the webpage
<p>     </p>                                    => Paragraph
<a href="someLink"> </a>             =>Links
<h(x)>                  </h(x)>              => Heading tags
and some other tags like div - container and so on, This is not an html tutorial anyways,

[How to find we are on the safer side of scraping -identifying the site is not allowing us to scrape]:

The structure of the website is something like this,

 

Image 7

what we are going to do is we are going to scrape all the bold letters (e.g Royal India Restaurant which is seen in the picture)

STEPS: Right click the bold words in the website(Royal India Restaurant) and select inspect element. you will see something like this,

Image 8

so we have got the proper html tag. have a look at it you will find some thing like this,

<a class="class name" .... here a means link as I have explained in pocket hints. so the bold letters are link with belongs to a class named "listing-name". So can you guess now how to get all the bold names???

Answer: scaping all the links which belongs to this class name will give us all the name of those restuarants.

Alright we will write a script to scrape all the links first. To get the HTML soure I am going to use requests and to parse the html I am going to use BeautifulSoup.

Ok you may find this code will display the page content

[code 9]

# SWAMI KARUPPASWAMI THUNNAI
import requests
from bs4 import BeautifulSoup


if __name__=="__main__":
    req = requests.get("https://www.yellowpages.com.au/search/listings?clue=Restaurants&locationClue=&lat=&lon=&selectedViewMode=list")
    #req.content = html page source and we are using the html parser
    soup = BeautifulSoup(req.content,"html.parser")
    print(soup)

NO, This will not diplay the page source, the output will be something like this

Quote:

We value the quality of content provided to our customers, and to maintain this, we would like to ensure real humans are accessing our information.

.

.

.

<form action="/dataprotection" method="post" name="captcha">

Why did this happen?

This page appears when online data protection services detect requests coming from your computer network which appear to be in violation of our website's terms of use.

 

I told you in the real word scraping the requests comming from python will get blocked. of-course we all violating the their terms and condition but this can be bypassed easily by adding user agent to it, I have added the user agent in [code 9] and when you run the code this code will work and we will get the page source. So we have now found that we are violating their terms and condition and we should not scrape further more. So I have ended this here just by showing the scraped names in page1 of the website!.

Below are the ways to break security - educational purpose only

so now our modified [code 9]:

# SWAMI KARUPPASWAMI THUNNAI
import requests
from bs4 import BeautifulSoup


if __name__=="__main__":
    agent = {'User-Agent': 'Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405'}
    req = requests.get("https://www.yellowpages.com.au/search/listings?clue=Restaurants&locationClue=&lat=&lon=&selectedViewMode=list",headers=agent)
    #req.content = html page source and we are using the html parser
    soup = BeautifulSoup(req.content,"html.parser")
    for i in soup.find_all("a",class_="listing-name"):
        print(i.text)

will yield this,

Image 9

I have ended the scraping here and the website has not been scrapped further more. I strongly recommed you to do the same so that no one will be affected.

One complete scraping example for familarizing the scraping

Target site: https://www.yelp.com/search?find_desc=Restaurants&find_loc=San+Francisco%2C+CA&ns=1

Before getting into scraping this website I would like to explain the general layouts which may be seen in the websites. Once the layouts is identified we can code according to it.

1. Information in one long lengthy page:

If this is our case then it is easier for us to write the script which scrapes for the single page alone.

2. Pagination:

If a website has a pagination layout the web site will have multiple pages like page1, page2, page3 and so on.

our example scraping website does have a pagination layout, follow the target site https://www.yelp.com/search?find_desc=Restaurants&find_loc=San+Francisco%2C+CA&ns=1

and scroll down you will see something like this

Image 10

so that is pagination, In this case we need to write a script to go to every single page and scrape the information will explaing more about scraping pagination below.

3. AJAX spinner: we need to use selenium to get the job done for these types of website and I will also explain how to use selenium in further/same article(s).

Exaplanition for scraping pagination for the above link: https://www.yelp.com/search?find_desc=Restaurants&find_loc=San+Francisco%2C+CA&ns=1

What we are going to do for this case is we are going to scrape all the available page links first, (see the above image) you will find something like this 1,2,3,...... Next> all those are links <a> tag in html but dont scrape for those links. If you scrape for those links then this will happen you will get the page link for page 1,2,3...9 but not for further pages since you have Next > link  blocking the further links there.

Run the below code and see the output [code 10]:

#Scrapes for 9 pages only
def Scrape(weblink):
    r = requests.get(weblink)
    soup = BeautifulSoup(r.content,"html.parser")
    for i in soup.find_all("a",class_="available-number pagination-links_anchor"):
        print("https://www.yelp.com"+i.get("href"))
        print(i.text)

you will only get the output for first 9 pages so in order to get the links of all pages what we are going to do is we are going to scrape the next links.

go to the link and inspect the next link

Image 11

so you will find that the next link belongs to a class named "u-decoration-none next pagination-links_anchor"

scraping for the link will give you the next link for the page so if you are scraping for page 1 it will give the link for the page 2 , if you are scraping for page 2 then will you link for the page 3 does it make any sense?

RECUSION...! :)

def scrape(weblink):
      r = requests.get(weblink)
      soup = BeautifulSoup(r.content,"html.parser")
      # Do some scraping for the current page here
      for i in soup.find_all("a",class_="u-decoration-none next pagination-links_anchor"):
            print("https://www.yelp.com"+i.get("href"))
            scrape("https://www.yelp.com"+i.get("href"))

Now we can do what ever we want,

We will scrape the names of all restuarants as an example

def scrape(weblink):
    print(weblink)
    r = requests.get(weblink)
    soup = BeautifulSoup(r.content,"html.parser")
    for i in soup.find_all("a",class_="biz-name js-analytics-click"):
        print(i.text)
    for i in soup.find_all("a",class_="u-decoration-none next pagination-links_anchor"):
        print("https://www.yelp.com"+i.get("href"))
        scrape("https://www.yelp.com"+i.get("href"))

This will give output some thing like this,

Quote:

https://www.yelp.com/search?find_desc=Restaurants&find_loc=San+Francisco%2C+CA&ns=1
Extreme Pizza
B Patisserie
Cuisine of Nepal
ABV
Southern Comfort Kitchen
Buzzworks
Frances
The Morris
Tacorea
No No Burger
August 1 Five
https://www.yelp.com/search?find_desc=Restaurants&find_loc=San+Francisco%2C+CA&start=10
https://www.yelp.com/search?find_desc=Restaurants&find_loc=San+Francisco%2C+CA&start=10
Extreme Pizza
Gary Danko
Italian Homemade Company
Nopa
Sugarfoot
Big Rec Taproom
El Farolito
Hogwash
Loló
Kebab King
Paprika
https://www.yelp.com/search?find_desc=Restaurants&find_loc=San+Francisco%2C+CA&start=20

You will find "Extreme pizza" repeated 2 times it is nothing but a sponsored advertisement of the hotel which will be displayed as the first for every page. we can write a script to skip the first entry. Conditional statement will do I dont need to explaing this a basic beginner could do this.

Ultimate Selenium Guide

So far we have used several libraries for some really basic scraping now we are going to use web drivers for complete browser automation and this is gonna be really interesting to view and watch...

the best way to install the selenium is by downloading the source https://pypi.python.org/pypi/selenium

ok after installing the selenium test the working of the selenium by importing the web driver,

from selenium import webdriver

you should not get any error if this line is executed. If there is no errors then we shall get started, from the import module you will find something like webdriver, yeah we are going to use webdrivers for automating the browsers,

OK, there are various web drivers available out there which does the very same task but I am going to cover only two,

  1. chrome driver: for real world scraping
  2. phantomjs: for headless scraping

Download chrome driver: https://sites.google.com/a/chromium.org/chromedriver/downloads

Download PhantomJs: http://phantomjs.org/download.html

Chrome Driver

we will see how to use the chrome driver here now,  once you have tested that chrome driver has installed properly then do the following things, first of all it is adviced to place the chromdriver in a static location e.g C:\\chromdriver.exe , this is because you can avoid huge memory consumption(literally) I mean that if you are placing it near to your python scripts it will work fine but for the seperate projects you need to place the chromedriver everywhere this will lead to lot of nuissance.

ok, now have a look at the code 11, this code will open google

[CODE 11]

from selenium import webdriver
# we are going to use the Chrome Driver so we have used Chrome
browser = webdriver.Chrome("E:\\chromedriver.exe")
#Get the website
browser.get("https://www.google.com")

The function get will get the websitelink which is passed as an argument! now we will open and close the browser, In-order to close the browser we will use the close() method.

syntax for closing the browser:
webdriver.close()
so, adding browser.close() at the end of the code will close the browser.
 

you can find that the browser is closed by the webdriver is not closed this can be done by using browser.quit() method. [code 12]

Understanding ID,name and css_selectors

ID

The id global attribute defines a unique identifier (ID) which must be unique in the whole document. Its purpose is to identify the element when linking (using a fragment identifier), scripting, or styling (with CSS). reference :MDN

See the sample ID here:

ID

ok, now we will try to click the button using an id!

See the sample code project page

here you will find the CANCL button right click on the cancel button and inspect the element, you will find something like this,

CANCEL BUTTON

Once, we click on the cancel button it is going to redirect us on the codeproject's homepage, so we are going to use selenium to automate the process, ok let's get started!

#SWAMI KARUPPASWAMI THUNNAI

from selenium import webdriver

if __name__=="__main__":
    browser = webdriver.Chrome("E:\\chromedriver.exe")
    #get the url
    browser.get("https://www.codeproject.com/Questions/ask.aspx")
    #click the cancel button using id
    cancel_button = browser.find_element_by_id("ctl00_ctl00_MC_AMC_PostEntry_Cancel")
    cancel_button.click()

have a look at this line [cancel_button = browser.find_element_by_id("ctl00_ctl00_MC_AMC_PostEntry_Cancel")] we are finding the element using ID and clicking it after that we do nothing so once the button has clicked then the url redirects to homepage since we have cicked cancel button.

Understanding name tags

Now, open up Google^ and start inspecting the search panel and you will find something like this

Google Screenshot

IMPORTANT:
So, far we have opened up a page, clicked the button and got the page source. Now we are going to fill an-entry box so, we need some more attention from now since we are moving from basics to advanced!

Import the keys in-order to send the key-words

from selenium.webdriver.common.keys import Keys

Ok, we will see an example how we are going automate a google search

# SWAMI KARUPPASWAMI THUNNAI
# CODE-14
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
browser = webdriver.Chrome("E:\\chromedriver.exe")
browser.get("https://www.google.com")
name = browser.find_element_by_name("q")
keyword = "Codeproject"
#Use send_keys to send the keywords
# NOTE: Do not use the webdriver like in here browser.send_keys("something")
# Webdriver does not have that kind of attribute
# Use the actual variable which is used to find the element
# in our case it is "name"
name.send_keys(keyword)

So, send_keys(arg) will take the keyword as an argument and will send the keywords in the entry boxes.

In the next upcomming tutorials I will add how to use Inspetors and using css_selectors,logging into the websites etc.,

Points of Interest

Final message I would like to convey with you is

I have not broken any rules on any of the site and I have not disobeyed any of the websites terms and conditions and I ask the users to use this knowledge wisely for humanity not exploiting the websites which I will not encourage you to do. 

Further to come in next articles:

Kindly take a survey I will cover it for you if this article is a successesfull one,

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author


Comments and Discussions

 
SuggestionA few suggestions, Pin
Afzaal Ahmad Zeeshan3-Mar-17 0:50
mveAfzaal Ahmad Zeeshan3-Mar-17 0:50 
PraiseRe: A few suggestions, Pin
VISWESWARAN19983-Mar-17 1:32
professionalVISWESWARAN19983-Mar-17 1:32 
GeneralRe: A few suggestions, Pin
Afzaal Ahmad Zeeshan3-Mar-17 1:35
mveAfzaal Ahmad Zeeshan3-Mar-17 1:35 
QuestionAbout help Pin
Nelek3-Mar-17 0:43
protectorNelek3-Mar-17 0:43 
GeneralRe: About help Pin
VISWESWARAN19983-Mar-17 0:48
professionalVISWESWARAN19983-Mar-17 0:48 
QuestionOnly half-right... Pin
djmarcus9-Jan-17 11:18
memberdjmarcus9-Jan-17 11:18 
AnswerRe: Only half-right... Pin
VISWESWARAN199810-Jan-17 4:02
professionalVISWESWARAN199810-Jan-17 4:02 
PraiseRe: great article Pin
VISWESWARAN19989-Jan-17 5:21
professionalVISWESWARAN19989-Jan-17 5:21 
Questiongreat one! Pin
Member 129270208-Jan-17 8:51
memberMember 129270208-Jan-17 8:51 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Article
Posted 25 Dec 2016

Stats

26.4K views
381 downloads
17 bookmarked