Click here to Skip to main content
11,496,146 members (1,523 online)
Click here to Skip to main content

Tagged as

Crawling with threads

, 13 Jan 2013 CPOL 4.9K 1
Crawling with threads.
The site is currently in read-only mode for maintenance. Posting of new items will be available again shortly.

The bottleneck for web scraping is generally bandwidth - the time waiting for webpages to download. This delay can be minimized by downloading multiple webpages concurrently in separate threads.

Here are examples of both approaches:

# a list of 100 webpage URL's to download
urls = [...]

# first try downloading sequentially
import urllib
for url in urls:

# now try concurrently
import sys
from webscraping import download
num_threads = int(sys.argv[1])
download.threaded_get(urls=urls, delay=0, num_threads=num_threads, 
    read_cache=False, write_cache=False) # disable cache

Here are the results:

$ time python
$ time python 10
$ time python 100

As expected threading the downloads makes a big difference. You may have noticed the time saved is not linearly proportional to the number of threads. That is primarily because my web server struggles to keep up with all the requests. When crawling websites with threads be careful not to overload their web server by downloading too fast. Otherwise the website will become slower for others users and your IP risks being blacklisted.


This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


About the Author

Richard Penman

Australia Australia
No Biography provided

Comments and Discussions

SuggestionPlease correct tags of article Pin
barto14-Jan-13 6:14
memberbarto14-Jan-13 6:14 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web01 | 2.8.150520.1 | Last Updated 13 Jan 2013
Article Copyright 2013 by Richard Penman
Everything else Copyright © CodeProject, 1999-2015
Layout: fixed | fluid