Click here to Skip to main content
Click here to Skip to main content

Tagged as

Go to top

How to use proxies

, 9 Jan 2013
Rate this:
Please Sign up or sign in to vote.
How to use proxies.

Proxies can be necessary when web scraping because some websites restrict the number of page downloads from each user. With proxies it looks like your requests come from multiple users so the chance of being blocked is reduced.

Most people seem to first try collecting their proxies from the various free lists such as this one and then get frustrated because the proxies stop working. If this is more than a hobby then it would be a better use of your time to rent your proxies from a provider like packetflip, USA proxies, or proxybonanza. These free lists are not reliable because so many people use them.

Each proxy will have the format login:password@IP:port. The login details and port are optional. Here are some examples:

  • bob:eakej34@66.12.121.140:8000
  • 219.66.12.12
  • 219.66.12.14:8080

With the webscraping library you can then use the proxies like this:

from webscraping import download  
D = download.Download(proxies=proxies, user_agent=user_agent)  
html = D.get(url)

The above script will download content through a random proxy from the given list. Here is a standalone version:

import urllib2
import gzip
import random  
import StringIO  
  
def fetch(url, data=None, proxies=None, user_agent='Mozilla/5.0'):  
    """Download the content at this url and return the content  
"""  
    opener = urllib2.build_opener()  
    if proxies:  
        # download through a random proxy from the list  
        proxy = random.choice(proxies)  
        if url.lower().startswith('https://'):  
            opener.add_handler(urllib2.ProxyHandler({'https' : proxy}))  
        else:  
            opener.add_handler(urllib2.ProxyHandler({'http' : proxy}))  
      
    # submit these headers with the request  
    headers =  {'User-agent': user_agent, 'Accept-encoding': 'gzip', 'Referer': url}  
      
    if isinstance(data, dict):  
        # need to post this data  
        data = urllib.urlencode(data)  
    try:  
        response = opener.open(urllib2.Request(url, data, headers))  
        content = response.read()  
        if response.headers.get('content-encoding') == 'gzip':  
            # data came back gzip-compressed so decompress it            
            content = gzip.GzipFile(fileobj=StringIO.StringIO(content)).read()  
    except Exception, e:  
        # so many kinds of errors are possible here so just catch them all  
        print 'Error: %s %s' % (url, e)  
        content = None  
    return content

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Richard Penman

Australia Australia
No Biography provided

Comments and Discussions

 
-- There are no messages in this forum --
| Advertise | Privacy | Mobile
Web02 | 2.8.140926.1 | Last Updated 9 Jan 2013
Article Copyright 2013 by Richard Penman
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid