Click here to Skip to main content
11,502,845 members (45,178 online)
Click here to Skip to main content

Using Google Translate to crawl a website

, 13 Jan 2013 CPOL 4.4K 2
Rate this:
Please Sign up or sign in to vote.
Using Google Translate to crawl a website.

I wrote previously about using Google Cache to crawl a website. Sometimes, for whatever reason, Google Cache does not include a webpage so it is helpful to have backup options.

One option is using Google Translate, which let's you translate a webpage into another language. If the source language is selected as something you know it is not (eg Dutch) then no translation will take place and you will just get back the original content:

I added a function to download a URL via Google Translate and Google Cache to the webscraping library. Here is an example:

from webscraping import download, xpath  
D = download.Download()  
url = ''  
html1 = D.get(url) # download directly  
html2 = D.gcache_get(url) # download via Google Cache  
html3 = D.gtrans_get(url) # download via Google Translate  
for html in (html1, html2, html3):  
    print xpath.get(html, '//title')

This example downloads the same webpage directly, via Google Cache, and via Google Translate. Then it parses the title to show the same webpage has been downloaded. The output when run is:

Frequently asked questions | webscraping
Frequently asked questions | webscraping
Frequently asked questions | webscraping

The same title was extracted from each source, which shows that the correct result was downloaded from Google Cache and Google Translate.


This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


About the Author

Richard Penman

Australia Australia
No Biography provided

Comments and Discussions

GeneralMy vote of 5 Pin
rahman_tanzilur0113-Jan-13 10:57
memberrahman_tanzilur0113-Jan-13 10:57 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web03 | 2.8.150520.1 | Last Updated 13 Jan 2013
Article Copyright 2013 by Richard Penman
Everything else Copyright © CodeProject, 1999-2015
Layout: fixed | fluid