Click here to Skip to main content
11,408,835 members (64,962 online)
Click here to Skip to main content
Technical Blog

Tagged as

Scraping JavaScript webpages with Webkit

, 15 Jan 2013 CPOL
Rate this:
Please Sign up or sign in to vote.
Webkit has now been ported to the Qt framework and can be used through its Python bindings.

In the previous post I covered how to tackle JavaScript based websites with Chickenfoot. Chickenfoot is great but not perfect because it:

  1. requires me to program in JavaScript rather than my beloved Python (with all its great libraries)
  2. is slow because have to wait for FireFox to render the entire webpage
  3. is somewhat buggy and has a small user/developer community, mostly at MIT

An alternative solution that addresses all these points is webkit, the open source browser engine used most famously in Apple's Safari browser. Webkit has now been ported to the Qt framework and can be used through its Python bindings.

Here is a simple class that renders a webpage (including executing any JavaScript) and then saves the final HTML to a file:

import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *  
  
class Render(QWebPage):  
  def __init__(self, url):  
    self.app = QApplication(sys.argv)  
    QWebPage.__init__(self)  
    self.loadFinished.connect(self._loadFinished)  
    self.mainFrame().load(QUrl(url))  
    self.app.exec_()  
  
  def _loadFinished(self, result):  
    self.frame = self.mainFrame()  
    self.app.quit()  
  
url = 'http://webscraping.com'  
r = Render(url)  
html = r.frame.toHtml()

I can then analyze this resulting HTML with my standard Python tools like the webscraping module.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Richard Penman

Australia Australia
No Biography provided

Comments and Discussions

 
-- There are no messages in this forum --
| Advertise | Privacy | Terms of Use | Mobile
Web04 | 2.8.150414.5 | Last Updated 15 Jan 2013
Article Copyright 2013 by Richard Penman
Everything else Copyright © CodeProject, 1999-2015
Layout: fixed | fluid