Click here to Skip to main content
Click here to Skip to main content

Scraping multiple JavaScript webpages with webkit

By , 9 Jan 2013
 

I made an earlier post about using webkit to process the JavaScript in a webpage so you can access the resulting HTML. A few people asked how to apply this to multiple webpages, so here it is:

import sys
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import *

class Render(QWebPage):  
  def __init__(self, urls):  
    self.app = QApplication(sys.argv)  
    QWebPage.__init__(self)  
    self.loadFinished.connect(self._loadFinished)  
    self.urls = urls  
    self.data = {} # store downloaded HTML in a dict  
    self.crawl()  
    self.app.exec_()  
      
  def crawl(self):  
    if self.urls:  
      url = self.urls.pop(0)  
      print 'Downloading', url  
      self.mainFrame().load(QUrl(url))  
    else:  
      self.app.quit()  
        
  def _loadFinished(self, result):  
    frame = self.mainFrame()  
    url = str(frame.url().toString())  
    html = frame.toHtml()  
    self.data[url] = html  
    self.crawl()  
  
urls = ['http://webscraping.com', 'http://webscraping.com/blog']  
r = Render(urls)  
print r.data.keys()

This is a simple solution that will keep all HTML in memory, which is not practical for large crawls. For large crawls you should save the results to disk. I use the pdict module for this.

Source code is available at my bitbucket account.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Richard Penman
Australia Australia
Member
No Biography provided

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
Hint: For improved responsiveness ensure Javascript is enabled and choose 'Normal' from the Layout dropdown and hit 'Update'.
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
-- There are no messages in this forum --
Permalink | Advertise | Privacy | Mobile
Web04 | 2.6.130523.1 | Last Updated 9 Jan 2013
Article Copyright 2013 by Richard Penman
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid