Click here to Skip to main content
Click here to Skip to main content

Scraping JavaScript based web pages with Chickenfoot

, 16 Jan 2013
Rate this:
Please Sign up or sign in to vote.
Scraping JavaScript based web pages with Chickenfoot.

The data from most webpages can be scraped by simply downloading the HTML and then parsing out the desired content. However some webpages load their content dynamically with JavaScript after the page loads so that the desired data is not found in the original HTML. This is usually done for legitimate reasons such as loading the page faster, but in some cases is designed solely to inhibit scrapers.
This can make scraping a little tougher, but not impossible.

The easiest case is where the content is stored in JavaScript structures which are then inserted into the DOM at page load. This means the content is still embedded in the HTML but needs to instead be scraped from the JavaScript code rather than the HTML tags.

A more tricky case is where websites encode their content in the HTML and then use JavaScript to decode it on page load. It is possible to convert such functions into Python and then run them over the downloaded HTML, but often an easier and quicker alternative is to execute the original JavaScript. One such tool to do this is the Firefox Chickenfoot extension. Chickenfoot consists of a Firefox panel where you can execute arbitrary JavaScript code within a webpage and across multiple webpages. It also comes with a number of high level functions to make interaction and navigation easier.

To get a feel for Chickenfoot here is an example to crawl a website:

// crawl given website url recursively to given depth  
function crawl(website, max_depth, links) {  
  if(!links) {  
    links = {};  
    go(website);  
    links[website] = 1;  
  }  
  
  // TODO: insert code to act on current webpage here
  if(max_depth > 0) {  
    // iterate links  
    for(var link=find("link"); link.hasMatch; link=link.next) {    
      url = link.element.href;  
      if(!links[url]) {  
        if(url.indexOf(website) == 0) {  
          // same domain  
          go(url);  
          links[url] = 1;  
          crawl(website, max_depth - 1, links);  
        }  
      }  
    }  
  }  
  back(); wait();  
}

This is part of a script I built on my Linux machine for a client on Windows and it worked fine for both of us. To find out more about Chickenfoot check out their video.

Chickenfoot is a useful weapon in my web scraping arsenal, particularly for quick jobs with a low to medium amount of data. For larger websites there is a more suitable alternative, which I will cover in the next post.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Richard Penman

Australia Australia
No Biography provided

Comments and Discussions

 
-- There are no messages in this forum --
| Advertise | Privacy | Mobile
Web03 | 2.8.140709.1 | Last Updated 16 Jan 2013
Article Copyright 2013 by Richard Penman
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid