Screen scraping using YQL and AJAX





5.00/5 (5 votes)
A simple application to scrap HTML data in JSON format.
Introduction
Web scraping always plays a negative role in web-development, but in some cases it is very important. jQuery is greatly helpful for cross-domain scraping and a bunch of examples are available too . Web scraping has been influenced by Yahoo Query Language(YQL). This article is going to provide a basic overview of web-scraping using jQuery and YQL. To represent data I have also used Mustache as HTML template so, I will also provide a short overview of Mustache.js here.
Background
If you are familiar with HTML you will understand this article easily. Simple basic knowledge of JavaScript and AJAX will do. Don’t worry, I have added some useful links to make you clear.
YQL Overview
YQL is a SQL-like syntax and can be used to work with different APIs. YQL is popular because of its faster response. Details overview can be found here
To get HTML off a page, YQL has different methods called ‘Tables’. Mostly highlighted are given below with example…
In my simple project I’ll be using HTML table.
Querying With YQL Console
Now, let’s become familiar with YQL Console. In the following example we are going to scrap data from gsmarena. We’ll search in gsmarena using phone manufacturer’s name and scrap the result. So,let’s go to gsmarena and see the data we are going to fetch as following
- Search using any phone manufacturer’s name, for example, ‘nokia’.
- We have got the search result page. Copy the page’s URL (http://www.gsmarena.com/results.php3?sQuickSearch=yes&sName=nokia).
- Now, let’s go to YQL console and run query to get the result of the mentioned url. (You need to be logged in in Yahoo!)
- Go to YQL Console.
- In the textbox write the query “select * from html where url="http://www.gsmarena.com/results.php3?sQuickSearch=yes&sName=nokia"”.
- To get the data in JSON format select JSON and press the ‘TEST’ Button.
- That’s all, the result is in the result box. Don’t worry if you do not understand the result, it has scrapped full page (http://www.gsmarena.com/results.php3?sQuickSearch=yes&sName=nokia) in JSON format.
- But we do not need the full page, we will take only the results generated by the search. To do this, we need the XPATH of the resultant div.
- To get the XPATH right click on the page(http://www.gsmarena.com/results.php3?sQuickSearch=yes&sName=nokia) and copy the XPATH as shown in the picture.
- Now go to YQL Console and run the query with the XPATH again. And now our result is only containing the search result’s div.
- Lastly, to Access YQL API using ajax we need the ‘REST QUERY’. Simply copy the ‘REST QUERY’.
To khow more about XPATH you can follow this w3schools tutorial.
If we see the ‘REST QUERY’ closely we will find three portions
http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22http%3A%2F%2Fwww.
gsmarena.com%2Fresults.php3%3FsQuickSearch%3Dyes%26sName%3Dnokia%22%20and%20xpath%3D'%2Fhtml%2Fbody%2F
div%2Fdiv%5B2%5D%2Fdiv%5B2%5D%2Fdiv%2Fdiv%5B2%5D%2Fdiv%5B2%5D'&format=json&diagnostics=true&callback=
• Url to access YQL API (http://query.yahooapis.com/v1/public/yql?)
• Our full query. Here it is encoded because uri should not contain any whitespace.
(q=select%20*%20from%20html%20where%20url%3D%22http%3A%2F%2Fwww.gsmarena.com%2Fresults.php3%3
FsQuickSearch%3Dyes%26sName%3Dnokia%22%20and%20xpath%3D'%2Fhtml%2Fbody%2Fdiv%2Fdiv%5B2%5D%2Fdiv%5B2%5D%2Fdiv%2Fdiv%5B2%5D%2Fdiv%5B2%5D')
• Data format and other components (&format=json&diagnostics=true&callback=)
Requirements
Jquery : In this project JQuery is used for two purposes…
- To run ajax request to get data from YQL.
- To display data using JqueyCycle. To get this feature in just add this script :
<script src="http://www.codeproject.com/cdnjs.cloudflare.com/ajax/libs/jquery.cycle/ 2.9999.8/jquery.cycle.all.min.js" type="text/javascript"></script>
For both purposes I am using jquery -v1.9.1. You can download any Latest version from jQuery.
- Json2.js : It is very important to know that From YQL using ajax we get data in JSON format. Json2.js is very helpful to handle this JSON data. Download Json2.js from here and include in your project. To know more about JSON you can go here
- jsonpath-0.8.0.js : When we want to get the specific result from the received JSON data, It is important to query among the divs, tables etc. jsonpath stands for this purpose. Get any latest version of jsonpath from here and include it in your project.
- mustache.js : The last requirement for this project to run is mustache.js. Mustache is a “logic-less” template syntax. It is very helpful for decoupling HTML markups from data. Mustache is implemented in different languages: Ruby, JavaScript, Python, PHP, Perl, Objective-C, Java, .NET, Android, C++, Go, Lua, Scala, etc. Mustache.js is the JavaScript implementation. Get mustache.js from here and this one is a helpful tutorial for mustache.
Using the code
Now our environment is ready to work on. If you face any trouble rearrange it as following
<script src="scripts/jquery-1.9.1.min.js" type="text/javascript"></script>
<!--jquery cycle library to slide the results-->
<script src="//cdnjs.cloudflare.com/ajax/libs/jquery.cycle/2.9999.8/jquery.cycle.all.min.js" type="text/javascript"></script>
<!--mustache.js template for javascript -->
<script src="scripts/mustache.js" type="text/javascript"></script>
<!--json2.js is to work with json data -->
<script src="scripts/json2.js" type="text/javascript"></script>
<!--jsonpath is helpful for querying json data -->
<script src="scripts/jsonpath-0.8.0.js" type="text/javascript"></script>
So, our task is simple. We’ll take manufacturer’s name as user input, scrap the result from www.gsmarena.com using YQL API and will receive the result performing an AJAX request. So, let’s start…
- To take the input, we need a text box. A button click event is going to fire a
JavaScript function named
GetResult()
. Also, a div is used to hold the entire result. So, the HTML markup is as following…<body> <input id="valueText" type="text" /> <button type="button" onclick="GetResult()">Get Result</button> <div id="speakerbox" style="float:left"> <a href="#" id="prev_btn">«</a> <a href="#" id="next_btn">»</a> <div id="carousel"></div> </div> </body>
- JavaScript’s
GetResult()
function is fetching scrapped data using AJAX. Get user input from textbox.
var item = $('#valueText').val();
So, we’ll simply append the user input in the query.
var query = "SELECT * FROM html WHERE url=" + '"' +
"http://www.gsmarena.com/results.php3?sQuickSearch=yes&sName=" + item + '"' +
" and xpath='/html/body/div/div[2]/div[2]/div/div[2]/div[2]'";
cacheBuster
for simplicity. And the rest query for the AJAX URL is:var url = 'http://query.yahooapis.com/v1/public/yql?q=' +
encodeURIComponent(query) + '&format=json&_nocache=' + cacheBuster;
window['wxCallback'] = function (data) {
console.log(data);
ParseData(data); // To show the result
};
$.ajax({
url: url,
dataType: 'jsonp',
cache: true,
jsonpCallback: 'wxCallback'
});
ParseData(data)
function with the data as its parameter. This function is simply parsing
the data and showing it inside the carousel using mustache template.function ParseData(data) {
var result = jsonPath(data, "$.query.results[*].ul.li[*]");
$('#carousel').empty();
var html = "";
for (i = 0; i < result.length; i++) {
var template = $('#speakerstpl').html();
html += Mustache.to_html(template, result[i]);
}
$('#carousel').append(html);
$('#carousel').cycle({
fx: 'fade',
pause: 1,
next: '#next_btn',
prev: '#prev_btn',
speed: 500,
timeout: 10000
});
Here, jsonPath
is finding the specific ‘li’ which one in containing the data using the provided query.
Points of Interest
No doubt, web-scrapping is an interesting job to do. It is much more interesting with YQL, I think.
History
- 28th September, 2013.