Click here to Skip to main content
15,886,362 members
Articles / Web Development / Node.js
Tip/Trick

Create Your Own Web Scraper Using node.js and Get Data in JSON Format

Rate me:
Please Sign up or sign in to vote.
4.56/5 (12 votes)
13 Dec 2015CPOL3 min read 30.1K   13   4
Create your own web scraper using node.js

Create Your Own Web Scraper Using node.js

Want to make you own scraper to scrape any data form any website and return it in JSON format so you can used it anywhere you like? If yes, then you are in the right place.

In this tip, I will guide you how to scrape any website to get the desired data using node.js and to obtain the data in JSON format which can be used, e.g., make any app which will run on live data from the internet.

I will be using Windows 10 x64 and VS 2015 for this tip and will scrape from a news website, i.e.:

Result

Image 2

  • First of all, set up the IDE, go to https://nodejs.org/en/download/ and download the node.js pre build installer. For me, it will be Windows installer 64-bit.

    Node.js

  • After installing it, open your Visual Studio and create a new project Templates>JavaScript>Node.js>Basic Node.js Express 4 Application.

    Image 4

  • Now I have to add two packages in npm folder, i.e. ‘Request’ and ‘Cheerio’.

    Image 5

  • And uninstall ‘jade’ by doing right click as we don’t need it now and I have to host my json to Azure cloud service so jade gives an exception. If you want to consume json directly in your application or hosting using other service, then you don’t have to uninstall jade.
  • Now go to app.js and comment out the line numbers 14 and 15 as we are not using ‘Views’

    Commenting

  • Also comment out ‘app.use('/', routes);
  • Change app.use('/users', users); to app.use('/', users);
  • Now go to users.js as now we will do the main thing here. First of all, add the files ‘cheerio’ and ‘request’.

    Image 7

  • Create a variable to save the url of the link:
    JavaScript
    var url = "http://www.thenews.com.pk/CitySubIndex.aspx?ID=14";
  • Modify the router.get() function as follows:
    JavaScript
    router.get ('/', function (req, res) {
        request (url, function (error, response, body) {
            if (!error && response.statusCode === 200) {
                var data = scrapeDataFromHtml(body);
                res.send(data);
            }
            return console.log(error);
        });
    });

    Image 8

  • Here comes the main and difficult part. Now, we have to write the main logic of scraping our website. You have to customize your function according to your website and the data you want to fetch. Let’s open the website in browser and develop the logic for it.

    Website

  • I want to scrape out the following data, news headline, its description and the link to open the detail of the news. This data is changed dynamically and want to fetch the latest data.

    Website

  • To fetch this data, I have to study its DOM so I can write its jQuery to fetch it easily.

    Image 11

  • I made a DOM tree so I can the write the logic to traverse it easily.

    Image 12

  • The text in red are the nodes I have to reach in a loop to access the data from the website.

    Image 13

  • I will write a function named as scrapedatafromthtml as follows:
    JavaScript
    var scrapeDataFromHtml = function (html) {
        var data = {};
        var $ = cheerio.load(html);
        var j = 1;
        $('div.DetailPageIndexBelowContainer').each(function () {
            var a = $(this);
            var fullNewsLink = a.children().children().attr("href");
            var headline = a.children().first().text().trim();
            var description = a.children().children().children().last().text();
            var metadata = {
                headline: headline,
         description: description,
                fullNewsLink : fullNewsLink
            };
            data[j] = metadata;
            j++;
        });
        return data;
    };
  • This function will reach the ‘div’ using the class ‘.DetailPageIndexBelowContainer’ and will iterate its DOM to fetch the ‘fullNewsLink’, ‘headline’ and ‘description’. Then, it will add these values in the array called ‘metadata’. I have another array called ‘data’ and will come the values from metadata on each iteration so in the end I can return my ‘data’ array as JSON. If you only want one thing form a website, you don’t need to have loop for it or to create you other array. You can directly access them by traversing it and return the single array.
  • Now run it and check the output.
  • And yes! It’s runs perfectly and returns the required data in JSON format.
  • PS: If the site that i am using as an example, removes the page, changes the layout, changes the css files or their names etc then we would not get the desired result. For that you have to write the new logic. but i have explained the logic and how to traverse the DOM tree of any website.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Student
Pakistan Pakistan
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
QuestionVery nice article, try to the same. Need some help Pin
pa3akp12-Mar-16 0:02
pa3akp12-Mar-16 0:02 
AnswerRe: Very nice article, try to the same. Need some help Pin
Umer Qureshi12-Mar-16 9:07
professionalUmer Qureshi12-Mar-16 9:07 
QuestionSome thoughts ... Pin
Garth J Lancaster6-Dec-15 13:09
professionalGarth J Lancaster6-Dec-15 13:09 
AnswerRe: Some thoughts ... Pin
Umer Qureshi13-Dec-15 3:58
professionalUmer Qureshi13-Dec-15 3:58 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.