Web Browser Automation with Selenium using Node.js

Sam__Khan

5.00/5 (4 votes)

Apr 29, 2018

CPOL

3 min read

18837

211

Automated Craigslist parsing with Selenium using Node.js

Download source code - 4.3 KB

Introduction

Selenium is a suite of tools that enable the automation of web browsers across multiple platforms. It is widely used in the automated testing of websites/webapps but its usage is not limited to testing only, other frequent, boring, repetitive and time-consuming web activities can and should also be automated.

This is a cut-to-the-chase post on how to use one of Selenium's components, i.e., WebDriver to automate the given use case. Pray continue.

Use Case

Get for-sale Honda Civic ads posted on LA’s Craigslist and furnish the related information in a spreadsheet:

Navigate to Craigslist Los Angeles page (https://losangeles.craigslist.org/)
Click on “cars+trucks” link, under “For sale” section:
On the next page, click on “BY-OWNER ONLY” link:
On the next page, In the “MAKE AND MODE” textbox, enter “Honda civic”, a link will appear, click on it:
- The main search page will display 120 ads.
- Go over top 5 of them and fetch these fields:
  - Title
  - Transmission
  - Fuel
  - Odometer
  - Ad link
- Furnish the fetched information in a spreadsheet (this number can certainly be changed/adjusted by slightly tweaking the code. The code comments duly point out the location where this number can be changed)
- Once the top 5 ads are processed, save the spreadsheet

Technologies Used

Selenium Webdriver: It helps in the automation of browsers (Chrome, Firefox, Internet Explorer, Safari, etc.). It drives the browser natively as the user would on his/her own system. For this implementation, I have chosen the Firefox (geckodriver) webdriver.
Node.js: The programming language of choice here is JavaScript and the runtime is Node.js.
Exceljs: Read/Write/Create/Manipulate Excel spreadsheets using this utility.

Setting Up

Installing geckodriver:
```
$ npm install –g geckodriver
```
package.json: It already has exceljs and selenium-webdriver specified
Installing package.json:
```
$ npm install
```
To run:
```
$ node app.js
```

Code Overview

The initialization block has the usual stuff happening here; the creation of selenium objects such as webdriver, By, until, firefox, firefoxOptions and the driver along with the excel object (by requiring 'exceljs' module).

/*
    Initializing and building the selenium webdriver with firefox options
    along with the exceljs object that will later be used to create the 
    spreadsheet
*/

const webdriver = require('selenium-webdriver'),
    By = webdriver.By,
    until = webdriver.until;

const firefox = require('selenium-webdriver/firefox');

const firefoxOptions = new firefox.Options();

/*
    Path to FF bin
*/
firefoxOptions.setBinary('/Applications/Firefox.app/Contents/MacOS/firefox-bin');
/*
    Uncomment the following line to enable headless browsing
*/
//firefoxOptions.headless();


const driver = new webdriver.Builder()
    .forBrowser('firefox')
    .setFirefoxOptions(firefoxOptions)
    .build();

const excel = require('exceljs')
/*
    End of initialization
*/

Note: To enable headless browsing (no browser window spawning when this option is turned on), uncomment the following line:

/*
    Uncomment the following line to enable headless browsing
*/
//firefoxOptions.headless();

The rest of the code has three async methods in total:

getcarlinks

The following method retrieves the ad links on the first page, 120 of them and returns them in an array. Following is the further logical breakdown of the function:

LA Craigslist main page ->
cars+truks ->
By-Owner Only ->
auto make model = "honda civic"
On the main search page, collect all the car ad links and return them in an array

Source code:

/*
    The following method retrieves the ad links on the first page, 120 of them
    LA Craigslist main page -> 
        cars+truks -> 
        By-Owner Only -> 
        auto make model = "honda civic"
*/
async function getcarlinks() {

    await driver.get('https://losangeles.craigslist.org/')
    await driver.findElement(By.linkText('cars+trucks')).click()
    await driver.findElement(By.linkText('BY-OWNER ONLY')).click()
    await driver.findElement(By.name('auto_make_model')).sendKeys('honda civic')
    /*
        Its important to note here is that the string "honda civic" when furnished 
        inside the auto_make_model textbox, it turns into a link that needs to be 
        clicked in order for the honda civic specific ads page to load. The 
        following function call handles the click part when string "honda civic" 
        turns into a link
    */
    await driver.wait(until.elementLocated(By.linkText('honda civic')), 50000)
        .then(
            elem => elem.click()
        )
    
    /*
        class 'result-info' helps in retrieving all those webelements that contain 
        the car ad link
    */
    let elems = await driver.findElements(By.className('result-info'))
    /*
        further parsing of the webelements to obtain the anchor ('a') tags
    */
    let linktagarr = await Promise.all(elems.map(
        async anelem => await anelem.findElements(By.tagName('a'))
    ))

    /*
        parse the actual links off the anchor tags into an array and return 
        the array
    */
    return await Promise.all(
        linktagarr.map(
            async anhref => await anhref[0].getAttribute('href')
        )
    )
}

processlinks

This method:

Is passed the car links array as obtained by the function above (getcarlinks)
Sets up a new workbook
Adds a new worksheet to the workbook, named 'CL Links Sheet'
These columns are added to the worksheet: Sr Num, Title, Transmission, Fuel, Odometer and link to the car's ad page
For each link in the links array, all the way till 5 elements (otherwise, the app will take a long time to process all the 120 links, this setting can be changed however to whichever number is deemed feasible), it does the following:
- Increments the sr (Sr Num) field in the spreadsheet
- 'gets' the given link
- Inside each ad page, look for these: title, transmission, Fuel, Odometer and the link
- Add a new row with the fetched/furnished info
- After processing the given links, it saves the spreadsheet with this name: output.xlsx

Source code:

/*
    The following method:
    - Is passed a car links array
    - Sets up a new workbook
    - Adds a new worksheet to the workbook, named 'CL Links Sheet'
    - These columns are added to the worksheet: Sr Num, Title, Transmission, 
        Fuel, Odometer and link to the car's ad page
    - for each link in the links array all the way till 5 elements (otherwise 
    the app will take a long time to process all the 120 links, this setting 
    can be changed however to whichever number is deemed feasible), it does the 
    following:
        - Increments the sr (Sr Num) field in the spreadsheet 
        - 'gets' the given link
        - Inside each ad page, look for these: title, transmission, Fuel,  
            Odometer and the link
        - Add a new row with the fetched/furnished info
    - After processing the given links, it saves the spreadsheet with this 
        name: output.xlsx
    
*/

async function processlinks(links) {
    /* 
        init workbook, worksheet and the columns
    */
    const workbook = new excel.Workbook()
    let worksheet = workbook.addWorksheet('CL Links Sheet')
    worksheet.columns = [
        { header: 'Sr Num', key: 'sr', width: 5 },
        { header: 'Title', key: 'title', width: 25 },
        { header: 'Transmission', key: 'transmission', width: 25 },
        { header: 'Fuel', key: 'fuel', width: 25 },
        { header: 'Odometer', key: 'odometer', width: 25 },
        { header: 'link', key: 'link', width: 150 }
    ]

    /*
        end init
    */

    for (let [index, link] of links.entries()) {
        /*
            The following if condition limits the number of links to be processed.
            If removed, the loop will process all 120 links
        */
        if (index < 5) {
            let row = {}
            row.sr = ++index
            row.link = link
            await driver.get(link)
            let elems = await driver.findElements(By.className('attrgroup'))
            /*
                There are only two elements/sections that match 'attrgroup' 
                className search criterion, the first one contains the title 
                info and the other contains the info related to the remaining 
                elements: transmission, fuel odometer and the ad's link.
                As there are always going to be two attrgoup elements therefore 
                I have directly used the elems indexes rather than appllying a 
                loop to iterate over the array
            */
            if (elems.length === 2) {
                /*
                    fetching row.title form elems[0]
                */
                row.title = await elems[0].findElement(By.tagName('span')).getText()
                /*
                    gathering the remaining spans from elems[1] index. These 
                    span tags contain the pieces of information we are looking for
                */
                let otherspans = await elems[1].findElements(By.tagName('span'))

                /*
                    Looping over each span and fetching the values associated with 
                    transmission, fuel, odometer and the link
                */
                for (aspan of otherspans) {
                    let text = await aspan.getText()
                    /*
                        An example of the given spans text.
                            Odometer: 16000
                        the value is the piece after ':'.
                        The following regex is separating the value form the 
                        complete string and leaving the result in an array
                    */
                    let aspanval = text.match('(?<=:).*')
                    if (text.toUpperCase().includes('TRANSMISSION')) {
                        row.transmission = aspanval.pop()
                    }
                    else if (text.toUpperCase().includes('FUEL')) {
                        row.fuel = aspanval.pop()
                    }
                    else if (text.toUpperCase().includes('ODOMETER')) {
                        row.odometer = aspanval.pop()
                    }
                }
            }
            /*
                The given row is now furnished. It's time to add it to the 
                worksheet
            */
            worksheet.addRow(row).commit()
        }
    }
    /*
        All the rows in the worksheet are now furnished. Save the workbook now
    */
    workbook.xlsx.writeFile('output.xlsx')
}

startprocessing

This function chains getcarlinks and processcarlinks by calling them in a sequence (JS internally promise chains these functions). This function is called to start the app, in other words, it's the entry point function:

Source code:

/*
    The following method chains the getcarlinks and processcarlinks methods 
    by calling them in a sequence (JS internally promise chaining these 
    functions under the hood)
*/

async function startprocessing() {
    try {

        let carlinks = await getcarlinks();
        await processlinks(carlinks);
        console.log('Finished processing')
        await driver.quit()
    }
    catch (err) {
        console.log('Exception occurred while processing, details are: ', err)
        await driver.quit()
    }
}

/*
    Starting the engines 
*/
startprocessing()

There you have it, you can download the attached source code to test this app and extend its functionality to better suit your requirement. You can also find the code on my GitHub page.

Important Links

Selenium website: https://www.seleniumhq.org/
Webdriver docs: https://www.seleniumhq.org/docs/03_webdriver.jsp
Webdriver JS GitHub page: https://github.com/SeleniumHQ/selenium/wiki/WebDriverJs
Exceljs on npm: https://www.npmjs.com/package/exceljs
Headless mode, MDN docs: https://developer.mozilla.org/en-US/Firefox/Headless_mode
Mozilla Geckodriver GitHub page: https://github.com/mozilla/geckodriver
This project can also be found on my GitHub repo: https://github.com/xeektech/samplenodeprojects/tree/master/craigslistparser