Web Browser Automation with Selenium using Node.js
Automated Craigslist parsing with Selenium using Node.js
Introduction
Selenium is a suite of tools that enable the automation of web browsers across multiple platforms. It is widely used in the automated testing of websites/webapps but its usage is not limited to testing only, other frequent, boring, repetitive and time-consuming web activities can and should also be automated.
This is a cut-to-the-chase post on how to use one of Selenium's components, i.e., WebDriver to automate the given use case. Pray continue.
Use Case
Get for-sale Honda Civic ads posted on LA’s Craigslist and furnish the related information in a spreadsheet:
- Navigate to Craigslist Los Angeles page (https://losangeles.craigslist.org/)
- Click on “cars+trucks” link, under “For sale” section:
- On the next page, click on “BY-OWNER ONLY” link:
- On the next page, In the “MAKE AND MODE” textbox, enter “Honda civic”, a link will appear, click on it:
- The main search page will display 120 ads.
- Go over top 5 of them and fetch these fields:
- Title
- Transmission
- Fuel
- Odometer
- Ad link
- Furnish the fetched information in a spreadsheet (this number can certainly be changed/adjusted by slightly tweaking the code. The code comments duly point out the location where this number can be changed)
- Once the top 5 ads are processed, save the spreadsheet
Technologies Used
- Selenium Webdriver: It helps in the automation of browsers (Chrome, Firefox, Internet Explorer, Safari, etc.). It drives the browser natively as the user would on his/her own system. For this implementation, I have chosen the Firefox (geckodriver) webdriver.
- Node.js: The programming language of choice here is JavaScript and the runtime is Node.js.
- Exceljs: Read/Write/Create/Manipulate Excel spreadsheets using this utility.
Setting Up
- Installing
geckodriver
:$ npm install –g geckodriver
- package.json: It already has exceljs and selenium-webdriver specified
- Installing package.json:
$ npm install
- To run:
$ node app.js
Code Overview
The initialization block has the usual stuff happening here; the creation of selenium objects such as webdriver, By, until, firefox, firefoxOptions and the driver along with the excel object (by requiring 'exceljs
' module).
/*
Initializing and building the selenium webdriver with firefox options
along with the exceljs object that will later be used to create the
spreadsheet
*/
const webdriver = require('selenium-webdriver'),
By = webdriver.By,
until = webdriver.until;
const firefox = require('selenium-webdriver/firefox');
const firefoxOptions = new firefox.Options();
/*
Path to FF bin
*/
firefoxOptions.setBinary('/Applications/Firefox.app/Contents/MacOS/firefox-bin');
/*
Uncomment the following line to enable headless browsing
*/
//firefoxOptions.headless();
const driver = new webdriver.Builder()
.forBrowser('firefox')
.setFirefoxOptions(firefoxOptions)
.build();
const excel = require('exceljs')
/*
End of initialization
*/
Note: To enable headless browsing (no browser window spawning when this option is turned on), uncomment the following line:
/*
Uncomment the following line to enable headless browsing
*/
//firefoxOptions.headless();
The rest of the code has three async
methods in total:
getcarlinks
The following method retrieves the ad links on the first page, 120 of them and returns them in an array. Following is the further logical breakdown of the function:
- LA Craigslist main page ->
- cars+truks ->
- By-Owner Only ->
- auto make model = "honda civic"
- On the main search page, collect all the car ad links and return them in an array
Source code:
/*
The following method retrieves the ad links on the first page, 120 of them
LA Craigslist main page ->
cars+truks ->
By-Owner Only ->
auto make model = "honda civic"
*/
async function getcarlinks() {
await driver.get('https://losangeles.craigslist.org/')
await driver.findElement(By.linkText('cars+trucks')).click()
await driver.findElement(By.linkText('BY-OWNER ONLY')).click()
await driver.findElement(By.name('auto_make_model')).sendKeys('honda civic')
/*
Its important to note here is that the string "honda civic" when furnished
inside the auto_make_model textbox, it turns into a link that needs to be
clicked in order for the honda civic specific ads page to load. The
following function call handles the click part when string "honda civic"
turns into a link
*/
await driver.wait(until.elementLocated(By.linkText('honda civic')), 50000)
.then(
elem => elem.click()
)
/*
class 'result-info' helps in retrieving all those webelements that contain
the car ad link
*/
let elems = await driver.findElements(By.className('result-info'))
/*
further parsing of the webelements to obtain the anchor ('a') tags
*/
let linktagarr = await Promise.all(elems.map(
async anelem => await anelem.findElements(By.tagName('a'))
))
/*
parse the actual links off the anchor tags into an array and return
the array
*/
return await Promise.all(
linktagarr.map(
async anhref => await anhref[0].getAttribute('href')
)
)
}
processlinks
This method:
- Is passed the car links array as obtained by the function above (
getcarlinks
) - Sets up a new workbook
- Adds a new worksheet to the workbook, named 'CL Links Sheet'
- These columns are added to the worksheet: Sr Num, Title, Transmission, Fuel, Odometer and link to the car's ad page
- For each link in the links array, all the way till 5 elements (otherwise, the app will take a long time to process all the 120 links, this setting can be changed however to whichever number is deemed feasible), it does the following:
- Increments the sr (Sr Num) field in the spreadsheet
- 'gets' the given link
- Inside each ad page, look for these: title, transmission, Fuel, Odometer and the link
- Add a new row with the fetched/furnished info
- After processing the given links, it saves the spreadsheet with this name: output.xlsx
Source code:
/*
The following method:
- Is passed a car links array
- Sets up a new workbook
- Adds a new worksheet to the workbook, named 'CL Links Sheet'
- These columns are added to the worksheet: Sr Num, Title, Transmission,
Fuel, Odometer and link to the car's ad page
- for each link in the links array all the way till 5 elements (otherwise
the app will take a long time to process all the 120 links, this setting
can be changed however to whichever number is deemed feasible), it does the
following:
- Increments the sr (Sr Num) field in the spreadsheet
- 'gets' the given link
- Inside each ad page, look for these: title, transmission, Fuel,
Odometer and the link
- Add a new row with the fetched/furnished info
- After processing the given links, it saves the spreadsheet with this
name: output.xlsx
*/
async function processlinks(links) {
/*
init workbook, worksheet and the columns
*/
const workbook = new excel.Workbook()
let worksheet = workbook.addWorksheet('CL Links Sheet')
worksheet.columns = [
{ header: 'Sr Num', key: 'sr', width: 5 },
{ header: 'Title', key: 'title', width: 25 },
{ header: 'Transmission', key: 'transmission', width: 25 },
{ header: 'Fuel', key: 'fuel', width: 25 },
{ header: 'Odometer', key: 'odometer', width: 25 },
{ header: 'link', key: 'link', width: 150 }
]
/*
end init
*/
for (let [index, link] of links.entries()) {
/*
The following if condition limits the number of links to be processed.
If removed, the loop will process all 120 links
*/
if (index < 5) {
let row = {}
row.sr = ++index
row.link = link
await driver.get(link)
let elems = await driver.findElements(By.className('attrgroup'))
/*
There are only two elements/sections that match 'attrgroup'
className search criterion, the first one contains the title
info and the other contains the info related to the remaining
elements: transmission, fuel odometer and the ad's link.
As there are always going to be two attrgoup elements therefore
I have directly used the elems indexes rather than appllying a
loop to iterate over the array
*/
if (elems.length === 2) {
/*
fetching row.title form elems[0]
*/
row.title = await elems[0].findElement(By.tagName('span')).getText()
/*
gathering the remaining spans from elems[1] index. These
span tags contain the pieces of information we are looking for
*/
let otherspans = await elems[1].findElements(By.tagName('span'))
/*
Looping over each span and fetching the values associated with
transmission, fuel, odometer and the link
*/
for (aspan of otherspans) {
let text = await aspan.getText()
/*
An example of the given spans text.
Odometer: 16000
the value is the piece after ':'.
The following regex is separating the value form the
complete string and leaving the result in an array
*/
let aspanval = text.match('(?<=:).*')
if (text.toUpperCase().includes('TRANSMISSION')) {
row.transmission = aspanval.pop()
}
else if (text.toUpperCase().includes('FUEL')) {
row.fuel = aspanval.pop()
}
else if (text.toUpperCase().includes('ODOMETER')) {
row.odometer = aspanval.pop()
}
}
}
/*
The given row is now furnished. It's time to add it to the
worksheet
*/
worksheet.addRow(row).commit()
}
}
/*
All the rows in the worksheet are now furnished. Save the workbook now
*/
workbook.xlsx.writeFile('output.xlsx')
}
startprocessing
This function chains getcarlinks
and processcarlinks
by calling them in a sequence (JS internally promise chains these functions). This function is called to start the app, in other words, it's the entry point function:
Source code:
/*
The following method chains the getcarlinks and processcarlinks methods
by calling them in a sequence (JS internally promise chaining these
functions under the hood)
*/
async function startprocessing() {
try {
let carlinks = await getcarlinks();
await processlinks(carlinks);
console.log('Finished processing')
await driver.quit()
}
catch (err) {
console.log('Exception occurred while processing, details are: ', err)
await driver.quit()
}
}
/*
Starting the engines
*/
startprocessing()
There you have it, you can download the attached source code to test this app and extend its functionality to better suit your requirement. You can also find the code on my GitHub page.
Important Links
- Selenium website: https://www.seleniumhq.org/
- Webdriver docs: https://www.seleniumhq.org/docs/03_webdriver.jsp
- Webdriver JS GitHub page: https://github.com/SeleniumHQ/selenium/wiki/WebDriverJs
- Exceljs on npm: https://www.npmjs.com/package/exceljs
- Headless mode, MDN docs: https://developer.mozilla.org/en-US/Firefox/Headless_mode
- Mozilla Geckodriver GitHub page: https://github.com/mozilla/geckodriver
- This project can also be found on my GitHub repo: https://github.com/xeektech/samplenodeprojects/tree/master/craigslistparser