Click here to Skip to main content
15,887,135 members
Articles / Programming Languages / Javascript
Article

Web Scraping Made Easy with Bright Data’s Web Scraper IDE

Rate me:
Please Sign up or sign in to vote.
5.00/5 (3 votes)
10 Feb 2023CPOL9 min read 8.2K   7  
Bright Data’s IDE includes pre-made scraping functions, built-in sophisticated unblocking proxy infrastructure, browser scripting in JavaScript, debugging, and several ready-to-use scraping templates for popular websites.

This article is a sponsored article. Articles such as these are intended to provide you with information on products and services that we consider useful and of value to developers

Building a web scraper on your own can be challenging and time-consuming. While obtaining structured data from REST or GraphQL APIs can be a straightforward task, scraping unstructured data from web pages can be more tedious and there are many challenges involved. It can be difficult to extract the right data and the code you write may break if the website you are scraping makes any changes to its structure.

On top of this, many sophisticated websites use anti-bot and anti-scraping precautions to prevent ‘unwanted’ web traffic.

To overcome such challenges, it is necessary to use an advanced and comprehensive automated web scraping tool. Bright Data offers such a solution, that simplifies the process of scraping structured and unstructured data, with excellent reliability, and proxy services designed to overcome anti-bot and anti-scraping measures.

While you can use Bright Data’s services separately, you can also reach for the Bright Data Web Scraper IDE. It combines the best parts of Bright Data into an interface that developers will feel at home with and can be productive within minutes.

Not only is this a cloud solution you can use directly from your browser, the process of scraping data with the Web Scraper IDE is significantly more sophisticated than your typical web scraper. It is build on unblocking proxy infrastructure, meaning you can scrape and collect data without the need to worry about anti-bot, anti-scraping, or IP blacklisting measures that some of the major websites have in place.

On top of this, there are predefined code templates for scraping popular websites, such as Amazon and Twitter. This makes the Web Scraper IDE an attractive option for those who are looking for a ready-made web scraping solution without the need for a background in programming.

For those with a basic understanding of JavaScript, you can look to leverage the Web Scraper IDE to make your own custom scraper. For this, there are lots of ready-made JavaScript functions available that can speed up the process of building a custom scraper. You could even take one of the pre-existing templates and customize it by leveraging the various API commands available, such as:

  • country(code) to use a device in a specific country
  • emulate_device(device) to emulate a specific phone or tablet
  • navigate(url) to open a URL in the headless browser
  • wait_network_idle() to wait for outstanding requests to finish
  • wait_page_idle() to wait until no further DOM requests are being made
  • click(selector) to click a specific element
  • type(selector, text) to enter text into an input field
  • scroll_to(selector) to scroll to an element so it’s visible
  • solve_captcha() to solve any CAPTCHAs displayed
  • parse() to parse the page data
  • collect() to add data to the dataset

Let’s look at the steps required to get up and running.

Web Scraping Using Bright Data’s Web Scraper IDE

Step 1: Set Up an Account

First, go to Brightdata.com and click on ‘Start Free Trial’. Then we fill out the form to sign up for Bright Data. Note: you will be required to verify your email.

Image 1

Step 2: Access the Web Scraper IDE

Once you have signed up, click on the ‘User Dashboard’ button in the top corner. You will then be presented with the following screen:

Image 2

Select ‘View data products’ from the ‘Datasets & Web Scraper IDE’ column.

You will then see the following screen:

Image 3

Select ‘My scrapers’ from the tab at the top.

You will then see the following screen:

Image 4

From here, click the ‘Develop a web scraper (IDE)’ button. This will present the following popup modal:

Image 5

From here, you can either choose a ready-made template, with options such as an ‘Amazon product page description’ scraper and a ‘Twitter hashtag search’ scraper. You can also choose to ‘start from scratch’ if you prefer.

I’m going to try using the Google SERP collector template. After selecting this option, the Web Scraper IDE loads in the browser:

Image 6

From here, the code is all set up and ready to use. But one thing you will want to adjust is the input variable. This is how you pass in query parameters to the template. In the image below, we can see the input variable being used on line 1, line 4, and line 6.

Image 7

Theoretically, we could add our own code into this template, or replace the input values with hardcoded values. But let’s keep the template as it is, and make use of the correct places.

So let’s scroll down the page a little, where we will see the following:

Image 8

This is where we can adjust our values.

Note: You can see that, besides the ‘Input’ tab, there are other tabs available. ‘Output’ will show you the output after running the ‘Preview’. The other tabs are similar to what you would see if you opened the ‘Developer Tools’ in a browser such as Chrome.

I’m interested in analysing query data for ‘developer marketing’, and I’m interested to see if there are any differences between SERP between the US and the UK. So I adjust the parameters like so:

Image 9

And then when you’re ready to run the code, press the ‘Preview’ button.

The code executes and you can see a live browser window that loads up the Google website to make a search query. After it completes, I see the following data in the ‘Output’ tab.

Image 10

In my case, I was particularly interested in obtaining the order of organic search results, which the Web Scraper IDE was able to parse and collect. I can now choose to download this data in JSON format, run more queries, make more comparisons, etc.

If you press the eyeball icon (on the right of the above image), it will open up a modal which makes it a bit easier to view the collected data.

The great thing about all of this is that the hard part of building the scraper was already taken care of, which meant I could spend more time being productive and analysing data.

Note: I experienced one instance where the Browser disconnected after timing out. When this happened, I simply pressed the ‘Preview’ button again and everything worked fine.

Image 11

And just like that, we have scraped data that is now available for us to download, store, and use as we see fit.

More details about the Bright Data Web Scraper IDE

The IDE is built using JavaScript and allows you to write code as needed. It offers functions and commands that extend JavaScript, making web scraping easier and providing a more efficient development experience by alleviating common pain points.

The IDE itself consists of three main parts, which are:

1. Interaction code

Here, we handle setting up our web scraper. This is where you would write code to interact with the website. This is also where you should filter navigation to unwanted pages.

With a template, most of these settings usually stay the same, but if you’re crafting your own script, the IDE offers in-depth and insightful comments on each option, outlining their function and the reasons why you might want to modify them.

For example, if you wanted to make changes to the data being collected by the ‘Interaction code’, I might look to change my code from:

collect(parse())

To:

let data = parse()

collect({links: data.organic})

This would then filter the output to only include the results from data.organic. And because this is JavaScript, you can make any of the adjustments you could typically do when writing JS code.

Image 12

There’s a ‘help’ button in the corner of the ‘Interaction code’ section that you can click on to open up a modal full of advice. It shows you all of the available commands, how to find element selectors, and even how to use jQuery-like expressions.

Image 13

‘Interaction code’ help modal

2. Parser code

This is where you can parse the HTML results from the interactions you have done. While the code is already set up and ready to use, you can make adjustments if you want to adjust the way that data is being collected.

In other words, the Interaction code is for choosing what data to collect, the Parser code is for choosing how to manipulate collected data.

Again, we also have a ‘help’ button we can click on to find out more about the available commands for the ‘Parser code’.

Image 14

‘Parser code’ help modal

3. User interaction

Located at the bottom of the Web Scraper IDE, we can access the following tabs:

Input: This is where we manage the inputs that our ‘Interaction code’ uses.

Image 15

Run Log: We can check running logs to check the status of a job.

Image 16

Browser Console: This window is the built-in browser console that displays any errors that occur during the runtime.

Image 17

Note: You can still use your console functions for debugging — for example, console.log() — but console outputs will appear in the ‘Run log’ tab.

Browser Network: We can use this to monitor all the requests passed through the IDE.

Image 18

Last Error: This is for monitoring if any errors had occurred after finishing the scraping task.

Image 19

Output: We can see the scraped result and download the data set.

Image 20

Advantages of Web Scraping with Bright Data’s Web Scraper IDE

  • The IDE can be easily accessed through Bright Data’s website, allowing for immediate use without the need for in-house data collection infrastructure.
  • It is built on Bright Data’s proxy infrastructure, providing scalability and high accuracy of 99.9%.
  • The IDE uses an unlocker infrastructure to effectively bypass captcha, IP, and device fingerprint-based blocks.
  • It provides pre-made templates for various data-rich websites and includes useful functions and commands built on JavaScript, making it easy to write a scraper without the need for extensive coding.
  • Unlike traditional scraping methods such as DOM parsing or regex, the IDE supports browser emulation out of the box, enabling data extraction from dynamic websites without the need to configure browser automation tools such as Selenium.
  • Bright Data’s pre-defined scraping templates are managed and updated automatically. This eliminates the need to regularly update the code in a manual scraping solution in order to continue working whenever a website is updated.
  • Bright Data is dedicated to compliance with data protection regulations such as GDPR and CCPA.
  • Bright Data offers 24/7 live support, eliminating the need to spend time scouring the internet for half-baked solutions to any issues we might be facing.

Conclusion

As websites continue to implement measures to block bots, hackers, and content thieves, scraping data has become increasingly challenging.

On top of this, scraping websites that are not static is difficult, and building your own scraping tool becomes more and more complex. This is common when attempting to scrape dynamic websites, such as Twitter, as they require JavaScript code to be loaded in order to generate content. Web scraping tools such as Cheerio are not designed to handle this out of the box.

Bright Data’s Web Scraper IDE offers an affordable solution to this problem by providing immediate access to an efficient, reliable, and adaptable tool, where you pay only for the data successfully extracted. And if you don’t feel like getting your hands dirty, you can even request for custom datasets by asking what website you want to obtain data from, and what data it is that you want.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Founder In Plain English
United States United States
Sunil is the Founder of In Plain English, a group of programming publications that aim to make education more accessible.

Comments and Discussions

 
-- There are no messages in this forum --