Click here to Skip to main content
13,763,643 members
Click here to Skip to main content
Add your own
alternative version

Stats

36.2K views
2.7K downloads
61 bookmarked
Posted 7 Apr 2016
Licenced CPOL

Webscraping with C# - point and scrape!

, 7 Apr 2016
Rate this:
Please Sign up or sign in to vote.
Automate your webscrapes - build a point and click web scrape engine in Javascript and C#

Introduction

This article is part three of a multi part series.

Part one - How to web scrape using C#
Part two - Web crawling using .net - concepts
Part three - Web scraping with C# - point and scrape! (this article)
Part four - Web crawling using .net - example code (to follow)

Background

I have practiced the art of web scraping for quite a while, and mostly, carry out the task by hand. I have seen some commercial offerings that offered a quicker and easier way to  pull data from web pages that is literally, point and click. This is not only useful for saving time for us poor coders, but also for users who are not coders, but still need to get data from a webpage (without annoying the coders of course!). This article will start as a short introduction to what is needed to put such an engine together and highlight some techniques for building a point-and-click web-scrape/crawl engine. 

There is enough in this article to get you started working on something yourself, and I intend revisiting it later with working code once I have that completed.

 

Point and click engine

Putting together most things is usually one part brain power and joining the dots, and one part building on the shoulders of those who have gone before us - this project is no different.

The workflow is pretty basic - and a few commercial outfits have done this already, including Kimono Labs. They were sold and closed their product, so its no use to us, but we can learn a lot from it! ... 


Step 1 - highlight/select elements to scrape

The first thing is to have a method for dynamically, in the browser, selecting/identifying HTML elements that contain data that we want to scrape. This is generally done as a browser extension/plugin.
Overall, its a pretty simple thing to do, there are examples here and here. 'Selector Gadget' is also a good example to look at. 

To get repeating elements like in the Kimono screenshot below, you just need to look at the element selected, then look around its parent/siblings for patterns of elements that repeat, and make a guess (letting the user correct things as you go). In this example, you can see I have in browser, clicked on the title of one of my articles, and the code has magically auto-selected what it thinks are other article titles.



The concept above just gets repeated with other fields/blocks of data on a page you want to scrape, and saved into a template. The key to knowing what to scrape, is to grab the XPath from the elements you want to scrape. Sure, this can be a bit involved at times, but its worth it in the long run. Learn a bit more about XPath here. Once you have the XPath of one or more elements, you can use the techniques demonstrated in my introduction to web scraping article to scrape data from these with a CSS Select query.

The following diagram shows how for example you might store a template in XML for the scrape of the div 'title', above.

 

Borrowing some code from the previous article on web scraping, and based on the XML example above, this is how you would then pull all of the 'title' data from the above page into a list variable called Titles:
 

WebPage PageResult = Browser.NavigateToPage(new Uri(XML.url));
var Titles = PageResult.Html.SelectNodes(XML.elements.element[n].xpath)



Step 2 - The Scrape-flow

You need to tell your engine how to get at both the page the data is on, and where the data on the page is. This is not the same as the selecting in Step 1. What I refer to here, is things that bring the data to the page - lets say I had 100 articles, but the page only showed 30 at a time. In this case you need to let your engine know that it needs to:

  1. Go to page
  2. Find elements, scrape
  3. Go to NEXT page (and how to do it)
  4. Rinse, repeat, until last page

To make this happen, you need to let the engine know how to navigate, this involves identifying for paged data:

  • start page
  • end page
  • rows per page
  • prev/next links


 

Step 3 - Schedule and scrape!

Ok, the last piece of this puzzle is putting it all together so you can point-click what you want to scrape, and then schedule it to happen on a regular basis. Depending on your needs, here are some useful tools that might assist you along the way:

Quartz .net scheduler

This is an extremely robust timer/scheduler framework. It is widely used, easy to implement and a far better approach to scheduling things in code than using and abusing the ubiquitous timer class. You can implement schedules to be very simple 'every Tuesday', 'once, at this specific time', or be quite complex beasts using the in-built CRON trigger methods.

Here are some examples:
 

0 15 10 ? * * Fire at 10:15am every day
0 0/5 14 * * ? Fire every 5 minutes starting at 2pm and ending at 2:55pm, every day
0 15 10 ? * 6L 2002-2005 Fire at 10:15am on every last Friday of every month during the years 2002, 2003, 2004 and 2005

Pretty powerful stuff!
 

JQuery-cron builder

If your user interface is on the web, this JQuery plugin may come in useful. It gives the user an easy interface to generate/select schedule times without having to know how to speak cron !


The job of this final step is simply to execute a scrape process, against the stored templates, at a pre-determined scheduled time. To get something up and running fast, with the basics, is easy - the fun starts when you have to work on building it out. Watch this space :)

 

Summary

That completes the basics of this article, and should be enough to get you started coding!
The next update will provide some working code you can implement and build on.

So remember:

1 - Select into template
2 - Identify the scrape-flow
3 - Schedule and scrape!

I have attached an example project of dynamic selecting in the browser, taken from one of the links above to get you started.


Finally - If you liked this article please give it a vote above!!

History

Version 1 - 7th April 2016

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

AJSON
Engineer
United Kingdom United Kingdom
No Biography provided

You may also be interested in...

Comments and Discussions

 
QuestionPart four - Web crawling using .net - example code (to follow) Pin
kiquenet.com14-Sep-18 11:06
professionalkiquenet.com14-Sep-18 11:06 
QuestionWhat about part 4? Pin
asalaheldinhasssan20-Aug-17 5:27
memberasalaheldinhasssan20-Aug-17 5:27 
GeneralMy vote of 5 Pin
D V L6-May-16 0:45
professionalD V L6-May-16 0:45 
GeneralRe: My vote of 5 Pin
AJSON6-May-16 0:54
mvpAJSON6-May-16 0:54 
PraiseGreat Pin
VijayRana22-Apr-16 5:14
professionalVijayRana22-Apr-16 5:14 
GeneralRe: Great Pin
AJSON23-Apr-16 6:48
mvpAJSON23-Apr-16 6:48 
QuestionExcellent as usual Pin
Marbry Hardin14-Apr-16 6:57
memberMarbry Hardin14-Apr-16 6:57 
AnswerRe: Excellent as usual Pin
AJSON14-Apr-16 7:04
mvpAJSON14-Apr-16 7:04 
GeneralRe: Excellent as usual Pin
JoshYates198023-May-16 11:30
professionalJoshYates198023-May-16 11:30 
GeneralRe: Excellent as usual Pin
AJSON23-May-16 11:55
mvpAJSON23-May-16 11:55 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Cookies | Terms of Use | Mobile
Web04-2016 | 2.8.181112.3 | Last Updated 7 Apr 2016
Article Copyright 2016 by AJSON
Everything else Copyright © CodeProject, 1999-2018
Layout: fixed | fluid