Web Scraping Library (Fully .NET)

FrankNight

4.89/5 (24 votes)

Jul 16, 2015

GPL3

6 min read

84798

6575

This is just another web scraper written fully in .NET but finally without the use of mshtml!

Introduction

Searching and collecting data published on web sites has always been a long and boring manual task. With this project, I try to give you a tool that can help to automate some of these tasks and save results in an ordered way.
It is simply another web scraper written in Microsoft .NET Framework (C# and VB.NET), but finally without the use of Microsoft mshtml parser!
I often use this light version because it is simple to customize and to be included in new projects.

These are the components that it is made of:

A parser Object gkParser, that uses Jamierte’s version of the HtmlParserSharp (https://github.com/jamietre/HtmlParserSharp) and that provides navigation functions
A ScrapeBot object gkScrapeBot that provides search, extraction and purge data functions
Some helper classes to speed up the development of database operations

Architecture

The search and extraction method requires that the HTML language be transformed in XML, also if it is not well formed. Doing so, it will be simpler to locate data inside a web page. The base architecture, then, focuses on making this transformation and executing queries on the XML result document.

The Parser class includes all functions to navigate and parse. In this version, parsing is limited only to HTML and JSON.
When navigation functions return a successful response, you have an XML DOM representation of the web page.

At this point, another object, the ScrapeBot can execute queries, extract and purge desired data with XPath syntax.

The gkScrapeBot is the main object you will use in your project. It already uses the Parser for you. It provides some wrappers to navigation functions, some useful query functions and some other functions to extract and purge data.

Let’s take a look inside these objects.

The Parser Class (gkParser)

This component is written in VB.NET and uses the Jamietre’s version of the port of the Validator.nu parser (http://about.validator.nu/htmlparser/).

Why not Microsoft Parser? Ah, ok. I want to spend a little time to write about this painful choice. ;-)

The first version of this project was made using mshtml. This is why I decided to change:
First: my intent was to use this component windowless. There are many documents on the web about using mshtml. No one official from Microsoft. However, a lot about users troubles… The only few useful documents from Microsoft are dated 1999 (walkall example from inet sdk)! It works, but I quickly found its limitation.
Second: Then I start coding .NET based on walkall example. After overcoming COM interop difficulties, I experienced that mshtml is able to make only GET requests. And POST? … Somewhere, Microsoft writes that it could be possible to customize the request process, implementing an interface, writing some callback functions… NO. doesn't work!
Third: I need to control the download of linked document, JavaScript, images, CSS, … Oh, yes. Microsoft writes about this. It writes that you have a total control on this… NO!
I used wireshark to see what my process was downloading and this feature didn’t work. I see that work only if hosted by the heavy MS WebBrowser component.
Then: I understand that Microsoft does not like developers using its parser.

The Component

Navigation functions are implemented with the use of WebRequest and WebResponse classes, and the HTML parser is implemented using the object HtmlParserSharp.SimpleHtmlParser. The Navigate method is the only public function used to make both GET and POST requests. It has 4 overloads to permit different behavior.

Public Sub Navigate(ByVal url As String)
Public Sub Navigate(ByVal url As String, ByVal dontParse As Boolean)
Public Sub Navigate(ByVal url As String, ByVal postData As String)
Public Sub Navigate(ByVal url As String, ByVal postData As String, ByVal dontParse As Boolean)

It's not easy to create a class that fully implements all navigation features, I've created one that implements a basic cookies management and that doesn't fully implement the https protocol.

All methods are synchronous, when they return, a XML DOM document is ready.
After a web request gets a success response, the class checks the content type and instantiates the correct parser.
The Jamietre’s parser returns a very formal XML. Too formal for our purpose. Moreover, some web pages are very large and complex and it would be useful to have a smaller XML. For this reason, I implemented an interesting algorithm that filters tags and attributes: You can instruct the parser to consider only desired tags and attributes and to exclude undesired ones.
The following two properties control this behavior:

Public Property ExcludeInstructions() As String
Public Property IncludeInstructions() As String

'default values example
p_tag2ExcInstruction = "SCRIPT|META|LINK|STYLE"
p_tag2IncInstruction = "A:href|IMG:src,alt|INPUT:type,value,name"

With this feature, you can customize the result XML and make it easier to understand and to teach the bot.

The Scraper

The other main class is the gkScrapeBot. This is the class you have to use.
It uses the gkParser to navigate, to get the XML to analyze and to extract data from it.
It implements helper functions to meet these requirements:

'
'Navigation functions:
'
'Makes a simple GET request and return the XML image of the entire html page
Public Sub Navigate(ByVal url As String)
'Makes a GET request, look for a subel element id and 
' return only the html contained in the subel element
Public Sub Navigate(ByVal url As String, ByVal subel As String)
'As above, and wait given millisecond
Public Sub Navigate(ByVal url As String, ByVal subel As String, ByVal wait As Integer)

'Makes a POST request and return the XML image of the entire html page
Public Sub Post(ByVal url As String, ByVal postData As String)
'Makes a POST request, look for a subel element id and 
' return only the html contained in the subel element
Public Sub Post(ByVal url As String, ByVal postData As String, ByVal subel As String)
'As above, and wait given millisecond
Public Sub Post(ByVal url As String, ByVal postData As String, _
                ByVal subel As String, ByVal wait As Integer)

'
' XPATH Search functions
'
Public Function GetNode_byXpath(ByVal xpath As String, _
   Optional ByRef relNode As XmlNode = Nothing, _
   Optional ByVal Attrib As String = "") As XmlNode
Public Function GetNodes_byXpath(ByVal xpath As String, _
   Optional ByRef relNode As XmlNode = Nothing, _
   Optional ByVal Attrib As String = "") As XmlNodeList
Public Function GetText_byXpath(ByVal xpath As String, _
   Optional ByRef relNode As XmlNode = Nothing, _
   Optional ByVal Attrib As String = "") As String
Public Function GetValue_byXpath(ByVal xpath As String, _
   Optional ByRef relNode As XmlNode = Nothing, _
   Optional ByVal Attrib As String = "") As String
Public Function GetHtml_byXpath(ByVal xpath As String, _
   Optional ByRef relNode As XmlNode = Nothing) As String

Look at the example below to see it in action.

How to Use: Test Project Included

Warning. Scraping is often forbidden by web sites policy.
Before scaping, you need to be sure that target site policy permits that.

I assume that you know how web site works (URL, method requests and parameters, ..). I use the developer tools provided by browsers both to discover all parameters and request sent to server, and to navigate the HTML tree.

Let's see it in action:
The test project, included in the download package, shows you how to get products details from a shop online. https://testscrape.gekoproject.com
I choose this example because it uses the key features of the scraper: cookies management and post request for login phase, and nodes exploring and database facilities to get and store extracted data:

Products are not visible to guest user. Only registered user can view products and prices.
The login process is based on cookies. Then, first of all, we need to simply navigate to the site to obtain the cookie.

'Navigate to homepage and get cookies. 
url = "https://testscrape.gekoproject.com/index.php/author-login"
bot.Navigate(url)

In the login page, into the form, there are two strings that are needed to post back to the server to successfully send a login request.

'Then look for two parameters useful to login
token1 = bot.GetText_byXpath("//DIV[@class='login']//INPUT[@type='hidden'][1]", , "value")
token2 = bot.GetText_byXpath("//DIV[@class='login']//INPUT[@type='hidden'][2]", , "name")

'Now login with username e password
url = "https://testscrape.gekoproject.com/index.php/author-login?task=user.login"
data = "username=" & USER & "&password=" & PASS & "&return=" & token1 & "&" & token2 & "=1"
bot.Post(url, data)

If all goes right, you are redirected to the user page, and then you can check getting the "Registered Date" information:

mytext = bot.GetText_byXpath("//DT[contains(.,'Registered Date')]/following-sibling::DD[1]")
Console.WriteLine("User {0}, Registered Date: {1}", USER, mytext.Trim)

Once you are logged, you can navigate to the products listing page and start data scraping.

In the example, only data of the first page are scraped, but you can repeat the task for each page in the pager.

Below is the code to retrieve a list of products and their attributes:

Dim url As String
Dim name As String
Dim desc As String
Dim price_str As String
Dim price As Double
Dim img_path As String

'Navigate to front-end store 
url = "https://testscrape.gekoproject.com/index.php/front-end-store"
bot.Navigate(url)

'find all product "div"
Dim ns As XmlNodeList = bot.GetNodes_byXpath_
("//DIV[@class='row']//DIV[contains(@class, 'product ')]")
If ns.Count > 0 Then

  'Write to a XML file
  Dim writer As XmlWriter = Nothing

  'Create an XmlWriterSettings object with the correct options.
  Dim settings As XmlWriterSettings = New XmlWriterSettings()
      settings.Indent = True
      settings.IndentChars = (ControlChars.Tab)
      settings.OmitXmlDeclaration = True

  writer = XmlWriter.Create("data.xml", settings)
  writer.WriteStartElement("products")

  '********************
  ' Main scraping loop
  '********************
  For Each n As XmlNode In ns

    'Find and collect data using relative xpath syntax
    name = bot.GetText_byXpath(".//DIV[@class='vm-product-descr-container-1']/H2", n)
    desc = bot.GetText_byXpath(".//DIV[@class='vm-product-descr-container-1']/P", n)
    desc = gkScrapeBot.FriendLeft(desc, 50)
    img_path = bot.GetText_byXpath(".//DIV[@class='vm-product-media-container']//IMG", n, "src")
    price_str = bot.GetText_byXpath(".//DIV[contains(@class,'PricesalesPrice')]", n)
    If price_str <> "" Then
      price = gkScrapeBot.GetNumberPart(price_str, ",")
    End If

    '
    'write xml product element
    '
    writer.WriteStartElement("product")
    writer.WriteElementString("name", name)
    writer.WriteElementString("description", desc)
    writer.WriteElementString("price", price)
    writer.WriteElementString("image", img_path)
    writer.WriteEndElement()

    '
    'Insert data into DB
    '
    db.CommantType = DBCommandTypes.INSERT
    db.Table = "Articles"
    db.Fields("Name") = name
    db.Fields("Description") = desc
    db.Fields("Price") = price
    Dim ra As Integer = db.Execute()
    If ra = 1 Then
      Console.WriteLine("Inserted new article: {0}", name)
    End If

  Next

  writer.WriteEndElement()
  writer.Flush()
  writer.Close()

End If

Conclusion

I hope that this project will help you in collecting data from the web.
I know that it's not simple to discover how a web site works, especially if it makes large use of JavaScript to make async request.
Then this project wouldn't be a solution for all websites; if you need something more than this project, you can contact me by leaving a comment below and be sure to be authorized to scrape. ;-)

Happy scraping!

Updates

28-06-2019
- Updated to target .NET Framework 4.7.2 and VS 2019
- Test Project was updated to work with new Test Site https://testscrape.gekoproject.com
- Improved features and bug correction
16-07-2015
- Fixed a permission error on the demo site that caused a runtime exception while running the test project