Download source code - 1.54 MB

Introduction (What is screen scraping?)

To parse HTML from a website is otherwise called Screen Scraping. It’s a process to access external website information (the information must be public – public data) and processing it as required. For instance, if we want to get the average ratings of Nokia Lumia 1020 from different websites we can scrap the ratings from all the websites and calculate the average in our code. So we can say, as a general “User” what you can have as “Public Data”, you’ll be able to scrap that using HTML Agility Pack easily.

Background

Previously it was harder to scrap a website as the hold DOM elements used to be downloaded as string. So it wasn't a pleasure to work with strings and find out individual nodes by iterating through at and matching tags and attributes to specify your requirements. Gradually the way has improved and now it has become too easy using HtmlAgilityPack library. That's why this article will give you a simple demonstration on how to start with HAP.

You need to have the basics of programming and must know writing code in C# and ASP.NET.

How it works

Before HTML Agility Pack we had to use different built-in classes in .NET Framework to pull out HTML from a website. But now we don’t have to use such loads of classes rather we’ll use the HAP library and order it to do the task for us.

It’s pretty simple. Your code will make an HTTP request to the server and parse/store the returned HTML.

First HAP creates a DOM view of the parsed HTML of a particular website. Then it’s really some lines of code that will allow you to pass through the DOM, selecting nodes as you like. Using an XPath expression, HAP also can give you a specific node and its attributes. HAP also includes a class to download a remote website.

Let's get started

In this example we'll parse all the links of a particular webpage and display it in our webpage using HtmlAgilityPack. So let's start:

Run Visual Studio 2012
Go to File -> New -> Project, select Web from Visual C# template and on right side select ASP.NET Empty Web Application. Name the Application HTMLAgilityPackStartUp and click OK.

From Solution Explorer right click on References that’s within your project. And click on Manage Nuget Packages.

Manage Nuget Packages window will appear. Click on the search bar on right side and search for HtmlAgilityPack. It’ll give you the libraries and extensions related to this name. On Middle tab of the window you’ll find HtmlAgilityPack library on first row. Click Install. It’ll download and install the library in your project and after finishing installation click OK.

Check if the library has been added to the References.

Now let's take a new Web Form and Name it MyScraper.aspx and click Add.

We’ll edit the MyScraper.aspx file as follows taking some labels, a textbox and a button and Save it.

<%@ Page Language="C#" AutoEventWireup="true" CodeBehind="MyScraper.aspx.cs" Inherits="HTMLAgilityPackStartUp.MyScraper" %> 
<!DOCTYPE html> 
<html xmlns="http://www.w3.org/1999/xhtml"> 
   <head runat="server"> <title></title> 
   </head> 
   <body>     
      <form id="form1" runat="server"> 
         <div> 
	    <asp:Label ID="InputHereLabel" runat="server" Text="Input Here" /> 
	    <asp:TextBox ID="InputTextBox" runat="server" /> <br /> 
	    <asp:Button ID="ClickMeButton" runat="server" Text="Click Me" OnClick="ClickMeButton_Click" /> <br /> <br /> 
            <asp:Label ID="OutputLabel" runat="server" /> 
	</div> 
      </form> 
   </body> 
</html>

Switch to the Design view of MyScraper.aspx page, click on the ClickMeButton Button, press f4 to switch to the properties window and click on the Events tab and then double click on the Click Action. It’ll create a handler on MyScraper.aspx.cs.

Now as we want to get all the links of a particular page of a website. To do that add the following using statement using HtmlAgilityPack; inside MyScraper.aspx.cs page and update the ClickMeButton_Click handler as follows:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.Web.UI;
using System.Web.UI.WebControls;
using HtmlAgilityPack; 

namespace HTMLAgilityPackStartUp
{
   public partial class MyScraper : System.Web.UI.Page
   {
      protected void Page_Load(object sender, EventArgs e)
      {
      }
      protected void ClickMeButton_Click(object sender, EventArgs e)
      {
	 var getHtmlWeb = new HtmlWeb();
	 var document = getHtmlWeb.Load(InputTextBox.Text);
	 var aTags = document.DocumentNode.SelectNodes("//a");
	 int counter = 1;
	 if (aTags != null)
	 {
		 foreach (var aTag in aTags)
		 {
		    OutputLabel.Text += counter + ". " + aTag.InnerHtml + " - " + aTag.Attributes["href"].Value + "\t" + "<br />";
		    counter++;
		 }
	 }
      }
   }
}

Explaining the code (MyScraper.aspx.cs)

var getHtmlWeb = new HtmlWeb();

Get an instance of HtmlWeb() to use the Load method to get the DOM elements of a URL.

var document = getHtmlWeb.Load(InputTextBox.Text);

Getting the DOM elements using Load method.

var aTags = document.DocumentNode.SelectNodes("//a");

After getting the DOM elements as an HtmlDocument format we use DocumentNode to select all nodes using SelectNodes of a particular tag (in our case it’s the “a” tag).

int counter = 1;

We put a counter to count the number of links we found.

if (aTags != null)
{
    foreach (var aTag in aTags)
    {
       OutputLabel.Text += counter + ". " + aTag.InnerHtml + " - " + 
         aTag.Attributes["href"].Value + "\t" + "<br />";
       counter++;
    }
}

Then we check if the nodes exist or not. It it exists the we loop through the nodes and print the InnerHtml of that node that is <a> (tag) and the “href” attributes value of that node incrementing the value each time it finds a new node.

After that press F6 to build the project and f5 to run it. Give the following input https://htmlagilitypack.codeplex.com/ and click on Click Me. You'll see that it'll bring all the href's and innerHtml of <a> tags of that URL and the output will be as following:

So, this is how we used HtmlAgilityPack with ASP.NET to parse DOM elements from a webpage. You can parse all the tags in different formats, traverse them as ParentNode or ChildNodes or many more ways using the built-in classes and functions. HAP gives you more flexibility on going through the DOM elements and selecting which one you want to play with. Hope you like it.

Tips

Go to https://htmlagilitypack.codeplex.com/ and follow the forum and discussions. In the documentation tab you'll find no help when I am writing this article. So you can come up with more better contribution to this open source project. Try to go through all the features and functionality HAP provides. The more you'll know it'll be more easier for you to Scrap more easily.

History

N/A.