Webscraping using Regular Expression and HtmlAgility

Eric M. H. Goh

2.67/5 (3 votes)

Apr 2, 2018

Apache

1 min read

10558

283

Web Scraping using Regular Expression and HtmlAgility for Data Mining

Introduction

Data Science is a growing field. According to CRISP DM model and other Data Mining models, we need to collect data before mining out knowledge and conduct predictive analysis. Data Collection can involve data scraping, which includes web scraping (HTML to Text), image to text and video to text conversion. When data is in text format, we usually use text mining techniques to mine out knowledge.

In this post, I am going to introduce you to web scraping. I developed Just Another Web Scraper (JAWS) to download webpage from URL, then extract text using regular expression or HTMLAgility Pack.

JAWS has features to extract text from HTML website using regular expression and HTMLAgility. I have included the source code for all the features. In this article, I am going to explain only the text extraction using Regular Expression and HtmlAgility.

References

Downloading the Webpage

To download the webpage, we must include the System.Net and System.IO library:

using System.Net;
using System.IO;

Then create the WebClient object:

WebClient web = new WebClient();

We can then download the webpage to a temp file:

web.DownloadFile(url, Directory.GetCurrentDirectory() + "/temp.html");

To load the downloaded webpage to richTextBox:

StreamReader sr = new StreamReader(Directory.GetCurrentDirectory() + "/temp.html");
   string line = sr.ReadLine(); int i = 0;
   while(line != null) {
    richTextBox1.Text += "\n" + line;
    line = sr.ReadLine();
    i++;
   }
   sr.Close();

Extracting Text using Regular Expression

To extract text using regular expression, include the following library:

using System.Text.RegularExpressions;

In order to search for text in the richTextBox1, using regular expression at regularExp variable:

MatchCollection matches = Regex.Matches(richTextBox1.Text, regularExp, RegexOptions.Singleline);

regularExp can have the values like "<title>\s*(.+?)\s*</title>"

Then display the results in richTextBox1:

foreach(Match m in matches) {
      richTextBox1.Text += m.Value;
     }

Extracting Text using HTMLAgility

To extract text using HtmlAgility, include the following library:

using HtmlAgilityPack;

Loading the HTML file into HtmlAgilityPack's HtmlDocument object:

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(Directory.GetCurrentDirectory() + "/temp.html");

Extracting data from HTML:

foreach (HtmlNode n in doc.DocumentNode.SelectNodes(mFromTextBox.Text)) {
      richTextBox1.Text += n.InnerHtml;
}

mFromTextBox.Text contains XPath for extraction. mFromTextBox.Text value can be like "//body".