Click here to Skip to main content
14,550,721 members

Webscraping using Regular Expression and HtmlAgility

Rate this:
2.67 (3 votes)
Please Sign up or sign in to vote.
2.67 (3 votes)
2 Apr 2018Apache
Web Scraping using Regular Expression and HtmlAgility for Data Mining

Introduction

Data Science is a growing field. According to CRISP DM model and other Data Mining models, we need to collect data before mining out knowledge and conduct predictive analysis. Data Collection can involve data scraping, which includes web scraping (HTML to Text), image to text and video to text conversion. When data is in text format, we usually use text mining techniques to mine out knowledge.

In this post, I am going to introduce you to web scraping. I developed Just Another Web Scraper (JAWS) to download webpage from URL, then extract text using regular expression or HTMLAgility Pack.

JAWS has features to extract text from HTML website using regular expression and HTMLAgility. I have included the source code for all the features. In this article, I am going to explain only the text extraction using Regular Expression and HtmlAgility.

References

Image 1

Downloading the Webpage

To download the webpage, we must include the System.Net and System.IO library:

using System.Net;
using System.IO;

Then create the WebClient object:

WebClient web = new WebClient();

We can then download the webpage to a temp file:

web.DownloadFile(url, Directory.GetCurrentDirectory() + "/temp.html");

To load the downloaded webpage to richTextBox:

StreamReader sr = new StreamReader(Directory.GetCurrentDirectory() + "/temp.html");
   string line = sr.ReadLine(); int i = 0;
   while(line != null) {
    richTextBox1.Text += "\n" + line;
    line = sr.ReadLine();
    i++;
   }
   sr.Close();

Extracting Text using Regular Expression

To extract text using regular expression, include the following library:

using System.Text.RegularExpressions;

In order to search for text in the richTextBox1, using regular expression at regularExp variable:

MatchCollection matches = Regex.Matches(richTextBox1.Text, regularExp, RegexOptions.Singleline);

regularExp can have the values like "<title>\s*(.+?)\s*</title>"

Then display the results in richTextBox1:

foreach(Match m in matches) {
      richTextBox1.Text += m.Value;
     }

Extracting Text using HTMLAgility

To extract text using HtmlAgility, include the following library:

using HtmlAgilityPack; 

Loading the HTML file into HtmlAgilityPack's HtmlDocument object:

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(Directory.GetCurrentDirectory() + "/temp.html");

Extracting data from HTML:

foreach (HtmlNode n in doc.DocumentNode.SelectNodes(mFromTextBox.Text)) {
      richTextBox1.Text += n.InnerHtml;
}

mFromTextBox.Text contains XPath for extraction. mFromTextBox.Text value can be like "//body".

License

This article, along with any associated source code and files, is licensed under The Apache License, Version 2.0

Share

About the Author

Eric M. H. Goh
Founder SVBook Pte. Ltd.
Singapore Singapore
Eric Goh is a data scientist, software engineer, adjunct faculty and entrepreneur with years of experiences in multiple industries. His varied career includes data science, data and text mining, natural language processing, machine learning, intelligent system development, and engineering product design. He founded SVBook Pte. Ltd. and extended it with DSTK.Tech and EMHAcademy.com. DSTK.Tech is where Eric develops his own DSTK data science softwares (public version). Eric also published “Learn R for Applied Statistics” at Apress, and published some books at LeanPub, Google Books, Amazon kindle, and SVBook Pte. Ltd. He teaches the content at EMHAcademy.com, Udemy, SkillShare, BitDegree, Simpliv, and developed 28 courses, 7 advanced certificates. Eric is also an adjunct faculty at Universities and Institutions.

Eric Goh has been leading his teams for various industrial projects, including the advanced product code classification system project which automates Singapore Custom’s trade facilitation process, and Nanyang Technological University's data science projects where he develop his own DSTK data science software (NTU version). He has years of experience in C#, Java, C/C++, SPSS Statistics and Modeller, SAS Enterprise Miner, R, Python, Excel, Excel VBA and etc. He won Tan Kah Kee Young Inventors' Merit Award 2007, and Shortlisted Entry for TelR Data Mining Challenge.

Eric holds a Masters of Technology degree from the National University of Singapore, an Executive MBA degree from U21Global (currently GlobalNxt) and IGNOU, a Graduate Diploma in Mechatronics from A*STAR SIMTech (a national research institute located in Nanyang Technological University), Coursera Specialization Certificate in Business Statistics and Analysis (Excel) from Rice University, IBM Data Science Professional Certificate (Python, SQL), and Coursera Verified Certificate in R Programming from Johns Hopkins University. He possessed a Bachelor of Science degree in Computing fr

Comments and Discussions

 
-- There are no messages in this forum --
Tip/Trick
Posted 2 Apr 2018

Tagged as

Stats

6.5K views
202 downloads
6 bookmarked