Click here to Skip to main content
14,024,123 members
Click here to Skip to main content
Add your own
alternative version

Tagged as

Stats

4.5K views
167 downloads
5 bookmarked
Posted 2 Apr 2018
Licenced Apache

Webscraping using Regular Expression and HtmlAgility

, 2 Apr 2018
Rate this:
Please Sign up or sign in to vote.
Web Scraping using Regular Expression and HtmlAgility for Data Mining

Introduction

Data Science is a growing field. According to CRISP DM model and other Data Mining models, we need to collect data before mining out knowledge and conduct predictive analysis. Data Collection can involve data scraping, which includes web scraping (HTML to Text), image to text and video to text conversion. When data is in text format, we usually use text mining techniques to mine out knowledge.

In this post, I am going to introduce you to web scraping. I developed Just Another Web Scraper (JAWS) to download webpage from URL, then extract text using regular expression or HTMLAgility Pack.

JAWS has features to extract text from HTML website using regular expression and HTMLAgility. I have included the source code for all the features. In this article, I am going to explain only the text extraction using Regular Expression and HtmlAgility.

References

Downloading the Webpage

To download the webpage, we must include the System.Net and System.IO library:

using System.Net;
using System.IO;

Then create the WebClient object:

WebClient web = new WebClient();

We can then download the webpage to a temp file:

web.DownloadFile(url, Directory.GetCurrentDirectory() + "/temp.html");

To load the downloaded webpage to richTextBox:

StreamReader sr = new StreamReader(Directory.GetCurrentDirectory() + "/temp.html");
   string line = sr.ReadLine(); int i = 0;
   while(line != null) {
    richTextBox1.Text += "\n" + line;
    line = sr.ReadLine();
    i++;
   }
   sr.Close();

Extracting Text using Regular Expression

To extract text using regular expression, include the following library:

using System.Text.RegularExpressions;

In order to search for text in the richTextBox1, using regular expression at regularExp variable:

MatchCollection matches = Regex.Matches(richTextBox1.Text, regularExp, RegexOptions.Singleline);

regularExp can have the values like "<title>\s*(.+?)\s*</title>"

Then display the results in richTextBox1:

foreach(Match m in matches) {
      richTextBox1.Text += m.Value;
     }

Extracting Text using HTMLAgility

To extract text using HtmlAgility, include the following library:

using HtmlAgilityPack; 

Loading the HTML file into HtmlAgilityPack's HtmlDocument object:

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(Directory.GetCurrentDirectory() + "/temp.html");

Extracting data from HTML:

foreach (HtmlNode n in doc.DocumentNode.SelectNodes(mFromTextBox.Text)) {
      richTextBox1.Text += n.InnerHtml;
}

mFromTextBox.Text contains XPath for extraction. mFromTextBox.Text value can be like "//body".

License

This article, along with any associated source code and files, is licensed under The Apache License, Version 2.0

Share

About the Author

Eric M. H. Goh
Founder SVBook
Singapore Singapore
Eric Goh is a data scientist, software engineer, adjunct faculty and entrepreneur with years of experiences in multiple industries. His varied career includes data science, data and text mining, natural language processing, machine learning, intelligent system development, and engineering product design. He founded SVBook and extended it with DSTK.Tech and EMHAcademy. DSTK.Tech is where Eric develops his own DSTK data science softwares. Eric also publishes 5 books at LeanPub and SVBook, and teaches the content at Udemy and EMHAcademy. During his free time, Eric is also an adjunct faculty at University of the People.

Eric Goh has been leading his teams for various industrial projects, including the advanced product code classification system project which automates Singapore Custom’s trade facilitation process, and Nanyang Technological University's data science projects where he develop his own DSTK data science software. He has years of experience in C#, Java, C/C++, SPSS Statistics and Modeller, SAS Enterprise Miner, R, Python, Excel, Excel VBA and etc. He won Tan Kah Kee Young Inventors' Merit Award and Shortlisted Entry for TelR Data Mining Challenge.

He holds a Masters of Technology degree from the National University of Singapore, an Executive MBA degree from U21Global (currently GlobalNxt) and IGNOU, a Graduate Diploma in Mechatronics from A*STAR SIMTech (a national research institute located in Nanyang Technological University), and Coursera Specialization Certificate in Business Statistics and Analysis from Rice University. He possessed a Bachelor of Science degree in Computing from the University of Portsmouth after National Service. He is also a AIIM Certified Business Process Management Master (BPMM), GSTF certified Big Data Science Analyst (CBDSA), and IES Certified Lecturer.

Specialties: Data Science, Text Mining, Social Network Analysis, Natural Language Processing, Machine Learning, Software Engineering, Mechatronics, Business.

You may also be interested in...

Pro

Comments and Discussions

 
-- There are no messages in this forum --
Permalink | Advertise | Privacy | Cookies | Terms of Use | Mobile
Web01 | 2.8.190419.4 | Last Updated 2 Apr 2018
Article Copyright 2018 by Eric M. H. Goh
Everything else Copyright © CodeProject, 1999-2019
Layout: fixed | fluid