Web Scraping in ASP.NET with Regular Expression Matching and XML Transformation
A demo web scraping ASP.NET application utilizing Regular Expression matching and XML transformation.
Contents
Introduction
Web scraping is a very useful technique when extracting information from websites. Sometimes it's the best or even the only way to achieve your goals. This article demonstrates the structure of an ASP.NET web scraping application. The sample application retrieves Ontario Cabinet Ministers information from Ontario Premier's website and then converts it to a customized layout. The original web page URL is http://www.premier.gov.on.ca/team/default.asp and it looks like this:
The output web page looks like this:
Background
Web scraping can be implemented in many ways but not many ASP.NET examples are available online. This example application is a result of my exploring different ways to implement web scraping in ASP.NET.
Features
- Flexible: Web page parsing is done through Regular Expression matching; this makes code change easier whenever the source web page changes.
- Extensible: Since XML is used, it's easy to change to other output format. For example, instead of outputting to a web page, we can change the last step to output to a file or serve as a Web Service to provide service to other applications.
- Efficiency: XSLT is used to transform XML data to an HTML page; this made coding very easy since we can convert an HTML page template into an XSLT file easily without having to create server controls in ASP.NET code.
How it works
The following diagram shows the application structure and data flow of the process:
- First of all, let's view the source of the source web page and compare the source web page and our output web page; the following block is identified for each minister:
- To extract this block and put it into an
XMLDocument
object, Regular Expression is used. - Convert the initial XML schema to the custom schema XML format.
- Finally, another XSLT file htmlOCM.xslt is used to transform the output XML into the HTML format of the output web page:
- Lastly, to make the content change dynamically based on selection on the left side tab, JavaScript is used to show the corresponding content:
<div class="grid_3 noborder center">
<a href="biography.asp?MPPID=8"><img
src="http://www.premier.gov.on.ca/photos/team/ChrisBentley.jpg"
width="144" height="171" alt="Chris Bentley's Biography" /></a>
</div>
<div class="grid_3">
<h3>
<a href="biography.asp?MPPID=8"
title="Chris Bentley's Biography">Chris Bentley</a>
</h3>
<p>
Attorney General <br /> Minister of Aboriginal Affairs
<br />
MPP London West
</p>
<ul>
<li><a href="http://ontario.ca/MAG">Attorney General</a></li>
<li><a href="http://ontario.ca/MAA">Aboriginal Affairs</a></li>
</ul>
</div>
Here is the function to receive the source HTML stream and parse it to generate the source XML data:
public void ConstructXMLDoc()
{
// request page from source website
WebClient webClient = new WebClient();
const string strUrl = "http://www.premier.gov.on.ca/team/default.asp?Lang=EN";
byte[] reqHTML;
reqHTML = webClient.DownloadData(strUrl);
UTF8Encoding objUTF8 = new UTF8Encoding();
string pageContent = objUTF8.GetString(reqHTML);
string ministerContent = string.Empty;
// use regular expression to find matching data portion
Regex r = new Regex("<div class="grid_3 noborder center">" +
"<a href="biography.asp\\?MPPID=[0-9]+"><imgsrc="http://" +
"www.premier.gov.on.ca/photos/team/[A-Za-z]+" +
".jpg"width="144" height="171" alt="[A-Za-z .]+'s " +
"Biography" /></a></div><div class="grid_3"><h3>" +
"<a href="biography.asp\\?MPPID=[0-9]+"title="[A-Za-z .]+'s " +
"Biography">[A-Za-z .]+</a></h3><p>[A-Za-z .,¡¯'-]+" +
"(<br />[A-Za-z .,¡¯'-]+)+</p><ul>(<li><a href="http://" +
"[A-Za-z.]+.ca/[0-9A-Za-z./&=;\\?]+">[0-9A-Za-z .,-¡¯;" +
"&#]+</a></li>)+</ul></div>");
pageContent = pageContent.Replace("\r", "").Replace("\n", "").Replace("\t", "");
MatchCollection mcl = r.Matches(pageContent);
// loop through each minister to construct the source XML
foreach (Match ml in mcl)
{
string xmlNode = ml.Groups[0].Value.Replace("imgsrc", "img src").Replace(
"width", " width").Replace("title", " title").Replace("\\\"", "\"");
XmlReader xmlReader = XmlReader.Create(new StringReader(
"<Minister>" + xmlNode + "</Minister>"));
xmlelemRoot.AppendChild(srcDoc.ReadNode(xmlReader));
}
}
After this step, the output XML looks like this:
<Ministers>
<Minister>
<div class="grid_3 noborder center">
<a href="biography.asp?MPPID=75">
<img
src="http://www.premier.gov.on.ca/photos/team/SophiaAggelonitis.jpg"
width="144" height="171" alt="Sophia Aggelonitis's Biography" />
</a>
</div>
<div class="grid_3">
<h3>
<a href="biography.asp?MPPID=75"
title="Sophia Aggelonitis's Biography">Sophia Aggelonitis</a>
</h3>
<p>Minister of Revenue<br />Minister
Responsible for Seniors<br />MPP Hamilton Mountain</p>
<ul>
<li>
<a href="http://ontario.ca/OSS">Seniors¡¯ Secretariat</a>
</li>
<li>
<a href="http://ontario.ca/revenue">Revenue</a>
</li>
</ul>
</div>
</Minister>
</Ministers>
The initial XML schema contains unnecessary information and the format is not straightforward to be used to transform to our destination HTML page, so an extra step is added here. xmlOCM.xslt is used to convert the initial XML schema to our custom schema XML format. This way, we can wrap the previous steps and this step into a Web Service so that multiple destination web pages can use the same Web Service to get the customized XML. The source web page could also be expended to more than one. Other usages include getting scraping data from the Web Service and doing data comparisons, research, or analysis.
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:msxsl="urn:schemas-microsoft-com:xslt" exclude-result-prefixes="msxsl">
<xsl:output method="xml" indent="yes"/>
<xsl:template match ="/">
<Ministers>
<xsl:apply-templates select ="Ministers/Minister"/>
</Ministers>
</xsl:template>
<xsl:template match ="Ministers/Minister">
<Minister>
<Image>
<xsl:attribute name="href">
<xsl:value-of select="div[1]/a/@href"/>
</xsl:attribute>
<xsl:attribute name="src">
<xsl:value-of select="div[1]/a/img/@src"/>
</xsl:attribute>
<xsl:attribute name="alt">
<xsl:value-of select="div[1]/a/img/@alt"/>
</xsl:attribute>
</Image>
<Name>
<xsl:attribute name="href">
<xsl:value-of select="div[2]/h3/a/@href"/>
</xsl:attribute>
<xsl:attribute name="title">
<xsl:value-of select="div[2]/h3/a/@title"/>
</xsl:attribute>
<xsl:value-of select="div[2]/h3/a"/>
</Name>
<Ministries>
<xsl:for-each select="div[2]/ul/li">
<Ministry>
<xsl:attribute name="href">
<xsl:value-of select="a/@href"/>
</xsl:attribute>
<xsl:value-of select="a"/>
</Ministry>
</xsl:for-each>
</Ministries>
</Minister>
</xsl:template>
</xsl:stylesheet>
After this step, the output XML looks like this:
<Ministers>
<Minister>
<Image href=\"biography.asp?MPPID=75\"
src=\"http://www.premier.gov.on.ca/photos/team/SophiaAggelonitis.jpg\"
alt=\"Sophia Aggelonitis's Biography\" />
<Name href=\"biography.asp?MPPID=75\"
title=\"Sophia Aggelonitis's Biography\">Sophia Aggelonitis</Name>
<Ministries>
<Ministry href=\"http://ontario.ca/OSS\">Seniors' Secretariat</Ministry>
<Ministry href=\"http://ontario.ca/revenue\">Revenue</Ministry>
</Ministries>
</Minister>
</Ministers>
<?xml version="1.0" ?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:msxsl="urn:schemas-microsoft-com:xslt"
xmlns:user="http://mydomain.com/myname">
<xsl:template match="/">
<div id="wrapper">
<div id="header">
<br />
<h1>
Ontario Cabinet Ministers
</h1>
<div>
<a id="top" name="top"></a>
</div>
</div>
<div id="right_column">
<xsl:for-each select="Ministers/Minister">
<div class="row" id="minister{position()}"
style="display:none;">
<div class="grid_3 noborder center">
<xsl:for-each select="Image">
<a
href="http://www.premier.gov.on.ca/team/{@href}"
target="_blank">
<img alt="{@alt}" src="{@src}"
width="144" height="171" />
</a>
</xsl:for-each>
</div>
<div class="grid_3">
<xsl:for-each select="Name">
<h2>
<a title="{@title}"
href="http://www.premier.gov.on.ca/team/{@href}"
target="_blank">
<xsl:value-of select="." />
</a>
</h2>
</xsl:for-each>
<ul>
<xsl:for-each select="Ministries/Ministry">
<li>
<a href="{@href}" target="_blank">
<xsl:value-of select="." />
</a>
</li>
</xsl:for-each>
</ul>
</div>
</div>
</xsl:for-each>
</div>
<div id="left_column">
<div class="leftnav">
<h2 class="header">
<a
href="http://www.premier.gov.on.ca/team/default.asp?Lang=EN#"
rel="homemenu" shape="rect"
target="_blank">Cabinet</a>
</h2>
<ul id="homemenu" class="menu">
<xsl:for-each select="Ministers/Minister">
<li id="li{position()}" class="li-inactive"
onclick="showDiv('{position()}');">
<xsl:value-of select="Name[1]"/>
</li>
</xsl:for-each>
</ul>
</div>
</div>
</div>
</xsl:template>
</xsl:stylesheet>
<script language="javascript" type="text/javascript">
var selectedID = "1";
function showDiv(id)
{
document.getElementById("minister" + selectedID).style.display = "none";
document.getElementById("li" + selectedID).className = 'li-inactive';
selectedID = id;
document.getElementById("minister" + selectedID).style.display = "";
document.getElementById("li" + selectedID).className = 'li-active';
}
showDiv(selectedID);
</script>
Files
- OCMWebApp_src.zip contains all the source files for this sample application. It's written in C# with Visual Studio 2010, targeting .NET 4.
History
- April 20, 2011 - First release.