Click here to Skip to main content
Click here to Skip to main content

Web Scraping in ASP.NET with Regular Expression Matching and XML Transformation

, 21 Apr 2011 CPOL
Rate this:
Please Sign up or sign in to vote.
A demo web scraping ASP.NET application utilizing Regular Expression matching and XML transformation.

Contents

Introduction

Web scraping is a very useful technique when extracting information from websites. Sometimes it's the best or even the only way to achieve your goals. This article demonstrates the structure of an ASP.NET web scraping application. The sample application retrieves Ontario Cabinet Ministers information from Ontario Premier's website and then converts it to a customized layout. The original web page URL is http://www.premier.gov.on.ca/team/default.asp and it looks like this:

Ontario Cabinet Ministers

The output web page looks like this:

Outout Minister page

Background

Web scraping can be implemented in many ways but not many ASP.NET examples are available online. This example application is a result of my exploring different ways to implement web scraping in ASP.NET.

Features

  1. Flexible: Web page parsing is done through Regular Expression matching; this makes code change easier whenever the source web page changes.
  2. Extensible: Since XML is used, it's easy to change to other output format. For example, instead of outputting to a web page, we can change the last step to output to a file or serve as a Web Service to provide service to other applications.
  3. Efficiency: XSLT is used to transform XML data to an HTML page; this made coding very easy since we can convert an HTML page template into an XSLT file easily without having to create server controls in ASP.NET code.

How it works

The following diagram shows the application structure and data flow of the process:

Structure and data flow

  1. First of all, let's view the source of the source web page and compare the source web page and our output web page; the following block is identified for each minister:
  2. <div class="grid_3 noborder center">
    <a href="biography.asp?MPPID=8"><img
    src="http://www.premier.gov.on.ca/photos/team/ChrisBentley.jpg"
    width="144" height="171" alt="Chris Bentley's Biography" /></a>
    </div>
    <div class="grid_3">
    <h3>
    <a href="biography.asp?MPPID=8"
    title="Chris Bentley's Biography">Chris Bentley</a>
    </h3>
    <p>
    Attorney General <br /> Minister of Aboriginal Affairs
    <br />
    MPP London West
    </p>
    <ul>
    <li><a href="http://ontario.ca/MAG">Attorney General</a></li>
    <li><a href="http://ontario.ca/MAA">Aboriginal Affairs</a></li>
    </ul>
    </div>
  3. To extract this block and put it into an XMLDocument object, Regular Expression is used.
  4. Here is the function to receive the source HTML stream and parse it to generate the source XML data:

    public void ConstructXMLDoc()
    {
        // request page from source website
        WebClient webClient = new WebClient();
        const string strUrl = "http://www.premier.gov.on.ca/team/default.asp?Lang=EN";
        byte[] reqHTML;
        reqHTML = webClient.DownloadData(strUrl);
        UTF8Encoding objUTF8 = new UTF8Encoding();
        string pageContent = objUTF8.GetString(reqHTML);
        string ministerContent = string.Empty;
    
        // use regular expression to find matching data portion
        Regex r = new Regex("<div class="grid_3 noborder center">" + 
           "<a href="biography.asp\\?MPPID=[0-9]+"><imgsrc="http://" + 
           "www.premier.gov.on.ca/photos/team/[A-Za-z]+" + 
           ".jpg"width="144" height="171" alt="[A-Za-z .]+'s " + 
           "Biography" /></a></div><div class="grid_3"><h3>" + 
           "<a href="biography.asp\\?MPPID=[0-9]+"title="[A-Za-z .]+'s " + 
           "Biography">[A-Za-z .]+</a></h3><p>[A-Za-z .,¡¯'-]+" + 
           "(<br />[A-Za-z .,¡¯'-]+)+</p><ul>(<li><a href="http://" + 
           "[A-Za-z.]+.ca/[0-9A-Za-z./&=;\\?]+">[0-9A-Za-z .,-¡¯;" + 
           "&#]+</a></li>)+</ul></div>");
        pageContent = pageContent.Replace("\r", "").Replace("\n", "").Replace("\t", "");
        MatchCollection mcl = r.Matches(pageContent);
    
        // loop through each minister to construct the source XML
        foreach (Match ml in mcl)
        {
            string xmlNode = ml.Groups[0].Value.Replace("imgsrc", "img src").Replace(
              "width", " width").Replace("title", " title").Replace("\\\"", "\"");
    
            XmlReader xmlReader = XmlReader.Create(new StringReader(
              "<Minister>" + xmlNode + "</Minister>"));
    
            xmlelemRoot.AppendChild(srcDoc.ReadNode(xmlReader));
        }
    }

    After this step, the output XML looks like this:

    <Ministers>
        <Minister>
            <div class="grid_3 noborder center">
                <a href="biography.asp?MPPID=75">
                  <img 
                    src="http://www.premier.gov.on.ca/photos/team/SophiaAggelonitis.jpg" 
                    width="144" height="171" alt="Sophia Aggelonitis's Biography" />
                </a>
            </div>
            <div class="grid_3">
                <h3>
                    <a href="biography.asp?MPPID=75" 
                      title="Sophia Aggelonitis's Biography">Sophia Aggelonitis</a>
                </h3>
                <p>Minister of Revenue<br />Minister 
                  Responsible for Seniors<br />MPP Hamilton Mountain</p>
                <ul>
                    <li>
                        <a href="http://ontario.ca/OSS">Seniors¡¯ Secretariat</a>
                    </li>
                    <li>
                        <a href="http://ontario.ca/revenue">Revenue</a>
                    </li>
                </ul>
            </div>
        </Minister>
    </Ministers>
  5. Convert the initial XML schema to the custom schema XML format.
  6. The initial XML schema contains unnecessary information and the format is not straightforward to be used to transform to our destination HTML page, so an extra step is added here. xmlOCM.xslt is used to convert the initial XML schema to our custom schema XML format. This way, we can wrap the previous steps and this step into a Web Service so that multiple destination web pages can use the same Web Service to get the customized XML. The source web page could also be expended to more than one. Other usages include getting scraping data from the Web Service and doing data comparisons, research, or analysis.

    <?xml version="1.0" encoding="utf-8"?>
    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
        xmlns:msxsl="urn:schemas-microsoft-com:xslt" exclude-result-prefixes="msxsl">
        <xsl:output method="xml" indent="yes"/>
    
        <xsl:template match ="/">
    
            <Ministers>
                <xsl:apply-templates select ="Ministers/Minister"/>
            </Ministers>
    
        </xsl:template>
    
        <xsl:template match ="Ministers/Minister">
    
            <Minister>
                <Image>
                    <xsl:attribute name="href">
                        <xsl:value-of select="div[1]/a/@href"/>
                    </xsl:attribute>
    
                    <xsl:attribute name="src">
                        <xsl:value-of select="div[1]/a/img/@src"/>
                    </xsl:attribute>
    
                    <xsl:attribute name="alt">
                        <xsl:value-of select="div[1]/a/img/@alt"/>
                    </xsl:attribute>
                </Image>
    
                <Name>
                    <xsl:attribute name="href">
                        <xsl:value-of select="div[2]/h3/a/@href"/>
                    </xsl:attribute>
    
                    <xsl:attribute name="title">
                        <xsl:value-of select="div[2]/h3/a/@title"/>
                    </xsl:attribute>
    
                    <xsl:value-of select="div[2]/h3/a"/>
                </Name>
    
                <Ministries>
                    <xsl:for-each select="div[2]/ul/li">
                        <Ministry>
                            <xsl:attribute name="href">
                                <xsl:value-of select="a/@href"/>
                            </xsl:attribute>
    
                            <xsl:value-of select="a"/>
                        </Ministry>
                    </xsl:for-each>
                </Ministries>
    
            </Minister>
    
        </xsl:template>
    
    </xsl:stylesheet>

    After this step, the output XML looks like this:

    <Ministers>
        <Minister>
            <Image href=\"biography.asp?MPPID=75\" 
                src=\"http://www.premier.gov.on.ca/photos/team/SophiaAggelonitis.jpg\" 
                alt=\"Sophia Aggelonitis's Biography\" />
            <Name href=\"biography.asp?MPPID=75\" 
                title=\"Sophia Aggelonitis's Biography\">Sophia Aggelonitis</Name>
            <Ministries>
                <Ministry href=\"http://ontario.ca/OSS\">Seniors' Secretariat</Ministry>
                <Ministry href=\"http://ontario.ca/revenue\">Revenue</Ministry>
            </Ministries>
        </Minister>
    </Ministers>
  7. Finally, another XSLT file htmlOCM.xslt is used to transform the output XML into the HTML format of the output web page:
  8. <?xml version="1.0" ?>
    <xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:msxsl="urn:schemas-microsoft-com:xslt"
    xmlns:user="http://mydomain.com/myname">
    
        <xsl:template match="/">
    
            <div id="wrapper">
                <div id="header">
                    <br />
                    <h1>
                        Ontario Cabinet Ministers
                    </h1>
                    <div>
                        <a id="top" name="top"></a>
                    </div>
                </div>
    
                <div id="right_column">
                    <xsl:for-each select="Ministers/Minister">
                        <div class="row" id="minister{position()}" 
                                 style="display:none;">
                        <div class="grid_3 noborder center">
                            <xsl:for-each select="Image">
                                <a 
                                 href="http://www.premier.gov.on.ca/team/{@href}" 
                                 target="_blank">
                                    <img alt="{@alt}" src="{@src}" 
                                           width="144" height="171" />
                                </a>
                            </xsl:for-each>
                        </div>
                        <div class="grid_3">
                            <xsl:for-each select="Name">
                                <h2>
                                    <a title="{@title}" 
                                      href="http://www.premier.gov.on.ca/team/{@href}" 
                                      target="_blank">
                                        <xsl:value-of select="." />
                                    </a>
                                </h2>
                            </xsl:for-each>
                            <ul>
                                <xsl:for-each select="Ministries/Ministry">
                                    <li>
                                        <a href="{@href}" target="_blank">
                                            <xsl:value-of select="." />
                                        </a>
                                    </li>
                                </xsl:for-each>
                            </ul>
                        </div>
                    </div>
                    </xsl:for-each>
                </div>
                
                <div id="left_column">
                    <div class="leftnav">
                      <h2 class="header">
                        <a 
                          href="http://www.premier.gov.on.ca/team/default.asp?Lang=EN#" 
                          rel="homemenu" shape="rect"
                          target="_blank">Cabinet</a>
                      </h2>
                      <ul id="homemenu" class="menu">
                          <xsl:for-each select="Ministers/Minister">
                              <li id="li{position()}" class="li-inactive" 
                                      onclick="showDiv('{position()}');">
                                  <xsl:value-of select="Name[1]"/>
                              </li>
                          </xsl:for-each>
                      </ul>
                    </div>
                </div>
            </div>
    
        </xsl:template>
    
    </xsl:stylesheet>
  9. Lastly, to make the content change dynamically based on selection on the left side tab, JavaScript is used to show the corresponding content:
  10. <script language="javascript" type="text/javascript">
    
        var selectedID = "1";
    
        function showDiv(id) 
        {
            document.getElementById("minister" + selectedID).style.display = "none";
            document.getElementById("li" + selectedID).className = 'li-inactive';
    
            selectedID = id;
            document.getElementById("minister" + selectedID).style.display = "";
            document.getElementById("li" + selectedID).className = 'li-active';
        }
    
        showDiv(selectedID);
    
    </script>

Files

  • OCMWebApp_src.zip contains all the source files for this sample application. It's written in C# with Visual Studio 2010, targeting .NET 4.

History

  • April 20, 2011 - First release.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Song_Gao
Software Developer (Senior) ThinData
Canada Canada
Song is a senior .NET Developer specialized on both Web and Windows applications. He is an MCTS (Microsoft Certified Technology Specialist) for Windows Applications, Web Applications and Distributed Applications.
 
He currently lives in Toronto and likes to travel.
 
You may contact Song through his email: song_gao@hotmail.com

Comments and Discussions

 
GeneralUse proxies Pinmemberfernir11-Jun-11 22:29 
General[My vote of 2] My vote of 2 PinmemberGary Noter28-Apr-11 6:36 
GeneralMy vote of 1 PinmemberJay R. Wren26-Apr-11 3:22 
GeneralUseful tool for creating regular expressions PinmemberFredrik Schultz25-Apr-11 22:04 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.1411023.1 | Last Updated 21 Apr 2011
Article Copyright 2011 by Song_Gao
Everything else Copyright © CodeProject, 1999-2014
Layout: fixed | fluid