Click here to Skip to main content
15,860,859 members
Articles / Web Development / ASP.NET
Article

Screen Scraping with C# for ASP.NET

Rate me:
Please Sign up or sign in to vote.
2.76/5 (33 votes)
1 Mar 2002CDDL2 min read 517.8K   5.5K   105   53
Using C# to scrape content from a third party site and present in on an ASP.NET webpage

Sample Image - weather.jpg

Introduction

This project is not earth shattering or revolutionary, it is simply a means of coming to terms with ASP.NET and C# development on my part, and to hopefully expose some knowledge and ideas to others.

This project began with the need to create an intranet portal that contained, among other things, the local weather forecast. The design for the forecast information was to be just like a local TV station’s web site. Since I could not use their site, nor was paying for a service to provide the information an option, it was determined that screen scrapping the local TV site would be a good solution. I decided this would be a good introduction the .NET world so I used ASP.NET with C# as the coding language.

WARNING

It should go without saying that screen scraping is not the best solution in many cases. You are completely at the mercy of the third party site, if the layout changes, you must rework you solution. It may also present some legal question as to your rights to use someone else’s work.

Details

The first step in the design was to call up the providing site, http://www.pittsburgh.com/partners/wpxi/weather/ in this case, and look at the HTML to find the information needed. In my case I was able to search for the heading

HTML
<B>Current Conditions for Pittsburgh</B>

The weather information was found in two tables so it was just a matter of searching the HTML text and extracting the tables. I could then pass this as the innerHTML content for a table on my webpage.

HTML
<TABLE id="Table1" width="100%" border="0">
<TR>
    <TD align="middle" colSpan="2"><STRONG>Local Weather Forecast</STRONG></TD>
</TR>
<TR>
    <TD><%=GetWeather()%></TD>
    <TD><%=GetForecast()%></TD>
</TR>
<TR>
    <TD align="middle" colSpan="2">information provided by WPXI</TD>
</TR>
</TABLE>

Aquire the HTML

Using the .NET library it was easy to aqurire the HTML from the site. As can be seen we just need to create a WebResponse object and feed the ResponseStream into a instance of StreamReader. From there I parse through it to remove the empty lines and assign the result to a string. I'm using StringBuilder.Append method as an alternative to appending to the string based on the recommendation of Charles Petzold in Programming Microsoft Windows with C#. Here he demonstrates the using the StringBuilder is 1000x faster than appending to a string.

C#
// Open the requested URL
WebRequest req = WebRequest.Create(strURL);

// Get the stream from the returned web response
StreamReader stream = new StreamReader(req.GetResponse().GetResponseStream());

// Get the stream from the returned web response
System.Text.StringBuilder sb = new System.Text.StringBuilder();
string strLine;
// Read the stream a line at a time and place each one
// into the stringbuilder
while( (strLine = stream.ReadLine()) != null )
{
    // Ignore blank lines
    if(strLine.Length > 0 )
        sb.Append(strLine);
}
// Finished with the stream so close it now
stream.Close();

// Cache the streamed site now so it can be used
// without reconnecting later
m_strSite = sb.ToString();

Extract the tables

After the text has been acquired it is simply a matter of extracting and returning the substring. To fix the relative path of the images I run the substring through another method to insert the absolute path before returning it.

C#
private string FindWeatherTable()
{
    int nIndexStart = 0;
    int nIndexEnd = 0;
    int nIndex = 0;

    try
    {
        // This phrase tells us where to start looking for the information
        // If it is found start looking for the first beginning table tag
        if( (nIndex = Find("Current Conditions for Pittsburgh", 0)) > 0 )
        {
            nIndexStart = Find("<TABLE", nIndex);
            if(nIndexStart > 0 )
            {
                // Need to find the second end table tag
                nIndex = Find("</TABLE>", nIndex);
                if(nIndex > 0 )
                {
                    // Add 1 to the index so we don't find the same 
                    // tag as above
                    nIndexEnd = Find("</TABLE>", nIndex+1);
                    if(nIndexEnd > 0 )
                        nIndexEnd += 8; // Include the characters in the tag
                }
            }
        }
        // Extract and return the substring containing the table we want
        // after correcting the img src elements
        return CorrectImgPath(m_strSite.Substring(nIndexStart, 
                              nIndexEnd - nIndexStart));
    }
    catch(Exception e)
    {
        return e.Message;
    }
}

private string CorrectImgPath(string s)
{
    int nIndex = 0;
    try
    {
        // Absolute path to insert
        string strInsert = "http://www.pittsburgh.com";
        // Find any and all images and insert the absolute path
        while( (nIndex = s.IndexOf("/images/", 
                                   nIndex + strInsert.Length + 1)) > 0 )
        {
            s = s.Insert(nIndex, strInsert);
        }
        return s;
    }
    catch(Exception e)
    {
        return e.Message;
    }
}

Conclusion

The complete site used ADO.NET to connect to a SQL Server database and provide the viewer with schedule and appointment information as well as corporate information. They also had the ability to add events to their calendar. For simplicity I choose not include these features in this sample. I just wanted to share a beginner C# and ASP.NET exploration to give others some ideas.

License

This article, along with any associated source code and files, is licensed under The Common Development and Distribution License (CDDL)



Comments and Discussions

 
GeneralMy vote of 1 Pin
sujit_haldar9-Nov-12 23:48
sujit_haldar9-Nov-12 23:48 
GeneralEasiest way to scrap data Pin
fernir14-Jun-11 8:35
fernir14-Jun-11 8:35 
GeneralIs there lower than 1 vote? Pin
Oslec1-May-11 0:36
Oslec1-May-11 0:36 
GeneralRe: Is there lower than 1 vote? Pin
Not Active1-May-11 2:10
mentorNot Active1-May-11 2:10 
GeneralRe: Is there lower than 1 vote? Pin
Wonde Tadesse20-May-11 16:27
professionalWonde Tadesse20-May-11 16:27 
GeneralMy vote of 1 Pin
Oslec1-May-11 0:33
Oslec1-May-11 0:33 
GeneralMy vote of 1 Pin
Søren Kierkegaard2-Apr-11 6:48
Søren Kierkegaard2-Apr-11 6:48 
GeneralRe: My vote of 1 Pin
Not Active1-May-11 2:16
mentorNot Active1-May-11 2:16 
GeneralMy vote of 1 Pin
daylightdj12-Nov-10 3:17
daylightdj12-Nov-10 3:17 
GeneralRe: My vote of 1 Pin
Not Active1-May-11 2:14
mentorNot Active1-May-11 2:14 
GeneralRe: My vote of 1 Pin
dovydasm28-Jan-12 13:45
dovydasm28-Jan-12 13:45 
GeneralHello Mark Pin
madness12316-Sep-10 8:57
madness12316-Sep-10 8:57 
Question"Screen Scraping" Pin
Michael12562-Jun-10 1:15
Michael12562-Jun-10 1:15 
AnswerRe: "Screen Scraping" Pin
Not Active2-Jun-10 2:29
mentorNot Active2-Jun-10 2:29 
GeneralRe: "Screen Scraping" Pin
Michael12562-Jun-10 2:37
Michael12562-Jun-10 2:37 
GeneralRe: "Screen Scraping" Pin
Not Active2-Jun-10 2:52
mentorNot Active2-Jun-10 2:52 
GeneralRe: "Screen Scraping" Pin
Michael12562-Jun-10 3:03
Michael12562-Jun-10 3:03 
GeneralVery nice solution! Pin
DrABELL16-Sep-09 7:25
DrABELL16-Sep-09 7:25 
GeneralRe: Very nice solution! Pin
Not Active16-Sep-09 7:57
mentorNot Active16-Sep-09 7:57 
QuestionHow to "submit" a form? Pin
Andy Crawford18-May-09 8:16
professionalAndy Crawford18-May-09 8:16 
GeneralMy vote of 1 Pin
Jasmine Pomelo27-Apr-09 5:51
Jasmine Pomelo27-Apr-09 5:51 
GeneralRe: My vote of 1 Pin
Dr Nick6-Oct-10 1:37
Dr Nick6-Oct-10 1:37 
GeneralSteal Inofrmation(Or Decsribe anthing) Pin
Member 366988924-Feb-09 17:56
Member 366988924-Feb-09 17:56 
GeneralMy vote of 1 Pin
heyw00djay8-Dec-08 6:09
heyw00djay8-Dec-08 6:09 
Generalwhere do I start Pin
unbeatenchamp1-Jun-05 5:34
unbeatenchamp1-Jun-05 5:34 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.