Click here to Skip to main content
15,867,568 members
Articles / Web Development / ASP.NET
Tip/Trick

Get page HTML from URL using WebClient, Strip HTML using Regex , export a list of Anchors into Excel or XML.

Rate me:
Please Sign up or sign in to vote.
5.00/5 (2 votes)
6 Nov 2012CPOL2 min read 31.3K   793   11   2
Get page HTML using System.Net.WebClient class of .NET as well as striping HTML using Regex and export a list into Excel or XML.

Introduction 

In this article I have tried to solve a very common requirement of developer that finding links other website page or get HTML of any webpage  (Internal project/ External website). This topic also covers how to get page HTML using System.Net.WebClient class of .NET as well as strip a particular HTML tag using Regex and export a list into excel or XML.

Background 

From past few days I had a discussion in forums and I found several developers discussing with me about few topics like

 (I) How to get a page HTML/Anchor Tag's/ Div Content from URL or from those web pages on which they don't have access on code?

(II) How to export a list or collection in excel or XML and download it? 

(III) How to strip a particular tag or Stripping HTML?

On the basis of above requirement I have tried to combine those solution and tried to discuss abut those topics as per my findings.

Using the code 

I have created two projects, one is class library one is a web project to implement this library.

First create on class to store values get from a URL and to export like this:

C#
[Serializable]
public class AnchorValues
{
    public string Name { get; set; }
    public string Url { get; set; }
}  

The WebClient class provides common methods for sending data to or receiving data from any local, intranet, or Internet resource identified by a URI.

The WebClient class uses the WebRequest class to provide access to resources. WebClient instances can access data with any WebRequest. Learn more about from MSDN: http://msdn.microsoft.com/en-us/library/system.net.webclient%28v=vs.80%29.aspx

Then created another class to get HTML from any URL using System.Net.WebClient class like this:

C#
protected string GetString(string url)
{
    WebClient wc = new WebClient();
    Stream resStream = wc.OpenRead(url);
    StreamReader sr = new StreamReader(resStream, System.Text.Encoding.Default);
    string ContentHtml = sr.ReadToEnd();

    return ContentHtml;
}

Get anchor tag's from HTML and store them into a collection using Regex.

The System.Text.RegularExpressions namespace contains the Regex class used to form and evaluate regular expressions. The Regex class contains static methods used to compare regular expressions against strings. The Regex class uses the IsMatch() static method to compare a string with a regular expression or get collection of matches with Mathch().  

Learn more about Regex from MSDN http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.aspx like this:

C#
List<AnchorValues> _list = new List<AnchorValues>();

string initialURL = @"<a.*?href=([""'])?(?<url>.*?)[""?|'?].*?>(?<name>.*?)</a>";
Regex regex = new Regex(initialURL, RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase);
MatchCollection matches = regex.Matches(html);

foreach (Match mt in matches)
{
    AnchorValues obj = new AnchorValues();
    obj.Name = mt.Result("${name}");
    obj.Url = mt.Result("${url}");
    _list.Add(obj);
}

Finally export your list as Excel/XML depending upon user choice as or it can return IDictionary object without Exporting to any file format.

C#
HttpContext.Current.Response.Clear();
HttpContext.Current.Response.Buffer = true;
HttpContext.Current.Response.ContentType = "application/XML";
HttpContext.Current.Response.Charset = "";
HttpContext.Current.Response.AppendHeader("Content-Disposition", "attachment; filename=AnchorFile.xml");

HttpContext.Current.Response.Write(SerializeToXML(source));
HttpContext.Current.Response.Flush();
HttpContext.Current.Response.End();
HttpContext.Current.ApplicationInstance.CompleteRequest();
HttpContext.Current.Response.Clear();
HttpContext.Current.Response.Buffer = true;
HttpContext.Current.Response.ContentType = "application/vnd.ms-excel";
HttpContext.Current.Response.Charset = "";
HttpContext.Current.Response.AppendHeader("Content-Disposition", "attachment; filename=AnchorFile.xls");

GridView1.DataSource = source;
GridView1.DataBind();
GridView1.RenderControl(oHtmlTextWriter);
HttpContext.Current.Response.Write(oStringWriter.ToString());
HttpContext.Current.Response.Flush();
HttpContext.Current.Response.End();
HttpContext.Current.ApplicationInstance.CompleteRequest();

Points of Interest

Here we can notice one thing which is additional to this when we try to export a list using the HttpContext.Current.Response object from a class we got e exception "Thread is being aborted." because of Response.End to solve this we can use this:

C#
HttpContext.Current.ApplicationInstance.CompleteRequest(); 

History

I have just tried to write quick solve of few requirements and assemble them into one article and will update this with more description soon.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Senior) ERICSSON INDIA GLOBAL SERVICES PVT. LTD
India India
MCPD 3.5 in 2011
Working as a Senior Dot Net Developer/Integration Engineer since last six years

Comments and Discussions

 
Question5 Pin
refinaa6-Nov-12 22:13
refinaa6-Nov-12 22:13 
GeneralMy vote of 5 Pin
kanishka.kar5-Nov-12 19:32
kanishka.kar5-Nov-12 19:32 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.