Click here to Skip to main content
Click here to Skip to main content

Get page HTML from URL using WebClient, Strip HTML using Regex , export a list of Anchors into Excel or XML.

, 6 Nov 2012
Rate this:
Please Sign up or sign in to vote.
Get page HTML using System.Net.WebClient class of .NET as well as striping HTML using Regex and export a list into Excel or XML.

Introduction 

In this article I have tried to solve a very common requirement of developer that finding links other website page or get HTML of any webpage  (Internal project/ External website). This topic also covers how to get page HTML using System.Net.WebClient class of .NET as well as strip a particular HTML tag using Regex and export a list into excel or XML.

Background 

From past few days I had a discussion in forums and I found several developers discussing with me about few topics like

 (I) How to get a page HTML/Anchor Tag's/ Div Content from URL or from those web pages on which they don't have access on code?

(II) How to export a list or collection in excel or XML and download it? 

(III) How to strip a particular tag or Stripping HTML?

On the basis of above requirement I have tried to combine those solution and tried to discuss abut those topics as per my findings.

Using the code 

I have created two projects, one is class library one is a web project to implement this library.

First create on class to store values get from a URL and to export like this:

[Serializable]
public class AnchorValues
{
    public string Name { get; set; }
    public string Url { get; set; }
}  

The WebClient class provides common methods for sending data to or receiving data from any local, intranet, or Internet resource identified by a URI.

The WebClient class uses the WebRequest class to provide access to resources. WebClient instances can access data with any WebRequest. Learn more about from MSDN: http://msdn.microsoft.com/en-us/library/system.net.webclient%28v=vs.80%29.aspx

Then created another class to get HTML from any URL using System.Net.WebClient class like this:

protected string GetString(string url)
{
    WebClient wc = new WebClient();
    Stream resStream = wc.OpenRead(url);
    StreamReader sr = new StreamReader(resStream, System.Text.Encoding.Default);
    string ContentHtml = sr.ReadToEnd();

    return ContentHtml;
}

Get anchor tag's from HTML and store them into a collection using Regex.

The System.Text.RegularExpressions namespace contains the Regex class used to form and evaluate regular expressions. The Regex class contains static methods used to compare regular expressions against strings. The Regex class uses the IsMatch() static method to compare a string with a regular expression or get collection of matches with Mathch().  

Learn more about Regex from MSDN http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.aspx like this:

List<AnchorValues> _list = new List<AnchorValues>();

string initialURL = @"<a.*?href=([""'])?(?<url>.*?)[""?|'?].*?>(?<name>.*?)</a>";
Regex regex = new Regex(initialURL, RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase);
MatchCollection matches = regex.Matches(html);

foreach (Match mt in matches)
{
    AnchorValues obj = new AnchorValues();
    obj.Name = mt.Result("${name}");
    obj.Url = mt.Result("${url}");
    _list.Add(obj);
}

Finally export your list as Excel/XML depending upon user choice as or it can return IDictionary object without Exporting to any file format.

HttpContext.Current.Response.Clear();
HttpContext.Current.Response.Buffer = true;
HttpContext.Current.Response.ContentType = "application/XML";
HttpContext.Current.Response.Charset = "";
HttpContext.Current.Response.AppendHeader("Content-Disposition", "attachment; filename=AnchorFile.xml");

HttpContext.Current.Response.Write(SerializeToXML(source));
HttpContext.Current.Response.Flush();
HttpContext.Current.Response.End();
HttpContext.Current.ApplicationInstance.CompleteRequest();
HttpContext.Current.Response.Clear();
HttpContext.Current.Response.Buffer = true;
HttpContext.Current.Response.ContentType = "application/vnd.ms-excel";
HttpContext.Current.Response.Charset = "";
HttpContext.Current.Response.AppendHeader("Content-Disposition", "attachment; filename=AnchorFile.xls");

GridView1.DataSource = source;
GridView1.DataBind();
GridView1.RenderControl(oHtmlTextWriter);
HttpContext.Current.Response.Write(oStringWriter.ToString());
HttpContext.Current.Response.Flush();
HttpContext.Current.Response.End();
HttpContext.Current.ApplicationInstance.CompleteRequest();

Points of Interest

Here we can notice one thing which is additional to this when we try to export a list using the HttpContext.Current.Response object from a class we got e exception "Thread is being aborted." because of Response.End to solve this we can use this:

HttpContext.Current.ApplicationInstance.CompleteRequest(); 

History

I have just tried to write quick solve of few requirements and assemble them into one article and will update this with more description soon.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

SoumenBanerjee
Software Developer (Senior) ERICSSON INDIA GLOBAL SERVICES PVT. LTD
India India
MCPD 3.5 in 2011
Working as a Senior Dot Net Developer/Integration Engineer since last six years

Comments and Discussions

 
Question5 Pinmemberrefinaa6-Nov-12 22:13 
I agree it is helpful, thank you.
GeneralMy vote of 5 Pinmemberkanishka.kar5-Nov-12 19:32 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web03 | 2.8.140709.1 | Last Updated 6 Nov 2012
Article Copyright 2012 by SoumenBanerjee
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid