Click here to Skip to main content
13,249,315 members (32,376 online)
Click here to Skip to main content
Add your own
alternative version


11 bookmarked
Posted 5 Nov 2012

Get page HTML from URL using WebClient, Strip HTML using Regex , export a list of Anchors into Excel or XML.

, 6 Nov 2012
Rate this:
Please Sign up or sign in to vote.
Get page HTML using System.Net.WebClient class of .NET as well as striping HTML using Regex and export a list into Excel or XML.


In this article I have tried to solve a very common requirement of developer that finding links other website page or get HTML of any webpage  (Internal project/ External website). This topic also covers how to get page HTML using System.Net.WebClient class of .NET as well as strip a particular HTML tag using Regex and export a list into excel or XML.


From past few days I had a discussion in forums and I found several developers discussing with me about few topics like

 (I) How to get a page HTML/Anchor Tag's/ Div Content from URL or from those web pages on which they don't have access on code?

(II) How to export a list or collection in excel or XML and download it? 

(III) How to strip a particular tag or Stripping HTML?

On the basis of above requirement I have tried to combine those solution and tried to discuss abut those topics as per my findings.

Using the code 

I have created two projects, one is class library one is a web project to implement this library.

First create on class to store values get from a URL and to export like this:

public class AnchorValues
    public string Name { get; set; }
    public string Url { get; set; }

The WebClient class provides common methods for sending data to or receiving data from any local, intranet, or Internet resource identified by a URI.

The WebClient class uses the WebRequest class to provide access to resources. WebClient instances can access data with any WebRequest. Learn more about from MSDN:

Then created another class to get HTML from any URL using System.Net.WebClient class like this:

protected string GetString(string url)
    WebClient wc = new WebClient();
    Stream resStream = wc.OpenRead(url);
    StreamReader sr = new StreamReader(resStream, System.Text.Encoding.Default);
    string ContentHtml = sr.ReadToEnd();

    return ContentHtml;

Get anchor tag's from HTML and store them into a collection using Regex.

The System.Text.RegularExpressions namespace contains the Regex class used to form and evaluate regular expressions. The Regex class contains static methods used to compare regular expressions against strings. The Regex class uses the IsMatch() static method to compare a string with a regular expression or get collection of matches with Mathch().  

Learn more about Regex from MSDN like this:

List<AnchorValues> _list = new List<AnchorValues>();

string initialURL = @"<a.*?href=([""'])?(?<url>.*?)[""?|'?].*?>(?<name>.*?)</a>";
Regex regex = new Regex(initialURL, RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase);
MatchCollection matches = regex.Matches(html);

foreach (Match mt in matches)
    AnchorValues obj = new AnchorValues();
    obj.Name = mt.Result("${name}");
    obj.Url = mt.Result("${url}");

Finally export your list as Excel/XML depending upon user choice as or it can return IDictionary object without Exporting to any file format.

HttpContext.Current.Response.Buffer = true;
HttpContext.Current.Response.ContentType = "application/XML";
HttpContext.Current.Response.Charset = "";
HttpContext.Current.Response.AppendHeader("Content-Disposition", "attachment; filename=AnchorFile.xml");

HttpContext.Current.Response.Buffer = true;
HttpContext.Current.Response.ContentType = "application/";
HttpContext.Current.Response.Charset = "";
HttpContext.Current.Response.AppendHeader("Content-Disposition", "attachment; filename=AnchorFile.xls");

GridView1.DataSource = source;

Points of Interest

Here we can notice one thing which is additional to this when we try to export a list using the HttpContext.Current.Response object from a class we got e exception "Thread is being aborted." because of Response.End to solve this we can use this:



I have just tried to write quick solve of few requirements and assemble them into one article and will update this with more description soon.


This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


About the Author

India India
MCPD 3.5 in 2011
Working as a Senior Dot Net Developer/Integration Engineer since last six years

You may also be interested in...


Comments and Discussions

Question5 Pin
refinaa6-Nov-12 23:13
memberrefinaa6-Nov-12 23:13 
GeneralMy vote of 5 Pin
kanishka.kar5-Nov-12 20:32
memberkanishka.kar5-Nov-12 20:32 
Very helpful. Keep on posting.

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Terms of Use | Mobile
Web04 | 2.8.171114.1 | Last Updated 6 Nov 2012
Article Copyright 2012 by SoumenBanerjee
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid