Click here to Skip to main content
13,453,725 members (41,993 online)
Click here to Skip to main content
Add your own
alternative version


22 bookmarked
Posted 15 Dec 2012

Simple Web Scraper

, 14 Mar 2013
Rate this:
Please Sign up or sign in to vote.
A simple web scraper that loads only the readable contents of a website.


Web Scraper can be used as tool that loads website contents. Since it downloads all the data from a website I prefer to format it, making it readable. 

Using the code

You can either use this data with console applications or with Windows/web applications. I used a console since it is introductory. 

In the console application, add the following namespaces:  

using System.Net; // to handle internet operations
using System.IO; // to use streams
using System.Text.RegularExpressions; // To format the loaded data 

Loading the content

Create the WebRequest and WebResponse objects.

WebRequest request=System.Net.HttpWebRequest.Create(url); // url="";
WebResponse response=request.GetResponse(); 

Create the StreamReader object to store the response of the website and save it in any string type variable and close the stream.

StreamReader sr=new StreamReader(response.GetResponseStream()); 
string result=sr.ReadToEnd();

To view the unformatted result, simply write it on the console.


Formatting the result 

To format the result we will use Regular Expression class functions.

result = Regex.Replace(result, "<script.*?</script>", "", 
  RegexOptions.Singleline | RegexOptions.IgnoreCase); // Remove scripts
result = Regex.Replace(result, "<style.*?</style>", "", 
  RegexOptions.Singleline | RegexOptions.IgnoreCase); // Remove inline stylesheets          
result = Regex.Replace(result, "</?[a-z][a-z0-9]*[^<>]*>", ""); // Remove HTML tags    
result = Regex.Replace(result, "<!--(.|\\s)*?-->", ""); // Remove HTML comments
result = Regex.Replace(result, "<!(.|\\s)*?>", ""); // Remove Doctype
result = Regex.Replace(result, "[\t\r\n]", " "); // Remove excessive whitespace

Now print the results on screen.



In this update I have tried to match the loaded content over the specific pattern. I have used it to match and load the URLs in the loaded content. However, you can choose your own pattern to match.

Using the code

What I am focusing here is pattern matching. To match a pattern specify it in a regular expression. I am using it to extract the associated list of URLs so the pattern for a URL is:  

string pattern=@"\b[a-zA-Z0-9\-\.]+\.(com|org|net|mil|edu|COM|ORG|NET|MIL|EDU)\b";

Now create a Regular Expression class object with the the given pattern as parameter. 

Regex r = new Regex(pat);

To match the given pattern we will use the Matches() function. Iterate through each of the found patterns and print it on the screen. 

foreach (Match m in r.Matches(result))
// result=loaded content from website using scrapper without formatting it

A list of associated URLs will be printed on screen. You can use it for matching different patterns.


This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


About the Author

Help desk / Support
India India
No Biography provided

You may also be interested in...

Comments and Discussions

QuestionMatching or Looking for Phone Numbers Pin
Member 1155986519-Mar-18 10:16
memberMember 1155986519-Mar-18 10:16 
QuestionGreat Code Pin
Member 1082331317-May-14 7:01
memberMember 1082331317-May-14 7:01 
SuggestionDidn't you mean "Scraper"? Pin
BigTimber@home19-Dec-12 10:22
memberBigTimber@home19-Dec-12 10:22 
GeneralRe: Didn't you mean "Scraper"? Pin
Rumman9219-Dec-12 18:46
memberRumman9219-Dec-12 18:46 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Terms of Use | Mobile
Web03-2016 | 2.8.180318.3 | Last Updated 14 Mar 2013
Article Copyright 2012 by Rumman92
Everything else Copyright © CodeProject, 1999-2018
Layout: fixed | fluid