Click here to Skip to main content
15,884,298 members
Articles / Programming Languages / C# 3.5
Tip/Trick

Simple Web Scraper

Rate me:
Please Sign up or sign in to vote.
5.00/5 (2 votes)
14 Mar 2013CPOL1 min read 27.9K   1.4K   24   4
A simple web scraper that loads only the readable contents of a website.

Introduction

Web Scraper can be used as tool that loads website contents. Since it downloads all the data from a website I prefer to format it, making it readable. 

Using the code

You can either use this data with console applications or with Windows/web applications. I used a console since it is introductory. 

In the console application, add the following namespaces:  

C#
using System.Net; // to handle internet operations
using System.IO; // to use streams
using System.Text.RegularExpressions; // To format the loaded data 

Loading the content

Create the WebRequest and WebResponse objects.

C#
WebRequest request=System.Net.HttpWebRequest.Create(url); // url="http://www.google.com";
WebResponse response=request.GetResponse(); 

Create the StreamReader object to store the response of the website and save it in any string type variable and close the stream.

C#
StreamReader sr=new StreamReader(response.GetResponseStream()); 
string result=sr.ReadToEnd();
sr.Close();

To view the unformatted result, simply write it on the console.

C#
Console.WriteLine(result); 

Formatting the result 

To format the result we will use Regular Expression class functions.

C#
result = Regex.Replace(result, "<script.*?</script>", "", 
  RegexOptions.Singleline | RegexOptions.IgnoreCase); // Remove scripts
result = Regex.Replace(result, "<style.*?</style>", "", 
  RegexOptions.Singleline | RegexOptions.IgnoreCase); // Remove inline stylesheets          
result = Regex.Replace(result, "</?[a-z][a-z0-9]*[^<>]*>", ""); // Remove HTML tags    
result = Regex.Replace(result, "<!--(.|\\s)*?-->", ""); // Remove HTML comments
result = Regex.Replace(result, "<!(.|\\s)*?>", ""); // Remove Doctype
result = Regex.Replace(result, "[\t\r\n]", " "); // Remove excessive whitespace

Now print the results on screen.

C#
Console.WriteLine(result); 

Update

In this update I have tried to match the loaded content over the specific pattern. I have used it to match and load the URLs in the loaded content. However, you can choose your own pattern to match.

Using the code

What I am focusing here is pattern matching. To match a pattern specify it in a regular expression. I am using it to extract the associated list of URLs so the pattern for a URL is:  

C#
string pattern=@"\b[a-zA-Z0-9\-\.]+\.(com|org|net|mil|edu|COM|ORG|NET|MIL|EDU)\b";

Now create a Regular Expression class object with the the given pattern as parameter. 

C#
Regex r = new Regex(pat);

To match the given pattern we will use the Matches() function. Iterate through each of the found patterns and print it on the screen. 

C#
foreach (Match m in r.Matches(result))
// result=loaded content from website using scrapper without formatting it
{
    Console.WriteLine(m.Value);
}

A list of associated URLs will be printed on screen. You can use it for matching different patterns.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Help desk / Support
India India
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
QuestionMatching or Looking for Phone Numbers Pin
Member 1155986519-Mar-18 10:16
Member 1155986519-Mar-18 10:16 
QuestionGreat Code Pin
Member 1082331317-May-14 7:01
Member 1082331317-May-14 7:01 
SuggestionDidn't you mean "Scraper"? Pin
BigTimber@home19-Dec-12 10:22
professionalBigTimber@home19-Dec-12 10:22 
Although "Scrapper" is also a valid English word, it looks like you really meant "Scraper", no?
GeneralRe: Didn't you mean "Scraper"? Pin
Rumman9219-Dec-12 18:46
Rumman9219-Dec-12 18:46 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.