Click here to Skip to main content
Click here to Skip to main content
Go to top

Simple Web Scraper

, 14 Mar 2013
Rate this:
Please Sign up or sign in to vote.
A simple web scraper that loads only the readable contents of a website.

Introduction

Web Scraper can be used as tool that loads website contents. Since it downloads all the data from a website I prefer to format it, making it readable. 

Using the code

You can either use this data with console applications or with Windows/web applications. I used a console since it is introductory. 

In the console application, add the following namespaces:  

using System.Net; // to handle internet operations
using System.IO; // to use streams
using System.Text.RegularExpressions; // To format the loaded data 

Loading the content

Create the WebRequest and WebResponse objects.

WebRequest request=System.Net.HttpWebRequest.Create(url); // url="http://www.google.com";
WebResponse response=request.GetResponse(); 

Create the StreamReader object to store the response of the website and save it in any string type variable and close the stream.

StreamReader sr=new StreamReader(response.GetResponseStream()); 
string result=sr.ReadToEnd();
sr.Close();

To view the unformatted result, simply write it on the console.

Console.WriteLine(result); 

Formatting the result 

To format the result we will use Regular Expression class functions.

result = Regex.Replace(result, "<script.*?</script>", "", 
  RegexOptions.Singleline | RegexOptions.IgnoreCase); // Remove scripts
result = Regex.Replace(result, "<style.*?</style>", "", 
  RegexOptions.Singleline | RegexOptions.IgnoreCase); // Remove inline stylesheets          
result = Regex.Replace(result, "</?[a-z][a-z0-9]*[^<>]*>", ""); // Remove HTML tags    
result = Regex.Replace(result, "<!--(.|\\s)*?-->", ""); // Remove HTML comments
result = Regex.Replace(result, "<!(.|\\s)*?>", ""); // Remove Doctype
result = Regex.Replace(result, "[\t\r\n]", " "); // Remove excessive whitespace

Now print the results on screen.

Console.WriteLine(result); 

Update

In this update I have tried to match the loaded content over the specific pattern. I have used it to match and load the URLs in the loaded content. However, you can choose your own pattern to match.

Using the code

What I am focusing here is pattern matching. To match a pattern specify it in a regular expression. I am using it to extract the associated list of URLs so the pattern for a URL is:  

string pattern=@"\b[a-zA-Z0-9\-\.]+\.(com|org|net|mil|edu|COM|ORG|NET|MIL|EDU)\b";

Now create a Regular Expression class object with the the given pattern as parameter. 

Regex r = new Regex(pat);

To match the given pattern we will use the Matches() function. Iterate through each of the found patterns and print it on the screen. 

foreach (Match m in r.Matches(result))
// result=loaded content from website using scrapper without formatting it
{
    Console.WriteLine(m.Value);
}

A list of associated URLs will be printed on screen. You can use it for matching different patterns.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Rumman92
Help desk / Support
India India
No Biography provided
Follow on   LinkedIn

You may also be interested in...

Comments and Discussions

 
QuestionGreat Code PinmemberMember 1082331317-May-14 7:01 
SuggestionDidn't you mean "Scraper"? PinmemberBigTimber@home19-Dec-12 10:22 
GeneralRe: Didn't you mean "Scraper"? PinmemberRumman9219-Dec-12 18:46 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web03 | 2.8.140926.1 | Last Updated 14 Mar 2013
Article Copyright 2012 by Rumman92
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid