65.9K
CodeProject is changing. Read more.
Home

Simple Web Scraper

starIconstarIconstarIconstarIconstarIcon

5.00/5 (2 votes)

Dec 15, 2012

CPOL

1 min read

viewsIcon

28658

downloadIcon

1387

A simple web scraper that loads only the readable contents of a website.

Introduction

Web Scraper can be used as tool that loads website contents. Since it downloads all the data from a website I prefer to format it, making it readable. 

Using the code

You can either use this data with console applications or with Windows/web applications. I used a console since it is introductory. 

In the console application, add the following namespaces:  

using System.Net; // to handle internet operations
using System.IO; // to use streams
using System.Text.RegularExpressions; // To format the loaded data 

Loading the content

Create the WebRequest and WebResponse objects.

WebRequest request=System.Net.HttpWebRequest.Create(url); // url="http://www.google.com";
WebResponse response=request.GetResponse(); 

Create the StreamReader object to store the response of the website and save it in any string type variable and close the stream.

StreamReader sr=new StreamReader(response.GetResponseStream()); 
string result=sr.ReadToEnd();
sr.Close();

To view the unformatted result, simply write it on the console.

Console.WriteLine(result); 

Formatting the result 

To format the result we will use Regular Expression class functions.

result = Regex.Replace(result, "<script.*?</script>", "", 
  RegexOptions.Singleline | RegexOptions.IgnoreCase); // Remove scripts
result = Regex.Replace(result, "<style.*?</style>", "", 
  RegexOptions.Singleline | RegexOptions.IgnoreCase); // Remove inline stylesheets          
result = Regex.Replace(result, "</?[a-z][a-z0-9]*[^<>]*>", ""); // Remove HTML tags    
result = Regex.Replace(result, "<!--(.|\\s)*?-->", ""); // Remove HTML comments
result = Regex.Replace(result, "<!(.|\\s)*?>", ""); // Remove Doctype
result = Regex.Replace(result, "[\t\r\n]", " "); // Remove excessive whitespace

Now print the results on screen.

Console.WriteLine(result); 

Update

In this update I have tried to match the loaded content over the specific pattern. I have used it to match and load the URLs in the loaded content. However, you can choose your own pattern to match.

Using the code

What I am focusing here is pattern matching. To match a pattern specify it in a regular expression. I am using it to extract the associated list of URLs so the pattern for a URL is:  

string pattern=@"\b[a-zA-Z0-9\-\.]+\.(com|org|net|mil|edu|COM|ORG|NET|MIL|EDU)\b";

Now create a Regular Expression class object with the the given pattern as parameter. 

Regex r = new Regex(pat);

To match the given pattern we will use the Matches() function. Iterate through each of the found patterns and print it on the screen. 

foreach (Match m in r.Matches(result))
// result=loaded content from website using scrapper without formatting it
{
    Console.WriteLine(m.Value);
}

A list of associated URLs will be printed on screen. You can use it for matching different patterns.