Click here to Skip to main content
6,292,426 members and growing! (9,923 online)
Email Password   helpLost your password?
Web Development » ASP.NET » General     Intermediate License: The Code Project Open License (CPOL)

A Web Spider Library in C#

By Uwe Keim

An article about a spider library to grab websites and store them locally.
Windows, .NET, ASP.NET, Visual Studio, WebForms, Dev
Posted:10 Sep 2006
Updated:18 Sep 2007
Views:60,635
Bookmarked:101 times
Unedited contribution
Announcements
Loading...
 
Search    
Advanced Search
printPrint   Broken Article?Report       add Share
  Discuss Discuss   Recommend Article Email
24 votes for this article.
Popularity: 4.96 Rating: 3.60 out of 5
4 votes, 16.7%
1
2 votes, 8.3%
2
2 votes, 8.3%
3
5 votes, 20.8%
4
11 votes, 45.8%
5

Sample Image - ZetaWebSpider.png
Don't fear, it's just a web spider ;-)

Introduction

Today, while looking through some older code, I came across a set of classes I wrote at the beginning of this year for a customer project.

The classes implement a basic web spider (also called "web robot" or "web crawler") to grab web pages (including resources like images and CSS), download them locally and adjust any resource hyperlinks to point to the locally downloaded resources.

While this article is not a full-featured article with detailled explanations as I usually like to write, I still want to put the code online with this short article. Maybe that some reader still takes some ideas from this code and use it as a starting point for his own project.

Overview

The classes allow for synchronous as well as asynchronous download of the web pages, allowing multiple options to be specified like hyperlink-depth to follow and proxy settings.

The downloaded resources get their own new file names, based on the hash code of the original URL. I did this for simplifications (for me as the programmer).

To parse a document I am using the SGMLReader DLL from the GotDotNet website.

Also, since I didn't need it for the project I wrote, the library does not care about "robots.txt" or throttling or other features.

Using the code

The download for this article contains the library ("WebSpider") and a testing console application ("WebSpiderTest"). The testing application is rather short and should be rather easy to understand.

Basically you do create an instance of the WebSiteDownloaderOptions class, configure several parameters, create an instance of the WebSiteDownloader class, optionally connect event handlers and then tell the instance to either start synchronously or asynchronously processing the given URL.

History

  • 2007-09-17: Fixed several issues.
  • 2006-09-10: Initially release of the article.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Uwe Keim


Member
Uwe does programming since 1989 with experiences in Assembler, C++, MFC and lots of web- and database stuff and now uses ASP.NET and C# extensively, too. He is also teached programming to students at the local university.

In his free time, he does climbing, running and mountain biking. You can watch him most of the day (and probably night) programming.

Some cool, free software from us:

-----

Zeta Test

Zeta Test is an integrated test management environment that enables you to perform black-box tests, white-box tests, regression tests or change management tests of software applications.

Zeta Test helps you to plan, perform, log, monitor and document the tests, and then to evaluate the test results.

Create and manage your test cases and test plans with Zeta Test. Test your software with test scripts that you created with Zeta Test.

Directly download Zeta Test for free!

-----

Zeta Producer Desktop CMS

Intuitive, completely easy-to-use CMS for Windows. Both Freeware version and full version available.

Try out by yourself now! (direct download)

-----

Zeta Uploader

Easily send large files by e-mail. Windows and web client available.

-----
Occupation: Software Developer
Company: zeta software GmbH
Location: Germany Germany

Other popular ASP.NET articles:

Article Top
You must Sign In to use this message board.
FAQ FAQ 
 
Noise Tolerance  Layout  Per page   
 Msgs 1 to 15 of 15 (Total in Forum: 15) (Refresh)FirstPrevNext
Generalquestion plz Pinmembernaroqueen20:22 22 May '09  
Generalnice spider and nice code Pinmemberpsyhf21:03 21 Mar '09  
QuestionI am getting an unhandled exception while running the test project PinmemberMember 471182410:47 3 Jun '08  
GeneralException Running Test Pinmembercornix410:49 30 Apr '08  
Generalso good Pinmemberokzhuce21:18 28 Apr '08  
GeneralRe: so good PinsitebuilderUwe Keim21:43 28 Apr '08  
Generalhello PinmemberMember 451146415:48 18 Feb '08  
QuestionVery nice Pinmemberaaron_myers5:59 18 Sep '07  
AnswerRe: Very nice PinsitebuilderUwe Keim21:37 18 Sep '07  
GeneralRe: Very nice Pinmemberaaron_myers4:53 22 Oct '07  
QuestionRe: Very nice Pinmemberbabaa17:57 24 Feb '08  
GeneralGreat App Pinmemberrdissell6:14 1 Aug '07  
GeneralGreat job. But i met a problem PinmemberBlueLoveCyn22:28 14 Sep '06  
GeneralYour Project No Worky Pinmembereasy_coder4:04 12 Sep '06  
GeneralRe: Your Project No Worky PinsitebuilderUwe Keim4:19 12 Sep '06  

General General    News News    Question Question    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

PermaLink | Privacy | Terms of Use
Last Updated: 18 Sep 2007
Editor:
Copyright 2006 by Uwe Keim
Everything else Copyright © CodeProject, 1999-2009
Web15 | Advertise on the Code Project