Click here to Skip to main content
Licence CPOL
First Posted 9 Sep 2006
Views 138,915
Downloads 5,055
Bookmarked 149 times

A Web Spider Library in C#

By | 18 Sep 2007 | Article
An article about a spider library to grab websites and store them locally

Sample Image - ZetaWebSpider.png

Don't fear, it's just a web spider ;-)

Introduction

Today, while looking through some older code, I came across a set of classes I wrote at the beginning of this year for a customer project.

The classes implement a basic web spider (also called "web robot" or "web crawler") to grab web pages (including resources like images and CSS), download them locally and adjust any resource hyperlinks to point to the locally downloaded resources.

While this article is not a full-featured article with detailed explanations as I usually like to write, I still want to put the code online with this short article. Maybe some reader can still take some ideas from this code and use it as a starting point for his own project.

Overview

The classes allow for synchronous as well as asynchronous download of the web pages, allowing multiple options to be specified like hyperlink-depth to follow and proxy settings.

The downloaded resources get their own new file names, based on the hash code of the original URL. I did this for simplifications (for me as the programmer).

To parse a document, I am using the SGMLReader DLL from the GotDotNet website.

Also, since I didn't need it for the project I wrote, the library does not care about "robots.txt" or throttling or other features.

Using the Code

The download for this article contains the library ("WebSpider") and a testing console application ("WebSpiderTest"). The testing application is rather short and should be rather easy to understand.

Basically, you do create an instance of the WebSiteDownloaderOptions class, configure several parameters, create an instance of the WebSiteDownloader class, optionally connect event handlers and then tell the instance to either start synchronously or asynchronously processing the given URL.

History

  • 2007-09-17: Fixed several issues
  • 2006-09-10: Initial release of the article

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Uwe Keim

Chief Technology Officer
Zeta Producer Desktop CMS
Germany Germany

Member

Uwe does programming since 1989 with experiences in Assembler, C++, MFC and lots of web- and database stuff and now uses ASP.NET and C# extensively, too. He is also teached programming to students at the local university.
 
In his free time, he does climbing, running and mountain biking. You can watch him most of the day (and probably night) programming.
 
Some cool, free software from us:
 
Free Test Management Software - Intuitive, competitive, Test Plans. Download now!  
Homepage erstellen - Intuitive, very easy to use. Download now!  
Send large Files online for free by Email
Some random fun stuff in German


Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board. (secure sign-in)
 
Search this forum  
 FAQ
    Noise  Layout  Per page   
  Refresh
Generaldoesn't work correctly in indexof: based sites PinmemberArpit sharma4:27 2 Apr '11  
GeneralProblem In SGML PinmemberArpit sharma4:23 27 Dec '10  
GeneralIssue following links PinmemberStubbsPKS6:07 26 Oct '10  
AnswerRe: Issue following links PinmemberStubbsPKS7:31 26 Oct '10  
GeneralRe: Issue following links PinmvpUwe Keim8:41 26 Oct '10  
GeneralMy vote of 2 Pinmemberdaveauld23:51 19 Jun '10  
GeneralRe: My vote of 2 PinmemberKevin Yochum7:59 9 Aug '10  
GeneralRe: My vote of 2 PinmvpUwe Keim19:39 9 Aug '10  
General3k Pinmembersongmei.lv@163.com0:03 7 Dec '09  
General.STATE File PinmemberMember 441033811:46 31 Oct '09  
GeneralMy vote of 1 Pinmemberbabakzawari0:23 20 Oct '09  
GeneralWeb Spider Issue PinmemberMember 47472422:17 2 Sep '09  
GeneralRe: Web Spider Issue PinsitebuilderUwe Keim22:29 2 Sep '09  
Generalxml site map PinmemberRohit_kakria1:51 10 Aug '09  
GeneralJust Links [modified] PinmemberSosyopat10:18 30 Jul '09  
GeneralRe: Just Links PinsitebuilderUwe Keim19:22 30 Jul '09  
GeneralThe remote server returned an error: (403) Forbidden Pinmemberkakakkakka1:00 21 Jul '09  
GeneralRe: The remote server returned an error: (403) Forbidden PinsitebuilderUwe Keim1:20 21 Jul '09  
Generalquestion plz Pinmembernaroqueen19:22 22 May '09  
Generalnice spider and nice code Pinmemberpsyhf20:03 21 Mar '09  
QuestionI am getting an unhandled exception while running the test project PinmemberMember 47118249:47 3 Jun '08  
Failed to replace URI '/script/Ann/ServeHTML.aspx?C=False&id=6233' with URI '30FC637E.html' in HTML text '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
 
Above is the message I get when I tried to run the test project.
 
Please advice as to how can I make the test project run without errors.
GeneralException Running Test Pinmembercornix49:49 30 Apr '08  
Generalso good Pinmemberokzhuce20:18 28 Apr '08  
GeneralRe: so good PinsitebuilderUwe Keim20:43 28 Apr '08  
Generalhello PinmemberMember 451146414:48 18 Feb '08  

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Mobile
Web04 | 2.5.120529.1 | Last Updated 19 Sep 2007
Article Copyright 2006 by Uwe Keim
Everything else Copyright © CodeProject, 1999-2012
Terms of Use
Layout: fixed | fluid