![]() |
Web Development »
ASP.NET »
General
Intermediate
License: The Code Project Open License (CPOL)
A Web Spider Library in C#By Uwe KeimAn article about a spider library to grab websites and store them locally. |
Windows, .NET, ASP.NET, Visual Studio, WebForms, Dev
|
||||||||||
|
Advanced Search |
|
|
|
||||||||||||||||

Don't fear, it's just a web spider ;-)
Today, while looking through some older code, I came across a set of classes I wrote at the beginning of this year for a customer project.
The classes implement a basic web spider (also called "web robot" or "web crawler") to grab web pages (including resources like images and CSS), download them locally and adjust any resource hyperlinks to point to the locally downloaded resources.
While this article is not a full-featured article with detailled explanations as I usually like to write, I still want to put the code online with this short article. Maybe that some reader still takes some ideas from this code and use it as a starting point for his own project.
The classes allow for synchronous as well as asynchronous download of the web pages, allowing multiple options to be specified like hyperlink-depth to follow and proxy settings.
The downloaded resources get their own new file names, based on the hash code of the original URL. I did this for simplifications (for me as the programmer).
To parse a document I am using the SGMLReader DLL from the GotDotNet website.
Also, since I didn't need it for the project I wrote, the library does not care about "robots.txt" or throttling or other features.
The download for this article contains the library ("WebSpider") and a testing console application ("WebSpiderTest"). The testing application is rather short and should be rather easy to understand.
Basically you do create an instance of the WebSiteDownloaderOptions class, configure several parameters, create an instance of the WebSiteDownloader class, optionally connect event handlers and then tell the instance to either start synchronously or asynchronously processing the given URL.
| You must Sign In to use this message board. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
General
News
Question
Answer
Joke
Rant
Admin
|
PermaLink |
Privacy |
Terms of Use
Last Updated: 18 Sep 2007 Editor: |
Copyright 2006 by Uwe Keim Everything else Copyright © CodeProject, 1999-2009 Web15 | Advertise on the Code Project |