What is it?
DomainWalker is an object that discovers domains reachable from a URL. Unlike traditional crawlers and site downloaders that identify all reachable URLs on a page,
DomainWalker explores a subset of the world wide web's topology by targeting root URLs only.
DomainWalker guarantees that its walk will complete in a finite amount of time by ensuring that duplicate domains are never crawled.
DomainWalker is an example of a WebResourceProvider and uses my StringParser utility class, both of which are published elsewhere at this site. As an aside, the demo application shows how to spin off a worker thread from a GUI and have it update the GUI in a safe manner. This is done by having the app respond to events fired by the worker thread.
How do I use it?
DomainWalker by initializing it, calling its
Walk() method, and getting its results.
- Initialize the
// Initialize the DomainWalker
DomainWalker dw = new DomainWalker();
dw.StartUrl = "www.ravib.com";
dw.MaxDepth = 3;
- Do the walk
// Do walk
- Get the results
// Get results
HashTable domainTree = dw.DomainTree;
printHashTableAsTree (domainTree); // left as an exercise to the reader
Getting DomainWalker's results
DomainWalker's results by accessing its
DomainTree property at the end of the walk and/or responding to the
DomainWalker's result is a tree of discovered domains obtained from the object's
DomainTree property. The tree is actually a nested
Hashtable, where each collection of child nodes is stored in a new
It may be more convenient to get at
DomainWalker's results by being notified every time a new URL is discovered. This is done by subscribing to the object's
OnNotifyUrlBeingTraversed event and is the approach taken by the demo app. Domain discovery notifications are received by registering a
OnNotifyUrlBeingTraversed delegate which has the following signature:
/// Notifies an observer when a url is about to be traversed.
/// <param name="strParentUrl">The parent url (may be null).</param>
/// <param name="strUrlBeingTraversed">The url being traversed.</param>
/// <param name="nCurrentDepth">Current traversal depth.</param>
/// <param name="nDomains">Number of domains discovered so far.</param>
/// <param name="tsElapsed">Time elapsed since start of crawl.</param>
public delegate void OnNotifyUrlBeingTraversed
The demo app responds to the
OnNotifyUrlBeingTraversed event by adding
strUrlBeingTraversed to a list box. The string is indented by an appropriate number of spaces proportional to
nCurrentDepth. Other useful information such as the elapsed walk time (
tsElapsed) is displayed in a label control.
DomainWalker also fires the
OnNotifyWalkCompleted event at the end of a walk. The
OnNotifyWalkCompleted delegate has the following signature:
/// Notifies an observer when the walk has completed.
/// <param name="nDomains">Number of domains discovered.</param>
/// <param name="tsElapsed">Time taken to complete crawl.</param>
public delegate void OnNotifyWalkCompleted
- 22 Jan 2006
DomainWalkerForm delegates to ensure controls are accessed from the GUI thread. (Thanks, Birgir K!)
- Added missing
.resx file to project.
- Upgraded project to VS2005.
- 15 Jan 2006