What is it?
|
DomainWalker is an object that discovers domains reachable from a URL. Unlike traditional crawlers and site downloaders that identify all reachable URLs on a page, DomainWalker explores a subset of the world wide web's topology by targeting root URLs only. DomainWalker guarantees that its walk will complete in a finite amount of time by ensuring that duplicate domains are never crawled.
DomainWalker is an example of a WebResourceProvider and uses my StringParser utility class, both of which are published elsewhere at this site. As an aside, the demo application shows how to spin off a worker thread from a GUI and have it update the GUI in a safe manner. This is done by having the app respond to events fired by the worker thread.
|
How do I use it?
You use DomainWalker
by initializing it, calling its Walk()
method, and getting its results.
- Initialize the
DomainWalker
instance
DomainWalker dw = new DomainWalker();
dw.StartUrl = "www.ravib.com";
dw.MaxDepth = 3;
- Do the walk
dw.walk();
- Get the results
HashTable domainTree = dw.DomainTree;
printHashTableAsTree (domainTree);
Getting DomainWalker's results
You retrieve DomainWalker
's results by accessing its DomainTree
property at the end of the walk and/or responding to the OnNotifyUrlBeingTraversed
event.
DomainTree property
DomainWalker
's result is a tree of discovered domains obtained from the object's DomainTree
property. The tree is actually a nested Hashtable
, where each collection of child nodes is stored in a new Hashtable
.
OnNotifyUrlBeingTraversed event
It may be more convenient to get at DomainWalker
's results by being notified every time a new URL is discovered. This is done by subscribing to the object's OnNotifyUrlBeingTraversed
event and is the approach taken by the demo app. Domain discovery notifications are received by registering a OnNotifyUrlBeingTraversed
delegate which has the following signature:
public delegate void OnNotifyUrlBeingTraversed
(string strParentUrl,
string strUrlBeingTraversed,
int nCurrentDepth,
int nDomains,
TimeSpan tsElapsed);
The demo app responds to the OnNotifyUrlBeingTraversed
event by adding strUrlBeingTraversed
to a list box. The string is indented by an appropriate number of spaces proportional to nCurrentDepth
. Other useful information such as the elapsed walk time (tsElapsed
) is displayed in a label control.
OnNotifyWalkCompleted event
DomainWalker
also fires the OnNotifyWalkCompleted
event at the end of a walk. The OnNotifyWalkCompleted
delegate has the following signature:
public delegate void OnNotifyWalkCompleted
(int nDomains,
TimeSpan tsElapsed);
Revision History
- 22 Jan 2006
- Corrected
DomainWalkerForm
delegates to ensure controls are accessed from the GUI thread. (Thanks, Birgir K!)
- Added missing
.resx
file to project.
- Upgraded project to VS2005.
- 15 Jan 2006
Initial version.