Click here to Skip to main content
Click here to Skip to main content

Domain Walker

By , 22 Jan 2006
 

What is it?

DomainWalker in action DomainWalker is an object that discovers domains reachable from a URL.  Unlike traditional crawlers and site downloaders that identify all reachable URLs on a page, DomainWalker explores a subset of the world wide web's topology by targeting root URLs only.  DomainWalker guarantees that its walk will complete in a finite amount of time by ensuring that duplicate domains are never crawled.

DomainWalker is an example of a WebResourceProvider and uses my StringParser utility class, both of which are published elsewhere at this site.  As an aside, the demo application shows how to spin off a worker thread from a GUI and have it update the GUI in a safe manner.  This is done by having the app respond to events fired by the worker thread.

How do I use it?

You use DomainWalker by initializing it, calling its Walk() method, and getting its results.

  1. Initialize the DomainWalker instance
    // Initialize the DomainWalker
    DomainWalker dw = new DomainWalker();
    dw.StartUrl = "www.ravib.com";
    dw.MaxDepth = 3;
  2. Do the walk
    // Do walk
    dw.walk();
  3. Get the results
    // Get results
    HashTable domainTree = dw.DomainTree;
    printHashTableAsTree (domainTree);   // left as an exercise to the reader

Getting DomainWalker's results

You retrieve DomainWalker's results by accessing its DomainTree property at the end of the walk and/or responding to the OnNotifyUrlBeingTraversed event.

DomainTree property

DomainWalker's result is a tree of discovered domains obtained from the object's DomainTree property. The tree is actually a nested Hashtable, where each collection of child nodes is stored in a new Hashtable.

Domain tree retrieved by DomainWalker

OnNotifyUrlBeingTraversed event

It may be more convenient to get at DomainWalker's results by being notified every time a new URL is discovered. This is done by subscribing to the object's OnNotifyUrlBeingTraversed event and is the approach taken by the demo app. Domain discovery notifications are received by registering a OnNotifyUrlBeingTraversed delegate which has the following signature:

  /// <summary>
  /// Notifies an observer when a url is about to be traversed.
  /// </summary>
  /// <param name="strParentUrl">The parent url (may be null).</param>
  /// <param name="strUrlBeingTraversed">The url being traversed.</param>
  /// <param name="nCurrentDepth">Current traversal depth.</param>
  /// <param name="nDomains">Number of domains discovered so far.</param>
  /// <param name="tsElapsed">Time elapsed since start of crawl.</param>
  public delegate void OnNotifyUrlBeingTraversed
    (string strParentUrl,
     string strUrlBeingTraversed,
     int nCurrentDepth,
     int nDomains,
     TimeSpan tsElapsed);

The demo app responds to the OnNotifyUrlBeingTraversed event by adding strUrlBeingTraversed to a list box. The string is indented by an appropriate number of spaces proportional to nCurrentDepth. Other useful information such as the elapsed walk time (tsElapsed) is displayed in a label control.

OnNotifyWalkCompleted event

DomainWalker also fires the OnNotifyWalkCompleted event at the end of a walk. The OnNotifyWalkCompleted delegate has the following signature:

  /// <summary>
  /// Notifies an observer when the walk has completed.
  /// </summary>
  /// <param name="nDomains">Number of domains discovered.</param>
  /// <param name="tsElapsed">Time taken to complete crawl.</param>
  public delegate void OnNotifyWalkCompleted
    (int nDomains,
     TimeSpan tsElapsed);

Revision History

  • 22 Jan 2006
    • Corrected DomainWalkerForm delegates to ensure controls are accessed from the GUI thread. (Thanks, Birgir K!)
    • Added missing .resx file to project.
    • Upgraded project to VS2005.
  • 15 Jan 2006
    Initial version.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Ravi Bhavnani
Technical Lead
Canada Canada
Ravi Bhavnani is an ardent fan of Microsoft technologies who loves building Windows apps, especially PIMs, system utilities, and things that go bump on the Internet. During his career, Ravi has developed expert systems, desktop imaging apps, marketing automation software, EDA tools, a platform to help people find, analyze and understand information, trading software for institutional investors and advanced data visualization solutions. He currently works for a company that provides enterprise workforce management solutions to large clients.
 
His interests include the .NET framework, reasoning systems, financial analysis and algorithmic trading, NLP, CHI and UI design. Ravi holds a BS in Physics and Math and an MS in Computer Science and was a Microsoft MVP (C++ and C# in 2006 and 2007). He is also the co-inventor of 2 patents on software security and generating data visualization dashboards. His claim to fame is that he crafted CodeProject's "joke" forum post icon.
 
Ravi's biggest fear is that one day he might actually get a life, although the chances of that happening seem extremely remote.

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
GeneralMy vote of 5membercmptr_kemist16-Aug-11 17:32 
Excellent!
GeneralThank you and Found a bugmemberPhebous3-Apr-08 10:59 
Hello Ravi,
 
I first of all want to thank you for all of this wonderful work that you have done here. You have saved me countless hours of reinventing the wheel!
 
Second, I found a bug in the code. When you make a request to a webserver and you get a timeout on the response, the application will crash. The problem happens in the WebResourceProvider class. The problem happens when you attempt to parse the web response to find the http result code and push the results into m_httpStatusCode. Here is the code:
 
Private Sub getContent(ByVal url As String)
...
Try
m_tmFetchTime = DateTime.Now
resp = DirectCast(req.GetResponse(), HttpWebResponse)
Catch exc As Exception
If TypeOf exc Is WebException Then
Dim webExc As WebException = TryCast(exc, WebException)
m_strError = webExc.Message
Dim expr As New System.Text.RegularExpressions.Regex("\d+")
m_httpStatusCode = expr.Match(webExc.Message).ToString
m_strError = webExc.Message
End If
...
 
As you can see here, if there is not a number in the webExc.message, then an empty string is assigned to m_httpStatusCode. This causes things to go astray. Here is what I recommend:
 
Try
m_tmFetchTime = DateTime.Now
resp = DirectCast(req.GetResponse(), HttpWebResponse)
Catch exc As Exception
If TypeOf exc Is WebException Then
Dim webExc As WebException = TryCast(exc, WebException)
m_strError = webExc.Message
Dim expr As New System.Text.RegularExpressions.Regex("\d+")
If expr.Match(webExc.Message).ToString = "" Then
m_httpStatusCode = 503
m_strError = webExc.Message
Else
m_httpStatusCode = expr.Match(webExc.Message).ToString
m_strError = webExc.Message
End If
End If
 
This will catch the empty string and assign it to 503 (see http://en.wikipedia.org/wiki/List_of_HTTP_status_codes#5xx_Server_Error) else it will assign the appropriate number to the m_httpStatusCode.
 
Once again, thank you for your work on this! It has saved me lots of time!
 
Regards,
 
Phebous...
GeneralRe: Thank you and Found a bugmemberRavi Bhavnani30-Aug-08 11:57 
Thanks! And sorry for the ridiculously tardy response! Blush | :O
 
/ravi
 
My new year resolution: 2048 x 1536
Home | Articles | My .NET bits | Freeware
ravib(at)ravib(dot)com

GeneralRe: Thank you and Found a bugmemberPhebous30-Aug-08 15:02 
Ravi Bhavnani wrote:
Thanks! And sorry for the ridiculously tardy response!

 
It is better to thank late than not at all!
 
Best Wishes,
 
Phebous...
QuestionWhat about robots.txtmembermariusco20-Dec-06 5:55 
Does it handles robots.txt
AnswerRe: What about robots.txtmemberRavi Bhavnani20-Dec-06 6:03 
No, but that's a nice feature suggestion.
 
/ravi
 
This is your brain on Celcius
Home | Music | Articles | Freeware | Trips
ravib(at)ravib(dot)com

Generalneed your help!memberbeyondwm200411-Apr-06 4:38 
hello, I am a chinese student(sady),My English is so bad, But I still want to ask for your help *_*,thank you first!
 
Now I want to made a project.
"Extracting Semistructured Information from the Web"
 
+page3
|
+page2----+page3
|
| +page3
| |
page1--+page2----+page3
|
| +page3
| |
+page2----+page3
 
First : I Extract the links(page2) from page1.
 
Second : I want to Extract links(page3) from page2.
but I don't know how to do that
this is my project:
http://www.sxtrain.com.cn/webinfo.rar[^]
 
I want you give me some suggestions!
thank you vary much!
Email:zhongliang72@gmail.com
GeneralRe: need your help!memberRavi Bhavnani12-Apr-06 4:52 
See my StringParser[^] class to help you do this. Specifically, take a look at the getLinks() method.
 
Hope this helps!
 
/ravi
 
My new year's resolution: 2048 x 1536
Home | Music | Articles | Freeware | Trips
ravib(at)ravib(dot)com
GeneralStill the Same ProblemsmemberDavid7778-Apr-06 9:47 

This is a very nice project, if would only run in VS 2005. The source that gets downloaded is dated 1/15/2006 but the updates says that it was updated on 1/22/2006.
 
This is a nice application- thank you for sharing it.
 
David Roh
GeneralRe: Still the Same ProblemsmemberRavi Bhavnani9-Apr-06 5:01 
Thanks for your comments!
 
The missing DomainWalkerForm.resx has been added to the source .zip. I didn't tweak the article updated date since this is a fix to the package of files and not a code or content change.
 
/ravi
 
My new year's resolution: 2048 x 1536
Home | Music | Articles | Freeware | Trips
ravib(at)ravib(dot)com
GeneralThe Same Problemmemberbeyondwm200426-Mar-06 21:39 
Big Grin | :-D
hi
i'm trying your app, but:
 
The item 'DomainWalkerForm.resx' does not exist in the project directory. It may have been moved, renamed or deleted.
 
bye
 
Sady-chinese
GeneralRe: The Same ProblemmemberRavi Bhavnani27-Mar-06 0:53 
Ooops! Sorry about that. I'll upload a new project tonight.
 
Thanks,
 
/ravi
 
My new year's resolution: 2048 x 1536
Home | Music | Articles | Freeware | Trips
ravib(at)ravib(dot)com
GeneralRe: The Same ProblemmemberRavi Bhavnani9-Apr-06 4:59 
Updated. Sorry for the delay!
 
/ravi
 
My new year's resolution: 2048 x 1536
Home | Music | Articles | Freeware | Trips
ravib(at)ravib(dot)com
GeneralNice workmemberHatemMostafa25-Feb-06 1:18 
Thanks
GeneralRe: Nice workmemberRavi Bhavnani25-Feb-06 3:58 
Thanks, Hatem! Rose | [Rose]
 
/ravi
 
My new year's resolution: 2048 x 1536
Home | Music | Articles | Freeware | Trips
ravib(at)ravib(dot)com
QuestionWhere is the exe ?memberNinjaCross23-Jan-06 3:52 
Hi.
In the "Binaries only" zip you didn't added the exe, but only the dlls.
Btw, nice work Smile | :)
 
--
NinjaCross
www.ninjacross.com
AnswerRe: Where is the exe ?memberRavi Bhavnani23-Jan-06 5:41 
Gak! The .zips got messed up - will update them tonight. Thanks for the heads up!
 
/ravi
 
My new year's resolution: 2048 x 1536
Home | Music | Articles | Freeware | Trips
ravib(at)ravib(dot)com
AnswerRe: Where is the exe ?memberRavi Bhavnani25-Jan-06 11:52 
Done!
 
/ravi
 
My new year's resolution: 2048 x 1536
Home | Music | Articles | Freeware | Trips
ravib(at)ravib(dot)com
GeneralBreaks with GUI updatesmemberBirgir K19-Jan-06 9:04 
You should always remember to check a control's InvokeRequired property before updating it from threads.
 
For example, your OnNotifyWalkCompletedHandler and OnNotifyUrlBeingTraversedHandler are not guaranteed to work as is, since they are updating the UI froma non-UI thread.
 
There are plenty of articles on this on the web, but as an example, here is how OnNotifyWalkCompletedHandler (renamed here to OnNotifyWalkCompleted) should have been written
 
private delegate void OnNotifyWalkCompletedHandler(int nDomains, TimeSpan tsElapsed);
///
/// Notifies an observer when a url is about to be traversed.
///

/// Number of domains discovered.
/// Time taken to complete crawl.
private void OnNotifyWalkCompleted
(int nDomains,
TimeSpan tsElapsed)
{
if (this.InvokeRequired)
{
this.Invoke(new OnNotifyWalkCompletedHandler(OnNotifyWalkCompleted), nDomains, tsElapsed);
return;
}
 
... // code as originally from source since we are on the UI thread now
 

GeneralRe: Breaks with GUI updatesmemberRavi Bhavnani19-Jan-06 9:29 
The delegate is executed by the non-GUI thread that runs DomainWalker.walk() which uses BeginInvoke(). Is this incorrect?
 
Thanks,
 
/ravi
 
My new year's resolution: 2048 x 1536
Home | Music | Articles | Freeware | Trips
ravib(at)ravib(dot)com
GeneralRe: Breaks with GUI updatesmemberBirgir K19-Jan-06 12:45 
Well, no - you need to make sure that any action that touches on ui components is executed on the thread that the component was created on.
 
There are quite a few articles on this, check this for example:
 
http://www.codeproject.com/csharp/begininvoke.asp[^]
GeneralRe: Breaks with GUI updatesmemberRavi Bhavnani20-Jan-06 2:46 
Birgir K wrote:
any action that touches on ui components is executed on the thread that the component was created on.

 
Sorry if I'm being dense, but isn't that what's happening? The delegate (specified by the GUI) is in fact being executed on it's thread (and not on the thread that's spun off) because DomainWalker executes the delegate using BeginInvoke().
 
/ravi
 
My new year's resolution: 2048 x 1536
Home | Music | Articles | Freeware | Trips
ravib(at)ravib(dot)com
GeneralRe: Breaks with GUI updatesmemberBirgir K20-Jan-06 14:27 
Hi Ravi,
 
Actually, I hadn't even looked at DomainWalker.cs - I downloaded and ran your source via VS 2005. VS 2005 has a great new addition - it breaks in debug mode when you try do do illegal cross-thread calls - and that is how I ran into the problems I described earlier.
 
What you are doing in DomainWalker.cs is another issue alltogether. What you are doing there is event invocation can be simplified from:
 
if (urlTraversalNotifications != null)
foreach (OnNotifyUrlBeingTraversed callback in urlTraversalNotifications.GetInvocationList())
callback.BeginInvoke (strParentUrl, m_strStartUrl, nCurrentDepth, allDomains.Count, tsElapsed, null, null);
 
to
 
if (urlTraversalNotifications != null)
urlTraversalNotifications(strParentUrl, m_strStartUrl, nCurrentDepth, allDomains.Count, tsElapsed);
 
And in DomainWalkerForm.cs, in walkThreadProc (which is your worker thread (m_walkThread) - not the UI thread!) you are subscribing to that event.
 
So when you handle the event, which is still going to be done in m_walkThread, you need to make sure that all ui tasks are handled by the UI thread - which you do by calling .InvokeRequired and .BeginInvoke (and optionally .EndInvoke) or simply .Invoke on a control created in the UI thread.
 
So, for example:
 
private delegate void OnNotifyWalkCompletedHandler(int nDomains, TimeSpan tsElapsed);
private void OnNotifyWalkCompleted(int nDomains, TimeSpan tsElapsed)
{
if (this.InvokeRequired)
{
this.Invoke(new OnNotifyWalkCompletedHandler(OnNotifyWalkCompleted), nDomains, tsElapsed);
return;
}
// Walk thread is no longer valid
m_walkThread = null;
 
// Reset GUI when walk has completed
btnExit.Enabled = true;
btnWalk.Text = "Walk";
listDomains.SelectedIndex = 0;
lblLevel.Visible = false;
progBarLevel.Visible = false;
}
 
hope that clears it up for you,
 
best regards,
Biggi
GeneralRe: Breaks with GUI updatesmemberRavi Bhavnani21-Jan-06 3:44 
Many thanks for your reply, Birgir! I guess it's time for me to tear the shrink wrap off my VS2005 package. Smile | :)
 
I'll update the article with the appropriate fixes shortly. Thanks again!
 
/ravi
 
My new year's resolution: 2048 x 1536
Home | Music | Articles | Freeware | Trips
ravib(at)ravib(dot)com
GeneralRe: Breaks with GUI updatesmemberRavi Bhavnani22-Jan-06 13:35 
Article updated.  Thanks again!
 
/ravi
 
My new year's resolution: 2048 x 1536
Home | Music | Articles | Freeware | Trips
ravib(at)ravib(dot)com
Generalfile missing in zipmemberroberto galbiati18-Jan-06 21:55 
hi
i'm trying your app, but:
 
The item 'DomainWalkerForm.resx' does not exist in the project directory. It may have been moved, renamed or deleted.
 
it's declared in DomainWalkerDemo.csproj, but not in source download.
 
can you do something?
 
bye
 
Roberto
AnswerRe: file missing in zipmemberRavi Bhavnani19-Jan-06 2:22 
Hi Roberto, the .resx file is generated when you rebuild the solution. I just downloaded the source code to a new folder and was able to build and run the application successfully.
 
/ravi
 
My new year's resolution: 2048 x 1536
Home | Music | Articles | Freeware | Trips
ravib(at)ravib(dot)com
GeneralRe: file missing in zipmemberroberto galbiati19-Jan-06 3:31 
hi ravi.
 
not my VS.
I've used this trick: http://www.knowdotnet.com/articles/rebuildresx.html
 
not it's ok
 
roberto
GeneralRe: file missing in zipmemberRavi Bhavnani19-Jan-06 5:42 
Ah. You're absolutely right - I'll add it. Thanks for bringing this to my attention!
 
/ravi
 
My new year's resolution: 2048 x 1536
Home | Music | Articles | Freeware | Trips
ravib(at)ravib(dot)com
GeneralRe: file missing in zipmemberRavi Bhavnani22-Jan-06 13:34 
.resx file added to project.
 
/ravi
 
My new year's resolution: 2048 x 1536
Home | Music | Articles | Freeware | Trips
ravib(at)ravib(dot)com

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web03 | 2.6.130617.1 | Last Updated 22 Jan 2006
Article Copyright 2006 by Ravi Bhavnani
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid