|
|
Comments and Discussions
|
|
 |
|

|
Hello Ravi,
I first of all want to thank you for all of this wonderful work that you have done here. You have saved me countless hours of reinventing the wheel!
Second, I found a bug in the code. When you make a request to a webserver and you get a timeout on the response, the application will crash. The problem happens in the WebResourceProvider class. The problem happens when you attempt to parse the web response to find the http result code and push the results into m_httpStatusCode. Here is the code:
Private Sub getContent(ByVal url As String)
...
Try
m_tmFetchTime = DateTime.Now
resp = DirectCast(req.GetResponse(), HttpWebResponse)
Catch exc As Exception
If TypeOf exc Is WebException Then
Dim webExc As WebException = TryCast(exc, WebException)
m_strError = webExc.Message
Dim expr As New System.Text.RegularExpressions.Regex("\d+")
m_httpStatusCode = expr.Match(webExc.Message).ToString
m_strError = webExc.Message
End If
...
As you can see here, if there is not a number in the webExc.message, then an empty string is assigned to m_httpStatusCode. This causes things to go astray. Here is what I recommend:
Try
m_tmFetchTime = DateTime.Now
resp = DirectCast(req.GetResponse(), HttpWebResponse)
Catch exc As Exception
If TypeOf exc Is WebException Then
Dim webExc As WebException = TryCast(exc, WebException)
m_strError = webExc.Message
Dim expr As New System.Text.RegularExpressions.Regex("\d+")
If expr.Match(webExc.Message).ToString = "" Then
m_httpStatusCode = 503
m_strError = webExc.Message
Else
m_httpStatusCode = expr.Match(webExc.Message).ToString
m_strError = webExc.Message
End If
End If
This will catch the empty string and assign it to 503 (see http://en.wikipedia.org/wiki/List_of_HTTP_status_codes#5xx_Server_Error) else it will assign the appropriate number to the m_httpStatusCode.
Once again, thank you for your work on this! It has saved me lots of time!
Regards,
Phebous...
|
|
|
|

|
Thanks! And sorry for the ridiculously tardy response!
/ravi
|
|
|
|

|
Ravi Bhavnani wrote: Thanks! And sorry for the ridiculously tardy response!
It is better to thank late than not at all!
Best Wishes,
Phebous...
|
|
|
|

|
Does it handles robots.txt
|
|
|
|

|
No, but that's a nice feature suggestion.
/ravi
|
|
|
|

|
hello, I am a chinese student(sady),My English is so bad, But I still want to ask for your help *_*,thank you first!
Now I want to made a project.
"Extracting Semistructured Information from the Web"
+page3
|
+page2----+page3
|
| +page3
| |
page1--+page2----+page3
|
| +page3
| |
+page2----+page3
First : I Extract the links(page2) from page1.
Second : I want to Extract links(page3) from page2.
but I don't know how to do that
this is my project:
http://www.sxtrain.com.cn/webinfo.rar[^]
I want you give me some suggestions!
thank you vary much!
Email:zhongliang72@gmail.com
|
|
|
|

|
See my StringParser[^] class to help you do this. Specifically, take a look at the getLinks() method.
Hope this helps!
/ravi
My new year's resolution: 2048 x 1536
Home | Music | Articles | Freeware | Trips
ravib(at)ravib(dot)com
|
|
|
|

|
This is a very nice project, if would only run in VS 2005. The source that gets downloaded is dated 1/15/2006 but the updates says that it was updated on 1/22/2006.
This is a nice application- thank you for sharing it.
David Roh
|
|
|
|

|
Thanks for your comments!
The missing DomainWalkerForm.resx has been added to the source .zip. I didn't tweak the article updated date since this is a fix to the package of files and not a code or content change.
/ravi
My new year's resolution: 2048 x 1536
Home | Music | Articles | Freeware | Trips
ravib(at)ravib(dot)com
|
|
|
|

|
hi
i'm trying your app, but:
The item 'DomainWalkerForm.resx' does not exist in the project directory. It may have been moved, renamed or deleted.
bye
Sady-chinese
|
|
|
|

|
Ooops! Sorry about that. I'll upload a new project tonight.
Thanks,
/ravi
My new year's resolution: 2048 x 1536
Home | Music | Articles | Freeware | Trips
ravib(at)ravib(dot)com
|
|
|
|
|
|
|

|
Hi.
In the "Binaries only" zip you didn't added the exe, but only the dlls.
Btw, nice work
--
NinjaCross
www.ninjacross.com
|
|
|
|

|
Gak! The .zips got messed up - will update them tonight. Thanks for the heads up!
/ravi
My new year's resolution: 2048 x 1536
Home | Music | Articles | Freeware | Trips
ravib(at)ravib(dot)com
|
|
|
|
|

|
You should always remember to check a control's InvokeRequired property before updating it from threads.
For example, your OnNotifyWalkCompletedHandler and OnNotifyUrlBeingTraversedHandler are not guaranteed to work as is, since they are updating the UI froma non-UI thread.
There are plenty of articles on this on the web, but as an example, here is how OnNotifyWalkCompletedHandler (renamed here to OnNotifyWalkCompleted) should have been written
private delegate void OnNotifyWalkCompletedHandler(int nDomains, TimeSpan tsElapsed);
///
/// Notifies an observer when a url is about to be traversed.
///
/// Number of domains discovered.
/// Time taken to complete crawl.
private void OnNotifyWalkCompleted
(int nDomains,
TimeSpan tsElapsed)
{
if (this.InvokeRequired)
{
this.Invoke(new OnNotifyWalkCompletedHandler(OnNotifyWalkCompleted), nDomains, tsElapsed);
return;
}
... // code as originally from source since we are on the UI thread now
|
|
|
|

|
The delegate is executed by the non-GUI thread that runs DomainWalker.walk() which uses BeginInvoke(). Is this incorrect?
Thanks,
/ravi
My new year's resolution: 2048 x 1536
Home | Music | Articles | Freeware | Trips
ravib(at)ravib(dot)com
|
|
|
|

|
Well, no - you need to make sure that any action that touches on ui components is executed on the thread that the component was created on.
There are quite a few articles on this, check this for example:
http://www.codeproject.com/csharp/begininvoke.asp[^]
|
|
|
|

|
Birgir K wrote: any action that touches on ui components is executed on the thread that the component was created on.
Sorry if I'm being dense, but isn't that what's happening? The delegate (specified by the GUI) is in fact being executed on it's thread (and not on the thread that's spun off) because DomainWalker executes the delegate using BeginInvoke().
/ravi
My new year's resolution: 2048 x 1536
Home | Music | Articles | Freeware | Trips
ravib(at)ravib(dot)com
|
|
|
|

|
Hi Ravi,
Actually, I hadn't even looked at DomainWalker.cs - I downloaded and ran your source via VS 2005. VS 2005 has a great new addition - it breaks in debug mode when you try do do illegal cross-thread calls - and that is how I ran into the problems I described earlier.
What you are doing in DomainWalker.cs is another issue alltogether. What you are doing there is event invocation can be simplified from:
if (urlTraversalNotifications != null)
foreach (OnNotifyUrlBeingTraversed callback in urlTraversalNotifications.GetInvocationList())
callback.BeginInvoke (strParentUrl, m_strStartUrl, nCurrentDepth, allDomains.Count, tsElapsed, null, null);
to
if (urlTraversalNotifications != null)
urlTraversalNotifications(strParentUrl, m_strStartUrl, nCurrentDepth, allDomains.Count, tsElapsed);
And in DomainWalkerForm.cs, in walkThreadProc (which is your worker thread (m_walkThread) - not the UI thread!) you are subscribing to that event.
So when you handle the event, which is still going to be done in m_walkThread, you need to make sure that all ui tasks are handled by the UI thread - which you do by calling .InvokeRequired and .BeginInvoke (and optionally .EndInvoke) or simply .Invoke on a control created in the UI thread.
So, for example:
private delegate void OnNotifyWalkCompletedHandler(int nDomains, TimeSpan tsElapsed);
private void OnNotifyWalkCompleted(int nDomains, TimeSpan tsElapsed)
{
if (this.InvokeRequired)
{
this.Invoke(new OnNotifyWalkCompletedHandler(OnNotifyWalkCompleted), nDomains, tsElapsed);
return;
}
// Walk thread is no longer valid
m_walkThread = null;
// Reset GUI when walk has completed
btnExit.Enabled = true;
btnWalk.Text = "Walk";
listDomains.SelectedIndex = 0;
lblLevel.Visible = false;
progBarLevel.Visible = false;
}
hope that clears it up for you,
best regards,
Biggi
|
|
|
|

|
Many thanks for your reply, Birgir! I guess it's time for me to tear the shrink wrap off my VS2005 package.
I'll update the article with the appropriate fixes shortly. Thanks again!
/ravi
My new year's resolution: 2048 x 1536
Home | Music | Articles | Freeware | Trips
ravib(at)ravib(dot)com
|
|
|
|
|

|
hi
i'm trying your app, but:
The item 'DomainWalkerForm.resx' does not exist in the project directory. It may have been moved, renamed or deleted.
it's declared in DomainWalkerDemo.csproj, but not in source download.
can you do something?
bye
Roberto
|
|
|
|

|
Hi Roberto, the .resx file is generated when you rebuild the solution. I just downloaded the source code to a new folder and was able to build and run the application successfully.
/ravi
My new year's resolution: 2048 x 1536
Home | Music | Articles | Freeware | Trips
ravib(at)ravib(dot)com
|
|
|
|

|
hi ravi.
not my VS.
I've used this trick: http://www.knowdotnet.com/articles/rebuildresx.html
not it's ok
roberto
|
|
|
|

|
Ah. You're absolutely right - I'll add it. Thanks for bringing this to my attention!
/ravi
My new year's resolution: 2048 x 1536
Home | Music | Articles | Freeware | Trips
ravib(at)ravib(dot)com
|
|
|
|
|
 |
|
|
General News Suggestion Question Bug Answer Joke Rant Admin
|
An object that allows you to explore the topology of the internet.
| Type | Article |
| Licence | CPOL |
| First Posted | 15 Jan 2006 |
| Views | 68,334 |
| Bookmarked | 50 times |
|
|