 |
|
 |
hi you developed an awesome app.
When i used this for indexof based websites this parsed first page of given url and then this turn to home page of site
But i want to explore site only from given url not from home page.
So help me how to change this for my use.
|
|
|
|
 |
|
 |
Hi Uwe Keim your project is very nice.
Your Spider is not extracting proper links at "http://downloadz.midnight-labs.org" Url.
Please give Solution at my E-mail: Arpit-sharma@hotmail.com
Thank You in Advance
|
|
|
|
 |
|
 |
First let me say that this webspider is wonderful. One issue that I am facing is that many of the links I am searching for are of the format: <a href="javascript:if(confirm('http://someaddress.goes.here.com'))window.location='http://someaddress.goes.here.com'" tppabs="http://someaddress.goes.here.com" target="_blank" class="STYLE13">Link Text</a> This gets converted into H341DC.html (or the like), but that file obviously does not exist. This causes my parser that I run over the captured files to then tell me I have a broken link. I am at a loss where I should look to try and change this behavior. I am currently poking around any references to LinkElements, but I haven't really turned anything up yet.
|
|
|
|
 |
|
 |
I seem to have figured out a workaround File: ResourceInformation.cs Line: 41 I've added: try { if (uri.Scheme.Contains("javascript")) { _baseUri = null; baseUri = null; } else { _baseUri = baseUri; } } catch (Exception ex) { _baseUri = baseUri; }
Same file, in the CleanupURL function, line 416 url = Regex.Replace( url, @"(javascript)\ *?)(http|https|ftp)(\://)(.*)", "$3$4$5", RegexOptions.Singleline); ----------------------------------------------------------------------------------- It may not be the fanciest, but it works for my purposes The effect is that the Spider does not change the format of the URL and I am now able to strip off the javascript portion and find the actual URL
If what I've done will break the program in some unforseen way, please let me know and I'll see what I can do to work around that as well!
|
|
|
|
 |
|
 |
Glad you managed to get it working. As far as it works for you, I wouldn't bother thinking whether it is breaking the code or not.
|
|
|
|
 |
|
 |
There is not enough discussion in the article or code snippets to make me want to download the project files to see if they are of any interest. More article content required.
|
|
|
|
 |
|
 |
I've rated your comment 2. You have got to be kidding, right? All of this code is free for the taking and you're whining because there's not enough documentation. Why are you on CodeProject if you're looking for professional documentation? Uwe states right up front in his post that the article and code isn't documented to his usual standards, but he's putting the code out there anyway. For that, thanks, Uwe.
|
|
|
|
 |
|
 |
Thank you very much for your kind words, Kevin!
|
|
|
|
 |
|
 |
l want write it for a bs object base on this lib
|
|
|
|
 |
|
 |
Hello Uwe,
I have a problem using this. When I run the Test app and it says 'Finished' I just get a STATE file. How do I continue using the pages I've downloaded?
Cheers
|
|
|
|
 |
|
|
 |
|
 |
Hi,
Thanks for posting information for Web Spider, I have downloaded the source code and executed. It generated a file with the extenssion "STATE".
Will you please let me know how to view this file?
Thanks,
Shabber.
|
|
|
|
 |
|
 |
It tries to save its current state in the file in order to continue when being restarted.
The content should not be viewed by humans, but of course you can run and debug the application to see how the content is written and retrieved.
Cheers
Uwe
|
|
|
|
 |
|
 |
Hi ,
Thanks for the good stuff.
i have used that. Working nicely.
I want to explore all dynamically links for the website. Like
www.xyz.com/products.aspx?id=2
www.xyz.com/products.aspx?id=3
www.xyz.com/products.aspx?id=4
How can i detect the dynamica urls. May be product having the query string includes the product of the name.
Can you give me a way to implement this.
Thanks and regards,
Rohit
|
|
|
|
 |
|
 |
Hi.
How i can download just unique links on page. i just need a links on page, and i'll keep links in database.
the code is awesome but too much complex for me, how i can prevent to download the content? is this difficult?
And how i can define type of link ?(/x/james-bond/12313 eg. with regex)?
thanks.
modified on Thursday, July 30, 2009 4:57 PM
|
|
|
|
 |
|
 |
A good way to start would be to search the project files for "link" or "href" or similar.
Then set appropriate breakpoints at the locations you found. Start running inside the debugger and see what you can get in the breakpoint locations (variables, members, etc.).
|
|
|
|
 |
|
 |
Hi There ,
I am getting this error.
can u plz provide more info.
Thanks,
Chetan J.
|
|
|
|
 |
|
 |
I would love to - if you could give me more context information, please.
Thanks
Uwe
|
|
|
|
 |
|
 |
Hello Keim, your code is great! but when I run it and it finishes crawling I can't find the HTML pages itself in the download folder, I only find the .STATE file... I don't know why...
|
|
|
|
 |
|
 |
nice spider, nice code, thanks
|
|
|
|
 |
|
 |
Failed to replace URI '/script/Ann/ServeHTML.aspx?C=False&id=6233' with URI '30FC637E.html' in HTML text '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
Above is the message I get when I tried to run the test project.
Please advice as to how can I make the test project run without errors.
|
|
|
|
 |
|
 |
Hi,
I am getting an exception when I run the test app that you included. Exception details:
System.ApplicationException was unhandled by user code
Message="Failed to replace URI '/script/Ann/ServeThirdParty.aspx?p=728x90&attrs=&r=1536964' with URI 'BC40D607.html' in HTML text 'r\n\r\n\r\n<html>\r\n\r
0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0'."
Source="Zeta.WebSpider"
StackTrace:
at Zeta.WebSpider.Spider.ResourceRewriter.ReplaceLinks(String textContent, UriResourceInformation uriInfo) in C:\Work\WebSpider\WebSpider\Spider\ResourceRewriter.cs:line 97
at Zeta.WebSpider.Spider.WebSiteDownloader.ProcessUrl(DownloadedResourceInformation uriInfo, Int32 depth) in C:\Work\WebSpider\WebSpider\Spider\WebSiteDownloader.cs:line 441
at Zeta.WebSpider.Spider.WebSiteDownloader.Process() in C:\Work\WebSpider\WebSpider\Spider\WebSiteDownloader.cs:line 107
at Zeta.WebSpider.Spider.WebSiteDownloader.processAsyncBackgroundWorker_DoWork(Object sender, DoWorkEventArgs e) in C:\Work\WebSpider\WebSpider\Spider\WebSiteDownloader.cs:line 231
at System.ComponentModel.BackgroundWorker.OnDoWork(DoWorkEventArgs e)
at System.ComponentModel.BackgroundWorker.WorkerThreadStart(Object argument)
Is there something I am not doing right?
I am trying to experiment with your code and see if I can use it to craw some data and store it into a database.
Thanks. </html>
|
|
|
|
 |
|
|
 |
|
|
 |
|
 |
Has the project the specify?
|
|
|
|
 |