 |
|
 |
Hello Uwe,
I have a problem using this. When I run the Test app and it says 'Finished' I just get a STATE file. How do I continue using the pages I've downloaded?
Cheers
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
 |
Hi,
Thanks for posting information for Web Spider, I have downloaded the source code and executed. It generated a file with the extenssion "STATE".
Will you please let me know how to view this file?
Thanks,
Shabber.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
It tries to save its current state in the file in order to continue when being restarted.
The content should not be viewed by humans, but of course you can run and debug the application to see how the content is written and retrieved.
Cheers Uwe
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Hi ,
Thanks for the good stuff.
i have used that. Working nicely.
I want to explore all dynamically links for the website. Like
www.xyz.com/products.aspx?id=2 www.xyz.com/products.aspx?id=3 www.xyz.com/products.aspx?id=4
How can i detect the dynamica urls. May be product having the query string includes the product of the name.
Can you give me a way to implement this.
Thanks and regards, Rohit
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Hi.
How i can download just unique links on page. i just need a links on page, and i'll keep links in database. the code is awesome but too much complex for me, how i can prevent to download the content? is this difficult?
And how i can define type of link ?(/x/james-bond/12313 eg. with regex)?
thanks.
modified on Thursday, July 30, 2009 4:57 PM
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
A good way to start would be to search the project files for "link" or "href" or similar.
Then set appropriate breakpoints at the locations you found. Start running inside the debugger and see what you can get in the breakpoint locations (variables, members, etc.).
|
| Sign In·View Thread·PermaLink | 5.00/5 (1 vote) |
|
|
|
 |
|
|
 |
|
|
 |
|
 |
Hello Keim, your code is great! but when I run it and it finishes crawling I can't find the HTML pages itself in the download folder, I only find the .STATE file... I don't know why...
|
| Sign In·View Thread·PermaLink | 2.00/5 (1 vote) |
|
|
|
 |
|
|
 |
|
 |
Failed to replace URI '/script/Ann/ServeHTML.aspx?C=False&id=6233' with URI '30FC637E.html' in HTML text '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
Above is the message I get when I tried to run the test project.
Please advice as to how can I make the test project run without errors.
|
| Sign In·View Thread·PermaLink | 2.00/5 (4 votes) |
|
|
|
 |
|
 |
Hi,
I am getting an exception when I run the test app that you included. Exception details:
System.ApplicationException was unhandled by user code Message="Failed to replace URI '/script/Ann/ServeThirdParty.aspx?p=728x90&attrs=&r=1536964' with URI 'BC40D607.html' in HTML text 'r\n\r\n\r\n<html>\r\n\r
0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0'." Source="Zeta.WebSpider" StackTrace: at Zeta.WebSpider.Spider.ResourceRewriter.ReplaceLinks(String textContent, UriResourceInformation uriInfo) in C:\Work\WebSpider\WebSpider\Spider\ResourceRewriter.cs:line 97 at Zeta.WebSpider.Spider.WebSiteDownloader.ProcessUrl(DownloadedResourceInformation uriInfo, Int32 depth) in C:\Work\WebSpider\WebSpider\Spider\WebSiteDownloader.cs:line 441 at Zeta.WebSpider.Spider.WebSiteDownloader.Process() in C:\Work\WebSpider\WebSpider\Spider\WebSiteDownloader.cs:line 107 at Zeta.WebSpider.Spider.WebSiteDownloader.processAsyncBackgroundWorker_DoWork(Object sender, DoWorkEventArgs e) in C:\Work\WebSpider\WebSpider\Spider\WebSiteDownloader.cs:line 231 at System.ComponentModel.BackgroundWorker.OnDoWork(DoWorkEventArgs e) at System.ComponentModel.BackgroundWorker.WorkerThreadStart(Object argument)
Is there something I am not doing right?
I am trying to experiment with your code and see if I can use it to craw some data and store it into a database.
Thanks. </html>
|
| Sign In·View Thread·PermaLink | 3.00/5 (5 votes) |
|
|
|
 |
|
|
 |
|
|
 |
|
|
 |
|
 |
I downloaded and found the project to build right away and run with no problems. It could prove very useful.
However, I have one question for you. I tried setting LinkDepth to both 0 and 1, because I'm only trying to grab one page and its necessary resources. Either way it crawls all over the place and is never wanting to stop!
Is this option implemented and am I using it correctly?
Regards,
Aaron
|
| Sign In·View Thread·PermaLink | 4.55/5 (6 votes) |
|
|
|
 |
|
 |
Just uploaded a new version.
I had several issues in a project of mine, too. This version works as expected for my project.
Please try again with this version.
|
| Sign In·View Thread·PermaLink | 1.50/5 (2 votes) |
|
|
|
 |
|
 |
Thank you for the code update. Even though I don't have a client proxy enabled, I'm still getting a proxy exception from the application.
"The remote server returned an error: (407) Proxy Authentication Required."
As far as I can tell this happens even before it has started returning any content. Any ideas?
Kind regards
Aaron
|
| Sign In·View Thread·PermaLink | 2.00/5 (1 vote) |
|
|
|
 |
|
 |
I am having the same problem, if I am in a subdirectory, this spider crawls to the root of a site. I tried both: options.MaximumLinkDepth = 1; or options.MaximumLinkDepth = 0;
I was trying to crawl 1 page only: http://sdlookup.com/MLS-076069531-1216_Gertrude_St_San_Diego_CA_92110
and it crawled up to sdlookup.com and eventually crashed with a "bad request" error.
Am I doing something wrong?
|
| Sign In·View Thread·PermaLink | 2.00/5 (1 vote) |
|
|
|
 |
|
|
 |
|
 |
when i use some url, i met a exception, in ResourceParser.DoExtractLinks method, in DoExtractLinks( XmlReader xml, UriResourceInformation uriInfo )
it seems that you take HTML as Xml, but some html is not well formate document, i am not sure about this.
|
| Sign In·View Thread·PermaLink | 1.30/5 (8 votes) |
|
|
|
 |
|
 |
Did you know that when launching the WebSpider.csproj an error is generated?
It reads:
Unable to read the project file 'WebSpider.csproj'. The file c:\[project path]\WebSpider.csproj is not a valid project file. The project file is missing the 'VisualStudioProject' section.
Vic
Edit: The same goes for the Test Project.
|
| Sign In·View Thread·PermaLink | 3.47/5 (8 votes) |
|
|
|
 |
|
 |
Please use the "WebSpider.sln" instead.
But I found I was missing a reference ("SgmlReaderDll.dll"), which I now included in an updated download.
Thanks for you feedback!
|
| Sign In·View Thread·PermaLink | 2.17/5 (5 votes) |
|
|
|
 |