Click here to Skip to main content
15,900,110 members
Please Sign up or sign in to vote.
1.00/5 (1 vote)
See more:

I need programmatically save a webpage complete (given an URL and a folderPath), using any web broser engine (Internet Explorer, Google Chrome, Firefox), or an API from anywhere.


Idea is to avoid recoding an webbrowser URL request feature in any language.
Resultant folder must contain the sample.html file and the resources folder (~sample_resource).


Using CDO library from IE engine it's possible to save an URL in a .mht file, but this doesn't hold several .css, .php, .js references. Is not so good though.


Looking for advice, link, API reference, code, something to finally stop the web-scrapping

topic on the forums.

Thanks

Posted
Updated 13-Oct-12 9:11am
v3
Comments
[no name] 13-Oct-12 15:06pm    
Okay, and? Did you have some sort of a question or problem?
Member 8089652 13-Oct-12 15:08pm    
Yes, I am looking for any help, any code, in any language. While avoiding to code a web-scrapping is OK.
[no name] 13-Oct-12 15:11pm    
Okay so use the WebClient class to request the data. Then simply save it.
Member 8089652 13-Oct-12 15:15pm    
this will save only the html string. Not the resources used in the web (.css, .js, ...images)
[no name] 13-Oct-12 15:20pm    
So in other words, you want the server side code that you can't get anyway?

1 solution

Well, it is quite simple, and thus not so simple. The basic idea is, that you need a http client. You have one in .net: http://msdn.microsoft.com/en-us/library/system.net.http.httpclient.aspx[^]. But that's only part of it: you still need to care about possible authentication, cookies, ans possibly even some other things, that happen in the browser before you get your page. Actually it depend how dynamic your needs are. But one thing you still have to do, if you want a whole page: you have to parse the html code, and download also the linked resources, like images, css, javascript files, any maybe other media too. This later can be a little complicated, since the content you can get via a page is not always statically linked. You might get an somewhat empty page, because the actual content is generated via javascript and downloaded in the background with ajax. So you can get trouble - that browsers already handle.

You can get over part of these problems, with an already existing mirroring tool, like WGet[^], or curl[^] - both available also for windows. These can be simply called from an other application too. Google might give you several other mirroring tools, but none will save the dynamically generated page - more precisely I don't know of any.
 
Share this answer
 
Comments
Member 8089652 13-Oct-12 15:28pm    
Thanks Zoltán. I already checked wget on Ubuntu but several uncontrolled downloads requires the need of an post-download checking, moving and re-linking of files. Anyways, it's a last option to try.
I'll read curl documentation. Does it have recursive download for resource files?
Zoltán Zörgő 13-Oct-12 15:33pm    
I don't know what you mean by "recursive download for resource files". Consult it's documentation.
Here is a feature list: http://curl.haxx.se/docs/comparison-table.html and there are other tools mentioned also.
I think this is the most complete tools available in this topic. And there is a library version also for .net.
Member 8089652 13-Oct-12 15:35pm    
OK, I'll work on that, thanks
Member 8089652 13-Oct-12 15:38pm    
Great table.
Recursive download means downloading all links inside the current html, pointing to other URLs and/or .css, .js, etc
curl doesn't have it, but there are few more who does it.
Zoltán Zörgő 13-Oct-12 15:44pm    
Yes, I noticed. I have never used it recursively :)

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900