Save web page completely, without developing a browser

Question

1.00/5 (1 vote)

See more:

I need programmatically save a webpage complete (given an URL and a folderPath), using any web broser engine (Internet Explorer, Google Chrome, Firefox), or an API from anywhere.

Idea is to avoid recoding an webbrowser URL request feature in any language.
Resultant folder must contain the sample.html file and the resources folder (~sample_resource).

Using CDO library from IE engine it's possible to save an URL in a .mht file, but this doesn't hold several .css, .php, .js references. Is not so good though.

Looking for advice, link, API reference, code, something to finally stop the web-scrapping

topic on the forums.

Thanks

Posted 13-Oct-12 9:05am

Member 8089652

Updated 13-Oct-12 9:11am

v3

Add a Solution

Comments

[no name] 13-Oct-12 15:06pm

Okay, and? Did you have some sort of a question or problem?

Member 8089652 13-Oct-12 15:08pm

Yes, I am looking for any help, any code, in any language. While avoiding to code a web-scrapping is OK.

[no name] 13-Oct-12 15:11pm

Okay so use the WebClient class to request the data. Then simply save it.

Member 8089652 13-Oct-12 15:15pm

this will save only the html string. Not the resources used in the web (.css, .js, ...images)

[no name] 13-Oct-12 15:20pm

So in other words, you want the server side code that you can't get anyway?

Member 8089652 13-Oct-12 15:24pm

correct, I am working on that ;)

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

**Zoltán Zörgő** · Answer 1 · 2012-10-13T09:19:00

Well, it is quite simple, and thus not so simple. The basic idea is, that you need a http client. You have one in .net: http://msdn.microsoft.com/en-us/library/system.net.http.httpclient.aspx[^]. But that's only part of it: you still need to care about possible authentication, cookies, ans possibly even some other things, that happen in the browser before you get your page. Actually it depend how dynamic your needs are. But one thing you still have to do, if you want a whole page: you have to parse the html code, and download also the linked resources, like images, css, javascript files, any maybe other media too. This later can be a little complicated, since the content you can get via a page is not always statically linked. You might get an somewhat empty page, because the actual content is generated via javascript and downloaded in the background with ajax. So you can get trouble - that browsers already handle.

You can get over part of these problems, with an already existing mirroring tool, like WGet[^], or curl[^] - both available also for windows. These can be simply called from an other application too. Google might give you several other mirroring tools, but none will save the dynamically generated page - more precisely I don't know of any.