Click here to Skip to main content
15,860,844 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
It seems that no one has actually written the code to save the contents of a page, including images, as rendered by the .NET webbrowser control. And the ShowSaveAsDialog provided by the WebBrowser control sucks - it doesn't return the filename, and it doesn't work - try saving a Google page with a search filled in, and you get just the Google home page - it doesn't save with the parameters specified.

Now, of course there's this helpful approach (from StackOverflow):

So my approach revised would be:

Use System.NET.HttpWebRequest to get the main HTML document as a string or stream (easy).
Load this into a HTMLAgilityPack document where you can now easily query the document to get lists of all image elements, stylesheet links, etc.
Then make a separate web request for each of these files and save them to a subdirectory.
Finally update all relevent links in the main page to point to the items in the subdirectory.


but what amazes me is that no one has posted code that does this, at least that I can find.

Argh. Why is working with WebBrowser such a PITA? Anyways, if someone has some code for saving a web page without using ShowSaveAsDialog, please point me in the right direction.

Marc
Posted

1 solution

This problem is not actually related to the browser control, not a browser itself. You need a browser to render the Web page, but to save a Web page, you don't really need to render it. You need to use the well-known technique of Web scraping:
http://en.wikipedia.org/wiki/Web_scraping[^].

Please see my past answers for further detail:
get specific data from web page[^],
How to get the data from another site[^].

If you do all that, you can derive a class from the WebBrowser control class (it is not sealed) and add this functionality if you wish; this is not a root of the problem.

—SA
 
Share this answer
 
Comments
Kenneth Haugland 2-May-13 13:23pm    
Think he has some feedback to you...
Marc Clifton 2-May-13 14:14pm    
Oh, I see what you mean. I'm not used to this forum format!
Marc Clifton 2-May-13 14:17pm    
I'm aware of that, but as the first link states, If this is a HTML document, you will need to parse it. and what surprises me is that I can't find any code samples for doing that. And yes, I'm aware that this is WebBrowser control independent, all I need is the HTML, which the control gives me access to.

Marc
Sergey Alexandrovich Kryukov 2-May-13 14:47pm    
OK, very good. I actually suggested a way to parse it. Apparently, this is unavoidable. Actually, it depends on how deep do you need to scrap, as it can eventually get anywhere on the whole Web. If you want to scrap only the URIs used to render just one page, it will only include the URIs loaded as pictures, JavaScript files and contents of frame. Some downloads could give random or unpredictable results, some won't be reached (for example, media streams). If you need just this, the approach could be HTML spying and collecting all URIs requested.
—SA

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900