Click here to Skip to main content
15,867,999 members
Please Sign up or sign in to vote.
2.00/5 (1 vote)
See more:
I have this code, that load directly into a string the url I provide there, and then(as if I load it from a file), is render it into a webBrowser control.
But, after render (like loading from a saved file), it misses some coding from that page, like: some divs, tables, text colors,images,alignments,formattings,functionality of some controls from that page, text encoding, etc.
So,how do I load a saved webpage without destroying whatever the page encoding is?


here is the code :
string url = "http://www.lexilogos.com/clavier/russkij.htm";
string url_ = "http://www.w3schools.com/jsref/jsref_replace.asp";
            WebClient webClient = new WebClient();
            string response = webClient.DownloadString(url_); // downloads the Htmlstring and save into the string variable
            webBrowser1.DocumentText = response; // render the webpage contents
Posted
Updated 29-Nov-11 13:14pm
v3

1 solution

There is a problem with Russian and other Web sites which makes this problem not solvable in really reliable ways for certain Web pages or sites. This is not too bad, good Web browser can even do auto-detection, but this is not reliable. First of all, it does not guarantee correct rendering without some trial. In pathological cases, there is ambiguity, so it is not possible to be 100% sure about the encoding.

You see, a page is merely an array of bytes. When you save it, you save it as it is. The encoding of the page is derived from three different sources:

  1. For Unicode UTFs, an HTML file can start with BOM. See:
    http://unicode.org/[^], http://unicode.org/faq/utf_bom.html[^]. This is not required due to two other way. If BOM is used is should not contradict the charset declared in the text of HTML or HTTP header, see below.
  2. HTTP equivalent. It is placed under <head> tag and looks like this:
    HTML
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  3. HTTP Headers. The charset information can come in HTTP header.


Now, some Russian sites use obsolete encoding like Windows CP-1251 or KOI8-r. This is really bad, but it can be acceptable if the charset if correctly prescribed in the file and if the page languages are only Russian + base Latin (no European characters beyond ASCII). In this case, all non-nonsense browsers render the page correctly. The problem appears when none of the tree ways is uses, or when they contradict. Unfortunately, some sites are like that, even these days. Too bad.

Now, we're coming to the problem of the saved files. Some Web sites use only the method #3. How they do it? Usually, an HTTP server has an option like "default charset". If a file has no indication of the charset, HTTP headers come with automatically generated header with the default charset. This is really bad as it does not allow for several languages if the charset is not a Unicode UTF (practically, only UTF-8 should be used), but creators of such sites think they need to safe on some disk space and traffic at the expense of this limitation; they really do as Unicode always get a bit more space, even the most economic UTF-8. Still, this works when you simply watch Web page on line.

The problems happen when one saves the file on a local disk. If only the method #3 is used, charset information is lost. Sorry, blame the authors of those sites. I would recommend to identify such situations and fix saved file by adding an "http-equiv" tag in head as in method #2.

This is not just a problem of Russian sites, of course, but I observed it mostly on Russian sites. These days, situations gets better.

—SA
 
Share this answer
 
v2

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900