There is a problem with Russian and other Web sites which makes this problem not solvable in really reliable ways for certain Web pages or sites. This is not too bad, good Web browser can even do auto-detection, but this is not reliable. First of all, it does not guarantee correct rendering without some trial. In pathological cases, there is ambiguity, so it is not possible to be 100% sure about the encoding.
You see, a page is merely an array of bytes. When you save it, you save it as it is. The encoding of the page is derived from three different sources:
- For Unicode UTFs, an HTML file can start with BOM. See:
http://unicode.org/[^], http://unicode.org/faq/utf_bom.html[^]. This is not required due to two other way. If BOM is used is should not contradict the charset declared in the text of HTML or HTTP header, see below. - HTTP equivalent. It is placed under
<head>
tag and looks like this:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
- HTTP Headers. The charset information can come in HTTP header.
Now, some Russian sites use obsolete encoding like Windows CP-1251 or KOI8-r. This is really bad, but it can be acceptable if the charset if correctly prescribed in the file and if the page languages are only Russian + base Latin (no European characters beyond ASCII). In this case, all non-nonsense browsers render the page correctly. The problem appears when none of the tree ways is uses, or when they contradict. Unfortunately, some sites are like that, even these days. Too bad.
Now, we're coming to the problem of the saved files. Some Web sites use only the method #3. How they do it? Usually, an HTTP server has an option like "default charset". If a file has no indication of the charset, HTTP headers come with automatically generated header with the default charset. This is really bad as it does not allow for several languages if the charset is not a Unicode UTF (practically, only UTF-8 should be used), but creators of such sites think they need to safe on some disk space and traffic at the expense of this limitation; they really do as Unicode always get a bit more space, even the most economic UTF-8. Still, this works when you simply watch Web page on line.
The problems happen when one saves the file on a local disk. If only the method #3 is used, charset information is lost. Sorry, blame the authors of those sites. I would recommend to identify such situations and fix saved file by adding an "http-equiv" tag in
head
as in method #2.
This is not just a problem of Russian sites, of course, but I observed it mostly on Russian sites. These days, situations gets better.
—SA