What you "know" is not true!
First, how much byte a character takes is not a pure characteristic of the language. And of course, there is not such thing as "normal" language. Core Unicode does not define how many bytes each character has; it defines the set of
code points and a correspondence between a character as a cultural phenomena, abstracted from its concrete glyph and a set on integer values understood in its mathematical sense abstracted from its computer presentation.
Encodings called UTFs define how to represent each code point in byte. Only UTF-32 has fixed 4 bytes per characters. Byte-oriented UTF-8 uses interesting algorithm which makes a character take 1, 2, 3 or 4 bytes with the actual length depending in the value of previous byte(s), and UTF-16 is not a 16-bit code (!), a length of the character can be either 16 or 32 bits (in case of bytes outside
Base Multi-lingual Plane (BMP) expressed in a surrogate pair — two 16-bit words). Also, UTF-16 and UTF-32 encodings can be little endian or big endian.
Now, about "normal" languages. Which language do you want to consider. American English perhaps? All expressed in ASCII, code points, 0 to 127, right? Think again! It depends on what you consider a "language". How about fully-fledged punctuation used in this language? Consider, for example, correct typography for dash and quotation marks:
—, – “ ”
. Try to type them in your keyboard. The code points are 0x2013, 0x2014, 0x201C and 0x201D. Try to squeeze them in one byte — good luck!
See
http://unicode.org/[
^],
http://unicode.org/faq/utf_bom.html[
^].
Please don't make false statements, understand things by yourself first.
—SA