There are no "multibyte character" concept in .NET. The .NET characters use UTF-16LE Unicode encoding. This encoding is composed of 16-bit code points. This is not enough to represent all Unicode code points, only BMP (
Base Multilungual Plane), which represent code points from 0 to 0xFFFF, inclusively.
What to do with other Unicode code points, above BMP? They are supported only at the level of strings but not individual characters. This is pretty hard to explain. In UTF-16 encoding, such code points are represented by
surrogate pairs. Only UTF-16 encodings use them, but there is a special range of code point dedicated to surrogates which should no be used for any "read" code points. Each 16-bit word of the surrogate is not a real code point, it only uses the position of some code point which does not exist as a real character. At the level of string, a pair is considered as a single character (for example, correctly rendered on screen as one character glyphs). This support was introduces in one of the Windows 2000 service packs.
At the level of characters, there are no 4-byte characters. It means that if you traverse a string character-by-character, some characters may be not "real" characters, so such traversal should avoided, at least for languages utilizing above-BMP code points. You should not break the surrogate pairs but you can with the
char
type which can cause problems.
To address these problems, the type
System.Text.Encoding
is designed. You can serialize any string to array of bytes (not characters!) using
GetBytes
methods or deserialize the array of bytes into string using the method
ToString
.
To solve your problem, you need to know what is represented by your array. What you show is not a "string array", this is an integer array. For integers, there is no such thing as "hex" or "decimal". If this is an array of 16-bit integer and if each element represent a character or surrogate, you probably can serialize it into array of byte and then deserialize it into string using
System.Text.Encoding.ToString
. Correct serialization depends on "endianess" (low-endian or high-endian) of the array. Where did you get it? Does it represent any sensible string. You can try it anyway. If you face the problem, post a valid data sample, I'll see.
See:
http://unicode.org/[
^].
http://unicode.org/faq/utf_bom.html[
^],
http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx[
^].
—SA