The answers provided before won't work beyond
BMP.
Unfortunately, the support of such characters in .NET is somewhat limited, more exactly, there is a full support, but it is indirect. In brief, the type
System.String
is a self-consistent type with full Unicode support, but
System.Char
is not: in a set of all possible values of this type, not all values represent a character: some corresponds to an undefined code point, some to a character and some to a high or low member of a
surrogate pair:
https://en.wikipedia.org/wiki/UTF-16#U.2B10000_to_U.2B10FFFF[
^].
Here is the .NET trick: internally, strings use the encoding UTF-16LE (please see the UTF-16 link above). As soon as you consider strings, not characters, everything works right. Some characters use 1 16-bit words, but some, above BMP use two such words, a surrogate pair.
Be very careful: say, use the property
Length
, it gives you the number of 16-bit words, not number of characters. The actual number of characters may be less than the number of memory words.
But if you extract a value of the type
System.Char
, it… might be not character.
Here is what you can do. As soon as you need a character code point, not surrogate values, you gave to use strings only. In particular, to get N-th character from a string, use a sub-string of length 1 (second parameter):
https://msdn.microsoft.com/en-us/library/aka44szs(v=vs.110).aspx[
^].
This way, even the single characters should be represented as strings, not as instances of
System.Char
.
Take the character (
char
, this time) and index 0 and check up if it is a surrogate pair. If the character obtained this way is a surrogate pair, take two
char
values by index 0 and 1, use
System.Char.ConvertToUtf32
:
https://msdn.microsoft.com/en-us/library/wdh8k14a%28v=vs.110%29.aspx[
^].
Alternatively, you can directly get UTF-32 representation of a character by its index in a string:
https://msdn.microsoft.com/en-us/library/z2ys180b%28v=vs.110%29.aspx[
^].
Now, you need to know that the arithmetic values of the 32-bit words in UTF-32LE encoding are exactly the values of the Unicode code point numbers. In .NET, you immediately get code point values.
In all other cases, use the only char, type-cast it to
uint
; that will be your code point value.
[EDIT]
You can understand all the steps on the work-around C# code sample shown above. I take some code point values, convert it to a .NET string and then inspect each "real" Unicode character in the string and get its code point; output the calculated code points:
System.UInt32[] codePoints = new uint[] {
0x10056, 0x10057, 0x10058,
0x03bb, 0x03bc,
0x41, 0x42,
};
byte[] utf32Data = new byte[codePoints.Length * sizeof(uint)];
for (int index = 0; index < codePoints.Length; ++index) {
byte[] character = System.BitConverter.GetBytes(codePoints[index]);
System.Array.Copy(character, 0, utf32Data, index * sizeof(uint), character.Length);
}
string value = new string(System.Text.Encoding.UTF32.GetChars(utf32Data));
System.Text.StringBuilder sb = new System.Text.StringBuilder("code points: ");
for (int index = 0; index < value.Length; ++index) {
char[] character;
char word = value[index];
if (System.Char.IsHighSurrogate(word)) {
character = new char[] { word, value[index + 1], };
} else if (System.Char.IsLowSurrogate(word))
continue;
else
character = new char[] { word };
int codePoint;
if (character.Length > 1)
codePoint = System.Char.ConvertToUtf32(character[0], character[1]);
else
codePoint = (int)character[0];
sb.Append(string.Format("{0:x} ", codePoint));
}
System.Console.WriteLine(sb.ToString());
This code sample is not the most efficient. Instead, I tried to show each step clearly.
[END EDIT]
You should understand Unicode code points correctly: they are just ordinary numbers assigned to characters. They are values in pure mathematical sense, totally abstracted from their computer presentations. Likewise, characters are pure cultural entities, fully abstracted from both computer presentations and detail like their glyphs, fonts, and so on. This correspondence is the code Unicode. All computer-related detail are defined by UTFs.
—SA