Click here to Skip to main content
15,902,865 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
hello
I want to get the Unicode representation of Char using C#.
Like
"A" its Unicode "U+0041"

TIA
Posted
Comments

The answers provided before won't work beyond BMP.

Unfortunately, the support of such characters in .NET is somewhat limited, more exactly, there is a full support, but it is indirect. In brief, the type System.String is a self-consistent type with full Unicode support, but System.Char is not: in a set of all possible values of this type, not all values represent a character: some corresponds to an undefined code point, some to a character and some to a high or low member of a surrogate pair: https://en.wikipedia.org/wiki/UTF-16#U.2B10000_to_U.2B10FFFF[^].

Here is the .NET trick: internally, strings use the encoding UTF-16LE (please see the UTF-16 link above). As soon as you consider strings, not characters, everything works right. Some characters use 1 16-bit words, but some, above BMP use two such words, a surrogate pair.

Be very careful: say, use the property Length, it gives you the number of 16-bit words, not number of characters. The actual number of characters may be less than the number of memory words.

But if you extract a value of the type System.Char, it… might be not character.

Here is what you can do. As soon as you need a character code point, not surrogate values, you gave to use strings only. In particular, to get N-th character from a string, use a sub-string of length 1 (second parameter): https://msdn.microsoft.com/en-us/library/aka44szs(v=vs.110).aspx[^].

This way, even the single characters should be represented as strings, not as instances of System.Char.

Take the character (char, this time) and index 0 and check up if it is a surrogate pair. If the character obtained this way is a surrogate pair, take two char values by index 0 and 1, use System.Char.ConvertToUtf32:
https://msdn.microsoft.com/en-us/library/wdh8k14a%28v=vs.110%29.aspx[^].

Alternatively, you can directly get UTF-32 representation of a character by its index in a string: https://msdn.microsoft.com/en-us/library/z2ys180b%28v=vs.110%29.aspx[^].

Now, you need to know that the arithmetic values of the 32-bit words in UTF-32LE encoding are exactly the values of the Unicode code point numbers. In .NET, you immediately get code point values.

In all other cases, use the only char, type-cast it to uint; that will be your code point value.

[EDIT]

You can understand all the steps on the work-around C# code sample shown above. I take some code point values, convert it to a .NET string and then inspect each "real" Unicode character in the string and get its code point; output the calculated code points:
C#
System.UInt32[] codePoints = new uint[] {
    // above BMP,
    // from: http://www.unicode.org/charts/PDF/U10000.pdf:
    0x10056, 0x10057, 0x10058,
    // Greek labda and mu: in BMP, but outside ASCII:
    0x03bb, 0x03bc,
    // ASCII, Latin A, B:
    0x41, 0x42,
};

// serialize it into array of bytes:
byte[] utf32Data = new byte[codePoints.Length * sizeof(uint)];
for (int index = 0; index < codePoints.Length; ++index) {
    byte[] character = System.BitConverter.GetBytes(codePoints[index]);
    System.Array.Copy(character, 0, utf32Data, index * sizeof(uint), character.Length);
}

// get string out of UTF32 data:
string value = new string(System.Text.Encoding.UTF32.GetChars(utf32Data));

// calculate and output code points:
System.Text.StringBuilder sb = new System.Text.StringBuilder("code points: ");
for (int index = 0; index < value.Length; ++index) {
    char[] character; // one or two 16-bit words is a character
    char word = value[index]; // a 16-bit word, not really a character
    if (System.Char.IsHighSurrogate(word)) {
        character = new char[] { word, value[index + 1], };
    } else if (System.Char.IsLowSurrogate(word))
        continue;
    else
        character = new char[] { word };
    int codePoint;
    if (character.Length > 1)
        codePoint = System.Char.ConvertToUtf32(character[0], character[1]);
    else
        codePoint = (int)character[0];
    sb.Append(string.Format("{0:x} ", codePoint));
}
System.Console.WriteLine(sb.ToString());

This code sample is not the most efficient. Instead, I tried to show each step clearly.

[END EDIT]

You should understand Unicode code points correctly: they are just ordinary numbers assigned to characters. They are values in pure mathematical sense, totally abstracted from their computer presentations. Likewise, characters are pure cultural entities, fully abstracted from both computer presentations and detail like their glyphs, fonts, and so on. This correspondence is the code Unicode. All computer-related detail are defined by UTFs.

—SA
 
Share this answer
 
v6
C#
private string CharToUnicodeFormat(char c)
{
    return string.Format(@"U+{0:x4}", (int)c);
}

private char UnicodeFormatToChar(string ucf)
{
    return Convert.ToChar(Convert.ToInt32(ucf.Substring(2),16));
}
Keep in mind that the creators of C# give you many tools to deal with the internal Unicode representation like '\u0000' compared to this "extended" Unicode formal syntax 'U+0000' ... where backslash-u is the syntax for an "escaped" Unicode character.
 
Share this answer
 
v2
Comments
Afzaal Ahmad Zeeshan 6-Dec-15 14:29pm    
Enough for an answer! 5ed.
That is the value of the character variable in memory. All you need to do is cast it to an integer and display it.
 
Share this answer
 
Comments
Member 12070468 6-Dec-15 5:07am    
it doesn't work,
char a = 'A';
int cc = Convert.ToInt32(a);

I got 65 while It's supposed to get U-0410???
Richard MacCutchan 6-Dec-15 8:08am    
Of course it works, hex 41 is decimal 65. You need to set your print command to display the value in the correct base.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900