how to retrieve the unicode representation for a char?

Question

0.00/5 (No votes)

See more:

C#

hello
I want to get the Unicode representation of Char using C#.
Like
"A" its Unicode "U+0041"

TIA

Posted 5-Dec-15 20:36pm

Member 12070468

Add a Solution

Comments

Afzaal Ahmad Zeeshan 6-Dec-15 14:29pm

For more on Unicode, please read http://www.codeproject.com/Articles/885262/Reading-and-writing-Unicode-data-in-NET.

3 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Sergey Alexandrovich Kryukov · Answer 1 · 2015-12-06T07:09:00

The answers provided before won't work beyond BMP.

Unfortunately, the support of such characters in .NET is somewhat limited, more exactly, there is a full support, but it is indirect. In brief, the type System.String is a self-consistent type with full Unicode support, but System.Char is not: in a set of all possible values of this type, not all values represent a character: some corresponds to an undefined code point, some to a character and some to a high or low member of a surrogate pair: https://en.wikipedia.org/wiki/UTF-16#U.2B10000_to_U.2B10FFFF[^].

Here is the .NET trick: internally, strings use the encoding UTF-16LE (please see the UTF-16 link above). As soon as you consider strings, not characters, everything works right. Some characters use 1 16-bit words, but some, above BMP use two such words, a surrogate pair.

Be very careful: say, use the property Length, it gives you the number of 16-bit words, not number of characters. The actual number of characters may be less than the number of memory words.

But if you extract a value of the type System.Char, it… might be not character.

Here is what you can do. As soon as you need a character code point, not surrogate values, you gave to use strings only. In particular, to get N-th character from a string, use a sub-string of length 1 (second parameter): https://msdn.microsoft.com/en-us/library/aka44szs(v=vs.110).aspx[^].

This way, even the single characters should be represented as strings, not as instances of System.Char.

Take the character (char, this time) and index 0 and check up if it is a surrogate pair. If the character obtained this way is a surrogate pair, take two char values by index 0 and 1, use System.Char.ConvertToUtf32:
https://msdn.microsoft.com/en-us/library/wdh8k14a%28v=vs.110%29.aspx[^].

Alternatively, you can directly get UTF-32 representation of a character by its index in a string: https://msdn.microsoft.com/en-us/library/z2ys180b%28v=vs.110%29.aspx[^].

Now, you need to know that the arithmetic values of the 32-bit words in UTF-32LE encoding are exactly the values of the Unicode code point numbers. In .NET, you immediately get code point values.

In all other cases, use the only char, type-cast it to uint; that will be your code point value.

[EDIT]

You can understand all the steps on the work-around C# code sample shown above. I take some code point values, convert it to a .NET string and then inspect each "real" Unicode character in the string and get its code point; output the calculated code points:

C#

System.UInt32[] codePoints = new uint[] {
    // above BMP,
    // from: http://www.unicode.org/charts/PDF/U10000.pdf:
    0x10056, 0x10057, 0x10058,
    // Greek labda and mu: in BMP, but outside ASCII:
    0x03bb, 0x03bc,
    // ASCII, Latin A, B:
    0x41, 0x42,
};

// serialize it into array of bytes:
byte[] utf32Data = new byte[codePoints.Length * sizeof(uint)];
for (int index = 0; index < codePoints.Length; ++index) {
    byte[] character = System.BitConverter.GetBytes(codePoints[index]);
    System.Array.Copy(character, 0, utf32Data, index * sizeof(uint), character.Length);
}

// get string out of UTF32 data:
string value = new string(System.Text.Encoding.UTF32.GetChars(utf32Data));

// calculate and output code points:
System.Text.StringBuilder sb = new System.Text.StringBuilder("code points: ");
for (int index = 0; index < value.Length; ++index) {
    char[] character; // one or two 16-bit words is a character
    char word = value[index]; // a 16-bit word, not really a character
    if (System.Char.IsHighSurrogate(word)) {
        character = new char[] { word, value[index + 1], };
    } else if (System.Char.IsLowSurrogate(word))
        continue;
    else
        character = new char[] { word };
    int codePoint;
    if (character.Length > 1)
        codePoint = System.Char.ConvertToUtf32(character[0], character[1]);
    else
        codePoint = (int)character[0];
    sb.Append(string.Format("{0:x} ", codePoint));
}
System.Console.WriteLine(sb.ToString());

This code sample is not the most efficient. Instead, I tried to show each step clearly.

[END EDIT]

You should understand Unicode code points correctly: they are just ordinary numbers assigned to characters. They are values in pure mathematical sense, totally abstracted from their computer presentations. Likewise, characters are pure cultural entities, fully abstracted from both computer presentations and detail like their glyphs, fonts, and so on. This correspondence is the code Unicode. All computer-related detail are defined by UTFs.

—SA

BillWoodruff · Answer 2 · 2015-12-05T22:20:00

C#

private string CharToUnicodeFormat(char c)
{
    return string.Format(@"U+{0:x4}", (int)c);
}

private char UnicodeFormatToChar(string ucf)
{
    return Convert.ToChar(Convert.ToInt32(ucf.Substring(2),16));
}

Keep in mind that the creators of C# give you many tools to deal with the internal Unicode representation like '\u0000' compared to this "extended" Unicode formal syntax 'U+0000' ... where backslash-u is the syntax for an "escaped" Unicode character.

Richard MacCutchan · Answer 3 · 2015-12-05T20:57:00

Solution 1

That is the value of the character variable in memory. All you need to do is cast it to an integer and display it.

Posted 5-Dec-15 20:57pm

Richard MacCutchan

Comments

Member 12070468 6-Dec-15 5:07am

it doesn't work,
char a = 'A';
int cc = Convert.ToInt32(a);

I got 65 while It's supposed to get U-0410???

Richard MacCutchan 6-Dec-15 8:08am

Of course it works, hex 41 is decimal 65. You need to set your print command to display the value in the correct base.

how to retrieve the unicode representation for a char?

3 solutions

Solution 3

Solution 2

Solution 1

Add your solution here

Preview 0