identify single byte and multibyte characters

Question

0.00/5 (No votes)

See more:

Hello,

I am Lakshman.

I am having a string array of Hexadecimal values.
For ex : {0x0, 0x31, 0xef, .....}

What i want to do is I want to identify single byte and multibyte characters .

How can i identify this?

Thanks in Advance.

regards

Lakshman

Posted 4-Jul-11 20:23pm

Lakshman c b

Add a Solution

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Sergey Alexandrovich Kryukov · Answer 1 · 2011-07-04T20:58:00

There are no "multibyte character" concept in .NET. The .NET characters use UTF-16LE Unicode encoding. This encoding is composed of 16-bit code points. This is not enough to represent all Unicode code points, only BMP (Base Multilungual Plane), which represent code points from 0 to 0xFFFF, inclusively.

What to do with other Unicode code points, above BMP? They are supported only at the level of strings but not individual characters. This is pretty hard to explain. In UTF-16 encoding, such code points are represented by surrogate pairs. Only UTF-16 encodings use them, but there is a special range of code point dedicated to surrogates which should no be used for any "read" code points. Each 16-bit word of the surrogate is not a real code point, it only uses the position of some code point which does not exist as a real character. At the level of string, a pair is considered as a single character (for example, correctly rendered on screen as one character glyphs). This support was introduces in one of the Windows 2000 service packs.

At the level of characters, there are no 4-byte characters. It means that if you traverse a string character-by-character, some characters may be not "real" characters, so such traversal should avoided, at least for languages utilizing above-BMP code points. You should not break the surrogate pairs but you can with the char type which can cause problems.

To address these problems, the type System.Text.Encoding is designed. You can serialize any string to array of bytes (not characters!) using GetBytes methods or deserialize the array of bytes into string using the method ToString.

To solve your problem, you need to know what is represented by your array. What you show is not a "string array", this is an integer array. For integers, there is no such thing as "hex" or "decimal". If this is an array of 16-bit integer and if each element represent a character or surrogate, you probably can serialize it into array of byte and then deserialize it into string using System.Text.Encoding.ToString. Correct serialization depends on "endianess" (low-endian or high-endian) of the array. Where did you get it? Does it represent any sensible string. You can try it anyway. If you face the problem, post a valid data sample, I'll see.

See:
http://unicode.org/[^].
http://unicode.org/faq/utf_bom.html[^],
http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx[^].

—SA