Click here to Skip to main content
15,896,201 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Hello,

I am Lakshman.

I am having a string array of Hexadecimal values.
For ex : {0x0, 0x31, 0xef, .....}

What i want to do is I want to identify single byte and multibyte characters .

How can i identify this?

Thanks in Advance.

regards

Lakshman
Posted

1 solution

There are no "multibyte character" concept in .NET. The .NET characters use UTF-16LE Unicode encoding. This encoding is composed of 16-bit code points. This is not enough to represent all Unicode code points, only BMP (Base Multilungual Plane), which represent code points from 0 to 0xFFFF, inclusively.

What to do with other Unicode code points, above BMP? They are supported only at the level of strings but not individual characters. This is pretty hard to explain. In UTF-16 encoding, such code points are represented by surrogate pairs. Only UTF-16 encodings use them, but there is a special range of code point dedicated to surrogates which should no be used for any "read" code points. Each 16-bit word of the surrogate is not a real code point, it only uses the position of some code point which does not exist as a real character. At the level of string, a pair is considered as a single character (for example, correctly rendered on screen as one character glyphs). This support was introduces in one of the Windows 2000 service packs.

At the level of characters, there are no 4-byte characters. It means that if you traverse a string character-by-character, some characters may be not "real" characters, so such traversal should avoided, at least for languages utilizing above-BMP code points. You should not break the surrogate pairs but you can with the char type which can cause problems.

To address these problems, the type System.Text.Encoding is designed. You can serialize any string to array of bytes (not characters!) using GetBytes methods or deserialize the array of bytes into string using the method ToString.

To solve your problem, you need to know what is represented by your array. What you show is not a "string array", this is an integer array. For integers, there is no such thing as "hex" or "decimal". If this is an array of 16-bit integer and if each element represent a character or surrogate, you probably can serialize it into array of byte and then deserialize it into string using System.Text.Encoding.ToString. Correct serialization depends on "endianess" (low-endian or high-endian) of the array. Where did you get it? Does it represent any sensible string. You can try it anyway. If you face the problem, post a valid data sample, I'll see.

See:
http://unicode.org/[^].
http://unicode.org/faq/utf_bom.html[^],
http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx[^].

—SA
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900