Click here to Skip to main content
15,867,488 members
Please Sign up or sign in to vote.
4.47/5 (7 votes)
See more:
As I understand it, .Net represents 32-bit characters using a pair (or "surrogate pair") of 16-bit characters. However, I haven't been able to find any functions which deal with these pairs as a single character. For example, Windows forms is capable of displaying this surrogate pair as a single character:
C#
// This displays the character as I expect.
MessageBox.Show(char.ConvertFromUtf32(int.Parse("2A601", NumberStyles.HexNumber)));

However, when I get the length of that string, it is 2 (I would expect it to be 1):
C#
// Shows 2 rather than 1.
MessageBox.Show(char.ConvertFromUtf32(int.Parse("2A601", NumberStyles.HexNumber)).Length.ToString());

Also, when I get the first character, some block character is shown rather than the character I expect:
C#
// Shows �� rather than 𪘁.
MessageBox.Show(char.ConvertFromUtf32(int.Parse("2A601", NumberStyles.HexNumber)).Substring(0, 1));

FYI, you may need something installed to see the special characters above, but you should get the point even if you can't see them.
Basically, I would like to know if there are any string functions to handle surrogate pairs properly (e.g., index them correctly, count them as a single character rather than two). Or, if I'm looking at the concept of surrogate pairs wrong, feel free to correct me.
Posted
Comments
AspDotNetDev 16-Feb-11 17:50pm    
Note that the editor is messing something up and the two blocks shown after "Shows " (in the code comment) should only be one block.
Sergey Alexandrovich Kryukov 16-Feb-11 18:29pm    
Good question, my 5.
--SA

"As I understand it, .Net represents 32-bit characters using a pair (or "surrogate pair") of 16-bit characters"

.Net uses UTF16 - and you may find this interesting:
http://www.unicode.org/notes/tn12/[^]

and this http://www.yoda.arachsys.com/csharp/unicode.html[^]

Libraries like http://site.icu-project.org/[^] takes surrogate pairs into account, using an iterator approach - while .Net seems to treat UTF16 as UCS16. While I suspect that that the underlying OS features implements and uses UTF16 more in line with the standard.

As SAKryukov mentions UnicodeEncoding actually takes these things into account - but it seems that the usual practise is to only consider the length of the string - and that usually tends to work out nicely anyway, unless you are doing character by character processing.

To get more than a box - you need to use a font that supports the characters you want to display.

Regards
Espen Harlinn
 
Share this answer
 
v3
Comments
Sergey Alexandrovich Kryukov 16-Feb-11 21:56pm    
Oh, very useful reading and library!
Thank you, my 5.

Now, you certainly right about usual practice (working with character indexing with can break into integrity on the level of code points. In practice, it is not a problem is working with proper care. For example, you can take invalid sub-string of string S if S contains a character above BMP, because you can cut it in the middle of this character. It happens if you want to take a randomly-positioned slice. Such randomness could be avoided. For example, if you want to split a string by some delimiter(s), it will always give a valid result.

--SA
Espen Harlinn 17-Feb-11 6:44am    
Thank you SAKryukov!
Sergey Alexandrovich Kryukov 16-Feb-11 21:57pm    
I suggest OP accepts this as an answer, too.
--SA
JF2015 17-Feb-11 0:56am    
Good additional answer. 5+
Espen Harlinn 17-Feb-11 6:44am    
Thank you JF2015!
There is a number of issues about it. There is no need to support surrogate pairs, they are supported automatically by OS (Windows 2000 needs a tweak to support them, later versions of Windows are bundled with surrogate support).

The notion of surrogate pair is only relevant to two UTF-16 encodings (UTF-16LE and UTF16BE); UTF-32 and UTF-8 support characters beyond BMP (Basic Multilingual Plane) directly or using UTF-8 algorithm, respectively. In application memory, UTF-16LE is used; and a character type does not really represent a Unicode code point: some code points are represented as two characters, as you correctly point out, so some care is needed to index characters, see below.

One can use characters above BMP in UI directly, without any re-coding. The text should be placed in XML resources. As XML files can declare UTF-8 charset, anyone can type such text directly using any editor capable of saving data in UTF-8 format:

XML
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />


The XML file will be embedded as a resource in the .NET Assembly; during run-time, the text will be loaded/converted into UTF-16 memory representation with the code point above BMP represented as surrogate pairs. In principle, such UTF-8 text can even be entered in C# code in the form of hard-coded string literals, but I would strongly recommend to avoid it. Any hard-coded string literals, even ASCII-only are best avoided in the code, with rare exclusions.

The biggest concern is deployment of fonts implementing code point ranges above BMP. From what I know, no such fonts are bundled with Windows. However, I tested Unicode implementation above BMP using some Open Source fonts and had no problems with them.

The mixed-size nature of character string is reflected in the members of abstract class System.Text.Encoding. For example, look at the following methods of this class: GetByteCount, GetBytes, GetCharCount, GetChars. They reflect the fact that there is no one-to-one correspondence between bytes and chars: these methods accept a string of char[] parameter on input.

There is no direct access to character indexing though. I would guess, this is because this information is rarely used and needs a lot of redundant data (see below). Controls process surrogate pairs automatically. If necessary, anyone can build such index in code. To do that, one need to create a separate index map represented by index set, for example, as array of integers.

Traverse the string's "characters" (in the .NET sense, not code points) in a loop and for every character examine it using predicates (static methods): System.Char.IsLowSurrogate(char), IsHighSurrogate or IsSurrogate, incrementing the code point index correspondently: by 1 per one "real" character (representing a code point) or per two "surrogate" characters representing a surrogate pair.
When you obtain the indexing map, you can index a string by code points and use other functions in code point semantics.

The implementation would look like this (not tested):

C#
public class CodePointIndexer {

    public CodePointIndexer(string value) {
        this.value = value;
        indexMap = BuildCodePointMap(value);
    } //CodePointIndexer

    public string Value { get { return this.value; } }

    public char[] this[int index] { //may throw out-of-range exception
        get {
            int codePointIndex = this.indexMap[index];
            char start = value[codePointIndex];
            if (System.Char.IsSurrogate(start))
                return new char[] { start };
            else
                return new char[] { start, value[codePointIndex + 1] };
        } //get this as code point
    } //this

    String value;
    int[] indexMap;

    #region implementation

    static int[] BuildCodePointMap(string source) {
        if (source == null) return null;
        if (source.Length < 1) return new int[] { };
        System.Collections.Generic.List<int> list =
            new System.Collections.Generic.List<int>();
        int currectIndex = 0;
        bool surrogateMode = false;
        foreach (char @char in source) {
            list.Add(currectIndex);
            if (surrogateMode) continue;
            surrogateMode = System.Char.IsSurrogate(@char);
            currectIndex++;
        } //loop
        return list.ToArray();
    } //BuildCodePointMap

    #endregion implementation

} //class CodePointIndexer


Sorry if I did not list comprehensive set of relevant .NET APIs — working above BMP is quite exotic requirement. At the same time, the methods I already mentioned are enough to implement any Unicode computing task.

—SA
 
Share this answer
 
v8
Comments
Espen Harlinn 16-Feb-11 18:31pm    
Great, my 5 :)
Sergey Alexandrovich Kryukov 16-Feb-11 19:06pm    
Thank you very much; sample code added.
--SA
AspDotNetDev 16-Feb-11 19:21pm    
Thanks for that. I learned a few new terms and that index map is a good idea (I was wondering to myself how one might deal with indexing considering the variable length caused by surrogate pairs). Kind of a shame that some UTF-32 type of string isn't used in .Net to fully handle code points. But what you recommend seems like a fair enough work around. +5 and accepted as answer. Thanks again!
Sergey Alexandrovich Kryukov 17-Feb-11 1:32am    
Oh, by the way, this is yet another idea of using all code points over BMP, will work faster: implement UTF-32 in memory, convert from UTF-16 and back; code point withing BMP will come as copy, other code points are surrogate pairs, convert them to 32-bit according UTF-16 spec. All complext calculations are done in 32-bit representations, converted to UTF-16 only for final presentaion in UI, that's it.
--SA
Sergey Alexandrovich Kryukov 16-Feb-11 19:39pm    
You're welcome.

It's my pleasure to get some interesting Question, thank you for that, and for accepting my answer.
Maybe you can find/develop something better or in addition; in this case please share.
I think you have good understanding of the topic.

May I ask you why are you using such "exotic" Unicode ranges? I tried all that when I was preparing a publication on Unicode support for "The Delphi Magazine" (printed in 2005) to develop a method of using of Borland ANSI-only (at that time) VCL library (that was really tricky!). Then I asked help of my Chinese an Arabic colleagues in testing, but above BMP I had to test by myself (Cuneiform, etc.) and found some fonts. Where do you get such fonts these days?

Good luck, call again.
--SA
If you character is beyond the BMP (and 2A601 is > 0xFFFF e. g. decimal 173569) then you will have a high- as well as low-surrogate within your string that encodes your codepoint. This means that TWO elements e. g. TWO words encode ONE character. Length will always obtain the number of array elements, not the number of characters/codepoints! This is true due to the fact that codepoints within a utf-16 stream appear as a dword if greater than 0xFFFF. Because a high- and a low-surrogate are TWO words, the length of 2 is as appropriate. Length means "number of elements" on an array, not as you expect "CharCount" or "CodepointCount".

There is a class called StringInfo that should do the job you are looking for. It checks for surrogate-pairs (and hopefully skips orphaned surrogates) and obtains the number of codepoints, not array elements. Try it.

If your control that you want the codepoint to display with is surrogate-aware, it will decode the codepoint that is encoded within the high- and low-surrogate pair and queries the configured font for the glyph. Be sure you have configured a font that has the proper glyph for your codepoint (e. g. Arial Unicode MS has many glyphs but not all).

kind regards,
yb
 
Share this answer
 
v2
Comments
AspDotNetDev 17-Feb-14 20:46pm    
Excellent answer! The StringInfo class is exactly what I'm looking for!
yetibrain 18-Feb-14 7:50am    
Great!
yetibrain 18-Feb-14 12:46pm    
Take care with StringInfo's LengthInTextElements()! Usually it works fine, when you have surrogate pairs within your string, string's Length will always be higher than LengthInTextElements() but if there are orphaned surrogates (just a high or low surrogate on its own and NOT as a pair), usually these orphaned surrogates are NOT characters, but this method counts them anyway. This is wierd because surrogates are NOT characters that's why they are called surrogates. This method better not counts orphaned surrogates as text elements, rather the class should count orphaned low- as well as high-surrogates and provide properties accordingly.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900