Click here to Skip to main content
Rate this: bad
good
Please Sign up or sign in to vote.
See more: C# VB.NET Unicode
As I understand it, .Net represents 32-bit characters using a pair (or "surrogate pair") of 16-bit characters. However, I haven't been able to find any functions which deal with these pairs as a single character. For example, Windows forms is capable of displaying this surrogate pair as a single character:
// This displays the character as I expect.
MessageBox.Show(char.ConvertFromUtf32(int.Parse("2A601", NumberStyles.HexNumber)));
However, when I get the length of that string, it is 2 (I would expect it to be 1):
// Shows 2 rather than 1.
MessageBox.Show(char.ConvertFromUtf32(int.Parse("2A601", NumberStyles.HexNumber)).Length.ToString());
Also, when I get the first character, some block character is shown rather than the character I expect:
// Shows �� rather than 𪘁.
MessageBox.Show(char.ConvertFromUtf32(int.Parse("2A601", NumberStyles.HexNumber)).Substring(0, 1));
FYI, you may need something installed to see the special characters above, but you should get the point even if you can't see them.
Basically, I would like to know if there are any string functions to handle surrogate pairs properly (e.g., index them correctly, count them as a single character rather than two). Or, if I'm looking at the concept of surrogate pairs wrong, feel free to correct me.
Posted 16-Feb-11 12:48pm
AspDotNetDev191.5K
Comments
AspDotNetDev at 16-Feb-11 17:50pm
   
Note that the editor is messing something up and the two blocks shown after "Shows " (in the code comment) should only be one block.
SAKryukov at 16-Feb-11 18:29pm
   
Good question, my 5.
--SA
Rate this: bad
good
Please Sign up or sign in to vote.

Solution 1

"As I understand it, .Net represents 32-bit characters using a pair (or "surrogate pair") of 16-bit characters"
 
.Net uses UTF16 - and you may find this interesting:
http://www.unicode.org/notes/tn12/[^]
 
and this http://www.yoda.arachsys.com/csharp/unicode.html[^]
 
Libraries like http://site.icu-project.org/[^] takes surrogate pairs into account, using an iterator approach - while .Net seems to treat UTF16 as UCS16. While I suspect that that the underlying OS features implements and uses UTF16 more in line with the standard.
 
As SAKryukov mentions UnicodeEncoding actually takes these things into account - but it seems that the usual practise is to only consider the length of the string - and that usually tends to work out nicely anyway, unless you are doing character by character processing.
 
To get more than a box - you need to use a font that supports the characters you want to display.
 
Regards
Espen Harlinn
  Permalink  
v3
Comments
SAKryukov at 16-Feb-11 21:56pm
   
Oh, very useful reading and library!
Thank you, my 5.
 
Now, you certainly right about usual practice (working with character indexing with can break into integrity on the level of code points. In practice, it is not a problem is working with proper care. For example, you can take invalid sub-string of string S if S contains a character above BMP, because you can cut it in the middle of this character. It happens if you want to take a randomly-positioned slice. Such randomness could be avoided. For example, if you want to split a string by some delimiter(s), it will always give a valid result.
 
--SA
Espen Harlinn at 17-Feb-11 6:44am
   
Thank you SAKryukov!
SAKryukov at 16-Feb-11 21:57pm
   
I suggest OP accepts this as an answer, too.
--SA
JF2015 at 17-Feb-11 0:56am
   
Good additional answer. 5+
Espen Harlinn at 17-Feb-11 6:44am
   
Thank you JF2015!
Rate this: bad
good
Please Sign up or sign in to vote.

Solution 2

There is a number of issues about it. There is no need to support surrogate pairs, they are supported automatically by OS (Windows 2000 needs a tweak to support them, later versions of Windows are bundled with surrogate support).
 
The notion of surrogate pair is only relevant to two UTF-16 encodings (UTF-16LE and UTF16BE); UTF-32 and UTF-8 support characters beyond BMP (Basic Multilingual Plane) directly or using UTF-8 algorithm, respectively. In application memory, UTF-16LE is used; and a character type does not really represent a Unicode code point: some code points are represented as two characters, as you correctly point out, so some care is needed to index characters, see below.
 
One can use characters above BMP in UI directly, without any re-coding. The text should be placed in XML resources. As XML files can declare UTF-8 charset, anyone can type such text directly using any editor capable of saving data in UTF-8 format:
 
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
 
The XML file will be embedded as a resource in the .NET Assembly; during run-time, the text will be loaded/converted into UTF-16 memory representation with the code point above BMP represented as surrogate pairs. In principle, such UTF-8 text can even be entered in C# code in the form of hard-coded string literals, but I would strongly recommend to avoid it. Any hard-coded string literals, even ASCII-only are best avoided in the code, with rare exclusions.
 
The biggest concern is deployment of fonts implementing code point ranges above BMP. From what I know, no such fonts are bundled with Windows. However, I tested Unicode implementation above BMP using some Open Source fonts and had no problems with them.
 
The mixed-size nature of character string is reflected in the members of abstract class System.Text.Encoding. For example, look at the following methods of this class: GetByteCount, GetBytes, GetCharCount, GetChars. They reflect the fact that there is no one-to-one correspondence between bytes and chars: these methods accept a string of char[] parameter on input.
 
There is no direct access to character indexing though. I would guess, this is because this information is rarely used and needs a lot of redundant data (see below). Controls process surrogate pairs automatically. If necessary, anyone can build such index in code. To do that, one need to create a separate index map represented by index set, for example, as array of integers.
 
Traverse the string's "characters" (in the .NET sense, not code points) in a loop and for every character examine it using predicates (static methods): System.Char.IsLowSurrogate(char), IsHighSurrogate or IsSurrogate, incrementing the code point index correspondently: by 1 per one "real" character (representing a code point) or per two "surrogate" characters representing a surrogate pair.
When you obtain the indexing map, you can index a string by code points and use other functions in code point semantics.
 
The implementation would look like this (not tested):
 
public class CodePointIndexer {
 
    public CodePointIndexer(string value) {
        this.value = value;
        indexMap = BuildCodePointMap(value);
    } //CodePointIndexer

    public string Value { get { return this.value; } }
 
    public char[] this[int index] { //may throw out-of-range exception
        get {
            int codePointIndex = this.indexMap[index];
            char start = value[codePointIndex];
            if (System.Char.IsSurrogate(start))
                return new char[] { start };
            else
                return new char[] { start, value[codePointIndex + 1] };
        } //get this as code point
    } //this

    String value;
    int[] indexMap;
 
    #region implementation
 
    static int[] BuildCodePointMap(string source) {
        if (source == null) return null;
        if (source.Length < 1) return new int[] { };
        System.Collections.Generic.List<int> list =
            new System.Collections.Generic.List<int>();
        int currectIndex = 0;
        bool surrogateMode = false;
        foreach (char @char in source) {
            list.Add(currectIndex);
            if (surrogateMode) continue;
            surrogateMode = System.Char.IsSurrogate(@char);
            currectIndex++;
        } //loop
        return list.ToArray();
    } //BuildCodePointMap

    #endregion implementation
 
} //class CodePointIndexer

 
Sorry if I did not list comprehensive set of relevant .NET APIs — working above BMP is quite exotic requirement. At the same time, the methods I already mentioned are enough to implement any Unicode computing task.
 
—SA
  Permalink  
v8
Comments
Espen Harlinn at 16-Feb-11 18:31pm
   
Great, my 5 :)
SAKryukov at 16-Feb-11 19:06pm
   
Thank you very much; sample code added.
--SA
AspDotNetDev at 16-Feb-11 19:21pm
   
Thanks for that. I learned a few new terms and that index map is a good idea (I was wondering to myself how one might deal with indexing considering the variable length caused by surrogate pairs). Kind of a shame that some UTF-32 type of string isn't used in .Net to fully handle code points. But what you recommend seems like a fair enough work around. +5 and accepted as answer. Thanks again!
SAKryukov at 17-Feb-11 1:32am
   
Oh, by the way, this is yet another idea of using all code points over BMP, will work faster: implement UTF-32 in memory, convert from UTF-16 and back; code point withing BMP will come as copy, other code points are surrogate pairs, convert them to 32-bit according UTF-16 spec. All complext calculations are done in 32-bit representations, converted to UTF-16 only for final presentaion in UI, that's it.
--SA
SAKryukov at 16-Feb-11 19:39pm
   
You're welcome.
 
It's my pleasure to get some interesting Question, thank you for that, and for accepting my answer.
Maybe you can find/develop something better or in addition; in this case please share.
I think you have good understanding of the topic.
 
May I ask you why are you using such "exotic" Unicode ranges? I tried all that when I was preparing a publication on Unicode support for "The Delphi Magazine" (printed in 2005) to develop a method of using of Borland ANSI-only (at that time) VCL library (that was really tricky!). Then I asked help of my Chinese an Arabic colleagues in testing, but above BMP I had to test by myself (Cuneiform, etc.) and found some fonts. Where do you get such fonts these days?
 
Good luck, call again.
--SA
AspDotNetDev at 17-Feb-11 0:19am
   
Actually, the reason is fairly boring. I recently posted an alternate to a tip/trick for how to determine if a string is a palindrome. I decided to use Substring() rather than the subscript to get the char because I was under the impression that Substring() handled surrogate pairs. However, after testing, I found out that wasn't the case, so I wanted to know more about how something so seemingly simple is best handled. Most (or even all) algorithms I've seen that check for palindromes will fail in the case that the string contains a surrogate pair.
 
Though the extra info you have given me will probably be of use when I make the Chinese version of the website I am working on. I'm not extremely familiar with character encodings and such, so every bit of extra info related to them helps.
SAKryukov at 17-Feb-11 2:08am
   
Ha! That's funny. Do you speak Chinese? Mandarin? Are code ranges above BOM used in Chinese? I thought not.
--SA
AspDotNetDev at 17-Feb-11 12:47pm
   
Nope, I only speak English, but I'm trying to learn Spanish. It will be quite a few years before I even consider trying to learn Chinese.
SAKryukov at 17-Feb-11 19:56pm
   
:-) That's funny that my little daughter started to lean a bit of Mandarin for fun, so I remember few words :-) So far I speak two.
--SA
JF2015 at 17-Feb-11 0:56am
   
Very detailed answer. Have another 5
SAKryukov at 17-Feb-11 1:15am
   
Thank you.
--SA
Rate this: bad
good
Please Sign up or sign in to vote.

Solution 3

If you character is beyond the BMP (and 2A601 is > 0xFFFF e. g. decimal 173569) then you will have a high- as well as low-surrogate within your string that encodes your codepoint. This means that TWO elements e. g. TWO words encode ONE character. Length will always obtain the number of array elements, not the number of characters/codepoints! This is true due to the fact that codepoints within a utf-16 stream appear as a dword if greater than 0xFFFF. Because a high- and a low-surrogate are TWO words, the length of 2 is as appropriate. Length means "number of elements" on an array, not as you expect "CharCount" or "CodepointCount".
 
There is a class called StringInfo that should do the job you are looking for. It checks for surrogate-pairs (and hopefully skips orphaned surrogates) and obtains the number of codepoints, not array elements. Try it.
 
If your control that you want the codepoint to display with is surrogate-aware, it will decode the codepoint that is encoded within the high- and low-surrogate pair and queries the configured font for the glyph. Be sure you have configured a font that has the proper glyph for your codepoint (e. g. Arial Unicode MS has many glyphs but not all).
 
kind regards,
yb
  Permalink  
v2
Comments
AspDotNetDev at 17-Feb-14 20:46pm
   
Excellent answer! The StringInfo class is exactly what I'm looking for!
yetibrain at 18-Feb-14 7:50am
   
Great!
yetibrain at 18-Feb-14 12:46pm
   
Take care with StringInfo's LengthInTextElements()! Usually it works fine, when you have surrogate pairs within your string, string's Length will always be higher than LengthInTextElements() but if there are orphaned surrogates (just a high or low surrogate on its own and NOT as a pair), usually these orphaned surrogates are NOT characters, but this method counts them anyway. This is wierd because surrogates are NOT characters that's why they are called surrogates. This method better not counts orphaned surrogates as text elements, rather the class should count orphaned low- as well as high-surrogates and provide properties accordingly.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

  Print Answers RSS
0 OriginalGriff 365
1 Sergey Alexandrovich Kryukov 329
2 BillWoodruff 210
3 Afzaal Ahmad Zeeshan 204
4 CPallini 185
0 OriginalGriff 5,515
1 DamithSL 4,451
2 Maciej Los 3,902
3 Kornfeld Eliyahu Peter 3,480
4 Sergey Alexandrovich Kryukov 3,175


Advertise | Privacy | Mobile
Web02 | 2.8.141216.1 | Last Updated 18 Feb 2014
Copyright © CodeProject, 1999-2014
All Rights Reserved. Terms of Service
Layout: fixed | fluid

CodeProject, 503-250 Ferrand Drive Toronto Ontario, M3C 3G8 Canada +1 416-849-8900 x 100