Click here to Skip to main content
Rate this: bad
good
Please Sign up or sign in to vote.
I have an input string that contains composite Unicode characters, like:
 
"Leppӓnen" == "\x004c\x0065\x0070\x0070\x04d3\x006e\x0065\x006e"
 
I want to convert this to use the precomposed characters, ie:
 
"Leppӓnen" == "\x004c\x0065\x0070\x0070\x00e4\x006e\x0065\x006e"
 
I have tried:
 
- String.Normalize() and String.Normalize(NormalizationForm)
- kernel32.dll!WideCharToMultiByte(...)
 
My last resort will be writing a method to manually look for the normalized versions of these characters and substitute the precomposed characters, but I was hoping there was a framework or Win32 function to do this.
 
If you have no idea what I'm talking about, see: http://en.wikipedia.org/wiki/Unicode_equivalence[^]
To see the character sets I'm talking about, see: http://en.wikibooks.org/wiki/Unicode/Character_reference/0000-0FFF[^]
Posted 31-Aug-11 8:48am
Edited 1-Sep-11 3:33am
v3
Comments
SAKryukov at 31-Aug-11 23:16pm
   
Interesting question on this boring topic. One note: I don't think your example is correct in its numeric part. These to variants hardly can be the example of composite vs. precomposed simply because they formally have the same number of code points.
I tries two forms on text editor -- they are rendered correctly, but not recognized as equal (may be I should have used different comparison). Sorry, I don't know how to automate the conversion by any ready-to-use methods. By the way, why do you need to convert them the the precomposed? -- just curious, never faced such problem.
--SA
Yvan Rodrigues at 1-Sep-11 9:00am
   
You're right, technically the first one is not composite, but rather is a 2-byte precomposed form; whereas the second form is a 1-byte precomposed form (I didn't want to make the topic even MORE boring, but you made me :). The first form is the "standard normalized" version, the second is the "legacy normalized" version". Many western european accented characters have a legacy normalized form. These were created so that 8-bit systems had a fighting chance of being able to easily interpret most Unicode characters used in the west.
 
They are the same character but you are correct that the framework (and most Unicode implementations) are not great at evaluating all forms of equivalence.
 
The reason I need this is that I'm using the iTextSharp library to generate some PDFs and these characters only render correctly if I use the 8-bit form. I believe this is actually the font's fault, not iTextSharp's, but most commercially available fonts don't seem to render the first form of the character correctly -- they just omit it.

1 solution

Rate this: bad
good
Please Sign up or sign in to vote.

Solution 1

You can try WideCharToMultiByte function from Windows unmanaged code. Reference at: http://msdn.microsoft.com/en-us/library/dd374130%28v=vs.85%29.aspx[^]
  Permalink  
Comments
Yvan Rodrigues at 1-Sep-11 8:51am
   
Yeah, I tried that (see above), but if I passed CP_UTF8 and WC_COMPOSITECHECK I would get a Windows error 78 (bad parameters). I also tried other codepages like 1200, 1201 and 65001 but they all resulted in the same error.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

  Print Answers RSS
0 OriginalGriff 405
1 Maciej Los 285
2 Peter Leow 128
3 Kornfeld Eliyahu Peter 119
4 DamithSL 119


Advertise | Privacy | Mobile
Web02 | 2.8.140709.1 | Last Updated 1 Sep 2011
Copyright © CodeProject, 1999-2014
All Rights Reserved. Terms of Service
Layout: fixed | fluid