Have you ever required better control over how non-printable characters are displayed? Would you like to remove or replace non-printable characters? Would you like to do this for Unicode text? Recently, I ran into these problems.
I expected an easy solution, but found none. Windows GDI provides an unmanaged function
GetFontUnicodeRanges that does a lot of the work. I wrote a C# wrapper class,
FontGlyphSet, to do the rest.
FontGlyphSet class provides methods for replacing or removing characters that do not have representations (glyphs) in a specified font. The remainder of the article describes this class and how it was implemented. The class is freely available for use in both commercial and non-commercial applications, providing the copyright notice is retained.
Recently, I needed to remove or replace non-printable characters in one of my applications. Initially, I considered doing what I have always done...treating characters outside the range of ASCII 32 (space) through ASCII 126 (tilde) as non-printable.
The very term "non-printable" is a bit of a misnomer that doesn't apply well to modern systems. In a world where printing is increasingly rare, Unicode dominates, and a dizzying array of fonts are available, it is hopelessly outdated.
What I really needed was a way to remove or replace characters that lacked a glyph in the font I was using to draw. So, armed with an updated vocabulary, I set out to search for a solution.
I started out with high hopes that the
Font class would provide something like a
bool HasGlyph(thisChar) method. But, no such luck. The search continued.
After a little more hunting around MSDN, it seemed that two unmanaged, Windows GDI functions
SelectObject could be combined to get the information I needed.
The first of these acquires a data structure,
GLYPHSET, that includes an array of
WCRANGE structures. Each element of the array, describes a range of characters that have representations (or glyphs).
So, now it was time to build a friendly wrapper class.
The Sample Program
The sample program is easy to use. Click on the "Remove" button, and the program will remove all of the non-printable characters in the corresponding text box. Click on the "Replace" button, and non-printable characters will be replaced by the character specified as a "Replacement character". You can experiment by editing the text or changing the replacement character. Clicking the "Font" button, allows you to try out different fonts. The "Reset" button will set the text and replacement character back to their original values.
The following image shows what happens after the "Remove" and "Replace" buttons are clicked...
The next paragraph contains a brief description of the features of the sample text. If you are unfamiliar with Unicode, don't panic; some explanation and references are provided later in the article.
In the sample text, used by default, the "ã" in the word "amanhã" is actually formed from two characters: an "a" followed by a combining character for the tilde U+0303. Following this, the Unicode code point U+10000 is encoded as a surrogate pair, U+D800 and U+DC00. The other non-printable characters are all control characters with values less than U+0020 (the space character). Interestingly, in .NET 1.1, the
TextBox control appears to incorrectly render the surrogate pair as two separate characters.
Using the Code
The class is also easy to use. First, it is necessary to create a
FontGlyphSet object from the
Font used to draw the text...
FontGlyphSet = new FontGlyphSet(Font);
When replacing or removing characters with
FontGlyphSet, it is important to enable a performance enhancement. It uses approximately 8K of memory, but speeds up the
Contains method (used by every other method). If you do not do this, the code will still work, but it may take longer to search through all of separate ranges for a specific character.
FontGlyphSet.IsFastContains = true;
Then, to replace or remove characters just use the
myString = FontGlyphSet.ReplaceAbsent(myString, '.');
myString = FontGlyphSet.RemoveAbsent(myString);
FontGlyphSet class includes the following public methods...
bool Contains(char charValue);
string RemoveAbsent(string inText);
int RemoveAbsent(char  inText, int inStart, int inLength,
char  outText, int outStart);
string ReplaceAbsent(string inText, char replacement);
int ReplaceAbsent(char  inText, int inStart, int inLength,
char  outText, int outStart, char replacement);
FontGlyphSet class includes the following public properties...
FontRange  Ranges;
The first step is to import
SelectObject so that they can be used within the class...
private static extern uint GetFontUnicodeRanges (
IntPtr hdc, IntPtr lpgs);
private static extern IntPtr SelectObject (
IntPtr hdc, IntPtr hgdiobj);
The next step is to get the information from these unmanaged functions. To do this, get a device context, associate the font with it, invoke
GetFontUnicodeRanges to get the size of the
GLYPHSET structure, and then call the same function a second time to get the data for the structure...
Graphics g = Graphics.FromHwnd(IntPtr.Zero);
IntPtr hdc = g.GetHdc();
IntPtr hFont = font.ToHfont();
IntPtr savedFont = SelectObject(hdc, hFont);
uint size = GetFontUnicodeRanges(hdc, IntPtr.Zero);
IntPtr glyphSetData = Marshal.AllocHGlobal((int) size);
After this, the
Marshal.ReadInt32 methods are used to get the data from the structure. The data is loaded into an array of
FontRange objects. The
FontRange class is another class included with this solution. However, since it is unlikely one would call it directly, I will leave its documentation to the code comments.
Following this, because many of the resources are unmanaged, there is a bit of clean up required...
The removal and replacement of characters is pretty obvious with a couple of small exceptions. While debugging, I discovered that some fonts have quite a lot of character ranges. I became concerned that the lookup time to test if a character exists in one of these ranges might be prohibitive.
So, I added the
IsFastContains property. By building a
BitArray with bits for each of the 65,536 possible character values, the lookup time could be reduced considerably.
Another small problem arises with the way strings are encoded internally. They are encoded as a series of 16-bit words. The coding scheme is called UTF-16. The most commonly used characters fall in a range known as the BMP (Basic Multi-Lingual Plane). Each of these characters have a value less than 65,536 and can be encoded in a single word.
However, Unicode also allows for characters with values greater than 65,535. These values are encoded as surrogate pairs. Each 16-bit word of the pair contains half of the value. All such characters are treated as non-printable by
The problem arises during replacement of these characters. Since the Unicode character actually takes up two 16-bit words in the string, it is important to replace both words of this surrogate pair with a single character. Changes were made to
ReplaceAbsent to properly handle this situation.
Removal of characters is less problematic. Each word of the surrogate pair uses a range of values that uniquely identifies it as such (0xD800-0xDBFF and 0xDC00-0xDFFF). According to ISO/IEC 10646 these values should not be associated with valid code points. For this reason, one would not expect glyphs for these characters in a properly designed font. The code simply depends on the fact that both words of the surrogate pair will be removed independently as a consequence of their values.
FontGlyphSet class that I have implemented works sufficiently for my needs. It is certainly an improvement over the technique of removing/replacing characters outside the printable ASCII range. However, the Unicode character set is very large and my time is very limited. So, it is quite likely that the class may need further refinements. I will do my best to keep the code here up-to-date with any new discoveries.
While a long time reader, this is my first submission to Code Project. I hope others find it helpful.
GetFontUnicodeRanges (MSDN), http://msdn2.microsoft.com/en-us/library/ms533944.aspx
Mapping of Unicode characters (Wikipedia), http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters
GLYPHSET (MSDN), http://msdn2.microsoft.com/en-us/library/ms533995.aspx
SelectObject (MSDN), http://msdn2.microsoft.com/en-us/library/ms533272.aspx
Unicode (Wikipedia), http://en.wikipedia.org/wiki/Unicode
UTF-16 (Wikipedia), http://en.wikipedia.org/wiki/UTF-16
WCRANGE (MSDN), http://msdn2.microsoft.com/en-us/library/ms534022.aspx