Stripping Accents from Latin Characters: A Foray into Unicode Normalization

Evan Stein

4.63/5 (18 votes)

Mar 20, 2006

Apache

9 min read

92597

1253

How to turn accented characters into ASCII for search purposes.

Flattened French!

Introduction

We've come a long way from the world of 7-bit ASCII. In the beginning, the language of the PC was most definitely English, with a smattering of hard-to-find characters for a few Western European languages. Now the PC works in just about any language. The Unicode Consortium did some stunning work reconciling all of the many disparate standards, and Unicode is now part of the PC's basic plumbing. The .NET framework uses Unicode from the ground up, and its programming languages keep improving the ways they handle it. Representing language will never be completely trouble-free or automatic, but this is as good as it's ever been.

Now that people can spell their names correctly and write in their own languages, the opposite problem comes up – how to normalize all the different (correct) spellings. The first question is: Why would you do such a crazy thing after people have gone through all the trouble to put in the correct accents? There's only one admissible answer, but that one answer is a very important one: "Searching".

Divided by a Common Alphabet

The Latin alphabet is used for many languages. In order to adapt it to individual languages, letters had to be altered and diacritics put in. While the basic Latin alphabet is about 30 characters, there are around 900 variations. It's less of a problem when a program works in one language alone, but when a program coordinates multiple languages, something has to be done. For instance, a French keyboard doesn't have an "a" with a tilde, but the French customer service person still needs to find the Portuguese client whose name has that character. It's great to show your customers respect, but you'll still annoy them if you lose their data.

What we need to do is to "normalize" the data, so that we're comparing apples to apples. And to do that, we need to find a basis that everyone has in common. So, as far as we've come with Unicode, it looks like ASCII isn't quite dead yet. ASCII – the new common denominator! What we want to do is to strip out the diacritics, come up with an ASCII string, and use the result for searching. That way, everyone has an equal chance of finding their data.

This technique is purposefully "lossy", so we don't want to replace the correct values with simpler ones. Rather, we want to use these values alongside the originals, internally, out of the end-user's sight. Behind-the-scenes isn't a bad thing, since what we're about to do to the text may cause the end-users to worry. For instance, in real German, the name "Händel" resolves to "Haendel". By stripping diacritics, it resolves to "Handel". No, it's not real German, but that's not our goal.

How to Lose Your Accent in a Hurry, and Find Yourself in the Process

Our aim is to be language-independent, and to get the same data out as the data that went in, without special keyboards. This is a practical issue, not a scholarly one. The good news is that in true democratic fashion, every language gets fractured equally, and also … we have the original text anyway. Nothing's lost, and data is found.

There are two cardinal rules in searching:

The search expression must match the data being searched. This means that you should use the same function to prepare the search expression as you use on the data, or else you won't find what you're looking for. Consistency wins over everything, especially correctness.
Avoid doing expensive conversions at runtime. Unless it's unavoidable, you don't want to go through an entire table, convert each value, and then compare it. The lights will go dim! The best place to do your thinking is at design time, and a "search value" column alongside the well-spelled column will make life easy. This also enables you to use normal database indexes for searching, which really makes a difference.
Since you only need to calculate values when they change, the best place to normalize strings is at data entry time. During data entry, most of the clock cycles are spent waiting for the user to type something, so there are very few calculations at this stage that are too expensive. Besides stripping the accents, you can also convert everything to upper or lower case, which solves the more mundane issues of normalization.

Decomposition

To co-opt a stupid joke, "Händel's not composing any more. Now he's decomposing." If Mr. Händel wants the umlaut out of his name, that's exactly what he'll need to do – decompose. Unicode has a concept of composition, which means that we combine several simple characters to get a single complex character. Unicode has the opposite concept as well. You can view a complex character as one character, or, you can view it as the combination of several simple ones. "Decomposition" is what we need to get back the simple ASCII characters we're looking for.

There's a bit of complexity, and there are multiple flavors of decomposition. All of this is covered in a set of scholarly papers by the Unicode Consortium (www.unicode.org). To cut to the chase, we want the most granular form of decomposition. The other thing we need to know is that the main characters (the ASCII characters, that is), are the most significant, and are guaranteed to come at the beginning of the decomposed sequence. It turns out that our work is easy.

Two approaches

I'll present two approaches here. The easiest approach is to use the inbuilt string normalization function, which is new to .NET 2.0. This function closely parallels the normalization functions in Java, and is a great addition to the language. The idea is to take each character of a string, put it through the strongest decomposition, and then extract the ASCII characters. There are three things we need to be aware of:

Some characters, like the diagraph "dz" break down to a "d" and a "z", so we have to test for multiple ASCII characters in a loop.
Other characters, like the combined "AE" character, don't break down any farther. The Unicode Consortium made some hard choices, and couldn't possibly please all the people all of the time. For our purposes, then, we have the choice of keeping the character as-is, or straining it out because it isn't an ASCII character. The code below keeps it, because it is reduced to a single character. You may have a different opinion.
If there are embedded characters from other sets, such as Greek or Chinese, we want to leave them alone completely. They're not ASCII, they'll never be ASCII, and we can only cause mischief if we decompose them. We can run non-ASCII characters through transliteration routines to make them ASCII, but that's a story for another day.

Option One

private string LatinToAscii(string InString)
{
    string newString = string.Empty, charString;
    char ch;
    int charsCopied;

    for (int i = 0; i < InString.Length; i++)
    {
        charString = InString.Substring(i, 1);
        charString = charString.Normalize(NormalizationForm.FormKD);
        // If the character doesn't decompose, leave it as-is
        if (charString.Length == 1)
            newString += charString;
        else
        {
            charsCopied = 0;
            for (int j = 0; j < charString.Length; j++)
            {
                ch = charString[j];
                // If the char is 7-bit ASCII, add
                if (ch < 128)
                {
                    newString += ch;
                    charsCopied++;
                }
            }
            /* If we've decomposed non-ASCII, give it back
             * in its entirety, since we only mean to decompose
             * Latin chars.
            */
            if (charsCopied == 0)
                newString += InString.Substring(i, 1);
        }
    }
    return newString;
}

The advantage of this code is that it's short, simple, and largely built-in, so this is the code to use if your needs are simple and the output doesn't cause you any trouble. You should test the output, of course, before putting the code into production!

One thing to note is that the decomposition here is very conservative. The Unicode Consortium had nobler things in mind than the cheap-and-nasty job that we're doing here. For instance, the combined character "æ" stays combined, since that character has its own identity. Similarly, the Scandinavian "Ø" stays as-is, since to Scandinavians, it's not an "O with a stroke" – it's an "Ø". However, basic accents are taken care of, and we have a working function.

If none of the drawbacks bother you (after testing!), consider the job done. If you need to take care of other characters, obviously, we need to do some more work. You could adapt the code to test specific characters which don't decompose, and if you only have a few exceptions, that's easy. If you have strong opinions (and a lot of them), the code will need some organization. You could either use a switch/case statement (which could get bulky), or a collection such as a hash table.

Option Two takes the hash table approach, and loads the entire Latin character set with its decomposition values. Rather than figuring out the decomposition at runtime, we decide what it is at compile time, and then simply look up the value. We suffer a bit when we load the table, and we gain a bit when we do the lookups. The table is static, and it's only loaded once.

Option Two is a much more labor-intensive solution, though, of course, the labor has already been done, and it's presented to you here. If you have strong opinions and lots of them, this is the approach to take. And if you haven't upgraded to Visual Studio 2005, this is the only approach to take, since you don't have normalization yet. Option Two is more complete, and can also be tweaked if you don't like the output as it is.

The way this was done was:

The Unicode database (which is a text file) was loaded into Oracle, with composition sequences.
A program was written to write a program, specifying Unicode characters and their most granular conversions. Composite characters can contain other composites, so a bit of iteration was necessary.
Where 100%-correct Unicode wasn't indicated, the values were hand-corrected using PDF files from the Unicode Consortium. Substitutions for the weirdest characters were done by appearance.

The code for this approach is even simpler than Option One. The first time the function is used, a static HashTable is loaded with values. Making the data static means that we only have to load it once. Otherwise our performance would be atrocious.

After the initial load, every character gets a lookup. We have to check to see if our item is in the HashTable first, since we'll get an exception if we search for a non-existent item. If the item isn't in our table, we return it as-is, since it's outside our domain anyway.

In theory, this code should be faster, both for the static data and the pre-calculated mappings, though it seems to average about the same as Option One. In any case, performance shouldn't be a huge issue, since we're using the code intelligently ;-)

Option Two

public static string LatinToAscii(string InString)
{
    string returnString = string.Empty, ch;

    if (mCharacterTable == null)
        InitializeCharacterTable();

    for (int i = 0; i < InString.Length; i++)
    {
        ch = InString.Substring(i, 1);
        if (!mCharacterTable.Contains(ch))
            returnString += ch;
        else
            returnString += mCharacterTable[ch];
    }
    return returnString;
}

Conclusion

I hope all of this is useful. All of our differences are so much easier to celebrate once we find we have something in common - even if it's only ASCII!