Click here to Skip to main content
Click here to Skip to main content

Removing Diacritics from Strings

By , 25 Jun 2012
 

Introduction

One of the problems I encountered a while ago was that when searching text, "A" will not match "À", or "Á" or "Ä" or "Â" or indeed any other characters which include diacritics (which is the printers term for the little accent marks which sit above some characters in many languages). This means that the same text as entered by a native German speaker will not match the text entered by a native English speaker. This can be a pain, and limit the usefulness of the search.

I was reminded of this when I answered a QA question on writing a regex to cope with international names...

Background

I did not write this code; this code is taken (as described in the code comments) from Micheal Kaplans Blog - all I did was respace it and convert it to an extension method. However, I felt this needed a wider audience than it was getting, and should be where it gets searched more easily.

I am not going to try to describe how it works, as the original blog does that in more detail than I'd want to go into! (And probably a lot more accuracy...Sigh | :sigh: )

Using the Code

Include the code in a static class of your own, or download the source and add it to your project.

/// <summary>
/// Remove Diacritics from a string
/// This converts accented characters to nonaccented, which means it is
/// easier to search for matching data with or without such accents.
/// This code from Micheal Kaplans Blog:
///    http://blogs.msdn.com/b/michkap/archive/2007/05/14/2629747.aspx
/// Respaced and converted to an Extension Method
/// <example>
///    aàáâãäåçc
/// is converted to
///    aaaaaaacc
/// </example>
/// </summary>
/// <param name="s"></param>
/// <returns></returns>
public static String RemoveDiacritics(this String s)
    {
    String normalizedString = s.Normalize(NormalizationForm.FormD);
    StringBuilder stringBuilder = new StringBuilder();
 
    for (int i = 0; i < normalizedString.Length; i++)
        {
        Char c = normalizedString[i];
        if (CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
            {
            stringBuilder.Append(c);
            }
        }
 
    return stringBuilder.ToString().Normalize(NormalizationForm.FormC);
    }

You can then use the method as you would any string extension method:

            string match = tbUserInput.Text.ToLower().RemoveDiacritics();
            if (string.IsNullOrWhiteSpace(match))
                {
                ...
                }

History

  • Original version

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

OriginalGriff
CEO
Wales Wales
Member
Born at an early age, he grew older. At the same time, his hair grew longer, and was tied up behind his head.
Has problems spelling the word "the".
Invented the portable cat-flap.
Currently, has not died yet. Or has he?

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
GeneralMy vote of 5memberBernard Chayer16 Mar '13 - 7:20 
Very usefull thanks'
GeneralMy vote of 5memberMatthew Searles5 Jul '12 - 17:29 
Love it, thanks for sharing Smile | :)
SuggestionWhile it is technically correct, it may still fail...memberAndreas Gieriet27 Jun '12 - 0:35 
I'm native German speaking, more precisely, Swiss-German (which is only spoken, not written - we learn to read/write German in school). In German, you have the "Umlaut" characters: äöüÄÖÜ (there is no German Umlaut on i and e, though).
 
One may choose to write instead of the Umlaut an "ae", "oe" or "ue", etc. This breaks the search for the non-space marks (see also http://en.wikipedia.org/wiki/%C3%84[^]).
 
Unfortunately, the reverse operation does not hold: not all "ae", "oe", or "ue" stand for "ä", "ö", and "ü".
 
If you want to make an even more robust search solution, you might need to do the following:
1) remove the diacritics from the text
2) remove all vowels
3) make all one case
4) if matches based on consonants only, calculate how the vowels correlate (number of matching vowels versus total number of vowels).
 
This might result in better matches - but maybe, it's not worth the effort, though...
 
Cheers
Andi

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web01 | 2.6.130516.1 | Last Updated 25 Jun 2012
Article Copyright 2012 by OriginalGriff
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid