Click here to Skip to main content
Click here to Skip to main content
Go to top

Remove Diacritical Marks in a Unicode String

, 29 Nov 2010
Rate this:
Please Sign up or sign in to vote.
With a helper CharMap class using VC2010 C++0x implementation

This contribution comes from this forum question[^], and my
unefficient answer
[^].

So we want to remove some diacritical marks[^] in a Unicode string, for instance change occurrences of àáảãạăằắẳẵặâầấẩẫậ to plain a, with the help of C++0x[^] as implemented in VC2010.

For that let's define a C array of const wchar_t* with the first character being the replacement character and the next ones being the characters to replace:

// This CODE cannot get formatted by the CP editor
const wchar_t* pchangers[] =
{
L"aàáảãạăằắẳẵặâầấẩẫậ",
L"AÀÁẢÃẠĂẰẮẲẴẶÂẦẤẨẪẬ",
L"OÒÒÓỎÕỌÔỒỐỔỖỘƠỜỚỞỠỢ",
L"EÈÉẺẼẸÊỀẾỂỄỆ",
L"UÙÚỦŨỤƯỪỨỬỮỰ",
L"IÌÍỈĨỊ",
L"YỲÝỶỸỴ",
L"DĐ",
L"oòóỏõọôồốổỗộơờớởỡợ",
L"eèéẻẽẹêềếểễệ",
L"uùúủũụưừứửữự",
L"iìíỉĩị",
L"yỳýỷỹỵ",
L"dđ"
};
// END CODE
The following CharMap class is constructed from a std::vector<std::wstring> of such strings and uses it to populate it's std::map<wchar_t, wchar_t> charmap member, with keys being characters after first and values being first character:
#include <map>
#include <vector>
#include <string>
#include <algorithm>
#include <iterator>
class CharMap
{
    std::map<wchar_t, wchar_t> charmap;
public:
    CharMap(const std::vector<const std::wstring>& changers)
    {
        std::for_each(changers.begin(), changers.end(), [&](const std::wstring& changer){
            std::transform(changer.begin() + 1, changer.end(), std::inserter(charmap, charmap.end()), [&](wchar_t wc){
                return std::make_pair(wc, changer[0]);});
        });
    }
    std::wstring operator()(const std::wstring& in)
    {
        std::wstring out(in.length(), L'\0');
        std::transform(in.begin(), in.end(), out.begin(), [&](wchar_t wc) ->wchar_t {
            auto it = charmap.find(wc);
            return it == charmap.end() ? wc : it->second;});
        return out;
    }
};  // class CharMap
The std::wstring CharMap::operator()(const std::wstring& in) constructs a std::wstring out from in, changing all characters to replace in in to their replacement character in out and returns out.
 

Now let's just put it at work:

#include <iostream>
    
std::vector<const std::wstring> changers(pchangers, pchangers + sizeof pchangers / sizeof (wchar_t*));
int main()
{
// This CODE cannot get formatted by the CP editor

std::wcout << CharMap(changers)(L" người mình.mp3 ") << std::endl;
// END unformatted CODE
    return 0;
}

Kind of demonstration of the power of C++0x isn't it?

If you have pasting problems with Unicode strings, download the full code CharMap.zip (1 KB).

cheers,
AR

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Alain Rist

France France
No Biography provided

Comments and Discussions

 
GeneralIt really works. thanks! PinmemberRandy Walles29-Jun-11 22:30 
GeneralThe sample pchangers does not include all possible cases. If... PinmemberAlain Rist28-Jun-11 23:12 
GeneralI've downloaded full code archive, but it didn't help me. Ma... PinmemberRandy Walles28-Jun-11 21:50