65.9K
CodeProject is changing. Read more.
Home

Remove Diacritical Marks in a Unicode String

starIconstarIconstarIconstarIcon
emptyStarIcon
starIcon

4.56/5 (4 votes)

Nov 28, 2010

CPOL
viewsIcon

25510

With a helper CharMap class using VC2010 C++0x implementation

This contribution comes from this forum question[^], and my unefficient answer[^].

So we want to remove some diacritical marks[^] in a Unicode string, for instance change occurrences of àáảãạăằắẳẵặâầấẩẫậ to plain a, with the help of C++0x[^] as implemented in VC2010.

For that let's define a C array of const wchar_t* with the first character being the replacement character and the next ones being the characters to replace:

// This CODE cannot get formatted by the CP editor
const wchar_t* pchangers[] = { L"aàáảãạăằắẳẵặâầấẩẫậ", L"AÀÁẢÃẠĂẰẮẲẴẶÂẦẤẨẪẬ", L"OÒÒÓỎÕỌÔỒỐỔỖỘƠỜỚỞỠỢ", L"EÈÉẺẼẸÊỀẾỂỄỆ", L"UÙÚỦŨỤƯỪỨỬỮỰ", L"IÌÍỈĨỊ", L"YỲÝỶỸỴ", L"DĐ", L"oòóỏõọôồốổỗộơờớởỡợ", L"eèéẻẽẹêềếểễệ", L"uùúủũụưừứửữự", L"iìíỉĩị", L"yỳýỷỹỵ", L"dđ" };
// END CODE
The following CharMap class is constructed from a std::vector<std::wstring> of such strings and uses it to populate it's std::map<wchar_t, wchar_t> charmap member, with keys being characters after first and values being first character:
#include <map>
#include <vector>
#include <string>
#include <algorithm>
#include <iterator>
class CharMap
{
    std::map<wchar_t, wchar_t> charmap;
public:
    CharMap(const std::vector<const std::wstring>& changers)
    {
        std::for_each(changers.begin(), changers.end(), [&](const std::wstring& changer){
            std::transform(changer.begin() + 1, changer.end(), std::inserter(charmap, charmap.end()), [&](wchar_t wc){
                return std::make_pair(wc, changer[0]);});
        });
    }
    std::wstring operator()(const std::wstring& in)
    {
        std::wstring out(in.length(), L'\0');
        std::transform(in.begin(), in.end(), out.begin(), [&](wchar_t wc) ->wchar_t {
            auto it = charmap.find(wc);
            return it == charmap.end() ? wc : it->second;});
        return out;
    }
};  // class CharMap
The std::wstring CharMap::operator()(const std::wstring& in) constructs a std::wstring out from in, changing all characters to replace in in to their replacement character in out and returns out.

Now let's just put it at work:

#include <iostream>
    
std::vector<const std::wstring> changers(pchangers, pchangers + sizeof pchangers / sizeof (wchar_t*));
int main()
{
// This CODE cannot get formatted by the CP editor
std::wcout << CharMap(changers)(L" người mình.mp3 ") << std::endl;
// END unformatted CODE
    return 0;
}

Kind of demonstration of the power of C++0x isn't it?

If you have pasting problems with Unicode strings, download the full code CharMap.zip (1 KB).

cheers, AR