Click here to Skip to main content
15,861,168 members
Articles / Programming Languages / C++11
Tip/Trick

Remove Diacritical Marks in a Unicode String

Rate me:
Please Sign up or sign in to vote.
4.56/5 (4 votes)
29 Nov 2010CPOL 24.4K   3   6
With a helper CharMap class using VC2010 C++0x implementation

This contribution comes from this forum question[^], and my
unefficient answer
[^].


So we want to remove some diacritical marks[^] in a Unicode string, for instance change occurrences of àáảãạăằắẳẵặâầấẩẫậ to plain a, with the help of C++0x[^] as implemented in VC2010.


For that let's define a C array of const wchar_t* with the first character being the replacement character and the next ones being the characters to replace:


// This CODE cannot get formatted by the CP editor
const wchar_t* pchangers[] =
{
L"aàáảãạăằắẳẵặâầấẩẫậ",
L"AÀÁẢÃẠĂẰẮẲẴẶÂẦẤẨẪẬ",
L"OÒÒÓỎÕỌÔỒỐỔỖỘƠỜỚỞỠỢ",
L"EÈÉẺẼẸÊỀẾỂỄỆ",
L"UÙÚỦŨỤƯỪỨỬỮỰ",
L"IÌÍỈĨỊ",
L"YỲÝỶỸỴ",
L"DĐ",
L"oòóỏõọôồốổỗộơờớởỡợ",
L"eèéẻẽẹêềếểễệ",
L"uùúủũụưừứửữự",
L"iìíỉĩị",
L"yỳýỷỹỵ",
L"dđ"
};
// END CODE

The following CharMap class is constructed from a std::vector<std::wstring> of such strings and uses it to populate it's std::map<wchar_t, wchar_t> charmap member, with keys being characters after first and values being first character:
#include <map>
#include <vector>
#include <string>
#include <algorithm>
#include <iterator>
class CharMap
{
    std::map<wchar_t, wchar_t> charmap;
public:
    CharMap(const std::vector<const std::wstring>& changers)
    {
        std::for_each(changers.begin(), changers.end(), [&](const std::wstring& changer){
            std::transform(changer.begin() + 1, changer.end(), std::inserter(charmap, charmap.end()), [&](wchar_t wc){
                return std::make_pair(wc, changer[0]);});
        });
    }
    std::wstring operator()(const std::wstring& in)
    {
        std::wstring out(in.length(), L'\0');
        std::transform(in.begin(), in.end(), out.begin(), [&](wchar_t wc) ->wchar_t {
            auto it = charmap.find(wc);
            return it == charmap.end() ? wc : it->second;});
        return out;
    }
};  // class CharMap

The std::wstring CharMap::operator()(const std::wstring& in) constructs a std::wstring out from in, changing all characters to replace in in to their replacement character in out and returns out.

Now let's just put it at work:


#include <iostream>
    
std::vector<const std::wstring> changers(pchangers, pchangers + sizeof pchangers / sizeof (wchar_t*));
int main()
{
// This CODE cannot get formatted by the CP editor

std::wcout << CharMap(changers)(L" người mình.mp3 ") << std::endl;
// END unformatted CODE
    return 0;
}

Kind of demonstration of the power of C++0x isn't it?


If you have pasting problems with Unicode strings, download the full code CharMap.zip (1 KB).


cheers,
AR

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
France France
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
GeneralIt really works. thanks! Pin
Randy Walles29-Jun-11 22:30
Randy Walles29-Jun-11 22:30 
GeneralThe sample pchangers does not include all possible cases. If... Pin
Alain Rist28-Jun-11 23:12
Alain Rist28-Jun-11 23:12 
GeneralI've downloaded full code archive, but it didn't help me. Ma... Pin
Randy Walles28-Jun-11 21:50
Randy Walles28-Jun-11 21:50 
GeneralRe: The sample pchangers does not include all possible cases. If... Pin
Alain Rist28-Jun-11 23:13
Alain Rist28-Jun-11 23:13 
GeneralRe: Very Useful Pin
Alain Rist29-Nov-10 23:58
Alain Rist29-Nov-10 23:58 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.