Click here to Skip to main content
Click here to Skip to main content

Remove Diacritical Marks in a Unicode String

By , 29 Nov 2010
Rate this:
Please Sign up or sign in to vote.

This contribution comes from this forum question[^], and my
unefficient answer
[^].

So we want to remove some diacritical marks[^] in a Unicode string, for instance change occurrences of àáảãạăằắẳẵặâầấẩẫậ to plain a, with the help of C++0x[^] as implemented in VC2010.

For that let's define a C array of const wchar_t* with the first character being the replacement character and the next ones being the characters to replace:

// This CODE cannot get formatted by the CP editor
const wchar_t* pchangers[] =
{
L"aàáảãạăằắẳẵặâầấẩẫậ",
L"AÀÁẢÃẠĂẰẮẲẴẶÂẦẤẨẪẬ",
L"OÒÒÓỎÕỌÔỒỐỔỖỘƠỜỚỞỠỢ",
L"EÈÉẺẼẸÊỀẾỂỄỆ",
L"UÙÚỦŨỤƯỪỨỬỮỰ",
L"IÌÍỈĨỊ",
L"YỲÝỶỸỴ",
L"DĐ",
L"oòóỏõọôồốổỗộơờớởỡợ",
L"eèéẻẽẹêềếểễệ",
L"uùúủũụưừứửữự",
L"iìíỉĩị",
L"yỳýỷỹỵ",
L"dđ"
};
// END CODE
The following CharMap class is constructed from a std::vector<std::wstring> of such strings and uses it to populate it's std::map<wchar_t, wchar_t> charmap member, with keys being characters after first and values being first character:
#include <map>
#include <vector>
#include <string>
#include <algorithm>
#include <iterator>
class CharMap
{
    std::map<wchar_t, wchar_t> charmap;
public:
    CharMap(const std::vector<const std::wstring>& changers)
    {
        std::for_each(changers.begin(), changers.end(), [&](const std::wstring& changer){
            std::transform(changer.begin() + 1, changer.end(), std::inserter(charmap, charmap.end()), [&](wchar_t wc){
                return std::make_pair(wc, changer[0]);});
        });
    }
    std::wstring operator()(const std::wstring& in)
    {
        std::wstring out(in.length(), L'\0');
        std::transform(in.begin(), in.end(), out.begin(), [&](wchar_t wc) ->wchar_t {
            auto it = charmap.find(wc);
            return it == charmap.end() ? wc : it->second;});
        return out;
    }
};  // class CharMap
The std::wstring CharMap::operator()(const std::wstring& in) constructs a std::wstring out from in, changing all characters to replace in in to their replacement character in out and returns out.
 

Now let's just put it at work:

#include <iostream>
    
std::vector<const std::wstring> changers(pchangers, pchangers + sizeof pchangers / sizeof (wchar_t*));
int main()
{
// This CODE cannot get formatted by the CP editor

std::wcout << CharMap(changers)(L" người mình.mp3 ") << std::endl;
// END unformatted CODE
    return 0;
}

Kind of demonstration of the power of C++0x isn't it?

If you have pasting problems with Unicode strings, download the full code CharMap.zip (1 KB).

cheers,
AR

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Alain Rist

France France
No Biography provided

Comments and Discussions

 
GeneralIt really works. thanks! PinmemberRandy Walles29-Jun-11 22:30 
GeneralThe sample pchangers does not include all possible cases. If... PinmemberAlain Rist28-Jun-11 23:12 
GeneralI've downloaded full code archive, but it didn't help me. Ma... PinmemberRandy Walles28-Jun-11 21:50 
GeneralRe: The sample pchangers does not include all possible cases. If... PinmemberAlain Rist28-Jun-11 23:13 
GeneralMessage Removed Pinmember_beauw_29-Nov-10 18:13 
GeneralRe: Very Useful PinmemberAlain Rist29-Nov-10 23:58 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web02 | 2.8.140421.2 | Last Updated 30 Nov 2010
Article Copyright 2010 by Alain Rist
Everything else Copyright © CodeProject, 1999-2014
Terms of Use
Layout: fixed | fluid