Remove Diacritical Marks in a Unicode String






4.56/5 (4 votes)
With a helper CharMap class using VC2010 C++0x implementation
This contribution comes from this forum question[^], and my unefficient answer[^].
So we want to remove some diacritical marks[^] in a Unicode string, for instance change occurrences of àáảãạăằắẳẵặâầấẩẫậ to plain a, with the help of C++0x[^] as implemented in VC2010.
For that let's define a C array of const wchar_t*
with the first character being the replacement character and the next ones being the characters to replace:
// This CODE cannot get formatted by the CP editorconst wchar_t* pchangers[] = { L"aàáảãạăằắẳẵặâầấẩẫậ", L"AÀÁẢÃẠĂẰẮẲẴẶÂẦẤẨẪẬ", L"OÒÒÓỎÕỌÔỒỐỔỖỘƠỜỚỞỠỢ", L"EÈÉẺẼẸÊỀẾỂỄỆ", L"UÙÚỦŨỤƯỪỨỬỮỰ", L"IÌÍỈĨỊ", L"YỲÝỶỸỴ", L"DĐ", L"oòóỏõọôồốổỗộơờớởỡợ", L"eèéẻẽẹêềếểễệ", L"uùúủũụưừứửữự", L"iìíỉĩị", L"yỳýỷỹỵ", L"dđ" };
// END CODEThe following
CharMap
class is constructed from a std::vector<std::wstring>
of such strings and uses it to populate it's std::map<wchar_t, wchar_t> charmap
member, with keys being characters after first and values being first character:
#include <map> #include <vector> #include <string> #include <algorithm> #include <iterator> class CharMap { std::map<wchar_t, wchar_t> charmap; public: CharMap(const std::vector<const std::wstring>& changers) { std::for_each(changers.begin(), changers.end(), [&](const std::wstring& changer){ std::transform(changer.begin() + 1, changer.end(), std::inserter(charmap, charmap.end()), [&](wchar_t wc){ return std::make_pair(wc, changer[0]);}); }); } std::wstring operator()(const std::wstring& in) { std::wstring out(in.length(), L'\0'); std::transform(in.begin(), in.end(), out.begin(), [&](wchar_t wc) ->wchar_t { auto it = charmap.find(wc); return it == charmap.end() ? wc : it->second;}); return out; } }; // class CharMapThe
std::wstring CharMap::operator()(const std::wstring& in)
constructs a std::wstring out
from in
, changing all characters to replace in in
to their replacement character in out
and returns out
.
Now let's just put it at work:
#include <iostream> std::vector<const std::wstring> changers(pchangers, pchangers + sizeof pchangers / sizeof (wchar_t*)); int main() { // This CODE cannot get formatted by the CP editorstd::wcout << CharMap(changers)(L" người mình.mp3 ") << std::endl;
// END unformatted CODE return 0; }
Kind of demonstration of the power of C++0x isn't it?
If you have pasting problems with Unicode strings, download the full code CharMap.zip (1 KB).
cheers, AR