Click here to Skip to main content
11,479,083 members (63,253 online)
Click here to Skip to main content

Convert Between std::string and std::wstring, UTF-8 and UTF-16

, 20 May 2007 CPOL 210K 1.9K 62
Rate this:
Please Sign up or sign in to vote.
How to convert safely STL strings between Unicode formats

Introduction

I needed to convert between UTF-8 coded std::string and UTF-16 coded std::wstring. I found some converting functions for native C strings, but these leave the memory handling to the caller. Not nice in modern times.

The best converter is probably the one from unicode.org. Here is a wrapper around this one which converts the STL strings.

Unlike other articles, this one has no other dependencies, does not introduce yet another string class, only converts the STL strings, and that's it. And it's better than the widely found...

std::wstring widestring(sourcestring.begin(), sourcestring.end()); 

... which only works for ASCII text.

Source

The header goes like this:

#ifndef UTFCONVERTER__H__
#define UTFCONVERTER__H__

namespace UtfConverter
{
    std::wstring FromUtf8(const std::string& utf8string);
    std::string ToUtf8(const std::wstring& widestring);
}

#endif

I guess this is simple and easy enough to use.

Here is the source code:

#include "stdafx.h"
#include "UtfConverter.h"
#include "ConvertUTF.h"

namespace UtfConverter
{
    std::wstring FromUtf8(const std::string& utf8string)
    {
        size_t widesize = utf8string.length();
        if (sizeof(wchar_t) == 2)
        {
            wchar_t* widestringnative = new wchar_t[widesize+1];
            const UTF8* sourcestart = reinterpret_cast<const UTF8*>(utf8string.c_str());
            const UTF8* sourceend = sourcestart + widesize;
            UTF16* targetstart = reinterpret_cast<UTF16*>(widestringnative);
            UTF16* targetend = targetstart + widesize+1;
            ConversionResult res = ConvertUTF8toUTF16
		(&sourcestart, sourceend, &targetstart, targetend, strictConversion);
            if (res != conversionOK)
            {
                delete [] widestringnative;
                throw std::exception("La falla!");
            }
            *targetstart = 0;
            std::wstring resultstring(widestringnative);
            delete [] widestringnative;
            return resultstring;
        }
        else if (sizeof(wchar_t) == 4)
        {
            wchar_t* widestringnative = new wchar_t[widesize];
            const UTF8* sourcestart = reinterpret_cast<const UTF8*>(utf8string.c_str());
            const UTF8* sourceend = sourcestart + widesize;
            UTF32* targetstart = reinterpret_cast<UTF32*>(widestringnative);
            UTF32* targetend = targetstart + widesize;
            ConversionResult res = ConvertUTF8toUTF32
		(&sourcestart, sourceend, &targetstart, targetend, strictConversion);
            if (res != conversionOK)
            {
                delete [] widestringnative;
                throw std::exception("La falla!");
            }
            *targetstart = 0;
            std::wstring resultstring(widestringnative);
            delete [] widestringnative;
            return resultstring;
        }
        else
        {
            throw std::exception("La falla!");
        }
        return L"";
    }

    std::string ToUtf8(const std::wstring& widestring)
    {
        size_t widesize = widestring.length();

        if (sizeof(wchar_t) == 2)
        {
            size_t utf8size = 3 * widesize + 1;
            char* utf8stringnative = new char[utf8size];
            const UTF16* sourcestart = 
		reinterpret_cast<const UTF16*>(widestring.c_str());
            const UTF16* sourceend = sourcestart + widesize;
            UTF8* targetstart = reinterpret_cast<UTF8*>(utf8stringnative);
            UTF8* targetend = targetstart + utf8size;
            ConversionResult res = ConvertUTF16toUTF8
		(&sourcestart, sourceend, &targetstart, targetend, strictConversion);
            if (res != conversionOK)
            {
                delete [] utf8stringnative;
                throw std::exception("La falla!");
            }
            *targetstart = 0;
            std::string resultstring(utf8stringnative);
            delete [] utf8stringnative;
            return resultstring;
        }
        else if (sizeof(wchar_t) == 4)
        {
            size_t utf8size = 4 * widesize + 1;
            char* utf8stringnative = new char[utf8size];
            const UTF32* sourcestart = 
		reinterpret_cast<const UTF32*>(widestring.c_str());
            const UTF32* sourceend = sourcestart + widesize;
            UTF8* targetstart = reinterpret_cast<UTF8*>(utf8stringnative);
            UTF8* targetend = targetstart + utf8size;
            ConversionResult res = ConvertUTF32toUTF8
		(&sourcestart, sourceend, &targetstart, targetend, strictConversion);
            if (res != conversionOK)
            {
                delete [] utf8stringnative;
                throw std::exception("La falla!");
            }
            *targetstart = 0;
            std::string resultstring(utf8stringnative);
            delete [] utf8stringnative;
            return resultstring;
        }
        else
        {
            throw std::exception("La falla!");
        }
        return "";
    }
} 

How To Do It Better

Here's another version that avoids using new and delete, by writing directly into the string buffer. Does anyone know whether this is okay?

#include "stdafx.h"
#include "UtfConverter.h"
#include "ConvertUTF.h"

namespace UtfConverter
{
    std::wstring FromUtf8(const std::string& utf8string)
    {
        size_t widesize = utf8string.length();
        if (sizeof(wchar_t) == 2)
        {
            std::wstring resultstring;
            resultstring.resize(widesize+1, L'\0');
            const UTF8* sourcestart = reinterpret_cast<const UTF8*>(utf8string.c_str());
            const UTF8* sourceend = sourcestart + widesize;
            UTF16* targetstart = reinterpret_cast<UTF16*>(&resultstring[0]);
            UTF16* targetend = targetstart + widesize;
            ConversionResult res = ConvertUTF8toUTF16
		(&sourcestart, sourceend, &targetstart, targetend, strictConversion);
            if (res != conversionOK)
            {
                throw std::exception("La falla!");
            }
            *targetstart = 0;
            return resultstring;
        }
        else if (sizeof(wchar_t) == 4)
        {
            std::wstring resultstring;
            resultstring.resize(widesize+1, L'\0');
            const UTF8* sourcestart = reinterpret_cast<const UTF8*>(utf8string.c_str());
            const UTF8* sourceend = sourcestart + widesize;
            UTF32* targetstart = reinterpret_cast<UTF32*>(&resultstring[0]);
            UTF32* targetend = targetstart + widesize;
            ConversionResult res = ConvertUTF8toUTF32
		(&sourcestart, sourceend, &targetstart, targetend, strictConversion);
            if (res != conversionOK)
            {
                throw std::exception("La falla!");
            }
            *targetstart = 0;
            return resultstring;
        }
        else
        {
            throw std::exception("La falla!");
        }
        return L"";
    }

    std::string ToUtf8(const std::wstring& widestring)
    {
        size_t widesize = widestring.length();

        if (sizeof(wchar_t) == 2)
        {
            size_t utf8size = 3 * widesize + 1;
            std::string resultstring;
            resultstring.resize(utf8size, '\0');
            const UTF16* sourcestart = 
		reinterpret_cast<const UTF16*>(widestring.c_str());
            const UTF16* sourceend = sourcestart + widesize;
            UTF8* targetstart = reinterpret_cast<UTF8*>(&resultstring[0]);
            UTF8* targetend = targetstart + utf8size;
            ConversionResult res = ConvertUTF16toUTF8
		(&sourcestart, sourceend, &targetstart, targetend, strictConversion);
            if (res != conversionOK)
            {
                throw std::exception("La falla!");
            }
            *targetstart = 0;
            return resultstring;
        }
        else if (sizeof(wchar_t) == 4)
        {
            size_t utf8size = 4 * widesize + 1;
            std::string resultstring;
            resultstring.resize(utf8size, '\0');
            const UTF32* sourcestart = 
		reinterpret_cast<const UTF32*>(widestring.c_str());
            const UTF32* sourceend = sourcestart + widesize;
            UTF8* targetstart = reinterpret_cast<UTF8*>(&resultstring[0]);
            UTF8* targetend = targetstart + utf8size;
            ConversionResult res = ConvertUTF32toUTF8
		(&sourcestart, sourceend, &targetstart, targetend, strictConversion);
            if (res != conversionOK)
            {
                throw std::exception("La falla!");
            }
            *targetstart = 0;
            return resultstring;
        }
        else
        {
            throw std::exception("La falla!");
        }
        return "";
    }
}

How to Use It

Just add it to your project. Download the Unicode converter from here and add that to the project, too. It should just work.

Of course, you can throw whatever exceptions you like upon failure.

I must admit I tried it only for 2-byte wchar_t.

Comments are welcome.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

rh_

Germany Germany
No Biography provided

Comments and Discussions

 
QuestionDo The Same Thing With C++11 STL Pin
Theo Buys8-Mar-15 23:45
memberTheo Buys8-Mar-15 23:45 
GeneralMy vote of 5 Pin
Stephan(Russland)27-Dec-11 20:44
memberStephan(Russland)27-Dec-11 20:44 
Generalworking code for this article Pin
assaf raman13-Mar-11 15:15
memberassaf raman13-Mar-11 15:15 
Download here.
I added some missing files and fixed a string length issue.
AnswerOr The Same Thing In Four Lines ! Pin
megaadam25-Oct-10 0:55
membermegaadam25-Oct-10 0:55 
GeneralRe: Or The Same Thing In Four Lines ! Pin
Michael B Pliam7-Dec-11 15:46
memberMichael B Pliam7-Dec-11 15:46 
GeneralRe: Or The Same Thing In Four Lines ! [modified] Pin
megaadam7-Dec-11 23:49
membermegaadam7-Dec-11 23:49 
GeneralRe: Or The Same Thing In Four Lines ! Pin
Michael B Pliam8-Dec-11 8:49
memberMichael B Pliam8-Dec-11 8:49 
GeneralRe: Or The Same Thing In Four Lines ! Pin
megaadam9-Dec-11 3:14
membermegaadam9-Dec-11 3:14 
GeneralRe: Or The Same Thing In Four Lines ! Pin
Michael B Pliam9-Dec-11 11:54
memberMichael B Pliam9-Dec-11 11:54 
GeneralRe: Or The Same Thing In Four Lines ! [modified] Pin
Theo Buys18-Feb-15 4:42
memberTheo Buys18-Feb-15 4:42 
GeneralZipfile code not working (later code does work, though) Pin
babzog5-May-10 11:26
memberbabzog5-May-10 11:26 
GeneralThe easiest way to do the same conversion Pin
steveb23-Oct-08 10:21
membersteveb23-Oct-08 10:21 
GeneralRe: The easiest way to do the same conversion Pin
kurt.griffiths9-Nov-11 10:09
memberkurt.griffiths9-Nov-11 10:09 
Generaltrouble appending string to conversion Pin
Member 457434818-Oct-08 3:53
memberMember 457434818-Oct-08 3:53 
GeneralIncorrect size set in ToUTF8 Pin
DEmberton19-Jun-08 3:55
memberDEmberton19-Jun-08 3:55 
GeneralRe: Incorrect size set in ToUTF8 Pin
peterchen21-Aug-08 5:47
memberpeterchen21-Aug-08 5:47 
GeneralRe: Incorrect size set in ToUTF8 Pin
Vite Falcon22-Apr-11 11:19
memberVite Falcon22-Apr-11 11:19 
Generalthank you Pin
wipehindy15-Feb-08 13:59
memberwipehindy15-Feb-08 13:59 
Questionwhat is “L” in:resultstring.resize(widesize+1, L'\0'); Pin
Eva ranee6-Jan-08 21:34
memberEva ranee6-Jan-08 21:34 
GeneralRe: what is “L” in:resultstring.resize(widesize+1, L'\0'); Pin
Mircea Puiu7-Jan-08 0:41
memberMircea Puiu7-Jan-08 0:41 
GeneralUNICODE is not the same as UTF16 Pin
christophe.hermier@quickfds.com30-Sep-07 22:55
memberchristophe.hermier@quickfds.com30-Sep-07 22:55 
GeneralRe: UNICODE is not the same as UTF16 [modified] Pin
Theo Buys19-Feb-15 2:04
memberTheo Buys19-Feb-15 2:04 
GeneralCA2T, CA2W Pin
kpnut30-Aug-07 5:35
memberkpnut30-Aug-07 5:35 
QuestionIs this OK? Pin
Stephen Hewitt20-May-07 22:12
mvpStephen Hewitt20-May-07 22:12 
QuestionBug in the code? Pin
_ema_10-May-07 14:51
member_ema_10-May-07 14:51 
AnswerRe: Bug in the code? Pin
rh_20-May-07 21:32
memberrh_20-May-07 21:32 
GeneralSimplest way Pin
tracker200213-Feb-07 5:06
membertracker200213-Feb-07 5:06 
GeneralRe: Simplest way Pin
rh_13-Feb-07 21:04
memberrh_13-Feb-07 21:04 
GeneralUTF-8 and Multibyte are not the same Pin
Ted Dunlop12-Feb-07 6:59
memberTed Dunlop12-Feb-07 6:59 
GeneralRe: UTF-8 and Multibyte are not the same Pin
rh_13-Feb-07 1:21
memberrh_13-Feb-07 1:21 
GeneralMultiByteToWideChar and WideCharToMultiByte... Pin
Johann Gerell11-Feb-07 20:27
memberJohann Gerell11-Feb-07 20:27 
GeneralRe: MultiByteToWideChar and WideCharToMultiByte... Pin
rh_11-Feb-07 23:37
memberrh_11-Feb-07 23:37 
NewsRe: MultiByteToWideChar and WideCharToMultiByte... Pin
Johann Gerell11-Feb-07 23:54
memberJohann Gerell11-Feb-07 23:54 
GeneralRe: MultiByteToWideChar and WideCharToMultiByte... Pin
rh_12-Feb-07 2:32
memberrh_12-Feb-07 2:32 
GeneralRe: MultiByteToWideChar and WideCharToMultiByte... Pin
Johann Gerell12-Feb-07 2:49
memberJohann Gerell12-Feb-07 2:49 
GeneralRe: MultiByteToWideChar and WideCharToMultiByte... Pin
rh_12-Feb-07 3:34
memberrh_12-Feb-07 3:34 
GeneralRe: MultiByteToWideChar and WideCharToMultiByte... Pin
Bjornar19-Feb-07 4:44
memberBjornar19-Feb-07 4:44 
GeneralRe: MultiByteToWideChar and WideCharToMultiByte... Pin
konmel7-Nov-07 7:53
memberkonmel7-Nov-07 7:53 
GeneralRe: MultiByteToWideChar and WideCharToMultiByte... Pin
chipmunk7-Aug-08 12:25
memberchipmunk7-Aug-08 12:25 
GeneralRe: MultiByteToWideChar and WideCharToMultiByte... Pin
johnny longxzy26-Jun-09 12:33
memberjohnny longxzy26-Jun-09 12:33 
AnswerRe: MultiByteToWideChar and WideCharToMultiByte... Pin
Johann Gerell26-Jun-09 13:34
memberJohann Gerell26-Jun-09 13:34 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.150520.1 | Last Updated 21 May 2007
Article Copyright 2007 by rh_
Everything else Copyright © CodeProject, 1999-2015
Layout: fixed | fluid