5,693,062 members and growing! (18,581 online)
Email Password   helpLost your password?
General Programming » String handling » General     Beginner

Convert between std::string and std::wstring, UTF-8 and UTF-16

By rh_

How to convert safely STL strings between Unicode formats
C++, Windows, Visual Studio, STL, Dev

Posted: 9 Feb 2007
Updated: 20 May 2007
Views: 50,916
Bookmarked: 35 times
Announcements
Loading...



Search    
Advanced Search
Sitemap
11 votes for this Article.
Popularity: 3.69 Rating: 3.55 out of 5
1 vote, 9.1%
1
3 votes, 27.3%
2
1 vote, 9.1%
3
1 vote, 9.1%
4
5 votes, 45.5%
5
Note: This is an unedited contribution. If this article is inappropriate, needs attention or copies someone else's work without reference then please Report This Article

Introduction

I needed to convert between UTF-8 coded std::string and UTF-16 coded std::wstring. I found here and there converting functions for native C strings, but these leave the memory handling to the caller. Not nice in modern times.

The best converter is probably the one from unicode.org. Here is a wrapper around this one which converts the STL strings.

Unlike other articles, this one has no other dependencies, does not introduce yet another string class, it only converts the STL strings, and that's it. And it's better than the widely found

std::wstring widestring(sourcestring.begin(), sourcestring.end()); 

which only works for Ascii text.

Source

The header goes like this:

#ifndef UTFCONVERTER__H__
#define UTFCONVERTER__H__

namespace UtfConverter
{
    std::wstring FromUtf8(const std::string& utf8string);
    std::string ToUtf8(const std::wstring& widestring);
}

#endif
 

I guess this is simple and easy enough to use.

Here is the source code:

#include "stdafx.h"

#include "UtfConverter.h"

#include "ConvertUTF.h"


namespace UtfConverter
{

    std::wstring FromUtf8(const std::string& utf8string)
    {
        size_t widesize = utf8string.length();
        if (sizeof(wchar_t) == 2)
        {
            wchar_t* widestringnative = new wchar_t[widesize+1];
            const UTF8* sourcestart = reinterpret_cast<const UTF8*>(utf8string.c_str());
            const UTF8* sourceend = sourcestart + widesize;
            UTF16* targetstart = reinterpret_cast<UTF16*>(widestringnative);
            UTF16* targetend = targetstart + widesize+1;
            ConversionResult res = ConvertUTF8toUTF16(&sourcestart, sourceend, &targetstart, targetend, strictConversion);
            if (res != conversionOK)
            {
                delete [] widestringnative;
                throw std::exception("La falla!");
            }
            *targetstart = 0;
            std::wstring resultstring(widestringnative);
            delete [] widestringnative;
            return resultstring;
        }
        else if (sizeof(wchar_t) == 4)
        {
            wchar_t* widestringnative = new wchar_t[widesize];
            const UTF8* sourcestart = reinterpret_cast<const UTF8*>(utf8string.c_str());
            const UTF8* sourceend = sourcestart + widesize;
            UTF32* targetstart = reinterpret_cast<UTF32*>(widestringnative);
            UTF32* targetend = targetstart + widesize;
            ConversionResult res = ConvertUTF8toUTF32(&sourcestart, sourceend, &targetstart, targetend, strictConversion);
            if (res != conversionOK)
            {
                delete [] widestringnative;
                throw std::exception("La falla!");
            }
            *targetstart = 0;
            std::wstring resultstring(widestringnative);
            delete [] widestringnative;
            return resultstring;
        }
        else
        {
            throw std::exception("La falla!");
        }
        return L"";
    }

    std::string ToUtf8(const std::wstring& widestring)
    {
        size_t widesize = widestring.length();

        if (sizeof(wchar_t) == 2)
        {
            size_t utf8size = 3 * widesize + 1;
            char* utf8stringnative = new char[utf8size];
            const UTF16* sourcestart = reinterpret_cast<const UTF16*>(widestring.c_str());
            const UTF16* sourceend = sourcestart + widesize;
            UTF8* targetstart = reinterpret_cast<UTF8*>(utf8stringnative);
            UTF8* targetend = targetstart + utf8size;
            ConversionResult res = ConvertUTF16toUTF8(&sourcestart, sourceend, &targetstart, targetend, strictConversion);
            if (res != conversionOK)
            {
                delete [] utf8stringnative;
                throw std::exception("La falla!");
            }
            *targetstart = 0;
            std::string resultstring(utf8stringnative);
            delete [] utf8stringnative;
            return resultstring;
        }
        else if (sizeof(wchar_t) == 4)
        {
            size_t utf8size = 4 * widesize + 1;
            char* utf8stringnative = new char[utf8size];
            const UTF32* sourcestart = reinterpret_cast<const UTF32*>(widestring.c_str());
            const UTF32* sourceend = sourcestart + widesize;
            UTF8* targetstart = reinterpret_cast<UTF8*>(utf8stringnative);
            UTF8* targetend = targetstart + utf8size;
            ConversionResult res = ConvertUTF32toUTF8(&sourcestart, sourceend, &targetstart, targetend, strictConversion);
            if (res != conversionOK)
            {
                delete [] utf8stringnative;
                throw std::exception("La falla!");
            }
            *targetstart = 0;
            std::string resultstring(utf8stringnative);
            delete [] utf8stringnative;
            return resultstring;
        }
        else
        {
            throw std::exception("La falla!");
        }
        return "";
    }
}
  

How to do it better

Here's another version that avoids using new and delete, by writing directly into the string buffer. Does anyone know whether this is okay?

#include "stdafx.h"

#include "UtfConverter.h"

#include "ConvertUTF.h"


namespace UtfConverter
{

    std::wstring FromUtf8(const std::string& utf8string)
    {
        size_t widesize = utf8string.length();
        if (sizeof(wchar_t) == 2)
        {
            std::wstring resultstring;
            resultstring.resize(widesize+1, L'\0');
            const UTF8* sourcestart = reinterpret_cast<const UTF8*>(utf8string.c_str());
            const UTF8* sourceend = sourcestart + widesize;
            UTF16* targetstart = reinterpret_cast<UTF16*>(&resultstring[0]);
            UTF16* targetend = targetstart + widesize;
            ConversionResult res = ConvertUTF8toUTF16(&sourcestart, sourceend, &targetstart, targetend, strictConversion);
            if (res != conversionOK)
            {
                throw std::exception("La falla!");
            }
            *targetstart = 0;
            return resultstring;
        }
        else if (sizeof(wchar_t) == 4)
        {
            std::wstring resultstring;
            resultstring.resize(widesize+1, L'\0');
            const UTF8* sourcestart = reinterpret_cast<const UTF8*>(utf8string.c_str());
            const UTF8* sourceend = sourcestart + widesize;
            UTF32* targetstart = reinterpret_cast<UTF32*>(&resultstring[0]);
            UTF32* targetend = targetstart + widesize;
            ConversionResult res = ConvertUTF8toUTF32(&sourcestart, sourceend, &targetstart, targetend, strictConversion);
            if (res != conversionOK)
            {
                throw std::exception("La falla!");
            }
            *targetstart = 0;
            return resultstring;
        }
        else
        {
            throw std::exception("La falla!");
        }
        return L"";
    }

    std::string ToUtf8(const std::wstring& widestring)
    {
        size_t widesize = widestring.length();

        if (sizeof(wchar_t) == 2)
        {
            size_t utf8size = 3 * widesize + 1;
            std::string resultstring;
            resultstring.resize(utf8size, '\0');
            const UTF16* sourcestart = reinterpret_cast<const UTF16*>(widestring.c_str());
            const UTF16* sourceend = sourcestart + widesize;
            UTF8* targetstart = reinterpret_cast<UTF8*>(&resultstring[0]);
            UTF8* targetend = targetstart + utf8size;
            ConversionResult res = ConvertUTF16toUTF8(&sourcestart, sourceend, &targetstart, targetend, strictConversion);
            if (res != conversionOK)
            {
                throw std::exception("La falla!");
            }
            *targetstart = 0;
            return resultstring;
        }
        else if (sizeof(wchar_t) == 4)
        {
            size_t utf8size = 4 * widesize + 1;
            std::string resultstring;
            resultstring.resize(utf8size, '\0');
            const UTF32* sourcestart = reinterpret_cast<const UTF32*>(widestring.c_str());
            const UTF32* sourceend = sourcestart + widesize;
            UTF8* targetstart = reinterpret_cast<UTF8*>(&resultstring[0]);
            UTF8* targetend = targetstart + utf8size;
            ConversionResult res = ConvertUTF32toUTF8(&sourcestart, sourceend, &targetstart, targetend, strictConversion);
            if (res != conversionOK)
            {
                throw std::exception("La falla!");
            }
            *targetstart = 0;
            return resultstring;
        }
        else
        {
            throw std::exception("La falla!");
        }
        return "";
    }
}

<> >

How to use it

Just add them to your project. Download the unicode converter from http://www.unicode.org/Public/PROGRAMS/CVTUTF/ and add that to the project, too. It should just work.

Of course you can throw whatever exception you like upon failure.

I must admit I tried it only for 2-byte wchar_t.

Comments are welcome.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

rh_



Location: Germany Germany

Other popular String handling articles:

Article Top
Sign Up to vote for this article
You must Sign In to use this message board.
FAQ FAQ Noise ToleranceSearch Search Messages 
 Layout  Per page   
 Msgs 1 to 25 of 25 (Total in Forum: 25) (Refresh)FirstPrevNext
GeneralThe easiest way to do the same conversionmembersteveb10:21 23 Oct '08  
Generaltrouble appending string to conversionmemberMember 45743483:53 18 Oct '08  
GeneralIncorrect size set in ToUTF8memberDEmberton3:55 19 Jun '08  
GeneralRe: Incorrect size set in ToUTF8supporterpeterchen5:47 21 Aug '08  
Generalthank youmemberwipehindy13:59 15 Feb '08  
Questionwhat is “L” in:resultstring.resize(widesize+1, L'\0');memberEva ranee21:34 6 Jan '08  
GeneralRe: what is “L” in:resultstring.resize(widesize+1, L'\0');memberMircea Puiu0:41 7 Jan '08  
GeneralUNICODE is not the same as UTF16memberchristophe.hermier@quickfds.com22:55 30 Sep '07  
GeneralCA2T, CA2Wmemberkpnut5:35 30 Aug '07  
GeneralIs this OK?mvpStephen Hewitt22:12 20 May '07  
QuestionBug in the code?member_ema_14:51 10 May '07  
AnswerRe: Bug in the code?memberrh_21:32 20 May '07  
GeneralSimplest waymembertracker20025:06 13 Feb '07  
GeneralRe: Simplest waymemberrh_21:04 13 Feb '07  
GeneralUTF-8 and Multibyte are not the samememberTed Dunlop6:59 12 Feb '07  
GeneralRe: UTF-8 and Multibyte are not the samememberrh_1:21 13 Feb '07  
GeneralMultiByteToWideChar and WideCharToMultiByte...memberJohann Gerell20:27 11 Feb '07  
GeneralRe: MultiByteToWideChar and WideCharToMultiByte...memberrh_23:37 11 Feb '07  
NewsRe: MultiByteToWideChar and WideCharToMultiByte...memberJohann Gerell23:54 11 Feb '07  
GeneralRe: MultiByteToWideChar and WideCharToMultiByte...memberrh_2:32 12 Feb '07  
GeneralRe: MultiByteToWideChar and WideCharToMultiByte...memberJohann Gerell2:49 12 Feb '07  
GeneralRe: MultiByteToWideChar and WideCharToMultiByte...memberrh_3:34 12 Feb '07  
GeneralRe: MultiByteToWideChar and WideCharToMultiByte...memberBjornar4:44 19 Feb '07  
GeneralRe: MultiByteToWideChar and WideCharToMultiByte...memberkonmel7:53 7 Nov '07  
GeneralRe: MultiByteToWideChar and WideCharToMultiByte...memberchipmunk12:25 7 Aug '08  

General General    News News    Question Question    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

PermaLink | Privacy | Terms of Use
Last Updated: 20 May 2007
Editor:
Copyright 2007 by rh_
Everything else Copyright © CodeProject, 1999-2008
Web16 | Advertise on the Code Project