Click here to Skip to main content
15,892,059 members
Articles / Mobile Apps / Windows Mobile
Article

UTF16 to UTF8 to UTF16 simple CString based conversion

Rate me:
Please Sign up or sign in to vote.
4.20/5 (16 votes)
16 May 2008CPOL 150K   38   22
Use CString to convert betwen UTF8 and UTF16.

Introduction

For conversion of strings between UTF8 and UTF16 (as well as other formats), Microsoft gives us the MultiByteToWideChar and WideCharToMultiByte functions. These functions use null terminated char/widechar based strings. Use of those strings requires a bit of memory management, and if you use the functions extensively, your code may end up looking like a complete mess. That's why I decided to wrap these two functions for use with the more coder-friendly CString types.

The conversion functions

UTF16toUTF8

CStringA UTF16toUTF8(const CStringW& utf16)
{
   CStringA utf8;
   int len = WideCharToMultiByte(CP_UTF8, 0, utf16, -1, NULL, 0, 0, 0);
   if (len>1)
   { 
      char *ptr = utf8.GetBuffer(len-1);
      if (ptr) WideCharToMultiByte(CP_UTF8, 0, utf16, -1, ptr, len, 0, 0);
      utf8.ReleaseBuffer();
   }
   return utf8;
}

UTF8toUTF16

CStringW UTF8toUTF16(const CStringA& utf8)
{
   CStringW utf16;
   int len = MultiByteToWideChar(CP_UTF8, 0, utf8, -1, NULL, 0);
   if (len>1)
   { 
      wchar_t *ptr = utf16.GetBuffer(len-1);
      if (ptr) MultiByteToWideChar(CP_UTF8, 0, utf8, -1, ptr, len);
      utf16.ReleaseBuffer();
   }
   return utf16;
}

Using the code

Use of the two helper functions is straightforward. But, do note that they are only useful if your project is set to use the UNICODE character set. The functions also only work in Visual Studio 7.1 or above. If you use Visual Studio 6.0, you won't be able to compile because you miss CStringA and CStringW. In the following code snippet, you have a usage example:

CStringW utf16("òèçùà12345");
CStringA utf8 = UTF16toUTF8(utf16);
CStringW utf16_2 = UTF8toUTF16(utf8);

History

After a comment by Ivo Beltchev, I decided to change the functions as he suggested. Initially, I designed the functions like this:

CStringA UTF16toUTF8(const CStringW& utf16)
{
  LPSTR pszUtf8 = NULL;
  CStringA utf8("");

  if (utf16.IsEmpty()) 
    return utf8; //empty imput string

  size_t nLen16 = utf16.GetLength();
  size_t nLen8 = 0;

  if ((nLen8 = WideCharToMultiByte (CP_UTF8, 0, utf16, nLen16, 
                                    NULL, 0, 0, 0) + 2) == 2)
    return utf8; //conversion error!

  pszUtf8 = new char [nLen8];
  if (pszUtf8)
  {
    memset (pszUtf8, 0x00, nLen8);
    WideCharToMultiByte(CP_UTF8, 0, utf16, nLen16, pszUtf8, nLen8, 0, 0);
    utf8 = CStringA(pszUtf8);
  }

  delete [] pszUtf8;
  return utf8; //utf8 encoded string
}

CStringW UTF8toUTF16(const CStringA& utf8)
{
  LPWSTR pszUtf16 = NULL;
  CStringW utf16("");
  
  if (utf8.IsEmpty()) 
    return utf16; //empty imput string

  size_t nLen8 = utf8.GetLength();
  size_t nLen16 = 0;

  if ((nLen16 = MultiByteToWideChar (CP_UTF8, 0, utf8, nLen8, NULL, 0)) == 0)
    return utf16; //conversion error!

  pszUtf16 = new wchar_t[nLen16];
  if (pszUtf16)
  {
    wmemset (pszUtf16, 0x00, nLen16);
    MultiByteToWideChar (CP_UTF8, 0, utf8, nLen8, pszUtf16, nLen16);
    utf16 = CStringW(pszUtf16);
  }

  delete [] utf16;
  return utf16; //utf16 encoded string
}

These functions work just as well, but the latter versions are smaller and a bit optimized. Thanks to Ivo for the observation!

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Senior)
Romania Romania
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
QuestionThere no need to adjust len for GetBuffer! Pin
Theo Buys15-Mar-17 3:02
Theo Buys15-Mar-17 3:02 
BugSmall bug ... Pin
Tomice14-Jan-15 22:26
Tomice14-Jan-15 22:26 
QuestionConsider using CA2T and CT2A Pin
kanalbrummer24-Nov-13 3:11
kanalbrummer24-Nov-13 3:11 
AnswerRe: Consider using CA2T and CT2A Pin
Theo Buys13-Apr-15 5:15
Theo Buys13-Apr-15 5:15 
QuestionTraditional Chinese characters aren’t being read from network stream Pin
Balaji198222-Mar-12 17:53
professionalBalaji198222-Mar-12 17:53 
GeneralMy vote of 3 Pin
Dezhi Zhao13-Jan-11 4:55
Dezhi Zhao13-Jan-11 4:55 
GeneralEven more elegant! PinPopular
Elmue23-Aug-08 10:37
Elmue23-Aug-08 10:37 
Hello

Your new version is much better but still not optimal.

Why do you call WideCharToMultiByte twice ?
It is not necessary to let the Windows API do the entire conversion twice.
(Not the best performance!)

You know in advance that the UTF string will NEVER be longer than 4 times the Unicode string:

CStringA UTF16toUTF8(const CStringW& utf16)
{
   CStringA utf8;
   int len = utf16.GetLength() *4;
   char *ptr = utf8.GetBuffer(len);
   if (ptr) WideCharToMultiByte(CP_UTF8, 0, utf16, -1, ptr, len, 0, 0);
   utf8.ReleaseBuffer();
   return utf8;
}


When you pass a string as return value from a function it will ALWAYS be copied to a new instance.

So
CString X = UTF16toUTF8(Y);

will create a new CString instance and copy the content from the returned string utf8 to the new string X and destroy utf8 afterwards.
utf8 is a local variable which cannot leave the function.

This happens every time you return local variables from a function: they are copied!

So even if the buffer of utf8 was much too big the string utf8 will be deleted at the moment when the function exits!
The string X will be created with the size required to hold the string data no matter what buffer size utf8 had before.
If utf8 would have a buffer of 10000 Bytes but the string data in it only has a length of 10 characters, then X will allocate a buffer which is a little bit bigger than 10 bytes.

So dont worry about a too big buffer!
Calling WideCharToMultiByte() twice is COMPLETELY useless!!
_______________________________________


You know also in advance that the Unicode string will NEVER be longer than the UTF string:

CStringW UTF8toUTF16(const CStringA& utf8)
{
   CStringW utf16;
   int len = utf8.GetLength();
   WCHAR *ptr = utf16.GetBuffer(len);
   if (ptr) MultiByteToWideChar(CP_UTF8, 0, utf8, -1, ptr, len);
   utf16.ReleaseBuffer();
   return utf16;
}



P.S.
The expert writes the simplest code.
The beginner writes the most complicatet code.
These 3 versions of the same thing show that very clearly!

Elmü
GeneralRe: Even more elegant! [modified] Pin
John Paul Pirau4-Sep-08 2:25
John Paul Pirau4-Sep-08 2:25 
GeneralYes it is more elegant and it works! Pin
Elmue9-Sep-08 16:05
Elmue9-Sep-08 16:05 
GeneralEven more elegant? Pin
Theo Buys15-Mar-17 5:40
Theo Buys15-Mar-17 5:40 
GeneralRe: Even more elegant? Pin
Elmue15-Mar-17 9:31
Elmue15-Mar-17 9:31 
GeneralRe: Even more elegant? Pin
Theo Buys16-Mar-17 7:25
Theo Buys16-Mar-17 7:25 
GeneralRe: Even more elegant? Pin
Elmue17-Mar-17 9:42
Elmue17-Mar-17 9:42 
GeneralRe: Even more elegant? Pin
Theo Buys18-Mar-17 13:02
Theo Buys18-Mar-17 13:02 
QuestionSince when CStringA is UTF-8? Pin
Shao Voon Wong19-May-08 22:13
mvaShao Voon Wong19-May-08 22:13 
AnswerRe: Since when CStringA is UTF-8? Pin
John Paul Pirau19-May-08 23:03
John Paul Pirau19-May-08 23:03 
AnswerRe: Since when CStringA is UTF-8? Pin
Nemanja Trifunovic29-May-08 4:02
Nemanja Trifunovic29-May-08 4:02 
GeneralRe: Since when CStringA is UTF-8? Pin
John Paul Pirau2-Jun-08 23:26
John Paul Pirau2-Jun-08 23:26 
GeneralRe: Since when CStringA is UTF-8? Pin
Theo Buys13-Apr-15 5:35
Theo Buys13-Apr-15 5:35 
GeneralNo need for a temporary buffer Pin
Ivo Beltchev17-May-08 8:48
Ivo Beltchev17-May-08 8:48 
GeneralRe: No need for a temporary buffer Pin
John Paul Pirau19-May-08 2:35
John Paul Pirau19-May-08 2:35 
QuestionRe: No need for a temporary buffer Pin
guoxuran21-Jun-09 17:54
guoxuran21-Jun-09 17:54 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.