Click here to Skip to main content
11,484,402 members (64,655 online)
Click here to Skip to main content

UTF16 to UTF8 to UTF16 simple CString based conversion

, 16 May 2008 CPOL 73.9K 37
Rate this:
Please Sign up or sign in to vote.
Use CString to convert betwen UTF8 and UTF16.

Introduction

For conversion of strings between UTF8 and UTF16 (as well as other formats), Microsoft gives us the MultiByteToWideChar and WideCharToMultiByte functions. These functions use null terminated char/widechar based strings. Use of those strings requires a bit of memory management, and if you use the functions extensively, your code may end up looking like a complete mess. That's why I decided to wrap these two functions for use with the more coder-friendly CString types.

The conversion functions

UTF16toUTF8

CStringA UTF16toUTF8(const CStringW& utf16)
{
   CStringA utf8;
   int len = WideCharToMultiByte(CP_UTF8, 0, utf16, -1, NULL, 0, 0, 0);
   if (len>1)
   { 
      char *ptr = utf8.GetBuffer(len-1);
      if (ptr) WideCharToMultiByte(CP_UTF8, 0, utf16, -1, ptr, len, 0, 0);
      utf8.ReleaseBuffer();
   }
   return utf8;
}

UTF8toUTF16

CStringW UTF8toUTF16(const CStringA& utf8)
{
   CStringW utf16;
   int len = MultiByteToWideChar(CP_UTF8, 0, utf8, -1, NULL, 0);
   if (len>1)
   { 
      wchar_t *ptr = utf16.GetBuffer(len-1);
      if (ptr) MultiByteToWideChar(CP_UTF8, 0, utf8, -1, ptr, len);
      utf16.ReleaseBuffer();
   }
   return utf16;
}

Using the code

Use of the two helper functions is straightforward. But, do note that they are only useful if your project is set to use the UNICODE character set. The functions also only work in Visual Studio 7.1 or above. If you use Visual Studio 6.0, you won't be able to compile because you miss CStringA and CStringW. In the following code snippet, you have a usage example:

CStringW utf16("òèçùà12345");
CStringA utf8 = UTF16toUTF8(utf16);
CStringW utf16_2 = UTF8toUTF16(utf8);

History

After a comment by Ivo Beltchev, I decided to change the functions as he suggested. Initially, I designed the functions like this:

CStringA UTF16toUTF8(const CStringW& utf16)
{
  LPSTR pszUtf8 = NULL;
  CStringA utf8("");

  if (utf16.IsEmpty()) 
    return utf8; //empty imput string

  size_t nLen16 = utf16.GetLength();
  size_t nLen8 = 0;

  if ((nLen8 = WideCharToMultiByte (CP_UTF8, 0, utf16, nLen16, 
                                    NULL, 0, 0, 0) + 2) == 2)
    return utf8; //conversion error!

  pszUtf8 = new char [nLen8];
  if (pszUtf8)
  {
    memset (pszUtf8, 0x00, nLen8);
    WideCharToMultiByte(CP_UTF8, 0, utf16, nLen16, pszUtf8, nLen8, 0, 0);
    utf8 = CStringA(pszUtf8);
  }

  delete [] pszUtf8;
  return utf8; //utf8 encoded string
}

CStringW UTF8toUTF16(const CStringA& utf8)
{
  LPWSTR pszUtf16 = NULL;
  CStringW utf16("");
  
  if (utf8.IsEmpty()) 
    return utf16; //empty imput string

  size_t nLen8 = utf8.GetLength();
  size_t nLen16 = 0;

  if ((nLen16 = MultiByteToWideChar (CP_UTF8, 0, utf8, nLen8, NULL, 0)) == 0)
    return utf16; //conversion error!

  pszUtf16 = new wchar_t[nLen16];
  if (pszUtf16)
  {
    wmemset (pszUtf16, 0x00, nLen16);
    MultiByteToWideChar (CP_UTF8, 0, utf8, nLen8, pszUtf16, nLen16);
    utf16 = CStringW(pszUtf16);
  }

  delete [] utf16;
  return utf16; //utf16 encoded string
}

These functions work just as well, but the latter versions are smaller and a bit optimized. Thanks to Ivo for the observation!

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

John Paul Pirau
Software Developer (Senior)
Romania Romania
No Biography provided

Comments and Discussions

 
BugSmall bug ... Pin
Tomice14-Jan-15 23:26
memberTomice14-Jan-15 23:26 
QuestionConsider using CA2T and CT2A Pin
kanalbrummer24-Nov-13 4:11
memberkanalbrummer24-Nov-13 4:11 
AnswerRe: Consider using CA2T and CT2A Pin
Theo Buys13-Apr-15 6:15
memberTheo Buys13-Apr-15 6:15 
QuestionTraditional Chinese characters aren’t being read from network stream Pin
Member 864850822-Mar-12 18:53
memberMember 864850822-Mar-12 18:53 
GeneralMy vote of 3 Pin
Dezhi Zhao13-Jan-11 5:55
memberDezhi Zhao13-Jan-11 5:55 
GeneralEven more elegant! Pin
Elmue23-Aug-08 11:37
memberElmue23-Aug-08 11:37 
GeneralRe: Even more elegant! [modified] Pin
John Paul Pirau4-Sep-08 3:25
memberJohn Paul Pirau4-Sep-08 3:25 
GeneralYes it is more elegant and it works! Pin
Elmue9-Sep-08 17:05
memberElmue9-Sep-08 17:05 
Hello

> ..both are unicode. You should use UTF16 and UTF8 respectively to avoid confusion

You are wrong.
UTF is NOT Unicode.
In Unicode you can display ANY of the possible 65535 characters like for example japanese directly and without any conversion.
Not so in UTF:
UTF uses 7 Bits for each ASCII character and a complicated REPLACEMENT for all characters above ascii 127 which can NOT be displayed with 7 bits.

You write:
> while an UTF8 has a variable length of 1 to 4 bytes for each character.

This is exactly THE SAME what I wrote!
obviously you did not understand it, so I explain you again:
Each Unicode character is represented by 1 up to 4 UTF8 characters.
So a Unicode string of 10 Unicode characters is represented by 10,11,12,....38,39 or 40 UTF8 characters but never more than 40.
So a buffer of 40 bytes will ALWAYS be enough for a the conversion of a 10 character Unicode string.
What is so difficult in understanding this ?
You dont have to call the APi twice.
You know in ADVANCE that a buffer of 4 bytes for each unicode character is ALWAYS big enough!

(The class CString cares about the terminating null character, you can ignore it in your code)

> If this parameter is set to 0, the function returns the required buffer size for lpMultiByteStr and makes no use of the output parameter itself.

Ohhh.
How good that you can read!
And you think I did not read that ?

If the API function does not write to an output buffer does that mean that the function does nothing?
Where come the return value from?
What do think what the Windows API does internally to obtain the size of the buffer?
Do you think that the API knows just by looking at the address of your Unicode string how many buffer is required to convert it?
I suppose you dont assume that the API is magic!
Do you think that the Microprocessor knows with one command what buffer size is required?
Obviously the API has to analyze the string character by character up to the last character before it can tell you the required size.
And this unnecessary work is not required.
Because you know already IN ADVANCE what buffer length will ALWAYS be big enough: 4 times the Unicode characters.

> you must call this function twice

Im sorry that did not understand anything! Frown | :(

> where is len defined here?

OK. Oversights happen to everybody.
I fixed that.
But this little flaw does not mean that any word of what I wrote is wrong!

The code in my posting above is 100% correct.
And please if you still should not have understood what I explained the second time, dont expect that I will explain it a third time!

You can simply try it:
Use my code with a string like for example "ABCDE".
Convert it to UTF8 and look what buffer size it requires.

Then try the same with five japanese, chinese, greek or russian characters.
Look what buffer sizes are required.
Look at the variables in the debugger of your compiler.
Then you will understand it. (I hope so)

Elmue

P.S.
There are many many years of learning before a beginner becomes an expert!
QuestionSince when CStringA is UTF-8? Pin
Wong Shao Voon19-May-08 23:13
memberWong Shao Voon19-May-08 23:13 
AnswerRe: Since when CStringA is UTF-8? Pin
John Paul Pirau20-May-08 0:03
memberJohn Paul Pirau20-May-08 0:03 
AnswerRe: Since when CStringA is UTF-8? Pin
Nemanja Trifunovic29-May-08 5:02
memberNemanja Trifunovic29-May-08 5:02 
GeneralRe: Since when CStringA is UTF-8? Pin
John Paul Pirau3-Jun-08 0:26
memberJohn Paul Pirau3-Jun-08 0:26 
GeneralRe: Since when CStringA is UTF-8? Pin
Theo Buys13-Apr-15 6:35
memberTheo Buys13-Apr-15 6:35 
GeneralNo need for a temporary buffer Pin
Ivo Beltchev17-May-08 9:48
memberIvo Beltchev17-May-08 9:48 
GeneralRe: No need for a temporary buffer Pin
John Paul Pirau19-May-08 3:35
memberJohn Paul Pirau19-May-08 3:35 
QuestionRe: No need for a temporary buffer Pin
guoxuran21-Jun-09 18:54
memberguoxuran21-Jun-09 18:54 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web04 | 2.8.150520.1 | Last Updated 16 May 2008
Article Copyright 2008 by John Paul Pirau
Everything else Copyright © CodeProject, 1999-2015
Layout: fixed | fluid