Click here to Skip to main content
Click here to Skip to main content

Upgrading an STL-based application to use Unicode.

, 9 Jul 2013
Rate this:
Please Sign up or sign in to vote.
Problems that developers will face when upgrading an STL-based application to use Unicode and how to solve them.

Introduction

I recently upgraded a reasonably large program to use Unicode instead of single-byte characters. Apart from a few legacy modules, I had dutifully used the t- functions and wrapped all my strings literals and character constants in _T() macros, safe in the knowledge that when it came time to switch to Unicode, all I had to do was define UNICODE and _UNICODE and everything would Just Work (tm).

Man, was I ever wrong Cry | :((

So, I write this article as therapy for the past two weeks of work and in the hope that it will maybe save others some of the pain and misery I have endured. Sigh...

The basics

In theory, writing code that can be compiled using single- or double-byte characters is straight-forward. I was going to write a section on the basics but Chris Maunder has already done it. The techniques he describes are widely known so we'll just get right on to the meat of this article.

Wide file I/O

There are wide versions of the usual stream classes and it is easy to define t-style macros to manage them:

#ifdef _UNICODE
    #define tofstream wofstream 
    #define tstringstream wstringstream
    // etc...
#else 
    #define tofstream ofstream 
    #define tstringstream stringstream
    // etc...
#endif // _UNICODE

And you would use them like this:

tofstream testFile( "test.txt" ) ; 
testFile << _T("ABC") ;

Now, you would expect the above code to produce a 3-byte file when compiled using single-byte characters and a 6-byte file when using double-byte. Except you don't. You get a 3-byte file for both. WTH is going on?!

It turns out that the C++ standard dictates that wide-streams are required to convert double-byte characters to single-byte when writing to a file. So in the example above, the wide string L"ABC" (which is 6 bytes long) gets converted to a narrow string (3 bytes) before it is written to the file. And if that wasn't bad enough, how this conversion is done is implementation-dependent.

I haven't been able to find a definitive explanation of why things were specified like this. My best guess is that a file, by definition, is considered to be a stream of (single-byte) characters and allowing stuff to be written 2-bytes at a time would break that abstraction. Right or wrong, this causes serious problems. For example, you can't write binary data to a wofstream because the class will try to narrow it first (usually failing miserably) before writing it out.

This was particularly problematic for me because I have a lot of functions that look like this:

void outputStuff( tostream& os )
{
    // output stuff to the stream
    os << ....
}

which would work fine (i.e. it streamed out wide characters) if you passed in a tstringstream object but gave weird results if you passed in a tofstream (because everything was getting narrowed).

Wide file I/O: the solution

Stepping through the STL in the debugger (what joy!) revealed that wofstream invokes a std::codecvt object to narrow the output data just before it is written out to the file. std::codecvt objects are responsible for converting strings from one character set to another and C++ requires that two be provided as standard: one that converts chars to chars (i.e. effectively does nothing) and one that converts wchar_ts to chars. This latter one was the one that was causing me so much grief.

The solution: write a new codecvt-derived class that converts wchar_ts to wchar_ts (i.e. do nothing) and attach it to the wofstream object. When the wofstream tried to convert the data it was writing out, it would invoke my new codecvt object that did nothing and the data would be written out unchanged.

A bit of poking around on Google Groups turned up some code written by P. J. Plauger (the author of the STL that ships with MSVC) but I had problems getting it to compile with Stlport 4.5.3. This is the version I finally hacked together:

#include <locale>

// nb: MSVC6+Stlport can't handle "std::"
// appearing in the NullCodecvtBase typedef.
using std::codecvt ; 
typedef codecvt < wchar_t , char , mbstate_t > NullCodecvtBase ;

class NullCodecvt
    : public NullCodecvtBase
{

public:
    typedef wchar_t _E ;
    typedef char _To ;
    typedef mbstate_t _St ;

    explicit NullCodecvt( size_t _R=0 ) : NullCodecvtBase(_R) { }

protected:
    virtual result do_in( _St& _State ,
                   const _To* _F1 , const _To* _L1 , const _To*& _Mid1 ,
                   _E* F2 , _E* _L2 , _E*& _Mid2
                   ) const
    {
        return noconv ;
    }
    virtual result do_out( _St& _State ,
                   const _E* _F1 , const _E* _L1 , const _E*& _Mid1 ,
                   _To* F2, _E* _L2 , _To*& _Mid2
                   ) const
    {
        return noconv ;
    }
    virtual result do_unshift( _St& _State , 
            _To* _F2 , _To* _L2 , _To*& _Mid2 ) const
    {
        return noconv ;
     }
    virtual int do_length( _St& _State , const _To* _F1 , 
           const _To* _L1 , size_t _N2 ) const _THROW0()
    {
        return (_N2 < (size_t)(_L1 - _F1)) ? _N2 : _L1 - _F1 ;
    }
    virtual bool do_always_noconv() const _THROW0()
    {
        return true ;
    }
    virtual int do_max_length() const _THROW0()
    {
        return 2 ;
    }
    virtual int do_encoding() const _THROW0()
    {
        return 2 ;
    }
} ;

You can see that the functions that are supposed to do the conversions actually do nothing and return noconv to indicate that.

The only thing left to do is instantiate one of these and connect it to the wofstream object. Using MSVC, you are supposed to use the (non-standard) _ADDFAC() macro to imbue objects with a locale, but it didn't want to work with my new NullCodecvt class so I ripped out the guts of the macro and wrote a new one that did:

#define IMBUE_NULL_CODECVT( outputFile ) \
{ \
    NullCodecvt* pNullCodecvt = new NullCodecvt ; \
    locale loc = locale::classic() ; \
    loc._Addfac( pNullCodecvt , NullCodecvt::id, NullCodecvt::_Getcat() ) ; \
    (outputFile).imbue( loc ) ; \
}

So, the example code given above that didn't work properly can now be written like this:

tofstream testFile ;
IMBUE_NULL_CODECVT( testFile ) ;
testFile.open( "test.txt" , ios::out | ios::binary ) ; 
testFile << _T("ABC") ;

It is important that the file stream object be imbued with the new codecvt object before it is opened. The file must also be opened in binary mode. If it isn't, every time the file sees a wide character that has the value 10 in it's high or low byte, it will perform CR/LF translation which is definitely not what you want. If you really want a CR/LF sequence, you will have to insert it explicitly using "\r\n" instead of std::endl.

wchar_t problems

wchar_t is the type that is used for wide characters and is defined like this:

typedef unsigned short wchar_t ;

Unfortunately, because it is a typedef instead of a real C++ type, defining it like this has one serious flaw: you can't overload on it. Look at the following code:

TCHAR ch = _T('A') ;
tcout << ch << endl ;

Using narrow strings, this does what you would expect: print out the letter A. Using wide strings, it prints out 65. The compiler decides that you are streaming out an unsigned short and prints it out as a numeric value instead of a wide character. Aaargh!!! There is no solution for this other than going through your entire code base, looking for instances where you stream out individual characters and fix them. I wrote a little function to make it a little more obvious what was going on:

#ifdef _UNICODE
    // NOTE: Can't stream out wchar_t's - convert to a string first!
    inline std::wstring toStreamTchar( wchar_t ch ) 
            { return std::wstring(&ch,1) ; }
#else 
    // NOTE: It's safe to stream out narrow char's directly.
    inline char toStreamTchar( char ch ) { return ch ; }
#endif // _UNICODE    

TCHAR ch = _T('A') ;
tcout << toStreamTchar(ch) << endl ;

Wide exception classes

Most C++ programs will be using exceptions to handle error conditions. Unfortunately, std::exception is defined like this:

class std::exception
{
    // ...
    virtual const char *what() const throw() ;
} ;

and can only handle narrow error messages. I only ever throw exceptions that I have defined myself or std::runtime_error, so I wrote a wide version of std::runtime_error like this:

class wruntime_error
    : public std::runtime_error
{

public:                 // --- PUBLIC INTERFACE ---

// constructors:
                        wruntime_error( const std::wstring& errorMsg ) ;
// copy/assignment:
                        wruntime_error( const wruntime_error& rhs ) ;
    wruntime_error&     operator=( const wruntime_error& rhs ) ;
// destructor:
    virtual             ~wruntime_error() ;

// exception methods:
    const std::wstring& errorMsg() const ;

private:                // --- DATA MEMBERS ---

// data members:
    std::wstring        mErrorMsg ; ///< Exception error message.
    
} ;

#ifdef _UNICODE
    #define truntime_error wruntime_error
#else 
    #define truntime_error runtime_error
#endif // _UNICODE

/* -------------------------------------------------------------------- */

wruntime_error::wruntime_error( const wstring& errorMsg )
    : runtime_error( toNarrowString(errorMsg) )
    , mErrorMsg(errorMsg)
{
    // NOTE: We give the runtime_error base the narrow version of the 
    //  error message. This is what will get shown if what() is called.
    //  The wruntime_error inserter or errorMsg() should be used to get 
    //  the wide version.
}

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  */

wruntime_error::wruntime_error( const wruntime_error& rhs )
    : runtime_error( toNarrowString(rhs.errorMsg()) )
    , mErrorMsg(rhs.errorMsg())
{
}

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  */

wruntime_error&
wruntime_error::operator=( const wruntime_error& rhs )
{
    // copy the wruntime_error
    runtime_error::operator=( rhs ) ; 
    mErrorMsg = rhs.mErrorMsg ; 

    return *this ; 
}

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  */

wruntime_error::~wruntime_error()
{
}

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  */

const wstring& wruntime_error::errorMsg() const { return mErrorMsg ; }

(toNarrowString() is a little helper function that converts a wide string to a narrow string and is given below). wruntime_error simply keeps a copy of the wide error message itself and gives a narrow version to the base std::exception in case somebody calls what(). Exception classes that I define myself, I modified to look like this:

class MyExceptionClass : public std::truntime_error
{
public:
    MyExceptionClass( const std::tstring& errorMsg ) : 
                            std::truntime_error(errorMsg) { } 
} ;

The final problem was that I had lots and lots of code that looked like this:

try
{
    // do something...
}
catch( exception& xcptn )
{
    tstringstream buf ;
    buf << _T("An error has occurred: ") << xcptn ; 
    AfxMessageBox( buf.str().c_str() ) ;
}

where I had defined an inserter for std::exception like this:

tostream&
operator<<( tostream& os , const exception& xcptn )
{
    // insert the exception
    // NOTE: toTstring() converts a string to a tstring - defined below
    os << toTstring( xcptn.what() ) ;

    return os ;
}

The problem is that my inserter called what() which only returns the narrow version of the error message. But if the error message contains foreign characters, I'd like to see them in the error dialog! So I rewrote the inserter to look like this:

tostream&
operator<<( tostream& os , const exception& xcptn )
{
    // insert the exception
    if ( const wruntime_error* p = 
            dynamic_cast<const wruntime_error*>(&xcptn) )
        os << p->errorMsg() ; 
    else 
        os << toTstring( xcptn.what() ) ;

    return os ;
}

Now it detects if it has been given a wide exception class and if so, streams out the wide error message. Otherwise it falls back to using the standard (narrow) error message. Even though I might exclusively use truntime_error-derived classes in my app, this latter case is still important since the STL or other third-party libraries might throw a std::exception-derived error.

Other miscellaneous problems

  • Q100639: If you are writing an MFC app using Unicode, you need to specify wWinMainCRTStartup as your entry point (in the Link page of your Project Options).
  • Many Windows functions accept a buffer to return their results in. The buffer size is usually specified in characters, not bytes. So while the following code will work fine when compiled using single-byte characters:
    // get our EXE name 
    TCHAR buf[ _MAX_PATH+1 ] ; 
    GetModuleFileName( NULL , buf , sizeof(buf) ) ;

    it is wrong for double-byte characters. The call to GetModuleFileName() needs to be written like this:

    GetModuleFileName( NULL , buf , sizeof(buf)/sizeof(TCHAR) ) ;
  • If you are processing a file byte-by-byte, you need to test for WEOF, not EOF.
  • HttpSendRequest() accepts a string that specifies additional headers to attach to an HTTP request before it is sent. ANSI builds accept a string length of -1 to mean that the header string is NULL-terminated. Unicode builds require the string length to be explicitly provided. Don't ask me why.

Miscellaneous useful stuff

Finally, some little helper functions that you might find useful if you are doing this kind of work.

extern std::wstring toWideString( const char* pStr , int len=-1 ) ; 
inline std::wstring toWideString( const std::string& str )
{
    return toWideString(str.c_str(),str.length()) ;
}
inline std::wstring toWideString( const wchar_t* pStr , int len=-1 )
{
    return (len < 0) ? pStr : std::wstring(pStr,len) ;
}
inline std::wstring toWideString( const std::wstring& str )
{
    return str ;
}
extern std::string toNarrowString( const wchar_t* pStr , int len=-1 ) ; 
inline std::string toNarrowString( const std::wstring& str )
{
    return toNarrowString(str.c_str(),str.length()) ;
}
inline std::string toNarrowString( const char* pStr , int len=-1 )
{
    return (len < 0) ? pStr : std::string(pStr,len) ;
}
inline std::string toNarrowString( const std::string& str )
{
    return str ;
}

#ifdef _UNICODE
    inline TCHAR toTchar( char ch )
    {
        return (wchar_t)ch ;
    }
    inline TCHAR toTchar( wchar_t ch )
    {
        return ch ;
    }
    inline std::tstring toTstring( const std::string& s )
    {
        return toWideString(s) ;
    }
    inline std::tstring toTstring( const char* p , int len=-1 )
    {
        return toWideString(p,len) ;
    }
    inline std::tstring toTstring( const std::wstring& s )
    {
        return s ;
    }
    inline std::tstring toTstring( const wchar_t* p , int len=-1 )
    {
        return (len < 0) ? p : std::wstring(p,len) ;
    }
#else 
    inline TCHAR toTchar( char ch )
    {
        return ch ;
    }
    inline TCHAR toTchar( wchar_t ch )
    {
        return (ch >= 0 && ch <= 0xFF) ? (char)ch : '?' ;
    } 
    inline std::tstring toTstring( const std::string& s )
    {
        return s ;
    }
    inline std::tstring toTstring( const char* p , int len=-1 )
    {
        return (len < 0) ? p : std::string(p,len) ;
    }
    inline std::tstring toTstring( const std::wstring& s )
    {
        return toNarrowString(s) ;
    }
    inline std::tstring toTstring( const wchar_t* p , int len=-1 )
    {
        return toNarrowString(p,len) ;
    }
#endif // _UNICODE

/* -------------------------------------------------------------------- */

wstring 
toWideString( const char* pStr , int len )
{
    ASSERT_PTR( pStr ) ; 
    ASSERT( len >= 0 || len == -1 , _T("Invalid string length: ") << len ) ; 

    // figure out how many wide characters we are going to get 
    int nChars = MultiByteToWideChar( CP_ACP , 0 , pStr , len , NULL , 0 ) ; 
    if ( len == -1 )
        -- nChars ; 
    if ( nChars == 0 )
        return L"" ;

    // convert the narrow string to a wide string 
    // nb: slightly naughty to write directly into the string like this
    wstring buf ;
    buf.resize( nChars ) ; 
    MultiByteToWideChar( CP_ACP , 0 , pStr , len , 
        const_cast<wchar_t*>(buf.c_str()) , nChars ) ; 

    return buf ;
}

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  */

string 
toNarrowString( const wchar_t* pStr , int len )
{
    ASSERT_PTR( pStr ) ; 
    ASSERT( len >= 0 || len == -1 , _T("Invalid string length: ") << len ) ; 

    // figure out how many narrow characters we are going to get 
    int nChars = WideCharToMultiByte( CP_ACP , 0 , 
             pStr , len , NULL , 0 , NULL , NULL ) ; 
    if ( len == -1 )
        -- nChars ; 
    if ( nChars == 0 )
        return "" ;

    // convert the wide string to a narrow string
    // nb: slightly naughty to write directly into the string like this
    string buf ;
    buf.resize( nChars ) ;
    WideCharToMultiByte( CP_ACP , 0 , pStr , len , 
          const_cast<char*>(buf.c_str()) , nChars , NULL , NULL ) ; 

    return buf ; 
}

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Taka Muraoka
Awasu
Australia Australia
No Biography provided

Comments and Discussions

 
QuestionDo something wonderful for your software and avoid the hassle PinmemberWolfgang_Baron15-Jul-13 7:59 
GeneralMy vote of 4 PinmemberJon Summers15-Jul-13 0:42 
SuggestionUse UTF-8 always Pinmemberfmaeseel9-Jul-13 22:46 
Generalmulti-byte application displays "???" instead of Russian chars PinmemberWR127018-Sep-09 14:36 
QuestionIs there an app/script to add _T() macro ? PinmemberWR127018-Sep-09 14:29 
AnswerRe: Is there an app/script to add _T() macro ? PinmemberRoger Bamforth28-Jul-11 4:04 
GeneralRe: Is there an app/script to add _T() macro ? PinmemberSunny127028-Jul-11 6:15 
GeneralRe: Is there an app/script to add _T() macro ? PinmemberRoger Bamforth28-Jul-11 7:04 
I agree entirely that the process can't be automated but I have a 300,000 line project to convert so I need all the help I can get! I am running ToUnicode.exe on all the files in each sub-project and then diffing each file to see what it actually did.
 
For example, we have a lot of low-level code talking to electronics via RS232 etc. so I have to catch all the places where people have used char when they meant BYTE, that char has been changed to TCHAR and a binary buffer has suddenly become twice the size it should be.
 
Worse is where people have tried to think about a possible future Unicode version and have helpfully used TCHAR, but they really meant char (or BYTE) so the code breaks in the Unicode version and there's no difference in the two files to give me a clue.
 
Of course it's not just adding _T(), there's changing all the string handling functions (e.g. strcpy() to _tcscpy() etc). ToUnicode.exe does that a well. Plus it outputs several useful warnings, if you hook it up as an external tool in Visual Studio you can get the warnings in the output window and then step through them using F8, just like compiler errors. I've found that very helpful.
Regards
 
- Roger

Generaljust wanted to say thank you. You saved me hours of precious time. eom Pinmemberjeffsaremi8-Jun-09 17:12 
GeneralSend me a sample Pinmembermalfaro19-Oct-08 12:05 
GeneralIMBUE_NULL_CODECVT error PinmemberAlessandro Papaleo4-Feb-08 1:33 
General'std::locale::_Addfac' was declared deprecated PinmemberMohamad Ali22-Jul-06 1:32 
GeneralRe: 'std::locale::_Addfac' was declared deprecated PinmemberJens Grünewald5-Oct-09 2:11 
GeneralIMBUE_NULL_CODECVT the VC8 way PinmemberTom Gee12-Jul-06 21:59 
GeneralRe: IMBUE_NULL_CODECVT the VC8 way PinmemberHugo González Castro5-Apr-10 3:45 
GeneralRe: IMBUE_NULL_CODECVT the VC8 way PinmemberTom Gee6-Apr-10 2:17 
GeneralGood, good Pinmemberedger12-Jul-06 19:56 
Generalwide streams and MinGW compiler PinmemberCNX_Glenn7-Jul-06 2:38 
GeneralRe: wide streams and MinGW compiler PinmemberNemanja Trifunovic7-Jul-06 2:51 
GeneralRe: wide streams and MinGW compiler PinmemberCNX_Glenn7-Jul-06 7:09 
Generalusing >> operator Pinmemberindra304015-Jun-06 3:12 
GeneralComments about doing a unicode conversion.... PinmemberPeter Weyzen12-Feb-06 22:05 
GeneralVS 2005 Updates Pinmemberstarcraft015-Jan-06 8:48 
Questionhow to convert CString to WCHAR * PinmemberBalasom16-Aug-05 2:17 
AnswerRe: how to convert CString to WCHAR * Pinmembercriss_iss10-May-06 0:18 
AnswerRe: how to convert CString to WCHAR * Pinmembercriss_iss10-May-06 0:28 
GeneralUTF-8 to Unicode PinmemberOlSchol18-Jul-05 18:45 
GeneralRe: UTF-8 to Unicode PinmemberPeter Weyzen12-Feb-06 21:56 
GeneralSetting up the IMBUE Macro PinmemberOlSchol13-Jul-05 19:51 
GeneralRe: Setting up the IMBUE Macro PinmemberTaka Muraoka13-Jul-05 22:42 
GeneralRe: Setting up the IMBUE Macro PinmemberOlSchol14-Jul-05 13:24 
GeneralRe: Setting up the IMBUE Macro PinmemberTaka Muraoka14-Jul-05 15:32 
GeneralRe: Setting up the IMBUE Macro PinmemberOlSchol14-Jul-05 16:04 
GeneralRe: Setting up the IMBUE Macro PinmemberTaka Muraoka14-Jul-05 16:08 
GeneralRe: Setting up the IMBUE Macro PinmemberOlSchol14-Jul-05 16:29 
GeneralRe: Setting up the IMBUE Macro PinmemberOlSchol14-Jul-05 16:53 
GeneralUNICODE, codecvt and STLPort solution PinmemberthomasG13-Oct-04 22:29 
QuestionProbably nice but where's the STL? PinmemberAndrew Phillips13-Apr-04 22:28 
AnswerRe: Probably nice but where's the STL? PinmemberGeorge L. Jackson15-Apr-04 2:33 
GeneralRe: Probably nice but where's the STL? PinmemberAndrew Phillips16-Apr-04 15:11 
GeneralRe: Probably nice but where's the STL? PinmemberMarcello31-Mar-05 7:34 
GeneralPerhaps safer alternative to writing directly to wstring buffer PinmemberJazee9-Apr-04 9:44 
GeneralRe: Perhaps safer alternative to writing directly to wstring buffer Pinmemberaimsoft29-Aug-05 23:06 
GeneralAttempt to make it more portable PinmemberRob Staveley21-Nov-03 12:25 
GeneralRe: Attempt to make it more portable PinmemberOlSchol13-Jul-05 20:30 
GeneralRe: Attempt to make it more portable PinmemberRob Staveley12-Aug-05 20:48 
GeneralRe: Attempt to make it more portable PinsussAnonymous14-Aug-05 16:51 
Generali had a problem with unicode too Pinmemberzcpro22-Oct-03 22:09 
GeneralRe: i had a problem with unicode too Pinmemberzcpro22-Oct-03 22:13 
Generala paragraph of msdn PinmemberEdwin Geng1-Sep-03 4:35 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web04 | 2.8.140709.1 | Last Updated 9 Jul 2013
Article Copyright 2003 by Taka Muraoka
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid