Click here to Skip to main content
15,867,453 members
Articles / Programming Languages / C++

Upgrading an STL-based application to use Unicode.

Rate me:
Please Sign up or sign in to vote.
4.82/5 (89 votes)
9 Jul 2013CPOL7 min read 525K   190   98
Problems that developers will face when upgrading an STL-based application to use Unicode and how to solve them.

Introduction

I recently upgraded a reasonably large program to use Unicode instead of single-byte characters. Apart from a few legacy modules, I had dutifully used the t- functions and wrapped all my strings literals and character constants in _T() macros, safe in the knowledge that when it came time to switch to Unicode, all I had to do was define UNICODE and _UNICODE and everything would Just Work (tm).

Man, was I ever wrong :((

So, I write this article as therapy for the past two weeks of work and in the hope that it will maybe save others some of the pain and misery I have endured. Sigh...

The basics

In theory, writing code that can be compiled using single- or double-byte characters is straight-forward. I was going to write a section on the basics but Chris Maunder has already done it. The techniques he describes are widely known so we'll just get right on to the meat of this article.

Wide file I/O

There are wide versions of the usual stream classes and it is easy to define t-style macros to manage them:

#ifdef _UNICODE
    #define tofstream wofstream 
    #define tstringstream wstringstream
    // etc...
#else 
    #define tofstream ofstream 
    #define tstringstream stringstream
    // etc...
#endif // _UNICODE

And you would use them like this:

tofstream testFile( "test.txt" ) ; 
testFile << _T("ABC") ;

Now, you would expect the above code to produce a 3-byte file when compiled using single-byte characters and a 6-byte file when using double-byte. Except you don't. You get a 3-byte file for both. WTH is going on?!

It turns out that the C++ standard dictates that wide-streams are required to convert double-byte characters to single-byte when writing to a file. So in the example above, the wide string L"ABC" (which is 6 bytes long) gets converted to a narrow string (3 bytes) before it is written to the file. And if that wasn't bad enough, how this conversion is done is implementation-dependent.

I haven't been able to find a definitive explanation of why things were specified like this. My best guess is that a file, by definition, is considered to be a stream of (single-byte) characters and allowing stuff to be written 2-bytes at a time would break that abstraction. Right or wrong, this causes serious problems. For example, you can't write binary data to a wofstream because the class will try to narrow it first (usually failing miserably) before writing it out.

This was particularly problematic for me because I have a lot of functions that look like this:

void outputStuff( tostream& os )
{
    // output stuff to the stream
    os << ....
}

which would work fine (i.e. it streamed out wide characters) if you passed in a tstringstream object but gave weird results if you passed in a tofstream (because everything was getting narrowed).

Wide file I/O: the solution

Stepping through the STL in the debugger (what joy!) revealed that wofstream invokes a std::codecvt object to narrow the output data just before it is written out to the file. std::codecvt objects are responsible for converting strings from one character set to another and C++ requires that two be provided as standard: one that converts chars to chars (i.e. effectively does nothing) and one that converts wchar_ts to chars. This latter one was the one that was causing me so much grief.

The solution: write a new codecvt-derived class that converts wchar_ts to wchar_ts (i.e. do nothing) and attach it to the wofstream object. When the wofstream tried to convert the data it was writing out, it would invoke my new codecvt object that did nothing and the data would be written out unchanged.

A bit of poking around on Google Groups turned up some code written by P. J. Plauger (the author of the STL that ships with MSVC) but I had problems getting it to compile with Stlport 4.5.3. This is the version I finally hacked together:

#include <locale>

// nb: MSVC6+Stlport can't handle "std::"
// appearing in the NullCodecvtBase typedef.
using std::codecvt ; 
typedef codecvt < wchar_t , char , mbstate_t > NullCodecvtBase ;

class NullCodecvt
    : public NullCodecvtBase
{

public:
    typedef wchar_t _E ;
    typedef char _To ;
    typedef mbstate_t _St ;

    explicit NullCodecvt( size_t _R=0 ) : NullCodecvtBase(_R) { }

protected:
    virtual result do_in( _St& _State ,
                   const _To* _F1 , const _To* _L1 , const _To*& _Mid1 ,
                   _E* F2 , _E* _L2 , _E*& _Mid2
                   ) const
    {
        return noconv ;
    }
    virtual result do_out( _St& _State ,
                   const _E* _F1 , const _E* _L1 , const _E*& _Mid1 ,
                   _To* F2, _E* _L2 , _To*& _Mid2
                   ) const
    {
        return noconv ;
    }
    virtual result do_unshift( _St& _State , 
            _To* _F2 , _To* _L2 , _To*& _Mid2 ) const
    {
        return noconv ;
     }
    virtual int do_length( _St& _State , const _To* _F1 , 
           const _To* _L1 , size_t _N2 ) const _THROW0()
    {
        return (_N2 < (size_t)(_L1 - _F1)) ? _N2 : _L1 - _F1 ;
    }
    virtual bool do_always_noconv() const _THROW0()
    {
        return true ;
    }
    virtual int do_max_length() const _THROW0()
    {
        return 2 ;
    }
    virtual int do_encoding() const _THROW0()
    {
        return 2 ;
    }
} ;

You can see that the functions that are supposed to do the conversions actually do nothing and return noconv to indicate that.

The only thing left to do is instantiate one of these and connect it to the wofstream object. Using MSVC, you are supposed to use the (non-standard) _ADDFAC() macro to imbue objects with a locale, but it didn't want to work with my new NullCodecvt class so I ripped out the guts of the macro and wrote a new one that did:

#define IMBUE_NULL_CODECVT( outputFile ) \
{ \
    NullCodecvt* pNullCodecvt = new NullCodecvt ; \
    locale loc = locale::classic() ; \
    loc._Addfac( pNullCodecvt , NullCodecvt::id, NullCodecvt::_Getcat() ) ; \
    (outputFile).imbue( loc ) ; \
}

So, the example code given above that didn't work properly can now be written like this:

tofstream testFile ;
IMBUE_NULL_CODECVT( testFile ) ;
testFile.open( "test.txt" , ios::out | ios::binary ) ; 
testFile << _T("ABC") ;

It is important that the file stream object be imbued with the new codecvt object before it is opened. The file must also be opened in binary mode. If it isn't, every time the file sees a wide character that has the value 10 in it's high or low byte, it will perform CR/LF translation which is definitely not what you want. If you really want a CR/LF sequence, you will have to insert it explicitly using "\r\n" instead of std::endl.

wchar_t problems

wchar_t is the type that is used for wide characters and is defined like this:

typedef unsigned short wchar_t ;

Unfortunately, because it is a typedef instead of a real C++ type, defining it like this has one serious flaw: you can't overload on it. Look at the following code:

TCHAR ch = _T('A') ;
tcout << ch << endl ;

Using narrow strings, this does what you would expect: print out the letter A. Using wide strings, it prints out 65. The compiler decides that you are streaming out an unsigned short and prints it out as a numeric value instead of a wide character. Aaargh!!! There is no solution for this other than going through your entire code base, looking for instances where you stream out individual characters and fix them. I wrote a little function to make it a little more obvious what was going on:

#ifdef _UNICODE
    // NOTE: Can't stream out wchar_t's - convert to a string first!
    inline std::wstring toStreamTchar( wchar_t ch ) 
            { return std::wstring(&ch,1) ; }
#else 
    // NOTE: It's safe to stream out narrow char's directly.
    inline char toStreamTchar( char ch ) { return ch ; }
#endif // _UNICODE    

TCHAR ch = _T('A') ;
tcout << toStreamTchar(ch) << endl ;

Wide exception classes

Most C++ programs will be using exceptions to handle error conditions. Unfortunately, std::exception is defined like this:

class std::exception
{
    // ...
    virtual const char *what() const throw() ;
} ;

and can only handle narrow error messages. I only ever throw exceptions that I have defined myself or std::runtime_error, so I wrote a wide version of std::runtime_error like this:

class wruntime_error
    : public std::runtime_error
{

public:                 // --- PUBLIC INTERFACE ---

// constructors:
                        wruntime_error( const std::wstring& errorMsg ) ;
// copy/assignment:
                        wruntime_error( const wruntime_error& rhs ) ;
    wruntime_error&     operator=( const wruntime_error& rhs ) ;
// destructor:
    virtual             ~wruntime_error() ;

// exception methods:
    const std::wstring& errorMsg() const ;

private:                // --- DATA MEMBERS ---

// data members:
    std::wstring        mErrorMsg ; ///< Exception error message.
    
} ;

#ifdef _UNICODE
    #define truntime_error wruntime_error
#else 
    #define truntime_error runtime_error
#endif // _UNICODE

/* -------------------------------------------------------------------- */

wruntime_error::wruntime_error( const wstring& errorMsg )
    : runtime_error( toNarrowString(errorMsg) )
    , mErrorMsg(errorMsg)
{
    // NOTE: We give the runtime_error base the narrow version of the 
    //  error message. This is what will get shown if what() is called.
    //  The wruntime_error inserter or errorMsg() should be used to get 
    //  the wide version.
}

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  */

wruntime_error::wruntime_error( const wruntime_error& rhs )
    : runtime_error( toNarrowString(rhs.errorMsg()) )
    , mErrorMsg(rhs.errorMsg())
{
}

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  */

wruntime_error&
wruntime_error::operator=( const wruntime_error& rhs )
{
    // copy the wruntime_error
    runtime_error::operator=( rhs ) ; 
    mErrorMsg = rhs.mErrorMsg ; 

    return *this ; 
}

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  */

wruntime_error::~wruntime_error()
{
}

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  */

const wstring& wruntime_error::errorMsg() const { return mErrorMsg ; }

(toNarrowString() is a little helper function that converts a wide string to a narrow string and is given below). wruntime_error simply keeps a copy of the wide error message itself and gives a narrow version to the base std::exception in case somebody calls what(). Exception classes that I define myself, I modified to look like this:

class MyExceptionClass : public std::truntime_error
{
public:
    MyExceptionClass( const std::tstring& errorMsg ) : 
                            std::truntime_error(errorMsg) { } 
} ;

The final problem was that I had lots and lots of code that looked like this:

try
{
    // do something...
}
catch( exception& xcptn )
{
    tstringstream buf ;
    buf << _T("An error has occurred: ") << xcptn ; 
    AfxMessageBox( buf.str().c_str() ) ;
}

where I had defined an inserter for std::exception like this:

tostream&
operator<<( tostream& os , const exception& xcptn )
{
    // insert the exception
    // NOTE: toTstring() converts a string to a tstring - defined below
    os << toTstring( xcptn.what() ) ;

    return os ;
}

The problem is that my inserter called what() which only returns the narrow version of the error message. But if the error message contains foreign characters, I'd like to see them in the error dialog! So I rewrote the inserter to look like this:

tostream&
operator<<( tostream& os , const exception& xcptn )
{
    // insert the exception
    if ( const wruntime_error* p = 
            dynamic_cast<const wruntime_error*>(&xcptn) )
        os << p->errorMsg() ; 
    else 
        os << toTstring( xcptn.what() ) ;

    return os ;
}

Now it detects if it has been given a wide exception class and if so, streams out the wide error message. Otherwise it falls back to using the standard (narrow) error message. Even though I might exclusively use truntime_error-derived classes in my app, this latter case is still important since the STL or other third-party libraries might throw a std::exception-derived error.

Other miscellaneous problems

  • Q100639: If you are writing an MFC app using Unicode, you need to specify wWinMainCRTStartup as your entry point (in the Link page of your Project Options).
  • Many Windows functions accept a buffer to return their results in. The buffer size is usually specified in characters, not bytes. So while the following code will work fine when compiled using single-byte characters:
    // get our EXE name 
    TCHAR buf[ _MAX_PATH+1 ] ; 
    GetModuleFileName( NULL , buf , sizeof(buf) ) ;

    it is wrong for double-byte characters. The call to GetModuleFileName() needs to be written like this:

    GetModuleFileName( NULL , buf , sizeof(buf)/sizeof(TCHAR) ) ;
  • If you are processing a file byte-by-byte, you need to test for WEOF, not EOF.
  • HttpSendRequest() accepts a string that specifies additional headers to attach to an HTTP request before it is sent. ANSI builds accept a string length of -1 to mean that the header string is NULL-terminated. Unicode builds require the string length to be explicitly provided. Don't ask me why.

Miscellaneous useful stuff

Finally, some little helper functions that you might find useful if you are doing this kind of work.

extern std::wstring toWideString( const char* pStr , int len=-1 ) ; 
inline std::wstring toWideString( const std::string& str )
{
    return toWideString(str.c_str(),str.length()) ;
}
inline std::wstring toWideString( const wchar_t* pStr , int len=-1 )
{
    return (len < 0) ? pStr : std::wstring(pStr,len) ;
}
inline std::wstring toWideString( const std::wstring& str )
{
    return str ;
}
extern std::string toNarrowString( const wchar_t* pStr , int len=-1 ) ; 
inline std::string toNarrowString( const std::wstring& str )
{
    return toNarrowString(str.c_str(),str.length()) ;
}
inline std::string toNarrowString( const char* pStr , int len=-1 )
{
    return (len < 0) ? pStr : std::string(pStr,len) ;
}
inline std::string toNarrowString( const std::string& str )
{
    return str ;
}

#ifdef _UNICODE
    inline TCHAR toTchar( char ch )
    {
        return (wchar_t)ch ;
    }
    inline TCHAR toTchar( wchar_t ch )
    {
        return ch ;
    }
    inline std::tstring toTstring( const std::string& s )
    {
        return toWideString(s) ;
    }
    inline std::tstring toTstring( const char* p , int len=-1 )
    {
        return toWideString(p,len) ;
    }
    inline std::tstring toTstring( const std::wstring& s )
    {
        return s ;
    }
    inline std::tstring toTstring( const wchar_t* p , int len=-1 )
    {
        return (len < 0) ? p : std::wstring(p,len) ;
    }
#else 
    inline TCHAR toTchar( char ch )
    {
        return ch ;
    }
    inline TCHAR toTchar( wchar_t ch )
    {
        return (ch >= 0 && ch <= 0xFF) ? (char)ch : '?' ;
    } 
    inline std::tstring toTstring( const std::string& s )
    {
        return s ;
    }
    inline std::tstring toTstring( const char* p , int len=-1 )
    {
        return (len < 0) ? p : std::string(p,len) ;
    }
    inline std::tstring toTstring( const std::wstring& s )
    {
        return toNarrowString(s) ;
    }
    inline std::tstring toTstring( const wchar_t* p , int len=-1 )
    {
        return toNarrowString(p,len) ;
    }
#endif // _UNICODE

/* -------------------------------------------------------------------- */

wstring 
toWideString( const char* pStr , int len )
{
    ASSERT_PTR( pStr ) ; 
    ASSERT( len >= 0 || len == -1 , _T("Invalid string length: ") << len ) ; 

    // figure out how many wide characters we are going to get 
    int nChars = MultiByteToWideChar( CP_ACP , 0 , pStr , len , NULL , 0 ) ; 
    if ( len == -1 )
        -- nChars ; 
    if ( nChars == 0 )
        return L"" ;

    // convert the narrow string to a wide string 
    // nb: slightly naughty to write directly into the string like this
    wstring buf ;
    buf.resize( nChars ) ; 
    MultiByteToWideChar( CP_ACP , 0 , pStr , len , 
        const_cast<wchar_t*>(buf.c_str()) , nChars ) ; 

    return buf ;
}

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  */

string 
toNarrowString( const wchar_t* pStr , int len )
{
    ASSERT_PTR( pStr ) ; 
    ASSERT( len >= 0 || len == -1 , _T("Invalid string length: ") << len ) ; 

    // figure out how many narrow characters we are going to get 
    int nChars = WideCharToMultiByte( CP_ACP , 0 , 
             pStr , len , NULL , 0 , NULL , NULL ) ; 
    if ( len == -1 )
        -- nChars ; 
    if ( nChars == 0 )
        return "" ;

    // convert the wide string to a narrow string
    // nb: slightly naughty to write directly into the string like this
    string buf ;
    buf.resize( nChars ) ;
    WideCharToMultiByte( CP_ACP , 0 , pStr , len , 
          const_cast<char*>(buf.c_str()) , nChars , NULL , NULL ) ; 

    return buf ; 
}

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Awasu
Australia Australia
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
GeneralRe: wchar_t, no problem for me Pin
Taka Muraoka18-Jul-03 20:50
Taka Muraoka18-Jul-03 20:50 
GeneralRe: wchar_t, no problem for me Pin
John M. Drescher21-Jul-03 15:58
John M. Drescher21-Jul-03 15:58 
GeneralFive star and a question Pin
ckatili18-Jul-03 4:19
ckatili18-Jul-03 4:19 
GeneralRe: Five star and a question Pin
Taka Muraoka18-Jul-03 13:38
Taka Muraoka18-Jul-03 13:38 
GeneralByte Order Marks Pin
Phil Harding18-Jul-03 0:17
Phil Harding18-Jul-03 0:17 
GeneralRe: Byte Order Marks Pin
Taka Muraoka18-Jul-03 13:32
Taka Muraoka18-Jul-03 13:32 
GeneralRe: Byte Order Marks Pin
Adrian Edmonds20-Jul-03 22:40
Adrian Edmonds20-Jul-03 22:40 
GeneralRe: Byte Order Marks Pin
Phil Harding21-Jul-03 0:56
Phil Harding21-Jul-03 0:56 
GeneralJust curious Pin
Jim Crafton17-Jul-03 12:54
Jim Crafton17-Jul-03 12:54 
GeneralRe: Just curious Pin
Taka Muraoka17-Jul-03 13:48
Taka Muraoka17-Jul-03 13:48 
Generalwchar_t as a type Pin
Jim A. Johnson17-Jul-03 9:43
Jim A. Johnson17-Jul-03 9:43 
GeneralRe: wchar_t as a type Pin
Taka Muraoka17-Jul-03 13:43
Taka Muraoka17-Jul-03 13:43 
GeneralRe: wchar_t as a type Pin
Jim Crafton18-Jul-03 3:33
Jim Crafton18-Jul-03 3:33 
GeneralRe: wchar_t as a type Pin
Jim Crafton18-Jul-03 3:35
Jim Crafton18-Jul-03 3:35 
GeneralRe: wchar_t as a type Pin
Taka Muraoka18-Jul-03 13:30
Taka Muraoka18-Jul-03 13:30 
GeneralMy sympathies Pin
Marc Clifton17-Jul-03 9:12
mvaMarc Clifton17-Jul-03 9:12 
GeneralRe: My sympathies Pin
Taka Muraoka17-Jul-03 13:39
Taka Muraoka17-Jul-03 13:39 
GeneralRe: My sympathies Pin
Marc Clifton17-Jul-03 16:38
mvaMarc Clifton17-Jul-03 16:38 
GeneralWell done! Pin
Ryan Binns17-Jul-03 5:43
Ryan Binns17-Jul-03 5:43 
GeneralRe: Well done! Pin
Taka Muraoka17-Jul-03 6:20
Taka Muraoka17-Jul-03 6:20 
GeneralRe: Well done! Pin
Gary R. Wheeler17-Jul-03 12:06
Gary R. Wheeler17-Jul-03 12:06 
GeneralUseful Pin
Nemanja Trifunovic17-Jul-03 5:39
Nemanja Trifunovic17-Jul-03 5:39 
GeneralKeeping your app Unicode-correct after you have converted it Pin
Taka Muraoka17-Jul-03 5:30
Taka Muraoka17-Jul-03 5:30 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.