Conversion to Unicode (C++, Microsoft, UTF-16, Native Windows)

Question

5.00/5 (1 vote)

See more:

SEEKING ADVICE REGARDING CONVERSION TO UNICODE

WHY I AM ADOPTING UNICODE SO LATE

[Feel free to skip to "The Situation", below, but if you do, please don't flame me for "waiting so long".]

I'm a former professional software engineer and programmer.

I was "away" from programming for many years due to medical considerations.

I "left" just as the shift to unicode became mainstream.

Lately, I've begun to tinker with programming.

Since my "return to programming", I've continued to work only with MBCS builds.

[Due to my condition, any conversion to Unicode will be a lot bigger deal than were I healthy.]

After my "return" (such as it is), I read about unicode.

It seemed clear to me that UTF-8 was the "way to go".

I assumed that UTF-8 had been adopted, pretty much universally.

I was very surprised (and disappointed) to learn that Microsoft had "gone with" UTF-16.

MY CODING SITUATION

My issues are with conversion to unicode in the following environment:

Language: C++
Target Platform: Native Windows
Development Tool: Visual Studio 2013

I have many lines of C++ source code (hundreds of thousands, almost certainly) in both libraries and applications.

Additionally, I make use of third-party libraries (for which I also have the source code). Some of these are relatively current, as I have done some work (MBCS) since my "return to programming".

THE MAIN PROBLEM WITH UNICODE BUILDS

A) Every interaction with the Windows API requires UTF-16 strings+.

B) However, the third-party libraries I use accept char*.

For every single "string type" variable, I must decide whether it should be "char" or "TCHAR" (or string or wstring).

For every string literal, I must decide whether it needs to be A) macro wrapped, B) prefixed, C) neither. [Basically, native char*, or UTF-16.]

This is the case not only for new code, but for existing code (existing code probably contains over 100,000 instances of such variables and literals).

[Ignore for now, that I want to be able to do both MBCS and unicode builds. That's not the issue. Pretend I'll be doing only unicode builds.]

For every single function or method call, I must insure that the expected string types are passed.

The code base consists of 4 "categories": A) Windows API, B) Third-party API/code, C) my libraries, D) my applications.

Strings are being "passed around all over the place" within a code category and between them.

It would be ever so much better if all functions (Windows API, my code, third-party code) used the same string/character type.

However, I can't very well convert the third-party libraries to use TCHAR, _T(), etc.

I'd have to do that every time a new version of a library was released.

[Incidentally, one library I use is Boost.]

In addition to having to choose the "right" string type for every variable and literal, it appears that I'll have to add a great number of string-conversion calls to my code, continually convert between UTF-16 and UTF-8/ASCII.

It seems to me that Microsoft created a programming nightmare by going with UTF-16.

It further seems to me that they avoided a relatively minor inconvenience to themselves (to maintain DLL compatibility), by inflicting an absurdly high cost on non-Microsoft developers.

Am I missing something?

If Microsoft had just "gone with" UTF-8, UTF-8 could be passed "everywhere". The whole nightmare would not exist. It would have been simplicity itself.

[Yes, some third-party code (at that time) might not have dealt correctly with multi-byte UTF-8 characters. However, it could have been updated to do so.]

How do I approach the conversion?

A) Use a UTF-16 "infrastructure", and write a wrapper around every third-party library?

B) Use a UTF-8 "infrastructure", and write a wrapper around the Windows API?

For example, a CEditUTF8 class, as a drop-in replacement for CEdit.

C) Decide, individually, for every string variable and literal, the appropriate type, and add ad-hoc calls to string conversion functions all over my code?

It strikes me as Microsoft having made a phenomenally selfish decision by going with UTF-16 (which should have been rejected by the developer community).

Maybe I'm wrong.

Maybe I'm missing something.

It appears to me that even if there were no third-party library issue, the decision to "go with" UTF-16 greatly complicates source code.

The third-party compatibility issue makes the situation at least 10 times worse.

For new code, nearly half my programming effort will be dealing with string-type issues.

In addition to "mixed" strings making code far messier, there is inefficiency due to conversions.

If I adopt solution (C), converting existing code may require inspection of every use of the "char" keyword, every use of string, etc.

Yet another consideration: I prefer to make my library code portable, if possible (to isolate OS-dependent code). This string issue complicates the writing of portable code.

Now, even code that is completely OS-independent must contain "Microsoftisms", to address
string/character type issues.

It all could have been so simple.

COMMENTS? SUGGESTIONS? DERISION? LAUGHS? INSULTS? COMMISERATION? AGREEMENT?

P.S.

I wrote a utility to wrap string literals with _T() **. However, it can't determine which literals to wrap (it's not smart enough to parse the source code, and determine the types of the variables to which the literal is being assigned). Even if it could, it wouldn't solve the problems of variable types, or conversion calls.

** This wasn't done via regular expression search and replace:

Firstly, there were too many exceptions (include directives, various macros, embedded double quotes, comments - both // and /**/ - and various other "special cases").

Secondly, Visual Studio search and replace is bugged: Search and replace IN FILES does not work if regular expressions are enabled (I have used SED to get around this).

Posted 14-Apr-14 7:28am

gokings

Updated 11-Jan-18 9:58am

v4

Add a Solution

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Michael Haephrati · Answer 1 · 2018-01-11T09:58:00

Here are most of the stages you need to take when converting an old project (defined as

Use Multi-Byte Character Set

or

Not Set

in Project -> Properties -> General -> Character Set.

First you set this attribute to Unicode.

1. You will then have to change any hardcoded string to from "string" into L"string" / _T("string").
2. In some of the places UNICODE need to be used, you will have to replace :

C++

char something[];

into

C++

wchar_t something[];

3. For these UNICODE strings, you will have to change string manipulation functions into their UNICODE equivalent. For example, instead of:

C++

strcpy(str1,str2);

use

C++

wcscpy(str1,str2);

and instead of

C++

strcmp(str1,"")

use

C++

wcscmp(str1,_T(""))

and so on...
If you are not sure, please note that if you google for example:
"

strcpy unicode

"
you will get to the MSDN page where the UNICODE equivalent of the old function is shown.

References:

Unicode Programming Summary

strcpy, wcscpy, _mbscpy