Click here to Skip to main content
15,901,035 members
Please Sign up or sign in to vote.
5.00/5 (1 vote)
See more:
SEEKING ADVICE REGARDING CONVERSION TO UNICODE


WHY I AM ADOPTING UNICODE SO LATE

[Feel free to skip to "The Situation", below, but if you do, please don't flame me for "waiting so long".]

I'm a former professional software engineer and programmer.

I was "away" from programming for many years due to medical considerations.

I "left" just as the shift to unicode became mainstream.

Lately, I've begun to tinker with programming.

Since my "return to programming", I've continued to work only with MBCS builds.

[Due to my condition, any conversion to Unicode will be a lot bigger deal than were I healthy.]

After my "return" (such as it is), I read about unicode.

It seemed clear to me that UTF-8 was the "way to go".

I assumed that UTF-8 had been adopted, pretty much universally.

I was very surprised (and disappointed) to learn that Microsoft had "gone with" UTF-16.



MY CODING SITUATION


My issues are with conversion to unicode in the following environment:

Language: C++
Target Platform: Native Windows
Development Tool: Visual Studio 2013

I have many lines of C++ source code (hundreds of thousands, almost certainly) in both libraries and applications.

Additionally, I make use of third-party libraries (for which I also have the source code). Some of these are relatively current, as I have done some work (MBCS) since my "return to programming".


THE MAIN PROBLEM WITH UNICODE BUILDS


A) Every interaction with the Windows API requires UTF-16 strings+.

B) However, the third-party libraries I use accept char*.


For every single "string type" variable, I must decide whether it should be "char" or "TCHAR" (or string or wstring).

For every string literal, I must decide whether it needs to be A) macro wrapped, B) prefixed, C) neither. [Basically, native char*, or UTF-16.]

This is the case not only for new code, but for existing code (existing code probably contains over 100,000 instances of such variables and literals).

[Ignore for now, that I want to be able to do both MBCS and unicode builds. That's not the issue. Pretend I'll be doing only unicode builds.]


For every single function or method call, I must insure that the expected string types are passed.


The code base consists of 4 "categories": A) Windows API, B) Third-party API/code, C) my libraries, D) my applications.


Strings are being "passed around all over the place" within a code category and between them.


It would be ever so much better if all functions (Windows API, my code, third-party code) used the same string/character type.


However, I can't very well convert the third-party libraries to use TCHAR, _T(), etc.

I'd have to do that every time a new version of a library was released.

[Incidentally, one library I use is Boost.]

In addition to having to choose the "right" string type for every variable and literal, it appears that I'll have to add a great number of string-conversion calls to my code, continually convert between UTF-16 and UTF-8/ASCII.

It seems to me that Microsoft created a programming nightmare by going with UTF-16.

It further seems to me that they avoided a relatively minor inconvenience to themselves (to maintain DLL compatibility), by inflicting an absurdly high cost on non-Microsoft developers.

Am I missing something?

If Microsoft had just "gone with" UTF-8, UTF-8 could be passed "everywhere". The whole nightmare would not exist. It would have been simplicity itself.

[Yes, some third-party code (at that time) might not have dealt correctly with multi-byte UTF-8 characters. However, it could have been updated to do so.]



How do I approach the conversion?


A) Use a UTF-16 "infrastructure", and write a wrapper around every third-party library?

B) Use a UTF-8 "infrastructure", and write a wrapper around the Windows API?

For example, a CEditUTF8 class, as a drop-in replacement for CEdit.

C) Decide, individually, for every string variable and literal, the appropriate type, and add ad-hoc calls to string conversion functions all over my code?


It strikes me as Microsoft having made a phenomenally selfish decision by going with UTF-16 (which should have been rejected by the developer community).


Maybe I'm wrong.


Maybe I'm missing something.


It appears to me that even if there were no third-party library issue, the decision to "go with" UTF-16 greatly complicates source code.

The third-party compatibility issue makes the situation at least 10 times worse.

For new code, nearly half my programming effort will be dealing with string-type issues.

In addition to "mixed" strings making code far messier, there is inefficiency due to conversions.

If I adopt solution (C), converting existing code may require inspection of every use of the "char" keyword, every use of string, etc.


Yet another consideration: I prefer to make my library code portable, if possible (to isolate OS-dependent code). This string issue complicates the writing of portable code.

Now, even code that is completely OS-independent must contain "Microsoftisms", to address
string/character type issues.


It all could have been so simple.




COMMENTS? SUGGESTIONS? DERISION? LAUGHS? INSULTS? COMMISERATION? AGREEMENT?


P.S.

I wrote a utility to wrap string literals with _T() **. However, it can't determine which literals to wrap (it's not smart enough to parse the source code, and determine the types of the variables to which the literal is being assigned). Even if it could, it wouldn't solve the problems of variable types, or conversion calls.


** This wasn't done via regular expression search and replace:


Firstly, there were too many exceptions (include directives, various macros, embedded double quotes, comments - both // and /**/ - and various other "special cases").

Secondly, Visual Studio search and replace is bugged: Search and replace IN FILES does not work if regular expressions are enabled (I have used SED to get around this).
Posted
Updated 11-Jan-18 9:58am
v4

1 solution

Here are most of the stages you need to take when converting an old project (defined as
Use Multi-Byte Character Set

or
Not Set

in Project -> Properties -> General -> Character Set.

First you set this attribute to Unicode.

1. You will then have to change any hardcoded string to from "string" into L"string" / _T("string").
2. In some of the places UNICODE need to be used, you will have to replace :
C++
char something[];

into
C++
wchar_t something[];

3. For these UNICODE strings, you will have to change string manipulation functions into their UNICODE equivalent. For example, instead of:
C++
strcpy(str1,str2);
use
C++
wcscpy(str1,str2);
and instead of
C++
strcmp(str1,"")
use
C++
wcscmp(str1,_T(""))
and so on...
If you are not sure, please note that if you google for example:
"
strcpy unicode
"
you will get to the MSDN page where the UNICODE equivalent of the old function is shown.

References:

Unicode Programming Summary

strcpy, wcscpy, _mbscpy
 
Share this answer
 
v2
Comments
Richard MacCutchan 12-Jan-18 4:39am    
FOUR years too late.
Michael Haephrati 12-Jan-18 5:08am    
Apparently there is no expiration date to questions and when I looked for unanswered questions, I got here...
Richard MacCutchan 12-Jan-18 5:09am    
True, but it is unlikely that the OP is still waiting for an answer. It is better to list all questions so you generally only see the currently active ones.
Richard Deeming 12-Jan-18 14:01pm    
Normally, I'd agree. But this question had no prior answers, and even if the OP has moved on, this solution might help someone else who finds the question.

Obligatory XKCD[^]. :)
Michael Haephrati 12-Jan-18 5:15am    
Thanks for the tip. I didn't see any indication this question is inactive. I do hope my answer will serve others who may be looking for the same issue...

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900