|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Announcements
Want a new Job?
Chapters
Services
Feature Zones
|
IntroductionYou've undoubtedly seen all these various string types like In Part I, I will cover the three types of character encodings. It is crucial that you understand how the encoding schemes work. Even if you already know that a string is an array of characters, read this part. Once you've learned this, it will be clearer how the various string classes are related. In Part II I will describe the string classes themselves, when to use which ones, and how to convert among them. The basics of characters - ASCII, DBCS, UnicodeAll string classes eventually boil down to a C-style string, and C-style strings are arrays of characters, so I'll first cover the character types. There are three encoding schemes and three character types. The first scheme is the single-byte character set, or SBCS. In this encoding scheme, all characters are exactly one byte long. ASCII is an example of an SBCS. A single zero byte marks the end of a SBCS string. The second scheme is the multi-byte character set, or MBCS. An MBCS encoding contains some characters that are one byte long, and others that are more than one byte long. The MBCS schemes used in Windows contain two character types, single-byte characters and double-byte characters. Since the largest multi-byte character used in Windows is two bytes long, the term double-byte character set, or DBCS, is commonly used in place of MBCS. In a DBCS encoding, certain values are reserved to indicate that they are part of a double-byte character. For example, in the Shift-JIS encoding (a commonly-used Japanese scheme), values 0x81-0x9F and 0xE0-0xFC mean "this is a double-byte character, and the next byte is part of this character." Such values are called "lead bytes," and are always greater than 0x7F. The byte following a lead byte is called the "trail byte." In DBCS, the trail byte can be any non-zero value. Just as in SBCS, the end of a DBCS string is marked by a single zero byte. The third scheme is Unicode. Unicode is an encoding standard in which all characters are two bytes long. Unicode characters are sometimes called wide characters because they are wider (use more storage) than single-byte characters. Note that Unicode is not considered an MBCS - the distinguishing feature of an MBCS encoding is that characters are of different lengths. A Unicode string is terminated by two zero bytes (the encoding of the value 0 in a wide character). Single-byte characters are the Latin alphabet, accented characters, and graphics defined in the ASCII standard and DOS operating system. Double-byte characters are used in East Asian and Middle Eastern languages. Unicode is used in COM and internally in Windows NT. You're certainly already familiar with single-byte characters. When you use the wchar_t wch = L'1'; // 2 bytes, 0x0031 wchar_t* wsz = L"Hello"; // 12 bytes, 6 wide characters How characters are stored in memorySingle-byte strings are stored one character after the next, with a single zero byte marking the end of the
string. So for example,
The Unicode version,
with the character 0x0000 (the Unicode encoding of zero) marking the end. DBCS strings look like SBCS strings at first glance, but we will see later that there are subtleties that make
a difference when using string manipulating functions and traversing through the string with a pointer. The string
"
Keep in mind that the value of "ni" is not interpreted as the Using string handling functionsWe've all seen the C string functions like Microsoft also added versions to their CRT (C runtime library) that operate on DBCS strings. The Let's look at a typical string to illustrate the need for the different versions of the string handling functions.
Going back to our Unicode string
Because x86 CPUs are little-endian, the value 0x0042 is stored in memory as So we've covered the usage of Traversing and indexing into strings properlySince most of us grew up using SBCS strings, we're used to using the However, you must break those habits for your code to work properly when it encounters DBCS strings. There are two rules for traversing through a DBCS string using a pointer. Breaking these rules will cause almost all of your DBCS-related bugs.
I'll illustrate rule 2 first, since it's easy to find a non-contrived example of code that breaks it. Say you
have a program that stores a config file in its own directory, and you keep the install directory in the registry.
At runtime, you read the install directory, tack on the config filename, and try to read it. So if you install
to Now, imagine this is your code that constructs the filename: bool GetConfigFileName ( char* pszName, size_t nBuffSize ) { char szConfigFilename[MAX_PATH]; // Read install dir from registry... we'll assume it succeeds. // Add on a backslash if it wasn't present in the registry value. // First, get a pointer to the terminating zero. char* pLastChar = strchr ( szConfigFilename, '\0' ); // Now move it back one character. pLastChar--; if ( *pLastChar != '\\' ) strcat ( szConfigFilename, "\\" ); // Add on the name of the config file. strcat ( szConfigFilename, "config.bin" ); // If the caller's buffer is big enough, return the filename. if ( strlen ( szConfigFilename ) >= nBuffSize ) return false; else { strcpy ( pszName, szConfigFilename ); return true; } } This is very defensive code, yet it will break with particular DBCS characters. To see why, suppose a Japanese
user gets hold of your program and changes the install directory to
When So what went wrong? Look at the two bytes above highlighted in blue. The value of the backslash character is
0x5C. The value of The correct way to traverse backwards is to use functions that are aware of DBCS characters and move the pointer the correct number of bytes. Here is the correct code, with the pointer movement shown in red: bool FixedGetConfigFileName ( char* pszName, size_t nBuffSize ) { char szConfigFilename[MAX_PATH]; // Read install dir from registry... we'll assume it succeeds. // Add on a backslash if it wasn't present in the registry value. // First, get a pointer to the terminating zero. char* pLastChar = _mbschr ( szConfigFilename, '\0' ); // Now move it back one double-byte character. pLastChar = CharPrev ( szConfigFilename, pLastChar ); if ( *pLastChar != '\\' ) _mbscat ( szConfigFilename, "\\" ); // Add on the name of the config file. _mbscat ( szConfigFilename, "config.bin" ); // If the caller's buffer is big enough, return the filename. if ( _mbslen ( szInstallDir ) >= nBuffSize ) return false; else { _mbscpy ( pszName, szConfigFilename ); return true; } } This fixed function uses the You can probably imagine a way to break rule 1 now. For example, you might validate a filename entered by the
user by looking for multiple occurrences of the character Related to rule 2 is this one about using array indexes:
Code that breaks this rule is very similar to code that breaks rule 2. For example, if char* pLastChar = &szConfigFilename [strlen(szConfigFilename) - 1]; it would break in exactly the same situations, because subtracting 1 in the index expression is equivalent to moving backwards 1 byte, which breaks rule 2. Back to strxxx() versus _mbsxxx()It should be clear now why the One final point about string functions: the MBCS and Unicode in the Win32 APIThe two sets of APIsAlthough you might never have noticed, every API and message in Win32 that deals with strings has two versions.
One version accepts MCBS strings, and the other Unicode strings. For example, there is no API called When you build a Windows program, you can elect to use either the MBCS or Unicode APIs. If you've used the VC
AppWizards and never touched the preprocessor settings, you've been using the MBCS versions all along. So how is
it that we can write "SetWindowText" when there isn't an API by that name? The winuser.h header file
contains some BOOL WINAPI SetWindowTextA ( HWND hWnd, LPCSTR lpString ); BOOL WINAPI SetWindowTextW ( HWND hWnd, LPCWSTR lpString ); #ifdef UNICODE #define SetWindowText SetWindowTextW #else #define SetWindowText SetWindowTextA #endif When building for the MBCS APIs, #define SetWindowText SetWindowTextA
and replaces calls to So, if you want to switch to using the Unicode APIs by default, you can go to the preprocessor settings and
remove the HWND hwnd = GetSomeWindowHandle(); char szNewText[] = "we love Bob!"; SetWindowText ( hwnd, szNewText ); After the compiler replaces "SetWindowText" with "SetWindowTextW", the code becomes: HWND hwnd = GetSomeWindowHandle(); char szNewText[] = "we love Bob!"; SetWindowTextW ( hwnd, szNewText ); See the problem here? We're passing a single-byte string to a function that takes a Unicode string. The first
solution to this problem is to use HWND hwnd = GetSomeWindowHandle(); #ifdef UNICODE wchar_t szNewText[] = L"we love Bob!"; #else char szNewText[] = "we love Bob!"; #endif SetWindowText ( hwnd, szNewText ); You can probably imagine the headache you'd get having to do that around every string in your code. The solution
to this is the TCHAR to the rescue!
#ifdef UNICODE typedef wchar_t TCHAR; #else typedef char TCHAR; #endif So a #ifdef UNICODE #define _T(x) L##x #else #define _T(x) x #endif The TCHAR szNewText[] = _T("we love Bob!"); Just as there are macros to hide the It's not just the String and TCHAR typedefsSince the Win32 API documentation lists functions by their common names (for example, "SetWindowText"),
all strings are given in terms of
When to use TCHAR and UnicodeSo, after all this, you're probably wondering, "So why would I use Unicode? I've gotten by with plain
The vast majority of Unicode APIs are not implemented on Windows 9x, so if you intend your program to be run on 9x, you'll have to stick with the MBCS APIs. (There is a relatively new library from Microsoft called the Microsoft Layer for Unicode that lets you use Unicode on 9x, however I have not tried it myself yet, so I can't comment on how well it works.) However, since NT uses Unicode for everything internally, you will speed up your program by using the Unicode APIs. Every time you pass a string to an MBCS API, the operating system converts the string to Unicode and calls the corresponding Unicode API. If a string is returned, the OS has to convert the string back. While this conversion process is (hopefully) highly optimized to make as little impact as possible, it is still a speed penalty that is avoidable. NT allows very long filenames (longer than the normal limit of Finally, with the end of the Windows 9x line, MS seems to be doing away with the MBCS APIs. For example, the
And even if you don't go with Unicode builds now, you should definitely always use | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||