Click here to Skip to main content
14,977,302 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I'm trying to read a Unicode file, it adds the characters to the CString szData but does not add the carriage return linefeed pair.
C++
BYTE buffer[3];
BYTE* pBuf = buffer;
BOOL bRead = TRUE;
CString szData;
FILE *fp = NULL;
_wfopen_s(&fp, (LPCTSTR)szFileName, _T("rb"));
while(bRead)
{
    bRead = fread(pBuf, sizeof(BYTE), 2, fp);
    if(*pBuf == '\n')
    {
        szData += "\r\n";
    }
    else
        szData += *pBuf;
}
fclose(fp);
szaText.Add(szData);
Posted
Updated 4-May-13 7:25am
v2

I think you are on the confused side about unicode like many other programmers. The unicode character set is nothing more that a table consisting of about 1 million characters. Since the range of a char is just 0..255 and the range of a wchar_t is 0..65535 its obvious that you can store a unicode character neither in a char nor in a wchar_t. You need at least 32 bits to to have the range to actually encode 1 unicode character (codepoint) with 1 integer. For this reason if you want to use 1 integer to store any of the unicode characters then you have to use utf-32 that is an encoding that uses no tricks. In utf32 1 uint32 is one index into the unicode table. Period. However in practice utf32 is rarely used because its memory intensive and wastes a lot of memory especially in case of languages that use a lot of ascii characters. Because of this utf8 and utf16 are more widespread than utf32 but in utf8 and utf16 one integer (uint8 or uint16) alone isn't necessary an index into the unicode table. For example in case of utf8 any character that is bigger than 127 means that this and the next few bytes together (max 4 bytes) will store the necessary bits that form together and index into the large unicode table (http://en.wikipedia.org/wiki/UTF-8[^]). In case of utf16 it is also possible that two wchar_ts together form an index (high and low surrogate pairs: 0xd800-0xdfff, https://en.wikipedia.org/wiki/UTF-16[^]). For this reason some operations on utf8 and utf16 encoded strings are not effective. For example strlen() and wcslen() return the number of chars and wchar_ts in the string instead of the actual number of unicode characters (that can be less than the number of chars or wchar_ts because of the trick I mentioned). Indexing a unicode character in the string is also ineffective. However in many cases these operations are not required and there are some other operations that are effective with these utf encodings as well, for example concatenation.

Often you are not really interested in the encoding of the string and the unicode characters in it so you can handle the string as a big bunch of binary data. In fact, many programs just load strings from some localization database/file and use them to display text on the screen. Only the text renderer/drawer method has to be able to decode the utf encoded binary data (string) into a sequence of unicode characters and the text drawer needs just a simple iterator that that retrieves the unicode characters from the utf data from left to right direction, that can be done effectively with both utf8 and utf16 and you don't even have to care about this if you are using for example windows DrawText().

Of course you may want to "procedurally" generate strings in the program but that is an easier task. Many operations allow you to treat the string as a sequence of chars and wchar_ts that make your work easier, for example if you are searching the next newline in a string in utf8 you can easily process the string as a sequence of chars because all bytes of a special multibyte character sequence are bigger than 127 so you can safely search for the next chr(10) without actually interpreting the unicode characters (the special multi byte and multi wchar_t utf8/16 stuff) in the encoded string. The same is true for all ascii characters (<128), this comes handy for example in case of an xml parasers in which the special characters are ascii (<>&").

utf16 or utf8? You can hide this as an implementation detail in your own string class and later you can easily change this as you will, or you can make it platform dependent. On linux utf8 is the way to go but you can use utf8 even on windows to store data in memory and you can convert to utf16 on the fly when you call a windows function that requires an utf16 string. Many make the mistake of calling ansi windows functions with utf8 data. You know: almost every windows functions that receive a string parameter have 3 names, e.g.: DrawTextA() DrawTextW() and DrawText() that is just a macro defined to either DrawTextA or W. On winNT the A functions just convert the input string to utf16 using the current locale of windows and then call the W version of the function, so dont make the mistake of calling A functions with utf8 strings. It will work if the string contains just ascii characters (<128) but it wont work with any special chars! On windows call always the W functions directly with utf16 strings so either store the strings as utf16 with a terminating null or store utf8 and write an utf16 converter method for your string class that returns a temporary utf16 converted string!

The conclusion is that you can simply read/write text from/to files as binary data, the encoding matters only if someone starts processing the binary data as a sequence of unicode characters. Even if you read in the text file as a big chunk of utf encoded binary data you can easily split it into lines (along chr(10) integers) without processing the actual unicode characters on the lines, or you can easily process a localization text file which has lines with key=value pairs without taking care about utf because all you have to do is splitting the line into two parts along an ascii character ('=').

Another interesting thing is that not all bytes sequences (binary data) can be interpreted as a valid utf8 or utf16 string! It worth validating the string when you read it from the file and I usually validate strings at runtime only in debug builds to make the release builds faster. In some cases you may need validation in runtime even in release builds but that is rare.

EDIT: Of course if you want to use the standard library to detect the actual encoding of the file and convert it to a format your program uses (for example utf-16) then my comments are just details that help you to understand whats going on. A text file can store text in several formats. Usually the first few (2-5) bytes of the text file is a special sequence that indicates the encoding of the text that follows, this is called the BOM (Byte Order Mark) and it isn't shown by modern text editors (use a hex editor to check this): http://en.wikipedia.org/wiki/Byte_order_mark[^]
Note that a bom at the beginning of the file isn't required but in that case a text editor might have hard time to guess the format (sometimes its impossible).

If you create data files for your program yourself then you can use a fix format even without bom. We often use utf8 without bom here and our program allows no other format.
   
v7
Comments
Kenneth Haugland 5-May-13 12:51pm
   
Thats what I call an answer! 5'ed. and did some highlight edits as well, if you dont mind :-)
pasztorpisti 5-May-13 12:55pm
   
I was a lazy ass to highlight stuff, you made it more attractive! Thank you!
H.Brydon 5-May-13 23:08pm
   
Whew! +5 for the effort!
pasztorpisti 6-May-13 4:54am
   
Thank you!
"\r\n" is an ASCII string, not Unicode. Also why are you reading BYTEs rather than WCHARs?

You should use fgetws()[^] to read Unicode.

See also Handling simple text files in C/C++[^].
   
v2
Check the encoding of your file. First is it really UTF-16? UTF-8 is more common.
If it is UTF-16, then you should be reading wide-characters from the file, as Richard has said, and comparing it with 16-bit value of '\n' (_T('\n') or 0x000a), and add the wide-characters to your string.

If, on the other hand, it is UTF-8, then certain byte sequences will need to be converted using one of the multi-byte to wide character functions, specifying an encoding of UTF-8.
In this case it may be that you just happen to be missing the '\n' because it happens to be in the second byte.
Also, even if you do see the '\n', you would lose information because you ignore the second byte you are reading.
Read one byte at a time and add them to a normal 8-bit character string, and then when you find the end of line add the "\r\n" and convert the string to wide-characters using a conversion function (such as mbstowcs).

In summary: You need to find out how the file is actually encoding. Saying it is "Unicode" doesn't mean anything as there are several ways to encode Unicode in a file. UTF-8 is the most common encoding (e.g. on websites, etc). One reason for this is because it means the file size is likely to be much smaller (and will definitely be if all the characters are less than Unicode 0800, which includes all European scripts).

Regards,
Ian.
   
v2
Comments
Roger65 4-May-13 16:52pm
   
The file was created using _O_U16TEXT I'm trying to convert it to plain ascii.
I'm 74 and some days.......I didn't need the ODOA pair just Add the terminated CString to the CStringArray.

BYTE buffer[3];
BYTE* pBuf = buffer;
BOOL bRead = TRUE;
CString szData;
FILE *fp = NULL;
_wfopen_s(&fp, (LPCTSTR)szFileName, _T("rb"));
while(bRead)
{
bRead = fread(pBuf, sizeof(BYTE), 2, fp);
if(*pBuf == 0x00A)
{
szaText.Add(szData);
szData = _T("");
}
else
szData += *pBuf;
}
fclose(fp);
   
Comments
Richard MacCutchan 5-May-13 9:21am
   
You are making this more difficult than it needs to be, and also less reliable. Have a look at my article which explains how to check the source encoding and convert Unicode to ANSI correctly.
Roger65 5-May-13 13:14pm
   
Richard, what article, does it have a name or do I have to guess?

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)




CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900