Once upon a time, a text file was just a simple file. But it is not that easy anymore. New lines can be written in three different ways. Windows/DOS using character 13 and 10, Macintosh using just character 13, and Unix using character 10. Why it is like that had always puzzled me. Different character sets make reading and writing text files harder, so I'm glad that we have Unicode to use instead. But as you might know, writing a text in Unicode could be done in several ways...
CTextFileDocument because I thought it was too complicated to write and read files using Unicode characters. I also wanted the class to handle ordinary 8-bit files. Since version 1.20, different codepages are supported when reading/writing ASCII-files.
The encodings the classes can read and write are:
|Simple 8-bit files (different codepages are supported).|
|Unicode encoded 8-bit files. A character could be written in one, two or three bytes.|
|Unicode, big-endian. Every character is written in two bytes. Most significant byte is written first.|
|Unicode, little-endian. Every character is written in two bytes. Least significant byte is written first.|
Most of the code I use on this page is for Windows/MFC, but the code should work on other platforms as well. The only major difference is that code-pages are only supported in Windows. On other platforms, you should use
setlocale to specify which code-page to use. It's not necessary to use MFC on Windows.
CTextFileDocument consists of three classes:
|This is the base class for the other two classes.|
|Use this to write files.|
|Use this to read files.|
There are some useful member functions in the base class:
TEXTENCODING GetEncoding() const;
void SetUnknownChar(const char unknown);
bool IsDataLost() const;
void SetCodePage(const UINT codepage);
UINT GetCodePage() const;
static void ConvertCharToWstring(const char* from,
wstring &to, UINT codepage=CP_ACP);
static void ConvertWcharToString(const wchar_t* from,
string &to, UINT codepage=CP_ACP,
bool* datalost=NULL, char unknownchar=0);
The first five functions are the most important ones, and I hope that what they do is obvious. The rest is needed when working with different code-pages.
Writing files is very easy. The public functions are:
class CTextFileWrite : public CTextFileBase
CTextFileWrite(const FILENAMECHAR* filename, TEXTENCODING type=ASCII);
CTextFileWrite(CFile* file, TEXTENCODING type=ASCII);
void Write(const char* text);
void Write(const wchar_t* text);
void Write(const string& text);
void Write(const wstring& text);
CTextFileWrite& operator << (const char wc);
CTextFileWrite& operator << (const char* text);
CTextFileWrite& operator << (const string& text);
CTextFileWrite& operator << (const wchar_t wc);
CTextFileWrite& operator << (const wchar_t* text);
CTextFileWrite& operator << (const wstring& text);
As you see, you use
wchar_t to write the text (
CString is no problem). Example:
myfile << "Using 8 bit characters as input";
myfile << L"Using 16-bit characters. The following character is alfa: \x03b1";
CString temp = _T("Using CString.");
myfile << temp;
Quite easy, isn't it .
Reading files isn't much complicated. The public member functions are:
class CTextFileRead : public CTextFileBase
CTextFileRead(const FILENAMECHAR* filename);
bool ReadLine(string& line);
bool ReadLine(wstring& line);
bool ReadLine(CString& line);
bool Read(string& all, const string newline="\r\n");
bool Read(wstring& all, const wstring newline=L"\r\n");
bool Read(CString& all, const CString newline=_T("\r\n"));
bool Eof() const;
ReadLine function is just reading a single line. Example 1:
if(myfile.GetEncoding() == CTextFileRead::ASCII)
encoding = _T("ASCII");
else if(myfile.GetEncoding() == CTextFileRead::UNI16_BE)
encoding = _T("UNI16_BE");
else if(myfile.GetEncoding() == CTextFileRead::UNI16_LE)
encoding = _T("UNI16_LE");
else if(myfile.GetEncoding() == CTextFileRead::UTF_8)
encoding = _T("UTF_8");
MessageBox( CString(_T("Text encoding: ")) + encoding );
MessageBox( line );
If you want to read the whole file, use a
Read function instead. Example 2:
MessageBox( alltext );
If you are using Document/View, you probably want to save and read your files in the
Serialize function. A problem with this is that you can't close the
CArchive object. If you do, you will get an
ASSERT error. So instead of using the constructors where you specify the file name, you should use the constructors where you use a
CFile pointer instead. When you do this, the file will not be closed when the object is deleted. The following sample is derived from
CEditView, and instead of using the original code that only reads ASCII files, this will read Unicode as well:
void CTextFileDemo2Doc::Serialize(CArchive& ar)
CTextFileWrite file(ar.GetFile(), CTextFileWrite::ASCII);
CTextFileWrite file(ar.GetFile(), CTextFileWrite::UTF_8);
file << allText;
_T("Data was lost when the file was read!"),
BOOL bResult =
if (!bResult ||
I hope that most of the code you have seen so far is quite straightforward to use. It's a little bit more difficult when you want to work with different code-pages (or "character sets", I don't understand the difference).
Before Unicode, there was a problem how to represent characters that were used in some parts of the world (a-z wasn't enough). For example, we who live in Sweden like the character 'å'. The character 'å' could be found in code-page 437. There, it has the ASCII-code 134. However, 'å' also exists in code-page 1252, but there it has the ASCII-code 229! Does it sound complicated? Wait, it's getting worse!
In some other countries, more complicated characters are used, like in Korea. Here, the ASCII-table is too small for all characters, so to make it possible to represent all characters, it is necessary to use two bytes for some characters. Code-page 949 has lots of multi-byte characters, like this one: 이 (code: C0CC=U+C774) (don't worry if you can't see the character). That character is represented by two bytes (192 and 204). If you open an ASCII-file that is using this character, in Notepad, and you are using code-page 949, you will see the character correctly. But if you are doing the same thing but you are using code-page 1252 instead, you will see two characters ("ÀÌ").
It is obviously quite hard to handle all different code-pages, that's why Unicode was invented. In Unicode, the idea is that only one character set should be used and that every character should be in the same size (no more multi-byte solutions are necessary).
So Unicode is great, but we still need to deal with files that use different code-pages.
CTextFileDocument does this for you if you define which code-page to use (if you don't, it will use the code-page used by the system and that mostly works well).
If you read an ASCII-file to a Unicode-string (like
_UNICODE is defined), the string will be converted by using the code-page you have selected. The same thing happens (but in the other way) if you write a Unicode-string to an ASCII-file.
Remember that the string will not be converted if you read/write an ASCII-file to/from a non-Unicode string. I will show later how you could do if you want to convert from one code-page to another.
When you convert a Unicode-string to a multi-byte string, it could happen that some characters couldn't be converted. These characters are by default replaced with a query mark ('?'), but you could change this by calling
SetUknownChar(). If you want to know if this has happened, call
CTextFileDocument is using some APIs in Windows to convert strings:
WideCharToMultiByte. When these functions are used, the code-page to the multi-byte string must be defined. By default,
CTextFileDocument is using
CP_ACP, that means that the system default code-page should be used. If you want to use another code-page, call
When you set which code-page to use, you must be sure that the code-page exists. Do this by calling
To see which code-page your system is using, call
To see all code-pages that your system is using, you could do this:
BOOL CALLBACK EnumCodePagesProc(LPTSTR lpCodePageString)
cout << "Code-page: " << lpCodePageString << endl;
OK, enough talk about code-page, here is an example. The following code is reading an ASCII-file (with code-page 437) to a Unicode-string. Then it creates a new ASCII-file and writes the string with code-page 1252.
This is how you should do if you want to convert a string from one code-page to another code-page. Convert the multi-byte string to a Unicode-string, and then convert the Unicode-string to a multi-byte string. If you don't want to write the string to a file, you could use
ConvertWcharToString that are found in
CTextFileWrite writer("ascii-1252.txt", CTextFileBase::ASCII);
writer << alltext;
As I said before, it should be possible to use
CTextFileDocument in platforms other than Windows. If you do this, you must know that code-pages are handled slightly different. Instead of calling
SetCodePage, you should call
setlocale to define which code-page to use. The following code is doing the same thing as the last example, but will work on every platform (I hope ):
cout << setlocale(LC_ALL, ".437") << endl;
CTextFileWrite writer("ascii-1252.txt", CTextFileBase::ASCII);
cout << setlocale(LC_ALL, ".1252") << endl;
writer << alltext;
About the code
CTextFileDocument was originally written to use MFC, but now it's more platform-independent. To make this possible, there are some
#defines in the code. The most important one is
PEK_TX_TECHLEVEL, which defines which features to use. But you need not think about this, the code should automatically define this correctly. The table below explains the differences:
PEK_TX_TECHLEVEL = 0
This is used if you are running on a non-Windows platform. This uses
fstream internally to read and write files. If you want to change codepage, you should call
PEK_TX_TECHLEVEL = 1
This is used on Windows if you don't use MFC. This calls Windows API directly to read and write files. If something couldn't be read/written, a
CTextFileException is thrown. Codepages are supported. Unicode in filenames is supported.
PEK_TX_TECHLEVEL = 2
This is used if you are using MFC. This uses
CFile internally to read and write files. If data can't be read/written,
CFile will throw an exception. Codepages are supported. Unicode in filenames is supported.
CString is supported.
Points of interest
Even if the classes are quite simple, they have been very useful to me. They have all features I want, so I don't miss anything important. However, it would be nice if it supported more encodings, like UTF-32. Maybe I'll add this in the future. The performance is quite good, but if you know some way to get it faster, let me know .
One thing that probably should improve the performance is increasing the value of
BUFFSIZE (defined in
CTextFileBase). Another thing is making the code in
CTextFileRead::GuessCharacterCount better. This should return the number of characters in the file. Currently, this only works if you are using MFC, otherwise it will return 1 MB.
GuessCharacterCount is only used when
Read is called, so it's not used when
ReadLine is called.
How many bytes are a
wchar_t? That is compiler dependent, and I think that could give me some problems in the future. In Windows,
wchar_t is two bytes, but I think that in Unix, four bytes are used. Currently, this is not a problem, but if I add support for UTF-32 (four bytes for every character), some problems may occur.
const function? I think it should be, but that is impossible. The reason for this is that
fstream::is_open() is not
const (well, it is in my VC6 but not in standard C++). Why it is like this is a mystery for me.
The classes expect that the files have a "byte order mark" (BOM) in the first bytes in the files. These bytes are telling what encoding is used. The first two bytes in a "big endian" file are 0xFF and 0xFE; if you make a "little endian" file, the order is reversed. If the encoding is UTF-8, the first three bytes are 0xEF, 0xBB and 0xBF. If no BOM is found, the file is treated as an ASCII file.
You may wonder why I call these classes
CTextFileDocument. The simple reason for this is that the name CTextFile was already taken... It was quite annoying to find that out just a couple of minutes before I wanted to upload the article .
And finally, thank you all of you who have commented and found bugs (and created fixes) to the code. These classes have been improved a lot, thanks to this.
- 21 May, 2005 - Version 1.22.
- Reading a line before reading everything could add an extra line break, fixed.
- A member variable wasn't always initialized, could cause problems when reading single lines, fixed.
- A smarter/easier algorithm is used when reading single lines.
- 10 April, 2005 - Version 1.21. If it was not possible to open a file in techlevel 1,
IsOpen returned a bad result. Fixed.
- 15 January, 2005 - Version 1.20
- Fix: Fixed some problems when converting multi-byte string to Unicode, and vice versa.
- Improved conversion routines. It's now possible to define which code-page to use.
- It's now possible to set which character to use when it's not possible to convert a Unicode character to a multi-byte character.
- It's now possible to see if data was lost during conversion.
- Better support for other platforms, it's no longer necessary to use MFC in Windows.
- Fix: Reading very small files (1 byte) failed.
- 26 December, 2004 - Version 1.13
- Fix 1: If the first line in a file is empty, that line is ignored.
- Fix 2: Problems when converting multi-byte characters to wide characters and vice versa.
- 17 October, 2004 - Version 1.12. A minor memory leak when open file failed, fixed.
- 28 August, 2004 - Version 1.11.
WriteEndl() didn't work correctly when writing ASCII files. Fixed.
- 13 August, 2004 - Version 1.1. I'm sorry about the quick update. I have rewritten some part of the code, so now it's a lot quicker than the previous version.
- 12 August, 2004 - Initial version.