Contents
Introduction
Once upon a time, a text file was just a simple file. But it is not that easy anymore. New lines can be written in three different ways. Windows/DOS using character 13 and 10, Macintosh using just character 13, and Unix using character 10. Why it is like that had always puzzled me. Different character sets make reading and writing text files harder, so I'm glad that we have Unicode to use instead. But as you might know, writing a text in Unicode could be done in several ways...
Text encoding
I wrote CTextFileDocument
because I thought it was too complicated to write and read files using Unicode characters. I also wanted the class to handle ordinary 8-bit files. Since version 1.20, different codepages are supported when reading/writing ASCII-files.
The encodings the classes can read and write are:
CTextFileBase::ASCII | Simple 8-bit files (different codepages are supported). |
CTextFileBase::UTF_8 | Unicode encoded 8-bit files. A character could be written in one, two or three bytes. |
CTextFileBase::UNI16_BE | Unicode, big-endian. Every character is written in two bytes. Most significant byte is written first. |
CTextFileBase::UNI16_LE | Unicode, little-endian. Every character is written in two bytes. Least significant byte is written first. |
Most of the code I use on this page is for Windows/MFC, but the code should work on other platforms as well. The only major difference is that code-pages are only supported in Windows. On other platforms, you should use setlocale
to specify which code-page to use. It's not necessary to use MFC on Windows.
Structure
CTextFileDocument
consists of three classes:
CTextFileBase | This is the base class for the other two classes. |
CTextFileWrite | Use this to write files. |
CTextFileRead | Use this to read files. |
There are some useful member functions in the base class:
class CTextFileBase
{
public:
CTextFileBase();
~CTextFileBase();
int IsOpen();
void Close();
TEXTENCODING GetEncoding() const;
void SetUnknownChar(const char unknown);
bool IsDataLost() const;
void ResetDataLostFlag();
void SetCodePage(const UINT codepage);
UINT GetCodePage() const;
static void ConvertCharToWstring(const char* from,
wstring &to, UINT codepage=CP_ACP);
static void ConvertWcharToString(const wchar_t* from,
string &to, UINT codepage=CP_ACP,
bool* datalost=NULL, char unknownchar=0);
}
The first five functions are the most important ones, and I hope that what they do is obvious. The rest is needed when working with different code-pages.
Writing files
Writing files is very easy. The public functions are:
class CTextFileWrite : public CTextFileBase
{
public:
CTextFileWrite(const FILENAMECHAR* filename, TEXTENCODING type=ASCII);
CTextFileWrite(CFile* file, TEXTENCODING type=ASCII);
void Write(const char* text);
void Write(const wchar_t* text);
void Write(const string& text);
void Write(const wstring& text);
CTextFileWrite& operator << (const char wc);
CTextFileWrite& operator << (const char* text);
CTextFileWrite& operator << (const string& text);
CTextFileWrite& operator << (const wchar_t wc);
CTextFileWrite& operator << (const wchar_t* text);
CTextFileWrite& operator << (const wstring& text);
void WriteEndl();
}
As you see, you use char
or wchar_t
to write the text (CString
is no problem). Example:
CTextFileWrite myfile(_T("samplefile.txt"),
CTextFileWrite::UTF_8 );
ASSERT(myfile.IsOpen());
myfile << "Using 8 bit characters as input";
myfile.WriteEndl();
myfile << L"Using 16-bit characters. The following character is alfa: \x03b1";
myfile.WriteEndl();
CString temp = _T("Using CString.");
myfile << temp;
Quite easy, isn't it :-).
Reading files
Reading files isn't much complicated. The public member functions are:
class CTextFileRead : public CTextFileBase
{
public:
CTextFileRead(const FILENAMECHAR* filename);
CTextFileRead(CFile* file);
bool ReadLine(string& line);
bool ReadLine(wstring& line);
bool ReadLine(CString& line);
bool Read(string& all, const string newline="\r\n");
bool Read(wstring& all, const wstring newline=L"\r\n");
bool Read(CString& all, const CString newline=_T("\r\n"));
bool Eof() const;
}
The ReadLine
function is just reading a single line. Example 1:
CTextFileRead myfile(_T("samplefile.txt"));
ASSERT(myfile.IsOpen());
CString encoding;
if(myfile.GetEncoding() == CTextFileRead::ASCII)
encoding = _T("ASCII");
else if(myfile.GetEncoding() == CTextFileRead::UNI16_BE)
encoding = _T("UNI16_BE");
else if(myfile.GetEncoding() == CTextFileRead::UNI16_LE)
encoding = _T("UNI16_LE");
else if(myfile.GetEncoding() == CTextFileRead::UTF_8)
encoding = _T("UTF_8");
MessageBox( CString(_T("Text encoding: ")) + encoding );
while(!myfile.Eof())
{
CString line;
myfile.ReadLine(line);
MessageBox( line );
}
If you want to read the whole file, use a Read
function instead. Example 2:
CTextFileRead myfile(_T("samplefile.txt"));
ASSERT(myfile.IsOpen());
CString alltext;
myfile.Read(alltext);
MessageBox( alltext );
Document/View
If you are using Document/View, you probably want to save and read your files in the Serialize
function. A problem with this is that you can't close the CArchive
object. If you do, you will get an ASSERT
error. So instead of using the constructors where you specify the file name, you should use the constructors where you use a CFile
pointer instead. When you do this, the file will not be closed when the object is deleted. The following sample is derived from CEditView
, and instead of using the original code that only reads ASCII files, this will read Unicode as well:
void CTextFileDemo2Doc::Serialize(CArchive& ar)
{
if(ar.IsStoring())
{
#ifndef _UNICODE
CTextFileWrite file(ar.GetFile(), CTextFileWrite::ASCII);
#else
CTextFileWrite file(ar.GetFile(), CTextFileWrite::UTF_8);
#endif
CString allText;
((CEditView*)m_viewList.GetHead())->GetWindowText(allText);
file << allText;
}
else
{
CTextFileRead file(ar.GetFile());
CString allText;
file.Read(allText);
if(file.IsDataLost())
MessageBox( AfxGetMainWnd()->m_hWnd,
_T("Data was lost when the file was read!"),
NULL,
MB_ICONWARNING|MB_OK);
BOOL bResult =
::SetWindowText(((CEditView*)m_viewList.GetHead())->GetSafeHwnd(),
allText);
if (!bResult ||
((CEditView*)m_viewList.GetHead())->GetWindowTextLength()
< (int)allText.GetLength())
AfxThrowMemoryException();
}
}
That's it!
Code-pages/Character sets
I hope that most of the code you have seen so far is quite straightforward to use. It's a little bit more difficult when you want to work with different code-pages (or "character sets", I don't understand the difference).
Before Unicode, there was a problem how to represent characters that were used in some parts of the world (a-z wasn't enough). For example, we who live in Sweden like the character 'å'. The character 'å' could be found in code-page 437. There, it has the ASCII-code 134. However, 'å' also exists in code-page 1252, but there it has the ASCII-code 229! Does it sound complicated? Wait, it's getting worse!
In some other countries, more complicated characters are used, like in Korea. Here, the ASCII-table is too small for all characters, so to make it possible to represent all characters, it is necessary to use two bytes for some characters. Code-page 949 has lots of multi-byte characters, like this one: 이 (code: C0CC=U+C774) (don't worry if you can't see the character). That character is represented by two bytes (192 and 204). If you open an ASCII-file that is using this character, in Notepad, and you are using code-page 949, you will see the character correctly. But if you are doing the same thing but you are using code-page 1252 instead, you will see two characters ("ÀÌ").
It is obviously quite hard to handle all different code-pages, that's why Unicode was invented. In Unicode, the idea is that only one character set should be used and that every character should be in the same size (no more multi-byte solutions are necessary).
So Unicode is great, but we still need to deal with files that use different code-pages. CTextFileDocument
does this for you if you define which code-page to use (if you don't, it will use the code-page used by the system and that mostly works well).
If you read an ASCII-file to a Unicode-string (like wstring
or CString
if _UNICODE
is defined), the string will be converted by using the code-page you have selected. The same thing happens (but in the other way) if you write a Unicode-string to an ASCII-file.
Remember that the string will not be converted if you read/write an ASCII-file to/from a non-Unicode string. I will show later how you could do if you want to convert from one code-page to another.
When you convert a Unicode-string to a multi-byte string, it could happen that some characters couldn't be converted. These characters are by default replaced with a query mark ('?'), but you could change this by calling SetUknownChar()
. If you want to know if this has happened, call IsDataLost()
.
Some Windows-APIs
CTextFileDocument
is using some APIs in Windows to convert strings: MultiByteToWideChar
and WideCharToMultiByte
. When these functions are used, the code-page to the multi-byte string must be defined. By default, CTextFileDocument
is using CP_ACP
, that means that the system default code-page should be used. If you want to use another code-page, call SetCodePage
.
When you set which code-page to use, you must be sure that the code-page exists. Do this by calling IsValidCodePage
.
To see which code-page your system is using, call GetACP()
.
To see all code-pages that your system is using, you could do this:
void ListCodePages()
{
EnumSystemCodePages(&EnumCodePagesProc, CP_SUPPORTED);
}
BOOL CALLBACK EnumCodePagesProc(LPTSTR lpCodePageString)
{
cout << "Code-page: " << lpCodePageString << endl;
return TRUE;
}
Example 1
OK, enough talk about code-page, here is an example. The following code is reading an ASCII-file (with code-page 437) to a Unicode-string. Then it creates a new ASCII-file and writes the string with code-page 1252.
This is how you should do if you want to convert a string from one code-page to another code-page. Convert the multi-byte string to a Unicode-string, and then convert the Unicode-string to a multi-byte string. If you don't want to write the string to a file, you could use ConvertCharToWstring
and ConvertWcharToString
that are found in CTextFileBase
.
CTextFileRead reader("ascii-437.txt");
reader.SetCodePage(437);
wstring alltext;
reader.Read(alltext);
reader.Close();
CTextFileWrite writer("ascii-1252.txt", CTextFileBase::ASCII);
writer.SetCodePage(1252);
writer << alltext;
if(writer.IsDataLost())
{
}
writer.Close();
Example 1b
As I said before, it should be possible to use CTextFileDocument
in platforms other than Windows. If you do this, you must know that code-pages are handled slightly different. Instead of calling SetCodePage
, you should call setlocale
to define which code-page to use. The following code is doing the same thing as the last example, but will work on every platform (I hope ;-)):
CTextFileRead reader("ascii-437.txt");
cout << setlocale(LC_ALL, ".437") << endl;
wstring alltext;
reader.Read(alltext);
reader.Close();
CTextFileWrite writer("ascii-1252.txt", CTextFileBase::ASCII);
cout << setlocale(LC_ALL, ".1252") << endl;
writer << alltext;
if(writer.IsDataLost())
{
}
writer.Close();
About the code
CTextFileDocument
was originally written to use MFC, but now it's more platform-independent. To make this possible, there are some #define
s in the code. The most important one is PEK_TX_TECHLEVEL
, which defines which features to use. But you need not think about this, the code should automatically define this correctly. The table below explains the differences:
PEK_TX_TECHLEVEL
= 0
This is used if you are running on a non-Windows platform. This uses fstream
internally to read and write files. If you want to change codepage, you should call setlocal
.
PEK_TX_TECHLEVEL
= 1
This is used on Windows if you don't use MFC. This calls Windows API directly to read and write files. If something couldn't be read/written, a CTextFileException
is thrown. Codepages are supported. Unicode in filenames is supported.
PEK_TX_TECHLEVEL
= 2
This is used if you are using MFC. This uses CFile
internally to read and write files. If data can't be read/written, CFile
will throw an exception. Codepages are supported. Unicode in filenames is supported. CString
is supported.
Links
Points of interest
Even if the classes are quite simple, they have been very useful to me. They have all features I want, so I don't miss anything important. However, it would be nice if it supported more encodings, like UTF-32. Maybe I'll add this in the future. The performance is quite good, but if you know some way to get it faster, let me know :-).
One thing that probably should improve the performance is increasing the value of BUFFSIZE
(defined in CTextFileBase
). Another thing is making the code in CTextFileRead::GuessCharacterCount
better. This should return the number of characters in the file. Currently, this only works if you are using MFC, otherwise it will return 1 MB. GuessCharacterCount
is only used when Read
is called, so it's not used when ReadLine
is called.
How many bytes are a wchar_t
? That is compiler dependent, and I think that could give me some problems in the future. In Windows, wchar_t
is two bytes, but I think that in Unix, four bytes are used. Currently, this is not a problem, but if I add support for UTF-32 (four bytes for every character), some problems may occur.
Why isn't IsOpen()
a const
function? I think it should be, but that is impossible. The reason for this is that fstream::is_open()
is not const
(well, it is in my VC6 but not in standard C++). Why it is like this is a mystery for me.
The classes expect that the files have a "byte order mark" (BOM) in the first bytes in the files. These bytes are telling what encoding is used. The first two bytes in a "big endian" file are 0xFF and 0xFE; if you make a "little endian" file, the order is reversed. If the encoding is UTF-8, the first three bytes are 0xEF, 0xBB and 0xBF. If no BOM is found, the file is treated as an ASCII file.
You may wonder why I call these classes CTextFileDocument
. The simple reason for this is that the name CTextFile was already taken... It was quite annoying to find that out just a couple of minutes before I wanted to upload the article :-).
And finally, thank you all of you who have commented and found bugs (and created fixes) to the code. These classes have been improved a lot, thanks to this.
History
- 21 May, 2005 - Version 1.22.
- Reading a line before reading everything could add an extra line break, fixed.
- A member variable wasn't always initialized, could cause problems when reading single lines, fixed.
- A smarter/easier algorithm is used when reading single lines.
- 10 April, 2005 - Version 1.21. If it was not possible to open a file in techlevel 1,
IsOpen
returned a bad result. Fixed.
- 15 January, 2005 - Version 1.20
- Fix: Fixed some problems when converting multi-byte string to Unicode, and vice versa.
- Improved conversion routines. It's now possible to define which code-page to use.
- It's now possible to set which character to use when it's not possible to convert a Unicode character to a multi-byte character.
- It's now possible to see if data was lost during conversion.
- Better support for other platforms, it's no longer necessary to use MFC in Windows.
- Fix: Reading very small files (1 byte) failed.
- 26 December, 2004 - Version 1.13
- Fix 1: If the first line in a file is empty, that line is ignored.
- Fix 2: Problems when converting multi-byte characters to wide characters and vice versa.
- 17 October, 2004 - Version 1.12. A minor memory leak when open file failed, fixed.
- 28 August, 2004 - Version 1.11.
WriteEndl()
didn't work correctly when writing ASCII files. Fixed.
- 13 August, 2004 - Version 1.1. I'm sorry about the quick update. I have rewritten some part of the code, so now it's a lot quicker than the previous version.
- 12 August, 2004 - Initial version.
PEK is one of the millions of programmers that sometimes program so hard that he forgets how to sleep (this is especially true when he has more important things to do). He thinks that there are not enough donuts in the world. He likes when his programs works as they should do, but dislikes when his programs is more clever than he is.