CTextFileDocument

PEK

4.95/5 (80 votes)

Aug 12, 2004

Ms-RL

10 min read

572460

10528

CTextFileDocument lets you write and read text files with different encodings (ASCII, UTF-8, Unicode 16 little/big endian are supported).

Sample Image - textfilesample.gif

Introduction
Text encoding
Structure
Writing files
Reading files
Document/View
Code-pages/Character sets
About the code
Links
Points of interest
History

Introduction

Once upon a time, a text file was just a simple file. But it is not that easy anymore. New lines can be written in three different ways. Windows/DOS using character 13 and 10, Macintosh using just character 13, and Unix using character 10. Why it is like that had always puzzled me. Different character sets make reading and writing text files harder, so I'm glad that we have Unicode to use instead. But as you might know, writing a text in Unicode could be done in several ways...

Text encoding

I wrote CTextFileDocument because I thought it was too complicated to write and read files using Unicode characters. I also wanted the class to handle ordinary 8-bit files. Since version 1.20, different codepages are supported when reading/writing ASCII-files.

The encodings the classes can read and write are:

`CTextFileBase::ASCII`	Simple 8-bit files (different codepages are supported).
`CTextFileBase::UTF_8`	Unicode encoded 8-bit files. A character could be written in one, two or three bytes.
`CTextFileBase::UNI16_BE`	Unicode, big-endian. Every character is written in two bytes. Most significant byte is written first.
`CTextFileBase::UNI16_LE`	Unicode, little-endian. Every character is written in two bytes. Least significant byte is written first.

Most of the code I use on this page is for Windows/MFC, but the code should work on other platforms as well. The only major difference is that code-pages are only supported in Windows. On other platforms, you should use setlocale to specify which code-page to use. It's not necessary to use MFC on Windows.

Structure

CTextFileDocument consists of three classes:

`CTextFileBase`	This is the base class for the other two classes.
`CTextFileWrite`	Use this to write files.
`CTextFileRead`	Use this to read files.

There are some useful member functions in the base class:

class CTextFileBase
{
public:
    CTextFileBase();
    ~CTextFileBase();

    //Is the file open?
    int IsOpen();

    //Close the file
    void Close();

    //Return the encoding of the file (ASCII, UNI16_BE, UNI16_LE or UTF_8);
    TEXTENCODING GetEncoding() const;
    
    //Set which character that should be used when converting
    //Unicode->multi byte and an unknown character is found ('?' is default)
    void SetUnknownChar(const char unknown);

    //Returns true if data was lost
    //(happens when converting Unicode->multi byte string and an unmappable
    //characters is found).
    bool IsDataLost() const;

    //Reset the data lost flag
    void ResetDataLostFlag();

    //Set codepage to use when working with none-Unicode strings
    void SetCodePage(const UINT codepage);

    //Get codepage to use when working with none-Unicode strings
    UINT GetCodePage() const;

    //Convert char* to wstring
    static void ConvertCharToWstring(const char* from, 
                   wstring &to, UINT codepage=CP_ACP);

    //Convert wchar_t* to string
    static void ConvertWcharToString(const wchar_t* from, 
                    string &to, UINT codepage=CP_ACP,
                    bool* datalost=NULL, char unknownchar=0);

}

The first five functions are the most important ones, and I hope that what they do is obvious. The rest is needed when working with different code-pages.

Writing files

Writing files is very easy. The public functions are:

class CTextFileWrite : public CTextFileBase
{
public:
    CTextFileWrite(const FILENAMECHAR* filename, TEXTENCODING type=ASCII);
    CTextFileWrite(CFile* file, TEXTENCODING type=ASCII);

    //Write routines
    void Write(const char* text);
    void Write(const wchar_t* text);
    void Write(const string& text);
    void Write(const wstring& text);


    CTextFileWrite& operator << (const char wc);
    CTextFileWrite& operator << (const char* text);
    CTextFileWrite& operator << (const string& text);

    CTextFileWrite& operator << (const wchar_t wc);
    CTextFileWrite& operator << (const wchar_t* text);
    CTextFileWrite& operator << (const wstring& text);

    //Write new line (two characters, 13 and 10)
    void WriteEndl();
}

As you see, you use char or wchar_t to write the text (CString is no problem). Example:

//Create file. Use UTF-8 to encode the file
CTextFileWrite myfile(_T("samplefile.txt"), 
            CTextFileWrite::UTF_8 );

ASSERT(myfile.IsOpen());

//Write some text
myfile << "Using 8 bit characters as input";
myfile.WriteEndl();
myfile << L"Using 16-bit characters. The following character is alfa: \x03b1";
myfile.WriteEndl();
CString temp = _T("Using CString.");
myfile << temp;

Quite easy, isn't it :-).

Reading files

Reading files isn't much complicated. The public member functions are:

class CTextFileRead : public CTextFileBase
{
public:
    CTextFileRead(const FILENAMECHAR* filename);
    CTextFileRead(CFile* file);

    //Reading functions. Returns false if eof.
    bool ReadLine(string& line);
    bool ReadLine(wstring& line);
    bool ReadLine(CString& line);

    bool Read(string& all, const string newline="\r\n");
    bool Read(wstring& all, const wstring newline=L"\r\n");
    bool Read(CString& all, const CString newline=_T("\r\n"));

    //End of file?
    bool Eof() const;
}

The ReadLine function is just reading a single line. Example 1:

CTextFileRead myfile(_T("samplefile.txt"));

ASSERT(myfile.IsOpen());

CString encoding;

if(myfile.GetEncoding() == CTextFileRead::ASCII)
    encoding = _T("ASCII");
else if(myfile.GetEncoding() == CTextFileRead::UNI16_BE)
    encoding = _T("UNI16_BE");
else if(myfile.GetEncoding() == CTextFileRead::UNI16_LE)
    encoding = _T("UNI16_LE");
else if(myfile.GetEncoding() == CTextFileRead::UTF_8)
    encoding = _T("UTF_8");

MessageBox( CString(_T("Text encoding: ")) + encoding );

while(!myfile.Eof())
{
    CString line;
    myfile.ReadLine(line);

    MessageBox( line );
}

If you want to read the whole file, use a Read function instead. Example 2:

CTextFileRead myfile(_T("samplefile.txt"));

ASSERT(myfile.IsOpen());

CString alltext;
myfile.Read(alltext);
MessageBox( alltext );

Document/View

If you are using Document/View, you probably want to save and read your files in the Serialize function. A problem with this is that you can't close the CArchive object. If you do, you will get an ASSERT error. So instead of using the constructors where you specify the file name, you should use the constructors where you use a CFile pointer instead. When you do this, the file will not be closed when the object is deleted. The following sample is derived from CEditView, and instead of using the original code that only reads ASCII files, this will read Unicode as well:

void CTextFileDemo2Doc::Serialize(CArchive& ar)
{
    if(ar.IsStoring())
    {
#ifndef _UNICODE
        //Save in ASCII if not unicode version
        CTextFileWrite file(ar.GetFile(), CTextFileWrite::ASCII);
#else
        //Save in UTF-8 in unicode version
        CTextFileWrite file(ar.GetFile(), CTextFileWrite::UTF_8);
#endif

        CString allText;

        ((CEditView*)m_viewList.GetHead())->GetWindowText(allText);

        file << allText;
    }
    else
    {
        CTextFileRead file(ar.GetFile());

        //Read text
        CString allText;

        file.Read(allText);
        
    //Data may be lost when the file is read. This happens when the
    //file is using Unicode, but your program doesn't.
    if(file.IsDataLost())
        MessageBox( AfxGetMainWnd()->m_hWnd, 
                _T("Data was lost when the file was read!"), 
                NULL, 
                MB_ICONWARNING|MB_OK);       

        //Set text 
        BOOL bResult = 
          ::SetWindowText(((CEditView*)m_viewList.GetHead())->GetSafeHwnd(), 
          allText);

        // make sure that SetWindowText was successful
        if (!bResult || 
          ((CEditView*)m_viewList.GetHead())->GetWindowTextLength() 
          < (int)allText.GetLength())
            AfxThrowMemoryException();
    }
}

That's it!

Code-pages/Character sets

I hope that most of the code you have seen so far is quite straightforward to use. It's a little bit more difficult when you want to work with different code-pages (or "character sets", I don't understand the difference).

Before Unicode, there was a problem how to represent characters that were used in some parts of the world (a-z wasn't enough). For example, we who live in Sweden like the character 'å'. The character 'å' could be found in code-page 437. There, it has the ASCII-code 134. However, 'å' also exists in code-page 1252, but there it has the ASCII-code 229! Does it sound complicated? Wait, it's getting worse!

In some other countries, more complicated characters are used, like in Korea. Here, the ASCII-table is too small for all characters, so to make it possible to represent all characters, it is necessary to use two bytes for some characters. Code-page 949 has lots of multi-byte characters, like this one: 이 (code: C0CC=U+C774) (don't worry if you can't see the character). That character is represented by two bytes (192 and 204). If you open an ASCII-file that is using this character, in Notepad, and you are using code-page 949, you will see the character correctly. But if you are doing the same thing but you are using code-page 1252 instead, you will see two characters ("ÀÌ").

It is obviously quite hard to handle all different code-pages, that's why Unicode was invented. In Unicode, the idea is that only one character set should be used and that every character should be in the same size (no more multi-byte solutions are necessary).

So Unicode is great, but we still need to deal with files that use different code-pages. CTextFileDocument does this for you if you define which code-page to use (if you don't, it will use the code-page used by the system and that mostly works well).

If you read an ASCII-file to a Unicode-string (like wstring or CString if _UNICODE is defined), the string will be converted by using the code-page you have selected. The same thing happens (but in the other way) if you write a Unicode-string to an ASCII-file.

Remember that the string will not be converted if you read/write an ASCII-file to/from a non-Unicode string. I will show later how you could do if you want to convert from one code-page to another.

When you convert a Unicode-string to a multi-byte string, it could happen that some characters couldn't be converted. These characters are by default replaced with a query mark ('?'), but you could change this by calling SetUknownChar(). If you want to know if this has happened, call IsDataLost().

Some Windows-APIs

CTextFileDocument is using some APIs in Windows to convert strings: MultiByteToWideChar and WideCharToMultiByte. When these functions are used, the code-page to the multi-byte string must be defined. By default, CTextFileDocument is using CP_ACP, that means that the system default code-page should be used. If you want to use another code-page, call SetCodePage.

When you set which code-page to use, you must be sure that the code-page exists. Do this by calling IsValidCodePage.

To see which code-page your system is using, call GetACP().

To see all code-pages that your system is using, you could do this:

void ListCodePages()
{
  EnumSystemCodePages(&EnumCodePagesProc, CP_SUPPORTED);
}

BOOL CALLBACK EnumCodePagesProc(LPTSTR lpCodePageString)
{
  cout << "Code-page: " << lpCodePageString << endl;
  return TRUE;
}

Example 1

OK, enough talk about code-page, here is an example. The following code is reading an ASCII-file (with code-page 437) to a Unicode-string. Then it creates a new ASCII-file and writes the string with code-page 1252.

This is how you should do if you want to convert a string from one code-page to another code-page. Convert the multi-byte string to a Unicode-string, and then convert the Unicode-string to a multi-byte string. If you don't want to write the string to a file, you could use ConvertCharToWstring and ConvertWcharToString that are found in CTextFileBase.

//Make file reader. Read the file "ascii-437.txt"
CTextFileRead reader("ascii-437.txt");

//Define which code-page to use when we read the file
//437 are very often used in DOS.
reader.SetCodePage(437);

//Read everything to a Unicode-string
wstring alltext;
reader.Read(alltext);

//Close file
reader.Close();


//Now we create a new ASCII-file
CTextFileWrite writer("ascii-1252.txt", CTextFileBase::ASCII);

//Set which code-page to use.
//1252 is very often used in Windows
writer.SetCodePage(1252);

//Do the writing...
writer << alltext;

//Was data lost when the Unicode-string was converted to
//code-page 1252?
if(writer.IsDataLost())
{
  //Do something...
}

//Close the file
writer.Close();

Example 1b

As I said before, it should be possible to use CTextFileDocument in platforms other than Windows. If you do this, you must know that code-pages are handled slightly different. Instead of calling SetCodePage, you should call setlocale to define which code-page to use. The following code is doing the same thing as the last example, but will work on every platform (I hope ;-)):

//Make file reader. Read the file "ascii-437.txt"
CTextFileRead reader("ascii-437.txt");

//Define which code-page to use when we read the file
//437 are very often used in DOS.
//NOTE: Make sure setlocale doesn't return an empty
//string. If it do, you have probably tried to use
//an code-page that your system doesn't support
cout << setlocale(LC_ALL, ".437") << endl;

//Read everything to a Unicode-string
wstring alltext;
reader.Read(alltext);

//Close file
reader.Close();


//Now we create a new ASCII-file
CTextFileWrite writer("ascii-1252.txt", CTextFileBase::ASCII);

//Set which code-page to use.
//1252 is very often used in Windows
cout << setlocale(LC_ALL, ".1252") << endl;

//Do the writing...
writer << alltext;

//Was data lost when the Unicode-string was converted to
//code-page 1252?
if(writer.IsDataLost())
{
  //Do something...
}

//Close the file
writer.Close();

About the code

CTextFileDocument was originally written to use MFC, but now it's more platform-independent. To make this possible, there are some #defines in the code. The most important one is PEK_TX_TECHLEVEL, which defines which features to use. But you need not think about this, the code should automatically define this correctly. The table below explains the differences:

PEK_TX_TECHLEVEL = 0
This is used if you are running on a non-Windows platform. This uses fstream internally to read and write files. If you want to change codepage, you should call setlocal.
PEK_TX_TECHLEVEL = 1
This is used on Windows if you don't use MFC. This calls Windows API directly to read and write files. If something couldn't be read/written, a CTextFileException is thrown. Codepages are supported. Unicode in filenames is supported.
PEK_TX_TECHLEVEL = 2
This is used if you are using MFC. This uses CFile internally to read and write files. If data can't be read/written, CFile will throw an exception. Codepages are supported. Unicode in filenames is supported. CString is supported.

Points of interest

Even if the classes are quite simple, they have been very useful to me. They have all features I want, so I don't miss anything important. However, it would be nice if it supported more encodings, like UTF-32. Maybe I'll add this in the future. The performance is quite good, but if you know some way to get it faster, let me know :-).

One thing that probably should improve the performance is increasing the value of BUFFSIZE (defined in CTextFileBase). Another thing is making the code in CTextFileRead::GuessCharacterCount better. This should return the number of characters in the file. Currently, this only works if you are using MFC, otherwise it will return 1 MB. GuessCharacterCount is only used when Read is called, so it's not used when ReadLine is called.

How many bytes are a wchar_t? That is compiler dependent, and I think that could give me some problems in the future. In Windows, wchar_t is two bytes, but I think that in Unix, four bytes are used. Currently, this is not a problem, but if I add support for UTF-32 (four bytes for every character), some problems may occur.

Why isn't IsOpen() a const function? I think it should be, but that is impossible. The reason for this is that fstream::is_open() is not const (well, it is in my VC6 but not in standard C++). Why it is like this is a mystery for me.

The classes expect that the files have a "byte order mark" (BOM) in the first bytes in the files. These bytes are telling what encoding is used. The first two bytes in a "big endian" file are 0xFF and 0xFE; if you make a "little endian" file, the order is reversed. If the encoding is UTF-8, the first three bytes are 0xEF, 0xBB and 0xBF. If no BOM is found, the file is treated as an ASCII file.

You may wonder why I call these classes CTextFileDocument. The simple reason for this is that the name CTextFile was already taken... It was quite annoying to find that out just a couple of minutes before I wanted to upload the article :-).

And finally, thank you all of you who have commented and found bugs (and created fixes) to the code. These classes have been improved a lot, thanks to this.

History

21 May, 2005 - Version 1.22.
- Reading a line before reading everything could add an extra line break, fixed.
- A member variable wasn't always initialized, could cause problems when reading single lines, fixed.
- A smarter/easier algorithm is used when reading single lines.
10 April, 2005 - Version 1.21. If it was not possible to open a file in techlevel 1, IsOpen returned a bad result. Fixed.
15 January, 2005 - Version 1.20
- Fix: Fixed some problems when converting multi-byte string to Unicode, and vice versa.
- Improved conversion routines. It's now possible to define which code-page to use.
- It's now possible to set which character to use when it's not possible to convert a Unicode character to a multi-byte character.
- It's now possible to see if data was lost during conversion.
- Better support for other platforms, it's no longer necessary to use MFC in Windows.
- Fix: Reading very small files (1 byte) failed.
26 December, 2004 - Version 1.13
- Fix 1: If the first line in a file is empty, that line is ignored.
- Fix 2: Problems when converting multi-byte characters to wide characters and vice versa.
17 October, 2004 - Version 1.12. A minor memory leak when open file failed, fixed.
28 August, 2004 - Version 1.11. WriteEndl() didn't work correctly when writing ASCII files. Fixed.
13 August, 2004 - Version 1.1. I'm sorry about the quick update. I have rewritten some part of the code, so now it's a lot quicker than the previous version.
12 August, 2004 - Initial version.