|
|||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||
|
Announcements
Want a new Job?
Chapters
Services
Feature Zones
|
Contents
IntroductionOnce upon a time, a text file was just a simple file. But it is not that easy anymore. New lines can be written in three different ways. Windows/DOS using character 13 and 10, Macintosh using just character 13, and Unix using character 10. Why it is like that had always puzzled me. Different character sets make reading and writing text files harder, so I'm glad that we have Unicode to use instead. But as you might know, writing a text in Unicode could be done in several ways... Text encodingI wrote The encodings the classes can read and write are:
Most of the code I use on this page is for Windows/MFC, but the code should work on other platforms as well. The only major difference is that code-pages are only supported in Windows. On other platforms, you should use Structure
There are some useful member functions in the base class: class CTextFileBase { public: CTextFileBase(); ~CTextFileBase(); //Is the file open? int IsOpen(); //Close the file void Close(); //Return the encoding of the file (ASCII, UNI16_BE, UNI16_LE or UTF_8); TEXTENCODING GetEncoding() const; //Set which character that should be used when converting //Unicode->multi byte and an unknown character is found ('?' is default) void SetUnknownChar(const char unknown); //Returns true if data was lost //(happens when converting Unicode->multi byte string and an unmappable //characters is found). bool IsDataLost() const; //Reset the data lost flag void ResetDataLostFlag(); //Set codepage to use when working with none-Unicode strings void SetCodePage(const UINT codepage); //Get codepage to use when working with none-Unicode strings UINT GetCodePage() const; //Convert char* to wstring static void ConvertCharToWstring(const char* from, wstring &to, UINT codepage=CP_ACP); //Convert wchar_t* to string static void ConvertWcharToString(const wchar_t* from, string &to, UINT codepage=CP_ACP, bool* datalost=NULL, char unknownchar=0); } The first five functions are the most important ones, and I hope that what they do is obvious. The rest is needed when working with different code-pages. Writing filesWriting files is very easy. The public functions are: class CTextFileWrite : public CTextFileBase { public: CTextFileWrite(const FILENAMECHAR* filename, TEXTENCODING type=ASCII); CTextFileWrite(CFile* file, TEXTENCODING type=ASCII); //Write routines void Write(const char* text); void Write(const wchar_t* text); void Write(const string& text); void Write(const wstring& text); CTextFileWrite& operator << (const char wc); CTextFileWrite& operator << (const char* text); CTextFileWrite& operator << (const string& text); CTextFileWrite& operator << (const wchar_t wc); CTextFileWrite& operator << (const wchar_t* text); CTextFileWrite& operator << (const wstring& text); //Write new line (two characters, 13 and 10) void WriteEndl(); } As you see, you use //Create file. Use UTF-8 to encode the file CTextFileWrite myfile(_T("samplefile.txt"), CTextFileWrite::UTF_8 ); ASSERT(myfile.IsOpen()); //Write some text myfile << "Using 8 bit characters as input"; myfile.WriteEndl(); myfile << L"Using 16-bit characters. The following character is alfa: \x03b1"; myfile.WriteEndl(); CString temp = _T("Using CString."); myfile << temp; Quite easy, isn't it :-). Reading filesReading files isn't much complicated. The public member functions are: class CTextFileRead : public CTextFileBase { public: CTextFileRead(const FILENAMECHAR* filename); CTextFileRead(CFile* file); //Reading functions. Returns false if eof. bool ReadLine(string& line); bool ReadLine(wstring& line); bool ReadLine(CString& line); bool Read(string& all, const string newline="\r\n"); bool Read(wstring& all, const wstring newline=L"\r\n"); bool Read(CString& all, const CString newline=_T("\r\n")); //End of file? bool Eof() const; } The CTextFileRead myfile(_T("samplefile.txt")); ASSERT(myfile.IsOpen()); CString encoding; if(myfile.GetEncoding() == CTextFileRead::ASCII) encoding = _T("ASCII"); else if(myfile.GetEncoding() == CTextFileRead::UNI16_BE) encoding = _T("UNI16_BE"); else if(myfile.GetEncoding() == CTextFileRead::UNI16_LE) encoding = _T("UNI16_LE"); else if(myfile.GetEncoding() == CTextFileRead::UTF_8) encoding = _T("UTF_8"); MessageBox( CString(_T("Text encoding: ")) + encoding ); while(!myfile.Eof()) { CString line; myfile.ReadLine(line); MessageBox( line ); } If you want to read the whole file, use a CTextFileRead myfile(_T("samplefile.txt")); ASSERT(myfile.IsOpen()); CString alltext; myfile.Read(alltext); MessageBox( alltext ); Document/ViewIf you are using Document/View, you probably want to save and read your files in the void CTextFileDemo2Doc::Serialize(CArchive& ar) { if(ar.IsStoring()) { #ifndef _UNICODE //Save in ASCII if not unicode version CTextFileWrite file(ar.GetFile(), CTextFileWrite::ASCII); #else //Save in UTF-8 in unicode version CTextFileWrite file(ar.GetFile(), CTextFileWrite::UTF_8); #endif CString allText; ((CEditView*)m_viewList.GetHead())->GetWindowText(allText); file << allText; } else { CTextFileRead file(ar.GetFile()); //Read text CString allText; file.Read(allText); //Data may be lost when the file is read. This happens when the //file is using Unicode, but your program doesn't. if(file.IsDataLost()) MessageBox( AfxGetMainWnd()->m_hWnd, _T("Data was lost when the file was read!"), NULL, MB_ICONWARNING|MB_OK); //Set text BOOL bResult = ::SetWindowText(((CEditView*)m_viewList.GetHead())->GetSafeHwnd(), allText); // make sure that SetWindowText was successful if (!bResult || ((CEditView*)m_viewList.GetHead())->GetWindowTextLength() < (int)allText.GetLength()) AfxThrowMemoryException(); } } That's it! Code-pages/Character setsI hope that most of the code you have seen so far is quite straightforward to use. It's a little bit more difficult when you want to work with different code-pages (or "character sets", I don't understand the difference). Before Unicode, there was a problem how to represent characters that were used in some parts of the world (a-z wasn't enough). For example, we who live in Sweden like the character 'å'. The character 'å' could be found in code-page 437. There, it has the ASCII-code 134. However, 'å' also exists in code-page 1252, but there it has the ASCII-code 229! Does it sound complicated? Wait, it's getting worse! In some other countries, more complicated characters are used, like in Korea. Here, the ASCII-table is too small for all characters, so to make it possible to represent all characters, it is necessary to use two bytes for some characters. Code-page 949 has lots of multi-byte characters, like this one: 이 (code: C0CC=U+C774) (don't worry if you can't see the character). That character is represented by two bytes (192 and 204). If you open an ASCII-file that is using this character, in Notepad, and you are using code-page 949, you will see the character correctly. But if you are doing the same thing but you are using code-page 1252 instead, you will see two characters ("ÀÌ"). It is obviously quite hard to handle all different code-pages, that's why Unicode was invented. In Unicode, the idea is that only one character set should be used and that every character should be in the same size (no more multi-byte solutions are necessary). So Unicode is great, but we still need to deal with files that use different code-pages. If you read an ASCII-file to a Unicode-string (like Remember that the string will not be converted if you read/write an ASCII-file to/from a non-Unicode string. I will show later how you could do if you want to convert from one code-page to another. When you convert a Unicode-string to a multi-byte string, it could happen that some characters couldn't be converted. These characters are by default replaced with a query mark ('?'), but you could change this by calling Some Windows-APIs
When you set which code-page to use, you must be sure that the code-page exists. Do this by calling To see which code-page your system is using, call To see all code-pages that your system is using, you could do this: void ListCodePages() { EnumSystemCodePages(&EnumCodePagesProc, CP_SUPPORTED); } BOOL CALLBACK EnumCodePagesProc(LPTSTR lpCodePageString) { cout << "Code-page: " << lpCodePageString << endl; return TRUE; } Example 1OK, enough talk about code-page, here is an example. The following code is reading an ASCII-file (with code-page 437) to a Unicode-string. Then it creates a new ASCII-file and writes the string with code-page 1252. This is how you should do if you want to convert a string from one code-page to another code-page. Convert the multi-byte string to a Unicode-string, and then convert the Unicode-string to a multi-byte string. If you don't want to write the string to a file, you could use //Make file reader. Read the file "ascii-437.txt" CTextFileRead reader("ascii-437.txt"); //Define which code-page to use when we read the file //437 are very often used in DOS. reader.SetCodePage(437); //Read everything to a Unicode-string wstring alltext; reader.Read(alltext); //Close file reader.Close(); //Now we create a new ASCII-file CTextFileWrite writer("ascii-1252.txt", CTextFileBase::ASCII); //Set which code-page to use. //1252 is very often used in Windows writer.SetCodePage(1252); //Do the writing... writer << alltext; //Was data lost when the Unicode-string was converted to //code-page 1252? if(writer.IsDataLost()) { //Do something... } //Close the file writer.Close(); Example 1bAs I said before, it should be possible to use //Make file reader. Read the file "ascii-437.txt" CTextFileRead reader("ascii-437.txt"); //Define which code-page to use when we read the file //437 are very often used in DOS. //NOTE: Make sure setlocale doesn't return an empty //string. If it do, you have probably tried to use //an code-page that your system doesn't support cout << setlocale(LC_ALL, ".437") << endl; //Read everything to a Unicode-string wstring alltext; reader.Read(alltext); //Close file reader.Close(); //Now we create a new ASCII-file CTextFileWrite writer("ascii-1252.txt", CTextFileBase::ASCII); //Set which code-page to use. //1252 is very often used in Windows cout << setlocale(LC_ALL, ".1252") << endl; //Do the writing... writer << alltext; //Was data lost when the Unicode-string was converted to //code-page 1252? if(writer.IsDataLost()) { //Do something... } //Close the file writer.Close(); About the code
LinksPoints of interestEven if the classes are quite simple, they have been very useful to me. They have all features I want, so I don't miss anything important. However, it would be nice if it supported more encodings, like UTF-32. Maybe I'll add this in the future. The performance is quite good, but if you know some way to get it faster, let me know :-). One thing that probably should improve the performance is increasing the value of How many bytes are a Why isn't The classes expect that the files have a "byte order mark" (BOM) in the first bytes in the files. These bytes are telling what encoding is used. The first two bytes in a "big endian" file are 0xFF and 0xFE; if you make a "little endian" file, the order is reversed. If the encoding is UTF-8, the first three bytes are 0xEF, 0xBB and 0xBF. If no BOM is found, the file is treated as an ASCII file. You may wonder why I call these classes And finally, thank you all of you who have commented and found bugs (and created fixes) to the code. These classes have been improved a lot, thanks to this. History
| ||||||||||||||||||||||||||||||||||