Click here to Skip to main content
Click here to Skip to main content

A UTF-16 Class for Reading and Writing Unicode Files

By , , 15 Jul 2009
 

Introduction

As Unicode becomes more popular, programmers will find themselves performing more file based operations using Unicode. Currently, familiar MFC classes such as CFile and CStdioFile do not properly handle reading and writing of a Unicode file. The class file presented addresses the need to read and write files as UTF-16 Unicode files.

Using the Code

During construction or with the use of the Open() member function, the class will examine the first two bytes of the file after appropriate size checking. The two byte sequence (BOM) 0xFE, 0xFF indicates the file is UTF-16 encoded. If this is the case, m_bIsUnicode is set to TRUE. If the bytes are not present, the class performs a CStdioFile::Seek(0, CFile::begin ) to return the consumed bytes.

CStdioFile::Read( &wcBOM, sizeof( WCHAR ) );

if( wcBOM == UNICODE_BOM ) {

    m_bIsUnicode   = TRUE;
    m_bByteSwapped = FALSE;
}

if( wcBOM == UNICODE_RBOM ) {

    m_bIsUnicode   = TRUE;
    m_bByteSwapped = TRUE;
}

// Not a BOM mark - treat it as an ANSI file
//   and defer to CStdioFile...
if( FALSE == m_bIsUnicode ) {

    CStdioFile::Seek( 0, CFile::begin );
}

ReadString(...) occurs as follows: if m_bIsUnicode is FALSE, the class returns the appropriate CStdioFile::ReadString(...) operation. If the file is UTF-16 encoded, the class will draw from an internal accumulator until a "\r" or "\n" is encountered when using CUTF16File::ReadString(CString& rString ). If using the CUTF16File::ReadString( LPWSTR lpsz, UINT nMax ) overload, CStdioFile::ReadString() behavior is duplicated. See the underlying comment from fgets().

The above read is accomplished through an accumulator. The accumulator is a STL list of WCHARs. When filling the accumulator, byte swapping occurs if a Big Endian stream (0xFF, 0xFE) is encountered.

Writing to a file is accomplished by extending the normal function with WriteString(LPCTSTR lpsz, BOOL bAsUnicode ). CStdioFile will handle the ANSI conversion internally, so CUTF16File simply yields to CStdioFile. If bAsUnicode is TRUE, the program will write the BOM (if file position is 0), and then call CFile::Write(...).

The program will open two files on the hard drive, write out both Unicode and ANSI text files, then read the files back in. The driver program then uses OutputDebugString(...) to write messages to the debugger's output window.

CUTF16File output1( L"unicode_write.txt", CFile::modeWrite |
CFile::modeCreate );
output1.WriteString( L"Hello World from Unicode land!", TRUE );
output1.Close();

...

CString szInput;
CUTF16File input1( L"unicode_write.txt", CFile::modeRead );
input1.ReadString( szInput );

Figure 1 is the result of writing a test file with the provided driver program. Notice that the BOM bytes are swapped on the disk.

Figure 1: Result of test program.

Figure 2 examines a similar file created with Notepad on Windows 2000 while saving the file as Unicode.

Figure 2: A Unicode sample created in Notepad.

Additional Reading

  • http://www.unicode.org/
  • International Programming for Microsoft Windows by D. Schmitt, ISBN 1-57231-956-9
  • Programming Windows with MFC by J. Prosise, ISBN 1-57231-695-0
  • Programming Server-Side Applications for Microsoft Windows 2000 by J. Richter and J. Clark, ISBN 0-73560-753-2

Revisions

  • 10 Feb 2005 Original release
  • 23 Dec 2006 Added Jordan Walters' improvements and bug fixes
  • 23 Dec 2006 Added Jordan Walters as an author
  • 17 Sep 2008 Fixed long-standing bug in 2nd constructor
  • 13 Jul 2009 Correct handling of Unicode characters. If UNICODE/_UNICODE project settings specified, writing ANSI still produces a Unicode output file.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Authors

Jeffrey Walton
Systems / Hardware Administrator
United States United States
Member
No Biography provided

Jordan Walters
Software Developer (Senior)
United Kingdom United Kingdom
Member
Ok, it's about time I updated this profile. I still live near 'Beastly' Eastleigh in Hampshire, England. However I have recently been granted a permamant migration visa to Australia - so if you're a potential employer from down under and like the look of me, please get in touch.
Still married - just, still with just a son and daughter. But they are now 8 and 7 resp and when together they have the energy of a nuclear bomb.
I worked at Teleca UK for over 8.5 years (but have now moved to TikitTFB) and have done loads of different things. Heavily involved with MFC, SQL, C#, The latest is ASP.NET with C# and Javascript. Moving away from Trolltech Qt3 and 4.
Jordan.

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
Hint: For improved responsiveness ensure Javascript is enabled and choose 'Normal' from the Layout dropdown and hit 'Update'.
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
GeneralMy vote of 5memberRichchu15 Nov '10 - 20:40 
GeneralPerformance Issue with ReadUnicodeString(CString&)memberChris Meech30 Jul '09 - 6:25 
GeneralRe: Performance Issue with ReadUnicodeString(CString&)memberChris Meech30 Jul '09 - 8:25 
GeneralClass fails to read russian symbol 'K'membersunrizer10425 Mar '09 - 2:55 
GeneralRe: Class fails to read russian symbol 'K'membersunrizer10425 Mar '09 - 3:21 
As I can see it happens because CStdioFile::Read in LoadAccumulator function returns wrong read byte count.
GeneralRe: Class fails to read russian symbol 'K'memberJeffrey Walton25 Mar '09 - 3:51 
GeneralRe: Class fails to read russian symbol 'K' [modified]membersunrizer10425 Mar '09 - 4:04 
GeneralRe: Class fails to read russian symbol 'K'membersunrizer10425 Mar '09 - 6:01 
GeneralRe: Class fails to read russian symbol 'K'memberJordan Walters25 Mar '09 - 11:20 
AnswerRe: Class fails to read russian symbol 'K'memberSkyKnight12 Jul '09 - 17:53 
AnswerRe: Class fails to read russian symbol 'K'memberSkyKnight13 Jul '09 - 6:37 
AnswerRe: Class fails to read russian symbol 'K'memberJordan Walters13 Jul '09 - 12:58 
GeneralFix the bug about cann't call GetLength() after call CUTF16File(LPCTSTR lpszFileName, UINT nOpenFlags) or Open(...)memberbeniii0114 Aug '08 - 21:19 
GeneralRe: Fix the bug about cann't call GetLength() after call CUTF16File(LPCTSTR lpszFileName, UINT nOpenFlags) or Open(...)memberJordan Walters17 Sep '08 - 11:38 
GeneralFix a bug in Revised viesionmemberbeniii0114 Aug '08 - 19:37 
GeneralRe: Fix a bug in Revised viesionmemberJordan Walters17 Sep '08 - 11:39 
Questionit desnt work well in vc6.0?memberbeniii0114 Aug '08 - 16:46 
Generalthere is a problem! using this member function BOOL CUTF16File::ReadString( CString& rString ) can`t read character "会"memberMember 222813622 Jan '08 - 21:31 
QuestionRe: there is a problem! using this member function BOOL CUTF16File::ReadString( CString& rString ) can`t read character "会"memberJordan Walters25 Apr '08 - 6:53 
GeneralI have a problemmemberMember 222813622 Jan '08 - 21:26 
GeneralText getting truncated? [modified]memberjimwillsher26 Jun '06 - 7:24 
GeneralRe: Text getting truncated?memberJordan Walters31 Aug '06 - 11:40 
GeneralNew Version for non-Unicode builds [modified]memberJordan Walters14 Dec '05 - 9:45 
AnswerRe: New Version for non-Unicode buildsmemberrobosport27 May '06 - 21:17 
GeneralRe: New Version for non-Unicode buildsmemberJeffrey Walton23 Dec '06 - 9:05 
GeneralWell done!memberPaul-T2 Jun '05 - 20:46 
GeneralUNICODE compile flag may not be needed.memberJordan Walters27 Apr '05 - 5:52 
GeneralRe: UNICODE compile flag may not be needed.sussAnonymous9 May '05 - 17:37 
GeneralRe: UNICODE compile flag may not be needed.memberaimsoft29 Aug '05 - 2:55 
GeneralRe: UNICODE compile flag may not be needed.memberBernhard12 Dec '05 - 2:45 
GeneralSeeksussAnonymous23 Feb '05 - 14:50 
GeneralRe: SeeksussAnonymous9 May '05 - 17:33 
GeneralRe: Seekmemberaimsoft29 Aug '05 - 0:17 
GeneralRe: SeekmemberJeff Walton1 Nov '05 - 7:18 
GeneralRe: Seekmembermambo_jumbo28 Dec '05 - 5:13 
GeneralEndianess SuggestionmemberJohann Gerell9 Feb '05 - 18:52 
GeneralRe: Endianess SuggestionmemberJeff Walton11 Feb '05 - 13:55 
GeneralRe: Endianess SuggestionmemberJohann Gerell11 Feb '05 - 22:43 
GeneralRe: Endianess SuggestionmemberJeff Walton12 Feb '05 - 13:09 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web01 | 2.6.130516.1 | Last Updated 15 Jul 2009
Article Copyright 2005 by Jeffrey Walton, Jordan Walters
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid