Click here to Skip to main content
15,886,664 members
Articles / Desktop Programming / MFC

A UTF-16 Class for Reading and Writing Unicode Files

,
Rate me:
Please Sign up or sign in to vote.
4.86/5 (18 votes)
15 Jul 2009CPOL2 min read 549.1K   1.6K   56   39
A UTF-16 class derived from CStdioFile for reading and writing Unicode files

Introduction

As Unicode becomes more popular, programmers will find themselves performing more file based operations using Unicode. Currently, familiar MFC classes such as CFile and CStdioFile do not properly handle reading and writing of a Unicode file. The class file presented addresses the need to read and write files as UTF-16 Unicode files.

Using the Code

During construction or with the use of the Open() member function, the class will examine the first two bytes of the file after appropriate size checking. The two byte sequence (BOM) 0xFE, 0xFF indicates the file is UTF-16 encoded. If this is the case, m_bIsUnicode is set to TRUE. If the bytes are not present, the class performs a CStdioFile::Seek(0, CFile::begin ) to return the consumed bytes.

C++
CStdioFile::Read( &wcBOM, sizeof( WCHAR ) );

if( wcBOM == UNICODE_BOM ) {

    m_bIsUnicode   = TRUE;
    m_bByteSwapped = FALSE;
}

if( wcBOM == UNICODE_RBOM ) {

    m_bIsUnicode   = TRUE;
    m_bByteSwapped = TRUE;
}

// Not a BOM mark - treat it as an ANSI file
//   and defer to CStdioFile...
if( FALSE == m_bIsUnicode ) {

    CStdioFile::Seek( 0, CFile::begin );
}

ReadString(...) occurs as follows: if m_bIsUnicode is FALSE, the class returns the appropriate CStdioFile::ReadString(...) operation. If the file is UTF-16 encoded, the class will draw from an internal accumulator until a "\r" or "\n" is encountered when using CUTF16File::ReadString(CString& rString ). If using the CUTF16File::ReadString( LPWSTR lpsz, UINT nMax ) overload, CStdioFile::ReadString() behavior is duplicated. See the underlying comment from fgets().

The above read is accomplished through an accumulator. The accumulator is a STL list of WCHARs. When filling the accumulator, byte swapping occurs if a Big Endian stream (0xFF, 0xFE) is encountered.

Writing to a file is accomplished by extending the normal function with WriteString(LPCTSTR lpsz, BOOL bAsUnicode ). CStdioFile will handle the ANSI conversion internally, so CUTF16File simply yields to CStdioFile. If bAsUnicode is TRUE, the program will write the BOM (if file position is 0), and then call CFile::Write(...).

The program will open two files on the hard drive, write out both Unicode and ANSI text files, then read the files back in. The driver program then uses OutputDebugString(...) to write messages to the debugger's output window.

C++
CUTF16File output1( L"unicode_write.txt", CFile::modeWrite |
CFile::modeCreate );
output1.WriteString( L"Hello World from Unicode land!", TRUE );
output1.Close();

...

CString szInput;
CUTF16File input1( L"unicode_write.txt", CFile::modeRead );
input1.ReadString( szInput );

Figure 1 is the result of writing a test file with the provided driver program. Notice that the BOM bytes are swapped on the disk.

Image 1

Figure 1: Result of test program.

Figure 2 examines a similar file created with Notepad on Windows 2000 while saving the file as Unicode.

Image 2

Figure 2: A Unicode sample created in Notepad.

Additional Reading

  • http://www.unicode.org/
  • International Programming for Microsoft Windows by D. Schmitt, ISBN 1-57231-956-9
  • Programming Windows with MFC by J. Prosise, ISBN 1-57231-695-0
  • Programming Server-Side Applications for Microsoft Windows 2000 by J. Richter and J. Clark, ISBN 0-73560-753-2

Revisions

  • 10 Feb 2005 Original release
  • 23 Dec 2006 Added Jordan Walters' improvements and bug fixes
  • 23 Dec 2006 Added Jordan Walters as an author
  • 17 Sep 2008 Fixed long-standing bug in 2nd constructor
  • 13 Jul 2009 Correct handling of Unicode characters. If UNICODE/_UNICODE project settings specified, writing ANSI still produces a Unicode output file.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Systems / Hardware Administrator
United States United States
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Written By
Software Developer (Senior)
United Kingdom United Kingdom
Ok, it's about time I updated this profile. I still live near 'Beastly' Eastleigh in Hampshire, England. However I have recently been granted a permamant migration visa to Australia - so if you're a potential employer from down under and like the look of me, please get in touch.
Still married - just, still with just a son and daughter. But they are now 8 and 7 resp and when together they have the energy of a nuclear bomb.
I worked at Teleca UK for over 8.5 years (but have now moved to TikitTFB) and have done loads of different things. Heavily involved with MFC, SQL, C#, The latest is ASP.NET with C# and Javascript. Moving away from Trolltech Qt3 and 4.
Jordan.

Comments and Discussions

 
GeneralMy vote of 5 Pin
Richchu15-Nov-10 20:40
Richchu15-Nov-10 20:40 
GeneralPerformance Issue with ReadUnicodeString(CString&) Pin
Chris Meech30-Jul-09 6:25
Chris Meech30-Jul-09 6:25 
GeneralRe: Performance Issue with ReadUnicodeString(CString&) Pin
Chris Meech30-Jul-09 8:25
Chris Meech30-Jul-09 8:25 
GeneralClass fails to read russian symbol 'K' Pin
sunrizer10425-Mar-09 2:55
sunrizer10425-Mar-09 2:55 
GeneralRe: Class fails to read russian symbol 'K' Pin
sunrizer10425-Mar-09 3:21
sunrizer10425-Mar-09 3:21 
GeneralRe: Class fails to read russian symbol 'K' Pin
Jeffrey Walton25-Mar-09 3:51
Jeffrey Walton25-Mar-09 3:51 
GeneralRe: Class fails to read russian symbol 'K' [modified] Pin
sunrizer10425-Mar-09 4:04
sunrizer10425-Mar-09 4:04 
GeneralRe: Class fails to read russian symbol 'K' Pin
sunrizer10425-Mar-09 6:01
sunrizer10425-Mar-09 6:01 
GeneralRe: Class fails to read russian symbol 'K' Pin
Jordan Walters25-Mar-09 11:20
Jordan Walters25-Mar-09 11:20 
AnswerRe: Class fails to read russian symbol 'K' Pin
SkyKnight12-Jul-09 17:53
SkyKnight12-Jul-09 17:53 
AnswerRe: Class fails to read russian symbol 'K' Pin
SkyKnight13-Jul-09 6:37
SkyKnight13-Jul-09 6:37 
AnswerRe: Class fails to read russian symbol 'K' Pin
Jordan Walters13-Jul-09 12:58
Jordan Walters13-Jul-09 12:58 
GeneralFix the bug about cann't call GetLength() after call CUTF16File(LPCTSTR lpszFileName, UINT nOpenFlags) or Open(...) Pin
beniii0114-Aug-08 21:19
beniii0114-Aug-08 21:19 
GeneralRe: Fix the bug about cann't call GetLength() after call CUTF16File(LPCTSTR lpszFileName, UINT nOpenFlags) or Open(...) Pin
Jordan Walters17-Sep-08 11:38
Jordan Walters17-Sep-08 11:38 
GeneralFix a bug in Revised viesion Pin
beniii0114-Aug-08 19:37
beniii0114-Aug-08 19:37 
GeneralRe: Fix a bug in Revised viesion Pin
Jordan Walters17-Sep-08 11:39
Jordan Walters17-Sep-08 11:39 
Questionit desnt work well in vc6.0? Pin
beniii0114-Aug-08 16:46
beniii0114-Aug-08 16:46 
Generalthere is a problem! using this member function BOOL CUTF16File::ReadString( CString& rString ) can`t read character "会" Pin
cdpc22-Jan-08 21:31
cdpc22-Jan-08 21:31 
QuestionRe: there is a problem! using this member function BOOL CUTF16File::ReadString( CString& rString ) can`t read character "会" Pin
Jordan Walters25-Apr-08 6:53
Jordan Walters25-Apr-08 6:53 
GeneralI have a problem Pin
cdpc22-Jan-08 21:26
cdpc22-Jan-08 21:26 
GeneralText getting truncated? [modified] Pin
jimwillsher26-Jun-06 7:24
jimwillsher26-Jun-06 7:24 
GeneralRe: Text getting truncated? Pin
Jordan Walters31-Aug-06 11:40
Jordan Walters31-Aug-06 11:40 
GeneralNew Version for non-Unicode builds [modified] Pin
Jordan Walters14-Dec-05 9:45
Jordan Walters14-Dec-05 9:45 
Hello everybody.
A few months ago I said that I'd made some changes that allowed the code to work with non-Unicode builds. I made a couple of other mods I believe to get it fully working.
I invited the author to contact me so that he could post the new version up - and he did. But he apparently had problems with CodeProject themselves and it never got done.
Every now and then I get emails from people asking for my new version. I don't mind this but of course there is a delay in my replies. If you're like me, and you look for something you want it now, not after the time it takes to write a request and get an email reply.
So I'm pasting the contents of the header and implementation files in this message so you can just go ahead and copy it straight off.......

1. First UTF16File.h

// UTF16File.h: interface for the CUTF16File class.
//
// Version 5.0, 2 February 2004.
//
// Jeffrey Walton
//
//	Modified by Jordan Walters 27/04/2005 to work with non-Unicode
//	builds as well.
//
//////////////////////////////////////////////////////////////////////

#if !defined(AFX_UTF16File_H__32BEF8AC_25E0_482F_8B00_C40775BCDB81__INCLUDED_)
#define AFX_UTF16File_H__32BEF8AC_25E0_482F_8B00_C40775BCDB81__INCLUDED_

#if _MSC_VER > 1000
#pragma once
#endif // _MSC_VER > 1000

#pragma warning(push, 3)
#include <list>
#pragma warning(pop)


//
// Under a hex editor, file[0] = 0xFF
//                     file[1] = 0xFE
//
// for a proper UTF-16 BOM
//
// This is different than the in-memory
//   representation of: mem[0] = 0xFE
//                      mem[1] = 0xFF
//
// on an Intel CPU
//
const unsigned char UNICODE_BOM[2]				= {unsigned char(0xFF), unsigned char(0xFE)};
const unsigned char UNICODE_RBOM[2]				= {unsigned char(0xFE), unsigned char(0xFF)};

const INT ACCUMULATOR_CHAR_COUNT				= 2048;

class CUTF16File: public CStdioFile
{
public:
	
	CUTF16File();
	CUTF16File(LPCTSTR lpszFileName, UINT nOpenFlags);

	virtual BOOL	Open(LPCTSTR lpszFileName, UINT nOpenFlags, CFileException* pError = NULL);
	virtual BOOL	ReadString(CString& rString);
    virtual LPTSTR  ReadString(LPTSTR lpsz, UINT nMax);
	virtual VOID	WriteString(LPCTSTR lpsz, BOOL bAsUnicode = FALSE);

    virtual LONG    Seek(LONG lOff, UINT nFrom);

    BOOL            IsUnicodeFile() { return m_bIsUnicode; }

protected:

	BOOL            ReadUnicodeString(CString& szString);
    LPTSTR          ReadUnicodeString(LPTSTR lpsz, UINT nMax);

    virtual VOID    WriteANSIString(LPCSTR lpsz);
    virtual VOID    WriteUnicodeString(LPCWSTR lpsz);

	BOOL m_bIsUnicode;
    BOOL m_bByteSwapped;

private:

	BOOL LoadAccumulator();

    std::list<WCHAR> m_Accumulator;
	DWORD	m_dwCurrentFilePointer;
};

#endif // !defined(AFX_UTF16File_H__32BEF8AC_25E0_482F_8B00_C40775BCDB81__INCLUDED_)



2. Second UTF16File.cpp

// UTF16File.cpp: implementation of the CUTF16File class.
//
// Version 5.0, 2 February 2004.
//
// Jeffrey Walton
//
//	Modified by Jordan Walters 27/04/2005 to work with non-Unicode
//	builds as well.
//
//////////////////////////////////////////////////////////////////////

#include "stdafx.h"
#include "UTF16File.h"
#include <atlconv.h>

#ifdef _DEBUG
#undef THIS_FILE
static char THIS_FILE[]=__FILE__;
#define new DEBUG_NEW
#endif

//////////////////////////////////////////////////////////////////////
// Construction/Destruction
//////////////////////////////////////////////////////////////////////

CUTF16File::CUTF16File(): CStdioFile(),
	m_bIsUnicode(FALSE),
	m_bByteSwapped(FALSE),
	m_dwCurrentFilePointer(0)
{
}

CUTF16File::CUTF16File(LPCTSTR lpszFileName, UINT nOpenFlags) :
	CStdioFile(lpszFileName, nOpenFlags), 
	m_bIsUnicode(FALSE),
	m_bByteSwapped(FALSE),
	m_dwCurrentFilePointer(0)
{
	char uchBOM[2] = {0};

	// We only need the BOM check if reading.
	if(CFile::modeWrite == (nOpenFlags & CFile::modeWrite)) { return; }

	// BOM is two bytes
	if(CFile::GetLength() < 2) { return; }

	m_dwCurrentFilePointer += CStdioFile::Read(reinterpret_cast<LPVOID>(uchBOM), 2);

	if(uchBOM[0] == UNICODE_BOM[0] &&  uchBOM[1] == UNICODE_BOM[1])
	{
		m_bIsUnicode   = TRUE;
		m_bByteSwapped = FALSE;
	} 

	if(uchBOM[0] == UNICODE_RBOM[0] &&  uchBOM[1] == UNICODE_RBOM[1])
	{
		m_bIsUnicode   = TRUE;
		m_bByteSwapped = TRUE;
	}

	// Not a BOM mark - its an ANSI file
	//   so punt to CStdioFile...
	if(FALSE == m_bIsUnicode)
	{
			m_dwCurrentFilePointer = 0;
		CStdioFile::Seek(0, CFile::begin);
	}

	m_Accumulator.clear();
}

BOOL CUTF16File::Open(LPCTSTR lpszFileName, UINT nOpenFlags, CFileException* pError /*=NULL*/)
{
	BOOL bResult = FALSE;

	unsigned char uchBOM[3] = {0};

	bResult = CStdioFile::Open(lpszFileName, nOpenFlags, pError);

	// We only need the BOM check if reading.
	if(CFile::modeWrite == (nOpenFlags & CFile::modeWrite)) { return bResult; }

	// BOM is two bytes
	if(CFile::GetLength() < 2) { return bResult; }

	if(TRUE == bResult)
	{
		m_dwCurrentFilePointer += CStdioFile::Read(reinterpret_cast<LPVOID>(uchBOM), 2);

		if(uchBOM[0] == UNICODE_BOM[0] &&  uchBOM[1] == UNICODE_BOM[1])
		{
			m_bIsUnicode   = TRUE;
			m_bByteSwapped = FALSE;
		} 

		if(uchBOM[0] == UNICODE_RBOM[0] &&  uchBOM[1] == UNICODE_RBOM[1])
		{
			m_bIsUnicode   = TRUE;
			m_bByteSwapped = TRUE;
		}

		// Not a BOM mark - its an ANSI file
		//   so punt to CStdioFile...
		if(FALSE == m_bIsUnicode)
		{
			m_dwCurrentFilePointer = 0;
			CStdioFile::Seek( 0, CFile::begin );
		}
	}

	m_Accumulator.clear();

	return bResult;
}

BOOL CUTF16File::ReadString( CString& rString )
{
	if(TRUE == m_bIsUnicode)
	{
		return ReadUnicodeString(rString);
	}

	return CStdioFile::ReadString(rString);
}

LPTSTR CUTF16File::ReadString(LPTSTR lpsz, UINT nMax)
{
	if(TRUE == m_bIsUnicode)
	{
		return ReadUnicodeString(lpsz, nMax);
	}

	return CStdioFile::ReadString(lpsz, nMax);
}

BOOL CUTF16File::ReadUnicodeString(CString& rString)
{
	USES_CONVERSION;

	BOOL bRead = FALSE;

	WCHAR c[2] = {0};

	rString.Empty();

	LoadAccumulator();

	while(FALSE == m_Accumulator.empty())
	{
		bRead = TRUE;

		c[0] = m_Accumulator.front();

		m_Accumulator.pop_front();

		if(L'\r' == c[0] || L'\n' == c[0])
		{
			// Set the file pointer to the current position of this carriage return - 
			// or one after as it has been read.
			m_dwCurrentFilePointer += 2;

			c[0] = m_Accumulator.front();

			m_Accumulator.pop_front();

			if(L'\r' == c[0] || L'\n' == c[0])
			{
				m_dwCurrentFilePointer += 2;
				Seek(m_dwCurrentFilePointer, CFile::begin);
			}
			break;
		}

		m_dwCurrentFilePointer += 2;
		rString += W2T(c);

		if(TRUE == m_Accumulator.empty())
		{
			LoadAccumulator();
		}
	}

	return bRead;;
}

/***
*char *fgets(string, count, stream) - input string from a stream
*
*Purpose:
*       get a string, up to count-1 chars or '\n', whichever comes first,
*       append '\0' and put the whole thing into string. the '\n' IS included
*       in the string. if count<=1 no input is requested. if EOF is found
*       immediately, return NULL. if EOF found after chars read, let EOF
*       finish the string as '\n' would.
*
***/

LPTSTR CUTF16File::ReadUnicodeString( LPTSTR lpsz, UINT nMax )
{
	USES_CONVERSION;

	BOOL bRead = FALSE;

	LPTSTR p = lpsz;
	WCHAR c[2] = {0};

	ASSERT(lpsz != NULL);
	ASSERT(AfxIsValidAddress(lpsz, nMax));
	ASSERT(m_pStream != NULL);

	if(nMax <= 1) { return lpsz; }

	LoadAccumulator();

	while(FALSE == m_Accumulator.empty() && --nMax)
	{
		bRead = TRUE;        

		c[0] = m_Accumulator.front();
		m_dwCurrentFilePointer += 2;
		*p++ = *(W2T(c));

		m_Accumulator.pop_front();

		if(L'\r' == c[0] || L'\n' == c[0])
		{
			// Set the file pointer to the current position of this carriage return - 
			// or one after as it has been read.
			m_dwCurrentFilePointer += 2;

			c[0] = m_Accumulator.front();

			m_Accumulator.pop_front();

			if(L'\r' == c[0] || L'\n' == c[0])
			{
				m_dwCurrentFilePointer += 2;
				Seek(m_dwCurrentFilePointer, CFile::begin);
			}
			break;
		}

		if(TRUE == m_Accumulator.empty())
		{
			LoadAccumulator();
		}
	}

	*p = L'\0';

	return TRUE == bRead ? lpsz : NULL;
}

VOID CUTF16File::WriteString( LPCTSTR lpsz, BOOL bAsUnicode /*= FALSE */ )
{
	USES_CONVERSION;
	
	if(TRUE == bAsUnicode)
	{
		WriteUnicodeString(T2W(lpsz));
	}
	else
	{
		WriteANSIString(lpsz);
	}
}

BOOL CUTF16File::LoadAccumulator()
{
	BYTE cbBuffer[ACCUMULATOR_CHAR_COUNT * sizeof(WCHAR)];

	UINT uCount = CStdioFile::Read(cbBuffer, ACCUMULATOR_CHAR_COUNT * sizeof(WCHAR));

	WCHAR* pwszBuffer = reinterpret_cast<WCHAR*>(cbBuffer);

	for(UINT i = 0; i < uCount / 2; i++)
	{
		WCHAR c = *pwszBuffer++;

		if(TRUE == m_bByteSwapped)
		{
			BYTE b1 = BYTE(c >> 8);   // high order
			BYTE b2 = BYTE(c & 0xFF); // low order

			c = WCHAR(b1 | (b2 << 8));
		}

		m_Accumulator.push_back(c);
	}

	return 0 == uCount;
}

LONG CUTF16File::Seek(LONG lOff, UINT nFrom)
{
	LONG lResult = CStdioFile::Seek(lOff, nFrom);
	m_dwCurrentFilePointer = CStdioFile::Seek(0, CFile::current);

	m_Accumulator.clear();

//	LoadAccumulator();

	// Should there be a test here to set fp = 2 if Unicode,
	//  and the user asks for CFile::begin???

	return lResult;
}

VOID CUTF16File::WriteANSIString( LPCSTR lpsz )
{
	CStdioFile::WriteString(lpsz);
}

VOID CUTF16File::WriteUnicodeString(LPCWSTR lpsz)
{
	if(0 == CFile::GetPosition())
	{
		CFile::Write(static_cast<LPVOID>(LPVOID(UNICODE_BOM)), sizeof(UNICODE_BOM));
	}

	CFile::Write(lpsz, wcslen(lpsz) * sizeof(WCHAR));
}


/////////////////////////////////////////////////////////////////////////////

That's it. Hope you find this useful.

Jordan

Ashes to ashes, DOS to DOS!

-- modified at 4:55 Sunday 28th May, 2006
AnswerRe: New Version for non-Unicode builds Pin
robosport27-May-06 21:17
robosport27-May-06 21:17 
GeneralRe: New Version for non-Unicode builds Pin
Jeffrey Walton23-Dec-06 9:05
Jeffrey Walton23-Dec-06 9:05 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.