Click here to Skip to main content
Click here to Skip to main content

CStdioFile-derived class for multibyte and Unicode reading and writing

By , 19 Jul 2007
 

Demo app screenshot

Introduction

This is a class derived from CStdioFile which transparently handles the reading and writing of Unicode text files as well as ordinary multibyte text files.

The code compiles as both multibyte and Unicode. In Unicode, multibyte files will be read and their content converted to Unicode using the current code page. In multibyte compilations, Unicode files will be read and converted to multibyte text.

The identification of a Unicode text file depends entirely on the presence of the Unicode byte order mark (0xFEFF). Its absence is not an absolute guarantee that a file is not Unicode, but it's the only method I use here. Feel free to suggest improvements.

By default, the class writes multibyte files, but can optionally write Unicode.

Background

The ability to transparently handle both multibyte and Unicode seems to be such a fundamental requirement, that I was sure that there would already be something similar on offer, and yet nothing turned up. Did I miss something?

I needed it for a translation tool I wrote, and knocked together an implementation that was good enough for my needs. This is little more than a cleaned up version of that, so expect bugs and all manner of deficiencies. I've tested the demo app though with the basic combinations -- Unicode files in a multibyte compilation, Unicode-Unicode, Multibyte-Unicode, and Multibyte-Multibyte, and they all seem to work.

Using the code

The use of the class is pretty simple. It overrides three functions of CStdioFile: Open(), ReadString() and WriteString(). To write a Unicode file, add the flag CStdioFileEx::modeWriteUnicode to the flags when calling the Open() function.

In other respects, usage is identical to CStdioFile.

To find out if a file you have opened is Unicode, you can call IsFileUnicodeText().

To get the number of characters in the file, you can call GetCharCount(). This is unreliable for multibyte/UTF-8, however.

An example of writing in Unicode:

// Test writing
CStdioFileEx fileWriteUnicode;

if (fileWriteUnicode.Open(_T("c:\\testwrite_unicode.txt"), 
    CFile::modeCreate | CFile::modeWrite | CStdioFileEx::modeWriteUnicode))
{
    fileWriteUnicode.WriteString(_T("Unicode test file\n"));
    fileWriteUnicode.WriteString(_T("Writing data\n"));
    fileWriteUnicode.Close();
}

You can now also specify the code page for multibyte file reading or writing. Simply call SetCodePage() before a read to tell CStdioFileEx which code page the file is coded in, or before a write, to tell it which code page you want it written in. Specifying CP_UTF8 as the code page allows you to read or write UTF-8 files.

The demo app is a dialog which opens a file, tells you whether it's Unicode or not and how many characters it contains, and shows the first fifteen lines from it. In the last couple of iterations I've added the option to convert a Unicode file to multibyte, and a multibyte file to Unicode, and a combo to specify the code page when reading.

As of v1.6, there is no limitation on the length of the line that can be read in any mode (Multibyte/Unicode, Unicode/Multibyte, etc.).

I'd love to hear of people's experiences with it, as well as reports of bugs, problems, improvements, etc.

Oh, and if I've accidentally included something offensive in the demo dialog, let me know. My Arabic and Chinese are not all that good.

History

  • v1.0 - Posted 14 May 2003
  • v1.1 - 23 August 2003. Incorporated fixes from Dennis Jeryd
  • v1.2 - 06 January 2005. Fixed garbage at end of file bug (Howard J Oh)
  • v1.3 - 19 February 2005. Howard J Oh's fix mysteriously failed to make it into the last release. Improved the test program. Fixed miscellaneous bugs
    Very important: In this release, ANSI files written in ANSI are no longer written using WriteString. This means \n will no longer be "interpreted" as \r\n. What you write is what you get
  • v1.4 - 26 February 2005. Fixed submission screw-up
  • v1.5 - 18 November 2005. Code page can be specified for reading and writing (inc. UTF-8). Multibyte buffers properly calculated. Fix from Andy Goodwin
  • v1.6 - 19 July 2007. Major rewrite: Maximum line length restriction removed; Use of strlen/lstrlen eliminated. Conversion functions always used to calculate required buffers; \r or \n characters no longer lost; BOM writing now optional; UTF-8 reading and writing works properly; systematic tests are now included with the demo project

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

David Pritchard
Software Developer (Senior)
Spain Spain
Member
I'm originally from Leek, Staffordshire in the UK, but I now work as a C++/MFC developer in Madrid, Spain.
 
I followed an erratic study/career path from German to a PhD in something resembling political science and linguistics, eventually ending up in IT.
 
I'm still finding bustling streets, warm nights, beer and vitamin D a pretty heady combination.

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
Questionggets.cpp bug PinmemberDavid Pritchard9 Dec '12 - 1:44 
Someone has drawn my attention to a probable bug. In ggets.cpp,
 
delete buffer
 
should be
 
delete[] buffer
 
I'll give it a test tomorrow. Yeah, I know, I have to update the demo project. Smile | :)
- Pfft. Coddled kids. In my day, we used to telnet to port 80, then render the page with pencil and paper-- and that's the way we liked it!
- Pshaw! Youngster. Your UID barely fits inside 16 bits. In _my_ day we had to whistle the 1's and 0's through an acoustic coupler!
 
Tools that support "all of UTF-8 as long as it starts with ASCII" and tools that cannot handle these three [BOM] bytes at all are not really supporting UTF-8.
- Michael Kaplan

AnswerRe: ggets.cpp bug PinmemberDavid Pritchard14 Jan '13 - 0:44 
QuestionCool stuff! Pinmembernapoapo20 Nov '12 - 6:19 
QuestionIs it possible to read a file 1 character at a time? PinmemberBTownTKD17 Jul '12 - 8:32 
GeneralMy vote of 5 Pinmemberonnlv7 Jun '12 - 17:50 
AnswerYou are a great man, but not great enough. PinmembervLinus4 Jun '12 - 5:46 
GeneralRe: You are a great man, but not great enough. Pinmembernapoapo20 Nov '12 - 5:56 
QuestionWarning: do not use parameterised constructor!! PinmemberDavid Pritchard23 Mar '12 - 5:05 
GeneralMy vote of 5 Pinmemberprofessore20 Feb '12 - 22:46 
GeneralRe: My vote of 5 PinmemberDavid Pritchard23 Mar '12 - 5:02 
QuestionExtrange behavior writting through network PinmemberMember 15651628 Nov '11 - 23:18 
AnswerRe: Extrange behavior writting through network PinmemberDavid Pritchard3 Jan '12 - 4:12 
GeneralMy vote of 3 Pinmemberbuyong17 Oct '11 - 19:22 
GeneralRe: My vote of 3 PinmemberDavid Pritchard3 Jan '12 - 4:12 
QuestionNeed help to read and write empty characters PinmemberCoder Block1 Sep '11 - 18:43 
AnswerRe: Need help to read and write empty characters PinmemberDavid Pritchard3 Jan '12 - 4:20 
GeneralRe: Need help to read and write empty characters PinmemberCoder Block11 Jan '12 - 23:47 
GeneralRe: Need help to read and write empty characters PinmemberDavid Pritchard25 Jan '12 - 22:44 
GeneralRe: Need help to read and write empty characters PinmemberCoder Block30 Jan '12 - 23:29 
GeneralRe: Need help to read and write empty characters PinmemberDavid Pritchard31 Jan '12 - 23:25 
QuestionJust to let people know PinmemberDavid Pritchard25 Jul '11 - 4:51 
GeneralSpeed Pinmember.dan.g.7 May '11 - 3:37 
GeneralRe: Speed PinmemberDavid Pritchard7 May '11 - 9:17 
GeneralRe: Speed Pinmember.dan.g.9 May '11 - 13:57 
GeneralRe: Speed PinmemberDavid Pritchard11 May '11 - 12:35 
Generalwrong to read with chinese characters Pinmemberliwenhaosuper1 Apr '11 - 20:46 
GeneralRe: wrong to read with chinese characters PinmemberDavid Pritchard2 Apr '11 - 2:17 
QuestionHow can write file in UTF-8 format? PinmemberLe@rner2 Feb '11 - 20:22 
GeneralMy vote of 5 PinmemberLe@rner2 Feb '11 - 19:24 
QuestionProblem in unicode csv file? PinmemberLe@rner2 Feb '11 - 19:14 
Questionconstructer Pinmemberyosizo14 Dec '10 - 16:36 
GeneralExcellent Class PinmemberMember 364468812 Oct '10 - 12:40 
GeneralRe: Excellent Class PinmemberDavid Pritchard12 Oct '10 - 13:11 
GeneralMy vote of 5 PinmemberMember 364468812 Oct '10 - 12:26 
GeneralRe: My vote of 5 PinmemberDavid Pritchard12 Oct '10 - 13:12 
GeneralProblem with Arabic PinmemberElsie27 Aug '09 - 22:51 
GeneralRe: Problem with Arabic PinmemberElsie27 Aug '09 - 23:04 
General[Message Deleted] PinmemberElsie27 Aug '09 - 23:06 
GeneralRe: Problem with Arabic PinmemberElsie27 Aug '09 - 23:08 
GeneralRe: Problem with Arabic PinmemberDavid Pritchard28 Sep '09 - 12:34 
GeneralRe: Problem with Arabic PinmemberElsie4 Oct '09 - 19:09 
GeneralRe: Problem with Arabic PinmemberDavid Pritchard4 Oct '09 - 20:52 
GeneralRe: Problem with Arabic PinmemberElsie4 Oct '09 - 23:52 
GeneralRe: Problem with Arabic PinmemberDavid Pritchard6 Oct '09 - 23:18 
AnswerRe: Problem with Arabic PinmemberElsie11 Dec '09 - 4:22 
GeneralRe: Problem with Arabic PinmemberDavid Pritchard12 Dec '09 - 0:42 
GeneralRe: Problem with Arabic PinmemberLe@rner2 Feb '11 - 20:08 
GeneralThanks David! Pinmemberfwaggie14 Jun '09 - 9:47 
GeneralRe: Thanks David! PinmemberDavid Pritchard14 Jun '09 - 9:54 
GeneralCompiler errors with Visual Studio 2003 PinmemberKikoa17 Mar '09 - 8:11 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Mobile
Web01 | 2.6.130516.1 | Last Updated 19 Jul 2007
Article Copyright 2003 by David Pritchard
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid