Click here to Skip to main content
Click here to Skip to main content

Simple class to read and write from an UTF-8 encoded file

, 8 Jul 2004
Rate this:
Please Sign up or sign in to vote.
A class derived from CStdioFile to read and write from an UTF-8 encoded file.


This will certainly look like a very thin article, but it's all in the source Sniff | :^)

I have looked around quite a bit, both here at Code Project and elsewhere, since I thought that there must be someone who has posted such a class already. Well, I couldn't find any, so here is my own quick hack to solve the problem.


The CStdioFile_UTF8 class was initially done as a step towards making Dan Goodson's excellent TodoList program support Unicode. I'm posting it here in case someone else finds it useful.

Using the code

Use the class as a plug in replacement for MFC CStdioFile. The class overrides the ReadString and WriteString functions in order to do some conversion. It also provides the functions ReadBOM and WriteBOM to handle an optional bute order mark in the file.

If _UNICODE is defined, the UTF16 strings used internally are converted from/to UTF8 as used in the file. If the symbol is not defined, the class acts exactly like the parent class CStdioFile.


  • 9-Jul-2004

    Initial version.


This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


About the Author

Sven Axelsson
Web Developer
Sweden Sweden
No Biography provided

Comments and Discussions

GeneralCompiler Error Pinmemberd00_ape30-Aug-04 0:10 
GeneralRe: Compiler Error PinsussUwe Sedlack11-Mar-05 7:12 
GeneralCStdioFileEx PinmemberDavid Pritchard14-Jul-04 12:51 
Generalhalf-baked conversion Pinmemberumeca7413-Jul-04 20:51 
GeneralRe: half-baked conversion PinmemberSven Axelsson16-Jul-04 1:42 
GeneralRe: half-baked conversion Pinmemberumeca7419-Jul-04 4:01 
i don't have a perfect solution for broken sequences but here's something that works:
* read the chunk of say 10000 bytes; the start is definitely good, i.e. won't break any utf8 sequences
* before converting, find the boundary of the last complete character; if less than 10000, backtrack the file pointer accordingly (essential for the next chunk to be read)
to locate this boundary i take advantage of the behaviour of MultiByteToWideChar that doesn't emit any characters for broken sequences. So i take the last 10 bytes (offsets 9990 - 9999) and translate them; then i backtrack 1 character, retranslate and compare the translated string length; continue like that until the translated length is smaller than the original length of the 10 bytes. Now you know where the last complete character is
presumably a more intelligent option would be to read the last few bytes and decipher them yourself, but that would assume knowledge of lead bytes and structure of utf8 -- expertise that i sadly lack Smile | :)

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web03 | 2.8.150224.1 | Last Updated 9 Jul 2004
Article Copyright 2004 by Sven Axelsson
Everything else Copyright © CodeProject, 1999-2015
Layout: fixed | fluid