Click here to Skip to main content
Email Password   helpLost your password?
Screenshot -

Introduction

Surprisingly, neither the C++ runtime library nor the Win32 Platform SDK provides any routines to read and write Unicode text files, so, when I needed some, I had to write my own. There are three reasons why you might choose to use these routines over others you can find out there on the Internet and elsewhere here on CodeProject: performance, performance and performance. And convenience. You can read and write any or all of ANSI, UTF-8, UTF-16 little-endian and UTF-16 big-endian files with no code changes on your part. These routines are not reliant on MFC in any way, so you can use them in any C++ project you like.

Just What are Unicode Text Files, Anyway?

Unicode text files come in three flavours: UTF-16 little-endian, UTF-16 big-endian, and UTF-8. There are also, unfortunately, three different conventions for delimiting lines: DOS/Windows (CRLF), Unix (LF only), and Mac (CR only). EZUTF handles all three types of file encoding (as well as ANSI files) and two of the three types of line delimiters. It cannot read CR-delimited files (although it can write them).

UTF-16 files store two bytes per character, which is why there are little-endian and big-endian variants. Little-endian files store the characters least-significant-byte first whereas big-endian files do the reverse. It is possible to tell which type of file you are reading because, by convention, UTF-16 files contain a two-byte marker - called a BOM - at the start of the file which differs between the two formats. For little-endian files it is 0xFF, 0xFE whereas for big-endian files it is - you guessed it - the reverse.

UTF-8 files are rather nifty, in that they can encode the entire UTF-16 character set but are half the size of UTF-16 files if you are storing only ASCII characters (i.e. character codes below 128). They are also highly portable between different systems and this is how multi-lingual Web pages travel around, in case you were curious. UTF-8 files store characters as sequences of 1, 2, 3 or 4 bytes depending on the character code in question, with ASCII characters always being encoded as 1 byte and all Latin (and other western) characters as 2 bytes. On the other hand, Chinese and Japanese characters encode as 3 bytes, so if you are storing a lot of these, UTF-16 might be a better choice. There are no byte-ordering issues with UTF-8 files, thankfully, and the 4 byte sequences are never used in Windows apps because they encode characters which lie outside the UTF-16 character set (which is what Windows uses internally).

A UTF-8 file can also be identified as such as they always start with the sequence 0xEF, 0xBB, 0xBF, which differs from the BOMs used for both types of UTF-16 encoding. The encoding of UTF-16 characters as UTF-8 byte sequences is not particularly complicated and is described in more detail here. Alternatively, just take a look at the code, which is commented (a bit).

Finally, just for completeness, ANSI files store each character as a single byte, and hence can only represent character codes of 255 and below. To properly understand an ANSI file, you have to know which code page was used when it was written, although a lot of software simply assumes Windows-1252.

What Does EZUTF Do?

EZUTF provides a set of high-performance routines to read and write all of the types of text files described above without the application having to do any of the necessary translations itself. It can also handle both DOS/Windows (CRLF) and Unix (LF) line-delimiters, but not Mac (CR-only). When a file is opened for reading, EZUTF can be instructed to read the BOM (if any) and hence deduce the file encoding. Alternatively, you can force EZUTF to use a particular encoding to avoid, for example, erroneously treating an ANSI file which happens to start with 0xFF, 0xFE as a UTF-16 file (which would be disastrous). When a file is opened for writing, you tell EZUTF what encoding and line delimiters you want to use and it will take care of all the details, including writing out a BOM at the start of the file.

No 'seek' functionality is provided, but EZUTF can append data to the end of an existing file. In this case, no BOM is written out unless the file was initially empty (or did not exist at all).

Using the Code

The entire public API is wrapped in a single class: TextFile. A TextFile object can be opened and closed, and can read and/or write either lines or single characters.

Reading Files

Here, typically, is how you would open a file for reading:

TextFile *tf = new TextFile;
int result = tf->Open (L"MyFile.txt", TF_READ);

When a file is opened for reading in this way, EZUTF will read the BOM mark in the file, if present, and deduce the file encoding from it. If you want to know what that is, you can use the following (after you have opened the file):

// TF_TF_ANSI, TF_UTF16LE, TF_UTF16BE or TF_UTF8
int file_encoding = tf->GetFileEncoding (); 

Alternatively, if you know that you are opening an ANSI file, you would be wiser to use...

TextFile *tf = new TextFile;
int result = tf->Open (L"MyFile.txt", TF_READ, TF_ANSI);

... as this avoids any danger of interpreting the file as Unicode by mistake.

To read lines from a file, you do something like this:

TCHAR *line_buf = NULL;
int result;
while ((result = tf->ReadLine (NULL, &line_buf) >= 0)
    // do something; the line just read from the file is in line_buf

free_block (line_buf);

Note that any line delimiter is stripped from the line before it is returned and that line_buf is allocated from within TextFile, not by the caller. This is to handle varying line lengths without having to allocate a buffer on each call. The caller must initialise line_buf to NULL and is responsible for freeing it when done (by calling free_block ()). If you fail to initialise line_buf to NULL, your program will die a horrible death, and if you fail to pass it to free_block () when you are done with it, you will have a memory leak. The pointer returned in line_buf remains valid until you pass it to another TextFile routine (or free it). The initial NULL parameter is for optionally returning 'data lost' in ANSI builds, where Unicode to ANSI translations are required within the TextFile class (see WideCharToMultiByte in the Platform SDK docs).

Writing Files

To open a file for writing, you must specify the encoding you want to use, like so:

TextFile *tf = new TextFile;
int result = tf->Open (L"MyFile.txt", TF_WRITE, TF_UTF8);

Then to write out a line, you would do this:

int result = tf->WriteString (NULL, L"This is a string");
if (result >= 0)
    result = tf->WriteChar (NULL, '\n');

Of course, if the line you are writing out is already terminated with a newline (\n) character, you can skip the call to WriteChar (). The initial NULL parameters are for optionally returning 'data lost' when writing to ANSI files, where Unicode to ANSI translations are required within the TextFile class (see WideCharToMultiByte in the Platform SDK docs).

If, like me, you are a fan of fprintf, you can write out formatted data like so:

int n_bottles = 10;
int result = tf->FormatString
    (NULL, L"There are %d green bottles, standing on the wall.\n", n_bottles);

Please note that I have not provided support for streams as I do not use them, but adding them would not be difficult and if someone would care to, I will gladly roll their changes into the master sources.

Reading and Writing Unix Files

When reading files, Unix-style (LF-only) line delimiters are handled automatically, i.e. you can just open the file in the normal way and then call ReadLine () as described above. To write out a file using Unix line delimiters, you can do:

TextFile *tf = new textFile;
int result = tf->Open (L"MyFile.txt", TF_WRITE, TF_UTF8 | TF_UNIX);

Writing out a \n character will then write just an LF to the file, rather than a CRLF sequence.

Error Handling and HPSLib

All TextFile methods return an integer, and if an error has occurred this will be negative. End of file also returns a negative value - TF_EOF - so test for this first. To retrieve a string describing the error, call a GetLastErrorString (). This works in a similar way to GetLastError (), but returns a pointer to an internal buffer (per thread) containing a user-friendly error message (e.g. 'Could not open file xyz, error blah'). The pointer returned is valid until you call TextFile again (or SetLastErrorString () from within the same thread. Alternatively, you can call GetLastError () in the usual way and report error conditions in whatever way you choose.

Performance

EZUTF is fast! If you have the need for speed, these are the routines for you. Reading a UTF-8 file some 100MB / 2,500,000 lines in size takes under a second on my AMD Athlon 64 3000+, once the file is in the cache. Copying the same file takes about 7 seconds, about the same time as it takes to do a binary copy, although there is considerably more CPU overhead.

By contrast, loading the same file into Notepad takes around 45 seconds, and loading it into Visual Studio 2005 2-3 seconds (which is actually pretty good; I was impressed). These figures refer to the release build - the debug build is a good deal slower.

HPSLib, and Miscellanea

EZUTF is built on top of an in-house library modestly entitled HPSLib. I have provided a minimal subset of this - in files hpslib.cpp, hpsutils.h, hpslib.rc and hpslib.hr - which provide enough functionality for the TextFile class to operate as designed. You will need to include these in any project where you use the TextFile class, or you might elect to copy the text strings from hpslib.rc (there are only 4 of them) across into your own *.rc file.

The demo app is a console app and expects to find a file called ezutf_test_input.txt in the current directory, which it copies to ezutf_test_output.txt. If you want to step through the code, build the debug version.

Newcomers to C++ might be interested in the use made of templates, virtual functions and inline functions in the implementation. Personally, I use templates rarely, but when you need 'em, you need 'em. More methods should probably be private.

History

You must Sign In to use this message board.
 
 
Per page   
 FirstPrevNext
GeneralMemory files.
tonyvsuk
4:11 22 Sep '09  
Many thanks for this class, it has already saved me a lot of hassle and time.

How easy would it be to extend this class to read a memory file?

I have an archive which I read from, so I have a BYTE buffer containing the file contents. I don't know the encoding of the file before I open it.

Up till now I have been using CMemFile and CArchive (I am in the process of converting my app to unicode), but these do not behave very well compared to this class, I would love to make even better use of it than I am now.

Thanks in advance for any advice,

Tony.
GeneralRe: Memory files.
Paul Sanders (AlpineSoft)
6:20 22 Sep '09  
I think the easiest thing would be to write the data to a temporary file. Presumably it's not very large if it's in memory. Check out GetTempPath, GetTempFileName and the FILE_ATTRIBUTE_TEMPORARY parameter to CreateFile.

If you prefer to have a crack at hacking the source code, you could replace the calls to ReadFile with calls to a new abstract class that either reads from file or returns bytes from your data buffer as appropriate. You would also need to replace a couple of SetFilePointer calls - EZUTF 'backs up' after failing to read a BOM.

HTH - Paul.


GeneralRe: Memory files.
tonyvsuk
6:33 22 Sep '09  
Thanks Paul.

I had already got to the stage where I realised it was the readfile function I needed to intercept.

I'll have a go and report back.

Tony.
GeneralRe: Memory files.
Paul Sanders (AlpineSoft)
6:55 22 Sep '09  
OK, good luck!


GeneralMissing functions
MemoC73
22:28 21 Sep '08  
Hi,

Really good work.
I am missing some functions in your c++ routines, like Seek, SeekToBegin and SeekToEnd. Maybe a function like GetCountOfLines would be also very nice.

Thanks

Best regards

Mehmet
GeneralRe: Missing functions
Paul Sanders (AlpineSoft)
23:02 21 Sep '08  
Hi,

Thank you for your comments. When you say 'Seek', I assume you mean in a file you are reading, rather than one you are writing. This was deliberately omitted to simplify the implementation, but it is probably not a lot of work to implement it (but not for a file you are writing; that would be more work). SeekToBegin can of course be emulated by reopening the file. SeekToEnd has no obvious use that I can see. All you will subsequently read is EOF.

To implement GetCountOfLines would mean reading through the entire file, so you might as well do just that. The overhead of calling ReadLine repeatedly to do this is very low.

Due to pressure of work, I have no immediate plans to update the code but I will bear your request in mind next time I do. If you plan to have a go yourself, you will need to:
- call SetFilePointer on the underlying HFILE
- discard the contents of EZUTF's buffer

For Seek to be of any use, you will also need to implement a Tell function (I assume you want to seek back to some remembered file location). This might usefully return the file position of the start of the line just read (perhaps as an option), rather than the current file location. Implementing this is a little more tricky as you will need to take account of where EZUTF has got to in its internal buffer, but if you get into the code a little bit it should not be too hard.

There, almost done Smile Sorry I can't be of more help.

PS: Poor man's seek:

1. Keep track of the line number you are at as you read through the file.
2. To seek to line n, reopen the file and read (and throw away) n-1 lines

If your files are small and seeking is infrequent, this might just be good enough. EZUTF is pretty fast and data read recently read from disk will be cached by Windows.


GeneralRe: Missing functions
MemoC73
22:24 22 Sep '08  
Hi,

yes, you are right. The 'seek' functions I have meant, was only for the reading functions.
SeekToEnd makes no sense.
I know that there are simple ways to implement SeekToBegin and GetCountOfLines by reopening the files. And I am using it in my code.

I like your routines, because they are really working with the UTF16 files. I have checked some other classes and routines in CodeProject for reading UTF16 files, but they was either too complicated or have not worked.

So I want to give you some ideas for extending your routines.

But now I have a strange thing. I read an UTF16 file and write it without any changes to another file. When I am comparing the file size between the original file and the copied file, I see that the copied file is bigger. Here are the values:
Original file: 382.302 Bytes
Copied file: 397.360 Bytes
My function is doing only reading the file by using ReadLine, write this to WriteString and add \r\n. That's all. When I am comparing it with UltraCompare, I get no differences.

regards
Mehmet
GeneralRe: Missing functions
Paul Sanders (AlpineSoft)
23:18 22 Sep '08  
Hi,

The difference must lie in the line breaks. The original file is probably only using a single delimiter, probably \n. Did it come from a Unix-based system by any chance? And Sherlock Holmes here deduces that the file has 15,058 lines in it, give or take.

And thanks for your input. If and when I revise the source code I will see what I can do.

Vote for the article?


GeneralRe: Missing functions
Paul Sanders (AlpineSoft)
23:09 24 Sep '08  
Hello again,

While dropping off to sleep last night I realised why your output file is larger than your input file. When writing out line breaks, use "\n" rather than "\r\n". EZUTF inserts a '\r' automatically. The file you are writing out at the moment will have "\r\r\n" at the end of each line.

Hope this helps.


GeneralReading/Writing Text Buffers instead of Lines
Gautam Jain
0:50 3 Aug '08  
Hello,

Good work. Thanks.

I have strings which include CRLF already. Can I use your functions (WriteString) to write to UTF8 text files? I don't want new lines to be added. Just write the text that I provide.

Similarly, I want to read whole file (all lines) in to a single string. Is it possible with your class?

Thanks.

Regards,
Gautam Jain
http://www.conceptworld.com

GeneralRe: Reading/Writing Text Buffers instead of Lines
Paul Sanders (AlpineSoft)
1:47 3 Aug '08  
Hi,

I'm not quite sure what it is you want to do, but WriteString writes out the characters in the string verbatim, so if there are CRLF characters in the string they will get written to the file. But if you don't want that, it is easy enough to remove them from the string beforehand surely.

When reading, EZUTF reads the input a line at a time. To read the entire file in one go I would use CreateFile, GetFileSize and ReadFile.

Hope this helps.


GeneralMemory waste
axelriet
18:08 1 Aug '08  
I just eyeballed the code and I stopped at line 36 in textfile.cpp: why blindly allocate 2x the string length instead of using WideCharToMultiByte() itself to compute the required size, by passing 0 in the cbMultiByte parameter? Most of the time, you allocate 2x too much memory.

Besides that, I wonder about the custom heap allocator. Is this really needed? Using a system-provided one (e.g. HeapCreate/HeapAlloc/HeapSetInformation...) can go a long way towards making your code thread safe.

Cheers,
Axel
GeneralRe: ...and High Performance
yarp
18:42 1 Aug '08  
Yes, I'm wondering about the same about that custom allocator.
Besides I don't see any "High Performance" stuff in your code. You are just using CreateFile with normal parameters. As you say it takes 1s to read a file, but only once it has been loaded in the cache.
Anyway your class is still very interesting for reading Unicode files but I wouldn't call it High Performance.

Yarp
http://www.senosoft.com/

GeneralRe: ...and High Performance
Paul Sanders (AlpineSoft)
2:10 2 Aug '08  
Well, you can't read raw binary data any faster than calling CreateFile / ReadFile (obviously), although using a large buffer size can help.

But that's not where the time goes. When reading text files, and in particular UTF-8, time is spent finding end-of-line characters, converting encoded character sequences and allocating memory to return the line of text to the caller. EZUTF optimises all of these, as my (minimal) benchmarks show.

Please do not thoughtlessly criticize what you do not understand. People on this site put a lot of work into their articles. And if you can find a faster implementation, I'd like to know about it.


GeneralRe: ...and High Performance [modified]
yarp
21:56 3 Aug '08  
I'm sorry if you get offended. I understand you put a lot of time in doing your class and writing your article. But critize are positive sometimes.
CreateFile with normal parameters (like you did) has to load the file in cache before actually doing the read phase - so with large files it will be slow (I mean large files start with 15+ MB). You can optimize file access a bit with sequential read (since you don't use Seek you can do that), and you can optimize it even more with memory mapped file which is what High Performance is (imho). That's what I meant when writing this.

Here's an implementation of Sequential read:

CreateFile(psz,
GENERIC_READ,
0,
NULL,
OPEN_EXISTING,
FILE_FLAG_NO_BUFFERING |
FILE_FLAG_SEQUENTIAL_SCAN,
NULL);

In that case you must read in a buffer which has the same size as your disk sectors:

if (GetDiskFreeSpace("C:",
&dwSectorPerCluster,
&dwBytesPerSector,
&dwNumberOfFreeClusters,
&dwTotalNumberOfClusters))
{
DWORD dwCntSectors = (1048576 / dwBytesPerSector);
dwBufferSize = dwCntSectors * dwBytesPerSector;
}
else
dwBufferSize = 1048576; // 1MB

m_pBuffer = (char*)::VirtualAlloc(NULL, dwBufferSize, MEM_COMMIT, PAGE_READWRITE);

...

VirtualFree(m_pBuffer, 0, MEM_RELEASE);


I don't have a Memory Mapped file example - that's the reason why I got interested in your post - but there are some on CP.
btw As I said your Unicode stuff is intersting enough Poke tongue

Yarp
http://www.senosoft.com/

modified on Monday, August 4, 2008 3:06 AM

GeneralRe: ...and High Performance [modified]
Paul Sanders (AlpineSoft)
0:55 4 Aug '08  
Hi,

I'm not offended, just a little irritated by the tone of your original post, and I feel duty bound to point out the errors in what you say. I prefer to be asked why, rather than told why not, if you like.

Windows does *not* pre-load large files into cache. It buffers recently read data, which is not the same thing at all - if you open a large file and read 1 byte of it, windows will read and buffer just that 1 byte (well, almost).

Using FILE_FLAG_NO_BUFFERING is normally counter-productive as it disables Windows' caching and read-ahead mechanisms. As for FILE_FLAG_SEQUENTIAL_SCAN, I don't think it actually does anything in practise (it has never had any effect in any of my tests), but I could be wrong about that. It would certainly do no harm to pass it to OpenFile, so I guess I should do so. It is ineffective when used with FILE_FLAG_NO_BUFFERING though, as its purpose is to read-ahead and cache data and FILE_FLAG_NO_BUFFERING explicitly turns caching off.

As I said before, the key to getting good file read performance is to pass a large buffer size to ReadFile. This reduces seek time and latency overheads as these cost you every time you perform a read operation, so fewer read operations = less overhead. EZUTF actually uses a rather small buffer size by default (4k) which I probably should change, but you can pass an optional parameter to TextFile::Open to increase it.

As for memory mapped files, they are not really relevant here. They certainly hold no advantages. Personally, I use them when a data structure on disk can be directly mapped to the same data structure in memory (we use them for our 'waveform' display, specifically). Behind the scenes, Windows uses them for loading EXE's and DLL's which, unless they have to be relocated, contain a verbatim image of the executable code. Thi smeans that a program can start up without having to be completely read in from disk. Clever, eh?

Anyway, thanks for posting. You have picked up a few things I should have done and I always enjoy a good debate.


modified on Monday, August 4, 2008 7:46 AM

GeneralRe: Memory waste
Paul Sanders (AlpineSoft)
2:07 2 Aug '08  
Hi,

I do it for performance reasons. Calling WideCharToMultiByte is expensive. The amount of memory wasted is trivial as we are only allocating a buffer for a single line of text at this point.

As for the custom heap allocator, it is just a wrapper round new. I do it so that I can use the CRT debug functions to detect memory leaks (Hnew records the file and line number of the allocation in debug builds). There are no multi-threading implications of doing this. The reason why the code is not threadsafe is that internal buffer manipulations need to be serialised in a multi-threaded environment. This would not be hard to do but, as I have no need of it, I did not bother. See my response to IanLo's comment if this is important to you.


GeneralRe: Memory waste
axelriet
10:39 2 Aug '08  
Paul Sanders (AlpineSoft) wrote:
I do it for performance reasons. Calling WideCharToMultiByte is expensive.


Okay, now consider this: I compile an ANSI program and I need to convert whathever input encoding to UTF-8, very legitimate, right? With your class, I'd just call SetCodePage(CP_UTF8) and it would work, right?

Not. Your hardcoded 2x allocation just turned from an innocent memory waste into what will appear as an intermittent failure in production: some input lines will vanish for no apparent reason. A debug build won't help as the assert() will not fire.

You'd have caught it had you checked for WideCharToMultiByte() returning 0 and called GetLastError(), who'd have returned ERROR_INSUFFICIENT_BUFFER when the input text expanded to three or more byte per character, but you did not.

Call WideCharToMultiByte() to find out the output buffer length (and check return values).

Just my 0.02 Euro-cents.

Cheers,
Axel
GeneralRe: Memory waste
Paul Sanders (AlpineSoft)
1:40 3 Aug '08  
OK, so maybe it should be 3x Smile, users of this class please be warned and make this change to the source code if you need to. I will update the master sources next time around.

But the point about performance remains. And I would like to reiterate that the waste of memory is trivial and calling WideCharToMultiByte is *not* the right thing to do in this context. Anyway, we are arguing about how many angels can dance on the head of a pin: who uses ANSI builds anymore, particularly when handling UTF-8? I certainly don't, and I only included ANSI support for the sake of completeness.

And a little aside about asserts: I leave them in, even in production code (and generate a dump file if one fires, read up on minidumps in the Platform SDK). This has saved me a lot of grief over the years as errors of thinking of this nature (and let's face it, we all make them) are trapped before they can have any knock-on effects.


GeneralAssert on explicit read of a UTF-8 file
flector
13:39 31 Jul '08  
Can EZUTF read non-BOM UTF-8 files?

I need to read greek language UTF-8 html files that don't have a BOM. The current approach is to read the whole file, look for an upcase "CHARSET=UTF-8," then re-read it as a UTF-8 if it's found. Alas, it's not working.

This generates an assert (file read opens seem to require require encoding == 0 or TF_ANSI):
some_file.Open (filename, TF_READ, TF_UTF8);

Any thoughts?
GeneralRe: Assert on explicit read of a UTF-8 file [modified]
Paul Sanders (AlpineSoft)
5:27 1 Aug '08  
Whoops, there seems to have been a little cut-and-paste error there. Just noticed. I gues it's way to late to provide any kind of useful reply now, but there's probably a way to do what you want if you are still interested. If you are, let me know and I will look into it.


GeneralThread safe?
Ianlo
19:30 24 Apr '08  
Hi Paul,

I was wondering if this implementation is thread safe? I am using Intel TBB to do some parallel reading of the files.

Thanks!
Ian
GeneralRe: Thread safe? [modified]
Paul Sanders (AlpineSoft)
0:14 25 Apr '08  
Hi,

No TextFile isn't threadsafe I'm afraid, but it should be sufficient for you to write a little wrapper function that claims a critical section before calling ReadLine and then releases after ReadLine returns. If you also want to call ReadChar, write a wrapper function for that too, using the same critical section of course, and likewise for WriteLine and WriteChar, if needed. You should use the same critical section for all four.

To get the maximum concurrency, you will probably want to use a separate critical section for each open file. I would therefore derive a class from TextFile, declare the critical section as private data, initialise it in the constructor, destroy it in the destructor and override ReadLine and ReadChar (and WriteLine and WriteChar, if required) as described above. A bit tedious perhaps, but it should only take an hour or two to put it together.

Having said all of which, if there is sufficient interest I would be happy to make the master sources threadsafe, but I am a little busy at the moment. If and when I do so, you can just throw your wrapper class away and use TextFile direct.


modified on Monday, May 5, 2008 4:21 PM

GeneralRe: Thread safe?
Ianlo
9:33 8 Jul '08  
Hi Paul,

Thanks for the pointers and help! Smile

Ian
GeneralNice, but it's a pity that it's limted to Windows
Tage Lejon
7:09 19 Mar '08  
It would be much better if you have implemented your excellent classes in ANSI C++, without
using Window specific API and types such as CreateFile() and HANDLE.

That will extend its usability a great deal!


Last Updated 1 Aug 2008 | Advertise | Privacy | Terms of Use | Copyright © CodeProject, 1999-2010