Unicode and ANSI file I/O, line by line

Jaroslav Klima

Rate me:

3.28/5 (15 votes)

4 Nov 2005CPL10 min read

113.4K

This article describes a very simple library which provides functions for reading and writing lines of text from/to a file on disk, supporting both ANSI and Unicode.

Download library + example (source included) - 40.6 Kb

Introduction

Most applications need to store data on the hard disk, in one form or another. The most common form is a text file, where every piece of information is represented by one or more lines of text. The file parser reads the file line by line into strings and passes the strings to whatever takes care of them afterwards.

Creating such a parser seems very easy, but there is one major problem - the file cannot be in Unicode format. At least not the easy way - neither CRT, nor STL support writing Unicode text to a file. Furthermore, even in ANSI, the only really simple way of writing and reading lines of text I know of is using STL streams, which some people (e.g. me :)) may not like very much.

For these reasons, I have decided to write a small library that will provide functions for writing and reading lines of text to/from files on disk, in both ANSI and Unicode, and will be as easy to use as possible.

In this article, I will describe the process of creating the library step by step and try to explain all the not-so-obvious aspects of it. The difficulty level is set to "beginner / haven't used C++ much", so please take this into consideration.

Design considerations

When writing any piece of code, the first thing we need to ask ourselves is what our goal is. In this case, what we want to end up with is a set of functions with two basic purposes:

Write a line of text into a file.
Read a line of text from the file.

The key words in the above statements are "file", "text" and "line". By "file", I refer to a file on the hard disk, as we all know it. By "text", I mean a sequence of characters. We will not interpret the characters in any way, no numbers, bools, etc. By "line", I refer to a piece of text terminated by a "line break". A line break is a special character used to indicate the end of a line in a piece of text. The character code of this character is different for every character encoding (SBCS, MBCS, Unicode.. there is a great article about character encodings here on CodeProject).

First, we need to decide on the format of our files. For our purposes, the file will contain lines of text in either (single-byte) ANSI, or Unicode character encoding. Nothing more, nothing less, no BOMs or anything else. The goal of our library is not to produce correct Unicode files, but merely to store characters from a string (single-byte ANSI or Unicode) in a file and then read them back exactly the same as before. Note that you should never mix ANSI and Unicode in one file, or use ANSI functions on a Unicode file or vice-versa, unless you know exactly what you are doing.

To handle files, we will use standard C functions from stdio.h. Other options would include Win32 API (not portable) and STL streams (overkill for our purpose).

This should be enough for the design, let's do the actual implementation. We will start with writing the function.

Implementing the LineToFile() function

The purpose of the LineToFile() function is to write text from a string into a text file. The two obvious arguments of this function would be the string and the text file. The return value should indicate success or failure of the function:

bool LineToFile(FILE* f, const std::string& s);

Simple enough. The only interesting part here is that the string argument is passed by a constant reference - this means that the function will not create its own copy of the string to work with (which saves time and memory), but will have read-only access to the original.

The problem with this declaration is that it will work for ANSI strings only. We will need to make a second function for Unicode strings. This is where function overloading comes in handy. With function overloading, we can create multiple functions with the same name, but different parameters. When compiling the code, the compiler will decide which function to call based on its parameters, and (provided that the functions have the same purpose) the programmer does not have to remember multiple function names for the same operation on different types of arguments. We will now create two overloads for the LineToFile() function, one for ANSI and one for Unicode:

bool LineToFile(FILE* f, const std::string& s);
bool LineToFile(FILE* f, const std::wstring& s);

This is better, now we can pass both an ANSI string and a Unicode string as the arguments and the appropriate function will get called.

Let's take a look at how the functions are implemented:

bool LineToFile(FILE* f, const std::string& s)
{
    // write the string to the file
    size_t n = fwrite(s.c_str(), sizeof(char), s.size(), f);
            
    // write line break to the file
    fputc('\n', f);
            
    // return whether the write operation was successful
    return (n == length);
};

bool LineToFile(FILE* f, const std::wstring& s)
{
    // write the string to the file
    size_t n = fwrite(s.c_str(), sizeof(wchar_t), s.size(), f);
            
    // write line break to the file
    fputwc(L'\n', f);
            
    // return whether the write operation was successful
    return (n == s.size());
};

This is fairly simple. We just take the string's buffer and copy its contents to the file using fwrite(), followed by the appropriate line break character (using fputc()). The difference between the ANSI and Unicode versions is minimal, we just need to make sure to use the appropriate data types and functions. From now on (for the sake of the length of this article), I will always describe only one of the overloads. The difference between them is minimal and you can always download the source code.

The string parameter passed to the functions can be either an STL string, a zero-terminated string, or a string constant. This is possible because when you pass the function a zero-terminated string, a temporary std::string with the appropriate content gets created and is used for the function call. (This imposes certain dangers which I discuss in the "Advanced" section of this article.)

Now let's implement the reading function...

Implementing the LineFromFile() function

Once again, the function will have two overloads - one for ANSI and the other for Unicode strings. The return value will once again indicate success or failure and the parameters will be the file to read from and a string variable to hold the resultant line of text.

bool LineFromFile(FILE* f, std::string& s);
bool LineFromFile(FILE* f, std::wstring& s);

Notice, that this time the reference to the string variable is not constant. This means that the argument has to be a real std::string variable, which is exactly what we wanted to achieve. The function is implemented as follows:

bool LineFromFile(FILE* f, std::wstring& s)
{
    // reset string
    s.clear();
    
    // read one char at a time
    while (true)
    {
        // read char
        wint_t c = fgetwc(f);        
        
        // check for EOF
        if (c == WEOF) return false;

        // check for EOL
        if (c == L'\n') return true;

        // append this character to the string
        s += c;
    };
};

This is also very straightforward. We read the input file one character at a time (using fgetc()) and append it to the string variable, except for the following cases:

If the character we have just read is a line break. This means that the end of the current line was reached. The function will return true and the string variable will contain the line of text we have just read.
If the character we have just read is an end-of-file (EOF) character, the function returns false, which indicates that the end of file was reached. The EOF character should always be preceded by a line break, and it is, if the file was written using our LineToFile() functions. This means that when the function returns false, the string variable should be empty. However, if the function returns false and the EOF was not preceded by a line break, the string variable will contain everything that the function has read before the end of file.

Using the functions

Using the functions described above is very easy. The only thing you need to do is open a file in binary mode (because we don't want any translation) using fopen() and call the functions with the right parameters. To write lines into a text file, and then read them back to memory, you would:

// prepare some strings
std::string s1 = "string 1";
char* s2 = "string 2";

// open the file for writing
FILE* f = fopen("file.dat", "wb");

// write strings
LineToFile(f, s1);
LineToFile(f, s2);
LineToFile(f, "string 3");

// reopen the file for reading
fclose(f);
f = fopen("file.dat", "rb");

// read all lines
std::string strLine;
while (LineFromFile(f, strLine))
{
  // do whatever with strLine
};

// close the file
fclose(f);

Extending functionality

When you look at the LineToFile() and LineFromFile() functions, you will notice that they use two constants - a line break character and an EOF character. What happens if we change those constants to something else?

Consider, for example, having a space character instead of a line break and a line break character instead of the EOF. Now we can read not a whole text file, one line at a time, but one whole line one word at a time (and repeat it for the whole file if we want). And the only thing we had to do was change two constants!

Changing the value of the two constants can be very useful, as illustrated above, so why not give the user the option to change it? We will change the function declarations as follows:

bool LineToFile(FILE* f, const std::string&  s, int    eol =  '\n');
bool LineToFile(FILE* f, const std::wstring& s, wint_t eol = L'\n');;

bool LineFromFile(FILE* f, std::string&  s, int    eol = '\n', int    eof =  EOF);
bool LineFromFile(FILE* f, std::wstring& s, wint_t eol = '\n', wint_t eof = WEOF);

If you don't understand the assignment operators in an argument declaration, know that this is called default arguments. When you assign an argument a default value, you give the user an option to choose whether he wants to specify the value for this argument or not. If not, the default value will be used. This way, all of the following function calls are valid:

LineToFile(f, myString);
LineToFile(f, myString, '\t');
LineToFile(f, myString, ' ', '\n');

We have managed to maintain the simplicity of the function calls while giving the user more control if he wants to have it.

The function bodies will not change very much, only instead of the constant '\n', we will use the argument eol, and instead of the constant EOF, the argument eof. This should be clear enough, but if you need to see the actual function bodies, look at the source files..

Advanced

As I have mentioned before, there is one problem with using a char* or a string constant with the LineToFile() function. It works just fine, but...

A temporary std::string variable has to be created, with the appropriate content, which is then used in the function. The problem is, the creation of this temporary variable is not necessary and can possibly even fail, because the new variable has to (sometimes) create its own buffer to hold the string data.

To avoid the unnecessary allocation, we can write separate overloads of the LineToFile() function for zero-terminated strings. These functions would look like this:

bool LineToFile(FILE* f, const wchar_t* const s, 
                wint_t eol = L'\n', size_t length = -1)
{
    // check if the pointer is valid
    if (!s)
    {
        return false;
    };
        
    // calculate the string's length
    if (length==-1)
    {
        length = wcslen(s);
    };    
    
    // write the string to the file
    size_t n = fwrite(s, sizeof(wchar_t), length, f);
            
    // write line break to the file
    fputwc(eol, f);
            
    // return whether the write operation was successful
    return (n == length);
};

Notice the length argument. This can be used if we don't want to waste time calculating the length of the string again, or if we want to write only a part of the string.

Now that we have separate functions for zero-terminated strings, we can as well abandon the body of the old WriteToFile() functions and use them only as interfaces to our new functions:

inline bool LineToFile(FILE* f, const std::string& s, int eol = '\n')
{
    return LineToFile(f, s.c_str(), eol, s.size());
};

inline bool LineToFile(FILE* f, const std::wstring& s, wint_t eol = L'\n')
{
    return LineToFile(f, s.c_str(), eol, s.size());
};

Notice that I have declared those two functions inline. This will (in this case) save a couple of assembly instructions in the generated code and make it run a little bit faster (by not creating a separate "function" just to call the other overload of LineToFile()).

Once again, we have improved the library without adding any undesired complexity.

Closing notes

This library doesn't attempt to produce files conforming to the Unicode standard. It is also (generally) not able to load Unicode files generated by a different program. The only purpose of this library is to provide a simple interface for storing and restoring Unicode and ANSI strings in files. Having said this, the Unicode files created with this library are readable in Notepad and probably other Unicode-aware text editors.

Credits

I would like to give credit to Mr. John R. Shaw for a couple of very good suggestions, mainly about the things described in the "Advanced" section of this article. See the discussion at the bottom of this page for details.

History

26 Oct 2005 - Major update, most of the article was rewritten from scratch.
25 Oct 2005 - LineToFile() now accepts a const std::string& instead of an std::string.
23 Oct 2005 - Added optional arguments for customizing EOL and EOF.
20 Oct 2005 - Added note about special characters in strings.
19 Oct 2005 - Initial release.