Click here to Skip to main content
Click here to Skip to main content

Unification of Text and Binary File APIs

By , 26 Nov 2013
Rate this:
Please Sign up or sign in to vote.

Table Of Contents

Introduction

For C/C++ programmers, the standard way to write and read files are through the C file API and C++ file streams. Because of its cryptic and unintuitive class and function names, I have always found C++ file streams 'too difficult' to use whereas C file API are not type-safe. In this article, I am going to introduce my own type-safe file library, which is based on C file API, unifies the text file and binary file APIs in an almost seamless way. There are still some difference between the text and binary APIs where it makes absolute no sense to make them similar. The library is meant to write and read structured data, meaning to say write and read integers, boolean, floats, strings and so on. The library can be used to write and read unstructured data (for example, C++ source files) with its lower level classes. However, that is not the focus of the library and the article. This article is meant to teach the readers how to easily access structured data in files.

For the .NET people who happened to chance upon this article, you can stop reading this article now. This article is about native C++, not .NET, though I tried to write a C# version of my file library but I have failed because C# does not allow developer to keep a copy of the passed-by-reference POD argument after the method returns.

Text File Usage

In this section, we are going to look at how to write and read text files. Let us begin at learning how to write integer and double to a text file.

using namespace Elmax;

xTextWriter writer;
std::wstring file = L"Unicode.txt";
if(writer.Open(file, FT_UNICODE, NEW))
{
    int i = 25698;
    double d = 1254.69;
    writer.Write(L"{0},{1}", i, d);
    writer.Close();
}

The code above tries open a new Unicode file and upon success, writes a integer and double value and closes the file. Other text file types supported are ASCII, Big Endian Unicode and UTF-8. Though not shown in the code, user should check the boolean return value of write. xTextWriter delegates its file work to AsciiWriter, UnicodeWriter, BEUnicodeWriter and UTF8Writer. Likewise, xTextReader delegates its file work to AsciiReader, UnicodeReader, BEUnicodeReader and UTF8Reader. These file writers write the BOM on their first write while the readers read the BOM automatically if it is present. For those readers who are not familiar what is BOM, BOM is an acronym for byte order mark. BOM is a Unicode character used to signal the endianness (byte order) of a text file or stream. BOM is optional but it is generally accepted as good practice to write BOM. The reader might ask for the reason to write a Unicode file library and why not pick up one from CodeProject. I have decided to write my own Unicode file classes because most of those featured on CodeProject make use of MFC CStdioFile class which does not work on other platforms. Let us now look at how to read the same data we have just written.

using namespace Elmax;

xTextReader reader;
std::wstring file = L"Unicode.txt";
if(reader.Open(file))
{
    if(reader.IsEOF()==false)
    {
        int i2 = 0;
        double d2 = 0.0;

        StrtokStrategy strat(L",");
        reader.SetSplitStrategy(&strat);
        size_t totalRead = reader.ReadLine(i2, d2); // i2 = 25698 and d2 = 1254.69
}
reader.Close();

The reader opens the same file and set its text split strategy. In this case, it is set to use strtok and its delimiter is set to comma. Other split strategies includes Boost and Regex but it is highly recommended for user to choose strtok because it is fast. We have seen how to write and read an integer and double. Writing and reading strings are no difference but special care must be taken for delimiter which may appears inside the string. That means we must escape the string when writing and unescape the string when reading. There is a function, ReplaceAll in StrUtil class which users can use to escape and unescape their strings. Note: This is no longer true for version 2.0.2 which use streams internally: you need not set splitter strategy but you must call SetDelimiter instead.

There is an overloaded Open function which takes in the additional Unicode file type as parameter. But foremost, it will always respect the BOM if it detects its presence. Only in the absence of BOM that the xTextReader will open the file according to the Unicode file type which the user specified.

Binary File Usage

Writing binary file is similar to writing text file, except the user does not have to write the delimiters in between the data.

using namespace Elmax;

xBinaryWriter writer;
std::wstring file = L"Binary.bin";
if(writer.Open(file))
{
    int i = 25698;
    double d = 1254.69;
    writer.Write(i, d);
    writer.Close();
}

Write returns number of the values successfully written. As shown below, reading is almost similar to writing.

using namespace Elmax;

xBinaryReader reader;
std::wstring file = L"Binary.bin";
if(reader.Open(file))
{
    if(reader.IsEOF())
    {
        int i2 = 0;
        double d2 = 0.0;
        size_t totalRead = reader.Read(i2, d2); // i2 = 25698 and d2 = 1254.69
    }
    reader.Close();
}

Writing strings in binary, most of the time, involves in writing the string length beforehand and before reading the string, we need to read the length and allocate the array first.

using namespace Elmax;

xBinaryWriter writer;
std::wstring file = GetTempPath(L"Binary.bin");
if(writer.Open(file))
{
    std::string str = "Coding Monkey";
    double d = 1254.69;
    writer.Write(str.size(), str, d);
    writer.Close();
}

xBinaryReader reader;
if(reader.Open(file))
{
    if(reader.IsEOF()==false)
    {
        size_t len = 0;
        double d2 = 0.0;
        StrArray arr;
        size_t totalRead = reader.Read(len);

        totalRead = reader.Read(arr.MakeArray(len), d2);

        std::string str2 = arr.GetPtr(); // str2 contains "Coding Monkey"
    }
    reader.Close();
}

We use StrArray to read a char array. We read its length first and use the length to allocate the array through MakeArray method. It is possible to read the length and make the array at the same time, using DeferredMake. Unlike MakeArray, DeferredMake does not allocate the array: the allocation is delayed until when it comes to its turn to read the file. DeferredMake captures the address of the len, so when the len gets updated with the length, it also gets the length. See below.

xBinaryReader reader;
if(reader.Open(file))
{
    if(reader.IsEOF()==false)
    {
        size_t len = 0;
        double d2 = 0.0;
        StrArray arr;
        size_t totalRead = reader.Read(len, arr.DeferredMake(len), d2);

        std::string str2 = arr.GetPtr(); // str2 contains "Coding Monkey"
    }
    reader.Close();
}

It is possible to write a structure as an array. This is however not advisable as different platforms may pad unknown number of bytes between structure members for performance reasons. For portability, it is recommended to write out every structure member, than writing structure as a flat array. If you still want to do it, then specify no padding for your structure(below).

#pragma pack(push, 1) // exact fit - no padding
struct MyStruct
{
  char b; 
  int a; 
};
#pragma pack(pop)

WStrArray is available to read wchar_t array. However, it is not recommended to write std::wstring and use WStrArray to read it if you want to keep your file format portable across different OSes. The reason is due to wchar_t size is different on Windows, Linux and Mac OSX. We will explore this issue on the later section. Note:Text file API do not have this problem as conversion are in place to keep it automatic. The workaround if the user need to write Unicode strings is write UTF-8 string. Another option is to use BaseArray class to write 16 bit string. There are 2 types of 16 bit encoding for Unicode, namely UCS-2 and UTF-16. UCS-2 unit is always 16 bits and can only represent 97% of the Unicode. UTF-16 can encode all Unicode code points but its unit could consist of a single or two 16 bit words. For some use cases, UCS-2 is sufficient to store the text of the choice language. UTF-16 is able to store everything that is Unicode but the tradeoff is the conversion time and the need to take note of the potential difference in text length before and after conversion.

xBinaryWriter and xBinaryReader also provides Seek and GetCurrPos to do file seeking (a common operation in binary file parsing).

Code Design

xTextWriter and xTextReader makes use of DataType and DataTypeRef respectively to do the conversion between data types and string. Basically, this library depends on implicit conversion of Plain Old Data(POD) to DataType object to work. xTextWriter has many overloaded Write and WriteLine which differs by the number of DataType parameters. WriteLine basically just add the linefeed (LF) after writing the string. The Write below, has 5 DataType parameters.

bool xTextWriter::Write( const wchar_t* fmt, DataType D1, DataType D2, DataType D3, DataType D4, DataType D5 )
{
    if(pWriter!=NULL)
    {
        std::wstring str = StrUtilRef::Format(fmt, D1, D2, D3, D4, D5);
        return pWriter->Write(str);
    }

    return false;
}

DataType consists many overloaded constructors which convert the Plain Old Data (POD) to string and store it in string member (m_str).

namespace Elmax
{
class DataType
{
public:
    ~DataType(void);

    DataType( int i );

    DataType( unsigned int ui );

    DataType( const ELMAX_INT64& i64 );

    DataType( const unsigned ELMAX_INT64& ui64 );

    DataType( float f );

    DataType( const double& d );

    DataType( const std::string& s );

    DataType( const std::wstring& ws );

    DataType( const char* pc );

    DataType( const wchar_t* pwc );

    DataType( char c );

    DataType( unsigned char c );

    DataType( wchar_t wc );

    std::wstring& ToString() { return m_str; }

protected:
    std::wstring m_str;
};

Here is the C++11 variadic template Write version which supports any arbitrary number of arguments. But you need to download and install the Visual C++ Compiler November 2012 CTP to compile the code. Note: the code is much lesser without having to write all those overloaded functions previously.

bool Write( const wchar_t* str )
{
    if(pWriter!=nullptr)
    {
        return pWriter->Write(std::wstring(str));
    }

    return false;
}

template<typename... Args>
bool Write( const wchar_t* fmt, Args&... args )
{
    std::wstring str = StrUtilRef::Format(std::wstring(fmt), 0, args...);

    if(pWriter!=nullptr)
    {
        return pWriter->Write(str);
    }

    return false;
}

As mentioned earlier, xTextReader makes use of DataTypeRef to do the conversion from string to Plain Old Data (POD). xTextReader has 10 overloaded Read and ReadLine which differs only by the number of DataTypeRef parameters. The ReadLine shown below, has 5 DataTypeRef parameters.

size_t xTextReader::ReadLine( DataTypeRef D1, DataTypeRef D2, DataTypeRef D3, DataTypeRef D4,
    DataTypeRef D5 )
{
    if(pReader!=NULL)
    {
        std::wstring text;
        bool b = pReader->ReadLine(text);

        if(b)
        {
            StrUtilRef strUtil;
            strUtil.SetSplitStrategy(m_pSplitStrategy);

            return strUtil.Split(text.c_str(), D1, D2, D3, D4, D5);
        }
    }

    return 0;
}

size_t StrUtilRef::Split( const std::wstring& StrToExtract, DataTypeRef& D1, DataTypeRef& D2, DataTypeRef& D3, 
    DataTypeRef& D4, DataTypeRef& D5 )
{
    std::vector<DataTypeRef*> vecDTR;
    vecDTR.push_back(&D1);
    vecDTR.push_back(&D2);
    vecDTR.push_back(&D3);
    vecDTR.push_back(&D4);
    vecDTR.push_back(&D5);

    assert( m_pSplitStrategy );
    return m_pSplitStrategy->Extract( StrToExtract, vecDTR );
}

size_t StrtokStrategy::Extract( 
    const std::wstring& StrToExtract, 
    std::vector<Elmax::DataTypeRef*> vecDTR )
{
    std::vector<std::wstring> vecSplit;
    const size_t size = StrToExtract.size()+1;
    wchar_t* pszToExtract = new wchar_t[size];
    wmemset( pszToExtract, 0, size );
    Wcscpy( pszToExtract, StrToExtract.c_str(), size );

    wchar_t *pszContext = 0;
    wchar_t *pszSplit = 0;
    pszSplit = wcstok( pszToExtract, m_sDelimit.c_str() );

    while( NULL != pszSplit )
    {
        size_t len = wcslen(pszSplit);
        if(pszSplit[len-1]==65535&&vecSplit.size()==vecDTR.size()-1) // bug workaround: wcstok_s/wcstok will put 65535 at the back of last string.
            pszSplit[len-1] = L'\0';

        vecSplit.push_back(std::wstring( pszSplit ) );

        pszSplit = wcstok( NULL, m_sDelimit.c_str() );
    }

    delete [] pszToExtract;

    size_t fail = 0;
    for( size_t i=0; i<vecDTR.size(); ++i )
    {
        if( i < vecSplit.size() )
        {
            if( false == vecDTR[i]->ConvStrToType( vecSplit[i] ) )
                ++fail;
        }
        else
            break;
    }

    return vecSplit.size()-fail;
}

DataTypeRef keeps a big union to store the address of each POD parameter as a destination for result.

namespace Elmax
{
class DataTypeRef
{
public:
    ~DataTypeRef(void);

    union UNIONPTR
    {
        int* pi;
        unsigned int* pui;
        short* psi;
        unsigned short* pusi;
        ELMAX_INT64* pi64;
        unsigned ELMAX_INT64* pui64;
        float* pf;
        double* pd;
        std::string* ps;
        std::wstring* pws;
        char* pc;
        unsigned char* puc;
        wchar_t* pwc;
    };

    enum DTR_TYPE
    {
        DTR_INT,
        DTR_UINT,
        DTR_SHORT,
        DTR_USHORT,
        DTR_INT64,
        DTR_UINT64,
        DTR_FLOAT,
        DTR_DOUBLE,
        DTR_STR,
        DTR_WSTR,
        DTR_CHAR,
        DTR_UCHAR,
        DTR_WCHAR
    };

    DataTypeRef( int& i )                    { m_ptr.pi = &i;       m_type = DTR_INT;   }

    DataTypeRef( unsigned int& ui )          { m_ptr.pui = &ui;     m_type = DTR_UINT;  }

    DataTypeRef( short& si )                 { m_ptr.psi = &si;     m_type = DTR_SHORT; }

    DataTypeRef( unsigned short& usi )       { m_ptr.pusi = &usi;   m_type = DTR_USHORT;}

    DataTypeRef( ELMAX_INT64& i64 )          { m_ptr.pi64 = &i64;   m_type = DTR_INT64; }

    DataTypeRef( unsigned ELMAX_INT64& ui64 ){ m_ptr.pui64 = &ui64; m_type = DTR_UINT64;}

    DataTypeRef( float& f )                  { m_ptr.pf = &f;       m_type = DTR_FLOAT; }

    DataTypeRef( double& d )                 { m_ptr.pd = &d;       m_type = DTR_DOUBLE;}

    DataTypeRef( std::string& s )            { m_ptr.ps = &s;       m_type = DTR_STR;   }

    DataTypeRef( std::wstring& ws )          { m_ptr.pws = &ws;     m_type = DTR_WSTR;  }

    DataTypeRef( char& c )                   { m_ptr.pc = &c;       m_type = DTR_CHAR;  }

    DataTypeRef( unsigned char& uc )         { m_ptr.puc = &uc;     m_type = DTR_UCHAR; }

    DataTypeRef( wchar_t& wc )               { m_ptr.pwc = &wc;     m_type = DTR_WCHAR; }

    bool ConvStrToType( const std::string& Str );

    bool ConvStrToType( const std::wstring& Str );

    DTR_TYPE m_type;

    UNIONPTR m_ptr;
};

The C++11 variadic template version below which calls ReadArg. The first ReadArg is the base function which will terminate the recursion of the variadic sibling. Please note that this is not true recursion as in the traditional sense because the function is actually not calling itself: it is calling a different function with the same name but have different number of arguments.

void ReadArg(std::vector<DataTypeRef*>& vec)
{
}

template<typename T, typename... Args>
void ReadArg(std::vector<DataTypeRef*>& vec, T& t, Args&... args)
{
    vec.push_back(new DataTypeRef(t));
    ReadArg(vec, args...);
}

template<typename... Args>
size_t Read( size_t len, Args&... args )
{
    if(pReader!=nullptr)
    {
        std::wstring text;
        bool b = pReader->Read(text, len);

        if(b)
        {
            std::vector<DataTypeRef*> vec;
            ReadArg(vec, args...);

            size_t ret = m_pSplitStrategy->Extract(text, vec);

            for(size_t i=0; i<vec.size(); ++i)
            {
                delete vec[i];
            }

            vec.clear();

            return ret;
        }
    }

    return 0;
}

xBinaryWriter makes use of BinaryTypeRef. The overloaded Write is different by the number of parameters. xBinaryWriter has no WriteLine function. The Write function shown below, has 2 BinaryTypeRef parameters.

size_t xBinaryWriter::Write( BinaryTypeRef D1, BinaryTypeRef D2 )
{
    size_t totalWritten = 0;
    if(fp!=NULL)
    {
        if(D1.m_type != BinaryTypeRef::DTR_STR && D1.m_type != BinaryTypeRef::DTR_WSTR && D1.m_type != BinaryTypeRef::DTR_BASEARRAY)
        {
            size_t len = fwrite(D1.GetAddress(), D1.size, 1, fp);
            if(len==1)
                ++totalWritten;
        }
        else
        {
            size_t len = fwrite(D1.GetAddress(), D1.elementSize, D1.arraySize, fp);
            if(len==D1.arraySize)
                ++totalWritten;
        }

        if(D2.m_type != BinaryTypeRef::DTR_STR && D2.m_type != BinaryTypeRef::DTR_WSTR && D2.m_type != BinaryTypeRef::DTR_BASEARRAY)
        {
            size_t len = fwrite(D2.GetAddress(), D2.size, 1, fp);
            if(len==1)
                ++totalWritten;
        }
        else
        {
            size_t len = fwrite(D2.GetAddress(), D2.elementSize, D2.arraySize, fp);
            if(len==D2.arraySize)
                ++totalWritten;
        }

    }

    if(totalWritten != 2)
    {
        errNum = ELMAX_WRITE_ERROR;
        err = StrUtil::Format(L"{0}: Less than 2 elements are written! ({1} elements written)", GetErrorMsg(errNum), totalWritten);
        if(enableException)
            throw new std::runtime_error(StrUtil::ConvToString(err));
    }

    return totalWritten;
}

BinaryTypeRef keeps a union to store the address of the POD. No textual to string conversion is necessary: POD is written as it is into the binary file.

namespace Elmax
{
class BinaryTypeRef
{
public:
    ~BinaryTypeRef(void);

    union UNIONPTR
    {
        const int* pi;
        const unsigned int* pui;
        const short* psi;
        const unsigned short* pusi;
        const ELMAX_INT64* pi64;
        const unsigned ELMAX_INT64* pui64;
        const float* pf;
        const double* pd;
        std::string* ps;
        const std::wstring* pws;
        const char* pc;
        const unsigned char* puc;
        const wchar_t* pwc;
        const char* arr;
    };

    enum DTR_TYPE
    {
        DTR_INT,
        DTR_UINT,
        DTR_SHORT,
        DTR_USHORT,
        DTR_INT64,
        DTR_UINT64,
        DTR_FLOAT,
        DTR_DOUBLE,
        DTR_STR,
        DTR_WSTR,
        DTR_CHAR,
        DTR_UCHAR,
        DTR_WCHAR,
        DTR_BASEARRAY
    };

    BinaryTypeRef( const int& i )                     { m_ptr.pi = &i; m_type = DTR_INT; size=sizeof(i); }

    BinaryTypeRef( const unsigned int& ui )           { m_ptr.pui = &ui; m_type = DTR_UINT; size=sizeof(ui); }

    BinaryTypeRef( const short& si )                  { m_ptr.psi = &si; m_type = DTR_SHORT; size=sizeof(si); }

    BinaryTypeRef( const unsigned short& usi )        { m_ptr.pusi = &usi; m_type = DTR_USHORT; size=sizeof(usi); }

    BinaryTypeRef( const ELMAX_INT64& i64 )           { m_ptr.pi64 = &i64; m_type = DTR_INT64; size=sizeof(i64); }

    BinaryTypeRef( const unsigned ELMAX_INT64& ui64 ) { m_ptr.pui64 = &ui64; m_type = DTR_UINT64; size=sizeof(ui64); }

    BinaryTypeRef( const float& f )                   { m_ptr.pf = &f; m_type = DTR_FLOAT; size=sizeof(f); }

    BinaryTypeRef( const double& d )                  { m_ptr.pd = &d; m_type = DTR_DOUBLE; size=sizeof(d); }

    BinaryTypeRef( std::string& s )                   { m_ptr.ps = &s; m_type = DTR_STR; elementSize=sizeof(char);size=s.length(); 
                                                            arraySize=s.length();}

    BinaryTypeRef( const std::wstring& ws )           { m_ptr.pws = &ws; m_type = DTR_WSTR; elementSize=sizeof(wchar_t);
                                                            size=ws.length()*sizeof(wchar_t); arraySize=ws.length();}

    BinaryTypeRef( const char& c )                    { m_ptr.pc = &c; m_type = DTR_CHAR; size=sizeof(c); }

    BinaryTypeRef( const unsigned char& uc )          { m_ptr.puc = &uc; m_type = DTR_UCHAR; size=sizeof(uc); }

    BinaryTypeRef( const wchar_t& wc )                { m_ptr.pwc = &wc; m_type = DTR_WCHAR; size=sizeof(wc); }

    BinaryTypeRef( const BaseArray& arr )             { m_ptr.arr = arr.GetPtr(); m_type = DTR_BASEARRAY; 
                                                            size=arr.GetTotalSize(); elementSize=arr.GetElementSize(); 
                                                            arraySize=arr.GetArraySize(); }
    char* GetAddress();

    DTR_TYPE m_type;

    UNIONPTR m_ptr;

    size_t size;

    size_t elementSize;

    size_t arraySize;
};

This is the C++11 variadic template binary Write version. The first Write is the base function which stops the recursive calls. It also makes use of the BinaryTypeRef class.

size_t Write()
{
    return 0;
}

template<typename T, typename... Args>
size_t Write( T t, Args... args )
{
    BinaryTypeRef dt(t);

    size_t totalWritten = 0;
    if(fp!=nullptr)
    {
        if(dt.m_type != BinaryTypeRef::DTR_STR && dt.m_type != BinaryTypeRef::DTR_WSTR && dt.m_type != BinaryTypeRef::DTR_BASEARRAY)
        {
            size_t len = fwrite(dt.GetAddress(), dt.size, 1, fp);
            if(len==1)
                ++totalWritten;
        }
        else
        {
            size_t len = fwrite(dt.GetAddress(), dt.elementSize, dt.arraySize, fp);
            if(len==dt.arraySize)
                ++totalWritten;
        }

    }

    return totalWritten + Write(args...);
}

Lastly, we have come to xBinaryReader. xBinaryReader makes use of BinaryTypeReadRef to do data conversion. Like xTextReader, xBinaryReader has overloaded Read to do its work but it has no ReadLine.

size_t xBinaryReader::Read( BinaryTypeReadRef D1, BinaryTypeReadRef D2 )
{
    size_t totalRead = 0;
    if(fp!=NULL)
    {
        if(D1.m_type != BinaryTypeReadRef::DTR_STRARRAY && D1.m_type != BinaryTypeReadRef::DTR_WSTRARRAY && D1.m_type != BinaryTypeReadRef::DTR_BASEARRAY)
        {
            size_t cnt = fread(D1.GetAddress(), D1.size, 1, fp);
            if(cnt==1)
                ++totalRead;
        }
        else
        {
            D1.DeferredMake();
            size_t cnt = fread(D1.GetAddress(), D1.elementSize, D1.arraySize, fp);
            if(cnt == D1.arraySize)
                ++totalRead;
        }

        if(D2.m_type != BinaryTypeReadRef::DTR_STRARRAY && D2.m_type != BinaryTypeReadRef::DTR_WSTRARRAY && D2.m_type != BinaryTypeReadRef::DTR_BASEARRAY)
        {
            size_t cnt = fread(D2.GetAddress(), D2.size, 1, fp);
            if(cnt==1)
                ++totalRead;
        }
        else
        {
            D2.DeferredMake();
            size_t cnt = fread(D2.GetAddress(), D2.elementSize, D2.arraySize, fp);
            if(cnt==D2.arraySize)
                ++totalRead;
        }

    }

    if(totalRead != 2)
    {
        errNum = ELMAX_READ_ERROR;
        err = StrUtil::Format(L"{0}: Less than 2 elements are read! ({1} elements read)", GetErrorMsg(errNum), totalRead);
        if(enableException)
            throw new std::runtime_error(StrUtil::ConvToString(err));
    }

    return totalRead;
}

For simplicity, I do not show the BinaryTypeReadRef class here because the code is quite complicated as it supports DeferredMake of the array class.

This is the C++11 variadic template binary Read version. Same as the binary Write before, the 1st function is the base function which ends the recursive calls. Like previous Read, it, too, make use of BinaryTypeReadRef.

size_t Read()
{
    return 0;
}

template<typename T, typename... Args>
size_t Read( T& t, Args&... args )
{
    BinaryTypeReadRef dt(t);
    size_t totalRead = 0;
    if(fp!=nullptr)
    {
        if(dt.m_type != BinaryTypeReadRef::DTR_STRARRAY && dt.m_type != BinaryTypeReadRef::DTR_WSTRARRAY && dt.m_type != BinaryTypeReadRef::DTR_BASEARRAY)
        {
            size_t cnt = fread(dt.GetAddress(), dt.size, 1, fp);
            if(cnt==1)
                ++totalRead;
        }
        else
        {
            dt.DeferredMake();
            size_t cnt = fread(dt.GetAddress(), dt.elementSize, dt.arraySize, fp);
            if(cnt == dt.arraySize)
                ++totalRead;
        }

    }

    return totalRead + Read(args...);
}

Porting to Linux

When I was writing the Windows code, I took special care to separate the Windows and Non-Windows code with a _MICROSOFT macro. _WIN32 macro is not used instead because the Mingw defines it as well. The main difference between Windows and Non-Windows code at that point, is on Windows, linefeed ("\n") is converted to a combination of carriage return and line feed ("\r\n") during file writing and the reverse process is applied during file reading; On Non-Windows platform, linefeed ("\n") remains as linefeed: no conversion is done.

I downloaded and installed Orwell Dev-C++ to test my code on Mingw and GCC on Windows. Orwell Dev-C++ is a continuation of the work of (currently non-active) the Bloodshed Dev-C++. Orwell Dev-C++ comes bundled with Mingw and fairly recent GCC 4.6.x. During compilation, Orwell Dev-C++ complains about the unavailable secure c function (typically name which ends with _s) such as _itow_s. So I changed them to non-secure version for Non-Windows implementation while Windows implementation is still using the secure version. Dev-C++ also complained it could not find a std::exception constructor which takes in a string. It turned out that std::exception was meant to be derived from and not used directly. I changed the use of std::exception to proper exception types, such as logic_error, runtime_error and so on. With these change done, I assume most of my Linux work is done. I estimated, excluding the time to learn G++ and write makefile, that it would take me at most 1 hour to get the code working. That was when I found out I have grossly underestmated the time that would taken me to resolve the errors on Ubuntu Linux 12.04.

After converting the Orwell Dev-C++ makefile to work on Ubuntu Linux and GCC 4.6.3, the first error which G++ complained was it did not understand the included paths. So I changed the backslash to forward slash.

#include "..\\..\\Common\\Common.h"

The above path is changed to below.

#include "../../Common/Common.h"

This was an easy change, though I had to update most of the 66 source files. The next G++ complaint was it could not find the data conversion function (typically name which starts with underscore) such as _ultow. It turned out that Microsoft standard conversion functions were not the standard after all. I have to use stringstream to replace _ultow and its cousins. All compilation errors are resolved at this point. And I ran the unit tests. It crashed at the first Unicode test! Upon some investigation, I discovered, to my dismay, the size of wchar_t on Linux and Mac OSX is 4 bytes, instead of 2 bytes! That meant all the wchar_t related functions did not work correctly on Linux and Mac OSX. It was clearly a showstopper! It took me 3 laborious days to implement UTF-16 conversions and handle all the instances where wchar_t size was 4; Unicode files are essentially UTF16 files. On Windows, UTF-16 is supported natively. On Ubuntu Linux, I have to convert the 4 bytes wchar_t (UTF-32) to UTF-16 before writing to Unicode file. The reverse conversion applies during reading.

If you are interested to run the Linux tests, you can run the command line below to build the library (FileLib.a) and the test application (UnitTest.exe) and execute it

cd FileLib
cd FileIO
make all
cd ..
cd PreVS2012UnitTest
make all
./UnitTest.exe

In total, there are 55 unit tests for Windows and 65 unit tests for Linux. Whenever I made a change or fix a bug for either OS, I ran the unit tests for both to make sure I have not broken anything on the either side.

Porting to Clang

Clang 3.1 on Ubuntu 12.04 is able to compile the library using GCC 4.7 standard library. However, Clang compilation failed on Mac OSX 10.8 due to the failure to find an overloaded constructor with size_t parameter. size_t is synonymous with unsigned 32 bit integer on 32 bit platform and is 64 bit on 64 platform. Apparently, Clang sees size_t as another type. The attempt to add that constructor failed on Microsoft compiler which complained of similar constructor already exists and the fix is to hide it under the __APPLE__ check.

#ifdef __APPLE__
    DataType( size_t ui );
#endif

In order to compile under Clang successfully, remove Microsoft specific files, like stdafx.h, WinOperation.h/cpp and Boost files like BoostStrategy.h/cpp and RegExStrategy.h/cpp. For unit testing, LinuxUnitTest.cpp can be used.

Streams

Version 2.0.2 of text file library (variadic template version, not 1.0.x C++98 version) use custom streams internally, so the users can write non-intrusive insertion and extraction operations for arbitrary data types including enums. The istream and ostream class make use of Boost lexical_cast to perform the data conversion so it should perform better than STL stringstream. With istream, there is no need to set the splitter strategy when reading but delimiter need to specified through SetLimiter. Your overloaded <<, >> operators can use same or different delimiters. Let us first look at overloading using same delimiters as the rest of the file format.

This is the structure, MyStruct.

struct MyStruct
{
    int a;
    int b;
};

These are the overloaded <<, >> operators placed in your source files.

Elmax::ostream operator <<(Elmax::ostream& os, const MyStruct& val)
{
    os << val.a;
    os << L",";
    os << val.b;
    os << L",";

    return os;
}

Elmax::istream operator >>(Elmax::istream& is, MyStruct& val)
{
    is >> val.a;
    is >> val.b;

    return is;
}
Now we can write and read MyStruct objects as follows.
// Writing
xTextWriter writer;
std::wstring file = L"...";
writer.Open(file, FT_UTF8, NEW);
writer.Close();

int i = 25698;
double d = 1254.5;
MyStruct my = { 22, 33 };
writer.Write(L"{0},{1},{2}", i, my, d);

// Reading
xTextReader reader;
reader.Open(file);
int i2 = 0;
double d2 = 0.0;
MyStruct my2 = { 0, 0 };

// do not set split strategy but set delimiters instead.
reader.SetDelimiter(L",");
size_t totalRead = reader.ReadLine(i2, my2, d2);

The next example we are going to use pipe, "|" for delimiter for our structure while the rest of document use comma.

struct DiffDelimiterStruct
{
    int a;
    float b;
};
Elmax::ostream operator <<(Elmax::ostream& os, const DiffDelimiterStruct& val)
{
    os << val.a;
    os << L"|";
    os << val.b;
    os << L"|";

    return os;
}

Elmax::istream operator >>(Elmax::istream& is, DiffDelimiterStruct& val)
{
    std::wstring old_delimiter = is.set_delimiter(L"|");

    is >> val.a;
    is >> val.b;

    is.set_delimiter(old_delimiter);

    return is;
}

As shown below, writing and reading is the same as previous example.

// Writing
xTextWriter writer;
std::wstring file = L"...";
writer.Open(file, FT_UTF8, NEW);
int i = 25698;
double d = 1254.5;
DiffDelimiterStruct my = { 22, 33 };
writer.Write(L"{0},{1},{2}", i, my, d);
writer.Close();

// Reading
xTextReader reader;
reader.Open(file);

int i2 = 0;
double d2 = 0.0;
DiffDelimiterStruct my2 = { 0, 0 };

reader.SetDelimiter(L",");
size_t totalRead = reader.ReadLine(i2, my2, d2);

Caveat

This is a list of issues that the users need to be aware of when using this file library.

  • Do not use size_t type for binary files: size_t is 32 bit unsigned integer on 32 bit platform and is 64 bit unsigned integer on 64 bit platform. The automatic promotion to 64 bit on 64 bit OS is sometimes desired but is wrong in file format. When a data is 32 bit in binary, we always want it to remain 32 bit in file to be consistent.
  • Non-Windows implementation use fopen: Windows provide a _wfopen function to open file with Unicode name. Unfortunately, Linux and GCC (or rather C Standard Library) does not have such function. C and C++ Standard does not make any notes on how to open Unicode named file. The workaround is, on other platforms, when your user is about to open a file with a name which consists of Unicode code point (> 255), the application should copy the file to another ASCII name and open that file instead.
  • Put the file code in try/catch: The exceptions that could be thrown by the library, are logic_error, runtime_error, overflow_error and underflow_error. Exception are enabled by default. Although exceptions can be disabled through the EnableException function, exceptions will still be thrown when there are data conversion errors. These errors are considered as serious errors, because the file could be corrupted, so silent failure is not acceptable. When exception is disabled, the user have to check the return value of each function call and call GetLastError.

Data Portability

Up to until now, we talk mainly about source code portability. Let us discuss some data portability issues. We did not encounter any data issue because platforms used are based on Intel x86. Other platforms may have a different endianness(Little Endian versus Big Endian); The file format should have a field to store the byte ordering, not unlike the TIFF image format and flip the bytes as needed during reading. Due to different alignment. it is best to write out individual struct members, instead of writing struct as a flat array. Do not use size_t as its size is dependant on the processor width (32bits versus 64bits). Not all platforms use 2s-complement for negative numbers; 1-complement or sign-magnitude could be used; you may need to store that information as well. If you bet on -1 having all the ones(eg, 0xFFFF), you are better off using ~0 for portability. Enums may have different values and data-size. You can assign numerical value and force the enum to be certain size. However, it is recommended a switch-case is used, instead of casting the enum to integer; switch-case works for enum and C++11 enum class.

enum MYCOLORS
{
    RED = 0,
    YELLOW = 1,
    ....
    NO_USED = 0xFFFFFFFF // force the enum to be 4 bytes wide
}

For floating point portability, we should check for IEEE 754 compliance using numeric_limits<float>::is_iec559. IEC 60559 is synonymous with IEEE 754 standard for floating point; IEC 60559 is also sometimes referred to as IEC 559.

Preventing Memory Leak

The original source code uploaded in this article, was tested to have no memory leaks using Visual Studio 2010/2012 and Valgrind (Linux) for correct program operation. There could be leaks in the event of when exceptions thrown, deallocation is prevented from being called. Another problem was the exceptions were allocated on the heap and not freeded in the catch handler(an oversight). All these has been rectified to use Resource Acquisition Is Initialization (RAII) for all arrays to freed the memory and exception, if thrown, is now allocated on the stack.

Future Direction

There are plans to move the library to C++11 features like variadic templates, nullptr and move semantics. It is much clearer to use standard integer types like uint32_t as opposed to the unsigned int. A preliminary C++11 version is already available for download. C++98 version will still be maintained on a different GIT branch.

The table below shows the number of lines of code (loc) for each of the class for C++98 and C++11. The percentage of reduced loc after applying C++11 variadic template is greater than 50%.

ClassC++98 locC++11 locReduced by %
xTextWriter43718657.4%
xTextReader63625959.3%
xBinaryWriter106718083.1%
xBinaryReader112318183.9%

Points of Interest

The reader may have or may not have noticed the Elmax namespace used in the code snippets. As anyone would have guessed the file library is for future cross-platform Elmax XML Library, but why include a binary file API as well? The reason is because there will be a version of Elmax which can save XML in binary form. Let us briefly recall the Elmax syntax to write a value to a XML element.

using namespace Elmax;
Element elem;

elem[L"Price"] = 30.65f;
elem[L"Qty"] = 1200;

As the reader can see from the above sample code, Elmax element is aware of the data type before it converts the data to textual form. By using the data type information, Elmax can build a metadata section about the XML. The metadata can be separated or embedded inside the Binary XML. If the XML contains mainly recurring elements, the metadata can be concise and small. However, if the XML file is consisted free form XML like SOAP XML, HTML or XAML, the metadata can be quite big with respect to the Binary XML. Binary file has the advantage of being fast because the data-type conversion from textual form is out of the picture.

Demo

I have modified an old OpenGL demo to read binary file to showcase the file library. Set the global variable, g_bLoadBinary according to which file type you want the demo to load. Please note the OpenGL code is not cross-plaform and only runs on Windows. Previously, I have uploaded an OpenGL demo for another article. Since I have only access to NVidia graphics card, I was not aware that the code does not run correctly on Intel graphics chipset. This demo should not have the same problem. Please let me know if you have any problem running the OpenGL demo. The demo is written in OpenGL 2.0. A OpenGL 4.0 version is being developed for a future OpenGL article. Stay tuned if you are interested in OpenGL 4.0!

This is the wood clip model loaded. The model is modelled using very old Milkshape shareware.

This is the screenshot of the demo.

Conclusion

In this article, we have seen a new file API which makes writing and reading structured data intuitive and productive. By keeping both the text and binary API similar, the user can maintain both file formats with minimal efforts. The file library would be used for the new Elmax XML library to save to textual and binary XML files. The XML work is a ongoing effort. The estimated date of completion is unknown. The source code is currently hosted at Github.

Compilers Tested

  • Microsoft Visual C++ 8.0, 9.0, 10.0 and 11.0
  • MingW 4.7.x
  • GCC 4.6 and 4.7 (Ubuntu 12.04)
  • Clang 3.1 (Ubuntu and Mac OSX 10.8)

Nuget

Elmax C++ File Library is available on NuGet Gallery for VS2010 and VS2012! Remember to update your Nuget to latest version 2.5 first.

Related Links

Reference

  • Write Portable Code by Brian Hook

History

  • 2013-11-26 : Added Streams section. The source code is updated to use Boost lexical_cast.
  • 2013-10-31 : Added Data Portability discussion. Important! Please read.
  • 2013-05-04 : Changed the file open functions not to throw exception because file open failure is common error, not exceptional error. Added Nuget section.
  • 2013-01-02 : Added a table to show the lines of code reduced after changing to C++11 variadic template.
  • 2012-12-23 : Added C++11 variadic versions of the functions
  • 2012-12-14 : Added Preventing Memory Leak section
  • 2012-12-12 : Added Clang support
  • 2012-09-25 : Initial Release

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Wong Shao Voon
Software Developer McGraw-Hill Financial
Singapore Singapore

Currently into areas like 3D graphics and application security. Hoping to revisit the cryptography and design pattern topics if time permits.

Follow on   Twitter   Google+   LinkedIn

Comments and Discussions

 
GeneralMy vote of 5 PinmemberMihai MOGA14-Dec-12 6:11 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web02 | 2.8.140415.2 | Last Updated 26 Nov 2013
Article Copyright 2012 by Wong Shao Voon
Everything else Copyright © CodeProject, 1999-2014
Terms of Use
Layout: fixed | fluid