For C/C++ programmers, the standard way to write and read files are through the C file API and C++ file streams. Because of its cryptic and unintuitive class and function names, I have always found C++ file streams 'too difficult' to use whereas C file API are not type-safe. In this article, I am going to introduce my own type-safe file library, which is based on C file API, unifies the text file and binary file APIs in an almost seamless way. There are still some difference between the text and binary APIs where it makes absolute no sense to make them similar. The library is meant to write and read structured data, meaning to say write and read integers, boolean, floats, strings and so on. The library can be used to write and read unstructured data (for example, C++ source files) with its lower level classes. However, that is not the focus of the library and the article. This article is meant to teach the readers how to easily access structured data in files.
For the .NET people who happened to chance upon this article, you can stop reading this article now. This article is about native C++, not .NET, though I tried to write a C# version of my file library but I have failed because C# does not allow developer to keep a copy of the passed-by-reference POD argument after the method returns.
In this section, we are going to look at how to write and read text files. Let us begin at learning how to write integer and double to a text file.
using namespace Elmax; xTextWriter writer; std::wstring file = L"Unicode.txt"; if(writer.Open(file, FT_UNICODE, NEW)) { int i = 25698; double d = 1254.69; writer.Write(L"{0},{1}", i, d); writer.Close(); }
The code above tries open a new Unicode file and upon success, writes a integer and double value and closes the file. Other text file types supported are ASCII, Big Endian Unicode and UTF-8. Though not shown in the code, user should check the boolean return value of write. xTextWriter delegates its file work to AsciiWriter, UnicodeWriter, BEUnicodeWriter and UTF8Writer. Likewise, xTextReader delegates its file work to AsciiReader, UnicodeReader, BEUnicodeReader and UTF8Reader. These file writers write the BOM on their first write while the readers read the BOM automatically if it is present. For those readers who are not familiar what is BOM, BOM is an acronym for byte order mark. BOM is a Unicode character used to signal the endianness (byte order) of a text file or stream. BOM is optional but it is generally accepted as good practice to write BOM. The reader might ask for the reason to write a Unicode file library and why not pick up one from CodeProject. I have decided to write my own Unicode file classes because most of those featured on CodeProject make use of MFC CStdioFile class which does not work on other platforms. Let us now look at how to read the same data we have just written.
xTextWriter
AsciiWriter
UnicodeWriter
BEUnicodeWriter
UTF8Writer
xTextReader
AsciiReader
UnicodeReader
BEUnicodeReader
UTF8Reader
MFC CStdioFile
using namespace Elmax; xTextReader reader; std::wstring file = L"Unicode.txt"; if(reader.Open(file)) { if(reader.IsEOF()==false) { int i2 = 0; double d2 = 0.0; StrtokStrategy strat(L","); reader.SetSplitStrategy(&strat); size_t totalRead = reader.ReadLine(i2, d2); // i2 = 25698 and d2 = 1254.69 } reader.Close();
The reader opens the same file and set its text split strategy. In this case, it is set to use strtok and its delimiter is set to comma. Other split strategies includes Boost and Regex but it is highly recommended for user to choose strtok because it is fast. We have seen how to write and read an integer and double. Writing and reading strings are no difference but special care must be taken for delimiter which may appears inside the string. That means we must escape the string when writing and unescape the string when reading. There is a function, ReplaceAll in StrUtil class which users can use to escape and unescape their strings.
strtok
ReplaceAll
StrUtil
There is an overloaded Open function which takes in the additional Unicode file type as parameter. But foremost, it will always respect the BOM if it detects its presence. Only in the absence of BOM that the xTextReader will open the file according to the Unicode file type which the user specified.
Open
Writing binary file is similar to writing text file, except the user does not have to write the delimiters in between the data.
using namespace Elmax; xBinaryWriter writer; std::wstring file = L"Binary.bin"; if(writer.Open(file)) { int i = 25698; double d = 1254.69; writer.Write(i, d); writer.Close(); }
Write returns number of the values successfully written. As shown below, reading is almost similar to writing.
Write
using namespace Elmax; xBinaryReader reader; std::wstring file = L"Binary.bin"; if(reader.Open(file)) { if(reader.IsEOF()) { int i2 = 0; double d2 = 0.0; size_t totalRead = reader.Read(i2, d2); // i2 = 25698 and d2 = 1254.69 } reader.Close(); }
Writing strings in binary, most of the time, involves in writing the string length beforehand and before reading the string, we need to read the length and allocate the array first.
using namespace Elmax; xBinaryWriter writer; std::wstring file = GetTempPath(L"Binary.bin"); if(writer.Open(file)) { std::string str = "Coding Monkey"; double d = 1254.69; writer.Write(str.size(), str, d); writer.Close(); } xBinaryReader reader; if(reader.Open(file)) { if(reader.IsEOF()==false) { size_t len = 0; double d2 = 0.0; StrArray arr; size_t totalRead = reader.Read(len); totalRead = reader.Read(arr.MakeArray(len), d2); std::string str2 = arr.GetPtr(); // str2 contains "Coding Monkey" } reader.Close(); }
We use StrArray to read a char array. We read its length first and use the length to allocate the array through MakeArray method. It is possible to read the length and make the array at the same time, using DeferredMake. Unlike MakeArray, DeferredMake does not allocate the array: the allocation is delayed until when it comes to its turn to read the file. DeferredMake captures the address of the len, so when the len gets updated with the length, it also gets the length. See below.
StrArray
MakeArray
DeferredMake
len
It is possible to write a structure as an array. This is however not advisable as different platforms may pad unknown number of bytes between structure members for performance reasons. For portability, it is recommended to write out every structure member, than writing structure as a flat array.
xBinaryReader reader; if(reader.Open(file)) { if(reader.IsEOF()==false) { size_t len = 0; double d2 = 0.0; StrArray arr; size_t totalRead = reader.Read(len, arr.DeferredMake(len), d2); std::string str2 = arr.GetPtr(); // str2 contains "Coding Monkey" } reader.Close(); }
WStrArray is available to read wchar_t array. However, it is not recommended to write std::wstring and use WStrArray to read it if you want to keep your file format portable across different OSes. The reason is due to wchar_t size is different on Windows, Linux and Mac OSX. We will explore this issue on the later section. Note:Text file API do not have this problem as conversion are in place to keep it automatic. The workaround if the user need to write Unicode strings is write UTF-8 string. Another option is to use BaseArray class to write 16 bit string. There are 2 types of 16 bit encoding for Unicode, namely UCS-2 and UTF-16. UCS-2 unit is always 16 bits and can only represent 97% of the Unicode. UTF-16 can encode all Unicode code points but its unit could consist of a single or two 16 bit words. For some use cases, UCS-2 is sufficient to store the text of the choice language. UTF-16 is able to store everything that is Unicode but the tradeoff is the conversion time and the need to take note of the potential difference in text length before and after conversion.
WStrArray
wchar_t
std::wstring
BaseArray
xBinaryWriter and xBinaryReader also provides Seek and GetCurrPos to do file seeking (a common operation in binary file parsing).
xBinaryWriter
xBinaryReader
Seek
GetCurrPos
xTextWriter and xTextReader makes use of DataType and DataTypeRef respectively to do the conversion between data types and string. Basically, this library depends on implicit conversion of Plain Old Data(POD) to DataType object to work. xTextWriter has many overloaded Write and WriteLine which differs by the number of DataType parameters. WriteLine basically just add the linefeed (LF) after writing the string. The Write below, has 5 DataType parameters.
DataType
DataTypeRef
WriteLine
bool xTextWriter::Write( const wchar_t* fmt, DataType D1, DataType D2, DataType D3, DataType D4, DataType D5 ) { if(pWriter!=NULL) { std::wstring str = StrUtilRef::Format(fmt, D1, D2, D3, D4, D5); return pWriter->Write(str); } return false; }
DataType consists many overloaded constructors which convert the Plain Old Data (POD) to string and store it in string member (m_str).
m_str
namespace Elmax { class DataType { public: ~DataType(void); DataType( int i ); DataType( unsigned int ui ); DataType( const ELMAX_INT64& i64 ); DataType( const unsigned ELMAX_INT64& ui64 ); DataType( float f ); DataType( const double& d ); DataType( const std::string& s ); DataType( const std::wstring& ws ); DataType( const char* pc ); DataType( const wchar_t* pwc ); DataType( char c ); DataType( unsigned char c ); DataType( wchar_t wc ); std::wstring& ToString() { return m_str; } protected: std::wstring m_str; };
Here is the C++11 variadic template Write version which supports any arbitrary number of arguments. But you need to download and install the Visual C++ Compiler November 2012 CTP to compile the code. Note: the code is much lesser without having to write all those overloaded functions previously.
bool Write( const wchar_t* str ) { if(pWriter!=nullptr) { return pWriter->Write(std::wstring(str)); } return false; } template<typename... Args> bool Write( const wchar_t* fmt, Args&... args ) { std::wstring str = StrUtilRef::Format(std::wstring(fmt), 0, args...); if(pWriter!=nullptr) { return pWriter->Write(str); } return false; }
As mentioned earlier, xTextReader makes use of DataTypeRef to do the conversion from string to Plain Old Data (POD). xTextReader has 10 overloaded Read and ReadLine which differs only by the number of DataTypeRef parameters. The ReadLine shown below, has 5 DataTypeRef parameters.
Read
ReadLine
size_t xTextReader::ReadLine( DataTypeRef D1, DataTypeRef D2, DataTypeRef D3, DataTypeRef D4, DataTypeRef D5 ) { if(pReader!=NULL) { std::wstring text; bool b = pReader->ReadLine(text); if(b) { StrUtilRef strUtil; strUtil.SetSplitStrategy(m_pSplitStrategy); return strUtil.Split(text.c_str(), D1, D2, D3, D4, D5); } } return 0; } size_t StrUtilRef::Split( const std::wstring& StrToExtract, DataTypeRef& D1, DataTypeRef& D2, DataTypeRef& D3, DataTypeRef& D4, DataTypeRef& D5 ) { std::vector<DataTypeRef*> vecDTR; vecDTR.push_back(&D1); vecDTR.push_back(&D2); vecDTR.push_back(&D3); vecDTR.push_back(&D4); vecDTR.push_back(&D5); assert( m_pSplitStrategy ); return m_pSplitStrategy->Extract( StrToExtract, vecDTR ); } size_t StrtokStrategy::Extract( const std::wstring& StrToExtract, std::vector<Elmax::DataTypeRef*> vecDTR ) { std::vector<std::wstring> vecSplit; const size_t size = StrToExtract.size()+1; wchar_t* pszToExtract = new wchar_t[size]; wmemset( pszToExtract, 0, size ); Wcscpy( pszToExtract, StrToExtract.c_str(), size ); wchar_t *pszContext = 0; wchar_t *pszSplit = 0; pszSplit = wcstok( pszToExtract, m_sDelimit.c_str() ); while( NULL != pszSplit ) { size_t len = wcslen(pszSplit); if(pszSplit[len-1]==65535&&vecSplit.size()==vecDTR.size()-1) // bug workaround: wcstok_s/wcstok will put 65535 at the back of last string. pszSplit[len-1] = L'\0'; vecSplit.push_back(std::wstring( pszSplit ) ); pszSplit = wcstok( NULL, m_sDelimit.c_str() ); } delete [] pszToExtract; size_t fail = 0; for( size_t i=0; i<vecDTR.size(); ++i ) { if( i < vecSplit.size() ) { if( false == vecDTR[i]->ConvStrToType( vecSplit[i] ) ) ++fail; } else break; } return vecSplit.size()-fail; }
DataTypeRef keeps a big union to store the address of each POD parameter as a destination for result.
namespace Elmax { class DataTypeRef { public: ~DataTypeRef(void); union UNIONPTR { int* pi; unsigned int* pui; short* psi; unsigned short* pusi; ELMAX_INT64* pi64; unsigned ELMAX_INT64* pui64; float* pf; double* pd; std::string* ps; std::wstring* pws; char* pc; unsigned char* puc; wchar_t* pwc; }; enum DTR_TYPE { DTR_INT, DTR_UINT, DTR_SHORT, DTR_USHORT, DTR_INT64, DTR_UINT64, DTR_FLOAT, DTR_DOUBLE, DTR_STR, DTR_WSTR, DTR_CHAR, DTR_UCHAR, DTR_WCHAR }; DataTypeRef( int& i ) { m_ptr.pi = &i; m_type = DTR_INT; } DataTypeRef( unsigned int& ui ) { m_ptr.pui = &ui; m_type = DTR_UINT; } DataTypeRef( short& si ) { m_ptr.psi = &si; m_type = DTR_SHORT; } DataTypeRef( unsigned short& usi ) { m_ptr.pusi = &usi; m_type = DTR_USHORT;} DataTypeRef( ELMAX_INT64& i64 ) { m_ptr.pi64 = &i64; m_type = DTR_INT64; } DataTypeRef( unsigned ELMAX_INT64& ui64 ){ m_ptr.pui64 = &ui64; m_type = DTR_UINT64;} DataTypeRef( float& f ) { m_ptr.pf = &f; m_type = DTR_FLOAT; } DataTypeRef( double& d ) { m_ptr.pd = &d; m_type = DTR_DOUBLE;} DataTypeRef( std::string& s ) { m_ptr.ps = &s; m_type = DTR_STR; } DataTypeRef( std::wstring& ws ) { m_ptr.pws = &ws; m_type = DTR_WSTR; } DataTypeRef( char& c ) { m_ptr.pc = &c; m_type = DTR_CHAR; } DataTypeRef( unsigned char& uc ) { m_ptr.puc = &uc; m_type = DTR_UCHAR; } DataTypeRef( wchar_t& wc ) { m_ptr.pwc = &wc; m_type = DTR_WCHAR; } bool ConvStrToType( const std::string& Str ); bool ConvStrToType( const std::wstring& Str ); DTR_TYPE m_type; UNIONPTR m_ptr; };
The C++11 variadic template version below which calls ReadArg. The first ReadArg is the base function which will terminate the recursion of the variadic sibling. Please note that this is not true recursion as in the traditional sense because the function is actually not calling itself: it is calling a different function with the same name but have different number of arguments.
ReadArg
void ReadArg(std::vector<DataTypeRef*>& vec) { } template<typename T, typename... Args> void ReadArg(std::vector<DataTypeRef*>& vec, T& t, Args&... args) { vec.push_back(new DataTypeRef(t)); ReadArg(vec, args...); } template<typename... Args> size_t Read( size_t len, Args&... args ) { if(pReader!=nullptr) { std::wstring text; bool b = pReader->Read(text, len); if(b) { std::vector<DataTypeRef*> vec; ReadArg(vec, args...); size_t ret = m_pSplitStrategy->Extract(text, vec); for(size_t i=0; i<vec.size(); ++i) { delete vec[i]; } vec.clear(); return ret; } } return 0; }
xBinaryWriter makes use of BinaryTypeRef. The overloaded Write is different by the number of parameters. xBinaryWriter has no WriteLine function. The Write function shown below, has 2 BinaryTypeRef parameters.
BinaryTypeRef
size_t xBinaryWriter::Write( BinaryTypeRef D1, BinaryTypeRef D2 ) { size_t totalWritten = 0; if(fp!=NULL) { if(D1.m_type != BinaryTypeRef::DTR_STR && D1.m_type != BinaryTypeRef::DTR_WSTR && D1.m_type != BinaryTypeRef::DTR_BASEARRAY) { size_t len = fwrite(D1.GetAddress(), D1.size, 1, fp); if(len==1) ++totalWritten; } else { size_t len = fwrite(D1.GetAddress(), D1.elementSize, D1.arraySize, fp); if(len==D1.arraySize) ++totalWritten; } if(D2.m_type != BinaryTypeRef::DTR_STR && D2.m_type != BinaryTypeRef::DTR_WSTR && D2.m_type != BinaryTypeRef::DTR_BASEARRAY) { size_t len = fwrite(D2.GetAddress(), D2.size, 1, fp); if(len==1) ++totalWritten; } else { size_t len = fwrite(D2.GetAddress(), D2.elementSize, D2.arraySize, fp); if(len==D2.arraySize) ++totalWritten; } } if(totalWritten != 2) { errNum = ELMAX_WRITE_ERROR; err = StrUtil::Format(L"{0}: Less than 2 elements are written! ({1} elements written)", GetErrorMsg(errNum), totalWritten); if(enableException) throw new std::runtime_error(StrUtil::ConvToString(err)); } return totalWritten; }
BinaryTypeRef keeps a union to store the address of the POD. No textual to string conversion is necessary: POD is written as it is into the binary file.
namespace Elmax { class BinaryTypeRef { public: ~BinaryTypeRef(void); union UNIONPTR { const int* pi; const unsigned int* pui; const short* psi; const unsigned short* pusi; const ELMAX_INT64* pi64; const unsigned ELMAX_INT64* pui64; const float* pf; const double* pd; std::string* ps; const std::wstring* pws; const char* pc; const unsigned char* puc; const wchar_t* pwc; const char* arr; }; enum DTR_TYPE { DTR_INT, DTR_UINT, DTR_SHORT, DTR_USHORT, DTR_INT64, DTR_UINT64, DTR_FLOAT, DTR_DOUBLE, DTR_STR, DTR_WSTR, DTR_CHAR, DTR_UCHAR, DTR_WCHAR, DTR_BASEARRAY }; BinaryTypeRef( const int& i ) { m_ptr.pi = &i; m_type = DTR_INT; size=sizeof(i); } BinaryTypeRef( const unsigned int& ui ) { m_ptr.pui = &ui; m_type = DTR_UINT; size=sizeof(ui); } BinaryTypeRef( const short& si ) { m_ptr.psi = &si; m_type = DTR_SHORT; size=sizeof(si); } BinaryTypeRef( const unsigned short& usi ) { m_ptr.pusi = &usi; m_type = DTR_USHORT; size=sizeof(usi); } BinaryTypeRef( const ELMAX_INT64& i64 ) { m_ptr.pi64 = &i64; m_type = DTR_INT64; size=sizeof(i64); } BinaryTypeRef( const unsigned ELMAX_INT64& ui64 ) { m_ptr.pui64 = &ui64; m_type = DTR_UINT64; size=sizeof(ui64); } BinaryTypeRef( const float& f ) { m_ptr.pf = &f; m_type = DTR_FLOAT; size=sizeof(f); } BinaryTypeRef( const double& d ) { m_ptr.pd = &d; m_type = DTR_DOUBLE; size=sizeof(d); } BinaryTypeRef( std::string& s ) { m_ptr.ps = &s; m_type = DTR_STR; elementSize=sizeof(char);size=s.length(); arraySize=s.length();} BinaryTypeRef( const std::wstring& ws ) { m_ptr.pws = &ws; m_type = DTR_WSTR; elementSize=sizeof(wchar_t); size=ws.length()*sizeof(wchar_t); arraySize=ws.length();} BinaryTypeRef( const char& c ) { m_ptr.pc = &c; m_type = DTR_CHAR; size=sizeof(c); } BinaryTypeRef( const unsigned char& uc ) { m_ptr.puc = &uc; m_type = DTR_UCHAR; size=sizeof(uc); } BinaryTypeRef( const wchar_t& wc ) { m_ptr.pwc = &wc; m_type = DTR_WCHAR; size=sizeof(wc); } BinaryTypeRef( const BaseArray& arr ) { m_ptr.arr = arr.GetPtr(); m_type = DTR_BASEARRAY; size=arr.GetTotalSize(); elementSize=arr.GetElementSize(); arraySize=arr.GetArraySize(); } char* GetAddress(); DTR_TYPE m_type; UNIONPTR m_ptr; size_t size; size_t elementSize; size_t arraySize; };
This is the C++11 variadic template binary Write version. The first Write is the base function which stops the recursive calls. It also makes use of the BinaryTypeRef class.
size_t Write() { return 0; } template<typename T, typename... Args> size_t Write( T t, Args... args ) { BinaryTypeRef dt(t); size_t totalWritten = 0; if(fp!=nullptr) { if(dt.m_type != BinaryTypeRef::DTR_STR && dt.m_type != BinaryTypeRef::DTR_WSTR && dt.m_type != BinaryTypeRef::DTR_BASEARRAY) { size_t len = fwrite(dt.GetAddress(), dt.size, 1, fp); if(len==1) ++totalWritten; } else { size_t len = fwrite(dt.GetAddress(), dt.elementSize, dt.arraySize, fp); if(len==dt.arraySize) ++totalWritten; } } return totalWritten + Write(args...); }
Lastly, we have come to xBinaryReader. xBinaryReader makes use of BinaryTypeReadRef to do data conversion. Like xTextReader, xBinaryReader has overloaded Read to do its work but it has no ReadLine.
BinaryTypeReadRef
size_t xBinaryReader::Read( BinaryTypeReadRef D1, BinaryTypeReadRef D2 ) { size_t totalRead = 0; if(fp!=NULL) { if(D1.m_type != BinaryTypeReadRef::DTR_STRARRAY && D1.m_type != BinaryTypeReadRef::DTR_WSTRARRAY && D1.m_type != BinaryTypeReadRef::DTR_BASEARRAY) { size_t cnt = fread(D1.GetAddress(), D1.size, 1, fp); if(cnt==1) ++totalRead; } else { D1.DeferredMake(); size_t cnt = fread(D1.GetAddress(), D1.elementSize, D1.arraySize, fp); if(cnt == D1.arraySize) ++totalRead; } if(D2.m_type != BinaryTypeReadRef::DTR_STRARRAY && D2.m_type != BinaryTypeReadRef::DTR_WSTRARRAY && D2.m_type != BinaryTypeReadRef::DTR_BASEARRAY) { size_t cnt = fread(D2.GetAddress(), D2.size, 1, fp); if(cnt==1) ++totalRead; } else { D2.DeferredMake(); size_t cnt = fread(D2.GetAddress(), D2.elementSize, D2.arraySize, fp); if(cnt==D2.arraySize) ++totalRead; } } if(totalRead != 2) { errNum = ELMAX_READ_ERROR; err = StrUtil::Format(L"{0}: Less than 2 elements are read! ({1} elements read)", GetErrorMsg(errNum), totalRead); if(enableException) throw new std::runtime_error(StrUtil::ConvToString(err)); } return totalRead; }
For simplicity, I do not show the BinaryTypeReadRef class here because the code is quite complicated as it supports DeferredMake of the array class.
This is the C++11 variadic template binary Read version. Same as the binary Write before, the 1st function is the base function which ends the recursive calls. Like previous Read, it, too, make use of BinaryTypeReadRef.
size_t Read() { return 0; } template<typename T, typename... Args> size_t Read( T& t, Args&... args ) { BinaryTypeReadRef dt(t); size_t totalRead = 0; if(fp!=nullptr) { if(dt.m_type != BinaryTypeReadRef::DTR_STRARRAY && dt.m_type != BinaryTypeReadRef::DTR_WSTRARRAY && dt.m_type != BinaryTypeReadRef::DTR_BASEARRAY) { size_t cnt = fread(dt.GetAddress(), dt.size, 1, fp); if(cnt==1) ++totalRead; } else { dt.DeferredMake(); size_t cnt = fread(dt.GetAddress(), dt.elementSize, dt.arraySize, fp); if(cnt == dt.arraySize) ++totalRead; } } return totalRead + Read(args...); }
When I was writing the Windows code, I took special care to separate the Windows and Non-Windows code with a _MICROSOFT macro. _WIN32 macro is not used instead because the Mingw defines it as well. The main difference between Windows and Non-Windows code at that point, is on Windows, linefeed ("\n") is converted to a combination of carriage return and line feed ("\r\n") during file writing and the reverse process is applied during file reading; On Non-Windows platform, linefeed ("\n") remains as linefeed: no conversion is done.
_MICROSOFT
_WIN32
"\n"
"\r\n"
I downloaded and installed Orwell Dev-C++ to test my code on Mingw and GCC on Windows. Orwell Dev-C++ is a continuation of the work of (currently non-active) the Bloodshed Dev-C++. Orwell Dev-C++ comes bundled with Mingw and fairly recent GCC 4.6.x. During compilation, Orwell Dev-C++ complains about the unavailable secure c function (typically name which ends with _s) such as _itow_s. So I changed them to non-secure version for Non-Windows implementation while Windows implementation is still using the secure version. Dev-C++ also complained it could not find a std::exception constructor which takes in a string. It turned out that std::exception was meant to be derived from and not used directly. I changed the use of std::exception to proper exception types, such as logic_error, runtime_error and so on. With these change done, I assume most of my Linux work is done. I estimated, excluding the time to learn G++ and write makefile, that it would take me at most 1 hour to get the code working. That was when I found out I have grossly underestmated the time that would taken me to resolve the errors on Ubuntu Linux 12.04.
_itow_s
std::exception
logic_error
runtime_error
After converting the Orwell Dev-C++ makefile to work on Ubuntu Linux and GCC 4.6.3, the first error which G++ complained was it did not understand the included paths. So I changed the backslash to forward slash.
#include "..\\..\\Common\\Common.h"
The above path is changed to below.
#include "../../Common/Common.h"
This was an easy change, though I had to update most of the 66 source files. The next G++ complaint was it could not find the data conversion function (typically name which starts with underscore) such as _ultow. It turned out that Microsoft standard conversion functions were not the standard after all. I have to use stringstream to replace _ultow and its cousins. All compilation errors are resolved at this point. And I ran the unit tests. It crashed at the first Unicode test! Upon some investigation, I discovered, to my dismay, the size of wchar_t on Linux and Mac OSX is 4 bytes, instead of 2 bytes! That meant all the wchar_t related functions did not work correctly on Linux and Mac OSX. It was clearly a showstopper! It took me 3 laborious days to implement UTF-16 conversions and handle all the instances where wchar_t size was 4; Unicode files are essentially UTF16 files. On Windows, UTF-16 is supported natively. On Ubuntu Linux, I have to convert the 4 bytes wchar_t (UTF-32) to UTF-16 before writing to Unicode file. The reverse conversion applies during reading.
_ultow
stringstream
If you are interested to run the Linux tests, you can run the command line below to build the library (FileLib.a) and the test application (UnitTest.exe) and execute it
FileLib.a
UnitTest.exe
cd FileLib cd FileIO make all cd .. cd PreVS2012UnitTest make all ./UnitTest.exe
In total, there are 55 unit tests for Windows and 65 unit tests for Linux. Whenever I made a change or fix a bug for either OS, I ran the unit tests for both to make sure I have not broken anything on the either side.
Clang 3.1 on Ubuntu 12.04 is able to compile the library using GCC 4.7 standard library. However, Clang compilation failed on Mac OSX 10.8 due to the failure to find an overloaded constructor with size_t parameter. size_t is synonymous with unsigned 32 bit integer on 32 bit platform and is 64 bit on 64 platform. Apparently, Clang sees size_t as another type. The attempt to add that constructor failed on Microsoft compiler which complained of similar constructor already exists and the fix is to hide it under the __APPLE__ check.
size_t
__APPLE__
#ifdef __APPLE__ DataType( size_t ui ); #endif
In order to compile under Clang successfully, remove Microsoft specific files, like stdafx.h, WinOperation.h/cpp and Boost files like BoostStrategy.h/cpp and RegExStrategy.h/cpp. For unit testing, LinuxUnitTest.cpp can be used.
This is a list of issues that the users need to be aware of when using this file library.
overflow_error
underflow_error
EnableException
GetLastError
The original source code uploaded in this article, was tested to have no memory leaks using Visual Studio 2010/2012 and Valgrind (Linux) for correct program operation. There could be leaks in the event of when exceptions thrown, deallocation is prevented from being called. Another problem was the exceptions were allocated on the heap and not freeded in the catch handler(an oversight). All these has been rectified to use Resource Acquisition Is Initialization (RAII) for all arrays to freed the memory and exception, if thrown, is now allocated on the stack.
There are plans to move the library to C++11 features like variadic templates, nullptr and move semantics. It is much clearer to use standard integer types like uint32_t as opposed to the unsigned int. A preliminary C++11 version is already available for download. C++98 version will still be maintained on a different GIT branch.
uint32_t
unsigned int
The table below shows the number of lines of code (loc) for each of the class for C++98 and C++11. The percentage of reduced loc after applying C++11 variadic template is greater than 50%.
The reader may have or may not have noticed the Elmax namespace used in the code snippets. As anyone would have guessed the file library is for future cross-platform Elmax XML Library, but why include a binary file API as well? The reason is because there will be a version of Elmax which can save XML in binary form. Let us briefly recall the Elmax syntax to write a value to a XML element.
using namespace Elmax; Element elem; elem[L"Price"] = 30.65f; elem[L"Qty"] = 1200;
As the reader can see from the above sample code, Elmax element is aware of the data type before it converts the data to textual form. By using the data type information, Elmax can build a metadata section about the XML. The metadata can be separated or embedded inside the Binary XML. If the XML contains mainly recurring elements, the metadata can be concise and small. However, if the XML file is consisted free form XML like SOAP XML, HTML or XAML, the metadata can be quite big with respect to the Binary XML. Binary file has the advantage of being fast because the data-type conversion from textual form is out of the picture.
I have modified an old OpenGL demo to read binary file to showcase the file library. Set the global variable, g_bLoadBinary according to which file type you want the demo to load. Please note the OpenGL code is not cross-plaform and only runs on Windows. Previously, I have uploaded an OpenGL demo for another article. Since I have only access to NVidia graphics card, I was not aware that the code does not run correctly on Intel graphics chipset. This demo should not have the same problem. Please let me know if you have any problem running the OpenGL demo. The demo is written in OpenGL 2.0. A OpenGL 4.0 version is being developed for a future OpenGL article. Stay tuned if you are interested in OpenGL 4.0!
g_bLoadBinary
This is the wood clip model loaded. The model is modelled using very old Milkshape shareware.
This is the screenshot of the demo.
In this article, we have seen a new file API which makes writing and reading structured data intuitive and productive. By keeping both the text and binary API similar, the user can maintain both file formats with minimal efforts. The file library would be used for the new Elmax XML library to save to textual and binary XML files. The XML work is a ongoing effort. The estimated date of completion is unknown. The source code is currently hosted at Github.
Elmax C++ File Library is available on NuGet Gallery for VS2010 and VS2012! Remember to update your Nuget to latest version 2.5 first.