CDataFile - An easy class for reading numeric data in CSV or Text-Delimited format.
This class reads numeric data and stores it for easy access. The data can be accessed by (row, column) from any data reduction routines you may have.
Introduction
This article presents the CDataFile
class.
CDataFile
is a class that is designed to provide an interface to
using and manipulating CSV (comma separated values) or other text-delimited,
tabular, data files. CDataFile
was first published in November,
2002 and has undergone several improvements and enhancements, and ended up
getting totally redesigned. This implementation has removed all dependence on
MFC and now makes great use of the Standard Template Library (STL). Throughout
the class, you will find several interesting tricks and some creative use of
predicates, function objects, and the STL algorithms. But most importantly, you
will find a powerful interface that is designed for usability by the novice and
experienced programmer alike.
Why CDataFile?
When I set out to search for a class to read CSV and other text data, I had these criteria:
- The class had to be very fast in parsing, lookups, and data manipulation.
- The class had to provide a simple (column, row) type interface.
- The class had to be able to efficiently handle data files in excess of 100MB or more.
To my surprise, I was hard pressed to find anything that met my criteria. I
found some stuff that used OBDC and even DOM, but nothing lightweight, simple to
use, or very fast. So I decided to write my own, and CDataFile
is
the result.
Using CDataFile
In order to use CDataFile
, you need to add DataFile.cpp
and DataFile.h to your project and include the following header file:
#include "DataFile.h"
I have listed the member functions and operators of CDataFile
in
the tables below. Note that the term variable in this class is synonymous
with field or column, and the term sample is synonymous
with record or row.
CDataFile Member Functions
Function | Description |
AppendData |
Appends data to a specified variable. |
ClearData |
Clears all data from the
CDataFile . |
CreateVariable |
Creates a new variable in the
CDataFile . |
DeleteVariable |
Deletes a variable from a
CDataFile . |
FromVector |
A static function to create a CDataFile from
a vector. |
GetData |
Gets data from the
CDataFile . |
GetLastError |
Gets the last error encountered by the
CDataFile . |
GetNumberOfSamples |
Gets the number of samples contained in a variable. |
GetNumberOfVariables |
Gets the number of variables in the
CDataFile . |
GetReadFlags |
Gets the current read flags. |
GetVariableIndex |
Gets the lookup index of a specified variable. |
GetVariableName |
Gets the name of the variable at a specified location. |
ReadFile |
Reads the contents of a file and stores it in the
CDataFile . |
SetData |
Sets data in the
CDataFile . |
SetDelimiter |
Sets the delimiter to use for parsing files. |
SetReadFlags |
Sets the read flags of the
CDataFile . |
SetReserve |
Sets the capacity of contiguous memory for the
CDataFile before a reallocation is needed. |
WriteFile |
Writes the contents of the CDataFile to a
file. |
CDataFile Operators
Operator | Description |
operator() |
Gets data from the CDataFile depending on
which override is called. |
operator[][] |
Gets a reference to the data at the specified
location. (Note: valid for DOUBLE mode only, not
STRING ) |
operator+ |
Combines the contents of
CDataFile (s). |
operator+= |
Appends the contents of another CDataFile to
the CDataFile . |
operator= |
Sets the contents of the CDataFile to the
contents of another CDataFile . |
operator<< |
Puts the contents of a CDataFile to a
stream. |
operator>> |
Extracts the contents of a CDataFile from a
stream. |
Constructing a CDataFile
CDataFile(void); CDataFile(const int& dwReadFlags); CDataFile(const char* szFilename, const int& dwReadFlags = DF::RF_READ_AS_DOUBLE); CDataFile(const DF::DF_SELECTION& df_selection); CDataFile(const CDataFile& df);
Parameters
dwReadFlags
Determines how the data is read and stored in the
CDataFile
. Can be any combination of the following:RF_READ_AS_DOUBLE
(Takes priority ifRF_READ_AS_STRING
is also set)RF_READ_AS_STRING
RF_APPEND_DATA
(Takes priority ifRF_REPLACE_DATA
is also set)RF_REPLACE_DATA
szFilename
The fully qualified path to the data file to read upon construction.
df_selection
A subset of another
CDataFile
to which the constructor will set initial values.df
Another
CDataFile
to which the constructor will set initial values.
Remarks
None of the CDataFile
constructors will throw an exception, but
will catch any exceptions internally. Optionally, the user can call
CDataFile::GetLastError()
to retrieve any error information due to
a CDataFile
construction exception.
Example
// Construct an empty CDataFile: CDataFile df; // Construct a CDataFile to read a CSV file containing data of type double: CDataFile df("C:\\MyData.csv"); // Construct a CDataFile to read a CSV file containing data of type string: CDataFile df("C:\\MyStringData.csv", DF::RF_READ_AS_STRING); // Construct an empty CDataFile that will read string data and append // data with each subsequent file read: CDataFile df(DF::RF_READ_AS_STRING | DF::RF_APPEND_DATA); // Construct a CDataFile from a subset of another CDataFile: CDataFile df1("C:\\MyData.csv"); CDataFile df2(df1(iLeftColumn, iTopRow, iRightColumn, iBottomRow));
CDataFile::AppendData()
The AppendData()
member function appends data to the end of a
specified variable. You can append values one at a time, or append the contents
of an entire vector. AppendData()
will return true if it was
successful, or false if an error was encountered.
bool AppendData(const int& iVariable, const char* szValue); bool AppendData(const int& iVariable, const double& value); bool AppendData(const int& iVariable, const std::vector<double>& vData); bool AppendData(const int& iVariable, const std::vector<std::string>& vszData);
Parameters
iVariable
The index of the variable to which the data will be appended.
szValue
The value to append. (Assumes
DF::RF_READ_AS_STRING
.)value
The value to append. (Assumes
DF::RF_READ_AS_DOUBLE
.)vData
A const reference to a vector containing the values to append. (Assumes
DF::RF_READ_AS_DOUBLE
.)vszData
A const reference to a vector containing the values to append. (Assumes
DF::RF_READ_AS_STRING
.)
Return Value
Returnstrue
if the function was successful,false
if an error was encountered.
Remarks
If the user appendsstring
data, it is appended to theCDataFile
internalstring
vector, whereas if the user appends values of typedouble
, the data is allocated to the class' internaldouble
vector.
Example
// To append 0.2123 to the variable at index 0: if(!df.AppendData(0, 0.2123)) cout << df.GetLastError(); // To append the contents of a std::vector<double>, v, to the variable at index 4: if(!df.AppendData(4,v); cout << df.GetLastError();
CDataFile::ClearData()
The ClearData()
member function is responsible for clearing all
the data from a CDataFile
, reclaiming any allocated memory, and
zeroing the class' internal buffer size.
void ClearData(void);
Remarks
The class will reclaim a block of contiguous memory equal to the size specified byCDataFile::SetReserve()
when reading a new data file, otherwise the internal buffer capacity will be zero. The class will allocate internal buffer storage as needed, but may exhibit the behavior studied in this article [^] if the class is required to read excessively large files. CallingCDataFile::SetReserve()
with an adequate capacity beforeCDataFile::ReadFile()
will eliminate this behavior.
CDataFile::ClearData()
is called automatically when aCDataFile
is destroyed.
Example
// Flush the contents of a CDataFile, df.
df.ClearData();
CDataFile::CreateVariable()
The member function, CreateVariable()
, is provided as a means to
append a new variable to the end of your CDataFile
. Think of it as
adding a new column to the right of your data in Excel. You can create variables
of pre-determined sizes and an initial value, or you can create a variable from
an existing vector.
bool CreateVariable(const char* szName, const double& initial_value, const int& iSize = 0); bool CreateVariable(const char* szName, const std::string& initial_value, const int& iSize = 0); bool CreateVariable(const char* szName, const std::vector<double>& vData); bool CreateVariable(const char* szName, const std::vector<std::string>& vszData);
Parameters
szName
The name you want to assign to the variable. Think of it as a column label in Excel or a field name in a database.
initial_value
The value you want all of the new data to contain. Think of it as assigning all the rows in a column this initial value.
iSize
The number of samples (rows or records) you want your variable to contain.
vData
A vector of type
double
containing the data to assign to the new variable.vszData
A vector of type
string
containing the data to assign to the new variable.
Return Value
Returnstrue
if the function was successful,false
if an error was encountered.
Remarks
The default size of a new variable is 0. The user must then callCDataFile::AppendData()
to add values to the variable. If the user attempts to callCDataFile::SetData()
on a variable of size 0, an error will occur.
Example
// Create a variable in df named "MyVar" // with 27 samples initialized to 0.0. if(!df.CreateVariable("MyVar", 0.0, 27)) cout << df.GetLastError(); // Create a variable in df named "My Vector Var" // containing the data in vector, v. if(!df.CreateVariable("My Vector Var",v); cout << df.GetLastError();
CDataFile::DeleteVariable()
The DeleteVariable()
member function is provided to delete a
variable from a CDataFile
. Think of it as deleting a column in
Excel, or an entire field in a database.
bool DeleteVariable(const int& iVariable);
Parameters
iVariable
The index of the variable to be deleted from the
CDataFile
.
Return Value
Returnstrue
if the function was successful,false
if an error was encountered.
Remarks
Use CDataFile::GetLastError()
to retrieve any error
information.
Example
// Delete the variable at index 3 from CDataFile, df. if(!df.DeleteVariable(3)) cout << df.GetLastError();
CDataFile::FromVector()
The static
member function, FromVector()
, is
provided as a means to create a CDataFile
object from an existing
data vector. This function particularly is useful when using
CDataFile::operator+
and CDataFile::operator+=
.
static CDataFile FromVector(const char* szVariableName, const std::vector<double>& vData); static CDataFile FromVector(const char* szVariableName, const std::vector<std::string>& vData);
Parameters
szVariableName
The name you want to assign to the variable. Think of it as a column label in Excel or a field name in a database.
vData
A vector of type
double
orstring
containing the data to assign to the new variable.
Return Value
Returns a CDataFile
containing the resultant
variable.
Remarks
Use this static function when you need to convert a vector of data
to a CDataFile
.
Example
// Add the contents of vector, v, to CDataFile, df, using operator += df += CDataFile::FromVector("MyVectorData", v); // Combine the contents of vector, v1, // and vector, v2, and set the CDataFile, df, // equal to the result. df = CDataFile::FromVector("My V1 Data", v1) + CDataFile::FromVector("My V2 Data", v2);
CDataFile::GetData()
There are several overrides provided for the member function,
GetData()
. This is to allow for great flexibility in how the user
wants to retrieve the data. You can get data by variable index or by variable
name, as well as obtain a single sample or all samples.
double GetData(const int& iVariable, const int& iSample); double GetData(const char* szVariableName, const int& iSample); int GetData(const int& iVariable, std::vector<double>& rVector); int GetData(const char* szVariableName, std::vector<double>& rVector); int GetData(const int& iVariable, const int& iSample, char* lpStr); int GetData(const char* szVariableName, const int& iSample, char* lpStr); int GetData(const int& iVariable, const int& iSample, std::string& rStr); int GetData(const char* szVariableName, const int& iSample, std::string& rStr); int GetData(const int& iVariable, std::vector<std::string>& rVector); int GetData(const char* szVariableName, std::vector<std::string>& rVector);
Parameters
iVariable
The index of the variable from which to retrieve the data.
iSample
The sample number (record or row, 0-indexed) to retrieve.
rVector
A reference to a vector containing the proper data type to receive the data.
lpStr
A pointer to a string buffer to receive the data.
rStr
A reference to a
std::string
that will receive the data.
Return Value
double GetData(const int& iVariable, const int& iSample); double GetData(const char* szVariableName, const int& iSample);Returns a value of type
double
equal to the data at the specified location if successful,DF::ERRORVALUE
if an error is encountered.int GetData(const int& iVariable, std::vector<double>& rVector); int GetData(const char* szVariableName, std::vector<double>& rVector); int GetData(const int& iVariable, std::vector<std::string>& rVector); int GetData(const char* szVariableName, std::vector<std::string>& rVector);Returns the new size of
rVector
if successful, -1 if an error is encountered.int GetData(const int& iVariable, const int& iSample, char* lpStr); int GetData(const char* szVariableName, const int& iSample, char* lpStr); int GetData(const int& iVariable, const int& iSample, std::string& rStr); int GetData(const char* szVariableName, const int& iSample, std::string& rStr);Returns the new length of the
lpStr
orrStr
if successful, -1 if an error is encountered.
Remarks
If callingGetData()
with parameters of typedouble
,DF::RF_READ_AS_DOUBLE
is assumed. When calling with parameters of typechar*
orstd::string
,DF::RF_READ_AS_STRING
is assumed.
Example
// Initialize a variable, d, to sample 7 of "MyVar" from CDataFile, df. double d = df.GetData("MyVar", 7); if(d == DF::ERRORVALUE) cout << df.GetLastError(); // Get string data from sample 3 of variable 9 from CDataFile, df. CString szValue = ""; int iLength = 0; iLength = df.GetData(9, 3, szValue.GetBuffer(0)); szValue.ReleaseBuffer(); if(iLength == -1) cout << df.GetLastError(); else { //... do something } // Get all the data from "My Variable 2" out of CDataFile, // df, and put it in vector, vData. std::vector<double> vData; int iSize = df.GetData("My Variable 2", vData); if(iSize == -1) cout << df.GetLastError(); else { //... do something }
CDataFile::GetLastError()
The member function, GetLastError()
, is provided as a means to
extract information from a CDataFile
, regarding the last error
encountered by the class.
const char* GetLastError(void) const;
Return Value
Returns a const char*
containing information about the
last error encountered by the class.
Example
// Display the last error encountered by CDataFile, df.
cout << df.GetLastError();
CDataFile::GetNumberOfSamples()
The GetNumberOfSamples()
member function is provided as a means
to determine how many samples are contained in any given variable.
int GetNumberOfSamples(const int& iVariable) const;
Parameters
iVariable
The index of the variable for which to obtain the number of samples.
Return Value
Returns the number of samples if the function was successful, -1 if an error was encountered.
Remarks
Use CDataFile::GetLastError()
to retrieve any error
information.
Example
// Get the number of samples contained // in the variable at index 3 from CDataFile, df. int nSamps = df.GetNumberOfSamples(3); if(nSamps==-1) cout << df.GetLastError();
CDataFile::GetNumberOfVariables()
The GetNumberOfVariables()
member function is provided so that
the user can obtain the number of variables currently contained in the
CDataFile
.
int GetNumberOfVariables(void) const;
Return Value
Returns the number of variables contained in the
CDataFile
.
Example
// Get the number of variables contained in CDataFile, df. int nVars = df.GetNumberOfVariables();
CDataFile::GetReadFlags()
The GetReadFlags()
member function is provided as a means to
obtain the current read flags that have been either set or cleared.
int GetReadFlags(void) const;
Return Value
Returns the current read flags.
Remarks
The function returns anint
that contains the flags encoded within it. Use the bitwise&
operator to determine which flags are actually set.
Example
// Check to see if the user has set DF::RF_APPEND_DATA cout << "Append mode is " << (df.GetReadFlags() & DF::RF_APPEND_DATA ? "" : "NOT " ) << "set!";
CDataFile::GetVariableIndex()
The GetVariableIndex()
member function is provided as a means to
lookup the index of a variable, given its name and/or other information.
int GetVariableIndex(const char* szVariableName, const int& iStartingIndex = 0); int GetVariableIndex(const char* szVariableName, const char* szSourceFilename, const int& iStartingIndex = 0);
Parameters
szVariableName
The name of the variable for which to find the index.
iStartingIndex
The index from which to begin the search.
szSourceFilename
The name of the file from which the variable originated.
Return Value
Returns the index (0-based) of the variable if the function was successful, -1 if an error was encountered.
Remarks
GetVariableIndex()
will return the first instance of the variable name. If you have variables in your data with the same name, and you need an instance of the variable name other than the one first encountered, you will want to offset your search withiStartingIndex
.You may have data in a
CDataFile
that comes from different source files (i.e. usingDF::RF_APPEND_DATA
). In these cases, you may desire an instance of a variable from a particular source file. In this case, you would use the override provided withszSourceFilename
.
Example
// Get the index of "MyVar" from CDataFile, df. int iVar = df.GetVariableIndex("MyVar"); if(iVar == -1) cout << df.GetLastError(); // Get the index of "My Var 2" occurring after // variable 3 whose source file is "C:\data.csv". int iVar = df.GetVariableIndex("My Var 2", "C:\\data.csv", 3); if(iVar == -1) cout << df.GetLastError();
CDataFile::GetVariableName()
The GetVariableName()
member function is provided to lookup the
name of a variable (or field) given its index (or 0-based column number).
int GetVariableName(const int& iVariable, char* lpStr); int GetVariableName(const int& iVariable, std::string& rStr);
Parameters
iVariable
The index of the variable for which to obtain the name.
lpStr
A pointer to a string buffer to receive the name.
rStr
A string to receive the name.
Return Value
Returns the length of the variable name if the function was successful, -1 if an error was encountered.
Remarks
Use CDataFile::GetLastError()
to retrieve any error
information.
Example
// Get the variable name at index 3. CString szVarName = ""; int iLength = df.GetVariableName(3, szVarName.GetBuffer(0)); szVarName.ReleaseBuffer(); // or... std::string szVar = ""; iLength = df.GetVariableName(3, szVar); if(iLength==-1) cout << df.GetLastError(); else cout << szVar.c_str();
CDataFile::ReadFile()
The ReadFile()
member function is provided as a way to easily
read a data file by wrapping up all the file IO stuff.
bool ReadFile(const char* szFilename); bool ReadFile(const char* szFilename, const unsigned& dwReadFlags);
Parameters
szFileName
The fully qualified path to the file.
dwReadFlags
The flags to specify how to read the file. (See Constructing a CDataFile for more info.)
Return Value
Returnstrue
if the function was successful,false
if an error was encountered.
Remarks
Use CDataFile::GetLastError()
to retrieve any error
information.
Example
// Read the contents of "C:\test.csv" into CDataFile, df. if(!df.ReadFile("C:\\test.csv")) cout << df.GetLastError(); // Append the data from "C:\test2.csv" to CDadaFile, df. if(!df.ReadFile("C:\\test2.csv", DF::RF_APPEND_DATA)) cout << df.GetLastError();
CDataFile::SetData()
The SetData()
member function is provided as a way to set data
values stored in a CDataFile
.
bool SetData(const int& iVariable, const int& iSample, const double& value); bool SetData(const int& iVariable, const int& iSample, const char* szValue);
Parameters
iVariable
The variable for which to set the data.
iSample
The sample for which to set the data.
value
The value that the data will be set to. (Assumes
DF::RF_READ_AS_DOUBLE
)szValue
The value that the data will be set to. (Assumes
DF::RF_READ_AS_STRING
)
Return Value
Returnstrue
if the function was successful,false
if an error was encountered.
Remarks
Use CDataFile::GetLastError()
to retrieve any error
information.
Example
// Set variable 3, sample 0 to 2.121 in CDataFile, df. if(!df.SetData(3,0,2.121)) cout << df.GetLastError();
CDataFile::SetDelimiter()
The SetDelimiter()
member function is provided to set the
delimiter that will separate values in a CDataFile
.
void SetDelimiter(const char* delim);
Parameters
delim
The character to use for parsing data.
Example
// Set the delimiter in CDataFile, df, to 'tab'. df.SetDelimiter("\t");
CDataFile::SetReserve()
The SetReserve()
member function is provided to set the capacity
of a CDataFile
.
void SetReserve(const int& nReserve);
Parameters
nReserve
The number of elements to reserve space for.
Example
// Set the reserve of CDataFile, df, to 1000000. df.SetReserve(1000000);
CDataFile::WriteFile()
The WriteFile()
member function is provided to simplify writing
a CDataFile
to disk.
bool WriteFile(const char* szFilename, const char* szDelim = ",");
Parameters
szFilename
The fully qualified path for the destination file.
szDelim
The delimiter to use to separate data values.
Return Value
Returnstrue
if the function was successful,false
if an error was encountered.
Remarks
Use
CDataFile::GetLastError()
to retrieve any error information.
Example
// Write the contents of CDataFile, df, to "C:\test.csv". if(!df.WriteFile("C:\\test.csv")) cout << df.GetLastError();
CDataFile::operator()
The ()
operator is provided to easily extract data from a
CDataFile
.
double operator()(const int& iVariable, const int& iSample); int operator()(const int& iVariable, const int& iSample, char* lpStr); DF::DF_SELECTION operator()(const int& left, const int& top, const int& right, const int& bottom);
Parameters
iVariable
The variable for which to obtain the data.
iSample
The sample for which to obtain the data.
lpStr
A string buffer to receive the data.
left
,top
,right
,bottom
The coordinates from which to obtain the resulting selection. Think of it as highlighting a range of cells in Excel.
Return Value
double operator()(const int& iVariable, const int& iSample);
Returns the value at the specified location successful,
DF::ERRORVALUE
if an error was encountered.
int operator()(const int& iVariable, const int& iSample, char* lpStr);
Returns the new length of the
lpStr
if successful, -1 if an error is encountered.
DF::DF_SELECTION operator()(...);
Returns a selection (
DF_SELECTION
) containing the values in the specified range.
Remarks
Use CDataFile::GetLastError()
to retrieve any error
information.
Example
// Set d equal to the data at (2,4) from CDataFile, df. d = df(2,4); // Get the string data contained at (1,7). CString szData = ""; int iLength = df(1,7,szData.GetBuffer(0)); szData.ReleaseBuffer(); // Create a new CDataFile from the range (0,0,9,120). dfNew(df(0,0,9,120));
CDataFile::operator[][]
The [][]
operator is provided to easily extract data of type
double
from a CDataFile
.
Return Value
Returns a reference to the data at the specified location.
Remarks
This operator is only valid for DF::RF_READ_AS_DOUBLE
.
Since an actual reference is returned, you are able to assign values as well as
read them.
Example
// Set d equal to the data at variable 2, sample 4 from CDataFile, df. double d = df[2][4]; // Set the data at variable 9, sample 0, from CDataFile, df, equal to d. df[9][0] = d;
CDataFile::operator+
The +
operator is provided to combine the contents of multiple
CDataFile
objects.
CDataFile operator+ (const CDataFile&) const;
Example
// Set CDataFile, df, to the combined data of df2, df3, and df4.
df = df2 + df3 + df4;
CDataFile::operator+=
The +=
operator is provided to append a CDataFile
to another CDataFile
.
CDataFile& operator+=(const CDataFile&);
Example
// Append the contents of CDataFile, df2, to CDataFile, df.
df += df2;
CDataFile::operator=
The =
operator sets the internals of one CDataFile
equal to another CDataFile
.
CDataFile& operator =(const CDataFile&); CDataFile& operator =(const DF::DF_SELECTION&);
Example
// Set CDataFile, df2, equal to CDataFile, df. df2 = df1; // Set df3 equal to the selection df(0,0,4,120). df3 = df(0,0,4,120);
CDataFile::operator<<
The <<
operator puts the contents of a
CDataFile
to a stream.
std::ostream& operator << (std::ostream&, const CDataFile&); std::ostream& operator << (std::ostream&, const DF::DF_SELECTION&);
Example
// Put CDataFile, df, to outStream. outStream << df; // Put the selection (0,0,5,15) to outStream. outStream << df(0,0,5,15);
CDataFile::operator>>
The >>
operator gets the contents of a
CDataFile
from a stream.
std::istream& operator >> (std::istream&, CDataFile&);
Example
// Get CDataFile, df, from inStream.
inStream >> df;