Sometimes there is a need for a way to store simple data objects into a file which can then be retrieved using a key or index.
In many applications with such a need, an embedded database may be used. In some cases, using an embedded database is more than necessary
and in addition, using the database increases the complexity of deployment. The embedded database becomes another component which has to be managed.
While working with the source code of an older point of sale application, I came across the implementation of a lightweight
file storage mechanism for storing multiple instances of a given data structure using a simple key to identify the
specific data instance to be accessed within the file. In the case of this point of sale application, the data
stored in this fashion was a list of cashiers (each cashier's record providing data for controlling cashier operations), a list of employees (each
employee's record providing data for employee time keeping), and other data. These data objects had the following characteristics in common:
- the actual data stored was a simple C language binary data struct,
- each object of a particular type had a unique key such as an employee identification number,
- there were few changes to the elements of the data objects during operations once the objects were created and stored,
- most of the time the files were created and a set of objects inserted into the file and the number of objects did not change often, and
- there were usually only a few (less than 100) individual data objects of a particular type.
At the time the original application was written, the developers did not consider a lightweight, embeddable database engine.
The original application environment was a small microprocessor terminal with a small amount of memory using a simple, proprietary
Operating System. The same pattern for file storage of objects was used in several places in the source code though the
source text was cloned and modified rather than the developers creating a single, generalized function library to be
used in multiple places. For one type of data object, the maximum number of which was changed by an order of
magnitude, a B-tree style database function library was introduced, however the other files were not changed.
The basic pattern used in the data storage in the point of sale application was to have individual files for each data type. The file contained three data areas:
- file description header containing file content management data,
- an index section that had a finite size and which contained the index data used to lookup a particular data record, and
- a data record section that contained the actual data being stored.
The number of indices available was fixed at the time an empty file was created and the sizing information was stored in the
file description header. The fixed size would not ordinarily be a problem with the application as people setting up the
point of sale terminal for an end customer would typically size the maximum number of data objects (cashiers, employees, etc.)
much larger than the actual expected number of objects needed. For instance, the installation person configuring a setup for a small restaurant
with 10 employees would size the employee file for 100 employees.
The index information is stored in the index area of the file as a sorted list. When a particular index is retrieved,
a binary search of the sorted list of indices is used to locate the proper index, should it exist, and then use
the index to locate the data record itself. When new records are stored, the insertion point for the new index is
located and the index is inserted into the list at that point. When existing records are deleted,
the existing index is located and the index is removed from the list and the list consolidated. This makes the insertion and the deletion
of records expensive because the index list must be maintained in a sorted order without holes or empty index slots in the list. However if
most operations used are read operations or record update operations, this structure is workable and simple for smaller data sets.
The implementation of these file access primitives within the point of sale application had a number of issues. First of all,
instead of a generalized implementation, each of the data types had their own implementation of the generalized concept, resulting in
source text being duplicated and increasing the size of the source text itself as well as the resulting size of the binary files.
Secondly, using the file access primitives involved multiple steps during which the programmer needed to know a fair number of details about the file structure.
The source code provides a generalized function library with an interface to create a file and perform insertions, deletions, updates,
and retrievals of the data in the file. The steps to use this library are:
- determine the data object record and the data object index that provides a unique key for the data object record,
- determine the maximum number of objects that will be stored, and
- determine a comparison function that will compare two indices and indicate the collation order of the indices.
The comparison function will determine the sorted order of the indices.
The file management data is separated from the user's data in this implementation, and the file management data is encapsulated in the
function library so that the user of the library does not know about the actual physical file implementation. The user is only concerned
with the data that is stored and the semantics of the data access. There are two layers of accessing the data interfacing functions, a
layer in which the file path name is specified and a lower level in which the file handle is specified. The layer that uses the file path
name uses the lower level file handle layer by opening the specified file and then calling the appropriate file handle function to do the
actual operation. For large amounts of multiple accessing, opening the file once and then doing multiple
operations using the file handle layer is significantly less than using the file path name layer with the file opening overhead.
In order to separate out the file content management duties from the file content itself, file creation involves a step
of creating the index section of the file before the file is first used. Each index entry contains two pieces of data, a record
block number (a number from one to the maximum number of records the file will contain) and the user's index data.
The initializing process involves writing the indices each with its record block number and a zeroed out section to be used for the
user's index data. As new records are added or existing records are deleted, the indices with their assigned record block number
are moved and the user data portion of the index is modified, however the assigned record block number assigned to a particular
index is never changed. The indices are managed as a pool of handles to the actual record storage space.
Because the file and record management data is separated from the user's data, the user of the library does not know the physical
location of the index or the record data to which the index references. The way that the index and the corresponding data record are
linked means that the actual data record may or may not contain the index data. It also means that the user could use the library to
create a file that is composed only of index data without a corresponding record.
Using the code
The function library provides basic file record manipulation:
- create and initialize a file,
- insert indices and records into the file with the indices ordered by a collation function,
- delete an index and record from the file when a collation function indicates an index from the file matches the search index,
- retrieve a record from the file by specifying an index with a collation function, and
- update an existing record whose index matches a specified index.
Using the function library, the user will provide at least one data structure, the index, and possible two others that are optional, the header information
and the record data structure used for data that is stored using the index as a key.
The file access operations (insertion, deletion, retrieval, and update) require two
pieces of information that are used to determine the location of the index in the list of indices: the index data struct
provided and a collation function that indicates whether the index data struct provided is above, below, or equal to
an index data struct retrieved from the file. Using this method, the data type for the key can be whatever the user
of the library desires so long as there is a collation sequence for the keys and there is a function that can compare
two indices and determine whether one is higher or lower in the collation sequence than the other.
Some caveats and quid pro quos
The function library does have a couple of data structures and procedures that will constrain the size of the index struct.
The size of the index struct should be small meaning less than 256 bytes because there are temporary buffer areas in
the function library that are used for holding index data temporarily. In order to improve throughput, the index area
of the file is shifted in multiples of the stored index entry size. The stored index size is the size of the user's index
data struct along with an additional area that contains the record block number assigned to the index.
The actual record data associated with the index and which is stored in the record data section of the file is never
moved the way the indices are moved. When a record is deleted, what is done is that the index associated with the
record is copied to a temporary location, the index area is shifted or rippled down overwriting the index of the
record that is being deleted, and the index in the temporary area is then put into the index area after the sorted
list of indices that are active. The result is that indices are never destroyed, when needed they are pulled from
the pool of available indices that are stored above the sorted list of in use indices, and when not needed they
are put back into the pool of available indices. The result is that the records stored in the record area are not
in a sorted list the way that the indices are kept. So after a number of insertions and deletions though, the indices will be in sorted order, the record data will not.
The current implementation of this function library does not overwrite the record data when its associated index
is put into the free pool when the record is deleted. This could be a security concern for sensitive data.
The underlying file access primitives used in this function library are those from the Standard C Library as
provided with the Microsoft Visual Studio 2005 Integrated Development Environment (
fflush()). Changing the file access
primitives would require modifications to the function library internals. The external interface functions that
use a file name expect the file name to be ANSI single byte zero terminated character strings. In applications
requiring the use of UNICODE file names with a different set of file primitives, the function interface would change.
Creating and initializing the file
Creating and initializing the file requires that the user specifies the maximum number of
records and provides the sizes for the index and for the record. A file content header that allows the user to
store file content management information such as a version number or other data can also be specified. In some
cases, the user of the library may want to query the file to determine the maximum number of records that
can be put into the file as well as how many records are currently in the file. A secondary file information
structure is provided in the header file to allow this inquiry to be made.
The intent for a person using the library is that they will have a C or C++ struct or class that will be used as
a description of the actual objects stored, the record data, and a C or C++ struct or class that will be used
as a description of the index object used to identify a specific record. So when the function is called to
create and initialize the file, the easiest way of doing so is to use the
sizeof() operator in the function call.
FileLibCreate ("testname1", 5000, sizeof(IndexThing), sizeof(RecordThing), sizeof(HeaderThing));
Accessing and modifying the file contents
The function library provides a set of functions to insert a new record, delete an old record, or to update an existing record.
MakeIndexAndRecord (iLoop, MyIndex, MyRecord);
FileLibInsert ("testname1", &MyIndex, &MyRecord, CollateFunc);
All of these primitives use a collation function to locate either an existing record or the location within the sorted list that a new record should go. The prototype for the collation
int CollateFunc (void *pIndex1, void *pIndex2) and this function will return one of three values. If the index
pIndex1 is lower in the collating sequence
for the indices than the index
pIndex2, then the collating function should return a value of negative one (-1). If the index of
equal to the index
pIndex2, then the collating function should return a value of zero (0). If the index of
pIndex1 is higher in the collating sequence
than the index of
pIndex2, the the collating function should return a value of positive one (+1). The values returned by the collating function
should be similar to the values returned by comparison functions in the Standard C Library such as the string comparison function
strcmp(s1, s2) or the memory comparison function
memcmp (b1, b2, n). In fact, if the key is a text string, the collation function
may be nothing more than a use of the string comparison function
strcmp (s1, s2) with the collation function returning the return code from
static int CollateFunc (void *pIndex1, void *pIndex2)
IndexThing *pLookFor = (IndexThing *)pIndex1;
IndexThing *pFileItem = (IndexThing *)pIndex2;
return (strcmp (pLookFor->aszKey, pFileItem->aszKey));
Points of interest
When testing, I tried using a file with five thousand indices and found that the open and close overhead
was substantial when inserting five thousand records into a file. By separating the file open logic from the actual
file manipulation function creating a layer that used a file handle to specify the file rather than the text
name of the file, the time taken to generate a file with five thousand records on a laptop was significantly reduced.
The search procedure used to find a specified index in the sorted list of indices uses a binary search to probe the list of indices stored on disk to determine a small window of indices.
At the point where the binary search probing narrows the list to a subsection, the subsection is read into memory and searched sequentially. The reason for
this is that searching a sequential list in memory is quicker than searching a disk file that is stored on an electro-mechanical device with quite a few delays involved
in locating and retrieving a particular area on the disk. An interesting experiment would be to change the procedure used so as to increase the size
of the subsection and to use a binary search for the in memory search as well.
While working on the test harness and demonstration source code showing how the library could be used, I realized that the various files were analogous to tables in a database.
This insight led to the adding of an additional function to the library to iterate over a file allowing a filter or selection function to be used to select indices that matched
a criteria other than the collation function criteria. By adding this iterator functionality, we now are able to use foreign keys with a more database oriented approach when
using this function library.
Another interesting piece of data that may be useful would be to add to the file management data that is stored in the file, using the struct
FileLibHeader, information about
file usage such as a record of the date and time of the last insert, the last delete, and the last update. Other information that may be useful would be to keep a count of the number
of inserts, deletes, updates, retrieves, etc., in the header.