Download arraysplice.zip - 1.8 MB

Introduction

The C Standard Library seems quite sparse when compared to the richness of the C++ Standard Template Library. One such functional area is a dynamic vector which provides primitives for the building and manipulating of a list. Since there is no list primitives available, C programmers have a tendency to rely on arrays and loops resulting in source code that in other programming languages such as python or C++ would use higher level, more abstract and more problem centered source code.

Background

The C++11 Standard Template Library and the C++11 standard changes have injected a functional language programming flavor to C++. After spending time with Bjarne Stroustrup's latest version of The C++ Programming Language along with reading Allen Holub's book Holub on Patterns, I realized that I was programming with way too much focus on the solution domain, how to instantiate a solution in source code, and insufficient focus on the problem domain. And my primary work as a maintenance programmer for older C source code was encouraging a solution domain focus implementing functionality with what Holub calls "raw loops" and writing functions that ask for information rather than asking for help by delegating actions to modules or objects or functions.

To work on reshaping my focus, I decided to work with a list as an abstract data type to experiment with the implementation of a generic list functionality written in C and usable by a C programmer. The goal of this library was to provide some of the list functionality available in the C++11 Standard Template Library as well as scripting languages such as php or python.

The question in my mind was how to go about providing a list functionality that could be used with a variety of different data types. The C programming language has few facilities available for object oriented programming and the C Preprocessor, while quite remarkable for its day, does not have the features and functionality of C++ templates for generic programming. The limitations of C meant that providing list functionality would place a fair amount of reliance on the C programmer and that there would be use of void pointers and other work arounds to replace what would be available from the C++ compiler with templates.

The goals of this exercise was to write a set of useful list functions while exploring the possibilities of what using a list data structure provided. The basic operations were to be able to add or remove items from a list and to make changes to the list. I looked to the splice() function provided by php as a model for the Splice() method implemented in this package as well as to the C++11 Standard Template Library for a model for applying algorithms to the elements of a list.

In order to make the If types of methods - FindIf(), FindIfNext(), ForEachRemoveIf(), etc. - more useful and flexible, I borrowed the method of passing a function pointer from the qsort() and bsearch() functions of the C Standard Library and added an additional pointer operand that can be used to provide state data to the function used to process the list. The additional pointer operand opens the door to a world of possibilities by providing state data during list processing as well as the possibility of providing function pointers within a struct to do further operations on list items.

Using the code

Basic Overview

The actual list functionality library is small. In order to reduce the chances of a name space collision, there is a single externally visible struct which contains function pointers to the actual list manipulation functions. The only type available is a type, ItemArray, with a single externally visible instance of this type, ItemArrayFactory. This externally visible instance is initialized at application startup with default values for an ItemArray instance that is created with the ItemArrayFactory.Create() method. The ItemArrayFactory.Create() method returns a pointer to an initialized area of memory that will be used to maintain the list state data as well as to provide the methods available for manipulating a list. This use of a struct with function pointers is similar to how a struct can be used in C++ where a struct is basically a class in which all members and methods have public visibility.

The C programming language does not have provisions for a constructor nor a destructor which means that the opportunity for a programmer to use an improperly initialized variable is high as is the opportunity to create memory leaks. Since all members of a struct have public visibility it is difficult to enforce encapsulation and the hiding of data members. The lack of templates means that to provide a generic programming approach that works with multiple data types requires work arounds which reduces the ability of the compiler to use static analysis to detect possible source code defects and generate appropriate warnings.

With these and other deficiencies in the C programming language there is much greater reliance on programmer skill and discipline.

Here is a code sample from the test harness in arraysplice.cpp. This sample creates an empty list using the ItemArrayFactory.Create() method and then does a few preliminary consistency checks. These lines of source show the style that is used by this library:

//  Create a new, empty list to contain copies of ArrayItem structs.  The initial max size is 6
ItemArray *jj = ItemArrayFactory.Create (sizeof(ArrayItem), 6);
if ( ! ItemArrayFactory.IsValidArray (&jj)) {
    std::cout << "test 2a, failed, IsValidArray() detected valid array as invalid." << std::endl;
}
if ( ! ItemArrayFactory.IsEmpty (&jj)) {
    std::cout << "test 2b, failed, IsEmpty() detected empty array as not empty." << std::endl;
}

This sample of source shows a couple of style features. First of all the ItemArrayFactory.Create() method is used to create an empty list. The size of the items to be held in the list is specified. The easiest way to do this is to use the sizeof() operator and let the compiler figure it out. The second argument is an initial maximum count. As items are added to the list and the maximum count is exceeded, the list will be reallocated using the C Standard Library function realloc() to create a copy of the existing list which has a larger maximum count value creating room for additional items.

Notice that the IsValidArray() method of the ItemArrayFactory external is used to validate that the list object was actually created. The newly created ItemArray object jj has the same method available since jj is a pointer to an initialized ItemArray object which contains the same function pointers as does ItemArrayFactory. When a new ItemArray object is created, it contains a copy of the ItemArrayFactory object with some of the data members initialized for a list. However since the Create() method may fail, we use the ItemArrayFactory.IsValidArray() method since that does exist whether jj was created properly or not.

Later on in the test harness you can see source lines where an item is put onto the end of the list using a PushBack() method as well as removed from the list using a RemoveBack() method similar to the following sample.

//  Add an item to the list
jj->PushBack (&jj, &oneItem);

//  ...

// Remove the item on the end of the list getting a copy of the item
jj->RemoveBack (&jj, &twoItem);

Notice that with both of these we use the ItemArray pointer to invoke the method we want to use and that the first argument to the method is the ItemArray pointer itself. The first argument is used with almost every one of the functions in the list library in order to identify the actual list object. If this were a C++ struct we would not need to have this first argument which identifies the actual object. Instead of jj->PushBack(jj, &oneItem); we would instead write jj->PushBack(&oneItem); and the C++ compiler would ensure that the this pointer was properly set with the actual object address. However with C the programmer must provide the object since what we are really doing is calling a function through a function pointer variable which happens to be in the ItemArray struct object.

Interface and Data Versus Command Arguments

Some of the methods in the splicing library are straightforward with functional cohesion in that they perform a single task or function without side effects. Some of the methods are complex with interface arguments acting as both data (for instance when a starting position is greater than equal to 0) and as command (for instance when a starting position is negative indicting to use the results of the last comparison).

This type of flexibility is often a source of defects since it provides coupling between modules or functions which is not visible. Implementing and using side effects in general is not a good design decision. The comparison index side effect is a double edged sword. With the FindIf() and the FindIfNext() pair of methods using the comparison index side effect makes quite a bit of sense and is actually part of the visible interface.

However allowing the comparison index side effect to be available for other methods such as the Splice() method begins to move into dangerous territory.

Considerations for Memory Management

One issue with using the C programming language for this library which is expected to be used in a C program is that some of the library functions will some times need to request additional memory in order to have sufficient memory to hold a modified list. There are really only two possible alternatives, always return a pointer to the object when a method which may perform a memory realloc() returns or to require that those methods that may need to do a realloc() are provided the address of a variable that will be updated with the new address should a realloc() be done.

A third alternative could have been to change the memory management. The way that the list objects are created is to allocate a memory area that contains both the list object header with the list management data and the actual list storage area. The list object header contains a pointer to the area after the list object header which is then used to contain the actual list data. The third alternative would be to approach the list object representation a bit differently in that rather than being a single contiguous memory area that contains both the list management information and the actual list data, the list object is composed of two pieces of memory, a list object header in one contiguous memory area and the list data in a second contiguous memory area.

Originally, I tried the first course of action. While testing and extending the library, I discovered that knowledge of when to do an assignment where a realloc() may happen was adding additional complexity to the source and required more knowledge of the method internals than seemed reasonable. Whether a realloc() was done or not seemed to be an implementation detail that should remain private. So I modified the interface to all functions so that they would take the address of the pointer to the object rather than the pointer to the object itself.

What the Library Offers and Using the Standard C Library

The point of the library is to offer a set of primitives for list processing. The basic functionality covers creating and deleting a list object along with a set of methods that allows the user to splice lists into a list or slice a part of a list into a list. Since the C Standard Library offers a sort, the qsort() function, and a binary search, the bsearch() function, two methods were added to provide a way to sort a list and to search a list for an item by providing methods which wrap those two functions from the C Standard Library.

The test harness provides an idea of the functionality available and how to use it. Some of the same ways that the C++ Standard Template Library vector class is used are ways that this library could be used. Since list data area is pointed to be the ItemArray struct member m_ItemList and this member is public and visible, a list could be accessed as a C array using the ItemArray struct member m_ItemCount to know how many array elements are being used.

The package provides a ways to build a list, one item at a time or by combining lists into a different list, and to perform various operations to search for items in the list and to modify the list or the items in the list.

The C programming language, often called a high level assembly language, provides a mechanism, the function pointer, which provides a great deal of flexibility as well as lots of rope to hang yourself. This package uses the function pointer for two specific cases, applying an algorithm or operation to items in the list and providing a comparison function for those methods that rely on a comparison to find an item in the list.

Since these functions are provided by the programmer, they are both called for each item while iterating the list opening the door to using them in similar ways. While the comparison function is intended to only do a comparison, there is nothing to stop a programmer from using the comparison function to also modify items as they are processed for comparison. I am not sure if this is a defect or if it is a feature.

Points of Interest

Performance of Various Methods

After completing a first implementation, I was curious as to what kind of performance would be seen with large lists. Since the methods use a contiguous area of memory, insertions at any point other than the end of the list requires moving data from the area where the insertion will take place before inserting the new data.

I created a simple list and then used the simple list in a loop with the Splice() method to get some idea as to performance implications. The testing involved a loop whose source code was similar to the following with two variations of the Splice() method. I first used a debug build with no optimization then recompiled a release build to see if there would be a difference.

jj = ItemArrayFactory.Create (sizeof(ArrayItem), 100000);
j2 = ItemArrayFactory.Create (sizeof(ArrayItem), 6);

for (int ii = 0; ii < 33; ii++) {
    ArrayItem initval = {20, ii};
    j2->PushBack (&j2, &initval);
}

for (int ii = 0; ii < 100000; ii++) {
    jj->Splice (&jj, -1, 0, &j2, 0, -1);
}

Two variations tried were with the Splice() first putting the list j2 at the beginning of the list jj repeatedly (starting position was 0) and secondly putting the list j2 at the end of list jj repeatedly (starting position specified was -1 indicating to use the end of the list).

The tests were run on a Dell laptop with an Intel Core 2 Duo 2.4GHz CPU with 3.5 GB of RAM. Two test sequences were run, a debug build and a release build, each compiled with Visual Studio 2005.

In the first test run with a debug build, the first variation while watching the process in the taskmanager utility of Windows XP showed CPU time of 0:12:32 with a VM size of 26,292 while the second variation showed a CPU time of 0:01:42 with a VM size of 26,256. The number of Page Faults was similar at some 20.8 million. Internally the Splice() method has to perform a memmove() to open up an insertion point when the Splice() is not at the end of the list and this operation seems to be quite costly.

In the second test with a release build of only doing the list splice at the beginning of the destination list, the data shown in the taskmanager utility was similar to the debug build doing insertions at the beginning of the list. The similarity of performance data between debug and release builds indicates that most of the work is probably page faults and moving data around.

Some Interesting Syntax

While working with this library, I tried some unusual syntax in places as well as an experiment or two.

The first experiment was to have some functions return a struct rather than a single value. This allowed some functions to return both a pointer to an item on the list as well as a pointer to the list object itself. A similar struct could be used to provide both an error code along with the function return value. In many cases a function value is used to return either valid data or an invalid data item to indicate an error occurred. An example would be the malloc() function which returns a valid pointer or a NULL pointer to indicate that the malloc attempt failed.

In the test harness, I used the following line of source which combines finding an element in a list which is then used as an insertion point for a Splice(). I am not recommending this construct as good practice however it is interesting that it does compile and it works.

jj = jj->SpliceOnFind (&(jj->FindIf (&jj, 0, -1, ArrayItemRemoveGreaterThan, &cmpGreaterThan).pList), 0, &j2, 0, -1);

The problem is that the struct returned by the FindIf() method is a temporary which disappears so that if the SpliceOnFind() does need to do a realloc() of memory, the address of the new list object will also be discarded. There are three easy solutions. The first and most straightforward is to separate this into two lines of source which use the same list object pointer. The second would be to use the comma operator after the FindIf() method call to introduce a sequence point followed by address of the list object pointer as in (jj->FindIf(&jj, ...), &jj) or &(jj->FindIf(&jj, ...), jj). The third would be use the assignment operator to get the list object pointer that is returned by the SpliceOnFind() method which is the method used in this example.

Complexity of the Method Interfaces

One possible problem is the complexity of the method interfaces as well as the fact that some of the methods have side effects. For some method arguments, special values are used to turn data arguments into command arguments. For instance the ending position that is specified in several places will accept a value of negative one (-1) to indicate that the rest of the list is to be used. This means that the function call does not need to determine the ending position of the list but instead can just indicate to use the end of the list what ever that value may be.

Side effects are created by the FindIf() and FindIfNext() methods in which the position for a matching item in the list is stored in the list object management data. This facilitates the use of FindIf() followed by repeated calls to FindIfNext() to find all of the items matching a criteria. Since the pointer to the comparison function as well as a pointer to the comparison state object are also stored in the list object, this makes the use of FindIf() and FindIfNext() easier to use as a pair such as in the following example from the test harness.

ArrayItem cmp = {1, 0}, found = {0, 0};
iLoop = 0;
pFound = (ArrayItem *)(jj->FindIf (&jj, 0, -1, ArrayItemCmp, &cmp)).pItem;
while (pFound) {
    found.i += pFound->i;     // pFound->i should always be equal to 1 since that is what we are looking for
    found.j += pFound->j;
    iLoop++;                 // value of iLoop is the count of number of matching items
    if (pFound->i != 1) {
        std::cout << "test 13f, failed, PushBack() FindIf() FindIfNext() pFound check failed." << std::endl;
    }
    pFound = (ArrayItem *)(jj->FindIfNext (&jj, -1, -1, 0, 0)).pItem;
}

One design question for a package like this is how much should side effects be relied on to provide the necessary behavior. Some methods seem to be obvious pairs to be coupled with a side effect such as FindIf() and the FindIfNext() methods. However should the ForEachRemoveIf() function use the same side effect as a place to begin its search? Perhaps there should instead be a way to specify an action

The problem with side effects is that they introduce coupling between methods which can trip up someone who is using the package. So there really needs to be some syntax in the method interface so that the user of the methods knows when they are relying on side effects and when they are not. There also needs to be some provision to detect if the side effect state is incompatible with the method being invoked and to then do something reasonable. The result is to increase the complexity of the methods while also increasing the cognitive load on the programming using the methods.

History

Keep a running update of any changes or improvements you've made here.