Generic Gap Buffer

Jacob Slusser

4.89/5 (34 votes)

Oct 17, 2007

20 min read

77590

632

A list-style collection for fast insert and remove operations.

Download source code (v1.1) - 56.7 KB

Abstract

There is no shortage of collection classes in the .NET Framework. I would suggest that the reason for this is because no one type of collection is good at everything. Each specializes in a certain task and addresses a certain requirement (or limitation) of computing. In the following article I'll present an implementation of a list style collection called a "gap buffer" that has excellent performance when inserting or removing items at or near the same relative index.

Introduction
A Primer on the List<T> Collection
- Limits of the List<T>
How a Gap Buffer Works
- Benefits of the Gap Buffer
The GapBuffer<T> Implementation
- Performance Test
- Source Code
Conclusion
History

Introduction

One of the rules of programming that I've tried to live by over the years is that whenever possible—use the simplest tool for the job. Start with what's available and has been proven to work before writing your own. With the wealth of classes and objects supplied in the .NET Framework, I'm hard pressed to ever encounter a need that hasn't already been addressed. However, occasionally that happens and it happened to me recently when I was using a List<T> to manage a large collection of objects.

The List<T> is a fine piece of engineering and in nine out of ten cases it's exactly what I need, but on a recent project I hit a wall. Have you ever tried to perform numerous insert or remove operations on a large List<T>? Depending on the number of elements being managed, the performance can range from "acceptable" to "my grandmother is more likely to finish a game of Risk™ first". This led me to investigate different types of collections and memory management techniques that would yield better performance when exchanging elements in the middle of the collection.

The "gap buffer", as I'll explain shortly, was developed to address this exact need. I didn't invent it and I'm not sure who did. I do know that it has been around for a long time as most well engineered things are. It became clear that this was the collection type that I was looking for, but look as I may, I couldn't find a .NET implementation of one. From that investigation was born the GapBuffer<T> class; a .NET collection that implements the principles of a gap buffer and is a drop in replacement for a List<T> when repetitive insert and remove performance is critical.

To understand how the generic GapBuffer<T> works, it's helpful to first have a clear understanding (or review) of its nearest cousin: the aforementioned List<T>.

Primer on the List<T> Collection

The List<T> collection is what you might call a "dynamic array". A student of computer science will note that the words "dynamic" and "array" don't usually go together. When an array is created, the size of the array must be specified up front. Once the appropriate amount of memory has been allocated, the size of the array cannot change. The reason for this is because as an operating system dishes out a memory request, it will very likely have memory allocated to a different purpose just before and just after the memory block given for the array. Block after block, like a row of books on a bookcase, the objects in memory are given their appropriate space. When that array reaches its limits and needs to expand, the chances of its being empty space on either side are practically zero and there's simply no adjacent space available. It's like if you were to remove the paperback version of 'The Legend of Sleepy Hollow' from a full bookshelf and try to replace it with the hardback version of 'War and Peace', it simply wouldn't fit. The only solution is to put 'War and Peace' somewhere else and that's exactly how the List<T> works. Ignoring for a moment some of the more specific mechanics of a managed collection we can draw a simple analogy. Consider the following sentence fragment:

In this example we'll pretend that the amount of total space is akin to the size of an array and the words "In a hole in the ground" occupy some of that space. We'll also pretend that, like an array, we decided how much space we needed up front and, as you can see, we have a little extra left over. Seeing that we have a little extra room we'll add a few more words:

Letter-by-letter the available space begins to fill until eventually the limit is reached. Like an array that's been filled to capacity there's no more room. In that case we need to create a new space. Once the operating system allocates us some more memory and the new space has been created, the existing contents can be copied over and the phrase completed:

It may look like we simply added more space but as I stated before this is not possible. We had to create an entirely new space and copy what we had in step two into step three. The process of filling the available space and then making new space is repeated as more and more words are added. Hence, our example is a bit like a "dynamic array" or List<T>. If you were to run Reflector and peer inside of the List<T> class, you would see that it's really only a fancy wrapper around a run-of-the-mill array. What makes it so useful is that all the work of jockeying around the memory required to maintain the contents as it grows (or shrinks) is automatically handled for you. Below is a basic diagram of the essential components of the List<T> class:

As you can see, the Count property of the List<T> is not the actual size of the internal array but rather how much of the internal array is being used. What the consumer of the collection sees is just a subset of the internal array. As more items are added to the List<T>, the Count property is updated to reflect the actual amount of the internal array being used. When the internal array is filled to capacity a new, larger array must be made; the contents of the current internal array are then copied into the new array; the new array then becomes the internal array for the List<T> and the old one is discarded. By all accounts this technique is quite effective. While there is some overhead incurred when the internal array reaches capacity and a new one must be created, there is little to no work necessary to add new items into the empty portion of the internal array. This tradeoff is acceptable because it allows us not having to state in advance how much space we're going to need. It will grow as necessary.

Limits of the List<T>

Like Kryptonite to Superman, every superhero or collection has its weakness and the List<T> is no different. While appending items to the end of the list requires very little work, inserting items somewhere in the middle of the list requires a great deal of work. The reason is because adding an item to the end of the list means simply placing it in the next unused index and increasing the Count. However, there is no unused space in the middle of a List<T>. Back-to-back, shoulder-to-shoulder the items in a List<T> are lined up. So how are we going to wedge another item in? The only way is to copy the entire contents of the internal array after the point in question over one space. There's no way around it. Every time we want to place a new item somewhere in the middle of the array, everyone after that point has to shift over to make room for the new guy. This effect gets worse as the number of the items in the List<T> increases and the insertion point gets closer to the beginning of the collection. It's a little like cutting in a line of New Yorkers waiting to see a Yankees game. You cut in front of one or two guys at the back of the line and you might not get hurt, but cutting in front of a couple hundred (or thousand) and the game isn't all that you'll be missing.

How a Gap Buffer Works

A gap buffer attempts to overcome the overhead of inserting items into the middle of the collection by placing the empty portion of the array in the middle instead of at the end; hence the name "gap buffer" referring to the "gap" in the middle of the "buffer". To be more specific, the gap itself is not fixed to any one position at all. At any time it could be in the middle of the buffer or just as easily at the beginning, the end, or anywhere between the two. Here's what the GapBuffer<T> looks like:

As you can see, a gap buffer uses the same pattern of wrapping an internal array, but in this case the unused portion of the array is not fixed at one end. The job of the collection is to combine the used portions of the internal array into one continuous view of the contents. In purpose and in practice it works very much like a List<T>. New items are inserted into the empty portion of the internal array and when the array is filled to capacity, a new one is made. However, as previously noted, when an item is inserted into the beginning of a List<T>, all the items after that point need to be moved down each time this operation is performed because a List<T> keeps all the empty portion of the internal array at the end. In a GapBuffer<T> this shifting of contents to make room still needs to happen—but only once. After the insertion point is established and the gap is moved, the empty portion of the internal array will now start at that point. All subsequent insertions to the same position require no extra work at all because there is now ample space at that position to insert the new items. Using the same illustration we used with the List<T>, the GapBuffer<T> works a little like this:

The gap of unoccupied space is located in the middle (or in this case, closer to the beginning) of the block allocated for the phrase. While a sentence like this would obviously look a little strange with so much space in the middle, a gap buffer works by combining the contents so that it appears as though it's one continuous block of text and would read "In the ground lived a hobbit." Now it's a snap if we wanted to insert some text after the word "in". Instead of having to relocate a huge portion of the contents, the gap is already at the insertion point and more text can be filled in:

So long as the unused portion of space is where we want life is great. No extra work was necessary to make room for the newly added text. To then insert some text elsewhere in the phrase, the gap needs to be moved. Isn't this the same problem the List<T> had? Well yes, but to a much lesser extent. If we wanted to place some text right after the word "ground" the gap only needs to move a few spaces from where it's currently located. If we were to take our simple example and expand it to say we have the entire text of 'The Hobbit' book following this first sentence, the methodology of a List<T> would buckle trying to move over everything to place some new text here. But with a gap buffer, only a few items are moved:

Once the gap has been placed, letter by letter, the new text can be inserted without ever having to incur any more overhead:

As with a List<T>, the process of filling the empty space continues until at which point, the buffer becomes full and the contents must be copied into a new, larger buffer. The work of the collection is to make this transparent to the consumer. While there may be a big gaping hole in the middle of the array when accessing the collection it appears as though it's one continuous phrase or block.

Note: The simple examples above are not intended to advocate that either a List<T> or a GapBuffer<T> should be the structure of choice when storing and manipulating text and are given only by way of analogy. Traditionally a gap buffer is well suited to this purpose. However, a discussion about string handling in .NET and the use of a gap buffer for storing text is outside the scope of this article.

Benefits of a Gap Buffer

One can immediately see the potential benefits of this approach. If all we ever wanted was to add items to the end of the collection (or internal array) we would keep the empty portion at the end, as does the List<T>. If in another case we knew that we might want to insert items at the beginning of the collection we would prefer to have the empty portion of the array at the beginning. Also if we wanted to insert in the middle of the collection we would want the gap to be in the middle and so on. The gap buffer makes this possible by not fixing the unused portion of the internal array to any one place, but rather allowing it to adjust to the most optimal place.

At worse case we have to move the gap from the very end of the buffer to the very beginning in which case we would incur the same performance penalty that a List<T> does, but if the next operation were somewhere in the middle we would only have to shift half the contents of the internal buffer. Better still, what if we performed a subsequent operation only a few indexes over from the previous, or maybe even inserted an item at the same index as last time? In that case the gap wouldn't need to shift more than a few items or even not at all. We would simply be filling up the unused space of the gap and none of the performance penalty associated with shifting over the contents of the internal buffer would be incurred. This is why it's said that a gap buffer offers increased performance when repetitious operations occur at or near the same relative index. The closer we are to the last operation, the less overhead there is in moving the gap.

The GapBuffer<T> Implementation

To implement a GapBuffer<T> there are a couple bits of information that we need to keep track of. Whereas the List<T> only needs to maintain a variable that represents the actual count (or rather, where the used part of the internal array ends and the unused portion begins), the gap buffer needs two variables: one to specify the index of where the gap starts and another to specify where the gap ends. The total size of the gap can then be found by subtracting where the gap ends from where it begins. A simple example of this is the Count property which is calculated by subtracting the size of the gap from the total capacity of the internal buffer; what's left over is how much of the internal array is being used:

public int Count
{
    get
    {
        return this._buffer.Length - (this._gapEnd - this._gapStart);
    }
}

The real magic of maintaining the internal array in the GapBuffer<T> collection comes from the private PlaceGapStart method. Before each modification to the collection, the PlaceGapStart method is called to move the gap into the insertion (or deletion) position. This will make sure that the available space is where we want to make changes. To move the gap is simply a matter of moving the items of the internal array towards where the gap is located, thereby closing it up and opening a gap in the new location. Like when Doc. Brown is trying to stretch the wire from the clock tower to the light post at the street below; when he pulls the connection closed at one spot it opens up at another. Fortunately for us, this is the effect we want and it is accomplished programmatically like this:

private void PlaceGapStart(int index)
{
    // Are we already there?

    if (index == this._gapStart)
        return;

    // Is there even a gap?

    if ((this._gapEnd - this._gapStart) == 0)
    {
        this._gapStart = index;
        this._gapEnd = index;
        return;
    }

    // Which direction do we move the gap?

    if (index < this._gapStart)
    {
        // Move the gap near (by copying the items at the beginning

        // of the gap to the end)

        int count = this._gapStart - index;
        int deltaCount = (this._gapEnd - this._gapStart < count ?
            this._gapEnd - this._gapStart : count);
        Array.Copy(this._buffer, index, this._buffer,
            this._gapEnd - count, count);
        this._gapStart -= count;
        this._gapEnd -= count;

        // Clear the contents of the gap

        Array.Clear(this._buffer, index, deltaCount);
    }
    else
    {
        // Move the gap far (by copying the items at the end

        // of the gap to the beginning)

        int count = index - this._gapStart;
        int deltaIndex = (index > this._gapEnd ? index : this._gapEnd);
        Array.Copy(this._buffer, this._gapEnd,
            this._buffer, this._gapStart, count);
        this._gapStart += count;
        this._gapEnd += count;

        // Clear the contents of the gap

        Array.Clear(this._buffer, deltaIndex, this._gapEnd - deltaIndex);
    }
}

There's a little bit of math involved here, but don't be intimidated by it. Some basic logic determines which way the gap needs to move (if at all), and then some simple calculations determine how many items we need to copy. Once we copy the items that need to move from one side of the gap to the other we also must clear any portion of the gap that has lingering copies of items by calling Array.Clear. If we didn't there might still be copies of items within the gap, and even though they would not be visible to the consumer of the collection, that reference would be enough to keep them from being garbage collected. Once all the items that are affected by the shifting of the gap are copied to their new position, the _gapStart and _gapEnd variables are updated to represent the new gap position. With the hard work of writing the PlaceGapStart method in place, inserting an item into the collection becomes a breeze:

public void Insert(int index, T item)
{
    if (index < 0 || index > Count)
        throw new ArgumentOutOfRangeException
            ("index", Resource.ArgumentOutOfRange_Index);

    // Prepare the buffer

    PlaceGapStart(index);
    EnsureGapCapacity(1);

    this._buffer[index] = item;
    this._gapStart++;
    this._version++;
}

First, the gap is moved to the point of insertion using the PlaceGapStart method. Then we ensure that there is room enough for the new item (and if not, create a new, larger internal array). Once those checks and balances have been done, we can set the item at the specified index to the new value and increase the gap start. By increasing the gap start by one, we're also shrinking the total gap size by one and, in effect, saying that the gap now begins after the item just inserted. The benefits of the gap buffer are realized when we can continually insert new items at the start of the gap and there is no need to shift the contents of the internal buffer or increase its size. Removing an item is even easier:

public void RemoveAt(int index)
{
    if (index < 0 || index >= Count)
        throw new ArgumentOutOfRangeException("index",
            Resource.ArgumentOutOfRange_Index);

    // Place the gap at the index and increase the gap size by 1

    PlaceGapStart(index);
    this._buffer[this._gapEnd] = default(T);
    this._gapEnd++;
    this._version++;
}

By placing the gap just before the item to remove and then increasing the size of the gap by one, we basically expand the gap right over the item to be removed. To ensure that the removed item is a candidate for garbage collection we must also reset the item at that position to the generic type's default.

All other common collection functions are done in a similar manor. Most of the time these operations are done in two phases: once for the content before the gap and once again for the content after. Sometimes all that is needed is a little arithmetic as is the case when implementing the iterator. By making a simple adjustment to the index we can have it point to the contents before or after the gap. Any time the index is greater than or equal to _gapStart we know that we must account for the gap and add its size to the desired index. Once that adjustment has been made, it's business as usual and all this takes place unbeknownst to the collection consumer:

public T this[int index]
{
    get
    {
        if (index < 0 || index >= Count)
            throw new ArgumentOutOfRangeException
                ("index", Resource.ArgumentOutOfRange_Index);

        // Find the correct span and get the item

        if (index >= this._gapStart)
            index += (this._gapEnd - this._gapStart);

        return this._buffer[index];
    }
    set
    {
        if (index < 0 || index >= Count)
            throw new ArgumentOutOfRangeException
                ("index", Resource.ArgumentOutOfRange_Index);

        // Find the correct span and set the item

        if (index >= this._gapStart)
            index += (this._gapEnd - this._gapStart);

        this._buffer[index] = value;
        this._version++;
    }
}

Performance Test

To ensure that theory translates into practice I've written a simple performance test to determine the real gains of using a GapBuffer<T> versus a List<T>. I measured the time it took to insert or remove a number of items either at random or at a fixed position (index zero). This should represent the full spectrum of performance gain a GapBuffer<T> can offer. I then did the same for a List<T> and compared the results. In every test the number of items either removed or inserted was 200,000. Here are the results:

Inserting Items Performance Test

Test	List<int>	GapBuffer<int>	Performance Gain
Inserting at random index:	7,985ms	7,459ms	7%
Inserting at fixed index (zero):	16,144ms	35ms	460%

Removing Items Performance Test

Test	List<int>	GapBuffer<int>	Performance Gain
Removing at random index:	8,700ms	7,458ms	17%
Removing at fixed index (zero):	16,360ms	2ms	8179%

As you can see, the performance gain ranges from marginal to dramatic! It also demonstrates the conclusions we made earlier: the more items that need to be shifted around, the more work is necessary and consequently, the more time it will take. Inserting or removing at the first index in a collection represents the absolute extreme of this principle, causing maximum impact on the List<T>. On the other hand the GapBuffer<T> excels at this task. Each time the List<T> needs to insert an item at index zero, every other item in the collection must move and the performance will get worse and worse as the collection grows. With the GapBuffer<T> we know that the gap will arrive at index zero during the first operation and stay there for every subsequent operation, requiring no extra work to continually add items at that index. Even when the index is not consistent and is randomly located, we can see that the GapBuffer<T> offers slightly better performance because it's probable that when inserting items at random, the next index of operation may be nearby and therefore not require a large amount of items to be moved. Removing items works in almost the same way and the results are naturally similar.

Real world performance will be somewhere between the random and index specific tests. When used as a substitute for a List<T>, where there is little repetition in the index used, a very modest performance gain will still be realized. If on the other hand, an application plays to the strengths of the GapBuffer<T> an enormous performance gain can be realized. Every collection has its strength and the GapBuffer<T> makes quick work of repetitive tasks.

In all other common collection operations (such as appending and iterating) I found the List<T> to perform slightly better than the GapBuffer<T>. This is not surprising because, however small it is, there is a little more overhead associated with maintaining the gap in the middle of the array. There are additional method calls to check the placement of the gap, perform adjustments on its location, and to work around it when iterating through the collection. That being said, I didn't find that performance loss to be significant nor did that performance loss outweigh the benefits that the GapBuffer<T> offers. That leads me to conclude that unless one is absolutely sure all work on the collection will be appending and iterating, it's more likely that the GapBuffer<T> will be a better choice, even if the number of insert and remove operations is comparatively low.

Source Code

I've included the complete source for my implementation of a generic gap buffer along with the project I used to do performance testing. I've taken great care to unit test the code to ensure that it runs perfectly. As I mentioned above, it should be a drop-in replacement for most cases where a List<T> is currently being used. It should go without saying that the source code requires .NET 2.0 and greater to compile. There is no restriction on the use of the code and I only ask that appropriate recognition be given to the author.

Conclusion

At the beginning I stated that the reason there are so many collections available to .NET programmers is because there is no one collection that is perfect in everything. Each has been designed for a specific need and so has the GapBuffer<T>. With it comes some added internal complexity when compared to a List<T> but also offers dramatic performance increases when used for what it's been designed. Should the need ever arise where inserting into a random or repetitive place in a collection becomes frequent, I would highly recommend considering the GapBuffer<T>.

History

Version 1.1 - Modified the code and article to properly clear the gap of copied items and ensure they are subject to garbage collection.