Click here to Skip to main content
Click here to Skip to main content

Squeezing more performance from SortedList

, 10 Oct 2008
Rate this:
Please Sign up or sign in to vote.
A SortedList implementations using a cyclic algorithm and C# IDictionary tweaks.

typical_results.png

Introduction

This article describes an IDictionary<TKey,TValue> implementation similar to SortedList<TKey,TValue>. The implementation uses TKey[] and TValue[] arrays as the underlying data structure. Performance increase is achieved by using cyclic array indexing and by pre-fetching the 32 most significant bits of TKey.

Asymptotic Time Complexity of SplitArrayIntDictionary<TValue>.Insert()

P = probability of random data.
(1-P) = probability of monotonously (sorted) input data.
O(log(n)) = search time in the sorted array.
O(n) = time needed for Array.Copy.
O(1) = time needed to write the new key/value pair into the arrays.
O(SplitArrayIntDictionary.Insert(TKey,TValue)) =
P*( O(log(n)) + O(n) ) + (1-P)*O(1)
< P*O(n)
< O(n)

For comparison, the average time of a balanced binary tree (SortedDictionary<Tkey,TValue>.Insert(TKey,TValue)) operation is O(log(n)), which is more efficient for large dictionaries. For small dictionaries (up to 50,000 - 10,000 elements), the Array.Copy operation is so efficient that inserting a new key to SplitArrayIntDictionary is faster than SortedDictionary (see the Excel graphs at the top of this page).

Cyclic Array.Copy Background

The .NET SortedList implements IDictionary<TKey,TValue> by synchronizing two arrays - TKey[] and TValue[]. The algorithm basically shifts the arrays each time a key needs to be inserted in the middle of the array. The shift operation uses Array.Copy (which is implemented internally using memmove). memmove is much faster than an equivalent C# for {} implementation, especially if we do not want to use auxiliary memory (in place copy operation):

void SlowMoveInPlace(TKey[] keyArray, int source, int destination , int length)
{
    Debug.Assert(source != destination);
    Debug.Assert(source+length < keyArray.Length);
    Debug.Assert(destination+length < keyArray.Length);

    //assume destination and source overlap
    //non-cyclic implementation          
    if (sIndex > dIndex)
    {
        for( int dIndex = destination, 
                 sIndex = source, 
                 sIndexMax = source + length -1 ; 
             sIndex <= sIndexMax ; 
             sIndex++ , 
             dIndex++)
        {          
          keyArray[dIndex] = keyArray[sIndex];
        }
    }
    else
    {
        for( int dIndex = destination+length-1, 
                 sIndex = source+length-1, 
                 sIndexMin = source; 
             sIndex >= sIndexMin ; 
             sIndex-- , 
             dIndex--)
        {
            keyArray[dIndex] = keyArray[sIndex];        
        }
    }
}

void FastMoveInPlace(TKey[] keyArray, int source, int destination , int length)
{
    Array.Copy(keyArray,source,keyArray,destination,length);
}

Three additional issues incur overhead on the Copy operation: first, we need to keep the TKey and TValue arrays synchronized. So each Array.Copy command is executed twice. (We could have used one array containing a KeyValuePair<TKey,TValue>, but it would involve boxing of TKey and also hurts cache locality of TKey[].)

void ArrayCopy(int source, int destination, int length)
{
    Array.Copy(this.nodeKeys, source,
               this.nodeKeys, destination,
               length);

    Array.Copy(this.nodeValues, source,
               this.nodeValues, destination,
               length);
}

Second, a cyclic array requires more "if" clauses since the source and/or the destination may wrap over the end of the array. Here is a sample from CyclicArrayCopy():

void CyclicArrayCopy(int source, int destination, int length)
{
    Debug.Assert(source != destination);        
    Debug.Assert(length > 0);
    Debug.Assert(source >= 0);
    Debug.Assert(destination >= 0);
    Debug.Assert(source <= LastIndex);
    Debug.Assert(destination <= LastIndex);
    // The following assert makes sure we do not require
    // auxillary memory allocation.
    bool shiftLeft = CyclicSub(destination,source) >= CyclicSub(source,destination);
    bool shiftRight = CyclicSub(destination,source) <= CyclicSub(source,destination);
    Debug.Assert(shiftRight || (shiftLeft  && (CyclicSub(source,destination) +
        (length-1)<=LastIndex)));
    Debug.Assert(shiftLeft  || (shiftRight && (CyclicSub(destination,source) +
        (length-1)<=LastIndex)));
    
    int srcEndIndex = CyclicAdd(source, length-1);
    int dstEndIndex = CyclicAdd(destination, length-1);
    
    if(...)
        // abcd00 --> 00abcd
    
    if(...)
        // cd000ab --> 00abcd0
        // cd000ab --> 000abcd
        
    else if(...)
        // cd000ab --> abcd000
        // cd000ab --> 0abcd00
    else if(...)
        // 00abcd0 --> cd000ab
        // 000abcd --> cd000ab
    else if (...)
        // abcd000 --> cd000ab
        // 0abcd00 --> cd000ab
    else if (...)
        // d000abc --> bcd000a
    else
        // bcd000a --> d000abc
#region proof 
    Debug.Assert(destination < source);
    int length5 = LastIndex+1 - source;
    int shift_ = source - destination;
    Debug.Assert(shiftLeft);        
    Debug.Assert(shift_ + (length - 1) <= LastIndex);
    Debug.Assert((source - destination) + (length - 1) <= LastIndex);
    Debug.Assert((source - destination) + (length - 1) < LastIndex+1);
    
    //by definition srcEndIndex = source+(length-1)-(LastIndex+1)
    //which gives us this equation:
    Debug.Assert(srcEndIndex - destination  <0);
    Debug.Assert(destination > srcEndIndex);
    // concludes the proof
#endregion
    //requires proof that destination > srcEndIndex
    //otherwise the first ArrayCopy() call
    //would overwrite source data required 
    //for the last ArrayCopy() call
    ArrayCopy(source, destination, length5);
    ArrayCopy(0, LastIndex+1 - shift_, shift_);
    ArrayCopy(shift_, 0, srcEndIndex+1);
}

Third, the bigger the array, the more time the Array.Copy command would require - especially if the source and the destination cause too many CPU cache misses (see the Points of Interest section at the bottom of this page).

Algorithm Improvements

Splitting the Array

Jacob Slusser gives a great overview of Gap Buffers, highly recommended reading!

The first shift operation shifts the array one index to the right or one index to the left. See the following pseudo code:

TKey[] nodeKeys  = {'a','e','f', 0 , 0 , 0 , 0 , 0 }
TKey key = 'd'

//make room for 'd'
ShiftLeft(nodeKeys, index=0,length=1,shift=1);
Assert(nodeKeys == { 0 ,'e','f', 0 , 0 , 0 , 0 ,'a'});

//insert 'd'
nodeKeys[0] = key;    
Assert(nodeKeys == {'d','e','f', 0 , 0 , 0 , 0 ,'a'});

If the next shift operation follows the previous shift (in index and direction), the array is shifted to the maximum.

TKey[] nodeKeys  = {'d','e','f', 0 , 0 , 0 , 0 ,'a'}
TKey key = 'c'

//make room for 'c'
ShiftLeft(nodeKeys, index=6,length=1,shift=4);
Assert(nodeKeys == {'d','e','f','a', 0 , 0 , 0 , 0 });

//Insert 'c' into nodeKeys
nodeKeys[7] = key;    
Assert(nodeKeys == {'d','e','f','a', 0 , 0 , 0 ,'c'});

This gap in the array allows fast insert operations as long as the input keeps the monotonous decrease:

TKey[] nodeKeys  = {'d','e','f','a', 0 , 0 ,'c'}
key = 'b'
// no shift required
// Insert key into nodeKeys
nodeKeys[6] = key;
Assert(nodeKeys == {'d','e','f','a', 0 , 0 ,'b','c'});

When the input is no longer monotonous, the algorithm starts from the beginning, shifting only by one index.

TKey[] nodeKeys  = {'d','e','f','a', 0 , 0 ,'b','c'}
key = 'g'

// Make room for 'g'
ShiftRight(nodeKeys,index=3,length=1,shift=1);

// Insert 'g'
nodeKeys[3] = key;
Assert(nodeKeys == {'d','e','f','g','a', 0 ,'b','c'})

The cyclic algorithm keeps track of the following properties: Split, MinIndex, MaxIndex, LocalMaxIndex, and LocalMinIndex.

bool Split
{
    get
    {
        if (MinIndex != this.LocalMinIndex)
        {
            Debug.Assert(MaxIndex != this.LocalMaxIndex);
            return true;
        }
        return false;
    }
}

// In the example below the sorted array is split into two sections: 
// from LocalMinIndex to MaxIndex ('b','c','d','e','f','g')
// from MinIndex to LocalMaxIndex ('a')
nodeKeys = {'d','e','f','g','a', 0 ,'b','c'}
Assert(Split)
Assert(MaxIndex == 3)
Assert(nodeKeys[MaxIndex] == 'g')
Assert(MinIndex == 4)
Assert(nodeKeys[MinIndex] == 'a')
Assert(LocalMinIndex == 6)
Assert(nodeKeys[LocalMinIndex] == 'b')
Assert(LocalMaxIndex == 4)
Assert(nodeKeys[LocalMaxIndex] == 'a')

// In the example below the sorted array has only one section:
// from MinIndex to MaxIndex ('a','b','c','d','e','f','g')
nodeKeys = {'d','e','f','g', 0 ,'a','b','c'}
Assert(!Split)
Assert(MaxIndex == LocalMaxIndex == 3)
Assert(MinIndex == LocalMinIndex == 5)
Assert(nodeKeys[MaxIndex] == 'g')
Assert(nodeKeys[MinIndex] == 'a')

Shifting in Both Directions

Since the underlying array is cyclic, it supports both shift right and shift left operations. The algorithm chooses the shift direction that requires less read/write memory operations.

bool ShiftRightIsFaster(int index)
{
    Debug.Assert(!Full);

    return (CyclicSub(LocalMaxIndex + 1,index) <
            CyclicSub(index+1,LocalMinIndex));        
}

bool Insert(TKey key, TValue value, bool overwriteValue)
{
    if (Empty) ...
    else if (Split && (newLocalMinKey || newLocalMaxKey)) ...
    else if (!Split && newMinKey) ...
    else if (!Split && newMaxKey) ...
    else if (Split && newMinKey) ...
    else if (Split && newMaxKey) ...
    else
    {
        bool foundKey;
        index = this.CyclicSearch(key,prefetch, out foundKey);

        if (foundKey)
        {
            if (!overwriteValue)
            {
                // key,value pair already exists
                return false;
            }
            //replace value of existing key
            Insert(key, value, prefetch,index);
            return true;
        }
    }

    if (Full) return false;
    //shift the array to accomodate space for the new node      
    if (ShiftRightIsFaster(index))
    {
        index = CyclicInc(index); //index++
        InsertShiftRight(key, prefetch, value, index);
    }
    else
    {
        InsertShiftLeft(key, prefetch, value, index);            
    }
    return true;

C# Code Optimizations

Cyclic Indexing

There are three basic approaches to implement a cyclic index + and - operator:

  • Obviously, using the integer modulus operator (requires int division operation):
  • int CyclicAdd(int index, int increase)
    {    
        Debug.Assert(increase >= 0);
        // LastIndex+1 == Capacity
        return (index + increase) % (LastIndex+1);    
    }
  • Using an if to detect index out of bounds:
  • int CyclicAdd(int index, int increase)
    {
        Debug.Assert(increase >= 0);
        Debug.Assert(increase <= LastIndex+1);
        Debug.Assert(index >= 0);
        Debug.Assert(index <= LastIndex);
        index += increase;
        if (index > LastIndex)
        {
            index -= LastIndex+1;
        }
        return index;
    }
  • And the fastest implementation assumes that LastIndex+1 is a power of 2. So if the capacity of an array is 8 = 00001000b, then LastIndex == 7 == 00000111b.
  • int CyclicAdd(int index, int increase)
    {
        Debug.Assert(increase >= 0);
        Debug.Assert(increase <= LastIndex+1);
        Debug.Assert(index >= 0);
        Debug.Assert(index <= LastIndex);
        return (index + increase) & LastIndex;
    }

    Since the JIT would easily inline each CyclicAdd() call, we can reuse it to implement the subtract operation:

    int CyclicSub(int index, int dec)
    {
        Debug.Assert(dec >= 0);
        Debug.Assert(dec <= LastIndex+1);
        return CyclicAdd(index, LastIndex+1 - dec);
    }

TKey Comparisons

One of the most time consuming operations in the algorithm is the Search loop. Most (if not all) .NET collections require either the TKey to implement the IComparer interface, or that the user would provide an IComparable. The approach in this example is to optimize speed at the cost of "Best Practices". Mind you, it is still possible to use IComparer and IComparable, but it would involve a penalty of a second virtual function call.

So - what's so bad with the interface approach?

class SplitArrayDictionary<TKey,TValue,TComparer>
      where TComparer : IComparer<TKey> , new()
{
    readonly static IComparer<TKey> Comparer = new TComparer();
    
    int Search(TKey key, out bool keyFound)
    {
        for (...)
        {
            if (Comparer.Compare(key,nodeKeys[index]) == 0)
            {
                keyFound = true;
                return index;
            }
            ...
        }
    }
}

The C# JIT compiles the Search method once, and distinguishes between different TKey types with a vtable (similar to virtual method calls). So there is a table that either points to Compare(int,int) when TKey==int, or Compare(long,long) when TKey==long. The JIT does not "inline" the Compare method inside the loop since that would require separate Search compilations for int and long. This JIT decision is by-design, and until it is relaxed - well - we need to look for something faster.

Abstract methods are not considered "best practice" for implementing a C# "contract", and the JIT doesn't inline abstract or virtual function calls either. However, abstract method calls are faster since the runtime JIT does not insert an if (Comparer!=null) {} clause before the virtual method call ("this" is never null).

class SplitArrayDictionary<TKey,TValue>
{
    abstract int Compare(TKey left, TKey right);
    
    int Search(TKey key, out bool keyFound)
    {
        for (...)
        {
            if (this.Compare(key,nodeKeys[index]) == 0)
            {
                keyFound = true;
                return index;
            }
            ...
        }
    }
}

class SplitArrayIntDictionary<TValue> : SplitArrayDictionary<int,TValue>
{
    sealed override int Compare(int left, int right)
    {
        return left-right; // might overflow !!
    }
}

Taking into account overflow conditions:

class SplitArrayIntDictionary<TValue> : SplitArrayDictionary<int,TValue>
{
    sealed override int Compare(int left, int right)
    {
        if (left > 0)
        {
            if (right < 0) return 1;
            return left-right;
        }
        if (right > 0) return -1;
        return left-right;
    }
}

A safe and faster alternative:

class SplitArrayIntDictionary<TValue> : SplitArrayDictionary<int,TValue>
{
    sealed override int Compare(int left, int right)
    {
        if (left  > right) return 1;
        if (right < left) return -1;
        return 0;
    }
}

The second caveat is that Compare does excess computations. It distinguishes between three different conditions: "<", or "==", or ">". In some cases, we might only want to distinguish between two conditions: "<=" or ">".

bool KeyMightBeInArray(TKey key)
{
    return !Empty &&
           LessOrEqual(key,nodeKeys[MaxIndex]) &&
           GreaterOrEqual(key,nodeKeys[MinIndex]) &&
           (   !Split ||
               LessOrEqual(key, nodeKeys[LocalMaxIndex]) ||
               GreaterOrEqual(key, nodeKeys[LocalMinIndex]))
            );
}

This requires a TKey specific implementation of these methods:

class SplitArrayIntDictionary<TValue> : SplitArrayDictionary<int,TValue>
{
    sealed override bool LessOrEqual(int left, int right)
    {
        return (left <= right);
    }
    
    sealed override bool GreaterOrEqual(int left, int right)
    {
        return (left >= right);
    }
}

In order to speed the comparisons even further, we pre-fetch the 32 most significant bits of the key and store them in a separate nodePrefetch[] int array. So we define yet another abstract method PrefetchKeyMSB(TKey) which returns the 32 Most Significant Bits of the key. For example, the MSB of long is the upper 32 bits, and the MSB for a person's name is the first two letters of the last name.

class LongDictionary<TValue> : BaseDictionary<long,TValue>
{
    sealed override int PrefetchKeyMSB(long key) 
    {
        return (int)(key >> 32);    // most significant 32 bits
    }
}

And the Insert() method also stores the MSB separately:

readonly static bool TKeyIsInteger = (typeof(TKey).FullName == "System.Int32");
void Insert(TKey key, TValue value, int prefetch, int index)
{
    nodeKeys[index] = key;
    nodeValues[index] = value;
    if (!TKeyIsInteger)
    {
        nodeKeyMSB[index] = prefetch;
    }
}

Integer comparisons are inlined by the JIT runtime compiler, and when they are executed inside a loop, the performance boost becomes significant.

int SearchInRange(  TKey key,
                    int keyPrefetch, 
                    out bool keyFound, 
                    int searchMinIndex, 
                    int searchMaxIndex)
{
    if (Equal(key, nodeKeys[searchMinIndex]))
    {
        keyFound = true;
        return searchMinIndex;
    }

    if (Equal(key, nodeKeys[searchMaxIndex]))
    {
        keyFound = true;
        return searchMaxIndex;
    }

    int delta = (searchMaxIndex - searchMinIndex) / 2;
    
    for (; delta != 0 ; delta = (searchMaxIndex - searchMinIndex) / 2)
    {
        Debug.Assert(Greater (key,nodeKeys[searchMinIndex]));
        Debug.Assert(Less    (key,nodeKeys[searchMaxIndex]));            
    
        int index = searchMinIndex + delta;
        int prefetch = nodeKeyMSB[index];
        Debug.Assert(index >= 0 && index <= LastIndex);
        if (keyPrefetch < prefetch)
        {
            searchMaxIndex = index;
            continue;
        }

        if (keyPrefetch > prefetch)
        {
            searchMinIndex = index;
            continue;
        }

        Debug.Assert(keyPrefetch == prefetch);
        int compare = Compare(key, GetKey(index));
        if (compare == 0)
        {
            keyFound = true;
            return index;
        }

        if (compare < 0)
        {
            searchMaxIndex = index;
        }
        else
        {
            Debug.Assert(compare > 0);
            searchMinIndex = index;
        }
    }//for
    Debug.Assert(Greater(key, nodeKeys[searchMinIndex]));
    Debug.Assert(Less   (key, nodeKeys[searchMaxIndex]));
    Debug.Assert(delta == 0);
    keyFound = false;
    return searchMinIndex;   
}

Using the Code

The code is released under the MIT Open Source license. The IDictionary methods implemented are: Insert, TryGetValue, indexer[], and Clear. You are invited to implement the rest of the methods. Do not forget to derive your own class from SplitArrayDictionary<TKey,TValue> and implement the following methods:

abstract public int PrefetchKeyMSB(TKey key);
abstract public int Compare(TKey left, TKey right);
abstract public bool LessOrEqual(TKey left, TKey right);
abstract public bool Less(TKey left, TKey right);
abstract public bool GreaterOrEqual(TKey left, TKey right);
abstract public bool Greater(TKey left, TKey right);
abstract public bool Equal(TKey left, TKey right);

The project was developed with the beta 2 version of Orcas (Visual Studio 2008 Beta2). Since I haven't used any special .NET 3.5 features, the code should be easily ported to .NET 2.

Points of Interest

  1. The results from Excel where produced on the following laptop:
  2. Intel Core 2 T6700 @ 2.33GHz
    Cache size = 2MB
    2x2GB=4GB of 512MHz DDR2 RAM
    Windows XP Pro (32bit)

    The results were compared with the following Workstation:

    Two Intel Xeon 5160 2.6 GHz
    4x2GB=8GB fully buffered DIMMs running at 667MHz
    Windows XP x64 Edition
    Cache size = 4MB

    All benchmarks performed better on the laptop, except the SortedList that performed better on the Workstation in the following cases:

    1. 100% random data when number of elements > 500,000 (maximum that was tested).
    2. 10% random data when number of elements > 100,000.

    What do you think is the reason for that?

  3. The memory consumption of this example was slightly lower than SortedList and about half of the memory that SortedDictionary consumed. Memory and time benchmarking were done following Tony's (Tobias Hertkorn) advice.

FAQ

Q: Wouldn't it be more cost-beneficial to buy a better computer (rather than tweaking the implementation)?

A: Clients that are power hungry would spend money on new hardware, whether you tweak the code or not. Any improvement over that would always be welcome.

Q: Why re-invent the wheel? Using standard components (SortedList) is safer.

A: The .NET Framework, is well, a framework. I believe that Microsoft should leave some room for contributors (CodeProject anyone?). You are all invited to peer review this work and contribute so the code will be more complete and robust.

Q: Is unmanaged code faster?

A: Yes... for now. As Microsoft/Sun/IBM continue to tweak compilers, higher level languages will close the gap. Part of this work is to highlight places that need attention. Banging my head on the wall did not leave any long lasting impression, hopefully this work would do better.

Q: What about the (poor) asymptotic performance?

A: It should be possible to use this implementation as a node of a balanced binary tree. The hybrid data structure would still have the "CPU cache locality" boost together with O(log N) asymptotic performance. Unfortunately, I don't have more time to work on it. If you are into it, start reading about AVL trees and Scapegoat trees. Since the array is quite big, one can use "helper" objects without affecting the memory consumption much. So, if for example, we track for each array its total weight (including children) and add a back pointer to the parent node, together with some Garbage Collection awareness ...

Q: Who cares about a slightly faster Dictionary?

A: Dictionaries are related to fast search performance, so even a little gain might probably prove useful for someone out there. Some of the tweaks described in this work can prove useful for other data structures as well.

History

  • 18-Oct-2007: Initial version.
  • 25-Oct-2007: Added FAQ section.
  • 10-Oct-2008: Added Remove() and the IEnumerable<> implementation.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

itaifrenkel
Software Developer
Israel Israel
No Biography provided

Comments and Discussions

 
GeneralMy vote of 5 PinmvpKanasz Robert6-Nov-12 2:27 
QuestionGood One Pinmemberpopchecker23-Aug-11 7:05 
GeneralMy vote of 1 PinmemberJalalAldeen22-Apr-10 3:29 
GeneralRe: My vote of 1 PinmemberHiren Solanki3-Dec-10 20:13 
GeneralCount, Capacity [modified] Pinmemberjohnj19844-Oct-08 9:32 
GeneralRe: Count, Capacity Pinmemberitaifrenkel10-Oct-08 13:55 
Question.NET 2.0? PinmemberSoftware_Architect12-Nov-07 6:34 
AnswerRe: .NET 2.0? Pinmemberitaifrenkel25-Nov-07 2:54 
Generalthanks Pinmemberpjd100125-Oct-07 18:04 
GeneralExcellent PinmemberPete O'Hanlon23-Oct-07 2:32 
GeneralRe: Excellent Pinmemberitaifrenkel24-Oct-07 14:30 
GeneralWow!!! PinmemberLev Vayner.23-Oct-07 2:23 
GeneralRe: Wow!!! Pinmemberitaifrenkel24-Oct-07 14:30 
QuestionIs it worth it? PinmemberJoe Sonderegger22-Oct-07 20:10 
AnswerRe: Is it worth it? Pinmemberitaifrenkel24-Oct-07 14:33 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web01 | 2.8.140826.1 | Last Updated 10 Oct 2008
Article Copyright 2007 by itaifrenkel
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid