Click here to Skip to main content
6,629,377 members and growing! (23,681 online)
Email Password   helpLost your password?
General Programming » Algorithms & Recipes » General     Intermediate License: The Code Project Open License (CPOL)

Squeezing more performance from SortedList

By itaifrenkel

A SortedList implementations using a cyclic algorithm and C# IDictionary tweaks
C# 3.0, Windows, .NET 3.5VS2008, Dev
Posted:20 Oct 2007
Updated:10 Oct 2008
Views:17,220
Bookmarked:26 times
Unedited contribution
Announcements
Loading...
 
Search    
Advanced Search
Add to IE Search
printPrint   add Share
      Discuss Discuss   Broken Article?Report  
17 votes for this article.
Popularity: 5.17 Rating: 4.20 out of 5

1
2 votes, 11.8%
2
1 vote, 5.9%
3
3 votes, 17.6%
4
11 votes, 64.7%
5

Download SplitArrayDictionary_src.zip - 14.81 KB
Download SplitArrayDictionary_demo.zip - 60.38 KB

typical_results.png

Introduction

This article describes an IDictionary<TKey,TValue> implementation similiar to SortedList<TKey,TValue>. The implementation uses TKey[] and TValue[] arrays as the underlying data structure. Performance increase is achieved by using cyclic array indexing and by prefetching the 32 most significant bits of TKey.

Assymptotic Time Complexity of
SplitArrayIntDictionary<TValue>.Insert()

P = probability of random data.
(1-P) = probability of monotonously (sorted) input data.
O(log(n)) = search time in the sorted array.
O(n) = time needed for Array.Copy.
O(1) = time needed to write the new key/value pair into the arrays.
O(SplitArrayIntDictionary.Insert(TKey,TValue)) =
P*( O(log(n)) + O(n) ) + (1-P)*O(1)
< P*O(n)
< O(n)

For comparison, the average time of a balanced binary tree
(SortedDictionary<Tkey,TValue>.Insert(TKey,TValue)) operation is O(log(n))
which is more efficient for large dictionaries. For small dictionaries (up to 50,000 - 10,000 elements) the Array.Copy operation is so efficient, that inserting a new key to SplitArrayIntDictionary is faster than SortedDictionary (see the excel graphs at the top of this page).

Cyclic Array.Copy Background

The .NET SortedList implements IDictionary<TKey,TValue>, by synchronizing two arrays - TKey[] and TValue[]. The algorithm basically shifts the arrays each time a key needs to be inserted in the middle of the array. The shift operation uses Array.Copy (which is implemented internally using memmove). Memove is much faster than an equivalent C# for {} implementation, especially if we do not want to use auxillary memory (in place copy operation):

void SlowMoveInPlace(TKey[] keyArray, int source, int destination , int length)
{
    Debug.Assert(source != destination);
    Debug.Assert(source+length < keyArray.Length);
    Debug.Assert(destination+length < keyArray.Length);

    //assume destination and source overlap
    //non-cyclic implementation          
    if (sIndex > dIndex)
    {
        for( int dIndex = destination, 
                 sIndex = source, 
                 sIndexMax = source + length -1 ; 
             sIndex <= sIndexMax ; 
             sIndex++ , 
             dIndex++)
        {          
          keyArray[dIndex] = keyArray[sIndex];
        }
    }
    else
    {
        for( int dIndex = destination+length-1, 
                 sIndex = source+length-1, 
                 sIndexMin = source; 
             sIndex >= sIndexMin ; 
             sIndex-- , 
             dIndex--)
        {
            keyArray[dIndex] = keyArray[sIndex];        
        }
    }
}

void FastMoveInPlace(TKey[] keyArray, int source, int destination , int length)
{
    Array.Copy(keyArray,source,keyArray,destination,length);
}

Three additional issues incur overhead on the Copy operation: First, we need to keep the TKey and TValue arrays synchronized. So each Array.Copy command is executed twice. (We could have used one array containing a KeyValuePair<TKey,TValue> but it would involve boxing of TKey and also hurt cache locality of TKey[]).

void ArrayCopy(int source, int destination, int length)
{
    Array.Copy(this.nodeKeys, source,
               this.nodeKeys, destination,
               length);

    Array.Copy(this.nodeValues, source,
               this.nodeValues, destination,
               length);
}

Second, a cyclic array requires more "if" clauses since the source and/or the destination may wrap over the end of the array. Here is a sample from CyclicArrayCopy()

void CyclicArrayCopy(int source, int destination, int length)
{
    Debug.Assert(source != destination);        
    Debug.Assert(length > 0);
    Debug.Assert(source >= 0);
    Debug.Assert(destination >= 0);
    Debug.Assert(source <= LastIndex);
    Debug.Assert(destination <= LastIndex);
    // The following assert makes sure we do not require
    // auxillary memory allocation.
    bool shiftLeft = CyclicSub(destination,source) >= CyclicSub(source,destination);
    bool shiftRight = CyclicSub(destination,source) <= CyclicSub(source,destination);
    Debug.Assert(shiftRight || (shiftLeft  && (CyclicSub(source,destination) +
        (length-1)<=LastIndex)));
    Debug.Assert(shiftLeft  || (shiftRight && (CyclicSub(destination,source) +
        (length-1)<=LastIndex)));
    
    int srcEndIndex = CyclicAdd(source, length-1);
    int dstEndIndex = CyclicAdd(destination, length-1);
    
    if(...)
        // abcd00 --> 00abcd
    
    if(...)
        // cd000ab --> 00abcd0
        // cd000ab --> 000abcd
        
    else if(...)
        // cd000ab --> abcd000
        // cd000ab --> 0abcd00
    else if(...)
        // 00abcd0 --> cd000ab
        // 000abcd --> cd000ab
    else if (...)
        // abcd000 --> cd000ab
        // 0abcd00 --> cd000ab
    else if (...)
        // d000abc --> bcd000a
    else
        // bcd000a --> d000abc
#region proof 
    Debug.Assert(destination < source);
    int length5 = LastIndex+1 - source;
    int shift_ = source - destination;
    Debug.Assert(shiftLeft);        
    Debug.Assert(shift_ + (length - 1) <= LastIndex);
    Debug.Assert((source - destination) + (length - 1) <= LastIndex);
    Debug.Assert((source - destination) + (length - 1) < LastIndex+1);
    
    //by definition srcEndIndex = source+(length-1)-(LastIndex+1)
    //which gives us this equation:
    Debug.Assert(srcEndIndex - destination  <0);
    Debug.Assert(destination > srcEndIndex);
    // concludes the proof
#endregion
    //requires proof that destination > srcEndIndex
    //otherwise the first ArrayCopy() call
    //would overwrite source data required 
    //for the last ArrayCopy() call
    ArrayCopy(source, destination, length5);
    ArrayCopy(0, LastIndex+1 - shift_, shift_);
    ArrayCopy(shift_, 0, srcEndIndex+1);
}

Third, the bigger the array, the more time Array.Copy command would require - especially if the source and the destination cause too many CPU cache misses (see the Points of Interest section at the bottom of this page).

Algorithm Improvements

Splitting the Array

Jacob Slusser gives a great overview of Gap Buffers, highly recommended reading ! </update>

The first shift operation shifts the array one index to the right or one index to the left. See the following pseudo code:

TKey[] nodeKeys  = {'a','e','f', 0 , 0 , 0 , 0 , 0 }
TKey key = 'd'

//make room for 'd'
ShiftLeft(nodeKeys, index=0,length=1,shift=1);
Assert(nodeKeys == { 0 ,'e','f', 0 , 0 , 0 , 0 ,'a'});

//insert 'd'
nodeKeys[0] = key;    
Assert(nodeKeys == {'d','e','f', 0 , 0 , 0 , 0 ,'a'});

If the next shift operation follows the previous shift (in index and direction) the array is shifted to the maximum.

TKey[] nodeKeys  = {'d','e','f', 0 , 0 , 0 , 0 ,'a'}
TKey key = 'c'

//make room for 'c'
ShiftLeft(nodeKeys, index=6,length=1,shift=4);
Assert(nodeKeys == {'d','e','f','a', 0 , 0 , 0 , 0 });

//Insert 'c' into nodeKeys
nodeKeys[7] = key;    
Assert(nodeKeys == {'d','e','f','a', 0 , 0 , 0 ,'c'});

This gap in the array allows fast insert operations as long as the input keeps the monotonous decrease:

TKey[] nodeKeys  = {'d','e','f','a', 0 , 0 ,'c'}
key = 'b'
// no shift required
// Insert key into nodeKeys
nodeKeys[6] = key;
Assert(nodeKeys == {'d','e','f','a', 0 , 0 ,'b','c'});

When the input is no longer monotonous the algorithm starts from the beginning, shifting only by one index.

TKey[] nodeKeys  = {'d','e','f','a', 0 , 0 ,'b','c'}
key = 'g'

// Make room for 'g'
ShiftRight(nodeKeys,index=3,length=1,shift=1);

// Insert 'g'
nodeKeys[3] = key;
Assert(nodeKeys == {'d','e','f','g','a', 0 ,'b','c'})

The cyclic algorithm keeps track of the following properties:
Split,MinIndex,MaxIndex,LocalMaxIndex,LocalMinIndex

bool Split
{
    get
    {
        if (MinIndex != this.LocalMinIndex)
        {
            Debug.Assert(MaxIndex != this.LocalMaxIndex);
            return true;
        }
        return false;
    }
}

// In the example below the sorted array is split into two sections: 
// from LocalMinIndex to MaxIndex ('b','c','d','e','f','g')
// from MinIndex to LocalMaxIndex ('a')
nodeKeys = {'d','e','f','g','a', 0 ,'b','c'}
Assert(Split)
Assert(MaxIndex == 3)
Assert(nodeKeys[MaxIndex] == 'g')
Assert(MinIndex == 4)
Assert(nodeKeys[MinIndex] == 'a')
Assert(LocalMinIndex == 6)
Assert(nodeKeys[LocalMinIndex] == 'b')
Assert(LocalMaxIndex == 4)
Assert(nodeKeys[LocalMaxIndex] == 'a')

// In the example below the sorted array has only one section:
// from MinIndex to MaxIndex ('a','b','c','d','e','f','g')
nodeKeys = {'d','e','f','g', 0 ,'a','b','c'}
Assert(!Split)
Assert(MaxIndex == LocalMaxIndex == 3)
Assert(MinIndex == LocalMinIndex == 5)
Assert(nodeKeys[MaxIndex] == 'g')
Assert(nodeKeys[MinIndex] == 'a')

Shifting in both directions

Since the underlying array is cyclic it supports both shift right and shift left operations. The algorithm chooses the shift direction that requires less read/write memory operations.

bool ShiftRightIsFaster(int index)
{
    Debug.Assert(!Full);

    return (CyclicSub(LocalMaxIndex + 1,index) <
            CyclicSub(index+1,LocalMinIndex));        
}

bool Insert(TKey key, TValue value, bool overwriteValue)
{
    if (Empty) ...
    else if (Split && (newLocalMinKey || newLocalMaxKey)) ...
    else if (!Split && newMinKey) ...
    else if (!Split && newMaxKey) ...
    else if (Split && newMinKey) ...
    else if (Split && newMaxKey) ...
    else
    {
        bool foundKey;
        index = this.CyclicSearch(key,prefetch, out foundKey);

        if (foundKey)
        {
            if (!overwriteValue)
            {
                // key,value pair already exists
                return false;
            }
            //replace value of existing key
            Insert(key, value, prefetch,index);
            return true;
        }
    }

    if (Full) return false;
    //shift the array to accomodate space for the new node      
    if (ShiftRightIsFaster(index))
    {
        index = CyclicInc(index); //index++
        InsertShiftRight(key, prefetch, value, index);
    }
    else
    {
        InsertShiftLeft(key, prefetch, value, index);            
    }
    return true;

C# code optimizations

Cyclic Indexing

There are three basic approaches to implement a cyclic index + and - operator:
Obviously using the integer modulus operator (requires int division operation)

int CyclicAdd(int index, int increase)
{    
    Debug.Assert(increase >= 0);
    // LastIndex+1 == Capacity
    return (index + increase) % (LastIndex+1);    
}

Using an if to detect index out of bounds.

int CyclicAdd(int index, int increase)
{
    Debug.Assert(increase >= 0);
    Debug.Assert(increase <= LastIndex+1);
    Debug.Assert(index >= 0);
    Debug.Assert(index <= LastIndex);
    index += increase;
    if (index > LastIndex)
    {
        index -= LastIndex+1;
    }
    return index;
}

And the fastest implementation assumes that LastIndex+1 is a power of 2.
So if the capacity of an array is 8 = 00001000b
then LastIndex == 7 == 00000111b

int CyclicAdd(int index, int increase)
{
    Debug.Assert(increase >= 0);
    Debug.Assert(increase <= LastIndex+1);
    Debug.Assert(index >= 0);
    Debug.Assert(index <= LastIndex);
    return (index + increase) & LastIndex;
}

Since the JIT would easily inline each CyclicAdd() call, we can reuse it to implement the subtract operation:

int CyclicSub(int index, int dec)
{
    Debug.Assert(dec >= 0);
    Debug.Assert(dec <= LastIndex+1);
    return CyclicAdd(index, LastIndex+1 - dec);
}

TKey comparisons

One of the most time consuming operation in the algorithm is the Search loop. Most (if not all) .NET collections require either the TKey to implement the IComparer interface, or that the user would provide IComparable.
The approach in this example is to optimize speed at the cost of "Best Practices". Mind you, it is still possible to use IComparer and IComparable, but it would involve a penalty of a second virtual function call.

So - what's so bad with the interface approach ?

class SplitArrayDictionary<TKey,TValue,TComparer>
    where TComparer : IComparer<TKey> , new()
{
    readonly static IComparer<TKey> Comparer = new TComparer();
    
    int Search(TKey key, out bool keyFound)
    {
        for (...)
        {
            if (Comparer.Compare(key,nodeKeys[index]) == 0)
            {
                keyFound = true;
                return index;
            }
            ...
        }
    }
}

The C# JIT compiles the Search method once, and distinguishes between differnet TKey types with a vtable (simmiliar to virtual method calls). So there is a table that either points to Compare(int,int) when TKey==int or Compare(long,long) when TKey==long.
The JIT does not "inline" the Compare method inside the loop, since that would require seperate Search compilations for int and long. This JIT decision is by-design, and until it is relaxed - well - we need to look for something faster.

Abstract methods are not considered "best practice" for implementing a C# "contract", and the JIT doesn't inline abstract or virtual function calls either. However, abstract methods call are faster since the runtime JIT does not insert a if (Comparer!=null) {} clause before the virtual method call ("this" is never null).

class SplitArrayDictionary<TKey,TValue>
{
    abstract int Compare(TKey left, TKey right);
    
    int Search(TKey key, out bool keyFound)
    {
        for (...)
        {
            if (this.Compare(key,nodeKeys[index]) == 0)
            {
                keyFound = true;
                return index;
            }
            ...
        }
    }
}

class SplitArrayIntDictionary<TValue> : SplitArrayDictionary<int,TValue>
{
    sealed override int Compare(int left, int right)
    {
        return left-right; // might overflow !!
    }
}

Taking into account overflow conditions:

class SplitArrayIntDictionary<TValue> : SplitArrayDictionary<int,TValue>
{
    sealed override int Compare(int left, int right)
    {
        if (left > 0)
        {
            if (right < 0) return 1;
            return left-right;
        }
        if (right > 0) return -1;
        return left-right;
    }
}

A safe and faster alternative:

class SplitArrayIntDictionary<TValue> : SplitArrayDictionary<int,TValue>
{
    sealed override int Compare(int left, int right)
    {
        if (left  > right) return 1;
        if (right < left) return -1;
        return 0;
    }
}

2) The second caveat is that Compare does excess computations. It distinguishes between 3 different conditions "<" or "==" or ">". In some cases we might want to distinguish only between 2 conditions "<=" or ">".

bool KeyMightBeInArray(TKey key)
{
    return !Empty &&
           LessOrEqual(key,nodeKeys[MaxIndex]) &&
           GreaterOrEqual(key,nodeKeys[MinIndex]) &&
           (   !Split ||
               LessOrEqual(key, nodeKeys[LocalMaxIndex]) ||
               GreaterOrEqual(key, nodeKeys[LocalMinIndex]))
            );
}

This requires TKey specific implementation of these methods:

class SplitArrayIntDictionary<TValue> : SplitArrayDictionary<int,TValue>
{
    sealed override bool LessOrEqual(int left, int right)
    {
        return (left <= right);
    }
    
    sealed override bool GreaterOrEqual(int left, int right)
    {
        return (left >= right);
    }
}

In order speed the comparisons even further we prefetch the 32 most significant bits of the key and store it in a separate nodePrefetch[] int array. So we define yet another abstract method PrefetchKeyMSB(TKey) which returns the 32 Most Significant Bits of key. For example, the MSB of long is the upper 32bits and the MSB for a person's name is first two letters of the last name.

class LongDictionary<TValue> : BaseDictionary<long,TValue>
{
    sealed override int PrefetchKeyMSB(long key) 
    {
        return (int)(key >> 32);    // most significant 32 bits
    }
}

And the Insert() method also stores seperatly the MSB:

readonly static bool TKeyIsInteger = (typeof(TKey).FullName == "System.Int32");
void Insert(TKey key, TValue value, int prefetch, int index)
{
    nodeKeys[index] = key;
    nodeValues[index] = value;
    if (!TKeyIsInteger)
    {
        nodeKeyMSB[index] = prefetch;
    }
}

Integer comparisons are inlined by the JIT runtime compiler and when they are executed inside a loop, the performance boost becomes significant.

int SearchInRange(  TKey key,
                    int keyPrefetch, 
                    out bool keyFound, 
                    int searchMinIndex, 
                    int searchMaxIndex)
{
    if (Equal(key, nodeKeys[searchMinIndex]))
    {
        keyFound = true;
        return searchMinIndex;
    }

    if (Equal(key, nodeKeys[searchMaxIndex]))
    {
        keyFound = true;
        return searchMaxIndex;
    }

    int delta = (searchMaxIndex - searchMinIndex) / 2;
    
    for (; delta != 0 ; delta = (searchMaxIndex - searchMinIndex) / 2)
    {
        Debug.Assert(Greater (key,nodeKeys[searchMinIndex]));
        Debug.Assert(Less    (key,nodeKeys[searchMaxIndex]));            
    
        int index = searchMinIndex + delta;
        int prefetch = nodeKeyMSB[index];
        Debug.Assert(index >= 0 && index <= LastIndex);
        if (keyPrefetch < prefetch)
        {
            searchMaxIndex = index;
            continue;
        }

        if (keyPrefetch > prefetch)
        {
            searchMinIndex = index;
            continue;
        }

        Debug.Assert(keyPrefetch == prefetch);
        int compare = Compare(key, GetKey(index));
        if (compare == 0)
        {
            keyFound = true;
            return index;
        }

        if (compare < 0)
        {
            searchMaxIndex = index;
        }
        else
        {
            Debug.Assert(compare > 0);
            searchMinIndex = index;
        }
    }//for
    Debug.Assert(Greater(key, nodeKeys[searchMinIndex]));
    Debug.Assert(Less   (key, nodeKeys[searchMaxIndex]));
    Debug.Assert(delta == 0);
    keyFound = false;
    return searchMinIndex;   
}

Using the code

The code is released under the MIT open source license. The IDictionary methods implemented are: Insert, TryGetValue, indexer[] and Clear. You are invited to implement the rest of the methods. Do not forget to derive your own class from SplitArrayDictionary<TKey,TValue> and implement the following methods:

    abstract public int PrefetchKeyMSB(TKey key);
    abstract public int Compare(TKey left, TKey right);
    abstract public bool LessOrEqual(TKey left, TKey right);
    abstract public bool Less(TKey left, TKey right);
    abstract public bool GreaterOrEqual(TKey left, TKey right);
    abstract public bool Greater(TKey left, TKey right);
    abstract public bool Equal(TKey left, TKey right);

The project was developed with the beta 2 version of Orcas (Visual Studio 2008 Beta2). Since I haven't used any special .NET 3.5 features, the code should be easily ported to .NET 2.

Points of Interest

1)The results from excel where produced on the following Laptop:
Intel Core 2 T6700 @ 2.33GHz
Cache size = 2MB
2x2GB=4GB of 512MHz DDR2 RAM
Windows XP Pro (32bit)

The results were compared with the following workstation:
Two Intel Xeon 5160 2.6 GHz
4x2GB=8GB fully buffered DIMMs running at 667MHz
Windows XP x64 Edition
Cache size = 4MB

All benchmarks performed better on the laptop except the SortedList that performed better on the workstation in the following cases:
1) 100% random data when number of elements > 500,000 (maximum that was tested).
2) 10% random data when number of elements > 100,000
What do you think is the reason for that?

2) The memory consumption of this example was slightly lower than SortedList and about half of the memory that SortedDictionary consumed.
Memory and Time benchmarking was done following Tony's (Tobias Hertkorn) advice

FAQ

Q: Wouldn't it be more cost-beneficial to buy a better computer (rather than tweaking the implementation ?)
A: Clients that are power hungry would spend money on new hardware, whether you tweak the code or not. Any improvement over that would always be welcome.

Q: Why re-invent the wheel ? Using standard components (SortedList) is safer.
A: The .NET framework, is well, a framework. I beleive that Microsoft should leave some room for contributers (CodeProject anyone ?). You are all invitred to peer review this work and contribute so the code will be more complete and robust.

Q: Unmanaged code is faster
A: Yes... for now. As Microsoft/Sun/IBM continue to tweak compilers, higher level languages will close the gap. Part of this work is to highlight places that need attention. Banging my head on the wall did not leave any long lasting impression, hopefully this work would do better.

Q: What about the (poor) assymptotic performance
A: It should be possible to use this implementation as a node of a balanced binary tree. The hybrid data structure would still have the "CPU cache locality" boost together with O(log N) assymptotic performance. Unfortunatly, I don't have more time to work on it. If you are into it, start reading about AVL trees and Scapegoat trees. Since the array is quite big, one can use "helper" objects without effecting much the memory consumption. So if for example, we track for each Array it's total weight (including children) and add a back pointer to the parent node, together with some Garbage Collection awareness ...

Q: Who cares about a slightly faster Dictionary.
A: Dictionaries are related to fast search performance, so even a little gain might probably proove usefull for someone out there. Some of the tweaks described in this work, can proove usefull for other data structures as well.

History

18-oct-2007 Initial version.

25-oct-2007 FAQ section.

10-oct-2008 Added Remove() and IEnumerable<> implementation

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

itaifrenkel


Member

Occupation: Software Developer
Location: Israel Israel

Other popular Algorithms & Recipes articles:

Article Top
You must Sign In to use this message board.
FAQ FAQ 
 
Noise Tolerance  Layout  Per page   
 Msgs 1 to 11 of 11 (Total in Forum: 11) (Refresh)FirstPrevNext
GeneralCount, Capacity [modified] Pinmemberjohnj198410:32 4 Oct '08  
GeneralRe: Count, Capacity Pinmemberitaifrenkel14:55 10 Oct '08  
General.NET 2.0? PinmemberSoftware_Architect7:34 12 Nov '07  
GeneralRe: .NET 2.0? Pinmemberitaifrenkel3:54 25 Nov '07  
Generalthanks Pinmemberpjd100119:04 25 Oct '07  
GeneralExcellent PinmemberPete O'Hanlon3:32 23 Oct '07  
GeneralRe: Excellent Pinmemberitaifrenkel15:30 24 Oct '07  
GeneralWow!!! PinmemberLev Vayner.3:23 23 Oct '07  
GeneralRe: Wow!!! Pinmemberitaifrenkel15:30 24 Oct '07  
GeneralIs it worth it? PinmemberJoe Sonderegger21:10 22 Oct '07  
GeneralRe: Is it worth it? Pinmemberitaifrenkel15:33 24 Oct '07  

General General    News News    Question Question    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

PermaLink | Privacy | Terms of Use
Last Updated: 10 Oct 2008
Editor: Sean Ewington
Copyright 2007 by itaifrenkel
Everything else Copyright © CodeProject, 1999-2009
Web20 | Advertise on the Code Project