13,195,991 members (53,871 online)
alternative version

#### Stats

52.4K views
31 bookmarked
Posted 20 Oct 2007

# Squeezing more performance from SortedList

, 10 Oct 2008
 Rate this:
A SortedList implementations using a cyclic algorithm and C# IDictionary tweaks.

## Introduction

This article describes an `IDictionary<TKey,TValue>` implementation similar to `SortedList<TKey,TValue>`. The implementation uses `TKey[]` and `TValue[]` arrays as the underlying data structure. Performance increase is achieved by using cyclic array indexing and by pre-fetching the 32 most significant bits of `TKey`.

## Asymptotic Time Complexity of SplitArrayIntDictionary<TValue>.Insert()

```P = probability of random data.
(1-P) = probability of monotonously (sorted) input data.
O(log(n)) = search time in the sorted array.
O(n) = time needed for Array.Copy.
O(1) = time needed to write the new key/value pair into the arrays.
O(SplitArrayIntDictionary.Insert(TKey,TValue)) =
P*( O(log(n)) + O(n) ) + (1-P)*O(1)
< P*O(n)
< O(n)```

For comparison, the average time of a balanced binary tree (`SortedDictionary<Tkey,TValue>.Insert(TKey,TValue)`) operation is O(log(n)), which is more efficient for large dictionaries. For small dictionaries (up to 50,000 - 10,000 elements), the `Array.Copy` operation is so efficient that inserting a new key to `SplitArrayIntDictionary` is faster than `SortedDictionary` (see the Excel graphs at the top of this page).

## Cyclic Array.Copy Background

The .NET `SortedList` implements `IDictionary<TKey,TValue>` by synchronizing two arrays - `TKey[]` and `TValue[]`. The algorithm basically shifts the arrays each time a key needs to be inserted in the middle of the array. The shift operation uses `Array.Copy` (which is implemented internally using `memmove`). `memmove` is much faster than an equivalent C# for {} implementation, especially if we do not want to use auxiliary memory (in place copy operation):

```void SlowMoveInPlace(TKey[] keyArray, int source, int destination , int length)
{
Debug.Assert(source != destination);
Debug.Assert(source+length < keyArray.Length);
Debug.Assert(destination+length < keyArray.Length);

//assume destination and source overlap
//non-cyclic implementation
if (sIndex > dIndex)
{
for( int dIndex = destination,
sIndex = source,
sIndexMax = source + length -1 ;
sIndex <= sIndexMax ;
sIndex++ ,
dIndex++)
{
keyArray[dIndex] = keyArray[sIndex];
}
}
else
{
for( int dIndex = destination+length-1,
sIndex = source+length-1,
sIndexMin = source;
sIndex >= sIndexMin ;
sIndex-- ,
dIndex--)
{
keyArray[dIndex] = keyArray[sIndex];
}
}
}

void FastMoveInPlace(TKey[] keyArray, int source, int destination , int length)
{
Array.Copy(keyArray,source,keyArray,destination,length);
}```

Three additional issues incur overhead on the `Copy` operation: first, we need to keep the `TKey` and `TValue` arrays synchronized. So each `Array.Copy` command is executed twice. (We could have used one array containing a `KeyValuePair<TKey,TValue>`, but it would involve boxing of `TKey` and also hurts cache locality of `TKey[]`.)

```void ArrayCopy(int source, int destination, int length)
{
Array.Copy(this.nodeKeys, source,
this.nodeKeys, destination,
length);

Array.Copy(this.nodeValues, source,
this.nodeValues, destination,
length);
}```

Second, a cyclic array requires more "`if`" clauses since the source and/or the destination may wrap over the end of the array. Here is a sample from `CyclicArrayCopy()`:

```void CyclicArrayCopy(int source, int destination, int length)
{
Debug.Assert(source != destination);
Debug.Assert(length > 0);
Debug.Assert(source >= 0);
Debug.Assert(destination >= 0);
Debug.Assert(source <= LastIndex);
Debug.Assert(destination <= LastIndex);
// The following assert makes sure we do not require
// auxillary memory allocation.
bool shiftLeft = CyclicSub(destination,source) >= CyclicSub(source,destination);
bool shiftRight = CyclicSub(destination,source) <= CyclicSub(source,destination);
Debug.Assert(shiftRight || (shiftLeft  && (CyclicSub(source,destination) +
(length-1)<=LastIndex)));
Debug.Assert(shiftLeft  || (shiftRight && (CyclicSub(destination,source) +
(length-1)<=LastIndex)));

if(...)
// abcd00 --> 00abcd

if(...)
// cd000ab --> 00abcd0
// cd000ab --> 000abcd

else if(...)
// cd000ab --> abcd000
// cd000ab --> 0abcd00
else if(...)
// 00abcd0 --> cd000ab
// 000abcd --> cd000ab
else if (...)
// abcd000 --> cd000ab
// 0abcd00 --> cd000ab
else if (...)
// d000abc --> bcd000a
else
// bcd000a --> d000abc
#region proof
Debug.Assert(destination < source);
int length5 = LastIndex+1 - source;
int shift_ = source - destination;
Debug.Assert(shiftLeft);
Debug.Assert(shift_ + (length - 1) <= LastIndex);
Debug.Assert((source - destination) + (length - 1) <= LastIndex);
Debug.Assert((source - destination) + (length - 1) < LastIndex+1);

//by definition srcEndIndex = source+(length-1)-(LastIndex+1)
//which gives us this equation:
Debug.Assert(srcEndIndex - destination  <0);
Debug.Assert(destination > srcEndIndex);
// concludes the proof
#endregion
//requires proof that destination > srcEndIndex
//otherwise the first ArrayCopy() call
//would overwrite source data required
//for the last ArrayCopy() call
ArrayCopy(source, destination, length5);
ArrayCopy(0, LastIndex+1 - shift_, shift_);
ArrayCopy(shift_, 0, srcEndIndex+1);
}```

Third, the bigger the array, the more time the `Array.Copy` command would require - especially if the source and the destination cause too many CPU cache misses (see the Points of Interest section at the bottom of this page).

## Algorithm Improvements

### Splitting the Array

Jacob Slusser gives a great overview of Gap Buffers, highly recommended reading!

The first shift operation shifts the array one index to the right or one index to the left. See the following pseudo code:

```TKey[] nodeKeys  = {'a','e','f', 0 , 0 , 0 , 0 , 0 }
TKey key = 'd'

//make room for 'd'
ShiftLeft(nodeKeys, index=0,length=1,shift=1);
Assert(nodeKeys == { 0 ,'e','f', 0 , 0 , 0 , 0 ,'a'});

//insert 'd'
nodeKeys[0] = key;
Assert(nodeKeys == {'d','e','f', 0 , 0 , 0 , 0 ,'a'});```

If the next shift operation follows the previous shift (in index and direction), the array is shifted to the maximum.

```TKey[] nodeKeys  = {'d','e','f', 0 , 0 , 0 , 0 ,'a'}
TKey key = 'c'

//make room for 'c'
ShiftLeft(nodeKeys, index=6,length=1,shift=4);
Assert(nodeKeys == {'d','e','f','a', 0 , 0 , 0 , 0 });

//Insert 'c' into nodeKeys
nodeKeys[7] = key;
Assert(nodeKeys == {'d','e','f','a', 0 , 0 , 0 ,'c'});```

This gap in the array allows fast insert operations as long as the input keeps the monotonous decrease:

```TKey[] nodeKeys  = {'d','e','f','a', 0 , 0 ,'c'}
key = 'b'
// no shift required
// Insert key into nodeKeys
nodeKeys[6] = key;
Assert(nodeKeys == {'d','e','f','a', 0 , 0 ,'b','c'});```

When the input is no longer monotonous, the algorithm starts from the beginning, shifting only by one index.

```TKey[] nodeKeys  = {'d','e','f','a', 0 , 0 ,'b','c'}
key = 'g'

// Make room for 'g'
ShiftRight(nodeKeys,index=3,length=1,shift=1);

// Insert 'g'
nodeKeys[3] = key;
Assert(nodeKeys == {'d','e','f','g','a', 0 ,'b','c'})```

The cyclic algorithm keeps track of the following properties: `Split`, `MinIndex`, `MaxIndex`, `LocalMaxIndex`, and `LocalMinIndex`.

```bool Split
{
get
{
if (MinIndex != this.LocalMinIndex)
{
Debug.Assert(MaxIndex != this.LocalMaxIndex);
return true;
}
return false;
}
}

// In the example below the sorted array is split into two sections:
// from LocalMinIndex to MaxIndex ('b','c','d','e','f','g')
// from MinIndex to LocalMaxIndex ('a')
nodeKeys = {'d','e','f','g','a', 0 ,'b','c'}
Assert(Split)
Assert(MaxIndex == 3)
Assert(nodeKeys[MaxIndex] == 'g')
Assert(MinIndex == 4)
Assert(nodeKeys[MinIndex] == 'a')
Assert(LocalMinIndex == 6)
Assert(nodeKeys[LocalMinIndex] == 'b')
Assert(LocalMaxIndex == 4)
Assert(nodeKeys[LocalMaxIndex] == 'a')

// In the example below the sorted array has only one section:
// from MinIndex to MaxIndex ('a','b','c','d','e','f','g')
nodeKeys = {'d','e','f','g', 0 ,'a','b','c'}
Assert(!Split)
Assert(MaxIndex == LocalMaxIndex == 3)
Assert(MinIndex == LocalMinIndex == 5)
Assert(nodeKeys[MaxIndex] == 'g')
Assert(nodeKeys[MinIndex] == 'a')```

### Shifting in Both Directions

Since the underlying array is cyclic, it supports both shift right and shift left operations. The algorithm chooses the shift direction that requires less read/write memory operations.

```bool ShiftRightIsFaster(int index)
{
Debug.Assert(!Full);

return (CyclicSub(LocalMaxIndex + 1,index) <
CyclicSub(index+1,LocalMinIndex));
}

bool Insert(TKey key, TValue value, bool overwriteValue)
{
if (Empty) ...
else if (Split && (newLocalMinKey || newLocalMaxKey)) ...
else if (!Split && newMinKey) ...
else if (!Split && newMaxKey) ...
else if (Split && newMinKey) ...
else if (Split && newMaxKey) ...
else
{
bool foundKey;
index = this.CyclicSearch(key,prefetch, out foundKey);

if (foundKey)
{
if (!overwriteValue)
{
return false;
}
//replace value of existing key
Insert(key, value, prefetch,index);
return true;
}
}

if (Full) return false;
//shift the array to accomodate space for the new node
if (ShiftRightIsFaster(index))
{
index = CyclicInc(index); //index++
InsertShiftRight(key, prefetch, value, index);
}
else
{
InsertShiftLeft(key, prefetch, value, index);
}
return true;```

## C# Code Optimizations

### Cyclic Indexing

There are three basic approaches to implement a cyclic index + and - operator:

• Obviously, using the integer modulus operator (requires int division operation):
• ```int CyclicAdd(int index, int increase)
{
Debug.Assert(increase >= 0);
// LastIndex+1 == Capacity
return (index + increase) % (LastIndex+1);
}```
• Using an `if` to detect index out of bounds:
• ```int CyclicAdd(int index, int increase)
{
Debug.Assert(increase >= 0);
Debug.Assert(increase <= LastIndex+1);
Debug.Assert(index >= 0);
Debug.Assert(index <= LastIndex);
index += increase;
if (index > LastIndex)
{
index -= LastIndex+1;
}
return index;
}```
• And the fastest implementation assumes that `LastIndex`+1 is a power of 2. So if the capacity of an array is 8 = 00001000b, then `LastIndex` == 7 == 00000111b.
• ```int CyclicAdd(int index, int increase)
{
Debug.Assert(increase >= 0);
Debug.Assert(increase <= LastIndex+1);
Debug.Assert(index >= 0);
Debug.Assert(index <= LastIndex);
return (index + increase) & LastIndex;
}```

Since the JIT would easily inline each `CyclicAdd()` call, we can reuse it to implement the subtract operation:

```int CyclicSub(int index, int dec)
{
Debug.Assert(dec >= 0);
Debug.Assert(dec <= LastIndex+1);
}```

### TKey Comparisons

One of the most time consuming operations in the algorithm is the Search loop. Most (if not all) .NET collections require either the `TKey` to implement the `IComparer` interface, or that the user would provide an `IComparable`. The approach in this example is to optimize speed at the cost of "Best Practices". Mind you, it is still possible to use `IComparer` and `IComparable`, but it would involve a penalty of a second virtual function call.

So - what's so bad with the interface approach?

```class SplitArrayDictionary<TKey,TValue,TComparer>
where TComparer : IComparer<TKey> , new()
{
readonly static IComparer<TKey> Comparer = new TComparer();

int Search(TKey key, out bool keyFound)
{
for (...)
{
if (Comparer.Compare(key,nodeKeys[index]) == 0)
{
keyFound = true;
return index;
}
...
}
}
}```

The C# JIT compiles the `Search` method once, and distinguishes between different `TKey` types with a vtable (similar to virtual method calls). So there is a table that either points to `Compare(int,int)` when `TKey==int`, or `Compare(long,long)` when `TKey==long`. The JIT does not "inline" the `Compare` method inside the loop since that would require separate `Search` compilations for `int` and `long`. This JIT decision is by-design, and until it is relaxed - well - we need to look for something faster.

Abstract methods are not considered "best practice" for implementing a C# "contract", and the JIT doesn't inline abstract or virtual function calls either. However, abstract method calls are faster since the runtime JIT does not insert an `if (Comparer!=null) {}` clause before the virtual method call ("`this`" is never `null`).

```class SplitArrayDictionary<TKey,TValue>
{
abstract int Compare(TKey left, TKey right);

int Search(TKey key, out bool keyFound)
{
for (...)
{
if (this.Compare(key,nodeKeys[index]) == 0)
{
keyFound = true;
return index;
}
...
}
}
}

class SplitArrayIntDictionary<TValue> : SplitArrayDictionary<int,TValue>
{
sealed override int Compare(int left, int right)
{
return left-right; // might overflow !!
}
}```

Taking into account overflow conditions:

```class SplitArrayIntDictionary<TValue> : SplitArrayDictionary<int,TValue>
{
sealed override int Compare(int left, int right)
{
if (left > 0)
{
if (right < 0) return 1;
return left-right;
}
if (right > 0) return -1;
return left-right;
}
}```

A safe and faster alternative:

```class SplitArrayIntDictionary<TValue> : SplitArrayDictionary<int,TValue>
{
sealed override int Compare(int left, int right)
{
if (left  > right) return 1;
if (right < left) return -1;
return 0;
}
}```

The second caveat is that `Compare` does excess computations. It distinguishes between three different conditions: "<", or "==", or ">". In some cases, we might only want to distinguish between two conditions: "<=" or ">".

```bool KeyMightBeInArray(TKey key)
{
return !Empty &&
LessOrEqual(key,nodeKeys[MaxIndex]) &&
GreaterOrEqual(key,nodeKeys[MinIndex]) &&
(   !Split ||
LessOrEqual(key, nodeKeys[LocalMaxIndex]) ||
GreaterOrEqual(key, nodeKeys[LocalMinIndex]))
);
}```

This requires a `TKey` specific implementation of these methods:

```class SplitArrayIntDictionary<TValue> : SplitArrayDictionary<int,TValue>
{
sealed override bool LessOrEqual(int left, int right)
{
return (left <= right);
}

sealed override bool GreaterOrEqual(int left, int right)
{
return (left >= right);
}
}```

In order to speed the comparisons even further, we pre-fetch the 32 most significant bits of the key and store them in a separate `nodePrefetch[] int` array. So we define yet another abstract method `PrefetchKeyMSB(TKey)` which returns the 32 Most Significant Bits of the key. For example, the MSB of `long` is the upper 32 bits, and the MSB for a person's name is the first two letters of the last name.

```class LongDictionary<TValue> : BaseDictionary<long,TValue>
{
sealed override int PrefetchKeyMSB(long key)
{
return (int)(key >> 32);    // most significant 32 bits
}
}```

And the `Insert()` method also stores the MSB separately:

```readonly static bool TKeyIsInteger = (typeof(TKey).FullName == "System.Int32");
void Insert(TKey key, TValue value, int prefetch, int index)
{
nodeKeys[index] = key;
nodeValues[index] = value;
if (!TKeyIsInteger)
{
nodeKeyMSB[index] = prefetch;
}
}```

Integer comparisons are inlined by the JIT runtime compiler, and when they are executed inside a loop, the performance boost becomes significant.

```int SearchInRange(  TKey key,
int keyPrefetch,
out bool keyFound,
int searchMinIndex,
int searchMaxIndex)
{
if (Equal(key, nodeKeys[searchMinIndex]))
{
keyFound = true;
return searchMinIndex;
}

if (Equal(key, nodeKeys[searchMaxIndex]))
{
keyFound = true;
return searchMaxIndex;
}

int delta = (searchMaxIndex - searchMinIndex) / 2;

for (; delta != 0 ; delta = (searchMaxIndex - searchMinIndex) / 2)
{
Debug.Assert(Greater (key,nodeKeys[searchMinIndex]));
Debug.Assert(Less    (key,nodeKeys[searchMaxIndex]));

int index = searchMinIndex + delta;
int prefetch = nodeKeyMSB[index];
Debug.Assert(index >= 0 && index <= LastIndex);
if (keyPrefetch < prefetch)
{
searchMaxIndex = index;
continue;
}

if (keyPrefetch > prefetch)
{
searchMinIndex = index;
continue;
}

Debug.Assert(keyPrefetch == prefetch);
int compare = Compare(key, GetKey(index));
if (compare == 0)
{
keyFound = true;
return index;
}

if (compare < 0)
{
searchMaxIndex = index;
}
else
{
Debug.Assert(compare > 0);
searchMinIndex = index;
}
}//for
Debug.Assert(Greater(key, nodeKeys[searchMinIndex]));
Debug.Assert(Less   (key, nodeKeys[searchMaxIndex]));
Debug.Assert(delta == 0);
keyFound = false;
return searchMinIndex;
}```

## Using the Code

The code is released under the MIT Open Source license. The `IDictionary` methods implemented are: `Insert`, `TryGetValue`, `indexer[]`, and `Clear`. You are invited to implement the rest of the methods. Do not forget to derive your own class from `SplitArrayDictionary<TKey,TValue>` and implement the following methods:

```abstract public int PrefetchKeyMSB(TKey key);
abstract public int Compare(TKey left, TKey right);
abstract public bool LessOrEqual(TKey left, TKey right);
abstract public bool Less(TKey left, TKey right);
abstract public bool GreaterOrEqual(TKey left, TKey right);
abstract public bool Greater(TKey left, TKey right);
abstract public bool Equal(TKey left, TKey right);```

The project was developed with the beta 2 version of Orcas (Visual Studio 2008 Beta2). Since I haven't used any special .NET 3.5 features, the code should be easily ported to .NET 2.

## Points of Interest

1. The results from Excel where produced on the following laptop:
2. ```Intel Core 2 T6700 @ 2.33GHz
Cache size = 2MB
2x2GB=4GB of 512MHz DDR2 RAM
Windows XP Pro (32bit)```

The results were compared with the following Workstation:

```Two Intel Xeon 5160 2.6 GHz
4x2GB=8GB fully buffered DIMMs running at 667MHz
Windows XP x64 Edition
Cache size = 4MB```

All benchmarks performed better on the laptop, except the `SortedList` that performed better on the Workstation in the following cases:

1. 100% random data when number of elements > 500,000 (maximum that was tested).
2. 10% random data when number of elements > 100,000.

What do you think is the reason for that?

3. The memory consumption of this example was slightly lower than `SortedList` and about half of the memory that `SortedDictionary` consumed. Memory and time benchmarking were done following Tony's (Tobias Hertkorn) advice.

## FAQ

Q: Wouldn't it be more cost-beneficial to buy a better computer (rather than tweaking the implementation)?

A: Clients that are power hungry would spend money on new hardware, whether you tweak the code or not. Any improvement over that would always be welcome.

Q: Why re-invent the wheel? Using standard components (`SortedList`) is safer.

A: The .NET Framework, is well, a framework. I believe that Microsoft should leave some room for contributors (CodeProject anyone?). You are all invited to peer review this work and contribute so the code will be more complete and robust.

Q: Is unmanaged code faster?

A: Yes... for now. As Microsoft/Sun/IBM continue to tweak compilers, higher level languages will close the gap. Part of this work is to highlight places that need attention. Banging my head on the wall did not leave any long lasting impression, hopefully this work would do better.

Q: What about the (poor) asymptotic performance?

A: It should be possible to use this implementation as a node of a balanced binary tree. The hybrid data structure would still have the "CPU cache locality" boost together with O(log N) asymptotic performance. Unfortunately, I don't have more time to work on it. If you are into it, start reading about AVL trees and Scapegoat trees. Since the array is quite big, one can use "helper" objects without affecting the memory consumption much. So, if for example, we track for each array its total weight (including children) and add a back pointer to the parent node, together with some Garbage Collection awareness ...

Q: Who cares about a slightly faster Dictionary?

A: Dictionaries are related to fast search performance, so even a little gain might probably prove useful for someone out there. Some of the tweaks described in this work can prove useful for other data structures as well.

## History

• 18-Oct-2007: Initial version.
• 10-Oct-2008: Added `Remove()` and the `IEnumerable<>` implementation.

## Share

 Software Developer Israel
No Biography provided

## You may also be interested in...

 First Prev Next
 My vote of 5 Kanasz Robert6-Nov-12 2:27 Kanasz Robert 6-Nov-12 2:27
 Good One popchecker23-Aug-11 7:05 popchecker 23-Aug-11 7:05
 My vote of 1 JalalAldeen22-Apr-10 3:29 JalalAldeen 22-Apr-10 3:29
 Re: My vote of 1 Hiren Solanki3-Dec-10 20:13 Hiren Solanki 3-Dec-10 20:13
 Count, Capacity [modified] johnj19844-Oct-08 9:32 johnj1984 4-Oct-08 9:32
 1. Why is the count not public in SortedCyclicSplitArray? 2. There is a min capacity in the constructor - does this array not grow automatically? I tried to add more items than the initial capacity, and got a 'Full' exception... 3. Do you have/plan to release this with support for growing, enumerators, and Removing items? modified on Saturday, October 4, 2008 4:07 PM
 Re: Count, Capacity itaifrenkel10-Oct-08 13:55 itaifrenkel 10-Oct-08 13:55
 .NET 2.0? Software_Architect12-Nov-07 6:34 Software_Architect 12-Nov-07 6:34
 Re: .NET 2.0? itaifrenkel25-Nov-07 2:54 itaifrenkel 25-Nov-07 2:54
 thanks pjd100125-Oct-07 18:04 pjd1001 25-Oct-07 18:04
 Excellent Pete O'Hanlon23-Oct-07 2:32 Pete O'Hanlon 23-Oct-07 2:32
 Re: Excellent itaifrenkel24-Oct-07 14:30 itaifrenkel 24-Oct-07 14:30
 Wow!!! Lev Vayner.23-Oct-07 2:23 Lev Vayner. 23-Oct-07 2:23
 Re: Wow!!! itaifrenkel24-Oct-07 14:30 itaifrenkel 24-Oct-07 14:30
 Is it worth it? Joe Sonderegger22-Oct-07 20:10 Joe Sonderegger 22-Oct-07 20:10
 Re: Is it worth it? itaifrenkel24-Oct-07 14:33 itaifrenkel 24-Oct-07 14:33
 Last Visit: 31-Dec-99 18:00     Last Update: 19-Oct-17 23:15 Refresh 1