# B-Tree Sorted Dictionary

, 27 Aug 2012 CPOL
 Rate this:
In-memory B-Tree sorted dictionary

## Introduction

B-Tree is a balanced m-way tree. This discussion from Wiki is a good material to introduce one to the characteristics and node layout of a B-Tree data structure: http://en.wikipedia.org/wiki/B-tree.

This article discusses an in-memory B-Tree implementation. Although B-Tree is typically used in file/database system indexing (see Open Source B-Tree On Disk @: http://sop.codeplex.com), there is a significant value in implementing a B-Tree based collection or dictionary in the .NET framework or in any language/platform for that matter.

Typical in-memory sorted dictionary data structures today are based on the Binary Tree algorithm, which is not to be confused with B-Tree. Each node of a Binary Tree can contain a single item, whereas a B-Tree can contain a user defined number of items per node, and its nodes are kept balanced. This is a very important differentiation. Being that each node has a single item, storing a large number of items in a Binary Tree will generate a tall and narrow node graph with numerous nodes.

In a corresponding B-Tree implementation, the graph will tend to be shorter and wider with a lot fewer nodes. This is because a node in a B-Tree is typically configured to have numerous items; e.g., a B-Tree dictionary with 12 items will require a single node to contain the items, and a Binary Tree will require 12 nodes which can be scattered around in the heap (memory). Increasing our item sampling to thousands, hundreds of thousands, or millions, we're talking about a significant differentiation and significant optimization that a corresponding B-Tree based sorted dictionary can provide. A million single item node of a Binary Tree vs. around eighty three thousand of a B-Tree, for a 12 item node setup, and even far less if the user specifies more items per node than mentioned.

With this characterization, it is easy to imagine that a Binary Tree will tend to use more memory, and will tend to generate more memory fragmentation than a respective dictionary utilizing the B-Tree algorithm. And in production environments, we can't afford to have a system that is prone to memory fragmentation as in time, the application will degrade in performance and can cause out of memory conditions due to lack of contiguous allocable space.

## Implementation

In this article and associated code implementation, I've selected C# and the .NET framework as the language and platform of implementation. Since B-Tree is a well established algorithm, I will focus my discussion on the high level design/implementation issues including the interface contract. The complete source code is available for download.

### IBTreeDictionary: Extending the Generics IDictionary

Generics is a great feature for enforcing coding and compile time data type checks. Thus, it is a great feature to support it in this implementation of B-Tree based Sorted Dictionary (note: non-Generics implementation is also supported). The Sorted Dictionary implements the Generics `IDictionary` interface in order to provide the capability mentioned.

In order to expose B-Tree specific features such as duplicate Key is allowed and thus requiring an explicit Item Iterator to give fine grained control in item traversal, I am extending the `IDictionary` interface to add methods for the mentioned feature, giving birth to the `IBTreeDictionary` interface. Here is how the said interface looks like:

```public interface IBaseCollection<T> : ICloneable
{
bool MoveNext();
bool MovePrevious();
bool MoveFirst();
bool MoveLast();
bool Search(T Value);
}

public interface IBTreeDictionary<TKey, TValue> :
System.Collections.Generic.IDictionary<TKey,
TValue>, IBaseCollection<TKey>
{
Sop.Collections.BTree.SortOrderType SortOrder{ get;set; }
BTreeItem<TKey, TValue> CurrentEntry{ get; }
TKey CurrentKey{ get; }
TValue CurrentValue{ get; set; }
void Remove();
}```

In order to consume items even if they have duplicate keys, the dictionary requires a feature to be able to set an item pointer to a desired position, then traverse a range while consuming the values of the items in the range. This is why the `Movexxx` methods and the `CurrentKey`/`CurrentValue` item pointers are required.

Also, B-Tree provides a feature to allow a user to traverse items either in ascending or descending sort order, thus the `SortOrder` attribute was provided.

## Management and Search Functions

Following are selected B-Tree management and search functions to illustrate some behavior which characterizes and gives the user the idea of this flavor implementation of the very useful B-Tree algorithm:

Adding an item to a B-Tree requires determination of the appropriate node where to add the item. This is required because items of the tree are maintained sorted as they get inserted or removed. Traversing the tree in order to find the node can be done recursively or via a looping construct such as a `while` loop. The latter is preferred than recursion as in this situation, it removes the requirement of stack usage, i.e., when done, there is no implicit stack (pop) operations as the innermost function unwinds/returns to the main caller function.

This block illustrates a `while` loop that is used to traverse the tree to find the target node. It uses `BinarySearch` for each node of the tree to further optimize the search per node. This is possible since each item in the node is sorted and thus, `BinarySearch` allows an optimal searching and minimal comparison of keys.

```TreeNode CurrentNode = this;
int Index = 0;
while (true)
{
Index = 0;
byte NoOfOccupiedSlots = CurrentNode.count;
if (NoOfOccupiedSlots > 1)
{
if (ParentBTree.SlotsComparer != null)
Index = Array.BinarySearch(CurrentNode.Slots, 0,
NoOfOccupiedSlots, Item, ParentBTree.SlotsComparer);
else
Index = Array.BinarySearch(CurrentNode.Slots, 0,
NoOfOccupiedSlots, Item);
if (Index < 0)
Index = ~Index;
}
else if (NoOfOccupiedSlots == 1)
{
if (ParentBTree.Comparer != null)
{
if (ParentBTree.Comparer.Compare(CurrentNode.Slots[0].Key, Item.Key) < 0)
Index = 1;
}
else
{
try
{
if (System.Collections.Generic.Comparer<TKey>.Default.Compare(
CurrentNode.Slots[0].Key, Item.Key) < 0)
Index = 1;
}
catch (Exception)
{
if (CurrentNode.Slots[0].Key.ToString().CompareTo(Item.Key.ToString()) < 0)
Index = 1;
}
}
}
if (CurrentNode.Children != null)
// if not an outermost node let next lower level node do the 'Add'.
CurrentNode = CurrentNode.Children[Index];
else
break;
}```

At this point, the correct node to add the item to is found. Now, invoke the actual `Add` function:

```CurrentNode.Add(ParentBTree, Item, Index);
CurrentNode = null;```

Here is the `Add` function that does the actual item insert to the found node. It maintains the items of the node to be sorted, and manages the breakup of the node when it is full, and other B-Tree maintenance actions required such as a broke up node's parent promotion, etc...:

```void Add(BTreeAlgorithm<TKey, TValue> ParentBTree,
BTreeItem<TKey, TValue> Item, int Index)
{
byte NoOfOccupiedSlots = count;
// Add. check if node is not yet full..
if (NoOfOccupiedSlots < ParentBTree.SlotLength)
{
ShiftSlots(Slots, (byte)Index, (byte)NoOfOccupiedSlots);
Slots[Index] = Item;
count++;
return;
}
else
{
Slots.CopyTo(ParentBTree.TempSlots, 0);
byte SlotsHalf = (byte)(ParentBTree.SlotLength >> 1);
if (Parent != null)
{
bool bIsUnBalanced = false;
int iIsThereVacantSlot = 0;
if (IsThereVacantSlotInLeft(ParentBTree, ref bIsUnBalanced))
iIsThereVacantSlot = 1;
else if (IsThereVacantSlotInRight(ParentBTree, ref bIsUnBalanced))
iIsThereVacantSlot = 2;
if (iIsThereVacantSlot > 0)
{
// copy temp buffer contents to the actual slots.
byte b = (byte)(iIsThereVacantSlot == 1 ? 0 : 1);
CopyArrayElements(ParentBTree.TempSlots, b,
Slots, 0, ParentBTree.SlotLength);
if (iIsThereVacantSlot == 1)
// Vacant in left, "skud over"
// the leftmost node's item to parent and left.
DistributeToLeft(ParentBTree,
ParentBTree.TempSlots[ParentBTree.SlotLength]);
else if (iIsThereVacantSlot == 2)
// Vacant in right, move the rightmost
// node item into the vacant slot in right.
DistributeToRight(ParentBTree, ParentBTree.TempSlots[0]);
return;
}
else if (bIsUnBalanced)
{ // if this branch is unbalanced..
// _BreakNode
// Description :
// -copy the left half of the slots
// -copy the right half of the slots
// -zero out the current slot.
// -copy the middle slot
// -allocate memory for children node *s
// -assign the new children nodes.
TreeNode LeftNode;
TreeNode RightNode;
try
{
RightNode = ParentBTree.GetRecycleNode(this);
LeftNode = ParentBTree.GetRecycleNode(this);
CopyArrayElements(ParentBTree.TempSlots, 0,
LeftNode.Slots, 0, SlotsHalf);
LeftNode.count = SlotsHalf;
CopyArrayElements(ParentBTree.TempSlots,
(ushort)(SlotsHalf + 1), RightNode.Slots, 0, SlotsHalf);
RightNode.count = SlotsHalf;
ResetArray(Slots, null);
count = 1;
Slots[0] = ParentBTree.TempSlots[SlotsHalf];
Children = new TreeNode[ParentBTree.SlotLength + 1];
ResetArray(Children, null);
Children[(int)Sop.Collections.BTree.ChildNodes.LeftChild] = LeftNode;
Children[(int)Sop.Collections.BTree.ChildNodes.RightChild] = RightNode;
//SUCCESSFUL!
return;
}
catch (Exception)
{
Children = null;
LeftNode = null;
RightNode = null;
throw;
}
}
else
{
// All slots are occupied in this and other siblings' nodes..
TreeNode RightNode;
try
{
// prepare this and the right node sibling
// and promote the temporary parent node(pTempSlot).
RightNode = ParentBTree.GetRecycleNode(Parent);
// zero out the current slot.
ResetArray(Slots, null);
// copy the left half of the slots to left sibling
CopyArrayElements(ParentBTree.TempSlots, 0, Slots, 0, SlotsHalf);
count = SlotsHalf;
// copy the right half of the slots to right sibling
CopyArrayElements(ParentBTree.TempSlots,
(ushort)(SlotsHalf + 1), RightNode.Slots, 0, SlotsHalf);
RightNode.count = SlotsHalf;
// copy the middle slot to temp parent slot.
ParentBTree.TempParent = ParentBTree.TempSlots[SlotsHalf];
// assign the new children nodes.
ParentBTree.TempParentChildren[
(int)Sop.Collections.BTree.ChildNodes.LeftChild] = this;
ParentBTree.TempParentChildren[
(int)Sop.Collections.BTree.ChildNodes.RightChild] = RightNode;
ParentBTree.PromoteParent = (TreeNode)Parent;
ParentBTree.PromoteIndexOfNode = GetIndexOfNode(ParentBTree);
//TreeNode o = (TreeNode)Parent;
//o.Promote(ParentBTree, GetIndexOfNode(ParentBTree));
//SUCCESSFUL!
return;
}
catch (Exception)
{
RightNode = null;
throw;
}
}
}
else
{
// _BreakNode
// Description :
// -copy the left half of the temp slots
// -copy the right half of the temp slots
// -zero out the current slot.
// -copy the middle of temp slot to 1st elem of current slot
// -allocate memory for children node *s
// -assign the new children nodes.
TreeNode LeftNode;
TreeNode RightNode;
try
{
RightNode = ParentBTree.GetRecycleNode(this);
LeftNode = ParentBTree.GetRecycleNode(this);
CopyArrayElements(ParentBTree.TempSlots, 0,
LeftNode.Slots, 0, SlotsHalf);
LeftNode.count = SlotsHalf;
CopyArrayElements(ParentBTree.TempSlots,
(ushort)(SlotsHalf + 1), RightNode.Slots, 0, SlotsHalf);
RightNode.count = SlotsHalf;
ResetArray(Slots, null);
Slots[0] = ParentBTree.TempSlots[SlotsHalf];
count = 1;
Children = new TreeNode[ParentBTree.SlotLength + 1];
Children[(int)Sop.Collections.BTree.ChildNodes.LeftChild] = LeftNode;
Children[(int)Sop.Collections.BTree.ChildNodes.RightChild] = RightNode;
return; // successful
}
catch (Exception)
{
LeftNode = null;
RightNode = null;
RightNode = null;
throw;
}
}
}
}```

### Remove Method

There are two versions of `Remove` in a B-Tree; one that will remove an item given a Key, and the other that will remove the currently selected item of the tree. The former will cause the B-Tree to search for the item, given a key, and position the item pointer to the item with the matching key in the tree. The same node determination listed in the Add block is used to find the node and item to delete. After the item pointer is positioned at the right one, the item is actually deleted from the tree. Here is the code that illustrates item deletion.

When the item for deletion is not in the outermost node, the tree moves the next item from the outermost node to overwrite the currently selected item's slot, causing removal of the said item from the tree.

```internal protected bool Remove(BTreeAlgorithm<TKey, TValue> ParentBTree)
{
if (Children != null)
{
byte byIndex = ParentBTree.CurrentItem.NodeItemIndex;
MoveNext(ParentBTree);
Slots[byIndex] =
ParentBTree.CurrentItem.Node.Slots[ParentBTree.CurrentItem.NodeItemIndex];
}
return true;
}```

Then it takes care of managing the outermost node either to set to `null` the slot vacated by the moved item, or to maintain the balance of the tree by pulling an item from either the left or right sibling node.

```internal void FixTheVacatedSlot(BTreeAlgorithm<TKey, TValue> ParentBTree)
{
sbyte c;
c = (sbyte)count;
if (c > 1) // if there are more than 1 items in slot then..
{ //***** We don't fix the children since there are no children at this scenario.
if (ParentBTree.CurrentItem.NodeItemIndex < c - 1)
MoveArrayElements(Slots,
(ushort)(ParentBTree.CurrentItem.NodeItemIndex + 1),
ParentBTree.CurrentItem.NodeItemIndex,
(ushort)(c - 1 - ParentBTree.CurrentItem.NodeItemIndex));
count--;
Slots[count] = null; // nullify the last slot.
}
else
{ // only 1 item in slot
if (Parent != null)
{
byte ucIndex;
// if there is a pullable item from sibling nodes.
if (SearchForPullableItem(ParentBTree, out ucIndex))
{
if (ucIndex < GetIndexOfNode(ParentBTree))
PullFromLeft(ParentBTree); // pull an item from left
else
PullFromRight(ParentBTree); // pull an item from right
}
else
{ // Parent has only 2 children nodes..
if (Parent.Children[0] == this)
{ // this is left node
TreeNode RightSibling = GetRightSibling(ParentBTree);
Parent.Slots[1] = RightSibling.Slots[0];
Parent.count = 2;
RightSibling = null;
}
else
{ // this is right node
Parent.Slots[1] = Parent.Slots[0];
TreeNode LeftSibling = GetLeftSibling(ParentBTree);
Parent.Slots[0] = LeftSibling.Slots[0];
Parent.count = 2;
LeftSibling = null;
}
// nullify Parent's children will cause this
// tree node instance to be garbage collected
// as this is child of parent!
Parent.Children[0] = null;
Parent.Children[1] = null;
Parent.Children = null;
Clear();
}
}
else
{ // only 1 item in root node !
Slots[0] = null; // just nullIFY the slot.
count = 0;
// Point the current item pointer to end of tree
}
}
}```

### Search Method

The `Search` method contains code that has the same logic as the first portion of the `Add` method listed above. It traverses the tree from the root down the children nodes, until it finds the target node and item using an iterative approach (`while` loop). But instead of adding an item, when it is found, `Search` merely positions the current item pointer to the found node and item in the node.

Having an item pointer combined with the `Search` and `Movexxx` methods is a powerful feature. E.g., doing a range query is very easy in this combination of methods. Position the item pointer to the (start) item you'd like to query, then use either the `MoveNext` or `MovePrevious` method to sequentially traverse the nearby items, until you'd read all the items in range that you are interested about.

## Using the Code

Download the source code zip file, uncompress it to your desired target folder, and open the B-Tree.sln solution inside VS 2008. Browse through the project `SampleUsage `source code in order to see how to use the `BTreeSortedDictionary`, run it, experiment to use other methods available, and reuse your experience using the Generic `SortedDictionary`. Enjoy!

## Comparison Results

```1 Million Inserts on .Net SortedDictionary: 12 sec 308 MB peak 286 MB end
1 Million Inserts on BTreeDictionary:       9 sec 300 MB peak 277 MB end
1 Million Search on .Net SortedDictionary:  7 sec 309 MB peak 286 MB end
1 Million Search on BTreeDictionary:        7 sec 309 MB peak 277 MB end```

Conclusion: This B-Tree sorted dictionary outperforms the .NET Framework's `SortedDictionary` in both memory utilization and speed of operations. Inserting 1 million items, this BTree outperforms .NET's by 3 seconds, and utilized ~9MB of less memory.

In searching, both are of the same speed, but then again, in memory utilization, BTree dictionary outperforms .NET's `SortedDictionary` by using 9MB of less memory. Multiplying the load 10X and using both in production environment, that is the true test, and I anticipate you will be very satisfied with this BTree dictionary due to architectural reasons I discussed at the top of this article.

Here is the stress program used:

```static void Main(string[] args)
{
SortedDictionary<string, string> l1 =
new SortedDictionary<string, string>();
StressSearch(l1, 1000000);
//** uncomment to run the BTreeDictionary test
//   and comment out the above block to turn off SortedDictionary test
//BTreeDictionary<string, string> l2 =
//            new BTreeDictionary<string, string>();
//StressSearch(l2, 1000000);
return;
}

static void StressAdd(IDictionary<string, string> l, int Max)
{
for (int i = 0; i < Max; i++)
{
l.Add(string.Format("{0} key kds vidsbnvibsnv siuv " +
"siuvsiubvisubviusvifsudjcnjdnc", i),
"value kds vidsbnvibsnv siuv " +
"siuvsiubvisubviusvifsudjcnjdnccscs");
}
}

static void StressSearch(IDictionary<string, string> l, int Max)
{
Console.WriteLine("Search: {0}", DateTime.Now);
for (int i = 0; i < Max; i++)
{
string v = l[string.Format("{0} key kds vidsbnvibsnv siuv " +
"siuvsiubvisubviusvifsudjcnjdnc", i)];
}
Console.WriteLine("Search: {0}", DateTime.Now);
}```

## Points of Interest

I really learnt a lot and enjoyed authoring this B-Tree implementation, which I did the C++ version back in the mid-90s; in fact, I enjoyed it so much, it propelled me to work on the on-disk version that is available now as Open Sourced software in CodePlex (http://sop.codeplex.com). Pls. feel free to join, participate and use the new "on disk" version.

## History

• 7/23/2010 - Initial submission
• 7/24/2010 - Added performance comparisons with .NET Framework's `SortedDictionary`
• 8/14/2012 - Modified "points of interest" section to mention just Open Sourced B-Tree Gold (v4.5) product

## Share

Chief Technology Officer 4ATech
United States
Oversees technological direction & Software dev't @ 4ATech
Microsoft Alumnus (Software Design Engineer)
Incubating Object Persistence Framework since 1995

 First Prev Next
 Duplicates Looping and Deleting runfastman 27-Feb-14 11:47
 Re: Duplicates Looping and Deleting Gerardo Recinto 28-Feb-14 9:42
 Write to file Shargon_85 16-Jul-13 3:58
 Re: Write to file Gerardo Recinto 25-Feb-14 22:21
 Introduction paragraph correction yloginov 20-May-13 11:11
 Re: Introduction paragraph correction Gerardo Recinto 20-May-13 19:50
 Re: Introduction paragraph correction yloginov 21-May-13 3:35
 Re: Introduction paragraph correction Gerardo Recinto 28-Feb-14 12:28
 Good learning material enjaychang 17-Jan-13 3:05
 Re: Good learning material Gerardo Recinto 18-Feb-13 13:23
 Re: Good learning material enjaychang 6-Mar-13 4:53
 Memory usage Lcng 9-Oct-12 5:52
 Re: Memory usage [modified] Gerardo Recinto 9-Oct-12 21:12
 B tree inserion of records in C language Member 9455937 4-Oct-12 4:13
 My vote of 5 Christian Amado 28-Aug-12 13:53
 Re: My vote of 5 Gerardo Recinto 2-Sep-12 19:23
 Nice work! Filipe Cantarelli 16-Aug-12 9:08
 Re: Nice work! Gerardo Recinto 16-Aug-12 10:53
 Hi Filipe, Thx for the feedback! Btw, pls. look into the souce set I uploaded in http://sop.codeplex.com. The in-memory and on disk B-Trees are both included and latest sources. This one attached in codeproject article is pretty much legacy (old) now. (sample improvements are: all remaining recursive calls were translated to looping constructs AND node recycling was introduced, etc... You will totally like the On Disk version, it is awesomer. My replies below... Enjoy!! ********************* I wonder how you came up with the value 12 for the node size. Have you tried different values and picked up the best? I know that for disks this value should be big enough so when we read from it we get the max amount of data we can, isn't that right? >>> You are right. On Disk version (in sop.codeplex) has default set to greater or equal to 200 if I remember this right. Btw, 12 default value for in-memory is OK but the ideal is something higher, I agree. That is why it is only default value and user can set it to preferred one. But in memory this doesn't apply at all. If we choose a value too low for the node size, then the structure basically becomes a binary tree, but if we choose a value too high we are basically performing binary searches inside the nodes which is no different than using a binary tree. So I see the best choice here is a median value. >>> absolutely. A graph would be cool to see how this parameter affects overall performance. >>> indeed. Also, you talked about the advantage in memory, but what's the advantage in performance? As for the structure, I see how this could improve cache performance because of the contiguous allocation in arrays and less 'pointer traversal'. But would it perform better than a binary self-balanced tree? >>> Performance is not in pointer traversal, but in reducing the amount or number of allocated nodes. Binary tree since containing only one item per node, will require one node per item (imagine 1 million, it will require 1 million nodes, how about 10 mil, 100s mil,...), whereas B-Tree will allow multiple items contained within one node. Having it 1 to 1 for Binary Tree is on the extreme level and can cause (prone to) memory fragmentation and costlier Garbage Collector tracking/management. Sorry for the amount of questions. I know my english sucks, sorry about that too. >>> no problem, and your English is good, ideas conveyed are very clear. -G.
 My vote of 5 Abdul Quader Mamun 15-Aug-12 18:55
 Re: My vote of 5 Abdul Quader Mamun 15-Aug-12 18:56
 Re: My vote of 5 Gerardo Recinto 20-Aug-12 22:23
 what's the B-Tree like when new word added? murderer_pro 2-Nov-11 17:12
 I don't like your "stress test" [modified] carga 16-Jul-11 13:54
 Re: I don't like your "stress test" Gerardo Recinto 26-Jul-11 8:07
 How display the structure of the tree vinycc1 16-Jun-11 10:43
 Re: How display the structure of the tree Gerardo Recinto 3-Jul-11 19:16
 Hamming Distance xfx 26-Mar-11 9:23
 Re: Hamming Distance xfx 26-Mar-11 15:00
 Re: Hamming Distance Gerardo Recinto 20-Apr-11 21:20
 Re: Hamming Distance xfx 21-Feb-12 19:43
 My vote of 1 Anders M. Mikkelsen 30-Aug-10 21:49
 Re: My vote of 1 Gerardo Recinto 3-Sep-10 8:54
 My vote of 5 Nishant Sivakumar 29-Jul-10 9:39
 disk-based? Unruled Boy 27-Jul-10 2:27
 Re: disk-based? Gerardo Recinto 27-Jul-10 9:23
 Re: disk-based? Unruled Boy 1-Aug-10 19:01
 Re: disk-based? Gerardo Recinto 3-Sep-10 8:49
 My vote of 5 ThatsAlok 26-Jul-10 22:31
 Re: My vote of 5 Gerardo Recinto 27-Jul-10 10:38
 Nice effort César de Souza 24-Jul-10 4:24
 Re: Nice effort [modified] Gerardo Recinto 24-Jul-10 9:44
 Re: Nice effort César de Souza 24-Jul-10 10:45
 Re: Nice effort Gerardo Recinto 24-Jul-10 18:56
 Re: Nice effort César de Souza 25-Jul-10 10:25
 I like it KenJohnson 23-Jul-10 21:30
 Re: I like it Gerardo Recinto 24-Jul-10 9:52
 Last Visit: 31-Dec-99 19:00     Last Update: 1-Apr-15 12:04 Refresh 1