Click here to Skip to main content
15,867,686 members
Articles / Programming Languages / C#
Tip/Trick

GUID Tree Data Structure

Rate me:
Please Sign up or sign in to vote.
5.00/5 (4 votes)
14 Feb 2013CPOL2 min read 13.2K   176   4   1
Data structure designed to store billions of items using GUID as a key.

Introduction

With modern day data analysis, we come across a situation where we have billions of records, each with its own GUID. How do we organize terabytes of data and retrieve what we want?  

The solution I describe uses GUIDs as a key, but a more generalized solution is possible.

Background 

While working at my previous company, someone mentioned that they had billions of records to process. They wanted to count the number of unique GUIDs, but couldn't find a useful tool to use. The idea popped into my head and here's the result. 

The solution is based on the notion of a base n key with m digits. In the case of a GUID, we have 32 hexadecimal digits. As a result, we can create a tree structure with a depth of 32 (for each of the digits in the GUID). Each node would have an array of 16 sub-trees attached to it. 

A simplified example is seen below:

Image 1

Under A, below the root, the array contains 3 elements, with links to 3 sub-trees. The System.Guid method ToByteArray() returns a byte array of 16 bytes. Since a byte contains 256 possible values, that seemed like an unnecessary use of memory. As a result I split it into a 32 element byte array, where each element contained a hexadecimal value. This allows for a node with an array containing 16 sub-trees, instead of 256 sub-trees. I could have used a linked list instead of an array, but I didn't want to increase retrieval time. I sacrificed memory for performance. The solution presented uses two types of data structures, depending on the number of records you need to store.

Memory Based

For small data sets, when we are certain all data can be stored in memory. The above data structure is used here.

File Based

For large data sets consisting of billions of records. The solution uses the file system to store data. The above data structure is represented as a series of file system directories. The limitation is that you need a drive with sufficient memory. You will need to use a spanned hard drive to store all the billions of records you need to work with. This is currently single-threaded, although a multi-threaded solution is possible.

Using the code

For the in-memory solution:
C#
// Create collection for memory-based storage
GuidCollection<int> guidCollection = new GuidCollection<int>();

// Key we will use
Guid newGuid = Guid.NewGuid();

// Add
guidCollection.Add(newGuid, 5);

// Contains key
bool res = guidCollection.ContainsKey(newGuid);

// Count
int count = guidCollection.guidArrayTree.Count;

// Allow overwriting data
guidCollection.OverwriteValues = true;

// Index (set)
guidCollection[newGuid] = 10;

// Index (get)
int val = guidCollection[newGuid];

// Enumerate
foreach (int intVal in guidCollection.GetEnumerator())
{
    System.Console.WriteLine("Value is: " + intVal);
}

// Retrieve
int retrivedValue;
guidCollection.TryGetValue(newGuid, out retrivedValue);

// Remove
guidCollection.Remove(newGuid);
For the file-based solution: 
C#
// directory where we store the data structure
string dir = "C://TestDir/";

// Create collection for file-based storage
GuidCollection<int> guidCollection = new GuidCollection<int>(dir);

// Key we will use
Guid newGuid = Guid.NewGuid();

// Add
guidCollection.Add(newGuid, 5);

// Contains key
bool res = guidCollection.ContainsKey(newGuid);

// Count
int count = guidCollection.guidArrayTree.Count;

// Allow overwriting data
guidCollection.OverwriteValues = true;

// Index (set)
guidCollection[newGuid] = 10;

// Index (get)
int val = guidCollection[newGuid];

// Enumerate
foreach (int intVal in guidCollection.GetEnumerator())
{
    System.Console.WriteLine("Value is: " + intVal);
}

// Retrieve
int retrivedValue;
guidCollection.TryGetValue(newGuid, out retrivedValue);

// Remove
guidCollection.Remove(newGuid);

Points of Interest

The GetEnumerator() method required recursion, since we are dealing with a deeply embedded data structure and each level had multiple entries. I got stuck on that, until I found this article: http://stackoverflow.com/questions/2055927/ienumerable-and-recursion-using-yield-return

Adding and retrieving data is in linear time. However, we pay for this in terms of memory usage.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer
United States United States
I’ve been a software developer for the last 11 years.
My language of choice is C#.
My strength lies in windows based programming as an SDET.

Strengths include:
• Object Oriented Programming
• Design Patterns
• Data Structures, Algorithms
• Windows Automation

Principle technologies include:
• Visual Studios 2010 (Ultimate, TFS, Test Manager)
• C# 4.0, XML, WPF, SQL Server

Education includes:
• BSEE - Concordia University, Quebec, Canada
• Programmer Analyst Certificate - John Abbott College, Qc, Canada
• Certified Scrum Master Training - Danube, Bellevue, WA
• Windows Azure Training - Wintellect, Bellevue, WA

Certification:
• Microsoft Certified Solution Developer for the MS.NET Platform

Comments and Discussions

 
QuestionEvidence... Pin
Mehdi Gholam14-Feb-13 6:59
Mehdi Gholam14-Feb-13 6:59 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.