Introduction
This article takes a closer than normal look at everyone’s favourite data structure – the .NET dictionary. There are quite a
lot of different dictionaries classes in .NET these days, so I’m going to explore how they work,
what their differences are and where to use them.
Background
I recently wrote my own dictionary class, one which stored the key-value pairs as an ordered list. Also there was an unordered
part at the end for new entries that got inserted, to be later moved to the correct place at an opportune idle moment. If the key
could not be found using a binary search in the ordered part, the unordered part would be scanned. I did it this way because for my
application inserts would be very rare but when they did happen I didn’t want to have to shift the ordered part around as it was
going to be very large.
I have affection for the binary search, a slightly tricky algorithm. My algorithms tutor at University had made me aware of how
tricky it was – many of the books at the time had it incorrect (edge cases – first, last, not present, empty list, etc.) but when
programmed properly it yields a very good, the best, worst-case look-up time. I was fairly confident the performance of my
dictionary would be alright.
Testing showed it worked accurately but was beaten hands down speed-wise by the standard .NET Dictionary
class. I had
hoped the performance would be similar and felt deflated when it wasn’t, but at least my implementation used less memory. My mistake had
been to only consider the worst-case performance which was actually very good, not the general case which wasn’t. Sadly MegaDictionary™
didn’t deliver the megadom I was expecting and was consigned to the bin; it had all been a misadventure.
Algorithms are such that one rarely fits all and there’s often a trade-off between CPU cycles and memory or one works well for
small amounts of data but falls apart for larger amounts. Knowing the right algorithm for the situation is what it’s all about.
This, combined with my surprise at the existence of ConcurrentDictionary
, something I only knew existed when I met it in
code one day, prompted me to spend a bit of time refreshing my knowledge of these ubiquitous structures so that I could authoritatively
command which would be the best for any circumstance, thus impressing my colleagues, wife, dog, etc. And today I’m sharing my findings
on CodeProject.
The Basics
A dictionary is a data structure which provides a one-way association between two object instances, or put more simply, it allows you
to say that for one specific thing you want to be able to get another specific thing. They are used externally to the objects
concerned. If you have a User
object and an Address
object, you could use a dictionary externally to both to
associate the address with a user. Normally a better design would be just to have a reference to Address
as a field of
User
, but it’s another way to do it.
The thing we’re mapping from is called the Key and the thing it maps to the Value. Most dictionaries use a hash table
algorithm under the covers. For these, the value can be anything but keys must do two things:
- Provide an equality method, a way in which we can tell if two objects are the same thing. For people, we might say that if
two instances share the same first name, last name and date of birth, we consider them as representing the same real person. Better
yet, we could compare National Insurance numbers (or Social Security numbers if you prefer) or a database identifier if we have
one as these already have the concept of uniqueness ingrained in them.
- Provide a hash code based upon the fields we’ve identified as defining uniqueness, i.e., the same fields we used for equality.
A hash code in the case of a .NET dictionary is simply a single integer number. That’s single as in just one, not 16 bits.
Hash codes don’t need to be unique but they do have to be consistent. That is to say, the same key must always produce the same code.
Ideally we’d want the hash code from every object to be unique but this isn’t possible. For instance, if you were to stick five billion items
in your dictionary (good luck with that), then it simply wouldn’t be possible to give each a unique number in 32 bits of space. More subtly,
we want the codes to be randomly unique, a term that might confuse you. I’m not sure what it means either – I just made it up - but I’ll
explain what I’m on about later.
In essence, a hash table works by having a set of buckets (typically) slightly larger in quantity than the number of items. Each key/value
pair is placed in a bucket and it uses the hash code of the key to determine which bucket that is (or would be). Depending on implementation
buckets can hold one key/value pair or more than one. In the one item per bucket implementation, if the key/value is not the right one, you need
to look in another bucket. In the multiple key/value per bucket implementation, you need to rummage around the other items in the same bucket. The
equality method is used to determine which pair is the one we’re after.
When using multiple items per bucket (as Dictionary<Tkey, TValue>
does), things works well when we get close to the one
item per bucket utopia, performance is exceedingly good and this is why careful attention to the hash code the key produces is important. In
the worst case, a type that returned the same hash code for all its instances would end up with everything in the same bucket which would then
need to be sequentially searched. That, my friends, would suck.
Some Ways are Quicker than Others
Let’s consider an example where a set of numbers comes to us and we want to count the instances of each number. Because we’re only interested in
the dictionary, I won’t make them random or anything interesting but will have twenty thousand different numbers all of which come in twenty
thousand times and use a Di
ctionary
to count the number of each. This will mean 800 million dictionary operations! I’ll time the whole thing so
we can get an idea of performance:
private static void Example1()
{
Dictionary<int, int> instanceCount = new Dictionary<int, int>();
Stopwatch stopwatch = Stopwatch.StartNew();
for (int count = 0; count < 20000; count++)
{
for (int number = 0; number < 20000; number++)
{
if (!instanceCount.ContainsKey(number))
{
instanceCount.Add(number, 1);
}
else
{
int current = instanceCount[number];
instanceCount[number] = current + 1;
}
}
}
stopwatch.Stop();
Console.WriteLine("Example 1 takes {0:0.00} seconds", stopwatch.ElapsedMilliseconds / 1000d);
}
That takes 26.03 seconds on my laptop in release without a debugger attached. In the sample code but not shown here, there is also a sanity check
to make sure it's doing what we expect and to ensure that the whole thing doesn't get optimised away by a too keen compiler. I often see dictionary
code like this but there is a simple way to save a look up. You don’t need to check whether a value exists before setting it and can save time and
typing by doing this:
private static void Example2()
{
Dictionary<int, int> instanceCount = new Dictionary<int, int>();
Stopwatch stopwatch = Stopwatch.StartNew();
int current = 0;
for (int count = 0; count < 20000; count++)
{
for (int number = 0; number < 20000; number++)
{
instanceCount.TryGetValue(number, out current);
instanceCount[number] = current + 1;
}
}
stopwatch.Stop();
Console.WriteLine("Example 2 takes {0:0.00} seconds",
stopwatch.ElapsedMilliseconds / 1000d);
}
That takes 18.63 seconds – better.
Like lists, you can add things to a dictionary without worrying about its capacity as it just automatically grows when needed, but under the covers
there is no such thing as a variable length structure. When you are allocated memory, you have to say how much you want and then are stuck with
it, so in order to grow it sometimes becomes necessary to create a new bigger structure, copy all the existing data into it and use that. This
comes at a performance cost and could hurt if you are tight on memory as to do this you will need to have both the old and new structure in memory
before you dispose of the old. Also, when things grow they usually double in size (by design), so if you insert just enough items to make your structure
resize, it will take up roughly twice as much memory as it needs, hardly ideal.
So, it's good practice if you know how many items there are going to be to tell your collection in the constructor. We know there are going to be
20,000 items in our dictionary, let’s create it that size up front so it doesn't have to grow and allocate more memory than needed:
private static void Example3()
{
Dictionary<int, int> instanceCount = new Dictionary<int, int>(20000);
Stopwatch stopwatch = Stopwatch.StartNew();
int current = 0;
for (int count = 0; count < 20000; count++)
{
for (int number = 0; number < 20000; number++)
{
instanceCount.TryGetValue(number, out current);
instanceCount[number] = current + 1;
}
}
stopwatch.Stop();
Console.WriteLine("Example 3 takes {0:0.00} seconds",
stopwatch.ElapsedMilliseconds / 1000d);
}
It now takes 18.7 seconds which is fractionally longer than before but that will be due to something else going on my computer – it has less work to
do. Realistically, there is no gain but it was worth a try.
Why use a Dictionary
at all in this case? If you are mapping a set of sequential numbers to something, an array saves time and memory:
private static void Example4()
{
int[] instanceCount = new int[20000];
Stopwatch stopwatch = Stopwatch.StartNew();
for (int count = 0; count < 20000; count++)
{
for (int number = 0; number < 20000; number++)
{
instanceCount[number]++;
}
}
stopwatch.Stop();
Console.WriteLine("Example 4 takes {0:0.00} seconds",
stopwatch.ElapsedMilliseconds / 1000d);
}
That takes just 0.49 seconds! Now, let’s look at a weirder case. This time, we will still have 20,000 numbers come in 20,000 times, but I’m going to space
them 21,203 apart:
private static void Example5()
{
Dictionary<int, int> instanceCount = new Dictionary<int, int>(20000);
Stopwatch stopwatch = Stopwatch.StartNew();
int current = 0;
for (int count = 0; count < 20000; count++)
{
for (int number = 0; number < 20000 * 21023; number += 21023)
{
instanceCount.TryGetValue(number, out current);
instanceCount[number] = current + 1;
}
}
stopwatch.Stop();
Console.WriteLine("Example 5 takes {0:0.00} seconds",
stopwatch.ElapsedMilliseconds / 1000d);
}
I can't tell you how long this takes because I had to stop my computer after four and a half hours as the room was getting hot! Slight change, inordinate
amount of time, do you know why? In all likelihood you'll never encounter this situation in the real world but, because knowledge is power, here’s why. Integers
just return themselves as their hash code which is ideal – no work is required and they guarantee uniqueness. Nice. Now when we allocate a Dictionary
with a capacity of 20,000 items, it will actually go and create 21,203 buckets. It uses a higher prime number internally. When the dictionary gets the hash code, it
simply mods it with the number of buckets to give the index of bucket which it is going to use for the item.
So 0 % 21203 = 0, so the first item goes in bucket 0. 21023 % 21203 = 0, bucket 0 again, 42406 % 21203 = 0, once again. In this example, everything ends up in
the first bucket and we end up with a lot of empty buckets and a headache. Everything is placed in one huge sequential list which needs scanning each access. This
is what I was striving at about random uniqueness earlier, our hash codes are unique but that’s not enough here.
Let’s think about this in terms of operations. In our other examples, we did achieve the utopia of one item in one bucket so we had to do around 800 million
'operations'. This time, because everything is in one bucket and we need to scan through it and this will take on average an extra 10,000 (20,000 / 2) operations
per previous operation, so it now has 8 trillion operations to do which means we can roughly estimate it would have taken about 55 hours to complete!
This is an interesting thing about hash tables, their worst-case performance is truly dire, not a patch on that of a binary search, yet worst-cases are rare and
in the normal case performance is excellent or O(1) if you ’re into that sort of thing.
Making Your Types Dictionary Friendly
Every type in .NET derives from Object
and has three virtual overrides as standard:
public virtual string ToString() | The ever useful method that gets called whenever a textual representation of the object is required. Very useful for debugging, sticking objects in
controls etc. |
public virtual bool Equals(Object obj) | An equality comparer as described above |
public virtual int GetHashCode() | A hash code generator as described above |
By overriding Equals()
and GetHashCode()
your object is ready for use as a key! The recommendation is that you override all three
in all your types but I don’t think anyone is that saintly.
These overrides have been around since the inception of .NET and predate generics so Equals()
takes a parameter of type Object
. With
the arrival of generics in .NET 2.0, things became more type-safe and performance improved because value types no longer needed to be boxed and unboxed in collections.
Boxing and unboxing takes a surprising amount of effort and memory. (An integer is 4 bytes on a 32 bit system. When boxed, you end up with a 4 byte reference to those
4 integer bytes, but also get an 8 byte RTTI/Syncblock overhead, so when boxed you're looking at 16 bytes).
If you are going to allow your object to be used as a key in a generic collection, you won’t really want to use an Equals
method which takes a
parameter of type Object
. Instead we can implement the IEquatable
interface which gives us a strongly typed version of the method:
public class User : IEquatable<User>
{
public string FirstName { get; set; }
public string LastName { get; set; }
private DateTime _dateOfBirth;
public DateTime DateOfBirth
{
get { return _dateOfBirth; }
set
{
_dateOfBirth = value.Date;
}
}
public string NationalInsuranceNumber { get; set; }
public override string ToString()
{
return string.Join(" ", FirstName, LastName);
}
public override int GetHashCode()
{
return NationalInsuranceNumber == null ?
0 : NationalInsuranceNumber.GetHashCode();
}
public override bool Equals(object obj)
{
User otherUser = obj as User;
if (otherUser == null) return false;
return Equals(otherUser);
}
public bool Equals(User other)
{
return NationalInsuranceNumber == other.NationalInsuranceNumber;
}
}
All primitive types provide their own speedy version of GetHashCode()
, so when you build your types it's usual just to combine the hash codes
of the fields that define identity. If you have a compound key, that is you need more than one field to define uniqueness, the usual approach is to bitwise
exclusive-or the hashes together:
public class Product : IEquatable<Product>
{
public string Manufacturer { get; set; }
public string ProductCode { get; set; }
public override int GetHashCode()
{
return Manufacturer.GetHashCode() ^ ProductCode.GetHashCode();
}
public bool Equals(Product other)
{
return other.Manufacturer == Manufacturer && other.ProductCode == ProductCode;
}
}
This approach works well, but make sure that the hash codes from the compound fields are not aligned in any way. If you were to have a compound key using two
integer identifiers which are usually the same and were to XOR their hash codes together, the result will usually be zero and you end up in the one bucket scenario.
Although this is unusual, it’s something to have in the back of your mind.
Defining GetHashCode() and Equals() Outside the Type
Usually a type will define its own Equals()
and GetHashCode()
specifying how it is should be used as a key in a dictionary, but this can
also be done outside the type. This is similar to a type that implements IComparable
. When sorting, you’d normally use the implementation of this interface
but can elect instead use an instance of a separate IComparer
. It’s a common approach when you want different sort orders - first name than last, but also
last name then first, etc. When we want to change the identity and hashing logic of an object in a dictionary, we can choose to use a separate IEqualityComparer.
If we consider our User
type above and we realise that sometimes the National Insurance Number has been left empty, this causes a problem for our dictionary. Keys need to be unique and now we have different users having the same key. Worse yet, the developer who developed the class likes it the way it is and
has expressed intention to hurt you if you change it.
To solve the problem, what seems to make sense here is to use the National Insurance number when present but default to the compound key of first name, last name and
date of birth when not. This is however a very bad idea. Recall that the output of GetHashCode()
should always return the same code for the same object.
While it works at first, should somebody add the National Insurance number later on, it will cause the hash code to change and you’ll have a broken dictionary.
Instead, we’re going to have to forget all about National Insurance (I wish) and just use the name/date compound key. We are not allowed to change User
as this will involve a trip to the hospital so we need to do this externally:
public class UserEqualityComparer : IEqualityComparer<User>
{
public bool Equals(User x, User y)
{
return x.FirstName == y.FirstName &&
x.LastName == y.LastName &&
x.DateOfBirth == y.DateOfBirth;
}
public int GetHashCode(User user)
{
return user.FirstName.GetHashCode() ^ user.LastName.GetHashCode()
^ user.DateOfBirth.GetHashCode();
}
}
And now, we can do this:
private static void UserDictionary1()
{
User peter = new User { FirstName = "Peter",
LastName = "Anderson", DateOfBirth = new DateTime(1965, 5, 3) };
User sarah = new User { FirstName = "Sarah",
LastName = "Anderson", DateOfBirth = new DateTime(1990, 8, 20) };
Dictionary<User, string> userGender = new Dictionary<User, string>();
userGender[peter] = "boy";
userGender[sarah] = "girl";
Console.WriteLine("Peter is a {0}", userGender[peter]);
Dictionary<User, string>
userGender2 = new Dictionary<User, string>(new UserEqualityComparer());
userGender2[peter] = "boy";
userGender2[sarah] = "girl";
Console.WriteLine("Peter is a {0}", userGender2[peter]);
}
Comparison of Dictionaries
Now we know what a dictionary is and what we need to do to make an object ready to be used in one it might be a good time to look at the different
dictionaries supplied in the .NET Framework. There are quite a few of them but this largely stems from there being one set which predates generics, and
another set using generics that replaces that set. The table below shows the different .NET dictionaries, when they were introduced, how the entries
are returned when you enumerate over it and the likely overhead per item on a 32 bit system.
Name | Introduced in | Enumeration order | Overhead per item |
Hashtable | .NET 1.0 | Arbitrary | 12 - 28 bytes |
ListDictionary | .NET 1.0 | Insertion | 20 - 36 bytes |
HybridDictionary | .NET 1.0 | Arbitrary | 12 - 36 bytes |
StringDictionary | .NET 1.0 | Arbitrary | 12 - 28 bytes |
OrderedDictionary | .NET 2.0 | Insertion | 20 - 36 bytes |
Dictionary<TKey, TValue> | .NET 2.0 | Arbitrary | 12 bytes |
SortedDictionary<TKey, TValue> | .NET 2.0 | Comparer | Difficult to tell |
SortedList<TKey, TValue> | .NET 2.0 | Comparer | 8 bytes |
ConcurrentDictionary<TKey, TValue> | .NET 4.0 | Arbitrary | Difficult to tell |
The overhead for the non-generic dictionaries varies according to whether the key and value already exist on the heap or not. Using
ListDictionary
as a simple example, this uses a linked list to store the key value pairs. If we have a user and address already on the heap,
we create a list item which stores a reference to the key (4 bytes), a reference to the value (4 bytes) and a reference to the next item in the list
(4 bytes). Because this item is itself stored on the heap, as with all heap objects it picks up an 8 byte overhead (4 byte type information, 4 byte
sync block), and we end up with 20 bytes.
If instead we use integers as both key and value these need to be boxed so the 4 byte integer values get added to the heap (again with an 8 bytes
overhead) making 24 bytes. This makes a total of 44 bytes per item. Take away the actual size of the key and value – 8 bytes – and you end up with 36.
This gives an idea of storage space, but is only indicative. For Hashtable
for instance, that overhead is per bucket not per item and there
are always more buckets than needed. In my tests when I put 10 million items in a Hashtable
, I ended up with 13.9 million buckets.
Hashtable
The Hashtable
class was the original dictionary in .NET 1.0, but there seems little point in using it these days as the Dictionary
class that arrived in 2.0 does the same thing more efficiently and with better type safety. Hashtable
uses a one item per bucket implementation and
a technique known as rehashing for determining the bucket.
The idea is this: The key in the hashtable provides a hash code via its GetHashCode()
function. Rather than just mod-ing this with the number of
buckets to get an index (as Dictionary
does), the hash code is hashed once again prior to the modulus function. If a collision occurs because the bucket
is in use, rehashing is performed again using a slightly altered hash function. The alteration is not in the way it works but in the coefficients that are used.
It’s all quite mathematical and relies on the number of buckets being prime in order to work. I spent a while getting quite confused looking at it, and since we don’t
use it any more decided not to dig too deeply.
ListDictionary
This class doesn’t use a hash table internally; instead it stores the key/value pairs in a forward-only linked-list so you can’t set an initial size on construction.
When you add, read or remove from a ListDictionary
, the linked-list must be traversed to see if the item exists already, if not, when adding, it’s appended
to the tail of the list. This preserves the order that things are added but also means that the performance degrades exponentially very rapidly meaning you should only
use ListDictionary
when the number of entries is a single figure. Because no hash table is involved, types do not need to override GetHashCode()
to work as it doesn’t get called (they should, of course). Optionally you can specify and external IComparer
so that the Equals()
method wouldn’t
technically need overriding either.
HybridDictionary
With a small number of entries, ListDictionary
outperforms Hashtable
, but doesn’t compete when dealing with large numbers of entries. A
HybridDictionary
is just the combination of a ListDictionary
and a Hashtable
. It starts off storing everything in the
ListDictionary
, but after eight items have been added it creates the Hashtable
, copies everything into that and deletes the ListDictionary
.
Theoretically it should give better performance when the numbers start off low but might grow. I always used to use this as my dictionary class in .NET 1.1, but in reality
I doubt it ever did anything for me performance wise.
StringDictionary
This dictionary maps strings to other strings, presumably because the .NET Framework team thought this would be a common occurrence and that it would be worth creating
a strongly typed dictionary to do this. The key is case insensitive and internally it uses a Hashtable
just as you would but wraps up the casting inside.
So speed and size wise it will perform the same as Hashtable
.
OrderedDictionary
This was introduced in .NET 2.0 but is unusual in that it is not generic. It’s hybrid in nature in that it behaves like a dictionary but also preserves insertion order and
allows you to fetch an item by its index as well as its key. The most likely reason it is non-generic is to cater for the instance where the key is an integer. In this case,
what would the indexer do? It takes an integer value but has no way of knowing whether it represents the key or the index.
Internally, it is simply a Hashtable
and an ArrayList
. When an item is added, it is inserted into the hash table and appended to the list. When
removed, it is taken out of the hash table, and a search is performed to find it in the list and that is removed too. Because the items are stored twice, it’s a memory-hungry
structure and best avoided for large amounts of items.
Dictionary<TKey, TValue>
This is the standard replacement for all the non-generic dictionaries, it works in a similar manner to Hashtable
but has slightly reworked internals. In particular,
each bucket is represented as a linked list. Each element in the list stores the key, value, hash of the key and a link to the next item so on a 32 bit system that’s eight bytes more then the key and value itself. Additionally you need an index which denotes the first item in the linked list for each bucket so you get 12 bytes of overhead per insertion of key and value. Remember too that like Hashtable
, there are always more buckets than needed and that can inflate the memory consumption.
It’s interesting that the hash code is stored for each item as well as the key, after all if you have the key you can just call GetHashCode()
on it to get it again.
Presumably, they concluded it was better to lose 4 bytes of memory and only have to call it once. In case you are wondering why we need the hash code at all now that we are already in
the right bucket, it is because items can switch buckets as the dictionary grows (remember we get the bucket index by mod-ing the hash by the number of buckets - if we add buckets
things are going to get moved around.)
SortedDictionary<TKey, TValue>
This dictionary stores its key/value pairs ordered by key in a red-black tree, a fiddly self balancing data structure. There is no hash table involved so once again the
GetHashCode()
is not used. Instead if you want to use a type as a key in this, it needs to implement IComparable
or you need to provide a separate
IComparer
to determine the ordering.
SortedDictionary
is similar to SortedList
below, and together they demonstrate the compromising world of algorithms. SortedDictionary
requires more memory but offers faster insertion times, except when the data is already sorted when SortedList
will be quicker.
SortedList<TKey, TValue>
Despite its name, this is still a dictionary (but also a list!) and one that behaves much the same as SortedDictionary
and you need to provide a comparer in the
same way to allow ordering by the keys. Internally, it stores the keys sequentially in one array and the values in another. When you look up a value from the key, a binary search
is performed on the key array and if found, its index is used to find the value from the value array. Insertion has to shift every item after the insertion point in both arrays so
is an extremely heavyweight action. If you want to see performance death, sort your key/value pairs, reverse the order insert them one by one! (Note: I haven't tried this).
Although huge CPU effort is required to use this thing, the trade-off is in memory consumption – there is no overhead as it only stores the keys and the values meaning that if
you want to store a lot of stuff in it and memory is becoming a problem, it might be a good choice. Just make sure data is sorted before you add it or that it's something which can
be serialized and deserialized easily.
ConcurrentDictionary<TKey, TValue>
Urgh, threading, doesn’t it ever go away? All the dictionaries so far have not been thread safe so if you are working in a multithreaded environment, it's your responsibility for
synchronising access. ConcurrentDictionary
arrived in .NET 4.0 and works like an ordinary dictionary but also exposes some additional methods which provide atomic access
to the dictionary and thus should be safe to use from different threads. I don’t want to look too closely at this because it takes us into the theory of multithreading and this is an
article about dictionaries and the way they work.
But to give a taste of the sort of thing, take a look at the TryUpdate
method. This updates a value in the dictionary for a given key only when the existing value matches
a value you supply. If this action wasn't atomic first you'd need to get the existing value for the key, check if it matched what you were expecting and if so perform the update. Because there
are two or three actions being performed here, there's the danger another thread could change the dictionary after the first and before the last so you would then need to synchronise all access
to the dictionary.
Conclusion
I hope that perhaps you have picked up something new about dictionaries having read this far. My general conclusion is that you don’t actually need to know most of this and if you’ve
been using Dictionary
in your code without a care in the world you’ve probably been doing the right thing.
The performance is superb as long as you stick to the rules, and unless memory consumption becomes challenging it should provide well for you. If you need to get stuff in an ordered
fashion, you have the dictionaries which support this too.
If you do find yourself running out of memory and vast dictionaries are responsible, consider using SortedList
but be prepared for a dramatic slow-down.
Points of Interest
Contrary to what has been written above, a dictionary is actually a book describing the meaning of words and has nothing to do with computers what-so-ever.
Like all great things (except for the transistor and paperclip), it was invented in England. The first recognisable dictionary, A Dictionary of the English
Language was published in 1755 and had been compiled by the great Samuel Johnson who also told us that "when a man is tired of London, he is tired of
life; for there is in London all that life can afford." I wouldn't argue with that Samuel,
but it is nice to get away at the weekend.
History
- 26th May 2013: Initial version
- 28th May 2013: Tidy up and some rewording for clarity