Last year I published the article Dictionary + Locking versus ConcurrentDictionary in which I explained why using normal dictionaries + some locking technique may be better than using the
ConcurrentDictionary<TKey, TValue> and I finished by presenting a
ThreadSafeDictionary<TKey, TValue> that uses the same lock-free reading technique used by the
ConcurrentDictionary<TKey, TValue> while avoiding the problems of such dictionary and even having some more methods.
Well, at this moment I am focusing in writing small solutions, so people can simply copy one file and use the solution and the
ThreadSafeDictionary<TKey, TValue> I presented doesn't fit that category. In fact, when I use such dictionary for caches I always use the
GetOrCreateValue() and sometimes the setter, the
Remove() or the
Clear() but I don't use any of the other methods. So, why not write a class specialized to do only that, making it very small and optimized for such use case?
Well, that's what this article is all about.
GetOrCreateValue() is the method ideal for any kind of cache. You pass a key (for example, a record's
Id) and, if the value (the record) is there, you get it directly. If not, a delegate is invoked that's capable of creating such value (loading it from the database), then it is cached so it doesn't need to be loaded again in a future call and finally it is returned to the user.
Note that even if I used a database record as example, this applies to any "function" that always generates the same value if it is provided with the same keys. So, any time consuming computation may benefit from such kind of solution (in fact, I always expected some kind of automatic cache for functional languages).
If you need a
GetOrCreateValue() method, you can actually look for an existing solution.
Dictionary<TKey, TValue> class is actually the "default" solution. It doesn't come with a
GetOrCreateValue() method but it is possible to write an extension method by using the
Add() methods. Actually having such method as part of the dictionary could help improve performance, as the
Add() method actually does a new search before adding the item, but as such search is very fast and we don't add an item twice, it is not a real problem. The real problem with this solution is that it is not thread-safe and, in many situations, that's exactly what we need.
As I presented in the other article, actually doing a full lock during the search (independently if we add the value or not) may be faster than using a
ReaderWriterLockSlim, as such lock is not as slim as the name suggests. But a full lock really forbids two (or more) threads from accessing the dictionary at the same time. One thread must wait the other to finish. Also the locks clear the CPU caches, so the next calls (even unrelated to the dictionary) may need to access the main memory again, becoming slower (something extremely hard to measure).
This is where the
ConcurrentDictionary<TKey, TValue> is a winner. Actually its method is named
GetOrAdd(), but we can say it is the
GetOrCreateValue with a different name. It really allows two or more threads to access its contents in parallel by using lock-free techniques, so when the contents are already in the dictionary no lock is needed and two or more threads can really access the same item in parallel.
There are two downsides with its implementation. The first one is that it uses a
Thread.MemoryBarrier(), which completely clears the CPU caches and which is not really needed considering the .NET strong memory model (and, if it was needed, it is bugged as it is done in the wrong place). The second downside, and this is a big one, is that when the values aren't in the cache they are generated outside of a lock, which means that two or more threads can generate a value for the same key. Even if every thread receives the same instance in the end, the other values generated in parallel are lost. This can be very problematic if such values must be disposed or if they actually can't be created in parallel.
So, the solution was to create a lock-free dictionary (for reads) like the
ConcurrentDictionary<TKey, TValue>, but one that does locks while creating the values. That's what I presented in the other article, but it was a big dictionary if we only need that cache functionality. In fact, because of the other functionalities it was not really optimized for such use case, doing an UpgradeableLock (which only affects the other methods) and then upgrading it. This means that a class that doesn't have those other methods could avoid that
UpgradeableLock (as it is effectively working as a full lock) and go to a full/write lock directly.
ThreadSafeGetOrCreateValueDictionary<TKey, TValue> is the new implementation optimized for real parallelism, CPU cache utilisation and to avoid creating two values for the same key. It does lock-free reads and when a value needs to be added it uses a full-lock (the normal
Monitor lock) to create and add it, so it is impossible to create two values for the same key and lose one. And, by using the normal lock, it is also capable of being used recursively (this may happen if the delegate used to create the value needs to read another value of the dictionary, like in a recursive Fibonacci implementation).
It also has methods to
Clear() the dictionary, but nothing else. It is not intended to be seen as a "collection", it is intended to be used as a cache.
The code for this optimized dictionary is this:
public sealed class ThreadSafeGetOrCreateValueDictionary<TKey, TValue>
private sealed class _Node
internal _Node(int hashCode, _Node nextNode, TKey key, TValue value)
_hashCode = hashCode;
_nextNode = nextNode;
_key = key;
_value = value;
internal readonly int _hashCode;
internal _Node _nextNode;
internal readonly TKey _key;
internal readonly TValue _value;
private readonly object _lock = new object();
private readonly Func<TKey, TValue> _creator;
private readonly IEqualityComparer<TKey> _comparer;
private _Node _baseNodes;
private int _count;
public ThreadSafeGetOrCreateValueDictionary(Func<TKey, TValue> creator):
this(creator, 31, null)
public ThreadSafeGetOrCreateValueDictionary(Func<TKey, TValue> creator, int capacity, IEqualityComparer<TKey> comparer)
if (creator == null)
throw new ArgumentNullException("creator");
capacity = DictionaryHelper.AdaptSize(capacity);
if (comparer == null)
comparer = EqualityComparer<TKey>.Default;
_creator = creator;
_comparer = comparer;
_baseNodes = new _Node[capacity];
public TValue GetOrCreateValue(TKey key)
if (key == null)
throw new ArgumentNullException("key");
int hashCode = _comparer.GetHashCode(key) & int.MaxValue;
var baseNodes = _baseNodes;
int bucketIndex = hashCode % baseNodes.Length;
var node = baseNodes[bucketIndex];
while(node != null)
if (hashCode == node._hashCode)
if (_comparer.Equals(key, node._key))
node = node._nextNode;
bucketIndex = hashCode % _baseNodes.Length;
node = _baseNodes[bucketIndex];
while(node != null)
if (hashCode == node._hashCode)
if (_comparer.Equals(key, node._key))
node = node._nextNode;
if (_count >= _baseNodes.Length)
bucketIndex = hashCode % _baseNodes.Length;
TValue value = _creator(key);
var newNode = new _Node(hashCode, _baseNodes[bucketIndex], key, value);
_baseNodes[bucketIndex] = newNode;
private void _Resize()
newSize = DictionaryHelper.AdaptSize(_baseNodes.Length * 2);
var newNodes = new _Node[newSize];
foreach(var baseNode in _baseNodes)
var oldNode = baseNode;
while(oldNode != null)
int hashCode = oldNode._hashCode;
int bucketIndex = hashCode % newSize;
var newNode = new _Node(hashCode, newNodes[bucketIndex], oldNode._key, oldNode._value);
newNodes[bucketIndex] = newNode;
oldNode = oldNode._nextNode;
_baseNodes = newNodes;
And you use it like this:
var dictionary = new ThreadSafeGetOrCreateValueDictionary<int, int>(DelegateToGenerateTheValue);
int value = dictionary.GetOrCreateValue(57);
Why does the ThreadSafeGetOrCreateValueDictionary<TKey, TValue> receive the delegate in its constructor?
There are two reasons:
- It is expected that we always use the same logic to generate values, so giving it in the constructor guarantees that it is always the same logic;
- When we write anonymous lambdas the compiler generates a new delegate instance per call instead of creating a single delegate instance, but as we give it to the constructor, it will be created only once instead of being created at each call to the
In fact the first item is the important one. The second is only a good consequence, but those allocations are so fast that I can't consider it important.
Multi "column" keys
A problem I often see is related to multi-column keys. Normal dictionaries only support a single key and the solution I am presenting is in the same situation. I often see people writing specialized dictionaries that try to deal with those multiple columns, many times using a .NET dictionary where the key if the first column and the value is another dictionary, using the second column as the key for that inner dictionary, but the simplest and better performing solution is to use a
Tuple<AnyNumberOfArguments> as the key. So, don't worry, even if you can only use a single object as the key, such object can be composed of other objects allowing you to have as many "columns" as needed as the key.
KeyValuePair<TKey, TValue> problem
For years I used the
KeyValuePair<TKey, TValue> as a dictionary key when I needed a 2-column key. The worst is that I did performance tests and it worked great but, well, I was doing a bad thing for years.
I always believed the
KeyValuePair<TKey, TValue> implemented the
GetHashCode() and the
Equals() methods, but it doesn't!
This means that only the first field is used to generate the hashcode (so, only the
Key of such pair is used to generate the hash code). In my cases it was usually not a performance problem because I didn't usually had too many repetitions of the first column, but this actually means that every item that has the same value for the first column will end-up in the same node.
So, to solve this problem I created the
Pair<TKey, TValue> struct. It is very similar to the
KeyValuePair<TKey, TValue> but it implements the
GetHashCode() and the equality methods (the one coming from the
System.Object type and the one from the
IEquatable<T> interface). So, this is the ideal type to use when you want two column keys.
The code for the struct is this:
public struct Pair<TKey, TValue>:
public Pair(TKey key, TValue value)
_key = key;
_value = value;
private readonly TKey _key;
public TKey Key
private readonly TValue _value;
public TValue Value
public override int GetHashCode()
int hashCode = 0;
if (_key != null)
hashCode = _key.GetHashCode();
hashCode = (hashCode << 16) | (hashCode >> 16);
if (_value != null)
hashCode ^= _value.GetHashCode();
public override bool Equals(object obj)
if (obj is Pair<TKey, TValue>)
return Equals((Pair<TKey, TValue>)obj);
public bool Equals(Pair<TKey, TValue> other)
EqualityComparer<TKey>.Default.Equals(_key, other._key) &&
public override string ToString()
return string.Concat(_key, ", ", _value);
And you can test the performance difference by using this sample:
var stopwatch = new Stopwatch();
var hashsetOfPair = new HashSet<Pair<int, int>>();
for(int i=0; i<1000; i++)
for(int j=0; j<1000; j++)
hashsetOfPair.Add(new Pair<int,int>(i, j));
var hashsetOfKeyValuePair = new HashSet<KeyValuePair<int, int>>();
for(int i=0; i<1000; i++)
for(int j=0; j<1000; j++)
hashsetOfKeyValuePair.Add(new KeyValuePair<int,int>(i, j));
Actually, the version with
Pair<int, int> takes less than a second to complete in my computer while the other version takes more than 40 seconds.
Lesson learned: Always check if a struct implements the
GetHashCode() method. And this only remembers me that I always considered the
GetHashCode() to be an error in the
System.Object class, it would be better if it only existed in an interface (inside the
IEquatable<T> would be a good place) that should be implemented by types that can be used as a hash-table key. If that was the case, a
KeyValuePair<TKey, TValue> would not be a valid type for a dictionary key and I wouldn't have committed such a mistake (and I imagine many other people wouldn't either).
Usually I use this kind of solution to create static caches.
I know, some people immediately see a problem there. Everything that's put in the cache will stay alive for the entire life of the application. So, is it right?
I can say that I usually use these caches to keep delegates to access field and properties in memory and, as there is a limit of types and properties I will use, I don't see a problem if they are kept alive for the entire application's lifetime. We can say the code of the application itself almost fits this category.
But for other caches it may be a problem. This is why I also created a weak version of it. The
The principle is the same. A class with only the methods
Remove(). But the generated values are stored in
WeakReferences, so they can be collected.
The thing that complicates this solution is the fact that when items are collected the dictionary is not notified. I thought about using one of my old solutions so it gets notified but I decided to do such analysis on the
_Resize() method. Every time the dictionary must be resized it actually counts how many items are still alive (recalculating the
_count) and, depending on the situation, it simply removes the collected nodes. So, there's a small risk of keeping a large array of
WeakReferences in memory if you put lots of items in the dictionary (which are eventually collected) and never use your dictionary again but I really think it works for situations where it is going to be used, and
WeakReferences are probably much smaller than the items that can still be collected.
I am not going to put its code here in the article's text, but you can download the sample and check the
There's an extra class that you will probably like to use with the
WeakGetOrCreateValueDictionary<TKey, TValue>. The purpose of the
GCKeepAliver is to keep items alive during the next collection (usually the next full collection). The reason of this is simple: It is not important how often you use a cached item, if at the moment of a garbage collection it is not in use, it will be collected.
Now let's be extremist. Imagine that you have an important object that takes 1 minute to be generated. After its generation you use it once per second, but you keep a strong reference to it for only a millisecond. If a collection happens in any of the other 999 milliseconds that exist in a second, it will be collected. I am sure you don't want that.
Even if I am using an extreme situation, this is something that happens when using weak-references and so, if you want to guarantee that recently used objects are kept alive for some time, call a
GCKeepAliver.KeepAlive() giving such object and it will survive the next collection. After surviving the next collection, if it is used again, you will say that it will survive yet another collection. This means that those objects don't have a specific time to live, yet they will be kept in memory if they are used frequently and at the same time they will be collected if they aren't used between two garbage collections. The best is that by being collection related, not time related, they are allowed to be collected if the computer is running really short on memory and doing garbage collections frequently.
WeakGetOrCreateValueDictionary doesn't call the
GCKeepAliver.KeepAlive() by default on purpose, as you may want to use a different technique to keep items alive, if you want to do that at all.
The sample application is included only to show how to use the dictionary methods and to show that it is really fast, but it is not doing anything useful.
It is up to you to use it in your applications if you need a cache that's weak or one that can be accessed in parallel without having the problems of the
ConcurrentDictionary<TKey, TValue> (if it is a problem to you or even if you are not sure).
Understanding the code - Advanced
If you stop reading the article here you can already use the dictionary to cache your data. This topic is more advanced than the rest of the article but it can be useful if you want to understand how it works.
To understand this code we must understand what is a hash code, how it can be used to index values and how lock-free programming works. So...
Imagine that you have this index for a Phonebook:
I know it is not ordered and this is not how phonebooks are organised. Yet it is valid for my purpose. Also consider that inside a page the names aren't ordered either, but all the names for a given starting letter are always put in a single page.
So, if you want to look for my name (Paulo), will you look for my name at page 2? Or page 30? I am pretty sure that you will look directly at page 57. You may lose time if there are too many names in page 57 and you may need to read the entire page if my name is not there, only to be sure it is really not there. You will also see that there are names that start with A in this page, yet you don't need to lose time reading the page 50, 20, 15 etc.
So we can say that this is very similar to how hash codes work. A hash code doesn't guarantee the data will be ordered, and if there are many items with the same hash code (or that end-up in the same page [known as bucket]) we may spend too much time looking for the right one, effectively having to look for all the items on that "page" if the key is not there.
But differently from letters (26 if we use English letters) we have 32-bit hash codes, which gives us more than 4 billion values, so collisions are expected to be less common. But this is one of the reasons that make the hash code very important: If only the first letter was used to generate the hash codes of strings we will only generate 26 values for all the possible names, effectively having a bad indexing (don't worry, .NET strings have a good hash code).
The HashCode and the Bucket
I put the letter A and the letter P on the same page for a reason. When we use dictionaries those pages (buckets) are pre-allocated, but we don't expect to pre-allocate 4 billion buckets for those 32-bit values. In the implementation already presented I start with 31 buckets. That's why the
bucketIndex is calculated as
hashCode % _baseNodes.Length. That
_baseNodes.Length is the number of buckets/pages that we have. So, values ranging from 0 to 30 will have a
bucketIndex equalling their hash code. But a value of
31 % 31 gives 0, so items with a hash code of 31 will also be in page index 0 (and items with 62 or any multiple of 31 will also be in bucket/page 0).
For example, if our keys are
ints return themselves as the hash codes) that increase by 10 we will have the following result:
I think it is already possible to see that even if we jump values from 10 to 10, the tendendy is to use all the buckets (you can see that we did not have a single collision yet). Eventually we will have to increase the number of buckets (after all, with 31 buckets if we have 100 items well distributed we expect to have 3 items per bucket, but increase the number of items to 10000 and we expect to have 300 items per bucket).
Increasing the number of buckets
Actually, every time the
_count becomes bigger than the number of buckets we resize the buckets array. Such process actually requires all nodes to be visited to recalculate the node index by the new number of buckets.
It is not mandatory to recalculate the number of buckets when the
_count becomes bigger than the buckets length. I do this because to very well distributed hash codes (for example, when we use int values that increase by one each time) this means that we will never have a bucket collision. But in fact in some situations it could be better to use different resize rules, like only resizing when the count is bigger than the 2 times the number of buckets, but this is something my code is not prepared to deal with.
When resizing the nodes I try to use prime numbers. This actually helps to create a good distribution of values when the hash codes are multiples of some value. For example, Imagine that instead of 31 we had 30 nodes.
If the hash codes were increasing 10 by 10, then the values 0, 30, 60 etc will end-up in the same bucket (the same applies to the values 10, 40, 70) while all the buckets that aren't multiple of 10 will be empty. This is bad.
Of course we will have problems when we have 31 buckets if all the values are multiples of 31, yet it is more common to have values that are multiples of something smaller (usually 2, 10 or 16). Also, even with hash codes multiples of 31 we will solve the problem when resizing, as the next prime will not be a multiple of 31.
But if you look at the code you will see that I am not limiting the results to primes as this will take too much time and goes against the purpose of having a fast dictionary. So I only try to search a value that's not divisible by the primes from 2 to 31.
Why is there that hashCode & int.MaxValue?
If we have a negative
hashCode % someValue we finish with a negative number, which is invalid for our bucket index. We could use
Math.Abs() or we can use the
& int.MaxValue to lose the negative sign. I didn't test it again, but the last time I checked using that & operator was faster.
Comparing the hash codes before comparing the items
If you look at the code you will see that I compare the hash code of the node with the hash code of the key I am searching before comparing the key itself. I also store the hash code in the node itself.
This is done for performance reasons. We don't know how much time the key will take to calculate the hash code, so by storing it in the node we avoid a possible performance hit. Even if calculating the hash code is fast we avoid a virtual call.
We also don't know how much time the
Equals() method actually takes to compare the objects. For an
int we can say that first checking the hash code to then check the item is slower than comparing the value directly, but that's the fast case. If the equals is slow we don't want to lose time comparing objects that actually have a different hash code as they should be different.
If you arrived here you probably understand why the hash codes are important and why I did those "strange calculations" to get the
bucketIndex and the number of buckets when resizing. But now we have the other part of the problem, the lock-free reading.
For example, if you look at the
_Resize() method you will see that new nodes are created. If we didn't have lock-free readers we could simply change the
_nextNode of our existing nodes instead of creating new ones (the garbage collector will love that). In fact, we could do something more similar to what the normal
Dictionary<TKey, TValue> does by using the nodes as structs instead of classes, so all the nodes are stored in a single array and count as a single object, with is again better for the garbage collector.
But we actually can't change our objects by using "non-atomic" operations and even if changing the
_nextNode reference is an atomic operation, we must be sure that all items that enter the new node are changed together and, as we can't do that, we create new objects that will not be seen until we do the:
_baseNodes = newNodes;
Many people believe that we must mark our objects as
volatile to do lock-free reads or that we should use those
Volatile methods, but we don't. The .NET has a strong memory model that means it is impossible for a different thread to see the
_baseNodes referencing a new array without seeing the right contents that were put inside it before doing the
_baseNodes = newNodes;.
Also some people are worried that by not doing a
volatile read we may not see an item that was just added by another thread. And that's true, but that's why we do a first read and, if we don't find an item, we do a lock and search for the item again. If the item was just added, our lock guarantees that we will see the latest version of the items in memory and also that no-one else will be able to add an item at the same time.
So, what do we need to do the lock-free read?
- We must access the
_baseNodes only once. That's why I put the
_baseNodes into a local variable to then get its length and to access its contents;
- We must guarantee that items don't change in a non-atomic manner. Actually the
Set() method could change the
_value of nodes directly for reference types and for primitives like
int or smaller (
long is safe only on 64-bit processors), but it can't for other value types, so we simply consider we don't have such a guarantee;
- We must guarantee that changes to the
_baseNodes contents are atomic. That's why our
_Node type is a class (a reference type) and not a struct. .NET will never allocate an array of references with the wrong alignment (which would cause the reads and writes to be non-atomic);
- We must also guarantee that when we resize the nodes we only publish the new nodes array when it is completely filled. That's why I keep the new array in a local variable until the resize is finished and only then I set the
_baseNodes to the
All changes that can be seen by the lock-free read must also be atomic. This affects all the methods that can actually change the contents of the dictionary, independently if those methods use locks (as a lock only protects other code that uses uses the same lock). Considering our methods we have such a guarantee because:
Clear() method only replaces nodes with
null (an atomic operation) and the
_count is never seen by the lock-free reads;
Remove() method only changes the base node or the
_nextNode field of a previous node to reference the
_nextNode of the removed one and replacing the value of a reference is an atomic operation;
Set() method guarantees that the new item is completely filled before making it accessible;
- And as those changes can be problematic related to each other (for example, between the search and the addition of a new node, another one could've been added), they all use a
lock to protect from each other.
In the beginning of the article I said that the
ConcurrentDictionary<TKey, TValue> has a downside of using the
Thread.MemoryBarrier() and that it is in the wrong place.
To explain it in details, the
GetOrAdd() method first uses the
TryGetValue method. Such method gets the first item in the bucket, then calls the
Thread.MemoryBarrier() and then processes the entire bucket.
The problems are:
Thread.MemoryBarrier() does a full-fence. This means that all data that was modified by the actual thread (and consequently the CPU) must be sent to the main memory and also that local caches for read must be cleared. In .NET 4.5 we have the
Volatile class that allows us to use half-fences (in this case, a
Volatile.Read() would do a better job);
- Even using the
Volatile.Read() is not necessary by the .NET strong memory model. We are not reading two or more fields or accessing two or more array items in a specific order that must be enforced because those fields are changed by other threads in the same specific order. We are accessing an item and then the contents of such item and, if the item is becoming available for this thread/CPU for the first time, it is the responsibility of .NET to guarantee that its contents are all OK. If .NET doesn't guarantee this then we could actually read real garbage from memory, not only "uninitialized" objects, which goes against .NET being safe. So, it may be an implementation detail, but this is a guarantee of the Microsoft .NET implementation and I really believe all other implementations should give the same guarantee;
- If you still believe my code is bugged by not using a barrier or the
Volatile.Read(), then you should know that the
ConcurrentDictionary<TKey, TValue> is bugged too. It actually only uses a single
Thread.MemoryBarrier() before starting to process a bucket, not between one item and the other and, well, it is possible to change values for existing items. This actually can change the intermediate nodes and so, if such barrier was needed, it should be inside the loop, not outside it.
Finally, if you want, you can check this link http://joeduffyblog.com/2007/11/10/clr-20-memory-model/. To me the most important one is:
- Rule 1: Data dependence among loads and stores is never violated.
This rule is actually everything that I need in my implementation as I don't try to change more than one field in any particular order. The only thing important is to publish a new object and have its contents correctly visible to other threads.
And if you still don't believe me, then change the code to use the
Volatile.Write() methods available in .NET 4.5 (or the
Thread.VolatileWrite() available in previous versions) when reading the bucket array, one of its items or the
_nextNode. Note that simply making the array as volatile will not guarantee that items will be accessed as volatile.