Before .Net 4.0 we didn't have an option. If we wanted to use dictionaries from multiple threads we should take care of the synchronization.
I am pretty sure many of us created our own thread-safe implementations, be it by creating entire dictionary types (less probable) or by simply creating a class that had a dictionary and wrapped all the methods with a lock.
But now we have the
ConcurrentDictionary. It is indicated at MSDN - Dictionary documentation, Thread Safety section, that if you want a thread-safe alternative, see the
So, now there is a replacement for a dictionary that is already thread safe, so we can stop with our own implementations. That's great!
I think I only really tried to use the
ConcurrentDictionary once. On my initial speed test, it was great so I immediately replaced it on one of my classes, did some tests... and I started to get exceptions.
So, what's wrong? Isn't it thread safe already?
In my tests I discovered what was wrong, but the MSDN documentation of the
GetOrAdd method that receives a delegate does not has a Remarks for the version 4 of the framework. But, check the documentation for the version 4.5. It has a Remarks that says:
"If you call
GetOrAdd simultaneously on different threads, addValueFactory may be called multiple times, but its key/value pair might not be added to the dictionary for every call."
And there was my problem. At the time I had to do many tests before figuring it out as such documentation didn't exist. My problem with this approach is that I generally use a dictionary to cache data that:
- Is very slow to create;
- Can't be created twice, be it because the second time it will throw an exception or because it will leak resources if created two or more times.
And the second situation is the problem. If two threads see that the value does not exist, the two first create the value, but only one result will be really stored and returned. And what about the other?
If the creation throws an exception you have a problem that you may solve with a try-catch (it is not ellegant, but works), but what about a resource that will be created and never collected?
You can say that an object created without any references will always be collected. But, think again, as one of these situations may be happening:
- You are emitting code. I do that in my remoting framework and I use a single non-collectible assembly to all implementations. If I create two types instead of one, the two will be there forever, even if one is never used.
- You are creating another thread, directly or indirectly. It is possible that you create a component that has its own dedicated thread to process messages asynchronously, but in the order they are received. So, you create the component, it creates a thread. You dispose the component, it finishes the thread. But the component was lost and, as the thread it creates has a reference to it, the thread will not die and the component will not die.
- You are doing P/Invokes and receiving handles that must be closed the same number of times they are opened;
- I am pretty sure there are a lot of other possible situations. Maybe the dictionary is there simply to hold references to requested services at an specific server and you should never request two identical services at the same server or it will log that you are doing bad things (I worked on a company where such situation could generate legal penalties).
So, it is easy to see that you can't blindly replace a dictionary + locks by a
ConcurrentDictionary, even if the documentation says it is the thread-safe equivalent.
You are not sure about the fact we will not have the same problem with a normal dictionary. Well, depending on the implementation that may also happen, but let's see one of the simplest strategies:
if (!dictionary.TryGetValue(key, out result))
result = createValue(key);
In this situation, we hold a full lock while we do the search. If the item we search is not there, we create it, while still holding the full lock, we add it to the dictionary and we finally release the lock and return the result. If two threads are searching for the same value at the same time, one will win and do the entire job while the other will simply wait. Then the other gets the result just created and never tries to create a new result.
Much better, don't you think?
Not really. I don't have a problem if two instances are created in parallel as long as only one is used.
Ok. The situation I was presenting is not always a problem. You can simply create two instances in parallel and discard one. So, how does the
ConcurrentDictionary performs compared to the normal dictionary + locks?
And the answer is: That depends on the lock strategy and on your usage of the dictionaries.
First, even if you can create the same value twice, what are the chances that two threads really try to create the same value at the same time?
Second, how much time they will lose creating such value?
I can very easily build an example where creating the value takes 10 seconds. After 5 seconds creating the value, another thread tries to
GetOrAdd the same item, also starting to create the value.
In this case, we will have 2 cpus working in parallel for 5 seconds, then the first thread finishes and the second thread continues to create its value for more five seconds... to then discover that there is a value there already and use it, simply losing its created value.
If the second thread simply waits, it will let the second CPU do other things (good for other unrelated threads, applications or battery consumption) and it will get the good result after 5 seconds, not after 10.
So, the winner in this situation is the normal dictionary + a full lock.
That's a false situation
Ok, that's a forced example. It is not really a false situation, but too extreme for normal dictionary uses. So, what will happen if the first thread is creating an item (10 seconds) and another thread decides to read an unrelated item, that's already there?
Well, with the
ConcurrentDictionary that will be possible, as there is no lock holding readers. With a normal dictionary and a full lock, the reader should wait simply because the lock is exclusive, even if it wants to read a completely unrelated bucket.
ConcurrentDictionary is winning here.
Note: I am considering you already know about dictionaries buckets and nodes/entries. If you don't, I suggest the article Understanding Generic Dictionary in-depth from Ofir Makmal as he did a good job explaining them and I really don't want to lose the focus of this article trying to explain again.
Multiple Readers, Single Writer Lock
But, what about a multiple readers, single writer lock instead of a full lock over a normal dictionary?
Well, if the thread creating the value holds an upgradeable lock until the moment the value is finally there to upgrade the lock type to write lock, then reads can be done in parallel.
We will solve our issue with a reader waiting for 10 seconds for nothing. But, if you do much more reads than writes you will see that the
ConcurrentDictionary is still faster, as it is implemented to use lock-free reading, while the ReaderWriterLockSlim is terrible for dictionary reads. It is usually preferable to use full locks for reading dictionaries than using the
ConcurrentDictionary wins again.
Note: I already presented the
YieldReaderWriterLock and the
YieldReaderWriterLockSlim classes in the article Managed Thread Synchronization. By using that reader writer lock I get a considerable speed gain with the lock itself (and now I evolved it to
SpinReaderWriterLockSlim), allowing many reads to be done in parallel with almost no impact. I still use it personally, but the no-lock of the
ConcurrentDictionary is still faster.
Multiple writes to different buckets
The story doesn't end there. What will happen if we have many items to add, all of them with different and non-colliding keys and buckets?
This one surprised me at first, but I was doing a bad example. I was using and
int, int dictionary, and my factory was immediately returning the
-key (negative key) as the result.
I expected the
ConcurrentDictionary to be the fastest one here, but it was the worst one. Any normal dictionary + any locking was doing better. So, why???
Well, the way the
ConcurrentDictionary allocates the nodes and puts them into its buckets is different. It is optimized to allow lock-free reading. But, when items are added, allocating such node is expensive.
Even if it could add many items in parallel, the allocation of such nodes was consuming more time than using a full lock.
Going back to the main problem: Why use a dictionary?
Let's be honest, if we have the delegate that creates the values, and it is instantaneous, we don't need a dictionary, right? We can call the delegate directly, right?
Well... the answer, as always, is that it depends.
Imagine that your key is a string containing the path of a page in your webserver, and the value is of a type that holds the actual number of users in the page + the total number of accesses done since the server started.
Creating such object with zero count is almost instantaneous. Later, you don't create a new one, you change the values in it. It can be created twice as long as only one instance is used. But, as the node allocation of the
ConcurrentDictionary is slower, you may have a better creation time with a normal dictionary + a lock.
So, with another forced example I show how a normal dictionary is better... but for little time.
Even if the
ConcurrentDictionary node allocation is slower, we don't try to put 100 millions of items in some seconds into the dictionary. That naturally takes time to happen.
Also, after an item is created it is always read. How its content are changed is another story. So, it is not important if it took more microseconds to create an item. It will be faster on reads (ok, some microseconds too) but that will happen much more frequently. So, the
ConcurrentDictionary is winning again.
What about different items that take time to create?
ConcurrentDictionary strongest point. Creating many different time consuming items and also adding them in parallel.
ConcurrentDictionary uses many different locks to allow adding items concurrently, but the logic to decide which lock to use + having to acquire many locks while resizing its buckets doesn't really help too much. Putting data into a bucket is extremely fast. What really makes him the winner is the fact that it can create those values in parallel.
But, wait, we can also do the same. If we don't care if values are created in parallel and some of them are lost, we could lock, check if the item is there, release the lock, create the value, lock again, check again and if required, add the item. The code is like this:
if (_dictionary.TryGetValue(i, out result))
int createdResult = _createValue(i);
if (_dictionary.TryGetValue(i, out result))
* Remember, I am using an int, int dictionary.
And with this simple structure, a normal dictionary is performing almost as good as a
ConcurrentDictionary when adding slow to create items in parallel. But with the same problem that some values may be generated and never used.
So, is there any conclusion?
Well, at this moment, I have some.
- All dictionaries are fast. I was creating millions of items and it is still fast. We will usually only create a small amount of items and read them with intervals, so we will not perceive the time spent reading them;
- If you can't create the same value twice, forget using the
- If you really care about performance to the end, you may still have a better performance with a normal dictionary + a full lock. The important factors, in this case, are the number of adds and removes done. Reads will, unfortunately, be slower than with a
- Even if I didn't present it, you have more freedom with a dictionary and a lock. You can lock once and add many items, remove many items, do many searches and only then release the lock;
- Avoid the
ReaderWriterLockSlim if you usually have much more reads than writes. Dictionaries are so fast that a full lock is faster than a read lock. In this case, that depends on how many time you spend creating values inside a lock (if you do).
So, I think that even if my examples were extreme, they show us how the use of the
ConcurrentDictionary is not always the best solution.
Understanding the differences
Well, the main reason to start this article was that I was looking for better solutions.
What can I say, I tried to deeply understand how the dictionaries work (and I think I really do understand them now).
I can say that the buckets and nodes of the
ConcurrentDictionaries are simpler. When I tried to create a dictionary for the first time I did something very similar. The normal dictionary class, that seems to be the simpler version is, in fact, more complex.
Each node is a full class in the
ConcurrentDictionary. In the normal
Dictionary, the nodes are implemented by a value type, and all of them are inside a giant array while the buckets are indexes to find those nodes in such array. Also, instead of a next node being a simple reference to another node, it is again a reference in such array (after all, a struct node can't have a struct node as a member).
When adding and removing, the normal dictionary can't simply create a new node, it must check if there is an index of a removed node to take its place, or it will use the "Count" as the position of the new node in the nodes array. In fact, Resizes are mandatory for normal dictionaries when all nodes are filled.
ConcurrentDictionary, a node is simply a new object. Removing a node is simply losing its reference. Addind a node is simply creating a new node instance and pointing to it. Resizes are done only to avoid collisions, but they aren't mandatory.
So, if the
Dictionary uses a more complex algorithm (on purpose), how can the
ConcurrentDictionary be better for multi-threading?
And the truth is: Having all the nodes in a single array is much faster to allocate and to read, even if we need another array to tell where to find those items. It initially uses more memory for the same number of buckets, but new items don't need new allocations, don't need new object synchronization and don't force new garbage collections. They are already there.
But, replacing the nodes content is not an "atomic" operation, and it is that one of the facts that make it thread-unsafe. With nodes as objects, an entire node is initially created, then a single reference needs to be updated to point to it (that's an atomic operation). So, reader threads can read the dictionary without locks. They will either read the old value or the new value. There is no chance of reading an incomplete value.
So, the truth is: Normal dictionaries are faster to read, as long as you don't need to lock. It is the lock for reading that make them slower on reads.
To do better is somewhat problematic. Each dictionary has an approach that works, has strong points and weak points. They are simply different strong and weak points.
What I wanted was to have the best of both, and it is simply impossible. So I must chose... and verify if I can do better in what I chose.
Or, try both.
For my own compreension I did a full dictionary implementation using the same approach used by the normal dictionary. I got effectively the same speed. Then I tried to make it thread-safe, putting the locks exactly where they should be and avoid repetitive reads.
To explain, which a normal dictionary I can use a read lock, try to find an item and, if it is not there, I enter an upgradeable lock, but I must search again (as between the transition some other thread may have created the value).
With my own implementation, I kept the bucket in memory and a version. If the version didn't change, I didn't need to do a second search, I could directly create the value and then upgrade to a write lock.
Well, it had a good performance, but still, my normal situation is some writes and a lot of reads. So, it is prefereable to do a read that avoid locks completely.
So, I did another implementation. Now, each node is a normal reference. Items are slower to create, but reads are faster.
I use my own
SpinReaderWriterLockSlim (which is much faster than the
ReaderWriterLockSlim in most cases) for situations where a lock is needed, but most reads are lock-free. In my tests, it was working fine, but I still had the problem that two unrelated values couldn't be written at once.
I decided to go further, I made each bucket have its own lock. So, when adding an item, I got a read-lock of the buckets (saying: Don't resize the buckets) and a write-lock on the bucket itself. That allowed me to create two values in parallel without ever creating duplicates.
The result? Well, it was slower to add items, as now I had 2 locks instead of one. Ok, in the event of many writes to different nodes it was faster, even faster than the
ConcurrentDictionary, but that's not my main goal. I usually have lots of reads and some writes, which need to be protected "just in case" there is a collision.
So, I removed those secundary locks. Yet, the lock-free for reads allows reads to be done while there are writes. That's much better than to have "many reads" or "a single write". Readers never have to wait. In the event of 10 threads reading, and one creating a value for 10 seconds, I can read freely instead of making 10 threads wait.
What can I do to make it better?
Using a single reader writer lock it was already better than the
ConcurrentDictionary in many situations, as resizes were faster.
Considering my optimistic lock also does great when there is no concurrency, it is better if there are no real simultaneous writes, yet it is protected if that happens.
Then the only thing that was missing were specialized methods. As I said, with a dictionary + a lock we can add many items at once. So, why not have that option in my dictionary?
And that's what I did. I tried to put many "multiple actions methods" inside the dictionary, that do a single lock while doing those actions.
So, you have the
GetOrCreateValue (which is the safe-equivalent to the
ConcurrentDictionary's GetOrAdd but that does not have an equivalent in the normal dictionary), the TryGetValueAndRemove and other methods, like RemoveMany, which can accept a delegate to check all the items and tell which should be removed (avoiding many searches), or that can receive a collection of keys to be removed in a single lock (as obtaining and releasing the locks is time consuming).
Also, seeing how the approach of the
GetOrAdd of the
ConcurrentDictionary can have its advantages (creating two or more items in parallel, even if some of them will be discarded), I decided to have the
GetOrCreateDiscardableValue. If the value needs to be created, it is created outside of a lock but, if in the meantime another thread put a value there, that other value will be used. But my version has the advantage that you can give a delegate to properly discard the generated value.
After all, who knows? You did some calls in parallel, that's ok, but if one of the generated values will not be used, you must discard it properly (for example, many P/Invokes return handles that must be closed the same number of times as they were opened).
Finally, I decided to add a
Lock() method to the dictionary, which returns a disposable
LockedDictionary. With it you can do many of your own actions while the lock is still held, without any method trying to acquire an extra lock and then you dispose such locked dictionary to release the lock.
I really think that with all those extra methods, such implementation is more complete to be used in many different scenarios where performance is as critical as thread-safety.
I also decided to make my dictionary more conform to the Single Responsibility Principle, so it does not implement the normal
IDictionary interface, as I consider it too bloated with members like
ICollections instead of
IEnumerables. Also, it does not implement serialization specific tasks (if you want to serialize it, use a framework that allows serializers to be registered) but at the same time it is more complete as a dictionary and also has the
TrimExcess() method available to lists and hashsets.
As I always do with this kind of article, the sample is a program that simply compares performance of the many techniques and situations discussed in the article.
It does not compare all the extra methods of my dictionary implementation, but you can already see the difference of the locking techniques and my dictionary is already doing great with the normal methods, the extra methods are exactly that, extra methods to gain even more performance.
You will find the entire source code of my dictionary and also my
SpinReaderWriterLockSlim, which is used by my dictionary.
Points of Interest
To me, one of the points of interest is how the MSDN documentation makes people think that the
ConcurrentDictionary is simply the replacement for dictionaries when multi-threading is required, which is not always the case.
Also, doing my own implementations opened me the possibility to create different kinds of dictionary. For example, the
ThreadSafeDictionary looks more like the
WeakDictionary (not present in this article) looks more like the normal
Dictionary (because items are collected and removed very often) and I am also creating a
BigDictionary, which allows me to create really big dictionaries (with more than 3 billions items, considering there is enough memory).
And who knows, maybe I can think about creating my own database that supports indexes now that I know how to work with nodes and hashcodes in different manners.
I already ended the discussion about one type of dictionary versus the other. Now I will talk about one of the decisions I made in my
I never use the
Thread.MemoryBarrier(). Not on writes, not on reads.
If we look at the
ConcurrentDictionary source code on the internet (or if we reflect it) we will see that the read methods first read the node from the buckets, then they do the memory barrier and finally they start reading the contents of the node and getting new nodes.
On the other hand, writes don't have any memory barrier.
I looked at many places to try to understand that reasoning. In theory,
Thread.MemoryBarrier() calls should be just before a new object is made available to other threads (that is, after we fill the properties, we do a
Thread.MemoryBarrier() and then we set a shared variable) and when a reference is read from a shared variable, but before reading its contents.
That's related to how caches read and write data. In theory it is possible that we create an object, fill its content (which remains in the CPU memory) and then set a reference that is immediately written to the main memory. That is, the main memory has an up-to-date reference, pointing to contents that are still not up-to-date (and have garbage).
But why the
ConcurrentDictionary does not do that, why it only uses
Thread.MemoryBarrier() on reads?
Well, apparently, all writes to public memory on the .Net have the "release" semantics, which means that when I do:
_globalVariable = x;
It will guarantee that all the contents of x will be flushed to the main memory before putting
x into the
So, why is there a
Thread.MemoryBarrier() on reads?
From the same sources I saw that writes are guaranteed to have "release" semantics while reads are not guaranteed to have "acquire" semantics.
That is, if the CPU 1 has already cached the area of memory with the contents of
x in one operation, CPU 2 can change such contents, change the
x, then the CPU 1 can read the
_globalVariable pointing to an area of memory that it already cached with the wrong values.
So, putting a read barrier just after getting the reference of
_globalVariable is good. In the particular case, that happens when reading the node reference from the buckets array.
But the .Net doesn't have a read barrier, it is a full barrier, which causes a big performance loss, and also:
- In the
ConcurrentDictionary, the barrier is done after reading the first node, but not while navigating next nodes. I first though it was OK as inserts replace the first node, but updates replace any nodes so, if the barrier is needed, it is needed for all nodes (or on the update, all nodes should be copied, to then replace the entire bucket, not a single item on the bucket);
- I simply can't simulate such situation on my computer. Apparently the problem is only visible on Itanium processors and will never happen on my computer. But, as I can't reproduce the problem, I prefer to avoid something that is making my code slower for no reason (after all, if I am still doing something wrong, I will never be able to test);
- The new nodes are filled inside their constructor. It is not: create an object, fill it, show it to other threads. Or worse: create an object, fill it, show it to other threads, now change its contents, try to show it again. The real code is: create filling it. Now that it is complete, show it to other threads and CPUs. And here I believe the .Net framework will never let a thread see a reference for the first time and see contents that may have already been prefetched by the CPUs with garbage. That will be terrible, specially when dealing with window handles and other P/Invokes;
- On this MSDN Article, the technique 1 does exactly what I do, without any memory barriers for reads and, considering who wrote the article, I believe it's a safe technique.
So, to end this topic, I didn't put any
Thread.MemoryBarrier() in my implementation. If my implementation really causes problems in Itanium processors, I will think about creating an Itanium specific compilation with such memory barriers the same way I will do if there is a processor or .Net implementation that needs the memory barriers on writes too. But, for the moment, I can let the CPU caches do their jobs freely on reads.
- March, 7th, 2013. Corrected the return of the
TryGetValueAndAddOrReplace(), which was inverted, added the properties
TypeOfValue in the untyped dictionary interface and added the support for different locks to be used in the
ThreadSafeDictionary. On single CPU machines, the
OptimisticReaderWriterLock will be used instead of the
- February, 18th, 2013. First version.