You are assuming the code is not working for the wrong reason.
the 'undeterministic' behavior is due to cache being applied to the field.
the compiler might optimize the use of that field for single threads, the value then, might be updated with certain 'delay'.
volatile means the field will not be cached in a per-thread basis, meaning all threads in all CPUs will be able to see changes in the field instantly. Hence disabling compiler optimizations for that field.
there is no bug here. there is nothing to fix. the fields where not "out of synch", the values were cached.
using a word like "synched" in this context is not correct, and is indeed misleading. Also is assuming the difference between running in debug mode and release mode. You might also run the program hundreds of times with different optimization enabled, and you will get different results.
I have rated most of your articles with a 5, but this one it is not ok.
The fact that other people have gave you a 5 on this, is at least, concerning.
You are assuming the code is not working for the wrong reason.
the 'undeterministic' behavior is due to cache being applied to the field.
the compiler might optimize the use of that field for single threads, the value then, might be updated with certain 'delay'.
volatile means the field will not be cached in a per-thread basis, meaning all threads in all CPUs will be able to see changes in the field instantly. Hence disabling compiler optimizations for that field.
I get your point and you are right. If we use volatile keyword we are fiddling with the free and optimized decisions made by the processor for memory synch. Now the processor has to do extra work of flushing memory of local thread storage as well as main memory. But the question is so do we put a delay for this synch to happen and will that delay be optimized.
Leonardo Paneque wrote:
there is no bug here. there is nothing to fix. the fields where not "out of synch", the values were cached.
There is no bug from the processor perspective , but my threads want current data which can lead to undeter behavior in my application. At the last I have demonstrated in a video about a bug caused due to volatile keyword. The program keeps running forever as its not getting the recent fresh value. So as from processor optimization perspective they are perfect but then from app perspective things can be problematic.
Leonardo Paneque wrote:
using a word like "synched" in this context is not correct, and is indeed misleading. Also is assuming the difference between running in debug mode and release mode. You might also run the program hundreds of times with different optimization enabled, and you will get different results.
As we have two memory storages i have used the word synch. Not sure if the word is misleading. With debug mode your program runs bit slow , giving chance for the processor to flush the memory , so i have said to run in released mode so that we can see the issues. I agree with different processors ( as every one has different memory model) you can get different results. Atleast with my all my PC , i was able to reproduce the issue.
Leonardo Paneque wrote:
I have rated most of your articles with a 5, but this one it is not ok.
The fact that other people have gave you a 5 on this, is at least, concerning.
I do not write for votes , i carv for such discussions which will help me grow. This discussion is worth than 100 votes .
Said and done you have highlighted very important points , i will be updating with a note about performance issues with volatile.
Cross question :- Is there some better way , that all my threads see fresh memory and we do not loose too much in optimization ?.
ON x86 and x64 there is no performance issues, most people agree the impact is zero. but that's due to CPU architecture, on ARMs processors it might be a problem, but not because volatile itself, but because the way cache is flushed. (everything is flushed)
Is there some better way , that all my threads see fresh memory and we do not loose too much in optimization ?.
Yes, to address that question... But pretty much to upgrade your knowledge on threading, I want to clarify this for you and anybody to come.
As for your relationship with threading: First you don't know volatile, when the right problem arises you will discover it, and you will pass a time when you like volatile because it solves the immediate problem. And then you will start to be frustrated by its limitations. You may start to like lock again... And finally you start avoiding both lock and volatile it as if they were a curse.
I'll try make this phases go fast and with little pain.
1) Volatile solve a problem.
It allows you to tell the compiler that certain variable must be taken in special consideration because it is being used by various threads at the same time. In fact, volatile is excellent to tell the compiler that it should make it so that those threads will read the value of the variable from shared memory each time (and not from cache).
2) Volatile is misleading.
Let's say you have split an operation in various threads, and you want to report the total progress in real time.
Of course you will have a shared variable for that. Let's say you have an int, that will go from 0 to 1000 where 1000 means that the works is done, and each thread has half the work (that is, they will increase the variable 500 times).
What can we expect? Well, that after both thread finished the final value will be 10000. Well, no. Because these threads are caching the variable. Your first instinct may be to throw volatile to the cage and hope for the best, and it will work most of the time... but not quite, sometimes you get 9999, some others 9997 and so on. What happens is that each thread will have something like this:
C#
_progress++
How does the compiler handle this? Like so, of course:
C#
var tmp = _progress + 1;
_progress = tmp;
Do you see the problem? No? Ok, let me try again:
C#
var tmp = _progress + 1;
//They are not a single instructions
_progress = tmp;
"What's wrong with that?" - you may say - "It should work anyway". Well, the problem is that you have two threads, and those threads can be preempted anytime, perhaps...
C#
var tmp = _progress + 1;
//Just right here!!!!!
_progress = tmp;
And now, each thread has it's version of the new value to set the field, and both are gonna write! oh no!
The initial value is 0.
The thread A comes and reads 0, and says... I'll update it to 1.
The thread A get's preempted.
The thread B comes and reads 0, and says... I'll update it to 1.
The thread B writes the field to 1.
The thread A is awaken.
The thread A writes the field to 1.
Both threads will swear they have just updated the value of the field. That is, from their view point the field had the value 0 and now it has 1. Yet, the final value is 1 not 2.
3) Love and hate of lock
Oh, cruel threads, what are we gonna do? of couse one solution is use Monitor:
C#
lock (something)
{
_progress++;
}
Yep, that works... but if we are going to need a lock, why are we using volatile in the first place?
Monitor is the defecto synchronization mechanism in C#. If you want to make sure only one thread enters a critical section, then use Monitor. It just works. And that's the problem.
Sometimes you don't want to have just one thread, for example you may want to have multiple threads read, and only one write (at a time). That's what ReaderWriterLock(Slim) is for, except for some corner cases, which thankfully have already been addressed more recently with Lazy<t> and ThreadLocal<t>.
But what if I want multiple writes?
Yes, I want multiple writers, there is no energy or time to have a threads sitting around! What I want is Lock-Free systems!
For example, let's get back to our previous example. Let's say that the work these threads are doing is copying an array, they are cooperating to copy the array. Of course if they are going to cooperate to copy the array, they need to be able to write it simultaneously.
C#
//Two threads use this code to cooperate to copy the arrayvar index = 0;
do
{
index = _progress++;
arrayTarget[index] = arraySource[index]
}while(index < 1000)
Of course, this time the problem is clear as day. Because of that [insert insult here] increment they may end up copying parts of the array twice!
So, how about a lock?
C#
//Two threads use this code to cooperate to copy the arrayvar index;
do
{
lock(something)
{
index = _progress++;
}
arrayTarget[index] = arraySource[index]
}while(index < 1000)
Well, it is no good either, because now each iteration the threads has to compete for the lock. ReaderWriteLock is not going to help either, because they are all writers. If those are the only options, then it would be better to have a single thread do the job.
So, how do you solve this?
4) Avoiding volatile
Here is the truly unsung hero: Interlocked
What you need is to increment _progress in a single (atomic) operation (in such a way that the thread will not be preempted in the middle of it), and Interlocked allows you to do just that!
C#
//Two threads use this code to cooperate to copy the arrayvar index;
do
{
index = Interlocked.Increment(ref _progress);
arrayTarget[index] = arraySource[index]
}while(index < 1000)
Yay! A solution. There is just a problem... You need to stop using volatile to be able to use Interlocked. Why? Well, Interlocked needs a ref to the field, and volatile doesn't work with that.
This bring another question... If I need to use both Interlocked and Volatile what should I do?
There are two answers:
A) Use Thread.VolatileRead(ref v) and Thead.VolatileWrite(ref v, value) which have the same semantics that volatile had in the first place (read the most up to date value, flushing the cache).
B) Use Thread.MemoryBarrier() which is what volatile actually does (disable compiler optimization from reordering operations and caching values). Actually Thread.VolatileWrite and Thread.VolatileRead use Thread.MemoryBarrier
In the situation where the industry stands today, you are going to see less and less lock and volatile and more and more Interlocked. This move will be driven for the desire for better CPU utilization and better energy consumption. There is a onward moving push to develop more and better Lock-Free and also Wait-Free* data-structures.
*: There are many self-proclaimed wait-free data-structures that are only lock-free after scrutiny (marketing?), only a few actually are what they say. Creating a truly wait-free data-structure is a very hard task (for which I'll give you some tips below). Lock-free on the other hand is a perfectly reachable objective.
I guess eventually most cases will be covered, for now, we need to spread the knowledge because we don't know who is going to complete the holy task of creating a truly mutable wait-free dictionary that doesn't have a limited capacity of threads or items (and releasing it as free and open source software) [I don't even know if such data-structure is possible, but I'll be looking for it (does it already exists?). Really that person deserves a medal or something].
5) Extra
These are tips to create wait-free and lock-free data-structures.
a) The key to creating lock-free and possibly wait-free data-structures is to allow threads to cooperate (with few exceptions).
Sometimes you will need to have some kind of protocol that allows threads to decide how to help others, so if a thread cannot do something it doesn't have to wait, instead it goes to help another thread*.
*: Hence the term wait-free.
Eventually another thread could come and help the first thread with the task he could not do before, or that thread come back and try again after helping another thread.
b) To create such protocol that allows threads to decide how to help others, it is important to be able to identify threads uniquely. Sadly the ManagedThreadId could be recycled by the runtime. In this case having a thread local storage and Interlocked comes in handy to assign, either unique temporal IDs, work tickets or other mechanisms to facilitate the protocol.
c) You can use thread local storage since .NET 2.0 with LocalDataStoreSlot. Also if you store something in the stack (stackallock?) it will be thread local because the stack is thread local.
d) Interlocked.Increment will wrap values. I mean after reaching MAXVALUE it goes back to MINVALUE. This allows you to create a wait free circular stacks and queues, under the constraint that the capacity should be a divisor of MAXVALUE (ie, a power of two, so you don't have to wrap the indexes, just use mod to map them to an internal array).
e) Using Interlocked.CompareExchange, Interlocked.Increment and Thread.VolatileWrite you can have critical sections that will only be acceded a limited number of times (for example only once, good for lazy initialization). Or, you can use them to allow only one thread a at a time without making other threads wait (similar to Monitor.TryEnter but much more lightweight).
f) Did you create a data-structure that is misbehaving and you don't know what may be going wrong? Use that wait-free circular something from point 4 to log debug information without introducing locks.
g) Homework: Learn about Incremental resizing. Hint: yes, you will have various thread cooperate to copy an array
Did you manage to implement your data structure without adding artificial delays?
no spin locks?
no busy whiles?
not a nasty goto?
enumeration doesn't repeat items?
no items lost when resizing?
no items lost in colision?
no mysterous hang?
no [I don't know what]?
Then congratulations! you just created a wait-free data-structure [Under the assumption that it works... test, test, test].
Otherwise, that is pretty much where I am... I don't know what the next tip may be.
Final note: I wish there were a button to convert this message into an article. This is why I cannot maintain a blog: because if I sit in advance to write for a blog I don't know what to write, but then I sit to answer question in some random thread, and there you go, a post that could be in a blog.
You are absolutely right. Using the volatile keyword in the sample code, is a substitute for thinking. The code may perform as intended, but then again it might not.
The CLR framework does give you a nice and clean interface to threading, but the fundamental issues regarding concurrency is not resolved by thowing in "volatile".
Issue is not really concurrency its more about fresh and recent updates not received in local cache of thread. Concurrency i can solve my Mutex,semaphore,Slim,Lock,Monitor.
Must be a better / legal way would be to put the variable updates in between lock calls. Also a comparison of that approach with Volatile would be awesome test to conclude which would be better.
Eric is legend..Second one was really catchy and nice.
But i do not accept that volatile is evil. Its has his own place in multithreading. I agree that its slows the application but then the adverse affects are minimized.
I'm somewhat confused by this article. You talk of variables having different storage allocated on a per-thread basis. Which would suggest you're talking about thread-local storage - only you have to tell the compiler to use that (using ThreadLocal<>), it isn't the default.
It's quite possible that the memory access is being cached (e.g. in a per-core L1 cache the CPU), which might give the symptoms you're describing although I'd expect the situation to resolve itself in time (e.g. when the thread swapped from one core to another). It's also possible that the JITter may decide to optimise out the memory fetch for (what it perceives as being) an invariant variable (although I'm not seeing it do so).
However, I'm not clear on what you're saying the problem is, and I'm not seeing any issues running the supplied code (debug or release). The code runs to completion (even if I add the necessary t1.Join() call in to block the main thread until the looping thread has had a chance to respond).
Threading (and synchronising memory access) is an interesting topic and worthy of discussion. I'm just not sure this article makes the subject any clearer.
You are right thread uses the cache memory ( which for clarity i have termed as local memory) , so the data is stale. Yes we can resolve by using the thread.sleep or block , but then if we have the luxury to make the other thread sleep.
John Brett wrote:
However, I'm not clear on what you're saying the problem is, and I'm not seeing any issues running the supplied code (debug or release). The code runs to completion (even if I add the necessary t1.Join() call in to block the main thread until the looping thread has had a chance to respond).
Do see my video at the last where i have run the complete code in release mode and the memory does not synch. I am not sure it should also reproduce at your end. Do not put sleep or block , then the cache and main memory gets synch and you will not be able to reproduce the problem.
Hi,
Interesting article, but I am wondering if there is not some sort of synchronisation mechanism between the main memory and the thread memory?
Maybe you can re-do the test but add a Sleep(0) call in you threading function?