Introduction

First a warning, this is a difficult article which goes really deep inside the .NET machinery, so if you don’t get it the first time (or even the second or third time…) don’t worry and come back later.

For a training session I’ve taught at the end of last year, I wanted to demonstrate some subtleties of multi-threading, and more specifically some memory visibility issues that should cause a program to hang.
So I developed a small sample that I expected would be showing the issue, but instead of hanging as expected, the program completed!

After manipulating the program further, I obtained the behavior I wanted, the program was hanging, but it still didn’t explain why it managed to complete with my original version.

<SPOILER>
I suspected some JITter optimizations, and indeed it was the case, but I needed more information to completely explain this strange behavior.
As often, the StackOverflow platform was of great help; if you’re curious, you can have a look at the original SO thread.
</SPOILER>

In this article, I’ll “build” and explain the issue step by step, trying to make it more understandable than the SO thread which is indeed quite dry.

A No-brainer…

Say you are a naive developer who loves simplicity.

You’re asked to synchronize two threads, so you ask yourself this question: what’s the simplest way of synchronizing two threads?
Easy peasy: a spin loop.

So 2 minutes later, you’re done with a simple, but you think brilliant, implementation:

using System;
using System.Threading;

namespace Tests
{
    public class AwesomeSpin
    {
        bool ok = false;

        void Spin()
        {
            while (!ok) ;
        }
        
        void Run()
        {
            Thread thread = new Thread(Spin);
            thread.Start();

            Console.Write("Press enter to notify thread...");
            Console.ReadLine();

            ok = true;

            Console.WriteLine("Thread notified.");
        }
        
        static void Main()
        {
            new AwesomeSpin().Run();
        }
    }
}

So the main thread starts another thread which should spin until we press enter to notify it that it has spun enough for today.

You compile your work:

> csc /optimize+ AwesomeSpin.cs

And you run it:

> NaiveSpin.exe
Press enter to notify thread...<You press enter>
Thread notified.

>

The second > indicates that the program has correctly terminated and that you’re back to the shell which is requesting more commands to execute.

Perfect! It works just as expected.
You’re the boss!

…Well, Almost

You commit and push your code and as you’ve done a pretty good job, you have the right to recover from this long and exhausting coding session with a well-deserved coffee.

But before you’ve ended drinking your coffee, you receive a message from the testing team:

Hello,
your new component has been running since a few minutes now without any output and it seems stuck!
The testing timeouts have been hit!
Could you please check that all is OK?
Regards

Like any developer in this situation, your first thought is “WTF?”.

Oops!

Then you decide to have a closer look at the situation and check how the testers have run your code and you realize the code has been compiled and run with many different configurations and hosts combinations.
The testing team has sent you a report like the following:

Platform	Host	Result
AnyCPU	x86	OK
AnyCPU	x64	KO
x86	x86	OK
x86	x64	OK
x64	x86	X
x64	x64	KO

Hum! Seems like there is an issue with 64-bit…

Note that by default, CSC flags the resulting assembly as supporting platform “AnyCPU” meaning it will run in a 64-bit CLR if one is available on the host and in a 32-bit CLR otherwise.

As you’re a conscientious employee and/or a curious geek, you try to reproduce the issue yourself.

You setup a 64-bit machine and update your code.

First, you force your .NET binary to be run only by the 32-bit CLR:

> csc /platform:x86 /optimize+ AwesomeSpin.cs

And you rerun it:

> AwesomeSpin.exe
Press enter to notify thread...
Thread notified.

>

So far so good.

Then, you try in 64-bit mode:

> csc /platform:x64 /optimize+ AwesomeSpin.cs

And you rerun it again:

> AwesomeSpin.exe
Press enter to notify thread...
Thread notified.
^C

Oops, indeed it’s stuck and you have to CTRL-C to stop the program. :/

But this is a good thing: a bug that you can reproduce can be considered as half fixed.

Note that you could have used CorFlags.exe too to set the assembly’s cor-flags to run it with different CLRs but recompiling best illustrates the way you do it with VS.

When Things Become Crazy

The code is quite light and the only idea you have to confirm that the issue is in the Spin method is to use your best debugging wizardry … console output:

void Spin()
{
    Console.WriteLine("\nBefore spin loop.");
        
    while (!ok) ;
            
    Console.WriteLine("After spin loop.");
}

And here we go again:

Compile:

> csc /platform:x86 /optimize+ AwesomeSpinDebug.cs

and run:

> AwesomeSpinDebug.exe
Press enter to notify thread...
Before spin loop.

Thread notified.
^C

Ok it’s still stuck.

But … wait … I’m in x86 mode!

Just to check, you comment the two Console.WriteLine lines:

void Spin()
{
    // Console.WriteLine("\nBefore spin loop.");

    while (!ok) ;
    
    // Console.WriteLine("After spin loop.");
}

One more compilation:

> csc /platform:x86 /optimize+ AwesomeSpinDebug.cs

And one more run:

> AwesomeSpinDebug.exe
Press enter to notify thread...
Before spin loop.

Thread notified.

>

And it works again!

…

As developers, we all know these moments, when you feel you’ve lost control over things and the machine does what it wants.

This time, you really think software development is not a job for you and it’ll get you crazy, and you start to ask Google if there is not an open position at the closest fast-food restaurant.

WTF?

But in a last fit of pride, you decide to investigate more and you decompile your executables with your favorite IL disassembler.
(I often use ILSpy but for simple cases like this one, ILDasm does the job too.)

With platform x86 without debugging output (the original version), you get:

.method private hidebysig instance void  Spin() cil managed
{
  // Code size       9 (0x9)
  .maxstack  8
  IL_0000:  ldarg.0
  IL_0001:  ldfld      bool Tests.MemoryVisibility::ok
  IL_0006:  brfalse.s  IL_0000
  IL_0008:  ret
} // end of method MemoryVisibility::Spin

With x64 platform, still without debugging output:

.method private hidebysig instance void  Spin() cil managed
{
  // Code size       9 (0x9)
  .maxstack  8
  IL_0000:  ldarg.0
  IL_0001:  ldfld      bool Tests.MemoryVisibility::ok
  IL_0006:  brfalse.s  IL_0000
  IL_0008:  ret
} // end of method MemoryVisibility::Spin

And finally with platform x86 with debugging output:

.method private hidebysig instance void  Spin() cil managed
{
  // Code size       29 (0x1d)
  .maxstack  8
  IL_0000:  ldstr      "\nBefore spin loop."
  IL_0005:  call       void [mscorlib]System.Console::WriteLine(string)
  IL_000a:  ldarg.0
  IL_000b:  ldfld      bool Tests.MemoryVisibility::ok
  IL_0010:  brfalse.s  IL_000a
  IL_0012:  ldstr      "After spin loop."
  IL_0017:  call       void [mscorlib]System.Console::WriteLine(string)
  IL_001c:  ret
} // end of method MemoryVisibility::Spin

As you’re probably not a CIL guru (pardon if you are), let me give you a little insight.
The important part, i.e. the spinning, is in these 3 lines of code:

IL_0000:  ldarg.0
IL_0001:  ldfld      bool Tests.MemoryVisibility::ok
IL_0006:  brfalse.s  IL_0000

It means:

Push the first method argument, i.e. the implicit “this” reference, at the top of the thread stack
Pop the object reference which is at the top of the stack and push the value of the object’s “ok” field
Check the boolean value at the top of the stack: if false, go to the instruction 2 lines above, else continue to the next instruction

Conclusion: The spinning part is exactly the same (except of course the labels’ offsets) for the three programs.

And you suddenly remember that the platform (x86 or x64) only instructs the C# compiler to generate metadata that will determine which CLR will run the code, without impacting the way the C# compiler generates IL code.
And this is a good thing, only a native compiler should take care of the x86/x64 dichotomy issues.

So the issue is not at the IL level and you know what that means: you’ll have to go deeper, where no .NET developer should have to go (and where 99.42% of them will never go, and this is a good thing): in the native assembly Mordor!
But as an ex seasoned C/C++ programmer, you don’t fear it!

Inside the Mount Doom

In a last effort to preserve your mental sanity, you run your programs again but this time you attach Visual Studio to check the resulting native assembly code of the Spin method.

With platform x86 and with output, you get:

00000000  push        ebp 
00000001  mov         ebp,esp 
00000003  push        esi 
00000004  mov         esi,ecx 
00000006  call        5BE97904 
0000000b  mov         ecx,eax 
0000000d  mov         edx,dword ptr ds:[03352178h] 
00000013  mov         eax,dword ptr [ecx] 
00000015  mov         eax,dword ptr [eax+3Ch] 
00000018  call        dword ptr [eax+10h] 
0000001b  movzx       eax,byte ptr [esi+4] 
0000001f  test        eax,eax 
00000021  je          0000001F 
00000023  call        5BE97904 
00000028  mov         ecx,eax 
0000002a  mov         edx,dword ptr ds:[0335217Ch] 
00000030  mov         eax,dword ptr [ecx] 
00000032  mov         eax,dword ptr [eax+3Ch] 
00000035  call        dword ptr [eax+10h] 
00000038  pop         esi 
00000039  pop         ebp 
0000003a  ret

The spinning part being:

0000001f  test        eax,eax 
00000021  je          0000001F

With platform x86 but no output:

00000000  push        ebp 
00000001  mov         ebp,esp 
00000003  cmp         byte ptr [ecx+4],0 
00000007  je          00000003 
00000009  pop         ebp 
0000000a  ret

Spinning part:

00000003  cmp         byte ptr [ecx+4],0 
00000007  je          00000003

And with platform x64 without output:

00000000  mov         al,byte ptr [rcx+8] 
00000003  movzx       ecx,al 
00000006  test        ecx,ecx 
00000008  je          0000000000000006 
0000000a  rep ret

Spinning part:

00000006  test        ecx,ecx 
00000008  je          0000000000000006

Again, all this bunch of cryptic code deserves some explanations:

The cmp instruction compares its two operands and set some CPU flags depending on the result: the zero-flag a.k.a ZF is set (1) if they are equals, unset (0) otherwise
The test instruction does a binary AND between its two operands and depending on the result set some flags: the ZF flag is set (1) if the result is 0, unset (0) otherwise
The je instruction jumps to the instruction at the given label if the ZF flag is set (1)

So the loops run while a zero (false) value is provided either as the first operand of cmp or as the first and second operands of test.

But the most important thing to notice is that:

Sometimes the .NET JITter directly compares the “ok” flag “from memory” in a place shared by the main thread and the spin thread (at address [ecx+4])
Sometimes it caches the value in a CPU register (eax or ecx) where it will be only accessible from the spin thread

In the latter case, the spin thread can’t see the new flag value because it only looks in a register: this is the memory visibility issue I wanted to demonstrate at first.

So you get the answer to the initial question: the behavior varies depending on the way the different JITters (the one of the x86 CLR and the one of the x64 CLR) optimize the code when they compile the IL code in native binary code.

Solution

So now you’ve understood why your code was behaving “strangely” in some context.

Of course, you can’t release such a code into the wild and you must fix it so that it will have a consistent behavior whatever the CLR used to run it.

There is a well known solution for this “issue”: tagging the data you want to protect from any over-optimization with the “volatile” metadata.

The volatile concept exists in most of the languages: it instructs the compilers that they should not try to do too clever optimizations because they could completely mess up the program, as demonstrated above: checking a copy of the value instead of the value is indeed not a good idea but the compiler does not understand your code’s semantics.

With languages that are directly compiled to native code like C or C++, the volatile keyword is directly interpreted by their respective compilers when they generate the native library or executable.

But the C# compiler does far less work than a C/C++ compiler as most of the optimizations are deferred to the native compilation step done by the CLRs’ JITters.

And indeed the C# volatile modifier is simply forwarded, through some assembly metadata (System.Runtime.CompilerServices.IsVolatile) to the JITters, informing them that they should be cautious and ensure that:

every read of the variable returns the latest value
every write updates the variable with the latest value

This means that the JITters can’t do some optimizations anymore, like caching the value in a register for faster access, which is of course a bad idea in our case.

So let’s try with this fix:

volatile bool ok = false;

void Spin()
{
    while (!ok) ;
}

If you now have a look at the IL generated by the C# compiler, you see:

.method private hidebysig instance void  Spin() cil managed
{
  // Code size       11 (0xb)
  .maxstack  8
  IL_0000:  ldarg.0
  IL_0001:  volatile.
  IL_0003:  ldfld      bool modreq([mscorlib]System.Runtime.CompilerServices.IsVolatile) Tests.AwesomeSpinFixed::ok
  IL_0008:  brfalse.s  IL_0000
  IL_000a:  ret
} // end of method AwesomeSpinFixed::Spin

So this is the same code protected with some “volatile” metadata.

And now it works like a charm whatever the code you put before and after the spin loop, whatever the platform you set at compile time and whatever the CLR you use at runtime.

This time, you can gaze at your brilliant work with pride for good.

Conclusion

This was a tricky one, probably the trickiest thing I’ve done with .NET, but it’s a really interesting one for at least three reasons:

It demonstrates the memory visibility issue
It shows that multi-threaded code can be quite subtle, particularly when optimizations come into play, so that great care should be taken when writing it
.NET does its best to encapsulate the underlying platform in a consistent manner but as a lot of abstractions it’s not a perfect abstraction but a “leaky” abstraction, meaning that the programmer has sometimes to be aware of some underlying things which are not perfectly abstracted by the higher level abstraction.

The latter point is not an issue by itself and there are more important leaky abstractions like floating point numbers (but that’s for another article).

Of course, you should never synchronize your threads with such a basic construct, and if after a thorough profiling of your code, you determine that you really need spinning, then you can use the .NET framework SpinLock; but be aware that it’s a value type so be very cautious when using it too.

Kudos to Hans Passant for confirming the issue and to MagnatLU for providing the debug wisdom necessary to extract the native assembly code and make the issue “clearly” appear.

If you catch any typo or mistake, have additional questions or want to share this kind of crazy experience, feel free to leave a comment.

Note to any future employer: this is not a real-life story, I promise that, except for demonstration purposes, I’ve never tried to spin a thread this way!