Click here to Skip to main content
Click here to Skip to main content

Concurrency Hazards: False Sharing

By , 10 Jan 2010
 

Note: The demo requires .NET 4.0 Beta 2.

Graph 1: Speedup

Table of Contents

Introduction

There are many reasons why a program doesn't scale well across multiple cores. Synchronization is often to blame, or it could be a saturated memory bus. In this article, I will concentrate on another concurrency hazard: false sharing.

In the graph above, the green line shows good scaling for up to 8 cores. There is close to linear speedup, which as good as it gets for most applications. However, the other three lines show an actual slowdown as more threads are used. This means that as you add more hardware, the wall-clock time to complete the processing actually gets longer.

There is very little difference in the data structures used for each of the four tests, but their performances are dramatically different. In fact, comparing the best and worse times (the red and green lines for 8 threads) shows a difference of about 100 times the speed. That is, instead of achieving a speedup of 7.76 x, the slowest case, in fact, shows a slowdown of 12.6 x. In seconds, that is the difference between 830 ms and 8.5 ms.

This article explores the cause of this anomaly and how it can be detected and avoided. I will also explain why one intrinsic data structure in .NET (Array) is particularly susceptible.

Nutshell

False sharing is also known as cache-line ping-ponging. It is caused by one or more cores repeatedly invalidating the caches of the other cores, even while accessing isolated state. This forces the other cores to read data from main memory instead of their local cache, which slows them down considerably: in the demo, by up to two orders of magnitude.

Details

Cache lines

The data in a cache is grouped into blocks called cache-lines, which are typically 64 or 128 bytes wide. These are the smallest units of memory that can be read from, or written to, main memory. This works well in most programs as data that is close in memory is often needed close in time by a particular thread. However, this is the root of the false sharing problem.

Cache coherence

When a program writes a value to memory, it goes firstly to the cache of the core that ran the code. If any other caches hold a copy of that cache line, their copy is marked as invalid and cannot be used. The new value is written to main memory, and the other caches must re-read it from there if they need it. Although this synchronization is implemented in hardware, it still takes time. And, of course, reading from main memory takes a few hundred clock cycles by itself.

Modern processors use the MESI protocol to implement cache coherence. This basically means each cache line can be in one of four states:

  • M odified
  • E xclusive
  • S hared
  • I nvalid

When a core modifies any data in a cache line, it transitions to "Modified", and any other caches that hold a copy of the same cache line are forced to "Invalid". The other cores must then read the data from main memory next time they need it. That's all I need to say about this here. A detailed explanation is available on Wikipedia[^], if you're interested.

False sharing

Imagine two different variables are being used by two different threads on two different cores. This appears to be embarrassingly parallel, as the different threads are using isolated data. However, if the two variables are located in the same cache line and at least one is being written, then there will be contention for that cache line. This is false sharing.

It is called false sharing because even though the different threads are not sharing data, they are, unintentionally, sharing a cache line.

Demo application

The demo basically increments counters in an int[] and measures the time taken in different configurations. For each of the four layouts of the array (padding and spacing), it measures the time taken, using from one to Environment.ProcessorCount threads. It then calculates the speedup (sequential time / concurrent time) and the efficiency (speedup / thread count).

Note: the demo should be built without optimizations, so we know exactly what code is being run.

Basic implementation

Here is a simple version of the test method:

partial class Worker
{
  const int ITERS = ( int ) 1e7;

  public TimeSpan Work( int threadCount )
  {
    // declare the counters
    int[] data = new int[ threadCount ];
    
    // each thread does an equal amount of the work
    int iters = ITERS / threadCount;
  
    // synchronization
    var mre = new ManualResetEvent( false );
    var countdown = new CountdownEvent( threadCount );

    // spawn threads
    for ( int thread = 0 ; thread < threadCount ; thread++ )
    {
      int iThread = thread; // capture

      new Thread( () =>
      {
        // anchor each thread to a core
        SetThreadAffinityMask( GetCurrentThread(), 
                               new UIntPtr( 1u << iThread ) );

        int offset = iThread;

        mre.WaitOne();

        for ( int x = 0 ; x < iters ; x++ )
            data[ offset ]++; // the 'work'
    
        countdown.Signal();
    
      } ) { IsBackground = true }.Start();
    }
    
    var sw = StopWatch.StartNew();
  
    mre.Set();
  
    countdown.Wait();
  
    return sw.Elapsed;
}

Efficiency graph

I will use graphs of efficiency from here on, as it is easier to make comparisons than when using speedup.

Graph 2: Efficiency

No padding, no spacing

The red line shows the performance just using a basic int[]. Below is a diagram of how the counters (C0 - C7) are laid out in memory. The numbers below the counters are the offsets in bytes.

As you can see, all the counters together occupy just 32 bytes, so all of them can fit in a single cache line. This means that all eight cores are reading and writing to the same cache line as fast as they can. This causes massive contention, which is shown by the slow time measurements. This is as bad as it gets for false sharing.

No padding, spacing

Let's try to separate the counters so that they are on different cache lines. We can do this by adding space between them in the array.

We do this by altering the basic code just a little:

...
  // declare the counters
  // int[] data = new int[ threadCount ];
  int[] data = new int[ threadCount * spacing ];
...
  // int offset = iThread;
  int offset = iThread * spacing;
...

The results of this test are shown in purple in the graphs. As you can see, there is some improvement, but performance is still nowhere near linear efficiency. To understand why this is, I need to take a quick look at the way Array is implemented in .NET.

Array in .NET

Every heap-allocated object has some metadata, similar to the vtable in C++, which is stored in memory just before the actual data. This metadata includes things like an object header and a pointer to a method table. The Array object has some additional metadata. This depends on whether it is a CLR SZARRAY (single-dimension, zero-based) or an MDARRAY (multi-dimension), but both include the element type and array length.

We are interested in the array size entry. This is read most times your code indexes into the array to check that your index is within the bounds of the array. I say most because for certain code patterns, some of the checks can be optimized away, but it's mostly true. The checks are done so that if your index points outside your array, you will get an IndexOutOfRangeException.

So basically, every time you write data[ offset ], there is a read of the memory just prior to the actual data in the array.

Padding, no spacing

In the code so far, the first counter has always been stored at index 0 of the data array. This is, very probably, in the same cache line as the array size member of the Array metadata. Since one core is continuously writing to this counter, the cache line is being invalidated continuously, which causes contention with all the other threads that are checking the size every time they index into the array. This is false sharing again, and is the reason why just spacing the array didn't work.

So I have added padding before the first counter in the array so that it no longer shares a cache line with the hidden size metadata.

This is achieved by this alteration of the basic code:

...
  // declare the counters
  // int[] data = new int[ threadCount ];
  int[] data = new int[ padding + threadCount ];
...
  // int offset = iThread;
  int offset = padding + iThread;
...

These results are shown in blue in the graphs. They are better than the basic array results, but apart from the two threads case, they are worse than the previous results using spacing. See the "Points of Interest" section below for an explanation of the two threads case.

In this test, we have removed the hidden size false sharing, but the counters still share a cache line. This is again causing false sharing and poor performance.

Padding, spacing

The last test case incorporates both the above solutions.

The final code looks like this:

...
  // declare the counters
  // int[] data = new int[ threadCount ];
  int[] data = new int[ padding + ( threadCount * spacing ) ];
...
  // int offset = iThread;
  int offset = padding + ( iThread * spacing );
...

These results are shown in green on the graphs, and give near-linear efficiency. So with this optimization, I have finally removed all the false sharing.

This approach is an optimization that trades space for time performance. In this case, for an eight-core machine, I use about 1 KB of space instead of 32 bytes, but the performance has increased by about a factor of 100.

Mostly readers

The last thing I want to show you with the demo is that false sharing can occur even when you are mostly reading, as long as there is at least one writer. With the check box checked, only the first counter is written and the rest are only read. Here is the results graph:

Graph 3: Mostly readers

The performance is slightly better than when all the threads are writing, but it is still terrible for the three false sharing tests. This shows that it just takes one core, continuously writing to a cache line, to cause enough contention to thwart the other seven cores.

Solution

The previous section on the evolutions of the demo show how to solve the problem of false sharing, and it is quite simple: move hot variables apart in memory by at least the size of a cache line - usually 64 or 128 bytes. This is actually the opposite of the hot / cold optimization technique for sequential programs where you try to put the most frequently accessed data close in memory.

It really is that simple. The "art" is in diagnosing false sharing in the first place and locating the data and code paths that are causing the problem. This is addressed in the next section.

Detection

Qualitative

There are many possible causes of poor performance in concurrent software. As I said, synchronization is often to blame, if you have had to use it. Diagnosing a problem is still more art than science, but the clues are there.

If you suspect poor scaling, you probably have a rough feeling of how fast your code should be executing for linear efficiency. Your wall clock measurements tell you that you are falling short of that. It's a start, but you must also have a good understanding of how your implementation fits together. If you don't, you won't be able to reason about how the measurements you make relate to your code.

The first tool you will use is Task Manager. For false sharing, you will see 100% processor usage across all cores. This contrasts with say, an asynchronous IO implementation, which would show low processor usage. In the Task Manager, you can also check the memory usage, the number of threads, and the number of handles held. These won't be a problem for false sharing, but they can rule out other causes.

Next, it is useful to accurately measure performance as I have done in the graphs: measuring times using from one to all cores. Obviously, to do this properly, you will need a machine that has at least as many cores as your expected clients' machines. For false sharing, you will see efficiencies significantly less than 100%, and which vary a fair bit between test runs. This is in contrast to say, a coarse grained lock convoy, which would show a consistent speedup of 1. That is to say, as you add cores, the wall clock time would stay pretty much constant.

Just to make things interesting, your measurements are unlikely to be as conclusive as those in this article. I have used a very small unit of work (just incrementing a counter) to highlight the effects of false sharing. In real code, your false sharing will probably affect a much smaller percentage of the work and so the effect on run times will be masked.

Quantitative

The previous section may sound like a lot of hand-waving, but with experience you can perform a basic triage of possible problems and rank them for further investigation. This section presumes that you suspect false sharing and shows you how to use Visual Studio to verify or eliminate false sharing as a cause.

I am using VS 2010 beta 2, but these profiling tools were available in some versions of VS 2008. There are also a whole load of excellent new debugging tools in VS 2010, but that is a subject for another article.

I will now walk through the process of detecting false sharing. Create a new performance session using sampling and open its properties. In the Sampling page you can change the sample event from "Clock Cycles" to "Performance Counter". This will present you with a bewildering range of counters to choose from. Select either "L2 Misses" (if it is available) or "L2 Lines In" as shown here:

Performance properties

You can also set the "Sampling Interval" to something smaller than the default. I have chosen to sample at each 100,000 events.

Now, launch the profiler and exercise the suspect functionality. If false sharing is occurring, you will see a lot of cache misses. This is effectively measuring the number of times a cache line has been marked as "Invalid" because another core has written to it. The summary screen will helpfully show your "Hot Path" like this:

Profile summary

If you click on the offending method, you will be taken to the "Function Details" page. Here, even more helpfully, the offending statement(s) will be highlighted:

Worker.cs

This result has confirmed that the increment statement is causing a lot of cache misses, which is an indication of false sharing. You now know where to look in your code and what to look for. Hopefully, you will be able to relocate the hot data in memory and fix the bug.

Points of interest

Two threads case

The test results for the two threads case look suspiciously good. The reason for this is the hardware cache layout in my machine and the fact that I am P/Invoking SetThreadAffinityMask to anchor each thread to a particular core.

In my machine, each core has 32 KB of L1 data cache, and each pair of cores share 6 MB of unified L2 cache. So, when testing with two threads on cores 0 and 1, they have a shared L2 cache. This means that only this L2 cache needs to be involved in cache coherence for the false-shared cache line. Although the L2 cache is slower than the L1 cache (17 cycles instead of 3 cycles), it is still a lot quicker than main memory (hundreds of cycles). This mitigates the false sharing effect, although it is still measurable.

Conclusion

Well, that's about it. False sharing is a real world problem that exists in a lot of concurrent software. As always, prevention is better than cure, and I hope I have raised awareness of this hazard and that you will avoid it altogether. If you do suspect false sharing in existing code, I have shown you how to diagnose and fix it.

I hope you've enjoyed reading this article. Please leave a vote and / or a comment below. Thanks.

History

  • 2010 - 01 - 11: First version.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Nicholas Butler

United Kingdom United Kingdom

I built my first computer, a Sinclair ZX80, on my 11th birthday in 1980.
In 1992, I completed my Computer Science degree and built my first PC.
I discovered C# and .NET 1.0 Beta 1 in late 2000 and loved them immediately.
I have been writing concurrent software professionally, using multi-processor machines, since 1995.
 
In real life, I have spent 3 years travelling abroad,
I have held a UK Private Pilots Licence for 20 years,
and I am a PADI Divemaster.
 
I now live near idyllic Bournemouth in England.
 
If you would like help with multithreading, please contact me via my website:
 
 
I can work 'virtually' anywhere!

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
GeneralMy vote of 5memberKjellKod.cc4-Dec-11 10:24 
I liked it, I have used similar demonstrations using C++ . I liked your approach with the demo program
GeneralRe: My vote of 5mentorNicholas Butler14-Dec-11 0:59 
Thanks Smile | :)
 
Nick

GeneralCache line size questionmembergilgamash8-Dec-10 23:26 
Greetings Nicholas,
 
first of all thanks for a nice and quick to grab, well written article. Just two short questions:
 
1) How do you choose the cache size in general? In your article you assume 128 bytes, but would you suggest in general to call some cpuid-instruction?
2) Any idea whether a pointer in C++ for arbitrary objects is of rather constant size and/or whether the size is clear already before 'new'ing? Basically, should I pad the cashline before the new operand (that is right after declaring for instance a pointer) of after the new operation, or does it not matter at all?

Thanks a lot for help and best wishes,
G.
GeneralRe: Cache line size questionmentorNicholas Butler8-Dec-10 23:38 
Thanks - glad you like it Smile | :)
 
1) Intel CPUs generally use 64-byte cache lines and AMD use 128-byte. You can P/Invoke GetLogicalProcessorInformation if you want to be 100% sure.
 
2) I'm not sure I understand this one. A C++ pointer is 32 bits in a 32-bit process and 64 bits in a 64-bit process.
 
Nick
 
EDIT: oops

modified on Thursday, December 9, 2010 5:44 AM

GeneralRe: Cache line size question [modified]membergilgamash8-Dec-10 23:53 
Hi again and thanks for the quick reply. I guess GetLogicalProcessorInformation is some MS-specific command? I am currently working with Gcc on a QNX-Sytem, but I will consider calling the cpuid assembly command my architecture offers.
 
Concerning the pointers: I sorta misformulated the question. My point were the different standards -- check Fog, Agner: Callong conventions for different c++ compilers and operating systems -- according to which for instance data member pointer vary in size depending on compiler and OS (and here I have no idea whether Linux and QNX behave identically). To be on the safe side, I will probably use a size-of operator.
 
Regards,
G.

modified on Thursday, December 9, 2010 6:15 AM

GeneralMy vote of 5memberSuntorn27-Oct-10 12:35 
Really works!!!
GeneralGreat demo and real implementation! 10/10members391029327-Oct-10 11:24 
Hi Nicholas,
 
The article is one of the most helpful papers to me.
 
Still, I have couple questions.
1. On your testing hardware, the cache line size is 128 bytes, isn't it?
The padding and spacing parameter are 32 in a number of the integer unit so the actual offset for each thread is 4x32 =128 bytes.
Therefore, if the program is run on a low-end system with 64 bytes cache lines. The padding and spacing size should be down to 16, again am I right?
 
2. I tried both cases, but the performance results didn't show a big difference. One evident thing is space used for the array allocation. Do you think so?
 
Again,thanks for your writing this article.
 
Regards,
Suntorn
GeneralReal situationsmemberfederico.strati14-Sep-10 23:35 
I come from a c++ environment,
hence I didn't get the whole point
of the exercise: apart from using
a particular kind of structure
(i.e. the array) that is somewhat
"shared" as the memory usage goes
what would be the real situations
alike?
 
I mean... in real programming
in a multithreaded environment
you usually take care to differentiate
thread local storage from shared
memory constructs.
 
In other words I'm curious:
Is there an example in C++ of such
a problem or is this problem only
related to the .Net way to handle
a particular structure (i.e. the array)?
 
Thanks
GeneralRe: Real situationsmemberfederico.strati14-Sep-10 23:57 
Actually, your previous article:
"Superlinear: an investigation into concurrent speedup"
http://www.codeproject.com/KB/threads/Superlinear.aspx
does answer a bit my question:
no false sharing in C++ if you use an array.
Anyway, still I don't get the point about
possible false sharing in C++.
Is false sharing related only to the
way .Net framework is built?
 
Cheers
GeneralMy vote of 5memberEric Xue (brokensnow)6-Sep-10 21:19 
interesting article! thanks for sharing
GeneralMy vote of 5memberSauro Viti22-Jul-10 3:25 
very good article: it explains a complex problem in an easy and understandable way
GeneralMy vote of 5memberGünther M. FOIDL20-Jul-10 0:56 
Well done!
GeneralRe: My vote of 5mentorNicholas Butler20-Jul-10 9:06 
Thanks Smile | :)
 
BTW, I enjoyed your article on Vernam encryption too.
 
Nick
GeneralI would give 10 if I couldmemberOmari O.20-Feb-10 3:41 
Thumbs Up | :thumbsup:
QuestionDo Windows concurrency primitives lead to false sharing?memberhuebel3-Feb-10 2:40 
Thanks for an enlightening article and good discussions.
 
Is there any light on why the results are "less consistent" on release and where do these "fluctuations" come from? Could it be the same thing, that Marc Clifton experienced as he tells another sad story about Microsoft in his comment on this article?
 
I can't help wondering though: Even when it comes down to the physical workings of multi-cores, the processes must still fight for resources, be it a number of garbage collectors or a virtual memory manager all competing for access to the same physical memory.
 
Serializing access to a device (RAM), which is accessible only via a bus, will necessarily have to occur on any computer using a von Neumann architecture.
 
By using the performance counters of an underlying layer (here the CPU itself), can we really avoid to influence what is experienced in the upper layer? Sort of the "Heisenberg"-principle (when you inspect a system, you inevitably change some aspect of it)...
 
Back to your example: Even if each thread has an affinity to a core, there are the OS processes influencing that. So what L2 misses are you measuring? There obviously is some sampling going on, which stores some numbers in memory, and associates them to an abstract representation of the program code, giving you the profilers result.
 
I might remember the nature of Windows' concurrency primitives wrongly -- I've been working inside virtual machines for too long Wink | ;-) -- but doesn't synchronisation somehow imply false sharing?
 
Really nice, thought provoking, article. 5 stars from me!
GeneralGood Job!!memberyash714721-Jan-10 20:51 
You have explained the complex things in a very easy and understandable way. Keep writing. Thanks Smile | :)
 
yash

GeneralRe: Good Job!!mentorNicholas Butler23-Jan-10 21:54 
Thank you - I'm glad you liked it Smile | :)
 
Nick
 
----------------------------------
Be excellent to each other Smile | :)

GeneralFantastic articlememberdisore21-Jan-10 3:42 
You definitely got my 5.
 
Why is it drug addicts and computer afficionados are both called users?
--Clifford Stoll

GeneralRe: Fantastic articlementorNicholas Butler23-Jan-10 21:53 
Thanks Smile | :)
 
Nick
 
----------------------------------
Be excellent to each other Smile | :)

GeneralAwesome, and patternsmemberalex turner19-Jan-10 0:58 
Awesome, just awesome. It is great to see someone working close to the machine occasionally. This is the sort of hard core software engineering which separated the great applications from the also-rans.
 
I have a thought on this area. Do you think that immutability could act as a general approach to avoid this issue? In the case of multiple readers, one updater, if a program is written such that a write creates a new instance rather than mutating an existing one, then false sharing would not occur unless one was unlucky enough for the new object to co-exist on the same line as one being read.
 
Whilst immutability seems expensive (creation of a lot of objects) functional patterns appear to offer some reasonable approaches to avoiding excessive mutation. For example, the lack of marshalling intermediate values through variables avoids excessive mutation.
 
Just a thought - and thanks again for the post!
 
- AJ
 
www.nerds-central.com - welcomes all nerds Smile | :)

GeneralRe: Awesome, and patternsmentorNicholas Butler19-Jan-10 2:03 
Thanks AJ Smile | :)
 
Your suggestion about using immutability is certainly valid. When I design a concurrent algorithm, I always try these approaches ( best first ): isolation ; immutability ; synchronization.
 
The problem with immutability in .NET is, as you say, creating lots of short-lived objects. Actually, it's not creating them that's the problem - it's the GC collecting them. Using the Server GC helps a bit on multicore hardware, but doesn't really solve the problem.
 
Nick
 
----------------------------------
Be excellent to each other Smile | :)

GeneralExcellentmemberPAU_CARRE18-Jan-10 20:30 
Simply one of the best articles I have ever read in Code Project. It is very rigorous and detailed as well properly explained, clear and easy to understand.
Thanks for sharing your knowledge
...anyway I wonder if the metadata issue applies to Java...
GeneralRe: ExcellentmentorNicholas Butler19-Jan-10 1:35 
Thank you very much Smile | :)
 
Nick
 
----------------------------------
Be excellent to each other Smile | :)

GeneralTopmemberdojohansen12-Jan-10 23:11 
Excellent article, a five from me. The problem you describe is of course only relevant in a few particular cases and in many pieces of software one can probably ignore it completely at no real cost. On the other hand I'm pretty sure there are cases where false sharing really makes a huge difference, so it's certainly useful to be aware of. Most importantly I think you've made a good choice in limiting the scope to discussing what false sharing is and how it can be worked around. The presentation is clear and easy to follow, and in my view the decision to write the code deliberately so that it highlights the problem (with thread affinity and unoptimized code) rather than give a "realistic" example is the right one.
 
I think the article could be further improved by comparing your solution with another way of eliminating false sharing: By storing the thread data on the stack, i.e. using a recursive rather than iterative approach. It seems to me that with some data structures, say traversing a tree of depth 7 with 50,000 nodes, using recursion would be both easier and eliminate the false sharing at less space cost than trying to pad the data structure.
GeneralRe: TopmentorNicholas Butler14-Jan-10 4:45 
Thanks for your comments and vote Smile | :)
 
There are many ways of locating data to avoid false sharing. Talking about stacks: if you are using a value type, one of the easiest ways is to put the data on each thread's own stack.
 
Nick
 
----------------------------------
Be excellent to each other Smile | :)

QuestionWhat about Release configuration?memberAlex Fr12-Jan-10 2:53 
You suggest to run this sample without optimization. However, final conclusions regarding program performance are always done by optimized Release build. What are Release build relults?
Generally, great information, I work a lot with code optimizations, and never think about this issue.
AnswerRe: What about Release configuration?mentorNicholas Butler12-Jan-10 6:33 
Hi Alex,
 
Good question! The results for an optimized build are much the same, but there are more fluctuations between runs.
 
When you allow optimizations, you don't know for sure what csc, the JIT compiler and your Out-of-Order CPUs are actually doing. For instance, this optimization is legal and may be applied some of the time:
...
 
  for ( int x = 0 ; x < iters ; x++ ) data[ offset ]++;
 
  // this is a legal optimization:

  data[ offset ] += iters;
 
...
I tested with optimizations switched on and the only change was that the results were less consistent. Since this article aims to demonstrate false sharing, I just chose the simplest way to do that.
 
Nick
 
----------------------------------
Be excellent to each other Smile | :)

GeneralRe: What about Release configuration?memberAlex Fr12-Jan-10 21:45 
I asked this, because many "false optimizations" give performance boost only in Debug configuration. For example, in C++, for this function:
 
int Sum(int* pArray, int size)
{
    int result = 0;
 
    for (int i = 0; i < size; ++i)
    {
        result += pArray[i];
    }
 
    return result;
}
 
we can write "false optimization":
 
int Sum(int* pArray, int size)
{
    int result = 0;
 
    for (int i = 0; i < size; ++i)
    {
        result += *pArray++;
    }
 
    return result;
}
 
In Debug configuration, the second version runs much better, because it doesn't compute array offset. However, in Release configuration both versions have the same performance - compiler is smart enough to improve our code. Also, second version is less readable, and cannot be parallelized with OpenMP or parallel_for. Sometimes "false optimizations" may increase Release build execution time.
To prove your concept, it is necessary to write the code which cannot be optimized by the simple way you show: data[ offset ] += iters, and test Release execution time.
 
Regards,
Alex.
GeneralSuggestionsmemberRobodroid12-Jan-10 1:47 
Very good article. Here are a couple of suggestions to improve it:
 
First, you can optionally make the writer in the "single writer" option the last thread. This way, you can add evidence to your hypothesis that it's cache-line ping-ponging. (Not to say you haven't adequately proved it later in the article.) On the two-processor case, it should greatly improve the efficiency and show that the array size and first element (in the case of spacing) are in contention.
 
Second, you might want to provide a few guidelines on how to avoid some of the larger pitfalls like this one. One that I used is to cluster thread data together. Rather than have:
 
int data1[threadCount];
int data2[threadCount];
int data3[threadCount];
...etc...
we would do something like:
struct threadData {
    int data1;
    int data2;
    int data3;
    ...other task variables...
    char FILLER[...]; // Pad to end of current cache-line
};
threadData threads[threadCount];
 
This way, (1) each thread will consume fewer cache-lines, since all the thread data is close together; (2) you won't consume as much space in your program, since one padding item will pad all your thread-specific data.
GeneralRe: SuggestionsmentorNicholas Butler12-Jan-10 5:43 
Thanks robodroid Smile | :)
 
Good points - you are of course quite right.
 
Nick
 
----------------------------------
Be excellent to each other Smile | :)

QuestionGreat ReadmemberKarl Nicoll11-Jan-10 22:57 
This is a great read, and a very insightful look at the problem of multithreading efficiency. Definitely a 5/5 article for me.
 
Out of curiousity though, are there any improvements if you turn optimizations on? Forgive my ignorance if this is a stupid question, as I suspect it might be!
 
-- Karl Nicoll Smile | :)
AnswerRe: Great ReadmentorNicholas Butler12-Jan-10 0:38 
Thanks Karl Smile | :)
 
Karl Nicoll wrote:
Out of curiosity though, are there any improvements if you turn optimizations on?

 
As far as I can tell, there are no optimizations applied that solve false sharing.
 
If you run the demo with optimizations turned on, you will see slightly better results, but not consistently. This is because the supporting code around the false sharing part is being optimized.
 
Nick
 
----------------------------------
Be excellent to each other Smile | :)

GeneralGreat article, but...memberRob V.11-Jan-10 21:28 
Great article, but once again it's about some minutae that the compiler, or w/.NET, the CLR, should deal with. The .NET CLR should detect one, dual, quad cores, and optimize the code accordingly.
 
Of course, the problem is that nobody writes compilers any more, so the people working on the .NET runtime environment just don't get it. Not that Java is any better, the more emphasis on JIT in Java means less effort to try to optimize for the CPU(s) the coed is running on.
 
Ultimately, the reason to use .NET or Java is portability, but with the costs already associated with it, I shouldn't have to read a deep article like this, and put all this extra work into my code: the underlying runtime environment should do it for me. If no, I might as well go back to C++.
GeneralRe: Great article, but...mentorNicholas Butler12-Jan-10 0:30 
Thanks Rob Smile | :)
 
There's been a lot of research into smart tools for concurrent software over the last 50 years - it's not a new problem in Computer Science!
 
The consensus seems to be that OO languages are too difficult for tools to reason about. Functional languages promise much, but we'll have to wait and see if they become mainstream.
 
If you're writing concurrent code at the moment, there's still a lot to (re)learn about hardware as you are much closer to the metal than if you were say, writing for asp.net. It's a whole different problem that requires a whole different set of skills.
 
Nick
 
----------------------------------
Be excellent to each other Smile | :)

GeneralBrrrrrrrrilliant!!!!!!!!!!!membersamir4118011-Jan-10 20:48 
An absolutely fantastic article....very informative and knowledge packed content. I am very delighted to have read this article first thing in the day.
GeneralRe: Brrrrrrrrilliant!!!!!!!!!!!mentorNicholas Butler12-Jan-10 0:16 
Thanks Samir Smile | :)
 
I'm glad to have helped you start your day on a positive note.
 
Nick
 
----------------------------------
Be excellent to each other Smile | :)

GeneralVery Well Thought Outmembersam.hill11-Jan-10 12:51 
Excellent article.
I also appreciate Marc's comments.
This is an extremely important, but complex topic and I will be studying your code and your text carefully.
 
Thanks.
GeneralRe: Very Well Thought OutmentorNicholas Butler12-Jan-10 0:14 
Thanks Sam Smile | :)
 
sam.hill wrote:
This is an extremely important, but complex topic

 
Absolutely! I expect concurrency will become one of the most important topics of the next few years.
 
Nick
 
----------------------------------
Be excellent to each other Smile | :)

GeneralExcellent articleprotectorMarc Clifton11-Jan-10 12:01 
I thought I'd share this experience...
 
About 10 years ago I wrote an app in C++/MFC to do some intensive analysis of a switch ring network for communication satellites. Everything was single-threaded because, well, 10 years ago, multiple cores was not really an option.
 
About a year ago I revamped the code to utilize multiple cores and stuck a C# front-end on it. On my test machines, it looked good, 100% utilization of both cores. However, when my client got around to testing the app on a multiple core system, he discovered that the analysis times were LONGER!
 
So, first moral of the story is, don't trust what you see in the CPU utilization. To this day, I still don't know why my systems (Vista) are showing 100% utilization and his (XP and Vista) aren't. I borrowed a 4 core XP system from another client (very generous of him) and sure enough, I was getting about 25% aggregate utilization and it was a lot slower.
 
Searching around for ideas and also stripping pieces out of the C++ code to see where the problem was, I discovered that any heap allocations by the STL was causing a tremendous slowdown. Googling that, I came across heap managers that replace the MVC managers, because apparently they do locks and are completely unfriendly to multiple threads.
 
Moral #2: Don't trust anything Microsoft writes. OK, that's a bit harsh, but I am using the latest MFC, VS 2008, etc., you would think they could write their framework better.
 
After fussing with what was obviously really kludgy (and expensive!) solutions for replacing the heap manager, I ended up breaking out the core processing into completely separate processes and using Named Pipes with a C# front-end as the interface.
 
Result: my client reports a 15x speed improvement on an 8 core system!
 
So, I realize this story is somewhat off topic, but I wanted to share it, as I'm pretty shocked at the hoops I had to go through to get the performance that I would have expected to get simply by doing the analysis in separate threads under the same process. Sadly, it wasn't that simple, and I'm now left with a major distrust of any framework and core system functions.
 
Marc
GeneralRe: Excellent articlementorNicholas Butler12-Jan-10 1:06 
Thanks Marc Smile | :)
 
Memory is still a problem in .NET. Allocating it is fast, but the GC causes problems. Using the server GC helps on multi-core hardware, but the solution is not to allocate it in the first place!
 
I built my 8-core box nearly 2 years ago and I've learnt a lot from using it. Do you remember this article: PetriDish: Multi-threading for performance in C#[^]? In the conclusion, I recommended getting a many-core box as they behave so differently from dual core boxes. Two years later I believe that even more. You just can't predict how an application will scale without actually trying it out.
 
Anyway, I'm glad your client is happy. When you get multi-threading right it's very rewarding!
 
Nick
 
----------------------------------
Be excellent to each other Smile | :)

RantRe: Excellent articlememberPete Appleton12-Jan-10 21:21 
Moral 2: Don't trust anything Microsoft writes
 
And you haven't learnt that already in 10 years? Sigh | :sigh: Although I would have thought that something this basic would be dealt with by the compiler, especially given how they're banging on about multi threading at the moment. OTOH, when I look at the performance metrics for a lot of the MS server software when under load then I often see lousy threading performance so it shouldn't really be a surprise...
 
Seriously, this article is excellent and was a real eye opener for me, and your comment also adds value. One thing that I think we tend to forget is that at the end of the day we're still pushing electrons around a piece of silicon, and all of the abstractions of the languages / frameworks are exactly that; at the end of the day, if you want to use the hardware efficiently then you need to understand it!
 
--
What's a signature?

GeneralExcellent writing stylememberRabinDl11-Jan-10 11:10 
Just forced me to understand subject so well ,even I was just a newbie about concurrency Wink | ;)
GeneralRe: Excellent writing stylementorNicholas Butler12-Jan-10 0:11 
Thanks Rabin Smile | :)
 
Nick
 
----------------------------------
Be excellent to each other Smile | :)

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web03 | 2.6.130617.1 | Last Updated 11 Jan 2010
Article Copyright 2010 by Nicholas Butler
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid