|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Announcements
Want a new Job?
Chapters
Services
Feature Zones
|
Note: This is an unedited contribution. If this article is inappropriate,
needs attention or copies someone else's work without reference then please
Report This Article
Table of Contents
IntroductionThis article has two aims. Firstly, there are a series of five exercises that detail the process of successfully multi-threading a sequential algorithm with timely progress display in the UI. It also provides an implementation of a thread local storage class which can help realise the performance promise of many-core machines. The exercises start at beginner's level, but even if you are already proficient in concurrent programming I suggest at least skimming though them as they build on one another. If you are short of time, you might like to skip ahead to the description of the new helper class in Appendix A. The code uses the December 2007 CTP of Microsoft's PFX library to hide some of the plumbing. BackgroundOverviewComputers are good at counting, so this is what I chose to use as the test load. Actually, in the last example which is the correct version, only 20% of execution time is spent doing the "work". The rest is spent in the infrastructure, but that's not a problem because it's all work to the CPUs. The interesting part is the relative comparison of the different tests, so the absolute throughput is almost irrelevant. I say almost, because it is hard to fully appreciate just how fast normal computers are these days. We are talking orders of magnitude here. Clock speeds are around 109 Hz, which is absolutely astounding if you think about it. Light moves at 3 x 108 ms-1, so in one clock cycle even light can only travel about a tenth of a metre. Put another way, say you are 2m tall. In the time it takes light to travel from your foot to your eye, a single core can execute 20 instructions. They're fast. However, even this power is not sufficient for some tasks. Since it looks unlikely that clock speeds will increase significantly, the current hardware trend is towards multi- and many-core machines. Using these machines correctly and efficiently for performance gains is the scenario addressed by this article. It is aimed at people who want to access the scalable performance improvements that concurrent software is capable of on modern machines. It must be noted that the hardware is still in its infancy. The eight-core box I use is considered high-end at the moment, but the maximum performance increase it can theoretically provide is less than an order of magnitude. If your users have two or even four-core machines, the potential gain is obviously less. I have no doubt that many-core machines are coming, but they won't be here tomorrow. This is, I believe, the reason why support from library providers, in particular Microsoft, is only now beginning to emerge.
Writing correct and efficient concurrent software is not easy at the moment
and I see no evidence of any substantial change on the horizon.
It requires both knowledge and intelligence, but also a diligent approach.
Do not expect to add PFXMicrosoft's Parallel Extensions library is used to simplify the code by removing the thread control code. This is pretty much all it does at the moment, but it is actually quite good at it. My first impression of this library was disappointment that it didn't go further, but it is useful if you don't expect too much from it. Also, remember that at the time of writing, it is just an early CTP. HardwareI tested this application on four boxes:
The relative performance of the tests were all consistent. The quoted performance metrics are from the 8-core box. ArchitectureTestAttribute
The five test classes that make up the exercise are decorated with this custom attribute.
It has TestInfo
This class holds information about each test class and can instantiate them.
It also has some static members that search the assembly using reflection
to build a full list of tests that is used by the UI.
It also has a couple of Controller
This class takes a TestBase
This is the base class for the five tests in the exercise.
It implements all the common functionality, including creating threads, running them
and firing the Factor
Each test class overrides this Concrete tests
The five concrete test classes all derive from User Interface
The user interface is a simple WPF application.
It lists each test on the left and allows you to start one by double-clicking it.
As the test executes, it updates the progress bar at 100 millisecond intervals.
It does this using a The UI is only updated ten times a second because this is all that is required to present the user with enough information. The best way to speed up your application is to make it do less work! Calculating the progress more often would only slow down the application, with no appreciable benefit to the user. Exercise 1: Blocking the UI threadThis exercise presents the problem: a computationally intensive task being performed on the UI thread. // do not do this
[Test( 1, "Blocking" )]
partial class TLS1Blocking : TestBase
{
public override bool Block { get { return true; } }
protected override void Work()
{
DoWork( Update );
}
void Update( long i )
{
Done = i;
}
}
This example is definitely not recommended.
It overrides the
The So, it does give a good base data point. This test executes at about 150 MIPS ( million iterations per second ) on my machine. This is the value against which all the other tests are compared. Exercise 2: Freeing the UI threadThe aim in this exercise is to make the UI responsive again. // recommended sequential pattern
[Test( 2, "Sequential" )]
partial class TLS2Sequential : TestBase
{
protected override void Work()
{
DoWork( Update );
}
void Update( long i )
{
Done = i;
}
}
This is the same as #1, except that it does not override the new Thread( new ThreadStart( WorkMaster ) ) { IsBackground = true }.Start();
This basically means that the UI thread is only responsible for creating the new thread and then returns to its own duties as it should. So the UI is free to keep itself up to date and properly report the progress of the test. This technique is also used in all the subsequent tests.
The Exercise 3: Introducing concurrencyThis exercise will attempt to use all the available cores. // do not do this
[Test( 3, "Wrong" )]
class TLS3Wrong : TestBase
{
public override bool Multithreaded { get { return true; } }
protected override void Work()
{
Parallel.Do( CreateActions( Update ) );
}
void Update( long i )
{
Done = i * Glob.CoreCount;
}
}
We have changed a couple of things in this test:
Firstly, we have overridden the
The second change is to the
The last change is to the This is a bit of a cheat. We now have 8 threads all counting up in steps of 8, so each thread is reporting 8 times it's actual progress. The final result will be correct, but if you run this on a multi-core machine, you will see the progress bar in the UI jump around because the different threads run at different speeds. So this test does not exhibit correct behaviour. When you run multiple threads, you cannot expect each thread to do the same amount of work in the same time. The kernel is responsible for allocating time slices to each thread. The algorithm and physical effects are so complicated that this is essentially a non-deterministic process. For example, the cores may not be all running at the same speed all the time. If you put pressure on the machine, the CPUs will heat up and when they get too hot, they slow themselves down to prevent permanent physical damage. Also, there is always something else running on your system, even if it is only the OS itself. This causes contention for resources in an effectively random manner. All these effects combined mean that you cannot predict how fast a particular thread will run. When run, this test performs at best at about 30 MIPSPC ( million iterations per second per core ). However, because it is using all 8 cores, this equates to about 240 MIPS overall, which is roughly double the performance of the sequential algorithm.
That's not great though; each core is running at a fifth the base speed.
This is because the threads are blocking each other when writing to the shared This result has to do with memory models. A strong memory model guarantees that the actual instructions executed by the CPUs are more like what you would expect when looking at the code in a sequential manner. A weaker memory model allows the CPUs to reorder reads and writes as an optimisation technique. The C# memory model is actually quite weak, which makes writing concurrent programs correctly more difficult. All it requires for normal state is that reads and writes appear not to move within a particular thread. However, the x86 memory model is actually stronger, so optimisations that are allowed by C# never actually happen. This is a problem because the Itanium memory model is weaker, so if you test on an x86, you may never see problems that will occur when a user runs your program on an Itanium system.
The contentions when the threads all try to write the So, in summary, although this test uses all available horsepower, it is both incorrect and slow. Exercise 4: Example of what not to doThis exercise will concentrate on making the program correct. // do not do this
[Test( 4, "Worse" )]
partial class TLS4Worse : TestBase
{
long _Count = 0;
protected override void Work()
{
_Count = 0;
Parallel.Do( CreateActions( Update ) );
}
void Update( long i )
{
Done = Interlocked.Increment( ref _Count );
}
}
The reason the #3 was incorrect was that each thread was writing to the
The
So using
However, considering performance, we shouldn't expect any improvements over #3.
In fact, this test performs much worse because of the extra work and contentions caused by the
The lesson to learn here is that shared state is a performance pitfall, even if you can make the execution correct. While this test is correct enough and uses all the available cores, it's performance is about an order of magnitude worse than the sequential algorithm even on an 8 core machine. We can do better. Exercise 5: Using the Thread Local Storage classThis exercise will solve the performance problems, while keeping the code correct. // this is the recommended strategy
[Test( 5, "Right" )]
partial class TLS5Right : TestBase
{
public override long Done
{
get
{
var tls = TLS;
if ( tls == null ) return 0;
return tls.Results.Sum( s => s.Done );
}
}
protected override void Work()
{
TLS = new Common.TLS<State>();
Parallel.Do( CreateActions( Update ) );
}
void Update( long i, State state )
{
++state.Done;
}
}
partial class TestBase
{
protected class State
{
public long Done = 0;
}
}
In this test, we introduce a class called
The
If you remember, the UI thread accesses the current work completed percentage ten times a second.
This is extremely slow from the point of view of the hardware, while being fast enough for the user.
So this test defers the calculation to when it is needed by overriding the So, we've removed the blocking synchronisation from the inner loop, how does it perform? We're up to 160 MIPSPC, which is actually a bit faster than the sequential algorithm. But we're also running on 8 cores, giving 1,300 MIPS overall. So on an 8 core machine, we have achieved about a factor of 8 performance improvement. Not bad. Also, this algorithm is correct and will scale with as many cores as you can give it. Appendix A: Understanding the Thread Local Storage class
Here is the implementation of the using System;
using System.Collections.Generic;
using System.Threading;
namespace Common
{
public class TLS<DATA> where DATA : class, new()
{
volatile Dictionary<int, DATA> _List =
new Dictionary<int, DATA>();
volatile object _Key = new object();
public DATA Current
{
get
{
int thread = Thread.CurrentThread.ManagedThreadId;
DATA tls = null;
if ( !_List.TryGetValue( thread, out tls ) )
{
lock ( _Key )
{
if ( !_List.TryGetValue( thread, out tls ) )
{
tls = new DATA();
Dictionary<int, DATA> list =
new Dictionary<int, DATA>( _List );
list.Add( thread, tls );
Thread.MemoryBarrier();
_List = list;
}
}
}
return tls;
}
}
public ICollection<DATA> Results
{
get
{
lock ( _Key ) return new List( _List.Values );
}
}
}
}
There's not much code, but it does some very useful things and is as bulletproof as I could make it.
The state is a collection of the generic type parameter
The
The Appendix B: Comparison with ThreadStaticAttribute
When the Appendix C: Comparison with PFX Thread Local SelectorSome of the PFX methods allow you to select thread-local data, for example: partial class Parallel
{
public static void ForEach<TSource, TLocal>(
IEnumerable<TSource> source,
Func<TLocal> threadLocalSelector,
Action<TSource, int, ParallelState<TLocal>> body,
Action<TLocal> threadLocalCleanup
)
{...}
}
This has the same problem as the You can combine the two techniques, like this: var tls = new TLS<State>();
Parallel.ForEach(
collection,
() => tls.Current,
( data, index, parallelState ) =>
{
State state = parallelState.ThreadLocalState;
...
} );
Conclusion
I hope that this article has answered some of your questions about concurrent programming.
I have not gone into too much depth in order to limit its length.
My goal in writing this was really to generate interest in the subject
and, as I said at the beginning, to provoke more questions.
You may, of course, use the Thanks for reading. ReferencesMS Parallel Computing Developer Center [^]History
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||