I've been playing around with the parallel extensions shipped as part of the .NET 4.0 and Visual Studio 2010 RC.
Turning immutable atomic self-contained CPU-bound operations to run in parallel is pretty easy. However, as soon as you drill down to more complex executions, where issues such as memory sharing, allocations, delegates, false sharing, and synchronization mechanisms come in mind, you may have trouble finding the optimal parallel execution model for your code.
I was interested in testing a specific scenario to see if the Parallel Extensions could be beneficial.
Consider the following example:
I have a method which gets a list of service endpoint addresses which I need to invoke.
I need to make the invocations concurrently and then wait until all of the results are completed and ready.
Prior to the arrival of Parallel Extensions, I used the standard asynchronous invocation pattern (Begin/EndXXX), and used the “
WaitHandle.WaitAll” to wait for all the results to be ready.
In this specific case, I make a simple operation which basically does only IO-bound work.
I was wondering what would be the result of using the Parallel Extensions in this case.
You should note that testing parallelism depends on the code you're executing and the machine configuration and you should ensure that other running processes don't affect your test.
Because I'm testing an atomic IO-bound operation, I expect to find no improvement nor worsening with the performance of using the asynchronous invocation VS the parallel extensions.
The test is of running 8 service calls concurrently. I will describe it while relating to different execution patterns - asynchronous invocation, Tasks (TPL),
Parallel.For, and PLINQ.
The test was executed on my machine which has Quad-Core, 4GM RAM, and Windows 7 Ultimate 64bit.
For each execution pattern, I run a dummy call first (to satisfy JIT and initialization) and then I monitor 5 executions, each making these 8 concurrent service calls.
The service operation implementation does only “
Thread.Sleep” for 2 seconds, which means that the whole client test should yield 2 seconds in average per run.
The following represents the list of service calls my operation receives to make the calls. (For the sake of the example, it’s just a list of integers, for each I make a service call.)
static List<int> _serviceCalls = Enumerable.Repeat(1, 8).ToList();
I repeated the entire invocation of all the patterns two more times – one time right after and the last one after 1 minute of sleep.
This is because I wanted to demonstrate the fact that the Parallel Extensions take some initialization overhead when being used once in a while (you'll see that in the results).
static void CallServicesAsync(IMyService client)
WaitHandle handles = _serviceCalls
.Select(c => client.BeginDo(null, null).AsyncWaitHandle)
This pattern utilizes the asynchronous invocation pattern with the standard .NET thread pool.
static void CallServicesTasks(IMyService client)
Task tasks = _serviceCalls
.Select(c => Task.Factory.StartNew(client.Do))
This pattern utilizes the tasks infrastructure included in the Parallel Extensions.
This pattern utilizes the parallel iteration loops included in the Parallel Extensions.
.ForAll(c => client.Do());
This pattern utilizes PLINQ infrastructure.
The most constant pattern at yielding almost 2sec average is by far the asynchronous invocation pattern, which did not meet my expectation.
This may change when they integrate the thread pool of the Parallel Extensions as the standard .NET thread pool, but currently, if I am facing a simple atomic IO-bound operation I would like to do concurrently – I might as well keep using the standard asynchronous invocation pattern.
Following are the key issues that lead me to it:
- Initialization overhead – You see in the results that when accessing the parallel extensions infrastructures once in a while, you may encounter some overhead it takes to initialize related context. This is noticeable in all of Parallel Extensions infrastructures – Parallel.For, Tasks, and PLINQ, which ever took place first caused the overhead (after the 1 minute pause).
Unfortunately, this basically means that calls in a relatively not much time apart, you will experience the 2 second average (you may see it in the asynchronous pattern too from time to time, but it is usually the lowest).
- PLINQ – A specific thing about PLINQ is that it uses 1 thread per core by default.
This means that in my case where the machine has 4 cores, you would see an average of 4 seconds! (4 cores – 4 threads – 4 concurrent calls)
This is why I used the “
WithDegreeOfParallelism” directive on the PLINQ query. It is extremely important to address it when dealing with IO-bound operations using PLINQ.
I just hope the everyday developer will keep that in mind. :)
In spite of these findings, you may still want to consider using the Parallel Extensions to parallelize simple atomic IO-bound operations if you don't care about the notes above. That way, you'd have what people may consider a more readable code and rely on a promising infrastructure and future optimizations.
Feel free to download the code and play with it yourself (enter the specific post page to get the attachment link).
Update 01/04: Be sure to read the second take post.