|
||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
|
Announcements
Chapters
Services
Feature Zones
|
IntroductionNowadays, dual core PCs have become more and more affordable, and have gradually become the standard. Quad core PCs are also getting closer, and of course, PCs with greater number of CPUs/cores are also available. Since there is huge amounts of computational tasks which are quite time consuming, parallel/distributed computing of these tasks is very critical. So, as users and developers receive PCs with several CPUs/cores, there is an obvious wish to use all the computing power of these PCs and load all the cores for paralleling time consuming computations. This article is going to discuss the topic of paralleling computations in C#, distributing them effectively through all cores available in the system. We will take a very brief look at what is provided by Microsoft’s parallel computation library, but the main aim of the article is to discuss how to implement parallelism using only standard facilities of the .NET framework and how to make it easy to use, so minimal changes should be done to the existing code in order to support parallelism. Microsoft’s solutionAs it is known, Microsoft provides an extension for .NET framework 3.5, which allows parallel computations to be distributed across all the cores available in a system. A dedicated blog is also available, where different news and information about the library are available. Microsoft’s library is quite powerful, easy to use, and provides a lot of different features, which allow solving the different tasks of parallel computations. For example, let’s take a look at how to parallel the code below, which does the multiplication of two square matrices: void MatrixMultiplication( double[,] a, double[,] b, double[,] c )
{
int s = a.GetLength( 0 );
for ( int i = 0; i < s; i++ )
{
for ( int j = 0; j < s; j++ )
{
double v = 0;
for ( int k = 0; k < s; k++ )
{
v += a[i, k] * b[k, j];
}
c[i, j] = v;
}
}
}
The paralleled version of the code will look like this: void ParalleledMatrixMultiplicationMS( double[,] a, double[,] b, double[,] c )
{
int s = a.GetLength( 0 );
System.Threading.Parallel.For( 0, s, delegate( int i )
{
for ( int j = 0; j < s; j++ )
{
double v = 0;
for ( int k = 0; k < s; k++ )
{
v += a[i, k] * b[k, j];
}
c[i, j] = v;
}
} );
}
Microsoft’s solution works really well, and their library provides much more than just a single
I am not sure if all the above points are important for you or not. Personally, I am interested in supporting the Mono project, and also, I am not yet ready to switch all my projects into the .NET 3.5 version. So, in this article, we are going to discuss a custom In case you are not interested in anything else except Microsoft’s solutions, the article still may be interesting for those who are willing to learn how to implement an easy to use paralleling approach, using just plain C#. Using the codeWe’ll start by describing how to use our custom implementation of parallelism routines, and the leave implementation details for the next section. As it was stated above, our aim is to make something similar to what Microsoft provides in their parallel extensions library and make it easy to use. So, let’s just implement a variant of void ParalleledMatrixMultiplicationAForge( double[,] a, double[,] b, double[,] c )
{
int s = a.GetLength( 0 );
AForge.Parallel.For( 0, s, delegate( int i )
{
for ( int j = 0; j < s; j++ )
{
double v = 0;
for ( int k = 0; k < s; k++ )
{
v += a[i, k] * b[k, j];
}
c[i, j] = v;
}
} );
}
So, as we can see from the code above, using our custom parallelism implementation is as simple as using Microsoft’s solution, and the only difference is just the namespace name where the // Microsoft's solution
System.Threading.Parallel.For( 0, s, delegate( int i )
...
// our implementation
AForge.Parallel.For( 0, s, delegate( int i )
...
Yes, as I have already mentioned, Microsoft provides additional Implementation detailsThe complete code of our Our implementation of parallelism is going to be based on regular classes from the // Initialize Parallel class's instance creating required number of threads // and synchronization objects private void Initialize( ) { // array of events, which signal about available job jobAvailable = new AutoResetEvent[threadsCount]; // array of events, which signal about available thread threadIdle = new ManualResetEvent[threadsCount]; // array of threads threads = new Thread[threadsCount]; for ( int i = 0; i < threadsCount; i++ ) { jobAvailable[i] = new AutoResetEvent( false ); threadIdle[i] = new ManualResetEvent( true ); threads[i] = new Thread( new ParameterizedThreadStart( WorkerThread ) ); threads[i].IsBackground = true; threads[i].Start( i ); } } What is the idea behind two events which are created for each thread? The Now, it is time to see how the jobs are scheduled ... public static void For( int start, int stop, ForLoopBody loopBody )
{
lock ( sync )
{
// get instance of parallel computation manager
Parallel instance = Instance;
instance.currentIndex = start - 1;
instance.stopIndex = stop;
instance.loopBody = loopBody;
// signal about available job for all threads and mark them busy
for ( int i = 0; i < threadsCount; i++ )
{
instance.threadIdle[i].Reset( );
instance.jobAvailable[i].Set( );
}
// wait until all threads become idle
for ( int i = 0; i < threadsCount; i++ )
{
instance.threadIdle[i].WaitOne( );
}
}
}
From the code above, it can be seen that jobs scheduling is very simple - save loop attributes, mark all threads as busy ( The last step is to see how the worker threads are actually working ... // Worker thread performing parallel computations in loop
private void WorkerThread( object index )
{
int threadIndex = (int) index;
int localIndex = 0;
while ( true )
{
// wait until there is job to do
jobAvailable[threadIndex].WaitOne( );
// exit on null body
if ( loopBody == null )
break;
while ( true )
{
// get local index incrementing global loop's current index
localIndex = Interlocked.Increment( ref currentIndex );
if ( localIndex >= stopIndex )
break;
// run loop's body
loopBody( localIndex );
}
// signal about thread availability
threadIdle[threadIndex].Set( );
}
}
So, as we can see from the above code, all the worker threads just sit doing nothing, and wait until there is something to do, which is signaled by the The entire implementation is very simple, and may be done using any version of the .NET framework, which is what we wanted from the beginning. Now, it is time to test it and see its performance compared to Microsoft's solution. Performance testsHow are we going to test the performance? Well, we'll use a simple technique and just run our routines many times in a loop, checking how much time we spend for it. The test code looks something like this: // run specified number of tests
for ( int test = 0; test < tests; test++ )
{
// test 1
DateTime start = DateTime.Now;
for ( int run = 0; run < runs; run++ )
{
MatrixMultiplication( a, b, c1 );
}
DateTime end = DateTime.Now;
TimeSpan span = end - start;
Console.Write( span.TotalMilliseconds.ToString( "F3" ) + "\t | " );
test1time += span.TotalMilliseconds;
// other tests
...
Note that we are going to run several iterations of our tests, so in the end, we'll also get the average performance: // provide average performance
test1time /= tests;
test2time /= tests;
test3time /= tests;
Console.WriteLine( "------------------- AVG -------------------" );
Console.WriteLine( test1time.ToString( "F3" ) + "\t | " +
test2time.ToString( "F3" ) + "\t | " +
test3time.ToString( "F3" ) + "\t | " );
So, let's run it and see the results of our tests (the tests below were done on an Intel Core 2 Duo CPU - 2.2 GHz):
From the provided results, we can see that our implementation does not look to be worse than Microsoft’s solution. Yes, we don’t have all the features and flexibility, but we’ve met our requirements to support early .NET versions and Mono too. From the above results, it looks like our implementation even performs a bit better. Analyzing our results a bit further, we can see that Microsoft’s solution requires a bit more time on the very first run, which means that they perform a more complex initialization of worker threads, consuming more time for it. Also, from our results, we can see that decreasing matrix size results in the fact that performance increase becomes not so evident in the case where work amount is too small for parallelism. But, we'll discuss this in the next section. Do profiling before and after optimizationIt is a common and good practice to do some sort of profiling and performance test before you start optimizing some code and right after. Performance tests will show how much you got from optimization, and if you got anything at all. In many cases, you may believe that you are optimizing your code, making it run faster, but in actuality, you may get a performance decrease. Parallelism is not a panacea, and in some cases, your paralleled code may perform slower. This may happen due to the fact that the amount of paralleled work is very small and the amount of time spent on threads synchronization is much greater. To demonstrate this effect, let's take a look at the result of paralleling the same matrix multiplication, but let's take small matrices this time:
As we can see from the above result, it is much faster to multiply small matrices without any parallelism at all. Paralleling such computations just leads to the fact that all additional routines for threads synchronization take much more time than the actual useful work. So, measure performance before making decisions on which code to use. Taking into account the fact that the ConclusionSo, as we can see, we've managed to make our own small, but easy to use We did not discuss other paralleling tasks in the article, but just matrix multiplication, so more samples may be investigated to provide clearer results. Personally, I am already using this Points of interestSome tuning of the existing implementation may be done by investigating the optimal amount of background threads to use for parallelism. The current implementation, by default, creates the amount of threads which is equal to the number of cores in the system. However, this may be changed by using the As another direction of increasing performance of parallel computational tasks, the utilization of GPUs may be considered. For example, NVidia provides the CUDA library, which may be used to utilize their GPUs for general purpose computing. Taking a look at the CUDA website, we can find that it was successfully applied to many different applications. CreditsI would like to thank Israel Lot for his valuable comments on the AForge.NETAlthough the code and demo application are provided with the article, the | |||||||||||||||||||||||||