Introduction

Nowadays, dual core PCs have become more and more affordable, and have gradually become the standard. Quad core PCs are also getting closer, and of course, PCs with greater number of CPUs/cores are also available. Since there is huge amounts of computational tasks which are quite time consuming, parallel/distributed computing of these tasks is very critical. So, as users and developers receive PCs with several CPUs/cores, there is an obvious wish to use all the computing power of these PCs and load all the cores for paralleling time consuming computations.

This article is going to discuss the topic of paralleling computations in C#, distributing them effectively through all cores available in the system. We will take a very brief look at what is provided by Microsoft’s parallel computation library, but the main aim of the article is to discuss how to implement parallelism using only standard facilities of the .NET framework and how to make it easy to use, so minimal changes should be done to the existing code in order to support parallelism.

Microsoft’s solution

As it is known, Microsoft provides an extension for .NET framework 3.5, which allows parallel computations to be distributed across all the cores available in a system. A dedicated blog is also available, where different news and information about the library are available.

Microsoft’s library is quite powerful, easy to use, and provides a lot of different features, which allow solving the different tasks of parallel computations. For example, let’s take a look at how to parallel the code below, which does the multiplication of two square matrices:

void MatrixMultiplication( double[,] a, double[,] b, double[,] c )
{
    int s = a.GetLength( 0 );

    for ( int i = 0; i < s; i++ )
    {
        for ( int j = 0; j < s; j++ )
        {
            double v = 0;

            for ( int k = 0; k < s; k++ )
            {
                v += a[i, k] * b[k, j];
            }

            c[i, j] = v;
        }
    }
}

The paralleled version of the code will look like this:

void ParalleledMatrixMultiplicationMS( double[,] a, double[,] b, double[,] c )
{
    int s = a.GetLength( 0 );

    System.Threading.Parallel.For( 0, s, delegate( int i )
    {
        for ( int j = 0; j < s; j++ )
        {
            double v = 0;

            for ( int k = 0; k < s; k++ )
            {
                v += a[i, k] * b[k, j];
            }

            c[i, j] = v;
        }
    } );
}

Microsoft’s solution works really well, and their library provides much more than just a single Parallel.For(). However, there are some issues, which may make this library not so preferred for your application:

The parallel computations extension is targeted for .NET framework 3.5. The 3.5 version is really nice, and provides a lot, but some applications may still want to support earlier .NET framework versions, like 2.0, for example, which makes it impossible for them to use this extension.
The parallel computations extension is not yet included into the standard .NET framework installation, but is provided as a separate component, which should be installed on a target system as well. This may complicate the distribution of your product a bit, requiring you to provide special instructions to users specifying all the dependencies, or to incorporate the installation of the extension into your product’s installation.
The parallel computations extension provided by Microsoft is aimed to run on Windows systems. However, if you are developing cross platform applications, which should also run on Linux, for example, under the Mono environment, then you will be left without the nice paralleling features. Although parallel extensions are going to appear in Mono soon, they are not yet available in the current version (1.9).

I am not sure if all the above points are important for you or not. Personally, I am interested in supporting the Mono project, and also, I am not yet ready to switch all my projects into the .NET 3.5 version.

So, in this article, we are going to discuss a custom Parallel.For() implementation, which could be used with early .NET versions as well as with Mono, and also may be incorporated easily into any project, since it represents just a tiny DLL assembly. By doing our own implementation, we can maintain the simplicity of its usage, and a good performance which is not going to be worse than that provided by Microsoft’s parallel extension. Note: we are not going to implement a complete analogue of the parallel extensions library provided by Microsoft, but just the Parallel.For(), which covers quite a lot of most paralleling tasks.

In case you are not interested in anything else except Microsoft’s solutions, the article still may be interesting for those who are willing to learn how to implement an easy to use paralleling approach, using just plain C#.

Using the code

We’ll start by describing how to use our custom implementation of parallelism routines, and the leave implementation details for the next section. As it was stated above, our aim is to make something similar to what Microsoft provides in their parallel extensions library and make it easy to use. So, let’s just implement a variant of Parallel.For() which is provided by Microsoft. The only difference of our variant is that we’ll have just a single definition of the method, which accepts start and stop indexes of the for-loop and the loop’s body as a delegate. Below is our square matrix multiplication code, but paralleled using our Parallel.For() implementation.

void ParalleledMatrixMultiplicationAForge( double[,] a, double[,] b, double[,] c )
{
    int s = a.GetLength( 0 );

    AForge.Parallel.For( 0, s, delegate( int i )
    {
        for ( int j = 0; j < s; j++ )
        {
            double v = 0;

            for ( int k = 0; k < s; k++ )
            {
                v += a[i, k] * b[k, j];
            }

            c[i, j] = v;
        }
    } );
}

So, as we can see from the code above, using our custom parallelism implementation is as simple as using Microsoft’s solution, and the only difference is just the namespace name where the Parallel class is defined:

// Microsoft's solution
System.Threading.Parallel.For( 0, s, delegate( int i )
...

// our implementation
AForge.Parallel.For( 0, s, delegate( int i )
...

Yes, as I have already mentioned, Microsoft provides additional Parallel.For() definitions, which may work not only with delegates, but with lambda expressions, for example. But, this is just an extra flexibility feature which comes for a cost. As for me, delegates are enough and they don’t require .NET framework 3.5.

Implementation details

The complete code of our Parallel.For() implementation may be found in the attachments to the article, but here, we’ll just discuss the main idea, which is concentrated in three main routines – initialization, job scheduling, and execution.

Our implementation of parallelism is going to be based on regular classes from the System.Threading namespace, which are available in any version of .NET – Thread, AutoResetEvent, and ManualResetEvent. To parallel for-loops, we need to create a certain amount of threads, which will be used to execute iterations of the for-loop body, but the events are required to signal about thread availability and job availability. By default, we create the amount of threads which is equal to the amount of cores in the system, but this value may be configured by the user, so more (or less) threads will be used to parallel loops.

// Initialize Parallel class's instance creating required number of threads
// and synchronization objects
private void Initialize( )
{
    // array of events, which signal about available job
    jobAvailable = new AutoResetEvent[threadsCount];
    // array of events, which signal about available thread
    threadIdle = new ManualResetEvent[threadsCount];
    // array of threads
    threads = new Thread[threadsCount];

    for ( int i = 0; i < threadsCount; i++ )
    {
        jobAvailable[i] = new AutoResetEvent( false );
        threadIdle[i]   = new ManualResetEvent( true );

        threads[i] = new Thread( new ParameterizedThreadStart( WorkerThread ) );
        threads[i].IsBackground = true;
        threads[i].Start( i );
    }
}

What is the idea behind two events which are created for each thread? The threadIdle events are required to signal if a thread is idle doing nothing (available for some job), or busy performing some calculations. The jobAvailable events are required to signal to a particular thread that it needs to wake up and do some work. So, after the above initialization is done, all threadIdle events are set into signaled state, what means that they are all in idle state and available to do something, but all jobAvailable events are set into non-signaled state, which means there is no work for the threads yet. All threads wait on these jobAvailable events, and right after they are turned into signaled state, the threads will wake up and start their job. We’ll see a worker thread’s function a bit later, so it will become clearer how the threads get their job.

Now, it is time to see how the jobs are scheduled ...

public static void For( int start, int stop, ForLoopBody loopBody  )
{
    lock ( sync )
    {
        // get instance of parallel computation manager
        Parallel instance = Instance;

        instance.currentIndex   = start - 1;
        instance.stopIndex      = stop;
        instance.loopBody       = loopBody;

        // signal about available job for all threads and mark them busy
        for ( int i = 0; i < threadsCount; i++ )
        {
            instance.threadIdle[i].Reset( );
            instance.jobAvailable[i].Set( );
        }

        // wait until all threads become idle
        for ( int i = 0; i < threadsCount; i++ )
        {
            instance.threadIdle[i].WaitOne( );
        }
    }
}

From the code above, it can be seen that jobs scheduling is very simple - save loop attributes, mark all threads as busy (threadIdle[i].Reset( )), signal to all threads that there is a job for them (jobAvailable[i].Set( )), and then just wait until the threads become idle again.

The last step is to see how the worker threads are actually working ...

// Worker thread performing parallel computations in loop
private void WorkerThread( object index )
{
    int threadIndex = (int) index;
    int localIndex = 0;

    while ( true )
    {
        // wait until there is job to do
        jobAvailable[threadIndex].WaitOne( );

        // exit on null body
        if ( loopBody == null )
            break;

        while ( true )
        {
            // get local index incrementing global loop's current index
            localIndex = Interlocked.Increment( ref currentIndex );

            if ( localIndex >= stopIndex )
                break;

            // run loop's body
            loopBody( localIndex );
        }

        // signal about thread availability
        threadIdle[threadIndex].Set( );
    }
}

So, as we can see from the above code, all the worker threads just sit doing nothing, and wait until there is something to do, which is signaled by the jobAvailable events. Once the events are received, the threads start their work - safely receive the loop index they need to work on using an interlocked increment, and then just execute the loop's body with the required index, which is done until the entire loop is calculated.

The entire implementation is very simple, and may be done using any version of the .NET framework, which is what we wanted from the beginning. Now, it is time to test it and see its performance compared to Microsoft's solution.

Performance tests

How are we going to test the performance? Well, we'll use a simple technique and just run our routines many times in a loop, checking how much time we spend for it. The test code looks something like this:

// run specified number of tests
for ( int test = 0; test < tests; test++ )
{
    // test 1
    DateTime start = DateTime.Now;

    for ( int run = 0; run < runs; run++ )
    {
        MatrixMultiplication( a, b, c1 );
    }

    DateTime end = DateTime.Now;
    TimeSpan span = end - start;

    Console.Write( span.TotalMilliseconds.ToString( "F3" ) + "\t | " );
    test1time += span.TotalMilliseconds;
    
    // other tests
    ...

Note that we are going to run several iterations of our tests, so in the end, we'll also get the average performance:

// provide average performance
test1time /= tests;
test2time /= tests;
test3time /= tests;

Console.WriteLine( "------------------- AVG -------------------" );
Console.WriteLine(  test1time.ToString( "F3" ) + "\t | " +
                    test2time.ToString( "F3" ) + "\t | " +
                    test3time.ToString( "F3" ) + "\t | " );

So, let's run it and see the results of our tests (the tests below were done on an Intel Core 2 Duo CPU - 2.2 GHz):

Matrix size: 50, runs: 200
Starting test with 2 threads

Clear C# | AForge   | MS       |
156.250  | 109.37   | 218.750  |  
171.875  | 93.750   | 125.000  |  
156.250  | 109.375  | 109.375  |  
171.875  | 93.750   | 125.000  |  
156.250  | 93.750   | 125.000  |  
------------------- AVG --------
162.500  | 100.000  | 140.625  |

Matrix size: 100, runs: 100
Starting test with 2 threads

Clear C# | AForge    | MS        |
687.500  | 390.625   | 515.625   |  
718.750  | 390.625   | 406.250   |  
703.125  | 390.625   | 406.250   |  
687.500  | 390.625   | 406.250   |  
734.375  | 390.625   | 406.250   |  
------------------- AVG ----------
706.250  | 390.625   | 428.125   |

Matrix size: 250, runs: 40
Starting test with 2 threads

Clear C#     | AForge    | MS        |
4453.125     | 2484.375  | 2593.750  |  
4609.375     | 2500.000  | 2500.000  |  
4515.625     | 2484.375  | 2500.000  |  
4546.875     | 2484.375  | 2500.000  |  
4671.875     | 2500.000  | 2500.000  |  
------------------- AVG --------------
4559.375     | 2490.625  | 2518.750  |

Matrix size: 1000, runs: 10
Starting test with 2 threads

Clear C#     | AForge       | MS         |
133078.125   | 72406.250    | 72531.250  |  
134875.000   | 72718.750    | 72406.250  |  
135296.875   | 72578.125    | 72375.000  |  
135484.375   | 72531.250    | 75062.500  |  
136500.000   | 72515.625    | 72343.750  |  
------------------- AVG -------------------
135046.875   | 72550.000    | 72943.750  |

From the provided results, we can see that our implementation does not look to be worse than Microsoft’s solution. Yes, we don’t have all the features and flexibility, but we’ve met our requirements to support early .NET versions and Mono too. From the above results, it looks like our implementation even performs a bit better.

Analyzing our results a bit further, we can see that Microsoft’s solution requires a bit more time on the very first run, which means that they perform a more complex initialization of worker threads, consuming more time for it.

Also, from our results, we can see that decreasing matrix size results in the fact that performance increase becomes not so evident in the case where work amount is too small for parallelism. But, we'll discuss this in the next section.

Do profiling before and after optimization

It is a common and good practice to do some sort of profiling and performance test before you start optimizing some code and right after. Performance tests will show how much you got from optimization, and if you got anything at all. In many cases, you may believe that you are optimizing your code, making it run faster, but in actuality, you may get a performance decrease. Parallelism is not a panacea, and in some cases, your paralleled code may perform slower. This may happen due to the fact that the amount of paralleled work is very small and the amount of time spent on threads synchronization is much greater. To demonstrate this effect, let's take a look at the result of paralleling the same matrix multiplication, but let's take small matrices this time:

Matrix size: 10, runs: 1000
Starting test with 2 threads

Clear C# | AForge    | MS        |
0.000    | 46.875    | 156.250   |  
15.625   | 15.625    | 46.875    |  
0.000    | 15.625    | 46.875    |  
0.000    | 15.625    | 46.875    |  
0.000    | 15.625    | 31.250    |  
------------------- AVG ----------
3.125    | 21.875    | 65.625    |

As we can see from the above result, it is much faster to multiply small matrices without any parallelism at all. Paralleling such computations just leads to the fact that all additional routines for threads synchronization take much more time than the actual useful work. So, measure performance before making decisions on which code to use. Taking into account the fact that the Parallel.For() syntax does not differ a lot from the regular for-loop statement, it should not be an issue to change a few lines of code to perform a different test.

Conclusion

So, as we can see, we've managed to make our own small, but easy to use Parallel.For() implementation, which performs quite well, and is not much worse than the parallel extension from Microsoft. Of course, Microsoft provides more flexibility and features, but many paralleling tasks may be solved with just paralleling for-loops, which was our aim, and we got it.

We did not discuss other paralleling tasks in the article, but just matrix multiplication, so more samples may be investigated to provide clearer results. Personally, I am already using this AForge.Parallel.For() implementation in another project, which deals with image processing and other stuff. Image processing, being one of the areas which may use quite time consuming computations, is one of the many areas that can utilize parallel computations successfully. And preliminary tests of that project already showed the boost from parallelism.

Points of interest

Some tuning of the existing implementation may be done by investigating the optimal amount of background threads to use for parallelism. The current implementation, by default, creates the amount of threads which is equal to the number of cores in the system. However, this may be changed by using the AForge.Parallel.ThreadsCount property, which allows the user to specify exactly how many background threads to create. For example, if we take a look at Microsoft's solution, we can find that they create more than two threads on a dual core system. Inspecting the Task Manager on my system, I found that around 10 additional threads were created when their Parallel.For() was invoked.

As another direction of increasing performance of parallel computational tasks, the utilization of GPUs may be considered. For example, NVidia provides the CUDA library, which may be used to utilize their GPUs for general purpose computing. Taking a look at the CUDA website, we can find that it was successfully applied to many different applications.

Credits

I would like to thank Israel Lot for his valuable comments on the AForge.Parallel implementation and his contribution.

AForge.NET

Although the code and demo application are provided with the article, the AForge.Parallel is going to become one of the new features of the AForge.NET framework, and is going to be released in the 2.0 version, where the class will be utilized for paralleling other classes of the framework.