Click here to Skip to main content
Click here to Skip to main content
Go to top

Tricky Programming on Multi-Core Processors

, 29 Feb 2008
Rate this:
Please Sign up or sign in to vote.
An article on experiments with multithreading on a multi-core processor
parallelMarkMain.jpg

Introduction

Nowadays, multi-core processing is a growing industry trend as single core processors rapidly reach the physical limits of possible complexity and speed. It's obvious that if such wonderful hardware exists, it should be supported in .NET, shouldn't it? To check this I looked for a well-known easy-parallelizable algorithm. The benchmark should have handled different .NET parallelizing techniques with memory and mathematical operations. In my opinion, the best algorithms that fit these requirements are QuickSort and Matrix multiplication.

Please take into account, that this article doesn't describe any algorithms' optimization techniques. Instead it shows, how to parallelize your C# code and compares the performance of different approaches.

The Idea

Quicksort can be easily parallelized due to its divide-and-conquer nature. If we have p processors, we can divide a list of n elements into p sublists in Θ(n) average time, then sort each of these in average time. Ignoring the Θ(n) preprocessing, this a is linear speedup. Given n processors, only Θ(n) time is required overall. One advantage of parallel quicksort over other parallel sort algorithms is that no synchronization is required. A new thread could be started as soon as a sublist is available for it to work on and it does not communicate with other threads. When all threads are complete, the sort is done.

What about matrices? Well, it's quite the same — easy to paralellize (lots of ways to do it) and no synchronization required. Also, it's much easier to divide a multiplication process into two equal sub-processes (what is not done in Quicksort). In general, an algorithms complexity is . The speedup depends on an approach that programmer chooses. In case of proposed tests the speedup was up to 50%.

Background

If you are not familiar with Quicksort or Matrix multiplication, please check the reference section. You should also know a little about multithreading. Some of the tests use Parallel FX library. It is free to download and is included in source files.

Inside the code::Quicksort

The QuickSort<T> where T : IComparable class contains all tests and three basic methods that implement Quicksort. The standard Quicksort function looks like this:

private void QSort(int left, int right)
{
   if (right > left)
   {
      int pivotIndex = GetPivotIndex(left, right); // median-of-three approach
      int pivotNewIndex = Partition(left, right, pivotIndex); 
      QSort(left, pivotNewIndex - 1);
      QSort(pivotNewIndex + 1, right);
   }
}

The first test uses managed threads to parallelize quicksort. It's advantages are:

  • No extra-libraries are needed.
  • Best performance results in benchmark
public void SortParallelThreads(T[] array)
{
   this.array = array;
   int pivotIndex = GetPivotIndexParallel(array); // get pivot index after partitioning
  
   Thread sort1 = new Thread(new ThreadStart(delegate()
   {
      QSort(0, pivotIndex); // sort the first part of the array
      ((AutoResetEvent)(waitHandles[0])).Set(); // signal, that sorting is completed
   }));

   Thread sort2 = new Thread(new ThreadStart(delegate()
   {
      QSort(pivotIndex + 1, array.Length - 1); // sort the second part of the array
      ((AutoResetEvent)(waitHandles[1])).Set();
   }));

   sort1.Start(); // start threads one by one
   sort2.Start(); 
   WaitHandle.WaitAll(waitHandles); // wait until both sorting threads finish their work
}

This test uses Microsoft Parallel Extensions to .NET Framework 3.5. The advantages of using Parallel FX are:

  • No need in thread synchronization (meaning main thread and both sorting threads)
  • The smallest code
public void SortParallelFX(T[] array)
{
   this.array = array;
   int pivotIndex = GetPivotIndexParallel(array); 

   Parallel.Do(delegate()
   {
      QSort(0, pivotIndex);
   },
   delegate()
   {
      QSort(pivotIndex + 1, array.Length - 1); 
   });
}

The last test uses the ThreadPool class to queue and run sorting threads.

public void SortParallelThreadPool(T[] array)
{
   this.array = array;
   int pivotIndex = GetPivotIndexParallel(array);

   ThreadPool.QueueUserWorkItem(new WaitCallback(delegate(object x)
   {
      QSort(0, pivotIndex);
      ((AutoResetEvent)(waitHandles[0])).Set();
   }));

   ThreadPool.QueueUserWorkItem(new WaitCallback(delegate(object x)
   {
      QSort(pivotIndex + 1, array.Length - 1);
      ((AutoResetEvent)(waitHandles[1])).Set();
   }));

   WaitHandle.WaitAll(waitHandles);
}

Inside the code::Matrix multiplication

The Matrix class is a wrapper for double[,] array. Its Multiply(..) method is shared by all tests except MultiplyFXFor() (see below).

The MultiplyThreads(), MultiplyThreadPool(), MultiplyFXDo() methods share the same concept: divide a multiplication operation in two sub-operations and process them side by side (using both processor's cores). The code for these methods is slightly different from the code of Quicksort methods. (Multiply(..) is called instead of QSort(..)).

The m1 and m2 parameters are matrix multipliers, the resMatrix is the product. beg and end parameters are used to parallelize the process.

For example: linear sort calls Multiply(.., 0, m1.rows), while most of tests use Multiply(.., 0, m1.rows/2) and Multiply(.., m1.rows/2 + 1, m1.rows).

private static void Multiply(Matrix m1, Matrix m2, Matrix resMatrix, int beg, int end)
{
   for (int i = beg; i < end; i++)
      for (int j = 0; j < m2.columns; j++)
      {
         double tValue = 0.0;
         for (int k = 0; k <= m1.columns - 1; k++)
         {
            tValue += m1[i, k] * m2[k, j];
         }
         tMatrix[i, j] = tValue;
      }
}

MultiplyFXFor() has no analogues in sorting tests. It takes the advantage of ParallelFX library in loops parallelizing.

Parallel.For(0, m1.rows,delegate (int i) // Parallel.For is used
{
   for (int j = 0; j < m2.columns; j++)
   {
      double tValue = 0.0;
      for (int k = 0; k <= m1.columns - 1; k++)
      {
         tValue += m1[i, k] * m2[k, j];
      }
      tMatrix[i, j] = tValue;
   }
});

The Results

Each quicksort method passed 10 tests with random integer arrays of 10^7 elements. Quicksort terrifically depends on an array structure and on pivot elements. This explains results diversity and answers the question "why parallel tests are not twice as fast as the linear one?" As I expected, the "pure .NET" sort should beat all tests. Very nice results for Array.Sort

Test 1 2 3 4 5 6 7 8 9 10 Average
Linear 6.05 5.93 5.91 5.87 5.88 5.94 5.82 5.79 5.77 5.82 5.88
Parallel FX 3.85 5.76 3.85 3.73 4.1 5.91 4.59 3.57 4.24 5.04 4.47
Threads 3.65 5.73 3.68 3.87 4.2 5.84 4.4 3.53 4.15 4.91 4.39
ThreadPool 3.82 5.83 3.85 4.06 4.2 5.96 4.65 3.71 4.34 5.23 4.56
Array.Sort(..) 1.7 1.7 1.7 1.68 1.69 1.7 1.7 1.7 1.68 1.69 1.7

Each matrix multiplication method passed 5 tests with random 1000x1000 matrices. Because each parallel operation works with the same amount of data, the results do not diverse much, and the speedup is close to 50%.

Test 1 2 3 4 5 Average
Linear 14.02 14.15 14.15 14.05 14.17 14.12
Threads 7.35 7.33 7.38 7.36 7.38 7.36
Paralle.Do 7.6 7.41 7.38 7.38 7.4 7.43
Parallel.For 8.38 8.38 8.36 8.41 8.39 8.38
ThreadPool 7.41 7.38 7.4 7.38 7.4 7.39

Conclusion

The Task Parallel Library (TPL) is designed to make it much easier to write managed code that can automatically use multiple processors. Using the library, you can conveniently express potential parallelism in existing sequential code, where the exposed parallel tasks will be run concurrently on all available processors.

The same could be done with "traditional" approaches like Managed threads or ThreadPool. But ParallelFX library has one significant advantage: it uses all available processor to parallelize the operation. Programmers should not worry about the end-users CPU type, it could be single-core, multi-core, or event multiprocessor system. Although Parallel.For is slower than Parallel.Do (speaking about multiplication tests) on a processor with two cores, it should work much faster on a processor with three or four cores, and no code changes are needed!
At last, this is CPU's usage during the parallel multiplication tests. 100% usage of both cores.

Reference

What is multi-core processor?
Quicksort - Wikipedia, the free encyclopedia
Matrix multiplication - Wikipedia, the free encyclopedia
Optimize Managed Code For Multi-Core Machines
Download Microsoft Parallel Extensions to .NET Framework 3.5

History

  • 24.02.2008 - Release
  • Smallest code

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Zaur Nasibov
Software Developer
Finland Finland
I'm a Master degree student, studying at the University of Joensuu, Finland.

Comments and Discussions

 
GeneralMy vote of 5 Pinmvpthatraja21-Jan-11 17:12 
Generalinteresting - have 5 PinmemberPranay Rana3-Jan-11 18:39 
GeneralMy vote of 1 PinmemberAndreas04920-Feb-10 6:38 
GeneralExcellent PinmemberRugbyLeague2-Oct-08 2:40 
GeneralRe: Excellent PinmemberZaur Nasibov2-Oct-08 3:22 
GeneralGreat Introduction, Thanks! PinmemberNiteBeast22-May-08 16:42 
GeneralVery Interesting Pinmembermerlin9815-Mar-08 3:50 
GeneralRe: Very Interesting PinmemberZaur Nasibov5-Mar-08 9:34 
GeneralVery interesting. PinmemberMuaddubby4-Mar-08 2:22 
GeneralRe: Very interesting. PinmemberZaur Nasibov4-Mar-08 4:09 
GeneralRe: Very interesting. PinmemberMuaddubby4-Mar-08 4:33 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web01 | 2.8.140926.1 | Last Updated 29 Feb 2008
Article Copyright 2008 by Zaur Nasibov
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid