Faster Sorting through Appropriate Threading on Multi-processor Machines

Jason Rudland

4.20/5 (8 votes)

Mar 5, 2009

Apache

4 min read

33997

386

Splitting a generics List of KeyValuePairs over the number of processors and then merging the result lists over half the processors and then half again, until only one list, the final list, is left.

Download source - 128.89 KB

Introduction

The goal was to achieve improved sorting performance by spawning a limited number of threads to use as much of the available processing power on multi-core and/or multi-processor machines as possible. The result is an improvement of 40% on dual core and more than 100% on a 4 processor machine.

Background

As part of a team involved in importing RDBMSs into an Associative Knowledge base, I recently inherited some code that hashed through a large amount of relational data, aggregating it before pushing it onto the server.

This article though, really has nothing to do with that. I was just looking for an excuse to generate heat. In fact I was quite embarrassed that the code was consuming a full 25% of the available processing on the quad core server, so I decided to do something about it.

After 2 months of cursing, I now have a new found respect for threaded architecture. There are pitfalls galore, even if you read all the documents. Nuances, such as code that runs on one machine and not on another. A lock that just doesn't seem to lock, and locks that make your code run like molasses.

The number one lesson I've learned is not to spawn threads unless there is a clear benefit to doing so. Even cheap threads are expensive when you have thousands of them around.

The number two lesson is to minimize contention or to avoid locking if you can.

Okay, so applying these lessons to sorting gives the following:

Split your big unsorted list by the number of available processors.
Spawn one thread per split that sorts them using your favourite sorting routine.
Merge pairs of result lists using Mergesort until you have no more results to merge.

(You will find this logic applied in the ParallelSorter method of Jigger.Parallel<T>).

Perfectionist Disclaimer

I do NOT claim that the source code provided has much merit as far as programming style goes. The sorting logic was extracted from a working class and placed in a test and benchmarking solution for the sole purpose of proving that it was worth doing. Memory consumption was taken into consideration but not to the detriment of performance.

Suggestions

However, having said that, if you have anything (con/de)structive to say, I'm all ears.

Source and Contributing

The zip contains the source, unit test and benchmarking files. I hope you have Visual Studio 2008, otherwise you will have to do some copy-pasting.

You will also find a set of results from my tests on my Dual Core notebook and on a Dual core, Dual processor server. I would be interested to hear from those of you that have other configurations. To add to the list, just install Ntime.

Using the Code

You will find many examples in the Parallel.Sort.Test.ParallelTester class. I have extracted what's pertinent and put it below:

//Initialise the collection
List<KeyValuePair<int, string>> stringDupesList= 
     new List<KeyValuePair<int, string>>(); 

// Load up a List of KeyValuePairs
ManyStringAdder(ref stringDupesList, 20);

//Initialise the Parallel sorter
Parallel<KeyValuePair<int, string>> parallelDo = 
     new Parallel<KeyValuePair<int, string>>(new KeyValueComparer<int, string>()); 

// And now you can sort using one of the 3 supplied 
// Sort using MergeSort. This one is a stable sort
stringDupesList = parallelDo.MergeSort(stringDupesList));

// This is the sort provided by Microsoft. 
// I've assumed it to be a derivative of QuickSort
stringDupesList = parallelDo.MSQuickSort(stringDupesList));

// This is my attempt at using the Logic from FastQuickSort with MergeSort. Not so good.
stringDupesList = parallelDo.MergeInsertionSort(stringDupesList));

Points of Interest

Building a solution that's worthy of being published as an article requires quite an effort. I had already benchmarked my sort with a console application, but when you need to show your results to others you have to be more formal and certain. For unit testing, I had already settled on Visual Studio 2008 but for micro benchmarking I had to do some research.

Micro-benchmarking

After a few days of rummaging around the interweb, wondering whether I should translate my own JavaScript profiler and checking to see if profiling tools could benchmark to the degree I require, I ended up with three 3^rd party candidates.

Nperf is a neglected but cool tool that provides graphs. Maybe part of Gallio ?
Ntime provides an nUnit interface, is feature full and was easy to implement
Microbenchmarking by Jon Skeet is C# source that you compile in your class

I went with NTime because I had little time to figure out the other candidates. Jon Skeet's tool provides a subset of the functionality in NTime and NPerf was temperamental to say the least. I still think I should translate my JavaScript profiler because it outputs to XML, ready to graph. But that's for another day.

Testing

The magic here is in what to test. I knew that my sort was running in production but I had never tested some obvious combinations of data types, values and thread count because my application never went near them. But, if you want other people to run your software, you are going to have to think about more uses for your code and test it. In the end, I spent half a day on the sort class and more than a week writing the tests and benchmarking in the hopes that I'll be able to re-use some of it in the future.

Thanks

In no particular order to:

NTime (by AdamSlosarski) for saving my time
Nick Taylor for being a sounding board
Kevin Fleming for inspiring usage of sound practices that I sometimes use
My wife for going with the flow
Ron Everett for not saying no when I said I wanted to write this article
Everyone else that I know

History

2009-03-05
- First release of Jigger.Parallel<T>.Sort