15,942,756 members
Articles / Programming Languages / C++

# Parallel QuickSort with U++ CoWork

Rate me:
10 Jan 2016BSD5 min read 12.4K   112   5
QuickSort like other divide-and-conquer algorithms is quite amenable for parallelization. We provide such version using CoWork parallelization class of U++ framework.

## Introduction

QuickSort like other divide-and-conquer algorithms is quite amenable for parallelization. In this article, we shall provide such version using CoWork parallelization class of U++ framework.

## Serial Version of the Algorithm

As the starting point, we shall use the standard `Sort` function template from U++ Core library:

C++
```// highly optimized selection sort for small number of elements
template <class I, class Less>
void ForwardSort(I begin, I end, const Less& less);

template <class I, class Less>
force_inline
void OrderIter2__(I a, I b, const Less& less)
{
if(less(*b, *a))
IterSwap(a, b);
}

dword  Random(dword n); // High quality PRNG

template <class I, class Less>
void Sort(I l, I h, const Less& less)
{
for(;;) {
int count = int(h - l);
if(count < 2)
return;
if(count < 8) {                         // Final optimized SelectSort
ForwardSort(l, h, less);
return;
}
int pass = 4;
for(;;) {
I middle = l + (count >> 1);        // get the middle element
OrderIter2__(l, middle, less);      // sort l, middle, h-1 to find median of 3
OrderIter2__(middle, h - 1, less);
OrderIter2__(l, middle, less);      // median is now in middle
IterSwap(l + 1, middle);            // move median pivot to l + 1
I ii = l + 1;
for(I i = l + 2; i != h - 1; ++i)   // do partitioning; already l <= pivot <= h - 1
if(less(*i, *(l + 1)))
IterSwap(++ii, i);
IterSwap(ii, l + 1);                // put pivot back in between partitions
I iih = ii;
while(iih + 1 != h && !less(*ii, *(iih + 1))) // Find middle range of elements equal to pivot
++iih;
if(pass > 5 || min(ii - l, h - iih) >
(max(ii - l, h - iih) >> pass)) { // partition sizes ok or we have done max attempts
if(ii - l < h - iih - 1) {       // recurse on smaller partition, tail on larger
Sort(l, ii, less);
l = iih + 1;
}
else {
Sort(iih + 1, h, less);
h = ii;
}
break;
}
IterSwap(l, l + (int)Random(count));     // try some other random elements for median pivot
IterSwap(middle, l + (int)Random(count));
IterSwap(h - 1, l + (int)Random(count));
pass++;
}
}
}```

This is a fairly standard implementation. For small final partitions (<8 elements), optimized selection sort is used.

Pivot is decided as median of first, middle, last elements and algorithm contains detection of degenerate cases, in which case it does 2 more attempts to find a better pivot, using random elements to get median.

Algorithm uses "fat" partitioning - central elements equal to pivot are excluded from partitions.

Finally, algorithm uses recursion on smaller partition and tail execution on larger.

Now the feature of `QuickSort` that makes it particularly easy to parallelize is the fact that once partitioned, there is no data sharing between partitions. Practically, all we need to do is to replace recursion with parallel execution.

The only problem here is how to reasonably manage threads. Thankfully, U++ provides a useful parallelization class `CoWork`.

## Meet the CoWork

`CoWork` can be thought of as an interface to the global thread pool. It has two fundamental operations:

#### Do

Schedules some code to be executed. Note that `Do` can also be invoked using `operator &`.

#### Finish

Waits for all the code scheduled by `Do` to be finished. `Finish` is also invoked from `CoWork` destructor, so usually is omitted from the code.

With respect to memory visibility, all changes made before `Do` are visible when the scheduled code is performed and all changes made in the scheduled code are visible after `Finish`.

Now, how the code is executed by `CoWork` is an implementation issue, anyway you can expect `CoWork` to try as hard as possible to use the global thread pool to run scheduled code in parallel.

## Parallel QuickSort

Now we have the algorithm, and we have the tool to make it parallel, let us combine them:

C++
```template <class I, class Less>
void CoSort(CoWork& cw, I l, I h, const Less& less)
{
const int PARALLEL_THRESHOLD = 80;

for(;;) {
int count = int(h - l);
if(count < 2)
return;
if(count < 8) {                         // Final optimized SelectSort
ForwardSort(l, h, less);
return;
}
int pass = 4;
for(;;) {
I middle = l + (count >> 1);        // get the middle element
OrderIter2__(l, middle, less);      // sort l, middle, h-1 to find median of 3
OrderIter2__(middle, h - 1, less);
OrderIter2__(l, middle, less);      // median is now in middle
IterSwap(l + 1, middle);            // move median pivot to l + 1
I ii = l + 1;
for(I i = l + 2; i != h - 1; ++i)   // do partitioning; already l <= pivot <= h - 1
if(less(*i, *(l + 1)))
IterSwap(++ii, i);
IterSwap(ii, l + 1);                // put pivot back in between partitions
I iih = ii;
while(iih + 1 != h && !less(*ii, *(iih + 1))) // Find middle range of elements equal to pivot
++iih;
if(pass > 5 || min(ii - l, h - iih) >
(max(ii - l, h - iih) >> pass)) { // partition sizes ok or we have done max attempts
if(ii - l < h - iih - 1) {       // schedule or recurse on smaller partition, tail on larger
if(ii - l < PARALLEL_THRESHOLD) // too small to run in parallel?
Sort(l, ii, less); // resolve in this thread
else
cw & [=, &cw] { CoSort(cw, l, ii, less); }; // schedule for parallel execution
l = iih + 1;
}
else {
if(h - iih - 1 < PARALLEL_THRESHOLD) // too small to run in parallel?
Sort(iih + 1, h, less); // resolve in this thread
else
cw & [=, &cw] { CoSort(cw, iih + 1, h, less); }; // schedule for parallel execution
h = ii;
}
break;
}
IterSwap(l, l + (int)Random(count));     // try some other random elements for median pivot
IterSwap(middle, l + (int)Random(count));
IterSwap(h - 1, l + (int)Random(count));
pass++;
}
}
}

template <class I, class Less>
void CoSort(I l, I h, const Less& less)
{
CoWork cw;
CoSort(cw, l, h, less);
// CoWork destructor waits until all scheduled work is finished
}```

Just as promised, about the only change in the `QuickSort` code is replacement of recursion with `CoWork::Do `scheduling (in `operator &` form).

`PARALLEL_THRESHOLD` constant prevents parallel execution for small partitions - scheduling them in parallel has diminishing returns and at some point, false cacheline sharing and `CoWork` management costs are higher than benefits of using more CPU cores. The value `80` used here is a combination of gut feeling and some experimentation and is not really critical, experiments with sorting `String`s revealed that performance generally is getting worse if this value is `>200` or `<20`.

## Benchmarks

We have benchmarked `CoSort` against the original serial version, sorting `Upp::Vector` of `N` randomly generated short `Upp::Strings`. For reference, we have also benchmarked standard library `std::sort` of `std::vector<std::string>` with the same data set.

Benchmarks were performed on two platforms, using svn revision 9390 of U++ framework. Benchmarking package was marked for 'fast' optimization.

Intel® Pentium(R) CPU G620 @ 2.60GHz (2 cores, 2 threads), 4GB RAM
Ubuntu Linux 64-bit
gcc (Ubuntu 5.2.1-22ubuntu2) 5.2.1 20151010

N Sort CoSort Sort / CoSort std::sort
103 63.22 us 53.12 us 1.19x 241.82 us
104 968.93 us 603.93 us 1.60x 3180 us
105 11.50 ms 6.58 ms 1.75x 38.32 ms
106 137.20 ms 77.00 ms 1.78x 427.70 ms
107 1.63 s 0.89 s 1.83x 4.85 s

Intel® Pentium(R) Core i7-4771 @ 3.5GHz (4 cores, 8 threads), 16GB RAM
Windows 10 64-bit
Visual C++ 2014 64-bit mode

N Sort CoSort Sort / CoSort std::sort
103 47.87 us 44.27 us 1.08x 193.37 us
104 642.97 us 301.97 us 2.13x 2490 us
105 7.96 ms 2.62 ms 3.04x 31.32 ms
106 95.90 ms 28.50 ms 3.36x 381.60 ms
107 1.16 s 0.30 s 3.87x 4.20 s
108 13.63 s 3.69 s 3.69x 49.08 s

Broadcom BCM2836 ARM Cortex-A7 900 MHz, 4 cores, 4 threads, 1GB SDRAM (Raspberry Pi 2)
Ubuntu Linux
gcc (Ubuntu 5.2.1-22ubuntu2) 5.2.1 20151010, 32-bit

N Sort CoSort Sort / CoSort std::sort
103 305.87 us 291.77 us 1.04x 1470 us
104 4.58 ms 2.33 ms 1.96x 20.00 ms
105 69.05 ms 33.64 ms 2.05x 250.48 ms
106 1.02 s 0.49 s 2.08x 3.13 s

(Pentium G620 system does not have enough memory to run 108 test, ARM system does not have enough memory for more than 106 tests).

## Conclusion

Benchmark results indicate that for large enough datasets, the performance ratio between serial and parallel version of algorithm is approaching the number of physical CPU cores on Intel CPUs. On the other hand, parallelization does not seem to make much sense for small datasets where factors like false cache-line sharing and job/thread management costs in `CoWork` prevent any noticeable speedup and ultimately lead to slowdown for small sets. In production code, it would probably be advisable to revert to purely serial code for less than 500 elements (in production version of `CoSort`).

ARM results are sort of disappointing, the plausible explanation is memory bandwidth bottleneck - unlike Intel i7 system, which has about 20GB/s bandwidth, RAM bandwidth of Rapsberry PI is only about 200MB/s.

## History

• January 18th, 2016: ARM CPU benchmark results added