Parallel QuickSort with U++ CoWork






3.50/5 (7 votes)
QuickSort like other divide-and-conquer algorithms is quite amenable for parallelization. We provide such version using CoWork parallelization class of U++ framework.
Introduction
QuickSort like other divide-and-conquer algorithms is quite amenable for parallelization. In this article, we shall provide such version using CoWork parallelization class of U++ framework.
Serial Version of the Algorithm
As the starting point, we shall use the standard Sort
function template from U++ Core library:
// highly optimized selection sort for small number of elements
template <class I, class Less>
void ForwardSort(I begin, I end, const Less& less);
template <class I, class Less>
force_inline
void OrderIter2__(I a, I b, const Less& less)
{
if(less(*b, *a))
IterSwap(a, b);
}
dword Random(dword n); // High quality PRNG
template <class I, class Less>
void Sort(I l, I h, const Less& less)
{
for(;;) {
int count = int(h - l);
if(count < 2)
return;
if(count < 8) { // Final optimized SelectSort
ForwardSort(l, h, less);
return;
}
int pass = 4;
for(;;) {
I middle = l + (count >> 1); // get the middle element
OrderIter2__(l, middle, less); // sort l, middle, h-1 to find median of 3
OrderIter2__(middle, h - 1, less);
OrderIter2__(l, middle, less); // median is now in middle
IterSwap(l + 1, middle); // move median pivot to l + 1
I ii = l + 1;
for(I i = l + 2; i != h - 1; ++i) // do partitioning; already l <= pivot <= h - 1
if(less(*i, *(l + 1)))
IterSwap(++ii, i);
IterSwap(ii, l + 1); // put pivot back in between partitions
I iih = ii;
while(iih + 1 != h && !less(*ii, *(iih + 1))) // Find middle range of elements equal to pivot
++iih;
if(pass > 5 || min(ii - l, h - iih) >
(max(ii - l, h - iih) >> pass)) { // partition sizes ok or we have done max attempts
if(ii - l < h - iih - 1) { // recurse on smaller partition, tail on larger
Sort(l, ii, less);
l = iih + 1;
}
else {
Sort(iih + 1, h, less);
h = ii;
}
break;
}
IterSwap(l, l + (int)Random(count)); // try some other random elements for median pivot
IterSwap(middle, l + (int)Random(count));
IterSwap(h - 1, l + (int)Random(count));
pass++;
}
}
}
This is a fairly standard implementation. For small final partitions (<8 elements), optimized selection sort is used.
Pivot is decided as median of first, middle, last elements and algorithm contains detection of degenerate cases, in which case it does 2 more attempts to find a better pivot, using random elements to get median.
Algorithm uses "fat" partitioning - central elements equal to pivot are excluded from partitions.
Finally, algorithm uses recursion on smaller partition and tail execution on larger.
Now the feature of QuickSort
that makes it particularly easy to parallelize is the fact that once partitioned, there is no data sharing between partitions. Practically, all we need to do is to replace recursion with parallel execution.
The only problem here is how to reasonably manage threads. Thankfully, U++ provides a useful parallelization class CoWork
.
Meet the CoWork
CoWork
can be thought of as an interface to the global thread pool. It has two fundamental operations:
Do
Schedules some code to be executed. Note that Do
can also be invoked using operator &
.
Finish
Waits for all the code scheduled by Do
to be finished. Finish
is also invoked from CoWork
destructor, so usually is omitted from the code.
With respect to memory visibility, all changes made before Do
are visible when the scheduled code is performed and all changes made in the scheduled code are visible after Finish
.
Now, how the code is executed by CoWork
is an implementation issue, anyway you can expect CoWork
to try as hard as possible to use the global thread pool to run scheduled code in parallel.
Parallel QuickSort
Now we have the algorithm, and we have the tool to make it parallel, let us combine them:
template <class I, class Less>
void CoSort(CoWork& cw, I l, I h, const Less& less)
{
const int PARALLEL_THRESHOLD = 80;
for(;;) {
int count = int(h - l);
if(count < 2)
return;
if(count < 8) { // Final optimized SelectSort
ForwardSort(l, h, less);
return;
}
int pass = 4;
for(;;) {
I middle = l + (count >> 1); // get the middle element
OrderIter2__(l, middle, less); // sort l, middle, h-1 to find median of 3
OrderIter2__(middle, h - 1, less);
OrderIter2__(l, middle, less); // median is now in middle
IterSwap(l + 1, middle); // move median pivot to l + 1
I ii = l + 1;
for(I i = l + 2; i != h - 1; ++i) // do partitioning; already l <= pivot <= h - 1
if(less(*i, *(l + 1)))
IterSwap(++ii, i);
IterSwap(ii, l + 1); // put pivot back in between partitions
I iih = ii;
while(iih + 1 != h && !less(*ii, *(iih + 1))) // Find middle range of elements equal to pivot
++iih;
if(pass > 5 || min(ii - l, h - iih) >
(max(ii - l, h - iih) >> pass)) { // partition sizes ok or we have done max attempts
if(ii - l < h - iih - 1) { // schedule or recurse on smaller partition, tail on larger
if(ii - l < PARALLEL_THRESHOLD) // too small to run in parallel?
Sort(l, ii, less); // resolve in this thread
else
cw & [=, &cw] { CoSort(cw, l, ii, less); }; // schedule for parallel execution
l = iih + 1;
}
else {
if(h - iih - 1 < PARALLEL_THRESHOLD) // too small to run in parallel?
Sort(iih + 1, h, less); // resolve in this thread
else
cw & [=, &cw] { CoSort(cw, iih + 1, h, less); }; // schedule for parallel execution
h = ii;
}
break;
}
IterSwap(l, l + (int)Random(count)); // try some other random elements for median pivot
IterSwap(middle, l + (int)Random(count));
IterSwap(h - 1, l + (int)Random(count));
pass++;
}
}
}
template <class I, class Less>
void CoSort(I l, I h, const Less& less)
{
CoWork cw;
CoSort(cw, l, h, less);
// CoWork destructor waits until all scheduled work is finished
}
Just as promised, about the only change in the QuickSort
code is replacement of recursion with CoWork::Do
scheduling (in operator &
form).
PARALLEL_THRESHOLD
constant prevents parallel execution for small partitions - scheduling them in parallel has diminishing returns and at some point, false cacheline sharing and CoWork
management costs are higher than benefits of using more CPU cores. The value 80
used here is a combination of gut feeling and some experimentation and is not really critical, experiments with sorting String
s revealed that performance generally is getting worse if this value is >200
or <20
.
Benchmarks
We have benchmarked CoSort
against the original serial version, sorting Upp::Vector
of N
randomly generated short Upp::Strings
. For reference, we have also benchmarked standard library std::sort
of std::vector<std::string>
with the same data set.
Benchmarks were performed on two platforms, using svn revision 9390 of U++ framework. Benchmarking package was marked for 'fast' optimization.
Intel® Pentium(R) CPU G620 @ 2.60GHz (2 cores, 2 threads), 4GB RAM
Ubuntu Linux 64-bit
gcc (Ubuntu 5.2.1-22ubuntu2) 5.2.1 20151010
N | Sort | CoSort | Sort / CoSort | std::sort |
---|---|---|---|---|
103 | 63.22 us | 53.12 us | 1.19x | 241.82 us |
104 | 968.93 us | 603.93 us | 1.60x | 3180 us |
105 | 11.50 ms | 6.58 ms | 1.75x | 38.32 ms |
106 | 137.20 ms | 77.00 ms | 1.78x | 427.70 ms |
107 | 1.63 s | 0.89 s | 1.83x | 4.85 s |
Intel® Pentium(R) Core i7-4771 @ 3.5GHz (4 cores, 8 threads), 16GB RAM
Windows 10 64-bit
Visual C++ 2014 64-bit mode
N | Sort | CoSort | Sort / CoSort | std::sort |
---|---|---|---|---|
103 | 47.87 us | 44.27 us | 1.08x | 193.37 us |
104 | 642.97 us | 301.97 us | 2.13x | 2490 us |
105 | 7.96 ms | 2.62 ms | 3.04x | 31.32 ms |
106 | 95.90 ms | 28.50 ms | 3.36x | 381.60 ms |
107 | 1.16 s | 0.30 s | 3.87x | 4.20 s |
108 | 13.63 s | 3.69 s | 3.69x | 49.08 s |
Broadcom BCM2836 ARM Cortex-A7 900 MHz, 4 cores, 4 threads, 1GB SDRAM (Raspberry Pi 2)
Ubuntu Linux
gcc (Ubuntu 5.2.1-22ubuntu2) 5.2.1 20151010, 32-bit
N | Sort | CoSort | Sort / CoSort | std::sort |
---|---|---|---|---|
103 | 305.87 us | 291.77 us | 1.04x | 1470 us |
104 | 4.58 ms | 2.33 ms | 1.96x | 20.00 ms |
105 | 69.05 ms | 33.64 ms | 2.05x | 250.48 ms |
106 | 1.02 s | 0.49 s | 2.08x | 3.13 s |
(Pentium G620 system does not have enough memory to run 108 test, ARM system does not have enough memory for more than 106 tests).
Conclusion
Benchmark results indicate that for large enough datasets, the performance ratio between serial and parallel version of algorithm is approaching the number of physical CPU cores on Intel CPUs. On the other hand, parallelization does not seem to make much sense for small datasets where factors like false cache-line sharing and job/thread management costs in CoWork
prevent any noticeable speedup and ultimately lead to slowdown for small sets. In production code, it would probably be advisable to revert to purely serial code for less than 500 elements (in production version of CoSort
).
ARM results are sort of disappointing, the plausible explanation is memory bandwidth bottleneck - unlike Intel i7 system, which has about 20GB/s bandwidth, RAM bandwidth of Rapsberry PI is only about 200MB/s.
Useful Links
History
- January 18th, 2016: ARM CPU benchmark results added