Parallel QuickSort with U++ CoWork

Miroslav Fidler

3.50/5 (7 votes)

Jan 10, 2016

BSD

5 min read

12667

114

QuickSort like other divide-and-conquer algorithms is quite amenable for parallelization. We provide such version using CoWork parallelization class of U++ framework.

Download source file - 1.3 KB

Introduction

QuickSort like other divide-and-conquer algorithms is quite amenable for parallelization. In this article, we shall provide such version using CoWork parallelization class of U++ framework.

Serial Version of the Algorithm

As the starting point, we shall use the standard Sort function template from U++ Core library:

// highly optimized selection sort for small number of elements
template <class I, class Less>
void ForwardSort(I begin, I end, const Less& less);

template <class I, class Less>
force_inline
void OrderIter2__(I a, I b, const Less& less)
{
    if(less(*b, *a))
        IterSwap(a, b);
}

dword  Random(dword n); // High quality PRNG

template <class I, class Less>
void Sort(I l, I h, const Less& less)
{
    for(;;) {
        int count = int(h - l);
        if(count < 2)
            return;
        if(count < 8) {                         // Final optimized SelectSort
            ForwardSort(l, h, less);
            return;
        }
        int pass = 4;
        for(;;) {
            I middle = l + (count >> 1);        // get the middle element
            OrderIter2__(l, middle, less);      // sort l, middle, h-1 to find median of 3
            OrderIter2__(middle, h - 1, less);
            OrderIter2__(l, middle, less);      // median is now in middle
            IterSwap(l + 1, middle);            // move median pivot to l + 1
            I ii = l + 1;
            for(I i = l + 2; i != h - 1; ++i)   // do partitioning; already l <= pivot <= h - 1
                if(less(*i, *(l + 1)))
                    IterSwap(++ii, i);
            IterSwap(ii, l + 1);                // put pivot back in between partitions
            I iih = ii;
            while(iih + 1 != h && !less(*ii, *(iih + 1))) // Find middle range of elements equal to pivot
                ++iih;
            if(pass > 5 || min(ii - l, h - iih) > 
            (max(ii - l, h - iih) >> pass)) { // partition sizes ok or we have done max attempts
                if(ii - l < h - iih - 1) {       // recurse on smaller partition, tail on larger
                    Sort(l, ii, less);
                    l = iih + 1;
                }
                else {
                    Sort(iih + 1, h, less);
                    h = ii;
                }
                break;
            }
            IterSwap(l, l + (int)Random(count));     // try some other random elements for median pivot
            IterSwap(middle, l + (int)Random(count));
            IterSwap(h - 1, l + (int)Random(count));
            pass++;
        }
    }
}

This is a fairly standard implementation. For small final partitions (<8 elements), optimized selection sort is used.

Pivot is decided as median of first, middle, last elements and algorithm contains detection of degenerate cases, in which case it does 2 more attempts to find a better pivot, using random elements to get median.

Algorithm uses "fat" partitioning - central elements equal to pivot are excluded from partitions.

Finally, algorithm uses recursion on smaller partition and tail execution on larger.

Now the feature of QuickSort that makes it particularly easy to parallelize is the fact that once partitioned, there is no data sharing between partitions. Practically, all we need to do is to replace recursion with parallel execution.

The only problem here is how to reasonably manage threads. Thankfully, U++ provides a useful parallelization class CoWork.

Meet the CoWork

CoWork can be thought of as an interface to the global thread pool. It has two fundamental operations:

Do

Schedules some code to be executed. Note that Do can also be invoked using operator &.

Finish

Waits for all the code scheduled by Do to be finished. Finish is also invoked from CoWork destructor, so usually is omitted from the code.

With respect to memory visibility, all changes made before Do are visible when the scheduled code is performed and all changes made in the scheduled code are visible after Finish.

Now, how the code is executed by CoWork is an implementation issue, anyway you can expect CoWork to try as hard as possible to use the global thread pool to run scheduled code in parallel.

Parallel QuickSort

Now we have the algorithm, and we have the tool to make it parallel, let us combine them:

template <class I, class Less>
void CoSort(CoWork& cw, I l, I h, const Less& less)
{
    const int PARALLEL_THRESHOLD = 80;

    for(;;) {
        int count = int(h - l);
        if(count < 2)
            return;
        if(count < 8) {                         // Final optimized SelectSort
            ForwardSort(l, h, less);
            return;
        }
        int pass = 4;
        for(;;) {
            I middle = l + (count >> 1);        // get the middle element
            OrderIter2__(l, middle, less);      // sort l, middle, h-1 to find median of 3
            OrderIter2__(middle, h - 1, less);
            OrderIter2__(l, middle, less);      // median is now in middle
            IterSwap(l + 1, middle);            // move median pivot to l + 1
            I ii = l + 1;
            for(I i = l + 2; i != h - 1; ++i)   // do partitioning; already l <= pivot <= h - 1
                if(less(*i, *(l + 1)))
                    IterSwap(++ii, i);
            IterSwap(ii, l + 1);                // put pivot back in between partitions
            I iih = ii;
            while(iih + 1 != h && !less(*ii, *(iih + 1))) // Find middle range of elements equal to pivot
                ++iih;
            if(pass > 5 || min(ii - l, h - iih) > 
            (max(ii - l, h - iih) >> pass)) { // partition sizes ok or we have done max attempts
                if(ii - l < h - iih - 1) {       // schedule or recurse on smaller partition, tail on larger
                    if(ii - l < PARALLEL_THRESHOLD) // too small to run in parallel?
                        Sort(l, ii, less); // resolve in this thread
                    else
                        cw & [=, &cw] { CoSort(cw, l, ii, less); }; // schedule for parallel execution
                    l = iih + 1;
                }
                else {
                    if(h - iih - 1 < PARALLEL_THRESHOLD) // too small to run in parallel?
                        Sort(iih + 1, h, less); // resolve in this thread
                    else
                        cw & [=, &cw] { CoSort(cw, iih + 1, h, less); }; // schedule for parallel execution
                    h = ii;
                }
                break;
            }
            IterSwap(l, l + (int)Random(count));     // try some other random elements for median pivot
            IterSwap(middle, l + (int)Random(count));
            IterSwap(h - 1, l + (int)Random(count));
            pass++;
        }
    }
}

template <class I, class Less>
void CoSort(I l, I h, const Less& less)
{
    CoWork cw;
    CoSort(cw, l, h, less);
    // CoWork destructor waits until all scheduled work is finished
}

Just as promised, about the only change in the QuickSort code is replacement of recursion with CoWork::Do scheduling (in operator & form).

PARALLEL_THRESHOLD constant prevents parallel execution for small partitions - scheduling them in parallel has diminishing returns and at some point, false cacheline sharing and CoWork management costs are higher than benefits of using more CPU cores. The value 80 used here is a combination of gut feeling and some experimentation and is not really critical, experiments with sorting Strings revealed that performance generally is getting worse if this value is >200 or <20.

Benchmarks

We have benchmarked CoSort against the original serial version, sorting Upp::Vector of N randomly generated short Upp::Strings. For reference, we have also benchmarked standard library std::sort of std::vector<std::string> with the same data set.

Benchmarks were performed on two platforms, using svn revision 9390 of U++ framework. Benchmarking package was marked for 'fast' optimization.

Intel® Pentium(R) CPU G620 @ 2.60GHz (2 cores, 2 threads), 4GB RAM
Ubuntu Linux 64-bit
gcc (Ubuntu 5.2.1-22ubuntu2) 5.2.1 20151010

N	Sort	CoSort	Sort / CoSort	std::sort
10³	63.22 us	53.12 us	1.19x	241.82 us
10⁴	968.93 us	603.93 us	1.60x	3180 us
10⁵	11.50 ms	6.58 ms	1.75x	38.32 ms
10⁶	137.20 ms	77.00 ms	1.78x	427.70 ms
10⁷	1.63 s	0.89 s	1.83x	4.85 s

Intel® Pentium(R) Core i7-4771 @ 3.5GHz (4 cores, 8 threads), 16GB RAM
Windows 10 64-bit
Visual C++ 2014 64-bit mode

N	Sort	CoSort	Sort / CoSort	std::sort
10³	47.87 us	44.27 us	1.08x	193.37 us
10⁴	642.97 us	301.97 us	2.13x	2490 us
10⁵	7.96 ms	2.62 ms	3.04x	31.32 ms
10⁶	95.90 ms	28.50 ms	3.36x	381.60 ms
10⁷	1.16 s	0.30 s	3.87x	4.20 s
10⁸	13.63 s	3.69 s	3.69x	49.08 s

Broadcom BCM2836 ARM Cortex-A7 900 MHz, 4 cores, 4 threads, 1GB SDRAM (Raspberry Pi 2)
Ubuntu Linux
gcc (Ubuntu 5.2.1-22ubuntu2) 5.2.1 20151010, 32-bit

N	Sort	CoSort	Sort / CoSort	std::sort
10³	305.87 us	291.77 us	1.04x	1470 us
10⁴	4.58 ms	2.33 ms	1.96x	20.00 ms
10⁵	69.05 ms	33.64 ms	2.05x	250.48 ms
10⁶	1.02 s	0.49 s	2.08x	3.13 s

(Pentium G620 system does not have enough memory to run 10⁸ test, ARM system does not have enough memory for more than 10⁶ tests).

Conclusion

Benchmark results indicate that for large enough datasets, the performance ratio between serial and parallel version of algorithm is approaching the number of physical CPU cores on Intel CPUs. On the other hand, parallelization does not seem to make much sense for small datasets where factors like false cache-line sharing and job/thread management costs in CoWork prevent any noticeable speedup and ultimately lead to slowdown for small sets. In production code, it would probably be advisable to revert to purely serial code for less than 500 elements (in production version of CoSort).

ARM results are sort of disappointing, the plausible explanation is memory bandwidth bottleneck - unlike Intel i7 system, which has about 20GB/s bandwidth, RAM bandwidth of Rapsberry PI is only about 200MB/s.

Useful Links

History

January 18^th, 2016: ARM CPU benchmark results added