According to the latest research, the qsort(…) (ANSI C) and std::sort(…) (ISO/IEC 14882(E)) functions are no longer the fastest implementations of the famous quicksort algorithm. Its performance significantly degrades while using it for the big-data sorting. To benefit in sorting of the big-data, we obviously need a different approach that allows to increase the overall performance of the the classical quicksort over its known shortcomings and limitations.

**Note**: Read the original article An Efficient Parallel Three-Way Quicksort Using Intel C++ Compiler And OpenMP 4.5 Library at Intel® Developer Zone.

## Introduction

*"Sorting plays a major role in commercial data processing and in modern scientific computing. Indeed, a sorting algorithm was named as one of the top ten algorithms for science and engineering of the 20th century", - *

*Robert Sedgewick, Kevin Wayne, "Algorithms 4*^{th} Edition".

In this article, I’d like to introduce the modern code in C++11, implementing the parallel three-way quicksort, which is asymptotically faster and more efficient than the famous heapsort and mergesort algorithms. Also, the parallel code, discussed in this article, provides more performance speed-up compared to the existing implementations of the fast quicksort, such as `qsort(…)`

(ANSI C) and `std::sort(…)`

(ISO/IEC 14882(E)). Specifically, I will look into the three-way quicksort algorithm and its complexity. Also, I will thoroughly discuss how to deliver the modern code, using Intel C++ Compiler and OpenMP 4.5 library, that runs the sort in parallel. I will explain how to use the OpenMP’s concurrent tasks to parallelize the recursive sub-sorts, drastically improving the overall performance. Finally, I will introduce a sample program that demonstrates the parallel sort, running it on the machine with symmetric multi-core Intel CPUs, such as Intel® Core™ i9 and Intel® Xeon™ E5, and, at the same time, evaluate its performance compared to the sort performed by `std::sort(…)`

, implemented as a part of the C++ Standard Library.

## Three-Way Quicksort Algorithm

The main idea of performing the three-way quicksort is to divide an entire array into three subsequent partitions, consisting of elements that are less, equal to or greater than the value of pivot element, respectively. Then, the left and right partitions are recursively sorted by performing the same three-way partitioning. The using of the three-way quicksort allows to reduce the best-case computational complexity of the sort from linearithmic O(n log n) to linear O(n), while sorting arrays with a huge number of elements. The following algorithm is O((n log n) / n) = O(log n) times faster, rather than the famous heapsort and mergesort algorithms. Moreover, the three-way quicksort is cache-coherent, greatly affecting the CPU’s cache pipeline and, thus, the overall performance of the sort, in general.

The three-way quicksort algorithm is mainly based on a single pass through each element in the array, from left to right. Each element in the array is compared to the value of pivot using three-way comparison, discussed below. The three-way comparison handles three main cases when an element is less, greater than or equal to the value of pivot. If an element is less than the value of pivot, this element is exchanged to the left partition, otherwise if an element is greater than the value of pivot, it’s exchanged to the right partition. Also, the elements equal to the value of pivot are never exchanged. During this process, it maintains three indices of the left, right and i, respectively. The index i is used for accessing each element in the array, while the left and right indices are used for swapping elements to a specific partition.

The three-way quicksort algorithm can be formulated as follows:

- Let arr[0..N] – an array of elements, low, high – the indices of the first and last elements;
- Select a pivot as the value of the first element (pivot = arr[low]);
- Initialize the left variable as the index of the first element (left = low);
- Initialize the right variable as the index of the last element (right = high);
- Initialize the variable i as the index of the second element (i = low + 1);
- For each i-th element arr[i] in the array (i <= high), do the following:
Compare i-th element arr[i] with the value of pivot:

- If the i-th element arr[i] is less than the pivot value, then exchange it with the element at left index and increment left and i indices by 1;
- If the i-th element arr[i] is greater than the pivot value, then exchange it with the element at right index and decrement the right index by 1;
- Otherwise, do not perform any exchange and increment the index i by 1;

- Recursively sort the left partition of the array arr[low..left - 1];
- Recursively sort the right partition of the array arr[right + 1..high];

After performing the three-way partitioning, the following algorithm recurs the sorting of the left and right partitions, independently. This operation can be performed by spawning two concurrent tasks that will do the recursive sorting in parallel.

## An Efficient Parallel Sort

The modern code in C++11 listed below implements the parallel three-way quicksort algorithm:

namespace internal
{
std::size_t g_depth = 0L;
const std::size_t cutoff = 1000000L;
template<class RanIt, class _Pred>
void qsort3w(RanIt _First, RanIt _Last, _Pred compare)
{
if (_First >= _Last) return;
std::size_t _Size = 0L; g_depth++;
if ((_Size = std::distance(_First, _Last)) > 0)
{
RanIt _LeftIt = _First, _RightIt = _Last;
bool is_swapped_left = false, is_swapped_right = false;
typename std::iterator_traits<RanIt>::value_type _Pivot = *_First;
RanIt _FwdIt = _First + 1;
while (_FwdIt <= _RightIt)
{
if (compare(*_FwdIt, _Pivot))
{
is_swapped_left = true;
std::iter_swap(_LeftIt, _FwdIt);
_LeftIt++; _FwdIt++;
}
else if (compare(_Pivot, *_FwdIt)) {
is_swapped_right = true;
std::iter_swap(_RightIt, _FwdIt);
_RightIt--;
}
else _FwdIt++;
}
if (_Size >= internal::cutoff)
{
#pragma omp taskgroup
{
#pragma omp task untied mergeable
if ((std::distance(_First, _LeftIt) > 0) && (is_swapped_left))
qsort3w(_First, _LeftIt - 1, compare);
#pragma omp task untied mergeable
if ((std::distance(_RightIt, _Last) > 0) && (is_swapped_right))
qsort3w(_RightIt + 1, _Last, compare);
}
}
else
{
#pragma omp task untied mergeable
{
if ((std::distance(_First, _LeftIt) > 0) && is_swapped_left)
qsort3w(_First, _LeftIt - 1, compare);
if ((std::distance(_RightIt, _Last) > 0) && is_swapped_right)
qsort3w(_RightIt + 1, _Last, compare);
}
}
}
}
template<class BidirIt, class _Pred >
void parallel_sort(BidirIt _First, BidirIt _Last, _Pred compare)
{
std::size_t pos = 0L; g_depth = 0L;
if (!misc::sorted(_First, _Last, pos, compare))
{
#pragma omp parallel num_threads(12)
#pragma omp master
internal::qsort3w(_First, _Last - 1, compare);
}
}
}

In this code, we perform the three-way partitioning sequentially, because it normally incurs the data flow dependency and, thus, cannot be run in parallel. Also, in this case, the parallel optimization is not required since the complexity of the three-way partitioning is always O(n), providing the sufficient performance speed-up.

In turn, the fragment of code that does the recursive sorting can be easily run in parallel, drastically increasing the overall performance of the sort. To perform the parallel recursive sorting, I’ve implemented the code that, while being executed, creates a group of two concurrent OpenMP tasks using `#pragma omp taskgroup {}`

directive. Both of these tasks are scheduled and launched by using the OpenMP’s `#pragma omp task untied mergeable {}`

directive, performing the recursive sorting in its own separate thread. I’ve used `untied`

clause to make sure that the following task is executed by more than one thread. As well, I specified the `mergeable`

clause, so that the task will use the same data context as the code generating this task:

if (_Size >= internal::cutoff)
{
#pragma omp taskgroup
{
#pragma omp task untied mergeable
if ((std::distance(_First, _LeftIt) > 0) && (is_swapped_left))
qsort3w(_First, _LeftIt - 1, compare);
#pragma omp task untied mergeable
if ((std::distance(_RightIt, _Last) > 0) && (is_swapped_right))
qsort3w(_RightIt + 1, _Last, compare);
}
}
else
{
#pragma omp task untied mergeable
{
if ((std::distance(_First, _LeftIt) > 0) && is_swapped_left)
qsort3w(_First, _LeftIt - 1, compare);
if ((std::distance(_RightIt, _Last) > 0) && is_swapped_right)
qsort3w(_RightIt + 1, _Last, compare);
}
}

The first task being scheduled performs the sorting of the left partition, while the second task does the right partition sorting, respectively.

The code listed above, actually performs the conditional tasking. Before generating a task, it performs a check if a partition is required to be sorted and we’re not sorting an empty partition. If so, it spawns a task that recursively calls the `qsort3w(…)`

function to perform the actual sorting.

Here’s also one effective optimization of the parallel three-way quicksort. In some cases, when sorting arrays with lots of duplicate elements, the recursion depth of the sort might increase and the code will generate an enormously large number of parallel tasks, producing the sufficient overhead time spent for synchronization and tasks scheduling. To avoid the following issue occurrence, I’ve implemented a fragment of code that first checks if the size of an array being sorted exceeds the specific cutoff boundary. If so, it normally creates a group of two concurrent tasks as it has already been discussed above. Otherwise, it merges both recursive calls to the `qsort3w(…)`

function, executed by a single thread. This, in turn, allows to reduce the number of parallel recursive tasks being scheduled.

The entire process of sorting starts by calling the `qsort3w(…)`

function, launched in a separate thread to invoke the parallelism:

template<class BidirIt, class _Pred >
void parallel_sort(BidirIt _First, BidirIt _Last, _Pred compare)
{
std::size_t pos = 0L; g_depth = 0L;
if (!misc::sorted(_First, _Last, pos, compare))
{
#pragma omp parallel num_threads(12)
#pragma omp master
internal::qsort3w(_First, _Last - 1, compare);
}
}

It’s recommended to invoke the `qsort3w(…)`

function in the master thread of a parallel region, rather than to use OpenMP’s tasking directives for that purpose.

## Sample Demo Program In C++11

To demonstrate the performance and efficiency of the parallel code delivered, I've implemented a sample demo program, evaluating the time spent for the sort by using the regular `std::sort`

and the parallel sort discussed in this article:

namespace parallel_sort_impl
{
#if defined( _WIN32 )
static HANDLE hStdout = ::GetStdHandle(STD_OUTPUT_HANDLE);
const WORD wdRed = FOREGROUND_RED | FOREGROUND_INTENSITY;
const WORD wdWhite = FOREGROUND_RED | FOREGROUND_GREEN | FOREGROUND_BLUE;
#endif
void stress_test(void)
{
while (true)
{
std::size_t count = 0L;
std::vector<std::int64_t> array, array_copy;
misc::init(array, std::make_pair(std::pow(10, misc::minval_radix), \
std::pow(10, misc::maxval_radix)), count);
array_copy.resize(array.size());
std::copy(array.begin(), array.end(), array_copy.begin());
std::chrono::system_clock::time_point \
time_s = std::chrono::system_clock::now();
std::cout << "sorting an array...\n";
std::sort(array.begin(), array.end(),
[](std::int64_t first, std::int64_t end) { return first < end; });
std::chrono::system_clock::time_point \
time_f = std::chrono::system_clock::now();
std::chrono::system_clock::duration \
std_sort_time_elapsed = time_f - time_s;
std::cout << std::setiosflags(std::ios::fixed) << std::setprecision(4)
<< "array size = " << count << " execution time (std::sort): "
<< std_sort_time_elapsed.count() << " ms ";
time_s = std::chrono::system_clock::now();
internal::parallel_sort(array_copy.begin(), array_copy.end(),
[](std::int64_t first, std::int64_t end) { return first < end; });
time_f = std::chrono::system_clock::now();
std::size_t position = 0L;
std::chrono::system_clock::duration \
qsort_time_elapsed = time_f - time_s;
bool is_sorted = misc::sorted(array_copy.begin(), array_copy.end(), position,
[](std::int64_t first, std::int64_t end) { return first < end; });
std::double_t time_diff = \
std_sort_time_elapsed.count() - qsort_time_elapsed.count();
#if defined( _WIN32 )
::SetConsoleTextAttribute(hStdout, \
(is_sorted == true) ? wdWhite : wdRed);
#else
if (is_sorted == false)
std::cout << "\033[1;31m";
#endif
std::cout << "<--> (internal::parallel_sort): "
<< qsort_time_elapsed.count() << " ms " << "\n";
std::cout << "verification: ";
if (is_sorted == false) {
std::cout << "failed at pos: " << position << "\n";
std::cin.get();
misc::print_out(array_copy.begin() + position,
array_copy.end() + position + 10);
}
else {
std::double_t ratio = qsort_time_elapsed.count() / \
(std::double_t)std_sort_time_elapsed.count();
std::cout << std::setiosflags(std::ios::fixed) << std::setprecision(2)
<< "passed... [ time_diff: " << std::fabs(time_diff)
<< " ms (" << "ratio: " << (ratio - 1.0) * 100
<< "% (" << (1.0f / ratio) << "x faster)) depth = "
<< internal::g_depth << " ]" << "\n";
}
std::cout << "\n";
#if !defined ( _WIN32 )
if (is_sorted == false)
std::cout << "\033[0m";
#endif
}
}
void parallel_sort_demo(void)
{
std::size_t count = 0L;
std::cout << "Enter the number of data items N = "; std::cin >> count;
std::vector<std::int64_t> array;
misc::init(array, std::make_pair(std::pow(10, misc::minval_radix), \
std::pow(10, misc::maxval_radix)), count);
std::chrono::system_clock::time_point \
time_s = std::chrono::system_clock::now();
internal::parallel_sort(array.begin(), array.end(),
[](std::int64_t first, std::int64_t end) { return first < end; });
std::chrono::system_clock::time_point \
time_f = std::chrono::system_clock::now();
std::size_t position = 0L;
std::chrono::system_clock::duration \
qsort_time_elapsed = time_f - time_s;
std::cout << "Execution Time: " << qsort_time_elapsed.count()
<< " ms " << "depth = " << internal::g_depth << " ";
bool is_sorted = misc::sorted(array.begin(), array.end(), position,
[](std::int64_t first, std::int64_t end) { return first < end; });
std::cout << "(verification: ";
if (is_sorted == false) {
std::cout << "failed at pos: " << position << "\n";
std::cin.get();
misc::print_out(array.begin() + position, array.end() + position + 10);
}
else {
std::cout << "passed...)" << "\n";
}
char option = '\0';
std::cout << "Do you want to output the array [Y/N]?"; std::cin >> option;
if (option == 'y' || option == 'Y')
misc::print_out(array.begin(), array.end());
}
}
int main()
{
std::string logo = "Parallel Sort v.1.00 by Arthur V. Ratz";
std::cout << logo << "\n\n";
char option = '\0';
std::cout << "Do you want to run stress test first [Y/N]?"; std::cin >> option;
std::cout << "\n";
if (option == 'y' || option == 'Y')
parallel_sort_impl::stress_test();
if (option == 'n' || option == 'N')
parallel_sort_impl::parallel_sort_demo();
return 0;
}
#endif // PARALLEL_STABLE_SORT_STL_CPP

In conclusion, as you can see from using the sample demo program, the parallel sort discussed in this article is drastically (up to 2x-6x times) faster, rather than the regular `qsort`

or `std::sort`

functions.

## History

- 5
^{th} December, 2015 - First revision - 8
^{th} December, 2015 - Second revision - 10
^{th} December, 2015 - Third revision - 12
^{th} December, 2015 - Final revision (performance analytics added) - 16
^{th} November, 2017 - Final revision (huge material update!) - 7
^{th} May, 2020 - Final revision (total material update!)