Click here to Skip to main content
15,867,756 members
Articles / Programming Languages / C++11

An Efficient Parallel Three-Way Quicksort Using Intel C++ Compiler And OpenMP 4.5 Library

Rate me:
Please Sign up or sign in to vote.
4.88/5 (87 votes)
6 May 2020CPOL7 min read 133.1K   2.2K   118   65
In this article, I’d like to introduce the modern code in C++11, implementing the parallel three-way quicksort, which is asymptotically faster and more efficient than the famous heapsort and mergesort algorithms.
According to the latest research, the qsort(…) (ANSI C) and std::sort(…) (ISO/IEC 14882(E)) functions are no longer the fastest implementations of the famous quicksort algorithm. Its performance significantly degrades while using it for the big-data sorting. To benefit in sorting of the big-data, we obviously need a different approach that allows to increase the overall performance of the the classical quicksort over its known shortcomings and limitations.

Note: Read the original article An Efficient Parallel Three-Way Quicksort Using Intel C++ Compiler And OpenMP 4.5 Library at Intel® Developer Zone.

Introduction

"Sorting plays a major role in commercial data processing and in modern scientific computing. Indeed, a sorting algorithm was named as one of the top ten algorithms for science and engineering of the 20th century", -

Robert Sedgewick, Kevin Wayne, "Algorithms 4th Edition".

In this article, I’d like to introduce the modern code in C++11, implementing the parallel three-way quicksort, which is asymptotically faster and more efficient than the famous heapsort and mergesort algorithms. Also, the parallel code, discussed in this article, provides more performance speed-up compared to the existing implementations of the fast quicksort, such as qsort(…) (ANSI C) and std::sort(…) (ISO/IEC 14882(E)). Specifically, I will look into the three-way quicksort algorithm and its complexity. Also, I will thoroughly discuss how to deliver the modern code, using Intel C++ Compiler and OpenMP 4.5 library, that runs the sort in parallel. I will explain how to use the OpenMP’s concurrent tasks to parallelize the recursive sub-sorts, drastically improving the overall performance. Finally, I will introduce a sample program that demonstrates the parallel sort, running it on the machine with symmetric multi-core Intel CPUs, such as Intel® Core™ i9 and Intel® Xeon™ E5, and, at the same time, evaluate its performance compared to the sort performed by std::sort(…), implemented as a part of the C++ Standard Library.

Three-Way Quicksort Algorithm

The main idea of performing the three-way quicksort is to divide an entire array into three subsequent partitions, consisting of elements that are less, equal to or greater than the value of pivot element, respectively. Then, the left and right partitions are recursively sorted by performing the same three-way partitioning. The using of the three-way quicksort allows to reduce the best-case computational complexity of the sort from linearithmic O(n log n) to linear O(n), while sorting arrays with a huge number of elements. The following algorithm is O((n log n) / n) = O(log n) times faster, rather than the famous heapsort and mergesort algorithms. Moreover, the three-way quicksort is cache-coherent, greatly affecting the CPU’s cache pipeline and, thus, the overall performance of the sort, in general.

The three-way quicksort algorithm is mainly based on a single pass through each element in the array, from left to right. Each element in the array is compared to the value of pivot using three-way comparison, discussed below. The three-way comparison handles three main cases when an element is less, greater than or equal to the value of pivot. If an element is less than the value of pivot, this element is exchanged to the left partition, otherwise if an element is greater than the value of pivot, it’s exchanged to the right partition. Also, the elements equal to the value of pivot are never exchanged. During this process, it maintains three indices of the left, right and i, respectively. The index i is used for accessing each element in the array, while the left and right indices are used for swapping elements to a specific partition.

The three-way quicksort algorithm can be formulated as follows:

  1. Let arr[0..N] – an array of elements, low, high – the indices of the first and last elements;
  2. Select a pivot as the value of the first element (pivot = arr[low]);
  3. Initialize the left variable as the index of the first element (left = low);
  4. Initialize the right variable as the index of the last element (right = high);
  5. Initialize the variable i as the index of the second element (i = low + 1);
  6. For each i-th element arr[i] in the array (i <= high), do the following:

    Compare i-th element arr[i] with the value of pivot:

    1. If the i-th element arr[i] is less than the pivot value, then exchange it with the element at left index and increment left and i indices by 1;
    2. If the i-th element arr[i] is greater than the pivot value, then exchange it with the element at right index and decrement the right index by 1;
    3. Otherwise, do not perform any exchange and increment the index i by 1;
  7. Recursively sort the left partition of the array arr[low..left - 1];
  8. Recursively sort the right partition of the array arr[right + 1..high];

After performing the three-way partitioning, the following algorithm recurs the sorting of the left and right partitions, independently. This operation can be performed by spawning two concurrent tasks that will do the recursive sorting in parallel.

An Efficient Parallel Sort

The modern code in C++11 listed below implements the parallel three-way quicksort algorithm:

C++
namespace internal
{
	std::size_t g_depth = 0L;
	const std::size_t cutoff = 1000000L;

	template<class RanIt, class _Pred>
	void qsort3w(RanIt _First, RanIt _Last, _Pred compare)
	{
		if (_First >= _Last) return;
		
		std::size_t _Size = 0L; g_depth++;
		if ((_Size = std::distance(_First, _Last)) > 0)
		{
			RanIt _LeftIt = _First, _RightIt = _Last;
			bool is_swapped_left = false, is_swapped_right = false;
			typename std::iterator_traits<RanIt>::value_type _Pivot = *_First;

			RanIt _FwdIt = _First + 1;
			while (_FwdIt <= _RightIt)
			{
				if (compare(*_FwdIt, _Pivot))
				{
					is_swapped_left = true;
					std::iter_swap(_LeftIt, _FwdIt);
					_LeftIt++; _FwdIt++;
				}

				else if (compare(_Pivot, *_FwdIt)) {
					is_swapped_right = true;
					std::iter_swap(_RightIt, _FwdIt);
					_RightIt--;
				}

				else _FwdIt++;
			}

			if (_Size >= internal::cutoff)
			{
				#pragma omp taskgroup
				{
					#pragma omp task untied mergeable
					if ((std::distance(_First, _LeftIt) > 0) && (is_swapped_left))
						qsort3w(_First, _LeftIt - 1, compare);

					#pragma omp task untied mergeable
					if ((std::distance(_RightIt, _Last) > 0) && (is_swapped_right))
						qsort3w(_RightIt + 1, _Last, compare);
				}
			}

			else
			{
				#pragma omp task untied mergeable
				{
					if ((std::distance(_First, _LeftIt) > 0) && is_swapped_left)
						qsort3w(_First, _LeftIt - 1, compare);

					if ((std::distance(_RightIt, _Last) > 0) && is_swapped_right)
						qsort3w(_RightIt + 1, _Last, compare);
				}
			}
		}
	}

	template<class BidirIt, class _Pred >
	void parallel_sort(BidirIt _First, BidirIt _Last, _Pred compare)
	{
		std::size_t pos = 0L; g_depth = 0L;
		if (!misc::sorted(_First, _Last, pos, compare))
		{
			#pragma omp parallel num_threads(12)
			#pragma omp master
				internal::qsort3w(_First, _Last - 1, compare);
		}
	}
}

In this code, we perform the three-way partitioning sequentially, because it normally incurs the data flow dependency and, thus, cannot be run in parallel. Also, in this case, the parallel optimization is not required since the complexity of the three-way partitioning is always O(n), providing the sufficient performance speed-up.

In turn, the fragment of code that does the recursive sorting can be easily run in parallel, drastically increasing the overall performance of the sort. To perform the parallel recursive sorting, I’ve implemented the code that, while being executed, creates a group of two concurrent OpenMP tasks using #pragma omp taskgroup {} directive. Both of these tasks are scheduled and launched by using the OpenMP’s #pragma omp task untied mergeable {} directive, performing the recursive sorting in its own separate thread. I’ve used untied clause to make sure that the following task is executed by more than one thread. As well, I specified the mergeable clause, so that the task will use the same data context as the code generating this task:

C++
if (_Size >= internal::cutoff)
{
    #pragma omp taskgroup
    {
        #pragma omp task untied mergeable
        if ((std::distance(_First, _LeftIt) > 0) && (is_swapped_left))
            qsort3w(_First, _LeftIt - 1, compare);

        #pragma omp task untied mergeable
        if ((std::distance(_RightIt, _Last) > 0) && (is_swapped_right))
            qsort3w(_RightIt + 1, _Last, compare);
    }
}

else
{
    #pragma omp task untied mergeable
    {
        if ((std::distance(_First, _LeftIt) > 0) && is_swapped_left)
            qsort3w(_First, _LeftIt - 1, compare);

        if ((std::distance(_RightIt, _Last) > 0) && is_swapped_right)
            qsort3w(_RightIt + 1, _Last, compare);
    }
}

The first task being scheduled performs the sorting of the left partition, while the second task does the right partition sorting, respectively.

The code listed above, actually performs the conditional tasking. Before generating a task, it performs a check if a partition is required to be sorted and we’re not sorting an empty partition. If so, it spawns a task that recursively calls the qsort3w(…) function to perform the actual sorting.

Here’s also one effective optimization of the parallel three-way quicksort. In some cases, when sorting arrays with lots of duplicate elements, the recursion depth of the sort might increase and the code will generate an enormously large number of parallel tasks, producing the sufficient overhead time spent for synchronization and tasks scheduling. To avoid the following issue occurrence, I’ve implemented a fragment of code that first checks if the size of an array being sorted exceeds the specific cutoff boundary. If so, it normally creates a group of two concurrent tasks as it has already been discussed above. Otherwise, it merges both recursive calls to the qsort3w(…) function, executed by a single thread. This, in turn, allows to reduce the number of parallel recursive tasks being scheduled.

The entire process of sorting starts by calling the qsort3w(…) function, launched in a separate thread to invoke the parallelism:

C++
template<class BidirIt, class _Pred >
void parallel_sort(BidirIt _First, BidirIt _Last, _Pred compare)
{
    std::size_t pos = 0L; g_depth = 0L;
    if (!misc::sorted(_First, _Last, pos, compare))
    {
        #pragma omp parallel num_threads(12)
        #pragma omp master
            internal::qsort3w(_First, _Last - 1, compare);
    }
}

It’s recommended to invoke the qsort3w(…) function in the master thread of a parallel region, rather than to use OpenMP’s tasking directives for that purpose.

Sample Demo Program In C++11

To demonstrate the performance and efficiency of the parallel code delivered, I've implemented a sample demo program, evaluating the time spent for the sort by using the regular std::sort and the parallel sort discussed in this article:

C++
namespace parallel_sort_impl
{
#if defined( _WIN32 )
	static HANDLE hStdout = ::GetStdHandle(STD_OUTPUT_HANDLE);
	const WORD wdRed = FOREGROUND_RED | FOREGROUND_INTENSITY;
	const WORD wdWhite = FOREGROUND_RED | FOREGROUND_GREEN | FOREGROUND_BLUE;
#endif

	void stress_test(void)
	{
		while (true)
		{
			std::size_t count = 0L;
			std::vector<std::int64_t> array, array_copy;
			misc::init(array, std::make_pair(std::pow(10, misc::minval_radix), \
				std::pow(10, misc::maxval_radix)), count);

			array_copy.resize(array.size());
			std::copy(array.begin(), array.end(), array_copy.begin());

			std::chrono::system_clock::time_point \
				time_s = std::chrono::system_clock::now();

			std::cout << "sorting an array...\n";

			std::sort(array.begin(), array.end(),
				[](std::int64_t first, std::int64_t end) { return first < end; });

			std::chrono::system_clock::time_point \
				time_f = std::chrono::system_clock::now();

			std::chrono::system_clock::duration \
				std_sort_time_elapsed = time_f - time_s;

			std::cout << std::setiosflags(std::ios::fixed) << std::setprecision(4)
				<< "array size = " << count << " execution time (std::sort): " 
                                   << std_sort_time_elapsed.count() << " ms ";

			time_s = std::chrono::system_clock::now();

			internal::parallel_sort(array_copy.begin(), array_copy.end(),
				[](std::int64_t first, std::int64_t end) { return first < end; });

			time_f = std::chrono::system_clock::now();

			std::size_t position = 0L;
			std::chrono::system_clock::duration \
				qsort_time_elapsed = time_f - time_s;

			bool is_sorted = misc::sorted(array_copy.begin(), array_copy.end(), position,
				[](std::int64_t first, std::int64_t end) { return first < end; });

			std::double_t time_diff = \
				std_sort_time_elapsed.count() - qsort_time_elapsed.count();

			#if defined( _WIN32 )
			::SetConsoleTextAttribute(hStdout, \
				(is_sorted == true) ? wdWhite : wdRed);
			#else
			if (is_sorted == false)
				std::cout << "\033[1;31m";
			#endif	

			std::cout << "<--> (internal::parallel_sort): " 
                      << qsort_time_elapsed.count() << " ms " << "\n";

			std::cout << "verification: ";

			if (is_sorted == false) {
				std::cout << "failed at pos: " << position << "\n";
				std::cin.get();
				misc::print_out(array_copy.begin() + position, 
                                array_copy.end() + position + 10);
			}

			else {
				std::double_t ratio = qsort_time_elapsed.count() / \
					(std::double_t)std_sort_time_elapsed.count();
				std::cout << std::setiosflags(std::ios::fixed) << std::setprecision(2)
					<< "passed... [ time_diff: " << std::fabs(time_diff)
					<< " ms (" << "ratio: " << (ratio - 1.0) * 100 
                    << "% (" << (1.0f / ratio) << "x faster)) depth = "
					<< internal::g_depth << " ]" << "\n";
			}

			std::cout << "\n";

			#if !defined ( _WIN32 )
			if (is_sorted == false)
				std::cout << "\033[0m";
			#endif	
		}
	}

	void parallel_sort_demo(void)
	{
		std::size_t count = 0L;
		std::cout << "Enter the number of data items N = "; std::cin >> count;

		std::vector<std::int64_t> array;
		misc::init(array, std::make_pair(std::pow(10, misc::minval_radix), \
			std::pow(10, misc::maxval_radix)), count);

		std::chrono::system_clock::time_point \
			time_s = std::chrono::system_clock::now();

		internal::parallel_sort(array.begin(), array.end(),
			[](std::int64_t first, std::int64_t end) { return first < end; });

		std::chrono::system_clock::time_point \
			time_f = std::chrono::system_clock::now();

		std::size_t position = 0L;
		std::chrono::system_clock::duration \
			qsort_time_elapsed = time_f - time_s;

		std::cout << "Execution Time: " << qsort_time_elapsed.count()
				  << " ms " << "depth = " << internal::g_depth << " ";

		bool is_sorted = misc::sorted(array.begin(), array.end(), position,
			[](std::int64_t first, std::int64_t end) { return first < end; });

		std::cout << "(verification: ";

		if (is_sorted == false) {
			std::cout << "failed at pos: " << position << "\n";
			std::cin.get();
			misc::print_out(array.begin() + position, array.end() + position + 10);
		}

		else {
			std::cout << "passed...)" << "\n";
		}

		char option = '\0';
		std::cout << "Do you want to output the array [Y/N]?"; std::cin >> option;

		if (option == 'y' || option == 'Y')
			misc::print_out(array.begin(), array.end());
	}
}

int main()
{
	std::string logo = "Parallel Sort v.1.00 by Arthur V. Ratz";
	std::cout << logo << "\n\n";

	char option = '\0';
	std::cout << "Do you want to run stress test first [Y/N]?"; std::cin >> option;
	std::cout << "\n";

	if (option == 'y' || option == 'Y')
		parallel_sort_impl::stress_test();
	if (option == 'n' || option == 'N')
		parallel_sort_impl::parallel_sort_demo();

	return 0;
}

#endif // PARALLEL_STABLE_SORT_STL_CPP

In conclusion, as you can see from using the sample demo program, the parallel sort discussed in this article is drastically (up to 2x-6x times) faster, rather than the regular qsort or std::sort functions.

History

  • 5th December, 2015 - First revision
  • 8th December, 2015 - Second revision
  • 10th December, 2015 - Third revision
  • 12th December, 2015 - Final revision (performance analytics added)
  • 16th November, 2017 - Final revision (huge material update!)
  • 7th May, 2020 - Final revision (total material update!)

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Senior) EpsilonDev
Ukraine Ukraine
I’m software developer, system analyst and network engineer, with over 20 years experience, graduated from L’viv State Polytechnic University and earned my computer science and information technology master’s degree in January 2004. My professional career began as a financial and accounting software developer in EpsilonDev company, located at L’viv, Ukraine. My favorite programming languages - C/C++, C#.NET, Java, ASP.NET, Node.js/JavaScript, PHP, Perl, Python, SQL, HTML5, etc. While developing applications, I basically use various of IDE’s and development tools, including Microsoft Visual Studio/Code, Eclipse IDE for Linux, IntelliJ/IDEA for writing code in Java. My professional interests basically include data processing and analysis algorithms, artificial intelligence and data mining, system analysis, modern high-performance computing (HPC), development of client-server web-applications using various of libraries, frameworks and tools. I’m also interested in cloud-computing, system security audit, IoT, networking architecture design, hardware engineering, technical writing, etc. Besides of software development, I also admire to write and compose technical articles, walkthroughs and reviews about the new IT- technological trends and industrial content. I published my first article at CodeProject in June 2015.

Comments and Discussions

 
QuestionOMP with visual studio? Pin
Dang D. Khanh7-Oct-22 23:10
Dang D. Khanh7-Oct-22 23:10 
AnswerRe: OMP with visual studio? Pin
Arthur V. Ratz7-Oct-22 23:47
professionalArthur V. Ratz7-Oct-22 23:47 
GeneralMy vote of 5 Pin
Grigoriy Stepovich2-Dec-17 20:34
professionalGrigoriy Stepovich2-Dec-17 20:34 
GeneralRe: My vote of 5 Pin
Arthur V. Ratz2-Dec-17 21:53
professionalArthur V. Ratz2-Dec-17 21:53 
Question3-way QuickSort complexity number wrong Pin
Miroslav Fidler20-Nov-17 20:37
Miroslav Fidler20-Nov-17 20:37 
AnswerRe: 3-way QuickSort complexity number wrong Pin
Arthur V. Ratz20-Nov-17 20:50
professionalArthur V. Ratz20-Nov-17 20:50 
GeneralRe: 3-way QuickSort complexity number wrong Pin
Miroslav Fidler20-Nov-17 21:29
Miroslav Fidler20-Nov-17 21:29 
AnswerRe: 3-way QuickSort complexity number wrong Pin
Arthur V. Ratz20-Nov-17 20:54
professionalArthur V. Ratz20-Nov-17 20:54 
GeneralRe: 3-way QuickSort complexity number wrong Pin
Miroslav Fidler20-Nov-17 21:32
Miroslav Fidler20-Nov-17 21:32 
QuestionArticle is still absolute nonsense Pin
Miroslav Fidler20-Sep-17 2:13
Miroslav Fidler20-Sep-17 2:13 
AnswerRe: Article is still absolute nonsense Pin
Arthur V. Ratz20-Sep-17 4:25
professionalArthur V. Ratz20-Sep-17 4:25 
QuestionAnother really bad article Pin
Byron Goodman15-Apr-17 11:13
Byron Goodman15-Apr-17 11:13 
AnswerRe: Another really bad article Pin
Rick York19-Sep-17 5:57
mveRick York19-Sep-17 5:57 
GeneralRe: Another really bad article Pin
Arthur V. Ratz19-Sep-17 6:52
professionalArthur V. Ratz19-Sep-17 6:52 
GeneralRe: Another really bad article Pin
Rick York19-Sep-17 10:48
mveRick York19-Sep-17 10:48 
GeneralRe: Another really bad article Pin
Byron Goodman19-Sep-17 7:47
Byron Goodman19-Sep-17 7:47 
Question5 Pin
afigegoznaet12-Apr-17 23:37
professionalafigegoznaet12-Apr-17 23:37 
PraiseVote of 5 Pin
Beginner Luck4-Jul-16 15:03
professionalBeginner Luck4-Jul-16 15:03 
GeneralRe: Vote of 5 Pin
Arthur V. Ratz4-Jul-16 19:37
professionalArthur V. Ratz4-Jul-16 19:37 
GeneralMy vote of 5 Pin
Kishor-KW7-Jan-16 18:12
Kishor-KW7-Jan-16 18:12 
BugWarning: This article contains many factual errors and blatant bugs Pin
Miroslav Fidler7-Jan-16 8:42
Miroslav Fidler7-Jan-16 8:42 
GeneralRe: Warning: This article contains many factual errors and blatant bugs Pin
Arthur V. Ratz7-Jan-16 12:40
professionalArthur V. Ratz7-Jan-16 12:40 
GeneralRe: Warning: This article contains many factual errors and blatant bugs Pin
Miroslav Fidler7-Jan-16 20:09
Miroslav Fidler7-Jan-16 20:09 
GeneralRe: Warning: This article contains many factual errors and blatant bugs Pin
Arthur V. Ratz7-Jan-16 20:12
professionalArthur V. Ratz7-Jan-16 20:12 
GeneralRe: Warning: This article contains many factual errors and blatant bugs PinPopular
Giuseppe Ciaccio31-Jan-16 6:58
Giuseppe Ciaccio31-Jan-16 6:58 
After reading this article I can confirm it contains a number of flaws.

Notably, the concept of "false sharing" as reported in the article is wrong. False sharing has nothing to do with coherency of multiple copies of a single variable across caches: false sharing is when DISTINCT variables are allocated onto the same cache line, so that using one of them invalidates also the other (thus two threads will incur repeated cache misses although they do not actually share nothing). The coherency problem is solved by the MESI protocol IN HARDWARE, there is no need and no way to address it in the program (the "volatile" attribute is not enough for that).

The code too seems to contain a lot of bugs. For instance, it seems to divide the array into chunks and let each chunk be independently sorted by a thread; this however will not yield a sorted array as a whole. Moreover, the use of "Sleep(1)" for (pseudo)synchronizing threads is not guaranteed to work in general, besides incurring an obvious performance penalty. Another wrong piece of code is the one which performs a "search for the smallest data item index in the padded array", due to a race on variable min_n .

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.