Click here to Skip to main content
13,399,175 members (34,355 online)
Click here to Skip to main content
Add your own
alternative version

Stats

8.3K views
194 downloads
28 bookmarked
Posted 18 Nov 2017

Multithreaded Parallel Scalable Sort Harnessing the Power of OpenMP Performance Library - Part 2: Effective Parallel Sort

, 20 Nov 2017
Rate this:
Please Sign up or sign in to vote.
In this article, we'll discuss on how to use Intel C++ Compiler and OpenMP performance library to deliver a modern parallel code that implements an efficient parallel sorting algorithm

Note: Optionally, you can evaluate the program, that performs the fast parallel sort, discussed in this article, by using the virtual machine environment with CentOS 64-bit and parallel sort executable pre-installed, available for download here. (login: root passwd: intel)

Introduction

In the second article in series, we'll discuss about the most of performance speed-up optimizations applied to the legacy sequential code that performs the actual sorting. The main reason why particularly we might want to provide these optimizations and deliver a modern parallel code is that the process of sorting performed sequentially can be much more faster since it's perfectly scaled across all active CPU's cores during its execution. In the other words, in this article we'll introduce the using of Intel® C++ compiler and OpenMP performance library to leverage the power of modern Intel® CPU's capable of performing symmetric real-time code execution that in turn provides the sufficient performance acceleration of code being executed.

Specifically, we'll discuss about the fragments of code implementing fast parallel sorting that are the basic optimization hotspots, which performance can be optimized by using specific OpenMP's directive constructs, including OpenMP's parallel tasks, parallel worksharing and tight loop parallelization constructs, loops unrolling, etc.

The first topic that we're going to discuss is the parallel recursion used to optimize the performance of the improved 3-way quicksort performed at the backend. At this point, we'll discuss on how to use OpenMP's tasks to provide a concurrent execution of the sorter routine recursively called by a team of CPU's H/W threads. Another optimization, for which we'll spend a moment to discuss, is the parallel worksharing construct used to parallize the process of sorting an entire array split up into the number of partitions, as well as using tight loop paralellization to optimize the performance of specific loops that perform the actual sorting.

Additionally, we'll discuss how to increase the performance speed-up of the fragments of code that performs the small arrays and auxiliary sorting and cannot be perfectly parallelized by using specific OpenMP's directive constructs due to the number of data-flow dependency and race condition issues.

During the discussion, we'll spend much time to analyze the parallel code that performs the actual sorting. Also, we'll provide useful guidelines how to compile and run this code to evaluate its performance and reliability.

At the end of this article, we'll provide the basic optimization hotspots analysis by using Intel® VTune Amplifier tool and give a brief explanation to the results of performance assessment.

The following article is a final chapter in which we're about to discuss about the efficient parallel sorting algorithm introduced in the first article and optimized by using OpenMP performance library.

Background

Parallel 3-Way Quicksort

In this paragraph we'll discuss about on how to use OpenMP library to deliver parallel code that implements the 3-way quicksort performed at the backend. The parallel optimization we're about to discuss actually provides a very small performance speed-up, but, along with the other optimizations, discussed below, significantly affects the general performance of the entire process of sorting.

As we've already discussed in the first article in series, the most of exisiting variants of quicksort algorithm, include 3-way quicksort, are mainly based on the idea of performing the recursive sorting, during which the subsequent parts of the partitioned array are sorted recursively by calling the routine the performs the actual sorting.

In the legacy sequential code that implements a quicksort algorithm, the specific recursive calls to the sorter routine are performed sequentially. To sort the leftmost part of the array we need to make the first recursive call to the sorter routine, and, after it completes its execution, we're similarly call the same sorter routine the second time to sort the rightmost part of the array. By running the specific sequential code, as we've already might noticed that the second call to the sorter routine is not actually performed until the the sorter function invoked during the first call has finished its execution. This is typically called the sequential execution, that, in the most cases, reduces the actual performance speed-up of the process of sorting.

By delivering modern parallel code, our goal is to provide a sufficient performance speed-up by implementing the parallel recursion. This can be easily done by using specific directive constructs of the OpenMP library. Specifically we'll use the OpenMP's #pragma omp task directive construct to execute a pair of recursive calls to the sorter routine in parallel. Since then, the recursive calls to the sorter routine are performed simultaneously as shown on Fig 1.:

As you've might noticed from the fragment of code above, while running these two parallel tasks, we're performing an additional check if the either leftmost or rightmost part of the array actually requires to be sorted. Specifically we're performing verification if the size of a part to be sorted is not equal to zero, as well as any of the data items were exchanged while performing the actual sorting. If so, we're executing the sorter routine that performs the actual 3-way quicksort of the specific part of the array, otherwise performing an empty task.

In this case, we're using #pragma omp task directive construct with two additional clauses such as either untied or mergeable. By using the first clause we specify that these are two independent tasks that are not associated with any other parent parallel tasks. The second clause is used to specify that both tasks can be merged into a single task by the built-in OpenMP's parallel regions scheduling mechanism.

However, the executing of parallel tasks that sort the arrays with a typically small amount of data items causes a sufficient overhead during the parallel tasks scheduling. That's why, we're performing an additional check if the size of the array to be sorted does not exceed the cutting off boundary. If not, we're merging the both concurrent tasks into a single task that actually does the two sequential recursive calls to the sorter routine while running in parallel. This is typically done by using a slightly different OpenMP's parallel directive constructs such as #pragma omp parallel and #pragma omp single nowait, thoroughly discussed below:

if (_Size >= internal::cuttoff_high)
{
	#pragma omp task untied mergeable
	if (std::distance(_First, _LeftIt) > 0 && is_swapped_left)
	    internal::_qs3w(_First, _LeftIt, compare);

	#pragma omp task untied mergeable
	if (std::distance(_RightIt, _Last) > 0 && is_swapped_right)
     	    internal::_qs3w(_RightIt, _Last, compare);
}

else
{
	#pragma omp parallel num_threads(6)
	#pragma omp single nowait
	{
	    if (std::distance(_First, _LeftIt) > 0 && is_swapped_left)
		internal::_qs3w(_First, _LeftIt, compare);
            if (std::distance(_RightIt, _Last) > 0 && is_swapped_right)
		internal::_qs3w(_RightIt, _Last, compare);
	}
}    

As we've might noticed from the code fragment above, we're not using any specific threads synchronization mechanism such as #pragma omp taskwait while running the recursive calls in parallel since the parallel execution of the sorter routine does not incur any data-flow dependency and race condition issues. Also, we're using OpenMP's parallel tasks in the favour of using the #pragma omp parallel sections, that create a sufficient overhead during its execution as well as require an extra barrier synchronization during its execution.

Optimizing Introspective Sort Performance Using OpenMP's Tasks

Since we've optimized performance of 3-way quicksort by implementing parallel recursion using OpenMP's tasks directive constructs, it's time to discuss on how to use the same parallel tasks mechanism to optimize the performance of code implementing introspective sort algorithm introduced in the first article in series.

As we've mentioned above, the using of parallel tasks to deliver the modern code, that properly scales the process of performing the introspective sort across multiple CPU's cores, is quite similar to implementing the parallel recursion. Let's recall that prior to running the parallel tasks that do the actual sorting, we're performing a check if the amount of data items in the array to be sorted does not exceed a specific cutting off boundary. As we've already discussed, this is typically done in order to eliminate the case, in which the process of parallel tasks execution involves an additional overhead spent for OpenMP's tasks scheduling as well as reduces the actual memory bandwidth. If the amount of data items is typically small, we have to join (or just to merge) two concurrent parallel tasks into a single tasks, which, in turn, is executed in parallel.

To sort the arrays with the amount of data items less than the value of cutting off boundary, we're using #pragma omp parallel followed by #pragma omp single nowait or #pragma omp master directive constructs. The first OpenMP directive construct causes the fragment of code that makes the recursive calls to the sorter routine to be executed by a team of CPU's H/W threads as a parallel region. The second directive construct is used to execute the following recursive calls by only one H/W thread. This can be the either any of active CPU's threads or the master thread that forks the execution of all other threads. By using nowait clause we specify that there's no barrier synchronization of threads being executed:

#pragma omp parallel num_threads(6)
#pragma omp single nowait
{
	if (std::distance(p.first, _Last) > 0)
		internal::_qs3w(p.first, _Last, compare);
	if (std::distance(_First, p.second) > 0)
		internal::_qs3w(_First, p.second, compare);
}    

As you've might noticed from the code snippet above, the OpenMP parallel tasks, that we're using to deliver the specific code that implements the fast sorting algorithm, have the number of nested levels. In this case, a parallel task performs an invocation of the sorter routine that in turn executes one or more parallel concurrent tasks.

Obviously that, by using the #pragma omp parallel and #pragma omp single nowait combined with #pragma omp task directive construct inside the next nested level, we might want to either use the untied and mergeable clauses, or to subscribe for the less number of H/W threads to reduce an overhead spent for tasks scheduling, avoiding a case in which lots of threads that perform an empty task are scheduled. In particular, those two clauses are used to make sure that the parallel tasks executed in the next nested level are not joined with the specific parent tasks. Fig. 2. illustrates the process of nested parallel tasks execution:

In the next paragraph we'll thoroughly discuss on how to deliver the parallel code that implements the forefront fast sorting algorithm by using the OpenMP's parallel worksharing construct to properly scale the process of sorting a partitioned array across multiple CPU's cores.

Using OpenMP's Parallel Worksharing Directive Construct

As we've already discussed in the first article in series, the actual performance speed-up of the sorting algorithm being introduced significantly degrades while sorting arrays with lots of duplicate values. To improve the performance of the process of sorting we'll use a slightly different optimization rather than solely OpenMP's parallel tasks.

Specifically, we'll use the OpenMP's parallel worksharing directive construct to sort particular chunks of the array in parallel by performing the introspective sort. To do this, we'll use the similar parallel optimization that is commonly used to share the execution of iterations of a sequential loop between multiple threads in a team.

The first what we have to do is to use the #pragma omp parallel directive construct to create a parallel region of code executed by each of the H/W threads in a team. In particular, the parallel region we've declared we'll invoke the introspective sort routine that, in turn, will perform the actual sorting of a specific chunk of the array. In the parallel region executed by each particular thread, we'll compute the starting and ending position of the chunk to be sorted. This can be done by using the following expression inside the parallel region:

volatile int n_threads = omp_get_max_threads();
volatile int thread_id = omp_get_thread_num();

std::size_t start_pos = thread_id * chunk_size / n_threads;
std::size_t end_pos = (thread_id + 1) * chunk_size / n_threads;

According to the snippet of code listed above, first we're obtaining the number of CPU's H/W threads by using omp_get_max_threads() OpenMP's runtime function. Then, to compute the starting position of a specific chunk we're dividing the value of the array potential size by the number of H/W threads to obtain the grain size of each chunk. After that, we finally multiply this value by the index of the current thread obtained as the result of calling omp_get_thread_num() function. Similarly we can compute the value of the ending position of the chunk to be sorted. For that purpose, we're using the same expression with only one difference that we're multiplying the value of chunk grain size by the index of the next thread in a team:

misc::partitioner p(std::make_pair(0, _Size), \
	omp_get_max_threads());

#pragma omp parallel num_threads(6)
{
	volatile int tid = omp_get_thread_num();
	internal::intro_sort(_First + p[tid].first(),
		_First + p[tid].second(), compare);
}

#pragma omp parallel num_threads(6)
#pragma omp master
	internal::intro_sort(_First, _Last - 1, compare);

As you can see from the code snipped above, during the development cycle, we've purposely wrapped the code that performs a chunk's starting and ending position computation into the class misc::partitioner, that we'll throughly discuss in the "Using the code" section of this article. Finally, after each chunk was pre-sorted in parallel by a team of CPU's H/W threads, we're launching the execution of a parallel task that invokes the same introspective sort routine that performs sorting of the entire array.

The main idea of parallel worksharing the process introspective sorting is that each thread in a team performs the introspective sort by calling the specific sorter routine for each chunk, that, in turn, performs the actual sorting when and only when the chunk of data items is unsorted. This allows to significantly accelerate the performance of the entire process of sorting.

Also, according to the nature of the algorithm itself, that we've already discussed in the previous article, the array with lots of duplicates data items is sorted faster by using 3-way quicksort algorithm since particular parts of the array have been sorted. That's actually why, at this point, we're pre-sorting of each chunks of the array in parallel, and then, finally, perform the same introspective sort for sorting the array in which particular parts are already sorted. 

Parallel Auxiliary Sorting

As we've already discussed in the first article, along with the improved 3-way quicksort and introspective sort algorithms, we use the other less efficient sorting algorithms for the either auxiliary pre-sorting or small arrays sorting purposes. To perform the actual sorting of arrays which size does not exceed 10^2 of data items, we're normally using the insertion sort algorithm. As well, to perform an efficient array pre-sorting we're using a variant of cocktail shaker sort algorithm.

By performing the performance optimization, it's the most applicable that we're running the routines that perform both the insertion and cocktail shaker sorting as parallel tasks, similarly as we've already discussed before.

Cocktail Shaker Sort

As we've already discussed, we're performing the entire array pre-sorting by using cocktail shaker sort that allows to efficiently sort arrays in which all data items are arranged in a reverse order. The cocktail shaker sort is basically performed prior to performing the introspective sort at the forefront. Obviously that, we're running the process of cocktail shaker sort as a parallel task:

#pragma omp task untied mergeable
	internal::shaker_sort(_First, _Last - 1, compare);

#pragma omp taskwait

Unlike in the all other cases, this particular case, we're turning on a barrier synchronization of this parallel task to ensure that the process of pre-sorting by using cocktail shaker sort is completed prior to the introspective sorting is launched. To provide a parallel task synchronization we're using #pragma omp taskwait directive construct.

Adjacent Sort

Another algorithm that we're about to use for pre-sorting purposes is the adjacent sort algorithm, which is one of specific variants of the trivial bubble sort commonly used for pre-sorting. Unlike the other trivial algorithms, the code implementing the adjacent sort is easily parallelized by using #pragma omp parallel for OpenMP's tight loop parallelization directive construct. This directive construct allows to execute each particular loop iteration in its own thread. Since the adjacent sort does not incur the data-flow dependency or race condition issues we can use the tight loop parallelization without any synchronization mechanisms as it's shown in the fragment of code below:

#pragma omp parallel for
for (auto _FwdIt = _First; _FwdIt != _Last - 1; _FwdIt++)
{
	if (!compare(*_FwdIt, *(_FwdIt + 1)) && *_FwdIt != *(_FwdIt + 1))
		std::iter_swap(_FwdIt, _FwdIt + 1);
}

Insertion Sort

According to the fast sorting algorithm we've introduced in the previous article, the entire process of sorting might be the less fast and efficient while sorting arrays with typically small amount of data items. To sort the arrays which size is less than 10^2 of data items, it's the most recommended to use a different sorting algorithm such as the conventinal insertion sort.

Unfortunately, according to the nature of the insertion sort algorithm, we cannot use the any of the parallel optimizations to improve the performance of insertion sort. We cannot perform the insertion sort in parallel due to the number of persistent data-flow dependency and race-condition issues. The all what we can do in this particular case, is to improve the system memory bandwith by unrolling specific loops that perform the actual insertion sort:

while (_LeftIt <= _RightIt)
{
    do_insertion(_First, _LeftIt - 1, _RightIt, compare); _LeftIt++;
    do_insertion(_First, _LeftIt - 1, _RightIt, compare); _LeftIt++;
    do_insertion(_First, _LeftIt - 1, _RightIt, compare); _LeftIt++;
    do_insertion(_First, _LeftIt - 1, _RightIt, compare); _LeftIt++;
}

As we can see from the code snipped above, during each iteration of the main loop we're sequentially execute the do_insertion function, that performs the actual insertion sort. This is typically called a "loop unrolling". 

Conclusion

However, there're the fragments of code that cannot be perfectly parallelized by using OpenMP's directive constructs. For example, the code that performs the array partitioning according to the classical quicksort algorithm performed as a part of the introspective sort cannot be perfectly parallelized by using specific OpenMP's directive constructs due to the number of data-flow and race condition issues. That's why we're not applying any performance optimization for this particular code, intended to be executed sequentially. Also, the code that implements the median-of-three and median-of-nine algorithms cannot be either parallelized as the result of the same data-flow dependency and race condition occurrence.

Parallel Sorting Algorithm Performance Assessment

In this section we'll discuss about the performance of parallel modern code that implements parallel fast sorting algorithm. Before we begin, let's spend a moment and take a closer look at the computational complexity of the following algorithm first. As we've already discussed the number of times, the following algorithm belongs to the class of fast sorting algorithms that performs the actual sorting based on the improved 3-way quicksort. Unlike the conventional quicksort proposed by Tony C.A.R Hoare, the worst-case complexity of the following algorithm is estimated as O(N x log2(N)), while the average and best-case complexity is just O(log2(N)). Compared to the classical quicksort, the following algorithm does not degrade when performing the actual sorting of the either arrays with lots of duplicates or those arrays in which the initial order of data items is an interleave sequence.

To demonstrate the performance of the parallel code implementing the new fast sorting algorithm, we've prepared the number of agressive tests to evaluate the efficiency and reliability of the process of sorting as well as to compare its performance speed-up with the performance of std::sort C++ template function that performs the sequential sorting. 

Since, we've delivered the modern parallel code implementing the fast sorting algorithm, let's measure its actual performance speed-up by using Intel® VTune Amplifier XE. By running the following parallel code using the following performance assessment tool, we've obtained the following results:

As we can see from the chart above, the effective time spent for parallel sort execution is almost 80% of the entire CPU time, that indicates the efficiency of the parallel optimizations applied during the legacy sequential code modernization, while the spin time during which the threads are synchronized is just 9.479 secs. However the parallel code being executed produces the sufficient overhead spent mostly for threads scheduling. As we've might have noticed from the OpenMP analysis, the parallel region time is about 91.5% and the time spent for the sequential code execution is just 9.026% of the entire execution time. This, in turn, is a very good result, since the entire process of sorting is perfectly scaled across all CPU's H/W threads.

The CPU usage histogram shows that the logical CPU's simultaneous utilization, in this particular case, is okey or ideal rathar than idle or poor. This can be also a proof of that the parallel process of sorting is perfectly scaled. The similar results can be observed by analyzing the OpenMP Region CPU Usage Histogram below:

The overall score estimated by the Intel® VTune Amplifier XE tool for the parallel code that performs the actual sorting is 16.183 secs, which is a pretty "Good" result, that indicates that the parallel optimizations applied provide a sufficient performance speed-up of the parallel code compared to its sequential execution.

Using the code

In this section we'll discuss particularly about the implementation of the fast parallel sorting algorithm by using C++11 and OpenMP performance library. The first routine which code we're going to analyze is internal::parallel_sort that performs the actual forefront sorting:

	template<class _pred="" bidirit="">
	void parallel_sort(BidirIt _First, BidirIt _Last, _Pred compare)
	{
		std::size_t pos = 0L; g_depth = 0L;
		// Compute the potential size of the array to be sorted
		std::size_t _Size = std::distance(_First, _Last);

		// Perform the parallel cocktail shaker sort
		#pragma omp task untied mergeable
			internal::shaker_sort(_First, _Last - 1, compare);

		// Synchronize threads until the parallel task has completed its execution
		#pragma omp taskwait
		
		// Let's give a first chance check if the array has already been sorted and
		// obtain the position of the first unsorted data item
		if (!misc::sorted(_First, _Last, pos, compare))
		{
			BidirIt _LeftIt = _First, _RightIt = _First + pos;

			// Compute the radix MSD position for the size value
			std::size_t _MaxSizeRadix  = std::log10((double)_Size);
			// Compute the radix MSD position for the maximum value
			std::size_t _MaxValueRadix = std::log10((double)*_RightIt);

			// Perform a check if the MSD position of the size is greater
			if (_MaxSizeRadix > _MaxValueRadix)
			{
				// If so, perform a check if the value of the first data item is
				// equal to the value of the first unsorted item that occurs in the array
				// (e.g. the array is an interleave sequence)
				if (*_LeftIt == *(_RightIt + 1))
				{
					// Perform the parallel task that executes 
					// the 3-way quicksort routine at the backend
					#pragma omp parallel num_threads(12)
					#pragma omp master
						internal::_qs3w(_First, _Last - 1, compare);
				
					// Terminate the process of sorting.
					return;
				}
			}

			// Otherwise, if the array is not an interleave sequence
			// execute a loop to sort the array by performing the introspective sort
			do
			{
				// Perform the pre-sorting of the array by using adjacent sort
				internal::adjacent_sort(_First, _Last, compare);
				// Perform the actual sorting by launching the introspective sort
				internal::parallel_sort1(_First, _Last, compare);
			  // Perform the last chance check if the array has already been sorted
			  // If not, proceed with the process of sorting over again until the entire array is sorted
			} while (!misc::sorted(_First, _Last, pos, compare));
		}
	}    
</class>

During this function execution we're first computing the actual size of the entire array to be sorted. For that purpose we're using std::distance C++ template function, that allows to find the actual distance between two iterators that point to specific data items in the array to be sorted. Since we've computed the size of the array, we're launching the execution of the parallel task that performs the cocktail shaker sort which code is discussed below, in case when the array is already sorted in the reverse order. At this point we're providing a barrier synchronization of this particular task so that the rest of sorting is performed after the performing the following parallel task. After that, we're performing a first chance check if the array has already been sorted. If so, we're terminating the process of sorting. Also we're obtaining the value of position of the first unsorted item in the array.

Otherwise, we're computing the radix value of MSD of the either value of size and the value of the last data item in the array to determine whether the array is an interleave sequence. To do this we're using std::log10 function, and perform a check if the MSD position of the size value is greater and the first data item is equal to the value of the first unsorted item. If so, the array is an interleave sequence, which we're about to sort then by performing the introspective sort at the forefront.

If the array is an interleave sequence, we're launching parallel task that invokes the internal::_qs3w function that performs the 3-way quicksort of the array at the backend and after its execution terminate the process of sorting.

In case, when the array is not an interleave sequence, we're delegating the execution to the regular introspective sort. To do this, we're implemeting a loop and proceed with the sorting process iteratively until we've obtained the sorted array of items. During each iteration of the following loop, we're performing a pre-sort by calling internal::adjacent_sort function. Since the array has already been pre-sorted, we're performing the next stage of the forefront sorting by calling internal::parallel_sort1 function:

	template<class _pred="" bidirit="">
	void parallel_sort1(BidirIt _First, BidirIt _Last, _Pred compare)
	{
		// Compute the potential size of the array to be sorted
		std::size_t _Size = std::distance(_First, _Last);
		// Find the value of the maximum data item in the array
		typename std::iterator_traits<bidirit>::value_type \
			_MaxValue = *std::max_element(_First, _Last);

		// Compute the radix MSD position for the size value
		std::size_t _MaxSizeRadix  = std::log10((double)_Size);
		// Compute the radix MSD position for the maximum value
		std::size_t _MaxValueRadix = std::log10((double)_MaxValue);

		// Perform a check if the MSD position of the size is greater
		// (e.g. the array is an interleave sequence)
		if (_MaxSizeRadix > _MaxValueRadix)
		{
			// Partition the entire array into chunks of a fixed size
			// based on finding sub-intervals for each particular chunk
			misc::partitioner p(std::make_pair(0, _Size), \
				omp_get_max_threads());

			// Execute a parallel region in which the introspective sorter
			// function is invoked for each particular chunks to be sorted
			#pragma omp parallel num_threads(12)
			{
				volatile int tid = omp_get_thread_num();
				internal::intro_sort(_First + p[tid].first(),
					_First + p[tid].second(), compare);
			}

			// Perform a parallel task to perform a final sort that
			// will arrange the entire array into an ordered sequence
			#pragma omp parallel num_threads(12)
			#pragma omp master
				internal::intro_sort(_First, _Last - 1, compare);
		}
		
		else
		{
			// Otherwise, launch a parallel task to perform 
			// an introspective sort of the entire array
			#pragma omp parallel num_threads(12)
			#pragma omp master
				internal::intro_sort(_First, _Last - 1, compare);
		}
	}
</bidirit></class>

During the following function execution we're first computing the size of the array being sorted and obtain the value of the data item with maximum value. Then we obviously performing a check if the array is an interleave sequence. If so, we're partitioning the array into chunks of fixed size and execute a parallel region in which we're sorting each chunk in its own thread in a team. This is called a "parallel worksharing". After that, we're performing a regular introspective sort by calling internal::intro_sort function to finilize the process of sorting.

In case when, the array is not an interleave sequence we're performing the introspective sort of the entire array by launching a specific parallel task:

	template<class _pred="" bidirit="">
	void intro_sort(BidirIt _First, BidirIt _Last, _Pred compare)
	{
		// Check if the array size is not zero
		if (_First >= _Last) return;

		std::size_t pos = 0L; g_depth++;
		// Perform a check if the array has already been sorted.
		// If so terminate the process of sorting
		if (misc::sorted(_First, _Last, pos, compare)) return;

		std::size_t _Size = 0L;
		// Compute the actual size of the array and perform a check
		// if the size does not exceed a lower cutting off boundary
		if ((_Size = std::distance(_First, _Last)) > internal::cutoff_low)
		{
			BidirIt _LeftIt = _First, _RightIt = _Last;

			// If so, partition the array by using Hoare's quicksort partitioning
			std::pair<bidirit, bidirit=""> p \
				= internal::partition(_First, _Last, compare);

			// Perform a check if the size of the array does not 
			// exceed the higher cutting off boundary
			if (_Size > internal::cutoff_high)
			{
				// If not, launch the first parallel task 
				// that will perform the improved 3-way quicksort at the backend
				#pragma omp task untied mergeable
				// Perform a check if the size of the partition is not zero and
				// we're not performing an empty null-task
				if (std::distance(p.first, _Last) > 0)
					internal::_qs3w(p.first, _Last, compare);

				// If not, launch the second parallel task 
				// that will perform the improved 3-way quicksort at the backend
				#pragma omp task untied mergeable
				// Perform a check if the size of the partition is not zero and
				// we're not performing an empty null-task
				if (std::distance(_First, p.second) > 0)
					internal::_qs3w(_First, p.second, compare);
			}

			else
			{
				// Otherwise, merge the two concurrent parallel tasks into a single parallel task
				// in which calls to the sorter routine are performed sequentially
				#pragma omp parallel num_threads(12)
				#pragma omp single nowait
				{
					if (std::distance(p.first, _Last) > 0)
						internal::_qs3w(p.first, _Last, compare);
					if (std::distance(_First, p.second) > 0)
						internal::_qs3w(_First, p.second, compare);
				}
			}
		}

		else
		{
			// If the size of the array is less than 10^2, 
			// perform a regular insertion sort launched 
			// during the parallel task execution
			#pragma omp parallel num_threads(12)
			#pragma omp single nowait
			{
				if (std::distance(_First, _Last) > 0)
					internal::insertion_sort(_First, _Last + 1, compare);
			}
		}
	}
</bidirit,></class>

During the execution of this function we're obviously performing a check if the array has not yet already been sorted. Then we're performing another check if the size of the array exceeds the lower cutting off boundary. If not, we're performing a regular insertion sort by launching the specific parallel task and invoking internal::insertion_sort routine. Otherwise, we're performing the array partitioning by calling the internal::partition function discussed below. The following function returns the pair of two pointers to the leftmost and rightmost partitions of the array being sorted. After that, we're sorting both leftmost and rightmost partitions in parallel by launching the two concurrent OpenMP's tasks that performs an improved 3-way quicksort of specific partitions. Before calling the internal::_qs3w function during each task execution, we're performing a check if the partition's size is not zero and we're not performing an empty task.

	template<class _pred="" bidirit="">
	inline std::pair<bidirit, bidirit=""> partition(BidirIt _First, \
		BidirIt _Last, _Pred compare)
	{
		std::size_t _Size = 0L;
		// Compute the actual size of the array. If the size is equal to 1
		// then perform a tiny sort by exchanging the adjacent data items
		if ((_Size = std::distance(_First, _Last)) == 1)
		{
			if (!compare(*_First, *_Last))
				std::iter_swap(_First, _Last);

			// Return the pair of pointers to the two subsequent
			// partitions in the array to be sorted
			return std::make_pair(_First, _Last);
		}

		if (_Size == 2)
		{
			// If the array size is equal to 2, perform a tiny sort by sorting three values.
			internal::sort3v(_First, _First + 1, _Last, compare);
			// Return the pair of pointers to the two subsequent
			// partitions in the array to be sorted
			return std::make_pair(_First, _Last);
		}

		// Compute the middle of the array to be sorted
		BidirIt _LeftIt = _First, _RightIt = _Last, \
			_MidIt = _LeftIt + _Size / 2;

		// Perform a tiny sort of three values 
		// at the beginning, middle and the end of the array
		internal::sort3v(_LeftIt, _MidIt, _RightIt, compare);
		// Perform a tiny sort of three values at the middle of the array
		internal::sort3v(_MidIt - 1, _MidIt, _MidIt + 1, compare);

		// Compute the median by using median-of-nine algorithm
		BidirIt _Median = internal::med9v(_LeftIt, _MidIt, _RightIt);
		// Obtain the value of pivot based on the value of median
		typename std::iterator_traits<bidirit>::value_type _Pivot = *_Median;

		// Perform a check if the first data item is equal to the value of median
		if (*_First == _Pivot)
			// If so, swap it with the middle data item
			std::iter_swap(_First, _MidIt);

		// Perform a check if the last data item is equal to the value of median
		if (*_Last == _Pivot)
			// If so, swap it with the middle data item
			std::iter_swap(_Last, _MidIt);

		// Perform partitioning iteratively until we've re-arranged the array to be sorted
		// and obtained the pointers to the leftmost and rightmost partitions.
		while (_LeftIt <= _RightIt)
		{
			// Iterate through the array from left and for each data item
			// perform a check if it's greater than the value of pivot
			for (; _LeftIt <= _Last; _LeftIt++)
			{
				// Perform a check if the value of the current 
				// data item is greater than the value of pivot or is equal to it 
				if (compare(_Pivot, *_LeftIt) || _Pivot == *_LeftIt)
				{
					// If so, perform the iteration through the array from the right
					// until we've reached the data item which is already found from left
					// For each item perform a check if it's not less than the value of pivot
					for (; _RightIt >= _LeftIt; _RightIt--)
					{
						// If the current item's value is less than the value of pivot or 
						// at least equal to it, exchange the specific data item being found
						if (compare(*_RightIt, _Pivot) || _Pivot == *_RightIt)
						{
							// Perform a check if the value of the pointer obtained from left
							// is less than or equal to the value of pointer from the right
							if (_LeftIt <= _RightIt)
							{
								// If so, exchange the specific data items
								std::iter_swap(_LeftIt, _RightIt);
								// Increment the pointer to the left item and 
								// decrement the pointer to the right item
								_LeftIt++; _RightIt--;
							}

							// Terminate the loop execution
							break;
						}
					}

					// Terminate the loop execution
					break;
				}
			}
		}

		// Return the pair of pointers to the two subsequent
		// partitions in the array to be sorted
		return std::make_pair(_LeftIt, _RightIt);
	}
</bidirit></bidirit,></class>

While executing the partitioning routine which code is listed above, we're performing a check if the array size is equal to 1. If so, we're performing a very small sorting by swapping the two data items in the correct order. Then we're performing a similar check if the size of the array is equal to 2 and then perform the tiny sorting of three values.

If the array to be sorted contains more than 2 data items, we're performing the actual partitioning. First, we're computing the value of the median by using median-of-nine algorithm and then assume that the value obtained is pivot value used when performing the quicksort partitioning.

Then, we're iterating through the array from left and for each particular item we're performing a check if its value is greater than the value of pivot. If so, we're proceeding with another iterations from right and for each item we're performing a check if its value is less than the value of pivot, until we've reached the data item, the position of which is previously obtained from the left. If the value of the current item we've stepped into from the right is less than the value of pivot, we're exchanging these two data items. The following process proceeds until the array would contain all items with values less than the value of pivot at the left and the items which values are greater than the value of pivot at the right.

template<class RanIt, class _Pred>
    void _qs3w(RanIt _First, RanIt _Last, _Pred compare)
{
    // Check if the array size is not zero
    if (_First >= _Last) return;

    // Perform a check if the size of the array is equal to 1
    if (std::distance(_First, _Last) == 1)
    {
        // If so, check if the value of the first item is greater
        // than the value of the last item. If so, exchange these both items
        if (*_First > *_Last)
            std::iter_swap(_First, _Last);

        // Terminate the process of sorting
        return;
    }

    // Compute the size of the array to be sorted
    std::size_t _Size = 0L; g_depth++;
        if ((_Size = std::distance(_First, _Last)) > 0)
    {
        // Compute the middle of the array to be sorted
        RanIt _LeftIt = _First, _RightIt = _Last,
            _MidIt = _First + (_Last - _First) / 2;

        bool is_swapped_left = false, is_swapped_right = false;
        // Compute the value of median by using median-of-nine algorithm
        RanIt _Median = internal::med9v(_LeftIt, _MidIt, _RightIt);
        // Obtain the value of pivot equal to the value of median
        typename std::iterator_traits<RanIt>::value_type _Pivot = *_Median;

        // Iterate through the array to be sorted and for each item
        // perform a check if it's less than the value of pivot
        for (RanIt _FwdIt = _LeftIt; _FwdIt <= _RightIt; _FwdIt++)
        {
            // Check if the value of the item is less than the value of pivot
            if (compare(*_FwdIt, _Pivot))
            {
                // If so, exchange the current data item with the next item from left
                is_swapped_left = true;
                std::iter_swap(_FwdIt, _LeftIt);
                // Increment the value of pointer to the succeeding data item from left
                _LeftIt++;
            }

            // Check if the value of the item is greater than the value of pivot
            if (compare(_Pivot, *_FwdIt))
            {
                is_swapped_right = true;
                // If so, exchange the current data item with the next item from right
                std::iter_swap(_FwdIt, _RightIt);
                // Decrement the value of pointer to the succeeding data item from right
                // and the pointer to the current data item in the array
                _RightIt--; _FwdIt--;
            }
        }

        // Perform a check if the size is greater than
        // the value of the higher cutting off boundary
        if (_Size >= internal::cutoff_high)
        {
            // If so, launch parallel tasks to sort
            // the either leftmost or rightmost part of the array
            #pragma omp task untied mergeable
            if (std::distance(_First, _LeftIt) > 0 && is_swapped_left)
                internal::_qs3w(_First, _LeftIt, compare);

            #pragma omp task untied mergeable
            if (std::distance(_RightIt, _Last) > 0 && is_swapped_right)
                internal::_qs3w(_RightIt, _Last, compare);
        }

        else
        {
            // Otherwise, merge the two concurrent parallel tasks into a single parallel task
            // in which the leftmost and rightmost parts are sorted sequentially
            #pragma omp parallel num_threads(12)
            #pragma omp single nowait
            {
                if (std::distance(_First, _LeftIt) > 0 && is_swapped_left)
                    internal::_qs3w(_First, _LeftIt, compare);
                if (std::distance(_RightIt, _Last) > 0 && is_swapped_right)
                    internal::_qs3w(_RightIt, _Last, compare);
            }
        }
    }
}

Now, at this point, it's time to discuss about the internal::_qs3w function that implements the improved 3-way quicksort algorithm. After we've performed a series of checks to determine the initial order of the array to be sorted, we're launching the process of backend sorting using an improved 3-way quicksort algorithm. To do this we're similarly find the value of pivot by using median-of-nine algorithm. Then, we're normally starting to iterate through the array and for each particular item we're performing a check if its value is less than the value of pivot. If so, we're exchanging this data item with the next data item from left, and if the following data item is greater than the value of pivot, we're respectively exchanging it with the next item from right.

At the end of performing 3-way quicksort, we've obtained the two pointers to the either leftmost or rightmost partition of the array. Then we're performing a parallel recursive sort by calling the same sorter routine during each parallel task execution. We're recursively proceeding with the following process until we've obtained a sorted array of items.

Besides of performing an effective introspective and improved 3-way quicksort, we also use the other algorithms for the either small arrays and auxiliary sorting. The first algorithm that has been implemented is the insertion sort that we're using to sort small array of items:

	template<class _pred="" randomit="">
	void insertion_sort(RandomIt _First, RandomIt _Last, _Pred compare)
	{
		RandomIt _LeftIt = _First + 1, _RightIt = _Last;

		// Perform the insertion sort by iterating through the array
		while (_LeftIt <= _RightIt)
		{
			// For each data item make the number of calls 
			// to the function that performs the actual sorting
			do_insertion(_First, _LeftIt - 1, _RightIt, compare); _LeftIt++;
			do_insertion(_First, _LeftIt - 1, _RightIt, compare); _LeftIt++;
			do_insertion(_First, _LeftIt - 1, _RightIt, compare); _LeftIt++;
			do_insertion(_First, _LeftIt - 1, _RightIt, compare); _LeftIt++;
		}
	}
</class>

As we've already discussed, we cannot normally parallelize the execution of the insertion sort. That's why, to improve the performance, we're using a slightly different approach known as "loop unrolling". In this case, we unroll each four iterations of the main loop by sequentially calling the do_insertion function that performs the actual sorting. This, in turn, provides a peformance speed-up of the insertion sort process:

	template<class _pred="" bidirit="">
	void do_insertion(BidirIt _First, BidirIt _RevIt, BidirIt _Last, _Pred compare)
	{
		typename std::iterator_traits<bidirit>::\
			value_type _Value = *(_RevIt + 1);

		// Iterate through the array from the current 
		// position downto the position of the first data item
	    // For each data item in the subset of previous data items
		// perform a check if it's greater or equal to the specific 
		// value of argument
		while (compare(_Value, *_RevIt))
		{
			// If so, do the insertion of the 
			// value of argument to the specific position
			*(_RevIt + 1) = *_RevIt; *_RevIt = _Value;
			if (_RevIt <= _First) break; _RevIt--;
		}
	}
</bidirit></class>

During the do_insertion function execution we're iterating through the array downto the first data item and for each item we're performing a check if its value is greater than the value of specific argument. If so, we're doing to insertion of the argument's value at the specific position prior to the item which value is greater.

Another kind of auxiliary sorting that we've implemented is cocktail shaker sort, during which we're observing the array from left and right simultaneously and exchange a pairs of data items from either left and right until the entire array has been sorted. This is the most useful method of sorting in case when we need to sort arrays in the reverse order:

	template<class _pred="" bidirit="">
	void shaker_sort(BidirIt _First, BidirIt _Last, _Pred compare)
	{
		BidirIt _LeftIt = _First, _RightIt = _Last, \
			_MidIt = _LeftIt + (_RightIt - _LeftIt) / 2;

		// Do the cocktail shaker sort by exchanging the data items from left and right
		while (!compare(*_LeftIt, *_RightIt) && _RightIt > _MidIt)
			std::iter_swap(_LeftIt++, _RightIt--);
	}
</class>

Also, we've implemented the adjacent sort algorithm to perform the array pre-sorting before sorting it with the introspective sort algorithm discussed above:

	template<class _pred="" bidirit="">
	void adjacent_sort(BidirIt _First, BidirIt _Last, _Pred compare)
	{
		// Iterate through the array of items in parallel
		#pragma omp parallel for
		for (auto _FwdIt = _First; _FwdIt != _Last - 1; _FwdIt++)
		{
			// For each item perform a check if the following item
			// is greater than the next adjacent item. If so, exchange these items
			if (!compare(*_FwdIt, *(_FwdIt + 1)) && *_FwdIt != *(_FwdIt + 1))
				std::iter_swap(_FwdIt, _FwdIt + 1);
		}
	}
</class>

As we've already discussed, we're using median-of-nine algorithm to find the optimal value of pivot. The following code below demonstrates the median-of-three and median-of-nine computation process:

	template<class _pred="" bidirit="">
	void sort3v(BidirIt _First, BidirIt _Mid, BidirIt _Last, _Pred compare)
	{
		// If the last data item is less than the first, exchange these both items
		if (compare(*_Last, *_First))
			std::iter_swap(_First, _Last);
		// If the middle data item is less than the first, exchange these both items
		if (compare(*_Mid, *_First))
			std::iter_swap(_First, _Mid);
		// If the last data item is less than the middle, exchange these both items
		if (compare(*_Last, *_Mid))
			std::iter_swap(_Mid, _Last);
	}

	template<class bidirit="">
	BidirIt med3v(BidirIt _First, BidirIt _Mid, BidirIt _Last)
	{
		if ((*_First <= *_Mid && *_First >= *_Last) ||
			(*_First <= *_Last && *_First >= *_Mid))
			// Return the first value if it is less than or equal to the middle value
			return _First;
		else if ((*_Mid <= *_First && *_Mid >= *_Last) ||
			(*_Mid <= *_Last && *_Mid >= *_First))
			// Return the middle value if it is less than or equal to the first value
			return _Mid;
		else if ((*_Last <= *_First && *_Last >= *_Mid) ||
			(*_Last <= *_Mid && *_Last >= *_First))
			// Return the last value if it is less than or equal to the first value
			return _Last;

		return _First;
	}

	template<class bidirit="">
	BidirIt med9v(BidirIt _First, BidirIt _Mid, BidirIt _Last)
	{
		std::size_t _Size = 0L;
		std::vector<bidirit> median; median.resize(9);

		// Compute the size of each fragment of the array
		if ((_Size = (std::distance(_First, _Last) / 9)) >= 3)
		{
			//#pragma  ivdep
			//#pragma  vector always
			// Iterate through the array and copy each item
			// located at the specific position to the temporary array
			for (int index = 0; index < 9; index++)
			     median[index] = _First + _Size * index;

			// Find the median of the first fragment
			BidirIt _Med1 = med3v(median[0], median[1], median[2]);
			// Find the median of the second fragment
			BidirIt _Med2 = med3v(median[3], median[4], median[5]);
			// Find the median of the third fragment
			BidirIt _Med3 = med3v(median[6], median[7], median[8]);

			// Compute the final value of the median-of-nine
			return med3v(_Med1, _Med2, _Med3);
		}

		else {
			return med3v(_First, _Mid, _Last);
		}
	}    
</bidirit></class></class></class>

Finally, the code below demonstrates the implementation of the main program that demonstrates the process of sorting:

#include "stdafx.h"

#include "parallel_sort.h"

#ifndef PARALLEL_STABLE_SORT_STL_CPP
#define PARALLEL_STABLE_SORT_STL_CPP

namespace parallel_sort_impl
{
#if defined( _WIN32 )
	static HANDLE hStdout = ::GetStdHandle(STD_OUTPUT_HANDLE);
	const WORD wdRed = FOREGROUND_RED | FOREGROUND_INTENSITY;
	const WORD wdWhite = FOREGROUND_RED | FOREGROUND_GREEN | FOREGROUND_BLUE;
#endif

	void stress_test(void)
	{
		while (true)
		{
			std::size_t count = 0L;
			int sort_type = std::rand() % 5;
			std::vector<std::int64_t> array, array_copy;
			// Generate the values of data items in the array to be sorted
			misc::init(array, std::make_pair(std::pow(10, misc::minval_radix), \
				std::pow(10, misc::maxval_radix)), count, sort_type);

			// Resize the array and copy all items to the second array to be sorted in parallel
			array_copy.resize(array.size());
			std::copy(array.begin(), array.end(), array_copy.begin());

			// Obtain the value of walltime prior to performing the sequential sorting
			std::chrono::system_clock::time_point \
				time_s = std::chrono::system_clock::now();

			std::cout << "sorting an array... (sort type: " << misc::sorttype2string(sort_type) << ")\n";

			// Perform the sequental sort by using std::sort function
			std::sort(array.begin(), array.end(),
				[](std::int64_t first, std::int64_t end) { return first < end; });

			// Obtain the value of walltime after to performing the sequential sorting
			std::chrono::system_clock::time_point \
				time_f = std::chrono::system_clock::now();

			// Compute the overall execution walltime
			std::chrono::system_clock::duration \
				std_sort_time_elapsed = time_f - time_s;

			std::cout << std::setiosflags(std::ios::fixed) << std::setprecision(4)
				<< "array size = " << count << " execution time (std::sort): " << std_sort_time_elapsed.count() << " ms ";

			// Obtain the value of walltime prior to performing the parallel sorting
			time_s = std::chrono::system_clock::now();

			// Perform the parallel sorting
			internal::parallel_sort(array_copy.begin(), array_copy.end(),
				[](std::int64_t first, std::int64_t end) { return first < end; });

			// Obtain the value of walltime after to performing the parallel sorting
			time_f = std::chrono::system_clock::now();

			std::size_t position = 0L;
			// Compute the overall execution walltime spent for parallel sorting
			std::chrono::system_clock::duration \
				intro_sort_time_elapsed = time_f - time_s;

			// Perform a verification if the array is properly sorted
			bool is_sorted = misc::sorted(array_copy.begin(), array_copy.end(), position,
				[](std::int64_t first, std::int64_t end) { return first < end; });

			// Compute the actual performance gain as the difference of execution walltime values
			std::double_t time_diff = \
				std_sort_time_elapsed.count() - intro_sort_time_elapsed.count();

			#if defined( _WIN32 )
			::SetConsoleTextAttribute(hStdout, \
				(is_sorted == true) ? wdWhite : wdRed);
			#else
			if (is_sorted == false)
				std::cout << "\033[1;31m";
			#endif	

			std::cout << "<--> (internal::parallel_sort): " << intro_sort_time_elapsed.count() << " ms " << "\n";

			std::cout << "verification: ";

			if (is_sorted == false) {
				std::cout << "failed at pos: " << position << "\n";
				std::cin.get();
				// Print out the array if the sorting has failed
				misc::print_out(array_copy.begin() + position, array_copy.end() + position + 10);
			}

			else {
				// Print out the statistics
				std::double_t ratio = intro_sort_time_elapsed.count() / \
					(std::double_t)std_sort_time_elapsed.count();
				std::cout << std::setiosflags(std::ios::fixed) << std::setprecision(2)
					<< "passed... [ time_diff: " << std::fabs(time_diff)
					<< " ms (" << "ratio: " << (ratio - 1.0) * 100 << "%) depth = "
					<< internal::g_depth << " ]" << "\n";
			}

			std::cout << "\n";

			#if !defined ( _WIN32 )
			if (is_sorted == false)
				std::cout << "\033[0m";
			#endif	
		}
	}

	void parallel_sort_demo(void)
	{
		std::size_t count = 0L;
		std::cout << "Enter the number of data items N = "; std::cin >> count;

		std::vector<std::int64_t> array;
		misc::init(array, std::make_pair(std::pow(10, misc::minval_radix), \
			std::pow(10, misc::maxval_radix)), count, 0);

		std::chrono::system_clock::time_point \
			time_s = std::chrono::system_clock::now();

		internal::parallel_sort(array.begin(), array.end(),
			[](std::int64_t first, std::int64_t end) { return first < end; });

		std::chrono::system_clock::time_point \
			time_f = std::chrono::system_clock::now();

		std::size_t position = 0L;
		std::chrono::system_clock::duration \
			intro_sort_time_elapsed = time_f - time_s;

		std::cout << "Execution Time: " << intro_sort_time_elapsed.count()
				  << " ms " << "depth = " << internal::g_depth << " ";

		bool is_sorted = misc::sorted(array.begin(), array.end(), position,
			[](std::int64_t first, std::int64_t end) { return first < end; });

		std::cout << "(verification: ";

		if (is_sorted == false) {
			std::cout << "failed at pos: " << position << "\n";
			std::cin.get();
			misc::print_out(array.begin() + position, array.end() + position + 10);
		}

		else {
			std::cout << "passed...)" << "\n";
		}

		char option = '\0';
		std::cout << "Do you want to output the array [Y/N]?"; std::cin >> option;

		if (option == 'y' || option == 'Y')
			misc::print_out(array.begin(), array.end());
	}
}

int main()
{
	std::string logo = "Parallel Sort v.1.00 (C)CPOL by Arthur V. Ratz";
	std::cout << logo << "\n\n";

	char option = '\0';
	std::cout << "Do you want to run stress test first [Y/N]?"; std::cin >> option;
	std::cout << "\n";

	if (option == 'y' || option == 'Y')
		parallel_sort_impl::stress_test();
	if (option == 'n' || option == 'N')
		parallel_sort_impl::parallel_sort_demo();

	return 0;
}

#endif // PARALLEL_STABLE_SORT_STL_CPP
    
</std::int64_t></std::int64_t>

Points of Interest

The fast and efficient parallel sort introduced in this article can be even much more faster if we've used the other multithreaded libraries and packages along with OpenMP performance library discussed in this article. For example, in this case, we can also use Intel® Threading Building Blocks to provide performance speed-up of the code introduced in this article. To do this, we can develop a code wrapper based on using tbb:parallel_for template function similar to the implementation of tbb::parallel_sort, and then use the code that performs the parallel fast sorting discussed in this article, instead of using std::sort C++ template function, to provide more performance speed-up of the process of sorting.

History

  • November 19, 2017 - The final revision of this article has been published;

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Arthur V. Ratz
Software Developer (Senior) Engineer DPLKB
Ukraine Ukraine
Arthur V. Ratz, 35 years old, C++ software developer, system analyst and network engineer graduated from L’viv State Polytechnical University (L'viv, Ukraine) and attained his Computer science and Information technology master’s degree in January 2004. Since the middle of 2005, Arthur Ratz is a senior IT-professional. His professional career began as a financial and accounting software developer in DPLKB company’s small local branch in L’viv, Ukraine. His professional interests include C/C++ programming, windows platform applications development using Win32API, parallel programming and multithreading, SQL relational database development, PHP and JavaScript web development, algorithms, system analysis, distributed information systems, computers networks design and analyzing, Windows Server and Linux administration, cloud computing, IoT, system security, technical writing and science publications etc. Arthur Ratz published his first article at CodeProject.com in June 2015.

You may also be interested in...

Comments and Discussions

 
GeneralMy vote of 5 Pin
Grigoriy Stepovich2-Dec-17 21:33
professionalGrigoriy Stepovich2-Dec-17 21:33 
GeneralRe: My vote of 5 Pin
Arthur V. Ratz2-Dec-17 22:53
professionalArthur V. Ratz2-Dec-17 22:53 
GeneralMy vote of 5 Pin
LrGrushko26-Nov-17 23:28
memberLrGrushko26-Nov-17 23:28 
GeneralRe: My vote of 5 Pin
Arthur V. Ratz27-Nov-17 2:55
professionalArthur V. Ratz27-Nov-17 2:55 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Terms of Use | Mobile
Web04 | 2.8.180218.2 | Last Updated 20 Nov 2017
Article Copyright 2017 by Arthur V. Ratz
Everything else Copyright © CodeProject, 1999-2018
Layout: fixed | fluid