## Editorial Note

*This article is the first in series of two articles, in which, we'll introduce and thoroughly discuss about the new fast sorting algorithm intended to be used for sorting arrays with typically huge (10^7-10^10) of data items. Specifically in the first article in series, we'll formulate the new algorithm and discuss about the number of algorithm-level optimizations that allow us to significantly increase the performance speed-up of the process of sorting compared to any other existing and currently active fast sorting algorithms. Also, we'll discuss about the computational complexity of the following algorithm and explore its efficiency and reliability when using it to sort various datasets. *

*In the second article we'll discuss on how to use Intel® C++ compiler and OpenMP performance library to deliver the code in C++11 that implements a parallel sorting algorithm, the serial version of that was introduced and formulated in the first article in series. Specifically, we'll find out how the following code sample execution can be perfectly scaled under the Amdahl law, across multiple processor's cores, leveraging the modern symmetric Intel® CPU's with 64-bit* *multicore architecture.*

*As the conclusion, we'll provide an analysis of the basic performance characteristics of the parallel code implementing the new fast sorting algorithm by using Intel® VTune Amplifier XE to collect and visualize the performance data.*

## Introduction

*"Sorting plays a major role in commercial data processing and in modern scientiﬁc computing. Indeed, a sorting algorithm was named as one of the top ten algorithms for science and engineering of the 20th century", - *

*Robert Sedgewick, Kevin Wayne, "Algorithms 4th Edition".*

*A sorting algorithm (defined)* - an algorithm that arranges all data items of an array in a specific numerical or lexicographical logical order (see figure below). The sorting algorithms are typically used to optimize the complexity and improve the performance of other algorithms, as well as to make the sorted data canonical producing human-readable outputs. In fact, sorting algorithms have lots of known applications in the field of commercial data processing, as well as the modern scientific computing in general.

The most common applications of various sorting algorithms basically include financial transactions processing, mathematical and economical statistics, linear algebra and functional optimization, data compression and visualization, cryptography, linguistic systems, timeline and tasks scheduling, logs and reports, genetic programming, AI data mining and neural networks, etc.

In fact, there's the number of reasons why actually we might want to use sorting as a part of the other more complex algorithms. For example, the solution of the trivial statistical sub-problem of finding top N-students with the highest score, can be simplified by using a sorting algorithm. Also, sorting can be used in various data compression algorithms to find those sequences of characters that more or less frequently occur in the text message being encoded. In genetic programming we use sorting algorithms to obtain chromosomes in the current population that represent the fittest problem's solution candidates. Moreover, the using of sorting algorithms allows to optimize the process of retrieving the most up-to-date information from a database such as the list of the latest remote transactions to a server ordered by date and time.

From the beginning of the era of computing, there was the number of various trivial sorting algorithms formulated. These algorithms basically include the famous bubble sort, as well as the selection and insertion sorting algorithms that are mainly based on performing the exchange-based stable data sorting. Also, there's a special kind of fast and efficient methods of sorting such as quicksort, merge and heapsort algorithms, capable to sort datasets with a typically huge amount of items drastically faster compared to those elementary methods of sorting mentioned above.

From one respect, many of those algorithms provide an efficient method for sorting arrays of data items having the either fundamental or abstract data type (ADT). From the other respect the using of each of those algorithms might become inefficient due to the number of potential drawbacks persistent in those algorithms, having an impact on the general performance of the process of sorting arrays with typically huge amount of data items.

The process of sorting itself still remains one of the most heavy tasks performed by modern computers, normally consuming large amounts of the processor's and system memory resources. That's why, the sorting of enormously huge amounts of data items might become impossible due to performance degrades caused solely by the nature and specific of a sorting algorithm applied.

Specifically, the main purpose of the following article is to introduce a new fast sorting algorithm that provides more efficiency and performance speed-up compared to some of those known implementations of the existing sorting algorithms such as either a classical quick and heap sort.

The new fast sorting algorithm being discussed is mainly based on an improved introspective sort (introsort) that employs more than one sorting algorithm such as either the most efficient 3-way quicksort or the trivial classical insertion sort to perform the actual sorting. Rather than the conventional introspective sort, proposed by David Musser in 1997, the following algorithm entirely relies on the observation of the input array, determining its size and initial order to properly select an optimal backend algorithm (e.g. 3-way quicksort vs. insertion sort) to be used for the actual sorting. Further, in this article, we'll thoroughly discuss about both of these sorting algorithms in details. In addition, we'll also discuss particular aspects of the other algorithms such as an adjacent and cocktail shaker sort used to perform an auxiliary and small arrays pre-sorting.

Furthermore, since the algorithm being discussed has been formulated, we'll spend just a bit more time to the number of effective algorithm-level performance optimizations that allow to drastically increase the performance speed-up of the process of sorting, making the algorithm being introduced even much more (2x - 6x times) faster than the number of known implementations based on the quicksort and heapsort fast algorithms, that, in turn, were formulated as a part of specific industrial standards for C and C++ programming languages.

In this article, we're aiming to achieve at least two primary goals:

*Introduce a principally new fast sorting algorithm capable of providing a better performance speed-up compared to the other fast sorting algorithms such as classical quicksort, merge and heap sort, previously formulated;* *Reveal the number of flaws and drawbacks affecting the actual performance in the mature stable versions of existing sorting algorithms such as *`qsort`

and `std::sort`

functions, implemented as a part of C and C++ standard libraries;

Those known algorithm-level optimizations basically include various schemes used by introspective sort algorithm to partition an entire array into multiple fragments of data items to be sorted. Also, we'll explain the median-of-three and median-of-nine algorithms used for the optimal selection of the position and value of a pivot data item used to split up the entire array into two subsequent parts to be sorted. This, in turn, have a large influence on the sorter binary heap size (i.e. the level of recursion depth), and, thus, the process of sorting performance speed-up itself. Finally, we're also about to discuss the sorter function recursive calls optimization that benefits in the lowered recursion depth and thus more performance speed-up of the entire process of sorting.

At the bottom of this article, we'll provide a basic performance analysis of the sorting algorithm being formulated based on the algorithm's computational complexity and the level of entropy estimation. As well, we'll survey the following algorithm comparing it with other known implementations of the fast sorting algorithms such as `std::sort`

C++ standard template function formulated as part of an industrial standard ISO/IEC 14882:2003(E) for software development cycle using C++ programming language.

Since the new sorting algorithm has been formulated, we'll thoroughly discuss about the ready-to-use code in C++11 that demonstrates the implementation of the following algorithm. Particularly, we'll discuss about the code implemented using C++ standard template library (STL), to exactly conform to the existing programming standards.

## Background

In this article, we’ll formulate and discuss an improved performance-optimized algorithm that sorts an array of data items based on performing 3-way quicksort using various partitioning schemes, as well as involving the various of other sorting algorithms to be used for such purposes as an auxiliary and small arrays sorting.

**Fast Sorting Algorithm's Overview**

The algorithm being discussed belongs to the special class of introspective sort algorithms that perform hybrid adaptive sorting basically relying on using multiple existing sorting methods and algorithms, rather than being formulated independently. The main benefit of the introspective sort is that it typically provides the either fast average performance or asymptotically reduced worst-case algorithm’s complexity O(N x log2(N)), as well as an entropy-optimal process of sorting itself.

The introspective sort algorithm was initially proposed by David Musser in the late 1990s, and, compared to such algorithms as quicksort, selection and insertion sort, still remains one of the newest and, at the same time, fast, efficient and reliable methods of sorting, that belongs to the new era of computing.

The main idea of using multiple sorting algorithms is that the most of existing sorting algorithms could be less or more efficient when used to sort various datasets. The computational complexity of many adaptive sorting algorithms varies depending on what particular sequences of data items they are used to sort. For example, the famous classical quicksort algorithm’s performance significantly decreases when sorting the either datasets with lots of duplicate items, or those datasets the data items of which are already sorted in the reverse order. In this case, the typically fast quicksort, which computational complexity is O(N x log2(N)) in the average case, asymptotically degrades to O(N^2), which is quite similar to the complexity of any other selection- or insertion-based algorithm. In turn, it makes the using of quicksort algorithm many times less efficient when sorting datasets with typically huge amounts of data items. From the other respect, the efficiency of using the trivial selection and insertion sort algorithms grows compared to the fast quicksort when it’s necessary to sort typically small arrays of data items.

The introspective sort was specially formulated as a workaround to the problem mentioned above. During the sorting process, the introspective algorithm performs an intro-selection of the most appropriate algorithm that could be used for sorting one or more fragments of an array, the sorting of which by using this particular algorithm is the most efficient. The intro-selection is actually the quick-select (a variant of the classical quicksort), based on performing array partitioning as well as the comparison and exchange operations on the specific data items while performing the actual sorting.

There’re two main scenarios of performing the introspective sort. The first variant, initially proposed yet by David Musser, basically relies on the dynamic controlling of recursion maximum depth, while the second variant is mainly based on selection of the most suitable algorithm depending on the potential size of a chunk and an initial order of data items to be sorted, the optimized partitioning that ensures the sufficient performance speed-up of the process of sorting, etc.

Specifically, the algorithm introduced in this article uses the second scenario for performing the introspective sort. In this particular case, we’ll use the introspective sort as a forefront procedure of sorting. The forefront sorting algorithm can be formulated as follows:

**1.** Compute the potential size of the entire array of data items;

**2.** If the potential size of the array is less than or equal to 10^2, regularly perform the insertion sort and go to step *10;*

**3.** Apply cocktail shaker sort to re-arrange the input array already sorted in the reverse order and go to the next step*;*

**4.** Perform the first chance check whether the array has already been sorted. If so, jump to step 10;

**5.** Compute the position of the first unsorted data item in the array;

**6.** Test, if the array of data items to be sorted is an interleave sequence by comparing the value of the first item in the array with * **the value** **of the data item located at the position starting at which the sequence of items is unsorted;*

**7.** If these values are equal, perform 3-way quicksort and go to step 10;

**8.** Pre-sort the array by using adjacent sort algorithm and launch the next stage of the introspective sort;

**9.** Perform the last chance check if the array has been sorted. If not, return to the step 8*;*

**10.** Terminate the process of sorting;

As you might already have noticed, the algorithm formulated above does not actually perform any sorting. Instead, the main purpose of the following algorithm is to select the proper method of sorting based on the array size and initial order observation. Actually, this is not a conventional sorting algorithm. According to the following algorithm, we actually poll on the decision of what particular sorting algorithm should be applied to perform backend array sorting.

Before we begin the actual sorting, first what we have to do is to compute the potential size of the input array to be sorted. This value will be used so far to determine what particular algorithm is the most optimal to be used for sorting arrays having a specific size. According to the algorithm being discussed, we normally use the insertion sort algorithm to sort arrays which size is typically small and does not exceed 10^2 of data items. After we've used the insertion sort for small array sorting we terminate the process of sorting asuming that the entire array has already been sorted. To sort arrays with typically huge amount of data items, we launch the process of introspective sort based on using 3-way quicksort algorithm at the backend.

Since we're about to sort a huge array of data items, prior to performing the introspective sort, we're performing a check to determine the initial order of data items in the array to be sorted, and, then, pre-sort the array by using one of the auxiliary sorting algorithms. Specifically, during the next step, we're using cocktail shaker sort to perform sorting of the array, in which data items are already sorted in the reverse order. At this point, we do not actually perform any additional verification. The process of performing cocktail shaker sort proceeds until we've changed the original order of all data items placed in the reverse order in the array being sorted. If the initial order of the array is not reverse, we simply bypass any sorting and go to the next step of the algorithm. As we've already explained, in this case, there's no need for any additional checks if the array has already been sorted. In fact, we purposely ommit the additional verification to optimize the performance of the entire process of sorting.

After the input array has been pre-sorted by using cocktail shaker sort algorithm, we perform a first chance check if the array has been sorted. If so, we jump to the last step of the algorithm and terminate the process of sorting. Otherwise, we compute the position of the first unsorted data item in case when the array has been partially sorted. Then we need to perform a check if unsorted part of the array is an interleave sequence, in which a a certain fragment consisting of data items that already sorted occurs in the array more than once (e.g. 1,2,3,4,1,2,3,4,1,2,3,4,...). The check is typically done by comparing the value of the first data item in the array with the value of the unsorted data item being found. If these values are equal and the unsorted part of the array appears to be an interleave sequence, we immediately proceed with 3-way quicksort used to sort the entire array. When the 3-way quicksort is completed we terminate the process of sorting.

If the array to be sorted is not an interleaving sequence, we proceed with the regular interspective sort. To do this, we first perform an array pre-sorting by using the adjacent sort, during which we actually sort adjacent neighbor data items. This, in turn, allows to increase the either performance speed-up or the reliability of the process of sorting, and reduce the entropy of the sorting algorithm itself. After that we proceed with the backend stage of the introspective sort thoroughly discussed below.

Just to provide a maximum reliability of the process of sorting we iteratively perform the last chance check if the entire array has already been sorted. To do this we perform the adjacent pre-sort and the introspective sorting and finally perform the following verification: if the array is still not entirely sorted, we do the same adjacent pre-sort and introspective sorting of the array over again, until we've obtained the array in which all data items have been properly ordered. After that we end up with performing the array sorting. The block diagram that illustrates the forefront sorting algorithm is shown on Fig. 1.

As you can see from the explanation and the block diagram above, the algorithm proposed uses a slightly different approach rather than controlling the sorter binary heap size based on the maximum recursion depth computation.

As we've already discussed, further, we proceed with the backend stage of the sorting algorithm. During the following stage we're performing one more additional check to determine if the array to be sorted consists of the large number of data items which values duplicate. This is typically done by initially performing a search to find the data item with maximum value in the array to be sorted. Then we need to compute the value of radix (i.e. the position of the most significant digit) for both the value of the array size and the maximum value in the array to be sorted. After that, we're performing a check if the position of MSD in the size value becomes greater, then we're partitioning the array into chunk of a specific size and then sort those chunks separately by using the introspective sort discussed in the next paragraph. This, in turn, allows to avoid of sorting those chunks which data items are already ordered. Otherwise, we don't actually need to perform the partitioning and use the introspective algorithm to sort the entire array.

The backend stage of the algorithm being discussed is very simple and has the following steps:

**1.** Find the data item which value is the maximum in the array to be sorted;

**2.** Compute the radix value of the value of size and the maximum value in the array;

**3.** Perform a check if the radix (i.e. the position of MSD) in the value of size is greater;

**4.** If so, split up the entire array into the number of chunks with equal size, otherwise go to step 7;

**5.** Sort data items within each chunk independently by using introspective sort algorithm;

**6.** Sort all subsequent chunks of data items in the array to be sorted and go to step 8;

**7.** Perform a regular introspective sort of the entire array to be sorted;

**8.** Terminate the sorting process;

The Fig. 2. illustrates the backend sorting algorithm:

**Introspective Sort**

In this section of the article we'll discuss about the backend stage of the introspective algorithm that is used for sorting in case when the array is not either the reverse or interleave sequence. To be more specific, let's at this point formulate the backend introspective sort algorithm, according to which we're performing a partitioning of the input array to be sorted and then sort each partition by performing fast and the most efficient 3-way quicksort. The backend introspective sort is even much more simple than the forefront algorithm discussed above and has the following steps:

**1.** Perform a check if the the array has already been sorted. If so, go to last step 7 and terminate the sorting process, otherwise, go to the next step;

**2.** Compute the actual size of the array to be sorted;

**3.** Perform a check if size of the array to be sorted is greater than the upper cutting off boundary. If not, just perform a regular insertion sort of the entire array. Otherwise, proceed with the next step of the algorithm;

**4.** Partition the the array into two either leftmost or rightmost subsequent parts according to the partition scheme used in the classical quicksort algorithm;

**5.** Sort the leftmost partition with the backend 3-way quicksort algorithm;

**6.** Sort the rightmost partition with the backend 3-way quicksort algorithm;

**7.** Terminate the sorting process;

In addition, the process of introspective sorting is illustrated on the block-diagram shown on the Fig. 3. below:

Before performing the actual sorting, we obviously performing a check to determine if the array has already been sorted. If so, we're terminating the process of sorting. If not, we proceed with the next step and compute the actual size of the array. After that we're performing a check if the potential size of the array is greater than the sorting process cutting off boundary. If so, we perform the partitioning of the array by using classical quicksort partitioning scheme. Finally, we sort each of those two partitions independently by using the improved 3-way quicksort algorithm. Please note that, unlike performing the conventional quicksort, at this point, we actually do not perform the introspective sort recursively. Instead, we delegate the process of sorting to 3-way quicksort function that in turn performs the recursive sorting.

**Partitioning schemes**

Partitioning - one of the most essential steps of the backend introspective sort discussed above. During this step we logically split up the entire array to be sorted into two subsequent parts (i.e. partitions) recursively sorted. The main idea of performing the partitioning is that we first must select a pivot data item and then re-arrange the array so that the data items which value is less than the value of pivot item are located at the leftmost partition of the array (e.g. prior to the position of pivot), as well as those data items which value is greater than the value of pivot item are located at the rightmost partition, respectively.

As we've might noticed, in this particular case, by performing the partitioning we actually pre-sort the array by exchanging the data items relative to the position of the pivot item being selected. There's a couple of the array partitioning schemes such as either the natural Hoare's or Lomuto partitioning scheme. As well, there're also different methods for selecting a pivot such as either using the data item located at the middle position of the array as the pivot item, or more sophisticated methods such as median-of-three and median-of-nine computation. We'll thoroughly discuss about these methods later on in this article.

Since we're going to use partitioning solely for pre-sorting and optimization purposes, we select a pivot and then perform the partitioning of the array based on the Lomuto partitioning scheme, that actually differs from the Hoare's partitioning scheme and can be formulated as follows.

According to the partitioning scheme proposed by Lomuto, we iterate through the array starting at the position of the first data item. By observing an array from left, we're stepping into each data item and compare its value with the value of pivot being selected. If the value of the current data item is greater than the value of pivot, we perform another iteration through the array starting from the end, at the position of the last data item. While performing the iteration backwards, we're sequentially comparing each particular data item with the value of pivot. If the value of the current item is less than the value of pivot, we simply exchange it with the data item from the left which value is greater than the value of pivot, previously obtained.

The main idea of performing the Lomuto partitioning is that we're stepping through the array from left and right side by performing the series of iterations through the array to locate the pairs of data items with values that are greater than or less than the value of pivot respectively. For each pair of data items we perform an exchange so that the data item with the smaller value occurs at the position prior to the position of pivot, and the data item with the value which is greater, - at the position after the position of pivot. Since, we've found the first pair of items to be exchanged, we're proceeding with the iterations respectively from left and right to find another pair of data items which values can be exchanged, until by iterating through the array from the left we've reached the position after the position of pivot, as well as the position prior to the position of pivot from right. Finally, we terminate the partitioning process. The example of partitioning process is shown on Fig. 4.

Since we've partitioned the array to be sorted, we will then sort each subsequent partition by performing 3-way quicksort, discussed below. Obviously that, according to the Hoare's partitioning scheme, the data items located between the position of the first item in the array and the final position we've reached from the right are in the first partition, and, similarly, the data items between the final item we've reached from left and the last item in the array, are in the second partition.

To optimize the performance of the sorting process at the algorithm level we'll use a slightly different method for splitting the array to be sorted. Specifically, the first partition will consist of those data items located between the first item in the array and the final item we've reached from the left, and second partition will include the data items between the final item reached from the right and the last item in the array, respectively. These both methods for splitting an array are shown on Fig. 5.

**Median of Medians**

As we've already discussed in the previous section of this article, one of the essential steps of the partitioning algorithm is the process during which we're selecting the pivot data item that logically divides the entire array into two subsequent parts. The proper selection of pivot value allows to drastically optimize the performance of the process of sorting by reducing the number of recursive calls to the sorter function and thus the size of the sorter binary heap. There're the number of various methods by using which the pivot data item can be selected. The classical quicksort algorithm proposed by Tony C.A.R. Hoare uses the data item located at the middle of array as a pivot to perform the array partitioning. In turn, the other versions of quicksort algorithm, including the dual-pivot quicksort proposed by Vladimir Yaroslavskiy, use the value of geometric mean of the values of the first and last data items in the array to be sorted.

However, to optimize the performance at the algorithm level, we can use the more sophisticated methods for finding the value of pivot data item. One of the most popular methods include the median-of-three and median-of-nine algorithms that we're about to discuss in this section.

The median is a central data item that logically separates an array into parts. In the most cases the value of median is the middle or average value of all items in the array. To compute the median of the array that consists of the odd number of item, all what we need is to take the value of data item located in the middle. If the number of data items in the array is even we compute the value of median as the arithmetic average of the values of two data items from the middle of the array. The main requirement for computing median is that the array must be previously sorted in the either ascending or descending order.

The median algorithm used by quicksort does not actually perform any of average value computations since the input array to be sorted is initially unordered. Instead, we take the values of three data items from the beginning, the middle and the end of the array. To obtain the value of median we actually select the average out of these three values. The average value is normally greater than the minima and less than maxima of these three values. Optionally, we can do a tiny sorting by exchanging these three values so that the minima value appears at the first position, the average (i.e. median) value - at the middle, and the maxima at the last position in the array to be sorted.

The algorithm that we've just introduced above is the median-of-three algorithm. To make the process of selecting the median more optimal we normally use median-of-nine algorithm, according to which we're splitting up the unsorted array into nine chunks, which size can be determined by dividing the potential size of the array by 9. Then, we're performing a series of iterations during which we're incrementing the value of a loop counter variable by the value of actual size of each chunk, previously computed. While performing each iteration we're retrieving the value at the position of the first item in each subsequent chunk pointed by the loop counter variable, and copying these values to a separate temporary array, until we've obtained the array of nine intermediate values used to compute the final value of median. After that, we're using the same median-of-three algorithm for each group of three values stored into the temporary location, and, as the result, obtain the three different values of median, so far. Finally, we'll again use the media-of-three algorithm to obtain the final value of median by selecting among those three values previously obtained. Fig. 6. illustrates the process of the median-of-three and median-of-nine computation.

**3-Way Quicksort Algorithm**

As we've already discussed before the introspective sorting algorithm thoroughly discussed above does not perform the actual sorting. The sorting of each fragment of the array after performing the partitioning is typically done by using 3-way quicksort, which is the fast and the most efficient sorting algorithm that ever has been formulated. 3-way quicksort was initially proposed and documented by *Robert Sedgewick* and *Kevin Wayne* in their *"Algorithms 4-th edition"* book and other materials published by these two authors. In fact, the 3-way variant of the quicksort algorithm is much easier than the classical quicksort invented by Tony Hoare. The following algorithm can be formulated as follows.

According to the 3-way quicksort procedure, all what we need to do is to select the pivot data item by using median-of-nine algorithm and then iterate through the array to be sorted. By stepping into each data item we're performing a check if the current data item's value is greater than the value of pivot. If so, we're exchanging the current item with the specific data item we've stepped from right of the array. Otherwise, if the current data item's value is less than the value of pivot, we're exchanging the following item with the data item we've stepped into from the left. Similarly, we're exchaning all other data item with the value which is less than the value of pivot with each next item from left, and those items which value is greater than the value of pivot with each next item from right, and so on. The data items which values are equal to the value of pivot remain at their original positions. Let's remind that, unlike the classical quicksort, the 3-way sorting is performing in a single pass through the array to be sorted.

Then, similarly to the all other variants of the quicksort algorithm, the array is partitioned by splitting it into two subsequent parts sorted by performing the recursive call to the 3-way quicksort routine independently for each subsequent part. The array to be sorted is splitted by using the second variant shown on Fig. 5.

The block diagram illustrating the 3-way quicksort algorithm is shown on Fig. 7:

The main benefit of performing 3-way quicksort rather than using the other sorting algorithms is that 3-way quicksort algorithm provides an unltimate performance and is certified as the most entropy-optimal sorting algorithm. The best-case computational complexity of the 3-way quicksort algorithm is just O(log2(N)) and the worst complexity is about O (N x log2(N)). Compared to the 3-way quicksort, the classical variant of the quicksort algorithm has much lower performance since the worst-case complexity of the classical quicksort can be estimated as O(N^2), in case when the following algorithm degrades while sorting the arrays with lots of duplicate values. In this case, the performance of 3-way quicksort is even more high since prior to the performing the actual sorting we're using the complex pre-sorting of the array to be sorted.

## Conclusion

In the first article in the series, we've thoroughly discussed about the new fast sorting algorithm based on performing the quicksort used for sorting array with typically huge number of data items. In the second article, coming soon, we'll discuss particularly about the code that implements the sorting algorithm being discussed. To implement the following algorithm we'll use Intel® C++ Compiler and OpenMP performance library to deliver the modern code that perform the actual sorting according to the algorithm being introduced.

## History

**December 5, 2015** - The first revision of article has been published **December 8, 2015 **- The second revision of article has been published **December 10, 2015 **- The third revision of the article has been published **December 12, 2015 **- The final revision of the article has been published (performance analytics added) **November 16, 2017** - The final revision of the article has been published (the huge material update!)