The generic sort algorithm in .NET does not perform well on sorting
strings. The reason for that is that it performs too many character comparisons while comparing
There is a better algorithm described by Prof. Robert Sedgewick called multikey quicksort. Sedgewick is known for his developer friendly books on algorithms and his approach to empirically study algorithm performance. In the case of
string sort, Sedgewick provides a reference implementation of
string sort in C:
His implementation is quite hard to understand but achieves excellent performance.
String Sort Implementation in C# and F#
I provide three implementations of multikey quick sort algorithm in C# and F# (in addition to the pseudo-like code below). My implementations are slightly different from Sedgewick's and slightly slower, but much faster than the .NET implementation. Initially, I had difficulty adapting Sedgewick's implementation due to a corner case when the
string can contain the '
\0' character. In the new code, this problem is fixed and I provide an implementation in C# very close to the reference.
The problem with .NET sort is not in the implementation per se, but in the algorithm. .NET implements plain quick sort, which is one of the fastest algorithms for integers or doubles. Sedgewick proposes a few extensions to plain quick sort which make the algorithm useful for:
- data which contains duplicates
- multikey data, such as strings or arrays of characters
Additionally, Sedgewick uses two other tricks:
- insertion sort for small arrays
- pseudo-median of 9 to find a good pivot
Pseudo-code of Multikey Quicksort
The real code of
string sort is somewhat hard to understand because it uses inplace update and multiple indices. The multikey quicksort algorithm however, is not. In this section, I provide actual F# code to serve as pseudo-code. This code can serve as a high-level specification of the actual algorithm. However, this pseudo-code omits details which affect performance.
let filter = Array.filter let concatenate = Array.concat
let cmp (depth: int, s: string, t: string) = let lenS, lenT = s.Length, t.Length
assert(lenS >= depth); assert(lenT >= depth) if (lenS = depth && lenT = depth) then (*return*) 0 else if (lenS = depth) then (*return*) -1 else if (lenT = depth) then (*return*) 1 else
assert(lenS > depth); assert(lenT > depth) let chS, chT = s.[depth], t.[depth] if chS < chT then (*return*) -1 else if chS = chT then (*return*) 0 else (*return*) 1
let rec pseudoCodeMultiKeyQuickSort(input: string, depth: int): string = if (input.Length = 0) then Array.empty else
let pivot = input.[input.Length / 2] let less = filter(fun s -> cmp(depth, s, pivot) < 0) input let eq = filter(fun s -> cmp(depth, s, pivot) = 0) input let gt = filter(fun s -> cmp(depth, s, pivot) > 0) input let sortedLess = pseudoCodeMultiKeyQuickSort(less, depth) let sortedEq = if (eq.Length > 0 && depth < eq..Length) then pseudoCodeMultiKeyQuickSort
(eq, depth + 1) else eq let sortedGt = pseudoCodeMultiKeyQuickSort(gt, depth) (*return*)
concatenate [|sortedLess; sortedEq; sortedGt|]
let pseudoStringSort (input: string) = (*return*) pseudoCodeMultiKeyQuickSort(input, 0 (*depth*))
Please pay attention to the comparison function (named
cmp, line 3) which takes a depth parameter in addition to the two
t) to be compared. Also note how the algorithm recurses on two dimensions: subarrays and the depth parameter (see Figure 1). You can think that the algorithm operates on levels: first it sorts by the first character, finds all
strings where the first character is equal to the pivot and, for this group of
strings only, sorts by the second character. Special care has to be taken for
strings which run out of characters as the depth parameter grows. Depth grows only on line 21. In that line, we have a block of
strings which have the same prefix from 0 to depth (inclusive). As depth grows, it will eventually end up with empty
strings. If the current suffix (starting at depth) of the first
string in the block of equal
strings (line 21) is the empty
string (tested by
eq..Length condition), then because all
strings in this block are equal, they are all empty. This is used as an termination condition to the recursion by depth. The other termination condition is that a subarray has become empty.
Finally, some notes on the F# syntax might be useful for those unfamiliar with F#.
[|a;b;c|] is used to denote an array of
arr.[index] is used to access an element at position index; let
x = 2 is used to bind variable
x to value
fun s -> f(s) is the F# syntax for a lambda expression, or a predicate which will take
s and run
f with argument
s and return the result of
f to the caller. The predicate may be executed multiple times.
Figure 1: Multikey quick sort recursion.
Dimensions by Which to Judge String Sorting Algorithms
The following dimensions will affect which
string sorting algorithm performs the best:
Length of Input
Small arrays vs. large Arrays. .NET Sort will perform well on small
string arrays and on arrays where the
strings are short. As the input array grows in size, or the
strings in the array grow, .NET will take much longer to terminate.
If there are many duplicate
strings, ternary partitioning (i.e. partitioning into three parts) in multikey quicksort (as implemented for
strings by Sedgewick) will pay off for its overhead. If duplicate
strings are not expected, binary partitioning algorithms will be faster. Ternary partitioning works in the following way. A pivot is chosen and in the end the elements are partitioned into three buckets in the following order: "
<= pivot", "
= pivot", and "
>= pivot". In contrast, binary partitioning will use only two buckets: "
<= pivot" and "
> pivot". Ternary partitioning is more complicated to implement without using an additional array as it first splits the elements in four groups in the order given: "
= pivot", "
< pivot", "
> pivot", "
= pivot". Then it moves back the two "
= pivot" blocks in the middle: "
< pivot", "
= pivot", "
= pivot", "
> pivot". In addition to having bigger inner loops than binary quick sort, one iteration of ternary quick sort requires more data swapping. The size of the code in the inner loop affects greatly the performance of quick sort. However, if the "
= pivot" blocks happen to be large, then there is less work to do when recursing quick sort on the "
< pivot" and "
> pivot" blocks. In the case of multikey quick sort, the pivot is a character and most likely will match many elements.
If the input array contains larger and larger
strings, the ratio between
string sort and generic sort in general becomes larger. This case is worsened if the
strings to be sorted contain a large number of long common prefixes. This is the case for sorting words from the English Language.
Adaptation to Non-English Languages
The reference code provided sorts
strings by comparing the characters of
strings from first to last, and compares the characters by their ASCII codes. There are two problems: reverse sort order and non-English languages.
- First, even if you provide a
char comparator sorting in reverse will not work unless the algorithm is changed. This is caused by empty
strings that become empty as the depth parameter grows). It's possible to change the algorithm to account for reverse comparators, but providing a
char comparator only is not sufficient.
- The second problem has to do with the fact that for many languages, a
char comparator is not enough to derive a
string comparator. The issue is outlined in an MSDN article titled:
There are some locales in which there does not exist a one-to-one correspondence between characters and letters. To sort in non-English languages, one should break the sequence of characters into letters and then sort by letters. Luckily, .NET provides a class called
SortKey which will convert a
string in a non-English language to a byte array. Sorting
strings by their corresponding byte arrays will result in the correct sort order. In this way, the multikey quicksort algorithm can be easily adapted to operate on byte arrays instead on
strings. For details on converting
SortKeys, please check the referenced MSDN article. For efficiency purposes,
strings should be converted to
SortKeys before running the sorting algorithm.
As explained in the previous section, careful comparison of sorting algorithms will have to account for at least the above mentioned input dimensions. It is best to keep two of the dimensions fixed and plot the running times against values of the third. Due to lack of time, I will show only running times where each of the three dimensions is relatively large. My code contains a function which will run and test all algorithms for automatically generated inputs of specified dimensions. This function is found in the F# file provided.
number of unique elements = 5000000;
number of duplicates for each unique element = 5;
length of each string = 20;
random seed = 384585;
String Sort in C# : avg: 31.247667(sec.) 0.4 %
String Sort in F# : avg: 44.563966(sec.) 43.2 %
Sedgewick (translated to C#): avg: 31.119719(sec.) 0.0 %
Dot Net Generic Sort : avg: 145.528939(sec.) 367.6 %
Fixed parameters as above, variable string length
string length = 5
String Sort in C# : avg: 25.546378(sec.) 1.1 %
String Sort in F# : avg: 33.159264(sec.) 31.2 %
Sedgewick (translated to C#): avg: 25.268513(sec.) 0.0 %
Dot Net Generic Sort : avg: 129.865247(sec.) 413.9 %
string length = 10
String Sort in C# : avg: 29.166639(sec.) -6.5 %
String Sort in F# : avg: 46.012092(sec.) 47.4 %
Sedgewick (translated to C#): avg: 31.210900(sec.) 0.0 %
Dot Net Generic Sort : avg: 152.109868(sec.) 387.4 %
string length = 20
String Sort in C# : avg: 37.304173(sec.) 16.2 %
String Sort in F# : avg: 52.745780(sec.) 64.3 %
Sedgewick (translated to C#): avg: 32.095941(sec.) 0.0 %
Dot Net Generic Sort : avg: 162.737198(sec.) 407.0 %
string length = 40
String Sort in C# : avg: 39.446201(sec.) 9.1 %
String Sort in F# : avg: 52.786923(sec.) 46.0 %
Sedgewick (translated to C#): avg: 36.151525(sec.) 0.0 %
Dot Net Generic Sort : avg: 167.100856(sec.) 362.2 %
string length = 40
String Sort in C# : avg: 44.527835(sec.) 26.9 %
String Sort in F# : avg: 54.424359(sec.) 55.1 %
Sedgewick (translated to C#): avg: 35.098257(sec.) 0.0 %
Dot Net Generic Sort : avg: 164.124873(sec.) 367.6 %