## Introduction

The generic sort algorithm in .NET does not perform well on sorting `string`

s. The reason for that is that it performs too many character comparisons while comparing `string`

s.

There is a better algorithm described by Prof. Robert Sedgewick called **multikey quicksort**. Sedgewick is known for his developer friendly books on algorithms and his approach to empirically study algorithm performance. In the case of `string`

sort, Sedgewick provides a reference implementation of `string`

sort in C:

His implementation is quite hard to understand but achieves excellent performance.

## String Sort Implementation in C# and F#

I provide three implementations of multikey quick sort algorithm in C# and F# (in addition to the pseudo-like code below). My implementations are slightly different from Sedgewick's and slightly slower, but much faster than the .NET implementation. Initially, I had difficulty adapting Sedgewick's implementation due to a corner case when the `string`

can contain the '`\0`

' character. In the new code, this problem is fixed and I provide an implementation in C# very close to the reference.

The problem with .NET sort is not in the implementation per se, but in the algorithm. .NET implements plain quick sort, which is one of the fastest algorithms for integers or doubles. Sedgewick proposes a few extensions to plain quick sort which make the algorithm useful for:

*data which contains duplicates* *multikey data, such as strings or arrays of characters *

Additionally, Sedgewick uses two other tricks:

*insertion sort for small arrays* *pseudo-median of 9 to find a good pivot*

## Pseudo-code of Multikey Quicksort

The real code of `string`

sort is somewhat hard to understand because it uses inplace update and multiple indices. The multikey quicksort algorithm however, is not. In this section, I provide actual F# code to serve as pseudo-code. This code can serve as a high-level specification of the actual algorithm. However, this pseudo-code omits details which affect performance.

let filter = Array.filter
let concatenate = Array.concat
let cmp (depth: int, s: string, t: string) =
let lenS, lenT = s.Length, t.Length
assert(lenS >= depth); assert(lenT >= depth)
if (lenS = depth && lenT = depth) then (*return*) 0
else if (lenS = depth) then (*return*) -1
else if (lenT = depth) then (*return*) 1
else
assert(lenS > depth); assert(lenT > depth)
let chS, chT = s.[depth], t.[depth]
if chS < chT then (*return*) -1
else if chS = chT then (*return*) 0
else (*return*) 1
let rec pseudoCodeMultiKeyQuickSort(input: string[], depth: int): string[] =
if (input.Length = 0) then Array.empty
else
let pivot = input.[input.Length / 2]
let less = filter(fun s -> cmp(depth, s, pivot) < 0) input
let eq = filter(fun s -> cmp(depth, s, pivot) = 0) input
let gt = filter(fun s -> cmp(depth, s, pivot) > 0) input
let sortedLess = pseudoCodeMultiKeyQuickSort(less, depth)
let sortedEq = if (eq.Length > 0 && depth < eq.[0].Length) then
pseudoCodeMultiKeyQuickSort
(eq, depth + 1)
else eq
let sortedGt = pseudoCodeMultiKeyQuickSort(gt, depth)
(*return*)
concatenate [|sortedLess; sortedEq; sortedGt|]
let pseudoStringSort (input: string[]) =
(*return*) pseudoCodeMultiKeyQuickSort(input, 0 (*depth*))
pseudoStringSort [|"aaa";"a";"b";"";"ba";"bbb"|]

Please pay attention to the comparison function (named `cmp`

, line 3) which takes a depth parameter in addition to the two `string`

s (`s`

and `t`

) to be compared. Also note how the algorithm recurses on two dimensions: subarrays and the depth parameter (see Figure 1). You can think that the algorithm operates on levels: first it sorts by the first character, finds all `string`

s where the first character is equal to the pivot and, for this group of `string`

s only, sorts by the second character. Special care has to be taken for `string`

s which run out of characters as the depth parameter grows. Depth grows only on line 21. In that line, we have a block of `string`

s which have the same prefix from 0 to depth (inclusive). As depth grows, it will eventually end up with empty `string`

s. If the current suffix (starting at depth) of the first `string`

in the block of equal `string`

s (line 21) is the empty `string`

(tested by `eq.[0].Length `

condition), then because all `string`

s in this block are equal, they are all empty. This is used as an termination condition to the recursion by depth. The other termination condition is that a subarray has become empty.

Finally, some notes on the F# syntax might be useful for those unfamiliar with F#. `[|a;b;c|]`

is used to denote an array of `a`

, `b `

and `c`

; `arr.[index]`

is used to access an element at position index; let `x = 2 `

is used to bind variable `x `

to value `2`

; `fun s -> f(s)`

is the F# syntax for a lambda expression, or a predicate which will take `s`

and run `f`

with argument `s`

and return the result of `f`

to the caller. The predicate may be executed multiple times.

Figure 1: Multikey quick sort recursion.

## Dimensions by Which to Judge String Sorting Algorithms

The following dimensions will affect which `string `

sorting algorithm performs the best:

#### Length of Input

Small arrays vs. large Arrays. .NET Sort will perform well on small `string `

arrays and on arrays where the `string`

s are short. As the input array grows in size, or the `string`

s in the array grow, .NET will take much longer to terminate.

#### Duplicate Strings

If there are many duplicate `string`

s, ternary partitioning (i.e. partitioning into three parts) in multikey quicksort (as implemented for `string`

s by Sedgewick) will pay off for its overhead. If duplicate `string`

s are not expected, binary partitioning algorithms will be faster. Ternary partitioning works in the following way. A pivot is chosen and in the end the elements are partitioned into three buckets in the following order: "`<= pivot`

", "`= pivot`

", and "`>= pivot`

". In contrast, binary partitioning will use only two buckets: "`<= pivot`

" and "`> pivot`

". Ternary partitioning is more complicated to implement without using an additional array as it first splits the elements in four groups in the order given: "`= pivot`

", "`< pivot`

", "`> pivot`

", "`= pivot`

". Then it moves back the two "`= pivot`

" blocks in the middle: "`< pivot`

", "`= pivot`

", "`= pivot`

", "`> pivot`

". In addition to having bigger inner loops than binary quick sort, one iteration of ternary quick sort requires more data swapping. The size of the code in the inner loop affects greatly the performance of quick sort. However, if the "`= pivot`

" blocks happen to be large, then there is less work to do when recursing quick sort on the "`< pivot`

" and "`> pivot`

" blocks. In the case of multikey quick sort, the pivot is a character and most likely will match many elements.

#### String Length

If the input array contains larger and larger `string`

s, the ratio between `string `

sort and generic sort in general becomes larger. This case is worsened if the `string`

s to be sorted contain a large number of long common prefixes. This is the case for sorting words from the English Language.

## Adaptation to Non-English Languages

The reference code provided sorts `string`

s by comparing the characters of `string`

s from first to last, and compares the characters by their ASCII codes. There are two problems: reverse sort order and non-English languages.

- First, even if you provide a
`char `

comparator sorting in *reverse* will not work unless the algorithm is changed. This is caused by empty `string`

s (or `string`

s that become empty as the depth parameter grows). It's possible to change the algorithm to account for reverse comparators, but providing a `char `

comparator only is not sufficient. - The second problem has to do with the fact that for many languages, a
`char `

comparator is not enough to derive a `string `

comparator. The issue is outlined in an MSDN article titled:
There are some locales in which there does not exist a one-to-one correspondence between characters and letters. To sort in non-English languages, one should break the sequence of characters into letters and then sort by letters. Luckily, .NET provides a class called `SortKey `

which will convert a `string `

in a non-English language to a byte array. Sorting `string`

s by their corresponding byte arrays will result in the correct sort order. In this way, the multikey quicksort algorithm can be easily adapted to operate on byte arrays instead on `string`

s. For details on converting `string`

s to `SortKey`

s, please check the referenced MSDN article. For efficiency purposes, `string`

s should be converted to `SortKey`

s before running the sorting algorithm.

## Comparisons

As explained in the previous section, careful comparison of sorting algorithms will have to account for at least the above mentioned input dimensions. It is best to keep two of the dimensions fixed and plot the running times against values of the third. Due to lack of time, I will show only running times where each of the three dimensions is relatively large. My code contains a function which will run and test all algorithms for automatically generated inputs of specified dimensions. This function is found in the F# file provided.

number of unique elements = 5000000;
number of duplicates for each unique element = 5;
length of each string = 20;
random seed = 384585;
================================================================
String Sort in C# : avg: 31.247667(sec.) 0.4 %
String Sort in F# : avg: 44.563966(sec.) 43.2 %
Sedgewick (translated to C#): avg: 31.119719(sec.) 0.0 %
Dot Net Generic Sort : avg: 145.528939(sec.) 367.6 %

*Fixed parameters as above, variable string length*
===============================================================
string length = 5
-------------------
String Sort in C# : avg: 25.546378(sec.) 1.1 %
String Sort in F# : avg: 33.159264(sec.) 31.2 %
Sedgewick (translated to C#): avg: 25.268513(sec.) 0.0 %
Dot Net Generic Sort : avg: 129.865247(sec.) 413.9 %
===============================================================
string length = 10
-------------------
String Sort in C# : avg: 29.166639(sec.) -6.5 %
String Sort in F# : avg: 46.012092(sec.) 47.4 %
Sedgewick (translated to C#): avg: 31.210900(sec.) 0.0 %
Dot Net Generic Sort : avg: 152.109868(sec.) 387.4 %
===============================================================
string length = 20
-------------------
String Sort in C# : avg: 37.304173(sec.) 16.2 %
String Sort in F# : avg: 52.745780(sec.) 64.3 %
Sedgewick (translated to C#): avg: 32.095941(sec.) 0.0 %
Dot Net Generic Sort : avg: 162.737198(sec.) 407.0 %
===============================================================
string length = 40
-------------------
String Sort in C# : avg: 39.446201(sec.) 9.1 %
String Sort in F# : avg: 52.786923(sec.) 46.0 %
Sedgewick (translated to C#): avg: 36.151525(sec.) 0.0 %
Dot Net Generic Sort : avg: 167.100856(sec.) 362.2 %
===============================================================
string length = 40
-------------------
String Sort in C# : avg: 44.527835(sec.) 26.9 %
String Sort in F# : avg: 54.424359(sec.) 55.1 %
Sedgewick (translated to C#): avg: 35.098257(sec.) 0.0 %
Dot Net Generic Sort : avg: 164.124873(sec.) 367.6 %

## Enjoy!