Click here to Skip to main content
11,920,772 members (57,054 online)
Click here to Skip to main content
Add your own
alternative version


94 bookmarked

Fast String Sort in C# and F#

, 16 Jan 2011 CPOL
Rate this:
Please Sign up or sign in to vote.
Implementation of Multikey String Quick Sort (following Sedgewick)


The generic sort algorithm in .NET does not perform well on sorting strings. The reason for that is that it performs too many character comparisons while comparing strings.

There is a better algorithm described by Prof. Robert Sedgewick called multikey quicksort. Sedgewick is known for his developer friendly books on algorithms and his approach to empirically study algorithm performance. In the case of string sort, Sedgewick provides a reference implementation of string sort in C:

His implementation is quite hard to understand but achieves excellent performance.

String Sort Implementation in C# and F#

I provide three implementations of multikey quick sort algorithm in C# and F# (in addition to the pseudo-like code below). My implementations are slightly different from Sedgewick's and slightly slower, but much faster than the .NET implementation. Initially, I had difficulty adapting Sedgewick's implementation due to a corner case when the string can contain the '\0' character. In the new code, this problem is fixed and I provide an implementation in C# very close to the reference.

The problem with .NET sort is not in the implementation per se, but in the algorithm. .NET implements plain quick sort, which is one of the fastest algorithms for integers or doubles. Sedgewick proposes a few extensions to plain quick sort which make the algorithm useful for:

  • data which contains duplicates
  • multikey data, such as strings or arrays of characters

Additionally, Sedgewick uses two other tricks:

  • insertion sort for small arrays
  • pseudo-median of 9 to find a good pivot

Pseudo-code of Multikey Quicksort

The real code of string sort is somewhat hard to understand because it uses inplace update and multiple indices. The multikey quicksort algorithm however, is not. In this section, I provide actual F# code to serve as pseudo-code. This code can serve as a high-level specification of the actual algorithm. However, this pseudo-code omits details which affect performance.

let filter = Array.filter //used as an alias         		//line  1
let concatenate = Array.concat //another alias                  //line  2

let cmp (depth: int, s: string, t: string) =                    //line  3
    let lenS, lenT = s.Length, t.Length
    assert(lenS >= depth); assert(lenT >= depth)           	//line  4
    if (lenS = depth && lenT = depth) then (*return*) 0       	//line  5
    else if (lenS = depth) then (*return*) -1                   //line  6
    else if (lenT = depth) then (*return*) 1                    //line  7
        assert(lenS > depth); assert(lenT > depth)              //line  8
        let chS, chT = s.[depth], t.[depth]                     //line  9
        if chS < chT then (*return*) -1                         //line 10
        else if chS = chT then (*return*) 0                     //line 11
        else (*return*) 1                                   	//line 12

// Do not use in production. This code is meant to serve as pseudo-code only
// Production code uses details that are not presented here
let rec pseudoCodeMultiKeyQuickSort(input: string[], depth: int): string[] = //line 13
    if (input.Length = 0) then Array.empty                               	//line 14
        let pivot = input.[input.Length / 2] //in reality median of 3 or 9 is used //line 15
        //in reality the following filters are done in place in two passes         //line 16
        let less = filter(fun s -> cmp(depth, s, pivot) < 0) input //  s < pivot	//line 17
        let eq = filter(fun s -> cmp(depth, s, pivot) = 0) input   //  s = pivot 	//line 18
        let gt = filter(fun s -> cmp(depth, s, pivot) > 0) input   //  s > pivot 	//line 19
        let sortedLess = pseudoCodeMultiKeyQuickSort(less, depth) 		//line 20
        let sortedEq = if (eq.Length > 0 && depth < eq.[0].Length) then	//line 21
			(eq, depth + 1) //note that depth grows   //line 21
                       else eq                                       	//line 22
        let sortedGt = pseudoCodeMultiKeyQuickSort(gt, depth)        	//line 23
        //in reality the algorithm updates the same array and					
        //recursively passes from and to indices to define subarrays
        concatenate [|sortedLess; sortedEq; sortedGt|]              	//line 24

let pseudoStringSort (input: string[]) =                               	//line 25
    (*return*) pseudoCodeMultiKeyQuickSort(input, 0 (*depth*))         	//line 26

pseudoStringSort [|"aaa";"a";"b";"";"ba";"bbb"|]

Please pay attention to the comparison function (named cmp, line 3) which takes a depth parameter in addition to the two strings (s and t) to be compared. Also note how the algorithm recurses on two dimensions: subarrays and the depth parameter (see Figure 1). You can think that the algorithm operates on levels: first it sorts by the first character, finds all strings where the first character is equal to the pivot and, for this group of strings only, sorts by the second character. Special care has to be taken for strings which run out of characters as the depth parameter grows. Depth grows only on line 21. In that line, we have a block of strings which have the same prefix from 0 to depth (inclusive). As depth grows, it will eventually end up with empty strings. If the current suffix (starting at depth) of the first string in the block of equal strings (line 21) is the empty string (tested by eq.[0].Length condition), then because all strings in this block are equal, they are all empty. This is used as an termination condition to the recursion by depth. The other termination condition is that a subarray has become empty.

Finally, some notes on the F# syntax might be useful for those unfamiliar with F#. [|a;b;c|] is used to denote an array of a, b and c; arr.[index] is used to access an element at position index; let x = 2 is used to bind variable x to value 2; fun s -> f(s) is the F# syntax for a lambda expression, or a predicate which will take s and run f with argument s and return the result of f to the caller. The predicate may be executed multiple times.


Figure 1: Multikey quick sort recursion.

Dimensions by Which to Judge String Sorting Algorithms

The following dimensions will affect which string sorting algorithm performs the best:

Length of Input

Small arrays vs. large Arrays. .NET Sort will perform well on small string arrays and on arrays where the strings are short. As the input array grows in size, or the strings in the array grow, .NET will take much longer to terminate.

Duplicate Strings

If there are many duplicate strings, ternary partitioning (i.e. partitioning into three parts) in multikey quicksort (as implemented for strings by Sedgewick) will pay off for its overhead. If duplicate strings are not expected, binary partitioning algorithms will be faster. Ternary partitioning works in the following way. A pivot is chosen and in the end the elements are partitioned into three buckets in the following order: "<= pivot", "= pivot", and ">= pivot". In contrast, binary partitioning will use only two buckets: "<= pivot" and "> pivot". Ternary partitioning is more complicated to implement without using an additional array as it first splits the elements in four groups in the order given: "= pivot", "< pivot", "> pivot", "= pivot". Then it moves back the two "= pivot" blocks in the middle: "< pivot", "= pivot", "= pivot", "> pivot". In addition to having bigger inner loops than binary quick sort, one iteration of ternary quick sort requires more data swapping. The size of the code in the inner loop affects greatly the performance of quick sort. However, if the "= pivot" blocks happen to be large, then there is less work to do when recursing quick sort on the "< pivot" and "> pivot" blocks. In the case of multikey quick sort, the pivot is a character and most likely will match many elements.

String Length

If the input array contains larger and larger strings, the ratio between string sort and generic sort in general becomes larger. This case is worsened if the strings to be sorted contain a large number of long common prefixes. This is the case for sorting words from the English Language.

Adaptation to Non-English Languages

The reference code provided sorts strings by comparing the characters of strings from first to last, and compares the characters by their ASCII codes. There are two problems: reverse sort order and non-English languages.

  • First, even if you provide a char comparator sorting in reverse will not work unless the algorithm is changed. This is caused by empty strings (or strings that become empty as the depth parameter grows). It's possible to change the algorithm to account for reverse comparators, but providing a char comparator only is not sufficient.
  • The second problem has to do with the fact that for many languages, a char comparator is not enough to derive a string comparator. The issue is outlined in an MSDN article titled:

    There are some locales in which there does not exist a one-to-one correspondence between characters and letters. To sort in non-English languages, one should break the sequence of characters into letters and then sort by letters. Luckily, .NET provides a class called SortKey which will convert a string in a non-English language to a byte array. Sorting strings by their corresponding byte arrays will result in the correct sort order. In this way, the multikey quicksort algorithm can be easily adapted to operate on byte arrays instead on strings. For details on converting strings to SortKeys, please check the referenced MSDN article. For efficiency purposes, strings should be converted to SortKeys before running the sorting algorithm.


As explained in the previous section, careful comparison of sorting algorithms will have to account for at least the above mentioned input dimensions. It is best to keep two of the dimensions fixed and plot the running times against values of the third. Due to lack of time, I will show only running times where each of the three dimensions is relatively large. My code contains a function which will run and test all algorithms for automatically generated inputs of specified dimensions. This function is found in the F# file provided.

number of unique elements = 5000000;
number of duplicates for each unique element = 5;
length of each string = 20;

random seed = 384585;
String Sort in C#           :   avg: 31.247667(sec.)    0.4 %
String Sort in F#           :   avg: 44.563966(sec.)    43.2 %
Sedgewick (translated to C#):   avg: 31.119719(sec.)    0.0 %
Dot Net Generic Sort        :   avg: 145.528939(sec.)   367.6 %
Fixed parameters as above, variable string length
string length = 5 
String Sort in C#           :   avg: 25.546378(sec.)    1.1 %
String Sort in F#           :   avg: 33.159264(sec.)    31.2 %
Sedgewick (translated to C#):   avg: 25.268513(sec.)    0.0 %
Dot Net Generic Sort        :   avg: 129.865247(sec.)   413.9 %

string length = 10
String Sort in C#           :   avg: 29.166639(sec.)    -6.5 %
String Sort in F#           :   avg: 46.012092(sec.)    47.4 %
Sedgewick (translated to C#):   avg: 31.210900(sec.)    0.0 %
Dot Net Generic Sort        :   avg: 152.109868(sec.)   387.4 %

string length = 20
String Sort in C#           :   avg: 37.304173(sec.)    16.2 %
String Sort in F#           :   avg: 52.745780(sec.)    64.3 %
Sedgewick (translated to C#):   avg: 32.095941(sec.)    0.0 %
Dot Net Generic Sort        :   avg: 162.737198(sec.)   407.0 %

string length = 40
String Sort in C#           :   avg: 39.446201(sec.)    9.1 %
String Sort in F#           :   avg: 52.786923(sec.)    46.0 %
Sedgewick (translated to C#):   avg: 36.151525(sec.)    0.0 %
Dot Net Generic Sort        :   avg: 167.100856(sec.)   362.2 %
string length = 40
String Sort in C#           :   avg: 44.527835(sec.)    26.9 %
String Sort in F#           :   avg: 54.424359(sec.)    55.1 %
Sedgewick (translated to C#):   avg: 35.098257(sec.)    0.0 %
Dot Net Generic Sort        :   avg: 164.124873(sec.)   367.6 %



This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


About the Author

Stefan Savev 2
United States United States
No Biography provided

You may also be interested in...

Comments and Discussions

GeneralMy vote of 5 Pin
davide-pro-99917-Oct-12 7:51
memberdavide-pro-99917-Oct-12 7:51 
GeneralMy vote of 5 Pin
hazeleekaizera15-Oct-12 19:49
memberhazeleekaizera15-Oct-12 19:49 
GeneralMy vote of 5 Pin
kishore doni2-Sep-12 21:17
memberkishore doni2-Sep-12 21:17 
Questionhei Pin
Gun Gun Febrianza20-Aug-12 4:10
memberGun Gun Febrianza20-Aug-12 4:10 
GeneralMy vote of 5 Pin
Gun Gun Febrianza20-Aug-12 4:09
memberGun Gun Febrianza20-Aug-12 4:09 
GeneralMy vote of 5 Pin
Dan Randolph4-Jul-12 9:16
memberDan Randolph4-Jul-12 9:16 
GeneralMy vote of 5 Pin
manoj kumar choubey28-Mar-12 1:41
membermanoj kumar choubey28-Mar-12 1:41 
GeneralMy vote of 5 Pin
Kanasz Robert1-Dec-11 3:48
memberKanasz Robert1-Dec-11 3:48 
GeneralMy vote of 5 Pin
RaviSant5-May-11 1:11
memberRaviSant5-May-11 1:11 
GeneralVery Good Article Pin
RaviSant5-May-11 1:11
memberRaviSant5-May-11 1:11 
GeneralNice Pin
CIDev17-Jan-11 10:00
memberCIDev17-Jan-11 10:00 
GeneralCode for performance comparison is missing Pin
Alois Kraus15-Jan-11 12:00
memberAlois Kraus15-Jan-11 12:00 
GeneralRe: Code for performance comparison is missing Pin
Stefan Savev 215-Jan-11 12:45
memberStefan Savev 215-Jan-11 12:45 
GeneralRe: Code for performance comparison is missing Pin
Alois Kraus17-Jan-11 10:23
memberAlois Kraus17-Jan-11 10:23 
GeneralRe: Code for performance comparison is missing Pin
Stefan Savev 217-Jan-11 10:36
memberStefan Savev 217-Jan-11 10:36 
GeneralMy Vote of 5 Pin
RaviRanjankr15-Jan-11 1:07
memberRaviRanjankr15-Jan-11 1:07 
GeneralThanks a five Pin
KenJohnson14-Jan-11 10:02
memberKenJohnson14-Jan-11 10:02 
GeneralBroken link Pin
Henry.Ayoola14-Jan-11 6:09
memberHenry.Ayoola14-Jan-11 6:09 
GeneralRe: Broken link Pin
Stefan Savev 214-Jan-11 7:03
memberStefan Savev 214-Jan-11 7:03 
GeneralRe: Broken link Pin
RugbyLeague14-Jan-11 7:04
memberRugbyLeague14-Jan-11 7:04 
Generalthanks for sharing - have 5 Pin
Pranay Rana14-Jan-11 5:31
memberPranay Rana14-Jan-11 5:31 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web03 | 2.8.151120.1 | Last Updated 17 Jan 2011
Article Copyright 2011 by Stefan Savev 2
Everything else Copyright © CodeProject, 1999-2015
Layout: fixed | fluid