Click here to Skip to main content
Click here to Skip to main content

Fast List<String> Sort in C#

By , 23 Aug 2012
 

Introduction

While searching for how to speed up string sorting in .NET collections, I came across this article: Fast String Sort in C# and F# 

I downloaded the StringSort class from the article, tested the speed, and found it sorts 50,000 strings in about 40 to 50% less time than the Generic IEnumerable object sorting methods.  This is when the compare method is used with the Ordinal option. Otherwise results are much worse for standard QuickSort string sorting.

So, to build on the previous project by Stefan Savev 2, I have derived a new sfList class from List<string>.

Using the code 

For your List<string> objects, you simply use sfList instead. The only thing that is different is the method it uses to sort strings.
using System;
using System.Collections.Generic;
using Sorting.CSharpStringSort;

namespace SortTests.Sorting
{
  public class sfList : List<string>
  {
    public sfList() : base() { }

    public sfList(int size) : base(size) { }

    public sfList(IEnumerable<String> aArray) : base(aArray) { }

    public new void Sort()
    {
      //StringSort.Sort(this);
      string[] a = this.ToArray();
      this.Clear();
      //sort array and refill contents of this (faster than directly sorting List)
      StringSort.Sort(ref a);
      this.AddRange(a);
    }
  }
}

From my initial testing, I could see that it is faster to copy the List to a string[], clear the list, sort the array using StringSort.Sort and refill the list with the result. This is what is done by replacing the List.Sort method with my call to StringSort.Sort. A very small amount of time is used to do the two extra copies compared to the sort time.

The StringSort class Sort method has been changed to do an in-place Sort on string[].

There is also a overloaded version of Sort that takes an unsorted List<string> parameter and returns a copy of a of a sorted list.

public static void Sort(string[] inout)
{
    InPlaceSort(inout, 0, 0, inout.Length);
}
public static List<string> Sort(List<string> sList)
{
  string[] aCopy = sList.ToArray();
  InPlaceSort(aCopy, 0, 0, aCopy.Length);
  return new List<string>(aCopy);
}

Other experimental overloads and unused private methods have been removed from my original posted version.

The first time SortTests.exe is ran, it generates a randomized list of values for sorting by using multiple reads of dic.txt. Random numbers are also appended to the original string values.

More Generic Collections

I have included a KeyedList<TKey, TValue> Class that is derived from List<T>.

I implemented the fast string sorting algorithm for this class' Sort method with good results.

The combined time of filling the KeyedList with values and then sorting it is much faster than adding the same values to a SortedList<string, string> collection, for example.

Here is the sort method that uses a Sort method overload to sort an array of KeyValuePair on string keys: 

    /// <summary>
    /// Use for default sorting on int or string keys
    /// </summary>
    public new void Sort()
    {
      if (typeof(TKey) == typeof(String))
      {
        KeyValuePair<TKey, TValue>[] kvpArray = this.ToArray();
        this.Clear();
        this.Sort(kvpArray); //sort String key type
        this.AddRange(kvpArray);
        if (!this.SortAscd)
        {
          this.Reverse();
        }
      }
      else
      {
        this.Sort(CompareKey);
      }
      //this.Sort(CompareKey);
      _IsSorted = true;
    }
public void Sort(KeyValuePair<TKey, TValue>[] sList)
{
  InPlaceSort(sList, 0, 0, sList.Length);
}

void InPlaceSort(KeyValuePair<TKey, TValue>[] input, int depth, int st, int ed)
{
   //this is all the same as StringSort except that input[indx].Key.ToString() is the sort key
}


For more on KeyedList, see my code sample here: Derive a KeyedList<TKey, Tvalue> generic class

Stopwatch Tests 

I improved that Stopwatch class in .NET. by deriving a new class from it call TimeCounter. Total Ticks are converted to milliseconds and rounded to the nearest 10 ms. This gives more consistent results. The practical resolution of Stopwatch is no more than 10 ms so I don't want to show a bunch of extra digits that don't add any real information. What interests me in these test is actually the relative differences among the different sort tests.

Sort test with 397540 elements		
                                          Milliseconds	  Relative Differences
#Array.Sort<string>(string[], CompareFun)  980            1.0000
#List<string>.Sort(CompareFun)	           860	          0.8776
*StringSort.Sort(List<string>)	           550	          0.5612
sfList.Sort()	                           540	          0.5510
StringSort.Sort(string[])	           530	          0.5408
Sedgewick.Sort(string[])	           450	          0.4592

# Without CompareFun with Ordinal option this method is slower by about 3x		
* Returns a new copy of List<string>		


 

So from these elapsed ms tests, you can see that both Array.Sort() and List.Sort() (original Microsoft sort methods) are around 45% slower than StringSort. With my improved timing method, I found that Sedgewick is somewhat faster than StringSort (over 50% faster than standard Array.Sort). Sorting strings using Microsoft quicksort implementation seems to have a large overhead. Replacisng List.Sort() in a derived class or using the StringSort.Sort(List<string>) overload yields a measurably quicker solution.

I also did tests on a 64-bit Windows 7 system to see if the relative difference results were different. These test results are in the download under SortTest.xsl.

History

First published July 4, 2012

Added Stopwatch tests results for Windows 7 64-bit system - July 7, 2012

Improved Stopwatch timing method and added relative differences calculation. Implemented a version of StringSort in derived KeyedList class - July 17, 2012

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Dan Randolph
Software Developer (Senior) Idea
United States United States
Member
Dan Randolph is currently a GIS Developer for Idea, and a graduate in Computer Science from the University of Wyoming. He is an active member of the Denver Visual Studio User Group. You can find him posting in the forums on Microsoft.com, Code Project, and LinkedIn.

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
QuestionLets be honest...memberSledgeHammer0127 Aug '12 - 13:25 
Not intended as a bash or anything, just an observation Smile | :) Smile | :) , but it took you 397540 elements to save 530ms in runtime.
 
1) 530ms is not going to be noticable to any user. Thats 1/2 a second, you know Smile | :) .
 
2) Are you really going to store 397540 elements in memory in the real world? Probably not. They are going to come from a database of some sort. SQL is going to have its own sort facility.
 
And those times are comparing the worst possible method to the best possible method. With the List, the savings is only 410ms.
AnswerRe: Lets be honest...memberDan Randolph27 Aug '12 - 18:08 
No, I left out the worst possible sorting. That would be List.Sort()
without the compare function. The standard string sorting does not use the Ordinal compare option and is much slower. I thought it would be unfair to put this in my comparison. But, you can look at the original article for the benchmarks. Fast String Sort in C# and F#[^]
GeneralMy vote of 4memberVitaly Tomilov18 Jul '12 - 8:34 
An article about sorting algorithms that doesn't even mention anything about different sorting algorithms and how they compare,...well, is not good, imho.
 
This guy looks almost like you, though he explains better: http://www.cprogramming.com/tutorial/computersciencetheory/sortcomp.html[^] Smile | :)
 
Another one: http://www.devx.com/vb2themax/Article/19900/[^]
GeneralRe: My vote of 4memberDan Randolph18 Jul '12 - 10:02 
If you read closely, this article is an extension (Alternative) to "Fast String Sort in C# and F#".
The point of it is to show how to use sorting methods from other article to speed up string-type sorting in generic collections.
QuestionnicememberCIDev18 Jul '12 - 5:07 
A nicely written article and a useful class.
Just because the code works, it doesn't mean that it is good code.

QuestionClear after sort safer?groupPaul @ The Computer Station12 Jul '12 - 23:14 
Hi,
 
Only a minor point, but would it not be safer to clear the List after sorting the array? This way, should an unexpected error occur in the sort, the List would not end up empty.
 
I.e. Rather than:
 
string[] a = this.ToArray();
this.Clear();
//sort array and refill contents of this (faster than directly sorting List)
StringSort.Sort(ref a);
this.AddRange(a);
 
Do:
 
string[] a = this.ToArray();
//sort array and refill contents of this (faster than directly sorting List)
StringSort.Sort(ref a);
this.Clear();
this.AddRange(a);

AnswerRe: Clear after sort safer?memberDan Randolph14 Jul '12 - 4:46 
I called Clear() first to avoid holding memory in case of really large arrays. This way, the recursive InPlaceSort method has more workspace since copying the List to the array doubles the allocated memory.
 
If it fails, it will most likely fail at making a copy of the list to the array if you exceed memory limits.
I have also experienced a worst case scenario for the C-library qsort when sorting a large amount of integers that were already in nearly sorted order. Each integer had many duplicates so it ended up with worst-case partitioning. In this case, the application had to be killed anyway so error handling would not have mattered.
 
You could do something like this instead to minimize total memory used.
 
string[] a = this.ToArray();
this.Clear();
try
{
  //sort array and refill contents of this (faster than directly sorting List)
  StringSort.Sort(ref a);
  this.AddRange(a);
}
catch (Exception ex)
{
  this.AddRange(a); //something went wrong so just add back what we have
  //todo: set exception flag and/or indicate sort failed
}

GeneralRe: Clear after sort safer?groupPaul @ The Computer Station15 Jul '12 - 23:03 
Thanks for the explanation; I had not considered the memory usage issue. As you say, that would be the most likely reason for failure and therefore my proposal would actually make failure more likely. Making the assumption that the array a will not be corrupted by StringSort.Sort(), then what you propose seems far more sensible than keeping a copy of the original array for roll-back purposes.
GeneralRe: Clear after sort safer?mvpPaulo Zemek23 Aug '12 - 6:57 
The Clear, as fas as I know, does not reduces the list capacity, so it will not let the memory be freed.
So, or you should also set the capacity to a small value, or it is better to do the Clear after the sort.
AnswerRe: Clear after sort safer?memberDan Randolph23 Aug '12 - 13:21 
It does not reduce the list capacity, but the list just contains 4-byte or 8-byte references to the objects in the list. So when you clear the list, it allows the GC to free the unreferenced objects that were in the list.
 
There probably is not much additional memory used by the the sort algorithm anyway so it really does not matter when you clear the list.
GeneralRe: Clear after sort safer?mvpPaulo Zemek23 Aug '12 - 14:13 
But in that case, all the references are still valid because they are in the array to be sorted... that's why it will not change memory use at all.
GeneralRe: Clear after sort safer?memberDan Randolph23 Aug '12 - 16:57 
    static void Main(string[] args)
    {
 
      List<string> ls = new List<string>();
      ls.Add("A");
      ls.Add("B");
      ls.Add("C");
 
      Console.WriteLine("Size of list {0}", GetObjectSize(ls));
 
      String [] arStr = ls.ToArray();
      arStr[0] = "D";
      ls[0] = "E";
      Console.WriteLine("arStr[0] = {0} ls[0] = {1}", arStr[0], ls[0]);
 
      Console.WriteLine("Size of Array {0}", GetObjectSize(arStr));
 
      ls.Clear();
 
      Console.WriteLine("Size of list after clear {0}", GetObjectSize(ls));
   }
 
    private static long GetObjectSize(object obj)
    {
      var bf = new BinaryFormatter();
      var ms = new MemoryStream();
      bf.Serialize(ms, obj);
      var size = ms.Length;
      ms.Dispose();
      return size;
    }
List(T) to array does a shallow copy. In the case of string types, it copies the whole string, not the reference address.
 
Compile this code and you will get this output:
Size of list 226
arStr[0] = D ls[0] = E
Size of Array 48
Size of list after clear 206

GeneralRe: Clear after sort safer?mvpPaulo Zemek24 Aug '12 - 3:31 
I was talking about memory usage, not about serialization size.
A list of size 100 = the list itself in memory, plus an array of size 128 (check the capacity if you want to know the size of the internal array) + the data for the items.
 
When you clear the list, those items can theorically be collected, but as they are held by another array, the only thing Clear does is setting the Count to 0 and putting null to all items in the internal array. The Capacity will stay the same (in case of 100 items, 128).
 
About serialization, that depends on how the serializer works. I can tell that my implementation of a serializer will store a list with count 0 as count 0... it will not lose time storing the null items in the inner array, but that's an implementation of the serializer... I think the original one simple puts all the fields on the binary file, so:
The Count field.
The Array field.
All the items on the array, even if they are null.
 
See, it's very different.
GeneralRe: Clear after sort safer? [modified]mvpPaulo Zemek24 Aug '12 - 3:37 
Also, it is not the List to array that does a shallow copy, it is the serialization.
 
If you want to check, after copying a list to an array, use object.ReferenceEquals(list[index], array[index]). If the return is true (and it will be) that means that it is the same instance. So only the the Capacity of the list * the size of a reference (4 or 8 bytes, depending if it is 32 bits or 64 bits) is the real size of the list (+ a fixed size... but I don't know what that fixed size is) in memory. And that does not change with a clear.

modified 24 Aug '12 - 9:55.

GeneralRe: Clear after sort safer?memberDan Randolph28 Aug '12 - 4:20 
Yes, you are correct that only the references to the strings are copied. (The same thing on assignments.) I was confused about this issue. Thanks for clearing this up. It is good to know that two copies of the string objects are not in memory after an Array.Copy().
 
That being said, the Clear() has to be done at some point or you will end up with both the original data and the sorted data in the list. It might also be good practice to reset the capacity in the final list result or use TrimExcess().
GeneralRe: Clear after sort safer?mvpPaulo Zemek28 Aug '12 - 4:40 
Doing the clear after the array sort and before the AddRange will solve the problem.
 
About the TrimExcess... I can't say how useful it is... after all, if the List was already trimmed, it is of no use. If it was not but you plan to add new items later, it may be a bad decision.
But if it is to be done, the best is not TrimExcess. The best is to Clear it, set the Capacity to the array length, then do AddRange of the array. It will avoid one extra copy (that will be done with TrimExcess).

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web02 | 2.6.130516.1 | Last Updated 23 Aug 2012
Article Copyright 2012 by Dan Randolph
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid