Natural Sort Comparer

By , 28 Dec 2007

Introduction

Recently there was a post on Coding Horror which caught my attention. It was about natural sorting vs. ASCII sorting, or as it was referred to "Alphabetical" vs. "ASCIIBetical". The argument goes like this:

Given a set of strings ("a1", "a2" ... "a9", "a10", "a11") a normal sort will sort as follows:

a1
a10
a11
a2
a3 (etc.)

But humans normally want to see them sorted as follows:

a1
a2
a3
...
a9
a10
a11

I didn't totally agree with the statement that the first method is "ASCIIBetical" rather than "Alphabetical", given the fact that, as I understand it, A100 comes before A2 Alphabetically (check your phone book). However I agree that the second method seems more natural when dealing with filenames and the like.

I read a few of the comments and ran across a blog entry by Ian Griffiths with this implementation of a C# solution. I liked it, but there were a couple of problems for me.

1. The full implementation didn't seem encapsulated to the Comparer.
2. It was using C# 3.5 (LINQ and Anonymous Types), which made it a little less portable for those who haven't moved to 2008 yet (like the projects I'm currently working on).

So using this as a starting point, I came up with the following implementation.

Using the Code

The NaturalComparer class can be dropped into any project, but would probably make the most sense somewhere in your framework project(s). Normally I don't allow any class outside of a namespace, but as this is sample code, I left it that way so you could easily move it wherever you wanted.

using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;

public class NaturalComparer : Comparer<string>, IDisposable
{
private Dictionary<string, string[]> table;

public NaturalComparer()
{
table = new Dictionary<string, string[]>();
}

public void Dispose()
{
table.Clear();
table = null;
}

public override int Compare(string x, string y)
{
if(x == y)
{
return 0;
}
string[] x1, y1;
if(!table.TryGetValue(x, out x1))
{
x1 = Regex.Split(x.Replace(" ", ""), "([0-9]+)");
}
if(!table.TryGetValue(y, out y1))
{
y1 = Regex.Split(y.Replace(" ", ""), "([0-9]+)");
}

for(int i = 0; i < x1.Length && i < y1.Length; i++)
{
if(x1[i] != y1[i])
{
return PartCompare(x1[i], y1[i]);
}
}
if(y1.Length > x1.Length)
{
return 1;
}
else if(x1.Length > y1.Length)
{
return -1;
}
else
{
return 0;
}
}

private static int PartCompare(string left, string right)
{
int x, y;
if(!int.TryParse(left, out x))
{
return left.CompareTo(right);
}

if(!int.TryParse(right, out y))
{
return left.CompareTo(right);
}

return x.CompareTo(y);
}
}

This may seem a little heavy at first, but I'll take a walk through the code to explain what I did and why I did it.

Starting at the top, I originally implemented the IComparer interface, but it occurred to me that this is really at its heart a string comparison, so I switched and descended from Comparer<string> to inherit whatever other useful stuff might come with it.

The original implementation simply split the values using the Regex expression and iterated through the splits for each call. This worked fine for a small set of say 10-20 values, but when I stress tested it with 1,000,000 randomly generated values, it took over 2 minutes to sort. I'm not usually one for over-optimization, but that just wouldn't fly. My first approach to optimizing was to cache the split values in a HashTable, reasoning that it was likely the same values would be passed in to Compare multiple times. This worked great, and reduced the 1,000,000 test to just short of 12 seconds. Not bad, eh? But, of course, once I'm on the optimization kick, I just can't let it go. Before moving on, if you're on .NET 1.0 or 1.1, you'll want to implement the HashTable approach here, due to the fact that the other optimizations were .NET 2.0 only changes.

It occurred to me that every single hash table lookup and retrieve required an unboxing (e.g. casting back to string[]), and I figured I could do better. I replaced the HashTable with a generic Dictionary to get rid of the boxing/unboxing. This dropped the time down to 9.162 seconds. We can still do better though.

The original implementation first checked if the value was in the Dictionary, and if it was, then retrieved it.

if (!table.ContainsKey(x))
{
x1 = Regex.Split(x.Replace(" ", ""), "([0-9]+)");
}
else
x1 = table[x];

Around this time, I discovered something I didn't know before about .NET 2.0. There's a TryGetValue function on the Dictionary, allowing you to check for and retrieve the value in one step. That's great! The double lookup (one to check for the key and one to retrieve it) just reduced by half! This dropped the time for 1,000,000 down to 7.751 seconds. A few more little tweaks like an initial string equality check and removing a check for y.Length from the loop dropped the final time down to 6.220 seconds. Not bad for having started at 2 minutes. My only disappointment is that the normal default sort sorted the same data in about 2 seconds. Running at 1/3rd the speed of the normal sort didn't make me happy, but considering that it's a more complicated sort, I was willing to live with it. I tried applying the same caching technique to the parsed integer values in PartCompare, but it had little or no effect on the sort time, so I removed it. I had apparently reached the threshold of diminishing return.

This left me with my only remaining concern, the size of the Dictionary. I tried a couple of approaches here but couldn't really find one that worked for me. The concern was that this would bloat the memory requirement of any application using it. My first approach involved spinning off a Thread or a Timer to call back after a few seconds of non-use to clear out the table. This not only didn't really work, but it bloated the time back up to 14 seconds. That wouldn't do. I finally compromised and added the IDisposable interface. The Dispose method simply clears out the table, then nulls the reference so the GC can pick it up. Not perfect, but if you put the Comparer in a using block, it works pretty nicely. I also made table non-static as it didn't seem likely that this data would need to be shared across instances. This allowed me to do the cleanup in the Dispose method. I then moved the creation of the Dictionary to the constructor. I'm not sure if this helped or not, but seemed like a cleaner implementation.

public void Dispose()
{
table.Clear();
table = null;
}

So walking through the Compare method, this first part simply checks if the strings are identical. This may seem like wasted cycles, but in the case of a duplicate string, it will save a whole lot of parsing. Feel free to drop it if you like.

if(x == y)
{
return 0;
}

This next part uses a regular expression to break the string up into component parts based on numbers. For instance, "z25" would become {"z", "25"}. First though, it checks to see if we've already parsed this string, and if so just retrieves the cached value. If not, it breaks it and stores it. Once for x and once for y.

string[] x1, y1;
if(!table.TryGetValue(x, out x1))
{
x1 = Regex.Split(x.Replace(" ", ""), "([0-9]+)");
}
if(!table.TryGetValue(y, out y1))
{
y1 = Regex.Split(y.Replace(" ", ""), "([0-9]+)");
}

The next part iterates through the sections of the parsed string until it finds one that doesn't match. Comparing z1 and z5 will skip the initial "z", for instance, and go straight on to comparing 1 and 5. It passes these values on to PartCompare.

for(int i = 0; i < x1.Length && i < y1.Length; i++)
{
if(x1[i] != y1[i])
{
return PartCompare(x1[i], y1[i]);
}
}

PartCompare is the real heart of the comparison here. The first thing it tries to do is convert both strings to integers. If either fails, it returns a normal string comparison. For example, if 1 and 5 come in, they are compared numerically. If A and 5 come in, they are compared alphabetically. The last line is a simple integer comparison.

if(!int.TryParse(left, out x))
{
return left.CompareTo(right);
}

if(!int.TryParse(right, out y))
{
return left.CompareTo(right);
}

return x.CompareTo(y);

Coming back to Compare, the last few lines take care of the exception cases. These lines are only hit if all of one string appeared in another, for example, z95 and z95a. Only z and 95 would be compared, and both parts would come back as equal. In this case, I simply compare the length of the arrays to determine the sort. Whichever one is shorter, alphabetically speaking, comes first. In theory, the last "return 0" should never be hit, due to the initial string comparison at the top of the method, but I haven't tested that for sure.

if(y1.Length > x1.Length)
{
return 1;
}
else if(x1.Length > y1.Length)
{
return -1;
}
else
{
return 0;
}

Finally, when the comparison is done, the class should theoretically be disposed and the table cleared. The following code was used to test out the implementation (you can see where I've commented out the stress code that ran 1,000,000 values through the comparer. The initial string array is for testing the logic (e.g. does it sort the way I think it should).

List<string> testItems = new List<string>(new string[]
{
"abc92", "2", "z24", "z2", "z15", "z1", "b1","b6",
"z 21", "z22", "1", "5", "3", "b2","abc1", "abc9", "abc9z4",
"z3", "b3","z20", "a5", "z11", "b5", "b4"
});
//const int size = 1000000;
TimeSpan naturalsorttime;
TimeSpan normalsorttime;
//Random rnd = new Random(DateTime.Now.Millisecond);

//List<string> testItems = new List<string>(size);
//for(int i=0;i<size;i++)
//   testItems.Add(((char)('A' + rnd.Next(25))) + rnd.Next(100).ToString());
List<string> testitems2 = new List<string>(testItems);

DateTime start = DateTime.Now;
using(NaturalComparer comparer = new NaturalComparer())
{
testItems.Sort(comparer);
}
DateTime stop = DateTime.Now;

naturalsorttime = stop - start;

start = DateTime.Now;
testitems2.Sort();
stop = DateTime.Now;

normalsorttime = stop - start;

Console.WriteLine("Natural Comparison: ");
foreach(string _item in testItems)
{
Console.WriteLine(_item);
}
Console.WriteLine("Normal Comparison");
foreach(string _item in testitems2)
{
Console.WriteLine(_item);
}

Console.WriteLine("Elapsed time for natural sort: "+naturalsorttime);
Console.WriteLine("Elapsed time for normal sort : " + normalsorttime);

This produces the following output:

Natural Comparison:
1
2
3
5
a5
abc1
abc9
abc9z4
abc92
b1
b2
b3
b4
b5
b6
z1
z2
z3
z11
z15
z20
z 21
z22
z24
Normal Comparison
1
2
3
5
a5
abc1
abc9
abc92
abc9z4
b1
b2
b3
b4
b5
b6
z 21
z1
z11
z15
z2
z20
z22
z24
z3
Elapsed time for natural sort: 00:00:00.0050005
Elapsed time for normal sort : 00:00:00

Special thanks to Ian Griffiths for the original implementation I started from. I think the only thing left is the regular expression, but that's pretty key to the whole thing. It also makes whitespace non-significant to comparison (which is why z 21 comes after z20 instead of before z1), which was apparently a suggestion by Charles Petzold, well-known programming God. The blog page stated that the code sample was released under the MIT license which requires a disclaimer. However I couldn't find a prepared disclaimer on the code, just a link to the template. If I need to add the disclaimer to this code, I will.

Jeff Atwood seemed to think that an implementation should be 40 lines of code or less. I guess I missed the mark, and I probably could have reduced the size if I had used 3.0/3.5, and I have this annoying habit of putting brackets on everything, whether they're required or not. Even so, 74 lines isn't bad. :-)

History

• 28th December, 2007: Initial post

 Justin.Jones Web Developer United States Member
I started out in life as a musician. I started playing the violin at age 4, and I've never really stopped. Even when I fully believed my life would be spent as a musician, I had a fondness for computers.

At age 11 my parents bought our first computer, a Texas Instruments TI 99/4A. This is where I learned BASIC, spending countless hours typing in game source code from the back of magazines and various books.

I began seriously learning about the IBM when I was 16 and accidently trashed MS-DOS 4 on my Dad's new 386 trying to remove 688 Attack Sub. The phone call to tech support went something like "I've gotta get it working, my Dad's gonna KILL me!".

After a couple years of living the musician life in college, I found it wasn't condusive to maintaining scholarships. I took a few years off, regrouped, and refocused on programming, finally completing a degree.

I've been actively working in the IT field as a programmer since 1997, having just been sacked from my warehouse job at Elek-Tek when it declared bankruptcy. The first part of my career was as a C++ programmer, with a brief foray into Java in 2000, moving into ASP web development in 2001 and C# ASP.NET development in 2003.

I'm a Microsoft MCP, and still actively participate in the local music community.

Votes of 3 or less require a comment

 Search this forum Profile popups    Spacing RelaxedCompactTight   Layout Open AllThread ViewNo JavascriptPreview   Per page 102550
 First Prev Next
 My vote of 5 Abinash Bishoyi 12 Apr '13 - 4:11
 My vote of 5 shenron 29 May '11 - 23:40
 Decimal/Float sorting. Kellros 21 Nov '08 - 23:24
 It seems a bit slower, but it gets the job done.   public class HumanStringComparer : IComparer { private Regex compRegex = new Regex(@"[-+]?[0-9]*\.?[0-9]+", RegexOptions.Compiled | RegexOptions.CultureInvariant | RegexOptions.Singleline); public HumanStringComparer() : base() { //Default constructor }   #region IComparer Members   public int Compare(string x, string y) { return DecimalCompare(x, y); }   private int DecimalCompare(string x, string y) { int originalComparison = string.Compare(x, y); //if it's the same string, return 0. if (originalComparison == 0) { return 0; }   //Compare only to smallest string int smallCount = 0; if (x.Length > y.Length) { smallCount = y.Length; } else { smallCount = x.Length; }   for (int ix = 0; ix != smallCount; ix++) { //Walk through the strings comparing, check if the first difference found is a digit. if (x[ix] != y[ix]) { //First difference found, check if they are digits. if (char.IsDigit(x[ix]) && char.IsDigit(y[ix])) { //Get the first values if both are digits string a = compRegex.Match(x).Value; string b = compRegex.Match(y).Value; decimal da = 0; decimal db = 0; if (decimal.TryParse(a, out da) && decimal.TryParse(b, out db)) { int dCompare = decimal.Compare(da, db); if (dCompare != 0) { return dCompare; } } } else { return originalComparison; } } }   //Last resort, in cases of example: xyz123mx && zyx123 where one string is part of another. return originalComparison; }   #endregion } Sign In·View Thread·Permalink
 Contribution: VB-Version MathiasW 1 Jan '08 - 19:37
 For developers like me, who favor to have such tasks available in VB, instead of using a separate project in a VB.NET-solution, I contribute my translation of the NaturalComparer class to this article.   Imports System Imports System.Collections.Generic Imports System.Text.RegularExpressions   Public Class NaturalComparer Inherits Comparer(Of String) Implements IDisposable   Private table As Dictionary(Of String, String())   Public Sub New() MyBase.New() table = New Dictionary(Of String, String()) End Sub   Public Sub Dispose() Implements IDisposable.Dispose table.Clear() table = Nothing End Sub   Public Overrides Function Compare(ByVal x As String, ByVal y As String) As Integer If (x = y) Then Return 0 End If Dim y1() As String Dim x1() As String If Not x Is Nothing AndAlso Not table.TryGetValue(x, x1) Then x1 = Regex.Split(x.Replace(" ", ""), "([0-9]+)") table.Add(x, x1) End If If Not y Is Nothing AndAlso Not table.TryGetValue(y, y1) Then y1 = Regex.Split(y.Replace(" ", ""), "([0-9]+)") table.Add(y, y1) End If Dim i As Integer = 0 Do While ((i < x1.Length) AndAlso (i < y1.Length)) If (x1(i) <> y1(i)) Then Return PartCompare(x1(i), y1(i)) End If i = (i + 1) Loop If (y1.Length > x1.Length) Then Return 1 ElseIf (x1.Length > y1.Length) Then Return -1 Else Return 0 End If End Function   Private Shared Function PartCompare(ByVal left As String, ByVal right As String) As Integer Dim y As Integer Dim x As Integer If Not Integer.TryParse(left, x) Then Return left.CompareTo(right) End If If Not Integer.TryParse(right, y) Then Return left.CompareTo(right) End If Return x.CompareTo(y) End Function End Class   Regards,   Mathias Wührmann FLEXact Sign In·View Thread·Permalink
 Re: Contribution: VB-Version Justin.Jones 3 Jan '08 - 18:04
 Decimals? Chris Wuestefeld 31 Dec '07 - 5:58
 Nice implementation. I like how you did the benchmarking and optimization; really good work there.   Two weeks ago I posted my own stab at the problem, you might be interested in a look: http://www.codeproject.com/KB/dotnet/SortingStringsForHumans.aspx[^]   I didn't attempt the optimization, and you've got a good approach to that here. While you're working on that, you might consider using weak references for the values in the Dictionary, in order to avoid the need for IDisposable. Also, I wonder if using the Regex's Compiled option might have helped at all.   A couple of decision that I made differently:   1. You've made whitespace irrelevant; I think it should at least count as a token separator ("z1" should sort before "z 21"). 2. You're not able to deal with decimals. You'll order "1.1" before "1.02". 3. I went farther in "humanizing" it -- I discard everything but letters and numbers. Maybe a bad idea?   That last one is certainly debatable, but I think that the first two things need to be addressed. Sign In·View Thread·Permalink
 Re: Decimals? Justin.Jones 31 Dec '07 - 18:00
 I was surprised at how similar the code looks. I suppose that's to be expected since we were working on the same problem. Chris Wuestefeld wrote:1. You've made whitespace irrelevant; I think it should at least count as a token separator ("z1" should sort before "z 21"). It actually does, but I see what you mean. I tried "4 2" and "40". "4 2" came second because it parsed as 42. I hadn't anticipated that. I might have to disagree with Petzold on that one after all. Chris Wuestefeld wrote:2. You're not able to deal with decimals. You'll order "1.1" before "1.02". True, something else I hadn't thought of. At first glance it seems like I can fix it by changing the regular expression to "([.][0-9]+)", but I would have to parse floating point numbers instead of integers. That might take a toll on the performance.   Chris Wuestefeld wrote:3. I went farther in "humanizing" it -- I discard everything but letters and numbers. Maybe a bad idea? Probably not. I'm thinking that in a more robust implementation these should be options so that the developer can decide whether or not to take the performance hit. I think a lot of situations probably wouldn't need these options, but it's frustrating to not have them when you need them.   Chris Wuestefeld wrote:While you're working on that, you might consider using weak references for the values in the Dictionary, in order to avoid the need for IDisposable. Also, I wonder if using the Regex's Compiled option might have helped at all. Thanks, those are good suggestions. I'm going to try it out and see how it works. The IDisposable was annoying me, to be honest.   J Make the logo bigger Sign In·View Thread·Permalink
 Thanks for the quick and interesting reading   "Real programmers just throw a bunch of 1s and 0s at the computer to see what sticks" - Pete O'Hanlon Sign In·View Thread·Permalink
 Very good article Rjard Sima 29 Dec '07 - 1:53
 Sort order - numbers before letters or vice versa? MathiasW 28 Dec '07 - 22:38
 Hi,   first of all, I want to thank you for this nice comparer for natural sorting. It will be of great use for me.   There is just one thing I'm wondering about. Your "natural sort order" produces:   a5 abc1 abc9 abc9z4 abc92 b1   Looking at this (a5, ab, ...), it seems like numbers come first, then letters. But the order abc9, abc9z4, abc92 seems to be in conflict with this assumption. Shouldn't it be abc9, abc92, abc9z4?!   Regards,   Mathias Wuehrmann Sign In·View Thread·Permalink
 Re: Sort order - numbers before letters or vice versa? OR0N 29 Dec '07 - 5:29
 Re: Sort order - numbers before letters or vice versa? Justin.Jones 29 Dec '07 - 19:28
 OR0N wrote:9 < 92, therefore abc9z4 is placed before abc92   That's how I saw it too. abc9z4 parses as "abc" "9" "z" "4". The first number portion parses as 9 (since 9z4 isn't a number in base 10). While this example itself doesn't make much sense, somebody creating strings as Chapter1section1 Chapter1section2 ... Chapters1section11 (as a stupid example off the top of my head) would think this seems normal.   J Make the logo bigger Sign In·View Thread·Permalink
 Re: Sort order - numbers before letters or vice versa? MathiasW 1 Jan '08 - 19:20
 Ooops, I just found there was a bug in my pattern recognation and updated my brain coding to fix it - thanks! Sign In·View Thread·Permalink
 the p/invoke way... Chorlton Dragon 28 Dec '07 - 21:07
 Re: the p/invoke way... Justin.Jones 29 Dec '07 - 19:34
 Well sure, if you want to do it the easy way. I started out in C++ and we never do things the easy way.   My motivation for this was part intellectual curiosity and part I actually ran into this problem once and gave the stupid developer answer that Jeff so despises about a year ago. I'll probably actually use this on my new project and I try to avoid p/invoke or unsafe code where possible. In other words, I wanted to know if it could be done in managed code without being dog-a** slow.   J Make the logo bigger Sign In·View Thread·Permalink
 the 'unsafe' way ... OR0N 28 Dec '07 - 19:36
 For a few months I've been using this version of logical comparer.   Works exactly the same way your comparer does, no idea about the performance though   public class LogicalStringComparer : Comparer { public override int Compare(string x, string y) { unsafe { if (x == y) return 0;   int lastResult = 0; char* temp = (char*)IntPtr.Zero;   fixed (char* curx = x) { fixed (char* cury = y) { char* cx = curx; char* cy = cury;   while (true) { if (*cx == '\0' || *cy == '\0') return lastResult;   if (char.IsLetter(*cx) && char.IsLetter(*cy)) { lastResult = cx->CompareTo(*cy);   if (lastResult != 0) { return lastResult; } cy++; cx++; continue; }   if (char.IsWhiteSpace(*cx)) { cx++; continue; }   if (char.IsWhiteSpace(*cy)) { cy++; continue; }   int nx = -1, ny = -1;   if (char.IsNumber(*cx)) { temp = cx; while (char.IsNumber(*cx)) cx++; nx = int.Parse(new string(temp, 0, (int)((cx) - temp))); }   if (char.IsNumber(*cy)) { temp = cy; while (char.IsNumber(*cy)) cy++; ny = int.Parse(new string(temp, 0, (int)((cy) - temp))); }   if (nx > -1 && ny > -1) lastResult = nx.CompareTo(ny); if (nx > -1 && ny == -1) lastResult = -1; if (nx == -1 && ny > -1) lastResult = 1;   cy++; cx++; } } } } } } Sign In·View Thread·Permalink
 Re: the 'unsafe' way ... Justin.Jones 29 Dec '07 - 19:36
 I remember pointers!   I have to say you've got me curious how this compares performance-wise. I'll give it a try. Thanks!   J Make the logo bigger Sign In·View Thread·Permalink
 Last Visit: 31 Dec '99 - 18:00     Last Update: 25 May '13 - 11:12 Refresh 1