Natural Sort Comparer

Justin.Jones

4.62/5 (18 votes)

Dec 28, 2007

CPOL

8 min read

107502

1434

An implementation of a Natural Sort Comparer

Introduction

Recently there was a post on Coding Horror which caught my attention. It was about natural sorting vs. ASCII sorting, or as it was referred to "Alphabetical" vs. "ASCIIBetical". The argument goes like this:

Given a set of strings ("a1", "a2" ... "a9", "a10", "a11") a normal sort will sort as follows:

a1
a10
a11
a2
a3 (etc.)

But humans normally want to see them sorted as follows:

a1
a2
a3 
...
a9
a10
a11

I didn't totally agree with the statement that the first method is "ASCIIBetical" rather than "Alphabetical", given the fact that, as I understand it, A100 comes before A2 Alphabetically (check your phone book). However I agree that the second method seems more natural when dealing with filenames and the like.

I read a few of the comments and ran across a blog entry by Ian Griffiths with this implementation of a C# solution. I liked it, but there were a couple of problems for me.

The full implementation didn't seem encapsulated to the Comparer.
It was using C# 3.5 (LINQ and Anonymous Types), which made it a little less portable for those who haven't moved to 2008 yet (like the projects I'm currently working on).

So using this as a starting point, I came up with the following implementation.

Using the Code

The NaturalComparer class can be dropped into any project, but would probably make the most sense somewhere in your framework project(s). Normally I don't allow any class outside of a namespace, but as this is sample code, I left it that way so you could easily move it wherever you wanted.

using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;

public class NaturalComparer : Comparer<string>, IDisposable
{
    private Dictionary<string, string[]> table;

    public NaturalComparer()
    {
        table = new Dictionary<string, string[]>();
    }

    public void Dispose()
    {
        table.Clear();
        table = null;
    }

    public override int Compare(string x, string y)
    {
        if(x == y)
        {
            return 0;
        }
        string[] x1, y1;
        if(!table.TryGetValue(x, out x1))
        {
            x1 = Regex.Split(x.Replace(" ", ""), "([0-9]+)");
            table.Add(x, x1);
        }
        if(!table.TryGetValue(y, out y1))
        {
            y1 = Regex.Split(y.Replace(" ", ""), "([0-9]+)");
            table.Add(y, y1);
        }

        for(int i = 0; i < x1.Length && i < y1.Length; i++)
        {
            if(x1[i] != y1[i])
            {
                return PartCompare(x1[i], y1[i]);
            }
        }
        if(y1.Length > x1.Length)
        {
            return 1;
        }
        else if(x1.Length > y1.Length)
        {
            return -1;
        }
        else
        {
            return 0;
        }
    }

    private static int PartCompare(string left, string right)
    {
        int x, y;
        if(!int.TryParse(left, out x))
        {
            return left.CompareTo(right);
        }

        if(!int.TryParse(right, out y))
        {
            return left.CompareTo(right);
        }

        return x.CompareTo(y);
    }
}

This may seem a little heavy at first, but I'll take a walk through the code to explain what I did and why I did it.

Starting at the top, I originally implemented the IComparer interface, but it occurred to me that this is really at its heart a string comparison, so I switched and descended from Comparer<string> to inherit whatever other useful stuff might come with it.

The original implementation simply split the values using the Regex expression and iterated through the splits for each call. This worked fine for a small set of say 10-20 values, but when I stress tested it with 1,000,000 randomly generated values, it took over 2 minutes to sort. I'm not usually one for over-optimization, but that just wouldn't fly. My first approach to optimizing was to cache the split values in a HashTable, reasoning that it was likely the same values would be passed in to Compare multiple times. This worked great, and reduced the 1,000,000 test to just short of 12 seconds. Not bad, eh? But, of course, once I'm on the optimization kick, I just can't let it go. Before moving on, if you're on .NET 1.0 or 1.1, you'll want to implement the HashTable approach here, due to the fact that the other optimizations were .NET 2.0 only changes.

It occurred to me that every single hash table lookup and retrieve required an unboxing (e.g. casting back to string[]), and I figured I could do better. I replaced the HashTable with a generic Dictionary to get rid of the boxing/unboxing. This dropped the time down to 9.162 seconds. We can still do better though.

The original implementation first checked if the value was in the Dictionary, and if it was, then retrieved it.

if (!table.ContainsKey(x))
{
    x1 = Regex.Split(x.Replace(" ", ""), "([0-9]+)");
    table.Add(x, x1);
}
else
    x1 = table[x];

Around this time, I discovered something I didn't know before about .NET 2.0. There's a TryGetValue function on the Dictionary, allowing you to check for and retrieve the value in one step. That's great! The double lookup (one to check for the key and one to retrieve it) just reduced by half! This dropped the time for 1,000,000 down to 7.751 seconds. A few more little tweaks like an initial string equality check and removing a check for y.Length from the loop dropped the final time down to 6.220 seconds. Not bad for having started at 2 minutes. My only disappointment is that the normal default sort sorted the same data in about 2 seconds. Running at 1/3^rd the speed of the normal sort didn't make me happy, but considering that it's a more complicated sort, I was willing to live with it. I tried applying the same caching technique to the parsed integer values in PartCompare, but it had little or no effect on the sort time, so I removed it. I had apparently reached the threshold of diminishing return.

This left me with my only remaining concern, the size of the Dictionary. I tried a couple of approaches here but couldn't really find one that worked for me. The concern was that this would bloat the memory requirement of any application using it. My first approach involved spinning off a Thread or a Timer to call back after a few seconds of non-use to clear out the table. This not only didn't really work, but it bloated the time back up to 14 seconds. That wouldn't do. I finally compromised and added the IDisposable interface. The Dispose method simply clears out the table, then nulls the reference so the GC can pick it up. Not perfect, but if you put the Comparer in a using block, it works pretty nicely. I also made table non-static as it didn't seem likely that this data would need to be shared across instances. This allowed me to do the cleanup in the Dispose method. I then moved the creation of the Dictionary to the constructor. I'm not sure if this helped or not, but seemed like a cleaner implementation.

public void Dispose()
{
    table.Clear();
    table = null;
}

So walking through the Compare method, this first part simply checks if the strings are identical. This may seem like wasted cycles, but in the case of a duplicate string, it will save a whole lot of parsing. Feel free to drop it if you like.

if(x == y)
{
    return 0;
}

This next part uses a regular expression to break the string up into component parts based on numbers. For instance, "z25" would become {"z", "25"}. First though, it checks to see if we've already parsed this string, and if so just retrieves the cached value. If not, it breaks it and stores it. Once for x and once for y.

string[] x1, y1;
if(!table.TryGetValue(x, out x1))
{
    x1 = Regex.Split(x.Replace(" ", ""), "([0-9]+)");
    table.Add(x, x1);
}
if(!table.TryGetValue(y, out y1))
{
    y1 = Regex.Split(y.Replace(" ", ""), "([0-9]+)");
    table.Add(y, y1);
}

The next part iterates through the sections of the parsed string until it finds one that doesn't match. Comparing z1 and z5 will skip the initial "z", for instance, and go straight on to comparing 1 and 5. It passes these values on to PartCompare.

for(int i = 0; i < x1.Length && i < y1.Length; i++)
{
    if(x1[i] != y1[i])
    {
        return PartCompare(x1[i], y1[i]);
    }
}

PartCompare is the real heart of the comparison here. The first thing it tries to do is convert both strings to integers. If either fails, it returns a normal string comparison. For example, if 1 and 5 come in, they are compared numerically. If A and 5 come in, they are compared alphabetically. The last line is a simple integer comparison.

if(!int.TryParse(left, out x))
{
    return left.CompareTo(right);
}

if(!int.TryParse(right, out y))
{
    return left.CompareTo(right);
}

return x.CompareTo(y);

Coming back to Compare, the last few lines take care of the exception cases. These lines are only hit if all of one string appeared in another, for example, z95 and z95a. Only z and 95 would be compared, and both parts would come back as equal. In this case, I simply compare the length of the arrays to determine the sort. Whichever one is shorter, alphabetically speaking, comes first. In theory, the last "return 0" should never be hit, due to the initial string comparison at the top of the method, but I haven't tested that for sure.

if(y1.Length > x1.Length)
{
    return 1;
}
else if(x1.Length > y1.Length)
{
    return -1;
}
else
{
    return 0;
}

Finally, when the comparison is done, the class should theoretically be disposed and the table cleared. The following code was used to test out the implementation (you can see where I've commented out the stress code that ran 1,000,000 values through the comparer. The initial string array is for testing the logic (e.g. does it sort the way I think it should).

List<string> testItems = new List<string>(new string[]
      {
          "abc92", "2", "z24", "z2", "z15", "z1", "b1","b6",
          "z 21", "z22", "1", "5", "3", "b2","abc1", "abc9", "abc9z4",
          "z3", "b3","z20", "a5", "z11", "b5", "b4"
      });
    //const int size = 1000000;
    TimeSpan naturalsorttime;
    TimeSpan normalsorttime;
    //Random rnd = new Random(DateTime.Now.Millisecond);

    //List<string> testItems = new List<string>(size);
    //for(int i=0;i<size;i++)
    //   testItems.Add(((char)('A' + rnd.Next(25))) + rnd.Next(100).ToString());
    List<string> testitems2 = new List<string>(testItems);
    
    DateTime start = DateTime.Now;
    using(NaturalComparer comparer = new NaturalComparer())
    {
        testItems.Sort(comparer);
    }
    DateTime stop = DateTime.Now;
    
    naturalsorttime = stop - start;

    start = DateTime.Now;
    testitems2.Sort();
    stop = DateTime.Now;

    normalsorttime = stop - start;

    Console.WriteLine("Natural Comparison: ");
    foreach(string _item in testItems)
    {
        Console.WriteLine(_item);
    }
    Console.WriteLine("Normal Comparison");
    foreach(string _item in testitems2)
    {
        Console.WriteLine(_item);
    }

    Console.WriteLine("Elapsed time for natural sort: "+naturalsorttime);
    Console.WriteLine("Elapsed time for normal sort : " + normalsorttime);
    Console.Read();

This produces the following output:

Natural Comparison:
1
2
3
5
a5
abc1
abc9
abc9z4
abc92
b1
b2
b3
b4
b5
b6
z1
z2
z3
z11
z15
z20
z 21
z22
z24
Normal Comparison
1
2
3
5
a5
abc1
abc9
abc92
abc9z4
b1
b2
b3
b4
b5
b6
z 21
z1
z11
z15
z2
z20
z22
z24
z3
Elapsed time for natural sort: 00:00:00.0050005
Elapsed time for normal sort : 00:00:00

Special thanks to Ian Griffiths for the original implementation I started from. I think the only thing left is the regular expression, but that's pretty key to the whole thing. It also makes whitespace non-significant to comparison (which is why z 21 comes after z20 instead of before z1), which was apparently a suggestion by Charles Petzold, well-known programming God. The blog page stated that the code sample was released under the MIT license which requires a disclaimer. However I couldn't find a prepared disclaimer on the code, just a link to the template. If I need to add the disclaimer to this code, I will.

Jeff Atwood seemed to think that an implementation should be 40 lines of code or less. I guess I missed the mark, and I probably could have reduced the size if I had used 3.0/3.5, and I have this annoying habit of putting brackets on everything, whether they're required or not. Even so, 74 lines isn't bad. :-)

History

28^th December, 2007: Initial post