Click here to Skip to main content
Email Password   helpLost your password?

naturalcomparer.png

Introduction

Did you notice how Explorer in XP is intelligent enough to sort the files in a natural order?

If you have 10 files on your hard disk, they will show in this order:

file1.txt

file2.txt
file3.txt

file4.txt
file5.txt

file6.txt
file7.txt

file8.txt
file9.txt

file10.txt

However, if you try in under DOS, they will appear this way:

file1.txt

file10.txt
file2.txt
...

The reason for that is that DOS uses a simple alphabetical search.

The aim of this article is to show how I think Explorer does this better than DOS and provide to the CodeProject readers a class to reproduce this in their .NET programs.

Background

The .NET framework uses the IComparer interface a lot. This interface is very simple to implement; it contains a single member:

'VB
Public Function Compare(ByVal x As Object, ByVal y As Object) As Integer
//C#
int IComparer.Compare(object x, object y)

The function you must provide returns an integer which must be:

Comparing a null reference (Nothing in Visual Basic) with any reference type is allowed, and does not generate an exception. A null reference is considered to be less than any reference that is not null.

Using the code

The class NaturalComparer can be used anywhere the .NET framework requires an IComparer. That is about every operation which involves sorting data. The most common is probably Array.Sort.

In the demo program, for example, I use:

Array.Sort(lines, New NaturalComparer())

This will sort the lines using the natural order I have briefly described above.

How does this work

This NaturalComparer class uses a couple of StringParser classes to compare each item.

The StringParser is a pretty straightforward character parser. It eats the characters of the string, and returns through its member NextToken a series of tokens. Each token is either numerical or string.

public void NextToken()
{
  do
  {
    if (mCurChar == '\0')
    {
      mTokenType = NaturalComparer.TokenType.Nothing;
      mStringValue = null;
      return; 
    }
    else if (char.IsDigit(mCurChar))
    {
      mTokenType = NaturalComparer.TokenType.Numerical;
      ParseNumericalValue();
      return; 
    }
    else if (char.IsLetter(mCurChar))
    {
      mTokenType = NaturalComparer.TokenType.String;
      // This can also optionally return
      // numericals in case of Roman Numerals
      ParseString();
      return; 
    }
    else
    {
      // Ignore this character and loop some more 
      NextChar();
    }
  } while (true);
}

The StringParser ignores the punctuations and spaces.

The NaturalComparer has very little to do. It will get the first token from each string and compares their numerical values if both are numerical, if not compares the string values.

int System.Collections.Generic.IComparer<string>.Compare(string string1, 
                                                 string string2)
{
  mParser1.Init(string1);
  mParser2.Init(string2);
  int result;
  do
  {
    if (mParser1.TokenType == TokenType.Numerical & 
                 mParser2.TokenType == TokenType.Numerical)
      // both string1 and string2 are numerical 
      result = decimal.Compare(mParser1.NumericalValue, mParser2.NumericalValue);
    else
      result = string.Compare(mParser1.StringValue, mParser2.StringValue);
    if (result != 0) return result;
    else
    {
      mParser1.NextToken();
      mParser2.NextToken();
    }
  } while (!(mParser1.TokenType == TokenType.Nothing & 
             mParser2.TokenType == TokenType.Nothing));  
  return 0; //identical 
}

Points of interest

As an option, you can ask the NaturalComparer to detect and parse Roman numerals.

New NaturalComparer(NaturalComparerOptions.RomanNumbers)

The problem is that sometimes the comparer could mix a valid English name for a Roman number. This is okay if the other side of the comparison is a string, but it can mess your sort order if the other side is a number or another false Roman numeral positive.

So, use this option if you believe the likelihood of having Roman numerals is worth messing the order.

History

You must Sign In to use this message board.
 
 
Per page   
 FirstPrevNext
GeneralThis is great! But.....
nfiskeo
4:42 18 Dec '08  
How could I implement this if my data is not in a string array but rather in a generic list?
GeneralThanks
Kahiko
8:34 8 Oct '08  
Just wanted to drop a line and say thank you for your examples.

I've also added the ability to sort both ascending and descending by doing the following:

Add a new enum:

Public Enum NaturalComparerDirection
Ascending
Descending
End Enum


Add the following to the NaturalComparer classs:

Private mSortAscending As Boolean = True

Overloaded sub new:

Sub New(ByVal NaturalComparerOptions As NaturalComparerOptions, ByVal Direction As NaturalComparerDirection)
If Direction = NaturalComparerDirection.Descending Then mSortAscending = False
mNaturalComparerOptions = NaturalComparerOptions
mParser1 = New StringParser(Me)
mParser2 = New StringParser(Me)
End Sub

And changed the Compare method:

Public Function Compare(ByVal string1 As String, ByVal string2 As String) As Integer Implements System.Collections.Generic.IComparer(Of String).Compare
mParser1.Init(string1)
mParser2.Init(string2)
Dim result As Integer
Do
If mParser1.TokenType = TokenType.Numerical And mParser2.TokenType = TokenType.Numerical Then
' both string1 and string2 are numerical
result = Decimal.Compare(mParser1.NumericalValue, mParser2.NumericalValue)
Else
result = String.Compare(mParser1.StringValue, mParser2.StringValue)
End If
If result <> 0 Then
' if the sort direction is decending then reverse the result
If Not mSortAscending Then
If result = -1 Then result = 1 Else result = -1
End If
Return result
Else
mParser1.NextToken()
mParser2.NextToken()
End If
Loop Until mParser1.TokenType = TokenType.Nothing And mParser2.TokenType = TokenType.Nothing
Return 0 'identical
End Function
GeneralEven more natural
Seth Morris
20:27 22 Jan '08  
I've made a change locally to call NextToken() a second time in StringParser's Init() if the first token is a string in a "skip list." My skip list is {"a", "an", "the", "le", "la", "les", "un"}(1,2), but a more detailed implementation of would make that a settable property.

I can't decide if "A The Who Retrospective" should sort to T or W, so I don't know if NextToken() should be called until the result is not in the skip list. However, "For Whom The Bell Tolls" should sort after "For Whom I Cry."

And don't get me started on the band name The The.


1) "Le" can give you trouble between names ("Le Duc" should sort to L) and nouns ("Le Petit Prince" should sort to P). Fortunately, if you only expect have English names you can skip this.
2) Note that an app for sorting porn might require modifying this skip list in other ways.
GeneralRe: Even more natural
Pascal Ganaye
3:01 4 Jul '08  
this is certainly useful.
Personnally I think I would not remove more than one skip list.
I am not really sure either.
GeneralRe: Even more natural
supercat9
9:59 4 Jul '08  
I really don't see any reasonable way to handle things like Roman numerals and title prefixes without markup. Small roman numerals might not pose too much trouble since, if spaces are significant, i, ii, iii, iv, v, vi, vii, and viii all be handled correctly with a simple ordinal sort if there aren't any non-numeric items to sort against. If one were to sort "ix", "xix", and "xxix" as "viiii", "xviiii", and "xxviiii", things would be fine if one didn't mix Roman numerals with words in the same sorting situation (I doubt anyone would think to look for "Bob IX" between "Bob Vig" and "Bob Vin"). I see no good way to handle numbers 49 and up, since "id" is a word that should sort between "ice" and "idea", and "d" is a letter that should sort before "i".

Title prefixes, however, are going to be totally impossible to sort reasonably without markup. The titles "I Corinthians", "II Corinthians", "I Tried", "I James", and "II James" should sort in that order, but I see no reasonable way to achieve that. "A is for Apple" should sort under "A", not "Is", but I can't imagine any algorithm to determine that.


Last Updated 4 Jul 2008 | Advertise | Privacy | Terms of Use | Copyright © CodeProject, 1999-2010