# String Extension Method - Mixed Alpha/Numeric Sort Order

By , 17 Feb 2009

## Introduction

People sort mixed text and numbers differently than most computers seem to. For example, ask a computer to order the list “A1”, “A2”, and “A10”, and more than likely, you will get “A1”, “A10”, “A2”. Intuitively, we expect “2” to come before “10”, not after. Most sorts use lexical order, comparing text one character at a time. '2' comes after '1', hence, “2” follows “10”, lexically.

The question becomes, how do you get a more natural ordering?

## Background

There are a few approaches to take. The simple one, which makes some assumption on the data to be ordered, assumes text contains only a number, or a string, but never both. Other approaches assume only one numeric portion, possibly at the front of the string. More advanced techniques create a custom comparer that checks two strings segment by segment, performing an ordering on numeric segments differently than ordering on non-numeric segments. This does require you be able to provide the sort algorithm or data with the comparison function.

The algorithm in this article uses a different approach. It assumes that you are stuck with lexical ordering. So, the technique now becomes one where you transform the original string to a new string with the property that, using a traditional lexical ordering, yields the natural order for the original untransformed strings.

First, I'll define a precise natural order that I can use to implement the conversion. A 'numeric' segment consists of a string having only characters '0', '1', '2', ... '9'. A 'non-numeric' segment has only characters other than '0', '1', '2', ... '9'. One can then decompose any string into a list of alternating numeric and non-numeric segments. For example, “C.10 CODE 2” decomposes to “C.”, “10”, “ CODE “, “2”, and “C.2 CODE 10” decomposes to “C.”, “2”, “ CODE “, “10”.

Using this decomposition, I define a simple natural order as: Compare the strings segment by segment. If both segments are numeric, treat them as integer representation when comparing, otherwise, use the typical lexical comparison on the segments. The first difference generates the final comparison result. For example, “C.10 CODE 2” and “C.2 CODE 10”, 2 < 10, giving “C.2 CODE 10” < “C.10 CODE 10”. Special care should be taken for numeric segments beginning with '0'. The numeric portion should still drive of the main comparison, but, if they compare equal, the longer zero padded one should follow the shorter. “01” orders to before “001”.

This leads to an implementation suggested by one of the approaches mentioned, defining a comparer. But, how can we transform “C.2 CODE 10” = C02 and “C.10 CODE 2” = C10 into new strings, such that they always lexically order to C02 < C10, using the same rules?

To solve this, note that numeric segments have a length on par with their order of magnitude (excluding left zero padding). We can solve the problem if we can code a magnitude that orders lexically, and prefix the number segment with it. If done this way, the lexical comparison will always compare numeric segments only for equal magnitude numbers, which indeed can be done character by character.

For this implementation, I assumed most numbers have fewer than 9 digits. The code transforms these numbers into a string with the magnitude as a single digit, followed by the number. For example:

• 4 has magnitude 10^0. Coding yields “0” + “4” = “04”.
• 12345 has magnitude 10^4. Coding yields “4” + “12345” = “412345”

I also assumed no one would ever need to order numeric segments longer than ~10^9 digits long. For numbers longer than 9 digits long, I prefix with '9', followed by the magnitude of the magnitude minus 9, followed by the magnitude minus 9, followed by the number -- These magnitudes will always be equal or greater than 9; subtracting 9 allows a shorter than average text string ("0" instead "9", "1" instead of "10", etc.) For example:

• 1234567890 has magnitude 10^9. 9-9 = 0 has magnitude 10^0. Coding yields “9” + “0” + “0” “1234567890”.
• A digit sequence 900 characters long has magnitude 10^899. 899-9 = 890 has magnitude 10^2. Coding yields “9” + “2” + “890” + [the 900 characters].

As mentioned, to order zero padded numbers correctly, the algorithm must append a count of left padded zero. I used the same method used to encode the magnitude, as zero padding. If the non-padded numbers are equal, and tack these to the end of the token. They only ever matter if ever number and non-numeric characters are equal.

This technique does have a drawback. It must process every character in every string, and produce another string. In contrast, a comparison technique may be able to return an ordering after comparing only the first character.

## Implementing the Algorithm

When writing code, I find that implementing a unit test results in better code and maintainability. I implemented a unit test first, to verify the token sorts as expected (Normally, I would write an nUnit test, but until I install and configure it on my home system, a manual test will have to do). I've removed some of the more tedious aspects of the unit test from the code snippet, but give a basic idea on usage, as well as verify the token generator works.

```public void TestOrder()
{
// Create a list sample to verify order is correct.
string[] testInput = new string[] { ... };

// Hand craft the order we expect to see assuming numeric ordering works correctly.
string[] expectedOrder = new string[] { ... };

// Use method to test order.
List<string> sortedOrder = new List<string>(
from i in testInput
orderby i.GetAlphaNumericOrderToken()
select i );

// Ensure order is as expected.
// ...

// Test succeeded.
}```

Note the use of a LINQ query, and the method to get the token -- the `GetAlphaNumericOrderToken()` extension method.

Getting into the working code itself, in my `StringExtensions `class, the algorithm takes the form:

```public static string GetAlphaNumericOrderToken
( this string s, bool includeHexPatterns, bool ignoreZeroPadding )
{
// Special case for null.
if( s == null )
{
return s;
}```

Note that extension methods can be invoked, even when calling with a `null `object. The first lines handle this special case, mapping the `null `string to a `null `string token.

```// ...
// Any sequence of digits [0-9] needs to be coded
// to ensure it is ordered correctly.
MatchCollection matches;
if( includeHexPatterns )
{
// Use the combination hex/integer patterns.
matches = integerAndHexRegex.Matches( s );
}
else
{
// Use the integer only hex/integer patterns.
matches = integerRegex.Matches( s );
}
if( matches.Count == 0 )
{
// No embedded integer text.  No transform is needed.
return s;
}```

I originally coded only to match integer sequences. A commenter noted some occasions call for ordering hex sequences as well. The regular expressions driving the two different modes differ only in including the hexadecimal pattern of "0x" followed by one or more hex symbols.

After checking for a special case where no embedded numbers were found, and some space management to generate some string builders into which the token will be written:

```// ...
int pos = 0;
foreach( Match match in matches )
{
// Add text preceding the integer text.
sb.Append( s, pos, match.Index - pos );

// Encode the length of integer text, skipping any initial zeros.
int len = match.Length;

// Don't count any '0' prefix characters in the length.
int zeros = 0;
{
while( zeros + 1 < len && s[match.Index + zeros] == '0' )
{
++zeros;
}
len -= zeros;
}```

The loop starts by appending text prior to the numeric text match. It then computes a length for the numeric text token. The zero padding option forces the zero padding comparisons to the end of the token. This way, a user can arbitrarily left zero pad numeric text sequences, and still get intuitive numeric order.

```    // ...
// Append length code, the number (without zeros), and the zero length code.
AppendLengthCode( sb, len );
sb.Append( s, match.Index + zeros, match.Length - zeros );
{
AppendLengthCode( zeroPadOrder, zeros + 1 );
}

// Set the position for non-numeric text.
pos = match.Index + match.Length;
}

sb.Append( s, pos, s.Length - pos );

{
sb.Append( " " );
}

// Return the token for ordering the string.
return sb.ToString();
}```

Using the technique described already, the logic codes the length, or magnitude, of the number into the text, followed by the number text itself, sans the left padded zeros.

Finally, once all the numeric segments have been processed, the zero padding tokens are appended to the end of the string.

The code for `AppendLengthCode(..) `follows pretty directly to the coding algorithm I described above:

```private static void AppendLengthCode( StringBuilder sb, int len )
{
// I expect most integer sequences to be 9 or fewer digits.
// This allows a single digit prefix to determine the
// relative order of these sequences.
const int largestSmallMagnitude = 9;
if( len <= largestSmallMagnitude )
{
// Append the single digit prefix to give overall order for the integer segment.
sb.Append( "012345678"[( len - 1 )] );
}
else
{
// The sequence is longer than 9 digits.
// Code with the special 'this is a long sequence'
// code that orders these numbers after the shorter
// numbers.
sb.Append( '9' );

// Get a 'length' code.  This is how long the digit sequence is.
// Can subtract the largest small magnitude to normalize it to 0
// for the smallest 'large' magnitude.
string lenText = ( len - largestSmallMagnitude - 1 ).ToString();

// double code the length.
// This allows coding up to 10^9, without consuming
// too much extra space (log(n) space)
sb.Append( lenText.Length.ToString() );
sb.Append( lenText );
}
}```

## Using the Code

I created a quick Windows Form wrapper to aid in visualization of the tokens and the sort order. Starting the application should show a form with a sample list of unorder strings:

Clicking the "Lexical Sort" button orders the list according to the standard lexical order of strings. This order works once you get used:

However, using the different check boxes, and clicking 'Sort' demonstrates a more human like order to the list:

One responder asked why I would handle zero padding as a special case.  The list order follows, if you do not ignore zero padding:

Such an order might have some fringe use, so I left in the option get this ordering. I also included an option to process hexadecimal strings, to demonstrate that any text patterns that can be translated into a magnitude + lexical token can be ordered this way.

This concludes the demonstration of the test application.

## Points of Interest

One interesting extension with this algorithm would extend the ability into the .NET SQL CLR. Using a user defined function would allow SQL queries to use the ordering directly:

```SELECT *
FROM MyTable
ORDER BY dbo.GetAlphaNumericOrderToken( MyTable.TextField )```

I chose not to do this currently, since I'm not sure how difficult getting C# Express and SQL Server Express working together in a CLR environment; deployment, etc. However, it should not be too difficult to accomplish this in a full version of Visual .NET Studio. Feel free to tackle it if you are keen on using this in SQL queries.

## History

• 17th February, 2009: Initial post

 Trent Tobler Software Developer (Senior) United States Member
No Biography provided

Votes of 3 or less require a comment

 Search this forum Profile popups    Spacing RelaxedCompactTight   Noise Very HighHighMediumLowVery Low   Layout Open AllThread ViewNo JavascriptPreview   Per page 102550
 First Prev Next
 Thx Alot MRoc 29 Apr '10 - 12:54
 Explorer Luc Pattyn 17 Feb '09 - 15:42
 FWIW: If you want to sort strings the way Windows Explorer sort filenames, you can use its code, in C# that takes P/Invoke with this prototype:   ```[DllImport("shlwapi.dll", CharSet=CharSet.Unicode, ExactSpelling=true)] private static extern int StrCmpLogicalW(string s1, string s2); ```     Luc Pattyn [Forum Guidelines] [My Articles] - before you ask a question here, search CodeProject, then Google - the quality and detail of your question reflects on the effectiveness of the help you are likely to get - use the code block button (PRE tags) to preserve formatting when showing multi-line code snippets Sign In·View Thread·Permalink
 Re: Explorer bscaer 24 Feb '09 - 4:54
 Interesting PIEBALDconsult 17 Feb '09 - 8:25
 Re: Interesting Trent Tobler 17 Feb '09 - 10:37
 Re: Interesting PIEBALDconsult 17 Feb '09 - 12:25
 Re: Interesting Trent Tobler 17 Feb '09 - 15:39
 Okay. I've added some additional code to demonstrate hex sorting, as well as the ability to turn off the zero pad processing.   I've also added a simple UI test harness to more easily visualize the different orderings, as well as provided more documentation on the code itself. Sign In·View Thread·Permalink
 Re: Interesting PIEBALDconsult 17 Feb '09 - 15:42
 Re: Interesting Trent Tobler 17 Feb '09 - 15:52
 Now you tell me! Lol.   It actually went in very smoothly... Just had to include a stronger RegEx for the hexidecimal coding - /((?<=0x)[0-9A-F]+)|([0-9]+)/. All other logic is identical.   It does have the interesting side effect that it always codes the 0 in the '0x' as a number too. "10xe1", gets coded as '10' followed by 'x' followed by hex sequencce 'e1'. A little twisted and warped, but I didn't want to spend too much time on a relatively fringe need. Sign In·View Thread·Permalink
 Re: Interesting supercat9 17 Feb '09 - 12:27
 There is no way I can see to define a sensible ordering for objects that may contain both variable-length decimal numbers that should sort in numerical order even when the following character is alphabetic, and hexadecimal numbers (whether fixed-length or variable length) unless rules are imposed to define what constitutes a hexadecimal number (e.g. a "0x" prefix not preceded by an alphanumeric character could specify that the following string of alphanumeric characters should not have digit strings handled specially). In that scenario, 0x1234 and 0x12AB would sort properly, though 0xFEA would sort after them (rather than before).   Logic that assumes that things that look like hex strings are, and those that don't look like hex strings aren't, is apt to produce ugly results more often than it will produce particularly good ones. Should "88balloons" sort after "99coconuts" but before "6bedbugs"? Total mess.   BTW, even when dealing with decimal numbers, it's not always clear how they should be ordered. For example, what should be the ordering for the following: 1.1 1.5 1.21? If the numbers represent quantities, then 1.21 should be between 1.1 and 1.5; if they represent section numbers, however, they should be sorted as given. Sign In·View Thread·Permalink
 Re: Interesting Luc Pattyn 17 Feb '09 - 14:44
 On the positive side, if it works for decimal numbers, it also works for octal numbers. That should please you.     Luc Pattyn [Forum Guidelines] [My Articles] - before you ask a question here, search CodeProject, then Google - the quality and detail of your question reflects on the effectiveness of the help you are likely to get - use the code block button (PRE tags) to preserve formatting when showing multi-line code snippets Sign In·View Thread·Permalink
 Re: Interesting PIEBALDconsult 17 Feb '09 - 15:17