A short identifier using base 36 in C#

Paul C Smith

4.85/5 (60 votes)

Sep 5, 2012

CPOL

11 min read

89538

1823

How to generate shorter, more readable, unique values for general-purpose identifiers.

Update, 2015-06-29 - Project is now on GitHub and NuGet. The code has evolved to support different component lengths, cross-process thread safety, and better performance. See the GitHub repository for source code and test cases. Get the NuGet package here.

Update, 2012-09-10 - Added SQL Server performance info

Update, 2012-09-07 - The code has been updated to clear up some issues and add more testing routines, and I added some more robust explanations in the article as well.

Introduction

If this:
D657-BHF6-ZI6E-Z3A2 looks like a better identifier to you than this:
9A6484a2-eb08-4aea-980a-a9694bf279dc, then read on.

This article arises out of a slew of annoyances with assigning identifiers in large, distributed apps with multiple datastores. I have a short, not too unreasonable list of must-haves for identifiers, and unfortunately, most existing ID schemes fail one or more of them:

Sufficiently unique
Completely decoupled
Efficiently indexed
Wieldy / humanly readable
Not a performance drag

About the Code

Solution summary The solution includes the class library for the ID generator, as well as a console test harness. Download it and run the console app to see a demonstration showing how many IDs your PC can generate in one second; my main laptop can do a little over 85k. There are a few crude and rudimentary built-in safeguards against collisions and clock problems; a section towards the end of the article explains these a bit.

To generate a new ID without knowing what's going on, just reference the class library and call the static method IdGenerator.NewBase36() method, or IdGenerator.NewBase36(delimiter) to show it as four delimiter-separated chunks. You can use this internally in just about any platform, across datastores, without worrying about collisions or sortability (IDs start with an ascending sequence).

Background

The benefits of the criteria above are pretty self-evident. Uniqueness is a prerequisite of any identifier, but the definition of "sufficiently" varies. Sometimes things only need to be unique within their relational table or entity type, as with an auto-incrementing SQL identifier. Sometimes they need to be unique across a wider context, perhaps even universally (as GUIDs are) so they can identify promiscuously shared things like binaries and interface versions. The definition of "sufficiently" used here is within the context of a distributed platform, like a social network, search engine or line-of-business SaaS platform. An ascending integral number does not achieve this, but a GUID does (though it isn't sequential or easily indexed).

Decoupling ensures that no component of the system will ever fail to generate an ID; depending on another server or service introduces a point of failure that's not there with, say, a GUID.

Efficient indexing just ensures that the ID won't slow an app's reads or writes down. IDs should be portable and independent of the storage layer, but many storage providers arrange the physical record layout by the sort order of the ID. Let's respect this with some sort of ascending ordering. SQL Sequential IDs achieve this, but they create a server dependency and add a few other hardships as well.

Wieldy and human readable go hand in hand. GUIDs are unwieldy; ever try to read one off to someone, or type it by hand? 32 characters with hard-to-predict enforcement of dash placement and such is unreasonable for communication purposes. It's not that hard to read a credit card number over the phone though, so 16 characters should be fine. Case-sensitivity and special characters, however, are not fine. Many people in the developed world still say "backslash" when reading URLs, and until we overcome this as a species I'm sure as hell not introducing casing or non-alphanumeric characters into an identifier scheme as some 25-character "short GUID" implementations do.

I trust that "not a performance drag" needs no explanation.

Solution Summary

So we're looking for something like a GUID in many ways ... relatively unique across an entire system's lifetime (but not necessarily universally). It should also be shorter than a GUID and easily indexed like an integral number or sequential GUID. It can't have a server dependency, and it must be case-insensitive and free of significant special characters.

The use cases here don't call for opacity or strong security. GUIDs aren't guessable, but these identifiers can be fully guessable and still work fine (the solution here is only partially guessable). As for performance, generation overhead should not approach any real-world usage scenario, and storage space shouldn't exceed a GUID's. A fixed length wouldn't hurt either, as some datastores persist, index and seek amongst fixed-length values more efficiently.

A Bit About Uniqueness

Definitions of "reasonably unique" will vary, but let's consider some not completely arbitrary usage scenarios. According to some webpage, Google US was performing 400 million queries a day at some point recently. Some other equally reputable website said there were about 21.7 billion credit card transactions in 2006. This generator can cover those volumes without issues. As for longevity, I'll arbitrarily say that 115 years would be an extraordinarily long lifetime for a system, so if we can fulfill the need for that long without collision, it should be able to handle about any enterprise app without issues. (There are provisions to milk the scheme for another 4,000 years if necessary, though.)

Got all that so far? Good, then ... ready, go!

Solution Details

Here is the composition of the 16 character scheme I'm using:

The alphanumeric requirement calls for base 36 encoding (0-9, A-Z), which also helps us stuff more uniqueness into less visual space than a hexadecimal GUID does. The storage space will be the same as a GUID, 128 bits / 16 bytes. In SQL Server, just define it as a char(16). The 16 characters are composed of four different sections which I'll describe next; each section is base 36 encoded and then left-padded when necessary to ensure the correct length.

A timestamp occupies the first 10 positions; this gives the IDs an ascending pattern. To allow IDs to start with low values, I calculate the timestamp as the elapsed interval from an inception date/time, which is hardcoded and should be universal across an organization. The resolution is in microseconds (ticks / 10), and a lock mechanism prevents any apps or threads in an AppDomain from issuing two IDs within 10 microseconds of each other. That roughly eliminates the chance of any local ID collisions. You can optionally express these IDs in four groups of four characters, separated by a dash or other delimiter. This looks familiar to people, like a credit card number.

A server hash takes up the next 2 positions. This is a base 36 encoded checksum based on the hostname of the server. This reduces the chances of an ID collision across a distributed app. It also adds an element of traceability to ID assignment, so long as you catalog your server checksums. Since there are two base 36 characters, there are 1296 possible values. The generator provides methods to override this value if you like.

A version identifier occupies the 13th position. This is arbitrarily set to Z. Since 10 characters in base 36 provide just over 115 years worth of timestamps, this character will decrement each time the timestamps are exhausted (every ~115.8 years). This will break the sequential nature of the IDs, but if you have an app in service long enough for that to happen, you should be congratulated, and I will buy you a beer.

Three random characters occupy the last positions. 36 ^ 3 gives you 46656 possible combinations. If your distributed app is big enough to have duplicate server hashes, this cuts the odds of an ID collision to n in 46656 at worst, where n is the number of servers with the same hash generating an ID in the same millionth of a second. With the 10 microsecond delay, even several servers concurrently issuing IDs in a tight loop would be unlikely to generate ids in the same microsecond.

Testing

I did a fair amount of brute-force testing on this generator for performance and collision avoidance. Generation speed in stress tests varied between about 60,000 and 100,000/second depending on hardware. Collisions would require two servers to have the same 2-character hash, so in my testing I overrode the hash to be identical on four different beefy Intel servers, each generating an average of a little over 100k ids per second on each. 72 hours of concurrent id generation on all four did not produce a collision.

To check the performance of 16-character strings as Ids, I compared the sorting of a List<T> with 1,000,000 instances of a string id versus an Int64 and a Guid. Sort speed is a little faster than the Guid sort; it took under 3 seconds to sort the list from a random start order. (It's important to use string.OrdinalCompare() if you want good performance though; the string class's default .CompareTo() instance method will slow you down fiercely.) These tests are included in the console app.

More dramatic performance differences showed up in SQL Server. I created two tables for comparison, one with a char(16) Id to hold my sequential base 36 string, and one with a uniqueidentifier Id. I created entities to match, and inserted 10 million records into each table using a SqlDataReader. Inserts were generally a bit faster with the char(16) (I tried batches from 1 to 100k records per insert), and individual selects were a few thousandths of a millisecond faster with a uniqueidentifier. However, when doing any aggregate operations or sorting based on the creation time, the sequential char(16) was dramatically faster. For instance, after inserting 10 million records, finding the timestamp of the 9-millionth record (logically, select MIN(timestamp) from (select top 1mm * order by datetimecreated desc)), the char(16) id table was 700% to 1000% faster. Tough grouping and aggregate queries like this on the Guid table often ran several minutes. There's a lesson here: you probably shouldn't use Guids as a clustered index on large tables you'll be processing in incremental, date- or time-grouped batches.

I verified the logic that decrements the reserved character after 115.8 years (when the base 36 sequential prefixes are exhausted) by changing the hardcoded inception date to 500+ years back. There is also a Parse() method that will verify an Id and show its components. It takes care of casing issues and removing dashes (or any other non base 36 characters) as well. It will throw an exception if you pass it a string that doesn't conform to the Id syntax. (I didn't write a TryParse().)

The console app shows how many Ids your machine generates in 1 second, displaying the first and last Id generated. It also shows the inception date, demonstrates the Parse() method, and stress tests generic list sorting with mock classes. No prep should be needed; just load up the solution and hit F5. The output looks like this:

If your Ids started out over 500 years ago, the output would look more like this:

Safeguards and limitations

The generator here takes reasonable safeguards against collisions, but they are still theoretically possible in really edge-case scenarios. So, you should either ensure that no two servers in your farm share the same host checksum, or design your app such that ID collisions are not catastrophic (and it's a good idea to defend against that sort of thing in general).

Keep in mind that the scheme uses 0 and O, and I and 1. People in your organization should make a habit of saying "zero" for the number and "O" for the letter, and you should display IDs in a font that enhances the differences between the characters when possible.

The sequence of these IDs depends on two critical points of failure, a) the in-service date hardcoded in the generator class, and b) the system clock. Ideally one should sync all server clocks against an external NTP service anyways, so I didn't spend much time defending against that possibility. The date is hardcoded because a) I hate storing anything critical to an app's logic in configuration files, and b) the value should be consistent across an organization, and this makes it easy to ensure that.

Further Investigation

It would be interesting to rigorously calculate collision odds in a massive server farm, or considering the expectation of clock failures. I also would love to see an insert and lookup performance comparison across a few datastores, notably MS SQL Server. Time permitting I will attack these endeavors myself, but meanwhile, I thought it was better to release the article and code without them then to leave it languish until I could undertake to get it all done.

History

First draft 2012-09-04.
Updated 2012-09-07