Sortable Base64 Encoding

Tomas Takac

4.28/5 (8 votes)

Nov 3, 2019

CPOL

7 min read

11700

131

Sortable and file-name friendly Base64 encoding

Download source code - 7.4 KB

Introduction

Base64 encoding can encode binary data to text but has some shortcomings. First of all, encoded values can contain "/" (a slash) which makes Base64 encoded values unusable in file names unless you apply another level of encoding. Second, the encoded values when sorted as strings are sorted differently than the original values when sorted as numbers.

Background

On my current project, we use Azure blob storage a lot. This has the nice feature that when you list blobs, it always returns them sorted by name. We rely on this feature for versioning so the newer version is before the latest version in the listing. This way, you can get the latest version of the entity by listing "top 1" blobs with a given prefix.

At one point, we needed to extend our versioning scheme. I liked the idea of encoding the binary value using Base64 but very soon, I learned about the shortcomings described in the introduction. In the end, we decided to go in a different direction but I thought it would be fun to implement this "better" Base64 anyway.

Base64

There are several ways how to encode binary data to text. You can use decimal or hexadecimal digits for example. Base64 as the name suggests uses 64 characters: besides 10 decimal digits also uppercase and lowercase letters and two additional characters "+" and "/". The "=" sign is only used for padding and does not encode any information. But different encodings have different ratio between the length of input and output. Let us encode 16 bytes, which is the length of a GUID.

Encoding	Output length	Example
Decimal digits	39	83010580618813770707986403520966229089
Hexadecimal digits	32	f3cefc61311240f9a9dc6c519c41733e
Base64	24 (22 without padding)	YfzO8xIx+UCp3GxRnEFzPg==
Binary	16

How did we get these numbers? One GUID is 16 bytes which is 128 bits - this is 2¹²⁸ combinations. We use logarithm to get number of digits or characters in output. Base of the logarithm is the number of available characters in the encoding. So for decimal digits we use log₁₀, for hexadecimal digits log₁₆ and for Base64 encoding log₆₄. As you can see, the more characters available in the encoding, the shorter the resulting text.

You can read more about binary to text encoding on Wikipedia.

Choice of Characters

The .NET implementation of Base64 encoding uses following characters: uppercase letters "A" to "Z", then lowercase letters "a" to "z", then decimal digits "0" to "9" and finally two additional characters "+" and "/".

Character	ASCII code
A .. Z	65 .. 90
a .. z	97 .. 122
0 .. 9	48 .. 57
+ (plus)	42
/ (slash)	47

This choice of characters has two problems. First obviously using "/" is problematic for file names. Then ordering of characters does not preserve sorting because for higher values in encoding, it uses lower ASCII codes. To fix these problems, we change the order of encoding characters and pick different character instead of "/".

Character	ASCII code
+ (plus)	42
0 .. 9	48 .. 57
A .. Z	65 .. 90
_ (underscore)	95
a .. z	97 .. 122

Underscore is a natural choice as it is commonly used in identifiers and is supported in file names. The new ordering also matches the ordering of ASCII codes. This is how it looks in the code:

private const string Base64Characters = 
    "+0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz";

Padding

In .NET implementation, the output is always padded with "=" at the end in order to have the output length a multiple of 4. This, of course, breaks the sorting story. To preserve sorting, we would need to add padding in front of the string. Also it is completely unnecessary and we will not use it.

We still need to insert some padding bits to get a whole number of bytes in the output. There is a simple rule for this:

Input length	Output length	Extra padding bits
3n	4n	0 bits
3n + 1	4n + 2	4 bits
3n + 2	4n + 3	2 bits

In the above table, "n" is a positive integer. We can encode 3 bytes into 4 Base64 characters without any padding. Input size that is not a multiple of 3 will result in output size in bits that is not divisible by 8. In this case, we will insert zero bits in front of the most significant character in the output.

You can verify the above in following table:

Input length	Output bit length
3n	8 * 3n = 24n = 6 * 4n
3n + 1	8 * (3n + 1) = 24n + 8 = 24n + 6 + 2 = 6 * (4n + 1) + 2
3n + 2	8 * (3n + 2) = 24n + 16 = 24n + 12 + 4 = 6 * (4n + 2) + 4

Endianness

Before we dive into the actual code, we need to make a short stop to explain the data layout. .NET as a platform is little endian. This means that for arrays, the more significant value the higher index in the array. So the least significant value is stored at the beginning of the array and the most significant value is stored at the end. This is important because when encoding as Base64, we want to put the most significant value at the beginning of the string and the least significant value at the end. So the order is vice-versa.

To learn more, read about Endianness on Wikipedia.

Encoding

First, we need to determine how many characters will be in output. We will follow the above table. We encode each group of 3 bytes using 4 Base64 characters. If we have 1 extra byte, we add 2 more characters with 4 padding bits. If we have 2 extra bytes, we add 3 more characters with 2 padding bits.

var numberOfChars = 4 * Math.DivRem(input.Length, 3, out int extraBytes);

if (extraBytes > 0)
{
    numberOfChars += extraBytes + 1;
}

var output = new char[numberOfChars];

switch (extraBytes)
{
    case 0:
        // no padding
        Encode0(input, output, input.Length - 1, 0);
        break;
    case 1:
        // 4 padding bits
        Encode1(input, output);
        break;
    case 2:
        // 2 padding bits
        Encode2(input, output);
        break;
    default:
        throw new InvalidOperationException("Unreachable code");
}

return new string(output);

Encoding Without Padding

This is the simplest case for encoding. We get 3 bytes from input and generate 4 characters in output. Please remember we read the input from the end!

private void Encode0(byte[] input, char[] output, int inputIndex, int outputIndex)
{
    while (inputIndex > 0)
    {
        var i0 = input[inputIndex];
        var i1 = input[inputIndex - 1];
        var i2 = input[inputIndex - 2];

        // 000000 001111 111122 222222
        var o0 = i0 >> 2;
        var o1 = ((i0 & 0x03) << 4) | (i1 >> 4);
        var o2 = ((i1 & 0x0F) << 2) | (i2 >> 6);
        var o3 = i2 & 0x3F;

        output[outputIndex] = Base64Characters[o0];
        output[outputIndex + 1] = Base64Characters[o1];
        output[outputIndex + 2] = Base64Characters[o2];
        output[outputIndex + 3] = Base64Characters[o3];

        inputIndex -= 3;
        outputIndex += 4;
    }
}

Encoding With 4 Padding Bits

In this case, we have an extra byte and we encode it into two Base64 characters with 4 padding bits. After we process this extra byte, we can continue the same as with no padding.

private void Encode1(byte[] input, char[] output)
{
    var len = input.Length;
    var i0 = input[len - 1];

    // ....00 000000
    var o0 = i0 >> 6;
    var o1 = i0 & 0x3F;

    output[0] = Base64Characters[o0];
    output[1] = Base64Characters[o1];

    Encode0(input, output, len - 2, 2);
}

Encoding With 2 Padding Bits

In this case, we have two extra bytes and we encode them into 3 Base64 characters with 2 padding bits. Again, after we process these two bytes, we can continue the same as with no padding.

private void Encode2(byte[] input, char[] output)
{
    var len = input.Length;
    var i0 = input[len - 1];
    var i1 = input[len - 2];

    // ..0000 000011 111111
    var o0 = i0 >> 4;
    var o1 = ((i0 & 0x0F) << 2) | (i1 >> 6);
    var o2 = i1 & 0x3F;

    output[0] = Base64Characters[o0];
    output[1] = Base64Characters[o1];
    output[2] = Base64Characters[o2];

    Encode0(input, output, len - 3, 3);
}

Decoding

We use reverse process to determine how many bytes will be in output. Each group of 4 characters encodes 3 bytes. If we have 2 extra characters, we add 1 more byte and decode with 4 padding bits. If we have 3 extra characters, we add 2 more bytes and decode with 2 padding bits.

var numberOfBytes = 3 * Math.DivRem(input.Length, 4, out int extraChars);

if (extraChars > 0)
{
    numberOfBytes += extraChars - 1;
}

var output = new byte[numberOfBytes];

switch (extraChars)
{
    case 0:
        // no padding
        Decode0(input, output, 0, output.Length - 1);
        break;
    case 2:
        // 4 padding bits
        Decode1(input, output);
        break;
    case 3:
        // 2 padding bits
        Decode2(input, output);
        break;
    default:
        throw new InvalidOperationException("Unreachable code");
}

return output;

Decoding Without Padding

This is the simplest case for decoding. We get 4 characters from input and generate 3 bytes in output. Please remember we need to put bytes to output from the end!

private void Decode0(string input, byte[] output, int inputIndex, int outputIndex)
{
    while (inputIndex < input.Length)
    {
        var i0 = Base64Characters.IndexOf(input[inputIndex]);
        var i1 = Base64Characters.IndexOf(input[inputIndex + 1]);
        var i2 = Base64Characters.IndexOf(input[inputIndex + 2]);
        var i3 = Base64Characters.IndexOf(input[inputIndex + 3]);

        // 000000 001111 111122 222222
        var o0 = (byte)((i0 << 2) | (i1 >> 4));
        var o1 = (byte)((i1 << 4) | (i2 >> 2));
        var o2 = (byte)((i2 << 6) | i3);

        output[outputIndex] = o0;
        output[outputIndex - 1] = o1;
        output[outputIndex - 2] = o2;

        inputIndex += 4;
        outputIndex -= 3;
    }
}

Decoding With 4 Padding Bits

In this case, we have 2 extra Base64 characters and we decode it into 1 byte with 4 padding bits. After we process these extra characters, we can continue the same as with no padding.

private void Decode1(string input, byte[] output)
{
    var i0 = Base64Characters.IndexOf(input[0]);
    var i1 = Base64Characters.IndexOf(input[1]);

    // ....00 000000
    var o0 = (byte)((i0 << 6) | i1);

    var len = output.Length;
    output[len - 1] = o0;

    Decode0(input, output, 2, len - 2);
}

Decoding With 2 Padding Bits

In this case, we have 3 extra Base64 characters and we decode it into 2 bytes with 2 padding bits. After we process these extra characters, we can continue the same as with no padding.

private void Decode2(string input, byte[] output)
{
    var i0 = Base64Characters.IndexOf(input[0]);
    var i1 = Base64Characters.IndexOf(input[1]);
    var i2 = Base64Characters.IndexOf(input[2]);

    // ..0000 000011 111111
    var o0 = (byte)((i0 << 4) | (i1 >> 2));
    var o1 = (byte)((i1 << 6) | i2);

    var len = output.Length;
    output[len - 1] = o0;
    output[len - 2] = o1;

    Decode0(input, output, 3, len - 3);
}

Points of Interest

Our new encoding is sortable but there are some limitations worth mentioning. First, it only works for strings of the same length, shorter strings are not always placed before longer strings. You can either pad strings to be of the same size or implement your own comparer.

Default sorting	Sorting with prefix
aa	+aa
aaa	+ab
aab	aaa
ab	aab
aba	aba
abb	abb

The other problem is that the sorting works in .NET only with ordinal string comparison. Azure blob storage lists blobs as expected but you need to specify the comparer explicitly in List.Sort() for example. When you sort the Base64 characters using the default comparer, it looks like this:

_+0123456789aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ

History

3^rd November, 2019 - Initial release