I can understand why it seems like a damn good idea: if you use SHA or MD5, they always generate an output that is the same length – 128 bits, 256 bits, 512 bits, or 1024 bits which makes it easy to store and transfer; unlike DES or AES which generate a very long stream of bytes. So if you decrypt SHA or MD5, then you get security plus an incredible degree of compression, which is really handy. So … why don't streaming services like Amazon or Netflix do that? Why aren't hard drives infinite capacity and stored as a 512 bit encrypted value? Why are webpages not compressed to 256 bits?
Unfortunately, it doesn't work like that, and this article tries to explain why not.
What is Encryption?
Let's start by looking at what encryption is:
In cryptography, encryption is the process of encoding a message or information in such a way that only authorized parties can access it and those who are not authorized cannot. Encryption does not itself prevent interference, but denies the intelligible content to a would-be interceptor. In an encryption scheme, the intended information or message, referred to as plaintext, is encrypted using an encryption algorithm – a cipher – generating ciphertext that can be read only if decrypted.
Which means that when you encrypt a message (be it a password, a phrase, a document, or a video) you can reverse the output and get the original back. If for any reason when you decrypt the encrypted data, you can't get exactly what you started with, it isn't encryption!
So Let's Try It
So let's assume you can decrypt SHA or MD5 and get the original input back, and see what happens.
For the sake of simplicity in the examples, let's invent an algorithm called SHA-2Bit which always produces a 2 bit output in the same way that SHA256 always produces a 256 bit output.
If we feed 2 bit values to SHA-2Bit, we get 2 bit outputs: it doesn't matter what the algorithm is internally, any more than it does with SHA256 – bear in mind that 99.99% of the users of SHA256 don't understand how it works either, and don't need to anyway!
So we feed it the full range of two bits values and we look at the outputs:
Input (Binary) Input (Decimal) Output (Binary) Output(Decimal)
00 0 11 3
01 1 00 0
10 2 10 2
11 3 01 1
And each output is different, which is important: if two different inputs generate the same output, then we can't decrypt it as we have no idea which of the two inputs we started with.
This is fine: So let's take those outputs as and see what we get back if we run them backwards by lookign them up in the output column above:
Output (Binary) Output (Decimal) Input (Binary) Input (Decimal)
11 3 00 0
00 0 01 1
10 2 10 2
01 1 11 3
So that works – we can encrypt any 2 bit value using SHA-2Bit and decrypt it successfully to get the original input back.
Input (Binary) Encrypted (Binary) Output (Binary)
00 11 00
01 00 01
10 10 10
11 01 11
But … what if we input a three bit value? Here we have a problem, because all of the possible two bit outputs are mapped to a separate 2 bit input (and they have to be, or we would have duplicates, and we couldn't decrypt them). So what 2 bit value can we output for 3 bit inputs? We could just look at the decimal values and work from there:
Input (Binary) Input (Decimal) Output (Binary) Output (Decimal)
000 0 11 3
001 1 00 0
010 2 10 2
011 3 01 1
And I guess that wouldn't make too much difference (though there is a difference between 2 bit "3" and 3 bit "3" in the real world).
But what values can we generate for the other three bit numbers?
Input (Binary) Input (Decimal) Output(Binary) Output(Decimal)
000 0 11 3
001 1 00 0
010 2 10 2
011 3 01 1
100 4 ? ?
101 5 ? ?
110 6 ? ?
111 7 ? ?
Any two bit values we try to chose will have already been used to map 2 bit input values so we can never decrypt values that are larger or smaller than 2 bits.
So ... It Doesn't Work for 2 Bit Outputs
And the problem is the same with SHA256: OK, there are 256 bits, which gives us 10^168 (or a 1 with 168 zeroes following) different values but there are twice as many 512 bit numbers we could provide as inputs: and a document with 33 characters in it already exceeds 256 bits!
No matter what fixed size we pick as our output, we cannot encrypt that size plus one bit, or one character, or one byte more into it as we are absolutely guaranteed to get duplicate outputs – and a single duplicate value means we cannot guarantee to successfully restore the original input given the output. We may be able to get something, but it may or may not be the original document!
But If I Use an Encryption Key, or a Salt: It'll Work - Yes?
No. It doesn't even matter what the encryption key was because we didn't even use a key in this example – just as SHA and MD5 don't use keys!
And that's important: SHA and MD5 don't use keys because they aren't encryption algorithms: they are hashing algorithms, and the difference is that encryption can be reversed, hashing can't. Any algorithm that generates a fixed size output is by definition hashing because it will produce duplicate values, where encryption will not but requires variable length output in order to ensure it.
Salts don't help either: they serve to vary the original input so that two users with the same password don't generate the same value in the "encrypted passwords" table. For example, to prevent everybody with "Password" as their password from being identifiable just by counting how many identical values you have in the table!
Hashing algorithms by their very nature discard information to produce a result in the same way that addition does: if I add 2 and 7, I always get 9 – but you can't take 9 and get back to 2 + 7, because 0 + 9, 1+8, 2+7, 3+4, 10+(-1), … and so on all give the same value. As does 7 + 2, 2+0+7, …
That's why streaming video services don't just use SHA to process the whole MP4 film, and send you a 256 bit value: if you could decrypt it and get the whole 1GB movie back, trust me, they would!
And hashing isn't compression either, because compression is (in effect) a sophisticated form of encryption which tries (very hard: very, very hard) to produce an output that is significantly smaller than the input, but which relies on the actual data and it's "inherent patterns" to enable the compression. The one thing that compression algorithms cannot guarantee is a specific output size, and a seemingly trivial change to the input can cause a large change in the output size.
Generally You Find:
- Fixed size output: Hashed data, original input cannot be recovered.
- Output smaller than input: Compressed data, original input can be recovered.
- Output larger than input: Encrypted, original input can be recovered.
And yes, MD5 is officially "broken" because there is a way to reverse some values to a "readable input" but that only works in a small number of cases, and doesn't guarantee to regenerate the specific original input. Doesn't mean it has become encryption! You still shouldn't use it for new projects.
- 2019-06-25: "SHA2" reference missed - changed to "SHA-2Bit". My thanks to @bence98 for spotting it!
- 2019-06-25: "Decode" of SHA-2Bit output to input clarified.
- 2019-06-21: Fictitious "SHA-2" algorithm renamed "SHA-2Bit" to prevent confusion with SHA-2 the real algorithm. My thanks to @xied75 (Dong Xie) and apologies for any confusion caused.
- 2019-06-20: Tidy up by Deeksha.
- 2019-06-20: Typos
- 2019-06-20: Original version