Click here to Skip to main content
15,889,471 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Hi.
I am making a program that can filter out copies in photo albums and I wanted to know if it is possible to calculate the chance that the same hash will be calculated for two different files.
I also wanted to know which hashing method is fastes compared to how uniqe the hash will be.

Lets say I have 10000 pictures. All of the pictures are taken with in, or around 8MP quality, so they are similar in size. I check each and every one against each other with MD5 hashing method. Is there any "real" chance that I will get a false positive? Does the name of the file play any part in the hash?

Thanks
Posted

"Is there any "real" chance that I will get a false positive?"
Very, very unlikely. The brute force odds are 1 in 2^128, but the actual odds are lower than that. Not so much that it makes a significant difference (Hell, it could be out by 50 orders of magnitude and you would still have a better chance of winning the lottery every week for a year. It isn't out that much though).

"Does the name of the file play any part in the hash?"
Not normally, no. It is the file content that matters, and an MD5 algorithm does not know what it's data is - it is just a stream of bytes as far as it is concerned.

"I also wanted to know which hashing method is fastes compared to how uniqe the hash will be."
Depends a lot more on the implementation of the algorithm than on the algorithm itself.
 
Share this answer
 
Comments
sigsand 23-Nov-11 12:36pm    
Nice. Thanks.
Just a little follow up on that last question.
I use the System.Security.Cryptography
And here is my method:
public static string ComputeMD5Hash(string fileName)
{
return ComputeHash(fileName, new MD5CryptoServiceProvider());
}
I will most likely be timing the execution of the code later on but there must be some general rule of thumb about which hashing algorithm is fastest.

Also a question I just thought about. Since there is so unlikely I will get a false positive, are there any methods that use less bits to calculate the hash to make it even faster. I think the lowest general method, that I know of, is using 128 bits.
OriginalGriff 23-Nov-11 14:02pm    
The lowest general method that I know of uses 8 bits - and the chance of duplication is very high!
Just XOR each byte and you will reach a 8 bit result, which duplicates pretty often....
LanFanNinja 23-Nov-11 12:44pm    
+5
Sander Rossel 23-Nov-11 13:22pm    
Nice one, 5ed :)
Mehdi Gholam 23-Nov-11 13:32pm    
Actually the collision rate is around 2^39 not the full range of 2^128.

5'ed
I have found the answer to my own question. I simply made a test program to check different methods and, surprisingly, my own implementation works the best.
I used a stopwatch to take the time on three methods. One was to get the MD5 (and SHA1 in a seperate test) hash of the files to be compared into two seperate strings and simply use if(string1.Equals(string2)) to check for a match.
A quicker way was to convert both files into a seperate byte array then looping through the arrays and comparing if the byte arrays match. Since most of the times the files will NOT be the same this dramatically increases the speed. I had a folder with 32 pictures and two of them were a copy paste. 84MB in total.
Test results:
Press enter to begin..

MD5 hash check
Nr of files to check: 32
Nr of matches: 2
Milli seconds: 6243 (Seconds: 6 )

SHA1 hash check
Nr of files to check: 32
Nr of matches: 2
Milli seconds: 7702 (Seconds: 7 )

byte array hash check
Nr of files to check: 32
Nr of matches: 2
Milli seconds: 1241 (Seconds: 1 )

Press enter to exit..

Thanks everyone for the input. If someone wants the algortihms I can post it here.
This is not exactly an answer to the question I had but people who have the same question I had might find this interesting.
I will mark the first answer as the solution since it directly answered my questions.
 
Share this answer
 
v4

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900