Click here to Skip to main content
15,867,830 members
Please Sign up or sign in to vote.
5.00/5 (2 votes)
See more:
I should add for clarity that strings are guaranteed to be unique.I use a classic djb2 hash function - see
http://www.cse.yorku.ca/~oz/hash.html
UINT32 gHashCh(BYTE *str)
{
	UINT32 hash = 5381;
	int c;

	while (c = *str++)
	{
		hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
	}

	return hash;
}

Question #1. How do I calculate (mathematically) the probability that the hash will NOT be unique if calculated on a number of strings? I know the worst case length (16 charactes) and the maximum possible number of strings (lets say, 1000)
Question #2. Can someone give me an example of 2 different strings that produce an identical hash given algorithm above.
Posted
Updated 10-Nov-11 17:43pm
v4
Comments
Sergey Alexandrovich Kryukov 10-Nov-11 18:52pm    
You are not too trusting! Got my 5 for the question.
--SA
michaelmel 10-Nov-11 18:54pm    
Well, I do work in life safety related industry :)

1 solution

A1. An "ideal" hash will spread the strings uniform-randomly over the hash space (2^32). If you have 1000 strings, then your probability of collision in that case is approximately 1 - exp[(- 1000 * 999)/(2 * 2^32)] which is small (about 1.16e-4), but may not be negligible in your case. Your chosen hash may perform as well as that, or a bit worse, or a lot worse, depending on your input space.

A2. You can do this pretty quickly. From the reference below, if you pick a bit over 77000 strings, you will have a better than 50% probability of finding a collision. It'll take you a few minutes to write the code, and probably less than a second for it to find a collision.

[ref]http://en.wikipedia.org/wiki/Birthday_paradox[^]

Cheers,
Peter

[edit]inserted missing left bracket in formula[/edit]
 
Share this answer
 
v2
Comments
michaelmel 10-Nov-11 21:24pm    
Thanks Peter, you pointed me in the right direction. I will probably do a Monte Carlo type experiment on how good it is by generating sets of random strings and measuring the average number that it takes to reach the collision.

Obviously I expect a much higher than 0.5 probability of collision for 77000 strings given that the formula is, as you point out, for an ideal hash function which djb2 is not.

My app checks for hash uniqueness anyway and if there is a collision, it switches to slower searches by string. This is why I need two strings which produce identical hash so I can check this switchoever mechanism.

Cheers,

Michael

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900