15,741,818 members
See more:
I should add for clarity that strings are guaranteed to be unique.I use a classic djb2 hash function - see
http://www.cse.yorku.ca/~oz/hash.html
```UINT32 gHashCh(BYTE *str)
{
UINT32 hash = 5381;
int c;

while (c = *str++)
{
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
}

return hash;
}```

Question #1. How do I calculate (mathematically) the probability that the hash will NOT be unique if calculated on a number of strings? I know the worst case length (16 charactes) and the maximum possible number of strings (lets say, 1000)
Question #2. Can someone give me an example of 2 different strings that produce an identical hash given algorithm above.
Posted
Updated 10-Nov-11 17:43pm
v4
Sergey Alexandrovich Kryukov 10-Nov-11 18:52pm
You are not too trusting! Got my 5 for the question.
--SA
michaelmel 10-Nov-11 18:54pm
Well, I do work in life safety related industry :)

## Solution 1

A1. An "ideal" hash will spread the strings uniform-randomly over the hash space (2^32). If you have 1000 strings, then your probability of collision in that case is approximately `1 - exp[(- 1000 * 999)/(2 * 2^32)]` which is small (about 1.16e-4), but may not be negligible in your case. Your chosen hash may perform as well as that, or a bit worse, or a lot worse, depending on your input space.

A2. You can do this pretty quickly. From the reference below, if you pick a bit over 77000 strings, you will have a better than 50% probability of finding a collision. It'll take you a few minutes to write the code, and probably less than a second for it to find a collision.

Cheers,
Peter

inserted missing left bracket in formula[/edit]

v2
michaelmel 10-Nov-11 21:24pm
Thanks Peter, you pointed me in the right direction. I will probably do a Monte Carlo type experiment on how good it is by generating sets of random strings and measuring the average number that it takes to reach the collision.

Obviously I expect a much higher than 0.5 probability of collision for 77000 strings given that the formula is, as you point out, for an ideal hash function which djb2 is not.

My app checks for hash uniqueness anyway and if there is a collision, it switches to slower searches by string. This is why I need two strings which produce identical hash so I can check this switchoever mechanism.

Cheers,

Michael

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Top Experts
Last 24hrsThis month
 OriginalGriff 65 Richard MacCutchan 60 Gerry Schmitz 25 Rick York 20 Valery Possoz 10
 OriginalGriff 1,568 Richard MacCutchan 1,318 merano99 480 Rick York 430 Dave Kreskowiak 276

CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900