Click here to Skip to main content
15,071,581 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Hi,

I want to generate a unique number of a given string. The string is around 50-150 character long. I was thinking of using GetHashCode but not sure of it, if is the best bet. The challenge is the generated number should be independent of any platform it means i want to use pure mathematical logic so that it work same everywhere?
The input set won't be large i can say it will be in thousands only.
The only requirement I have is it should generate same number on all platforms and number should not get repeated for different string
Any suggestions.
Posted
Comments
lukeer 13-Aug-13 1:36am
   
Does it really have to be a number? If string must be non-ambiguously connected to a number, why not use the original string instead?
Pankaj.Sinha.Techno 13-Aug-13 1:52am
   
Yes actually i am trying to generate a number which i can use as a key in database for specific type of entries and that entry should be same across all clients
   
Please see the comments to Solution 2.

What do you call uniqueness? In what scope, why? What's the ultimate purpose of it? You know, hash function, even of considerable size and cryptographic, is never a 100% guarantee of uniqueness. Paradoxically, cryptography does not need uniqueness: even of some different passwords give identical hash, it does not help to crack password protection, because this is simply translates to some higher probability of accidentally guessed password; increase hash size, and you can make this probability as little as any preliminary set number...

—SA
Pankaj.Sinha.Techno 13-Aug-13 2:30am
   
I agree with you Sergey, Uniqueness is not guranteed even by GUIDs. Security is not at all a concern. The only objective is to convert a string into a number with mininum probably of collision. I can make input size longer to make the probablity much lower. I want to generate a numeric value within a range lets say between 6 million to 7 million. The purpose is to determine a key in colum for certain types of entries in table (not all) which can be identifies across all users of products.
lukeer 13-Aug-13 5:18am
   
Do the users create those strings?
In that case, you would have to use one database for all of them, connected over a network. The database could check if a string exists and either provide the matching number, or insert a new entry in the database. The number would then be an auto-incremented id managed by the DBMS.

Any hash function will result in a collision eventually so you will get the same value for different inputs (that is the nature of a hash function).

Depending on your use case you can start with GetHashCode(), although it is only on the .net platform and may change between .net versions.

You can look at MurMur hash functions here :
blog.teamleadnet.com/2012/08/murmurhash3-ultra-fast-hash-algorithm.html
https://github.com/darrenkopp/murmurhash-net/[^]
   
Object.GetHashCode()[^] won't work for that:
"Furthermore, the .NET Framework does not guarantee the default implementation of the GetHashCode method, and the value this method returns may differ between .NET Framework versions and platforms, such as 32-bit and 64-bit platforms."

Your second requirement "Any two non-identical inputs must not share the same hash" is difficoult for any hash function.
A hash function, per definition, creates a fixed-length byte sequence from any given input. So there's a given maximum number of non-repeating outputs. You simply cannot guarantee that there's no junction in your set of a few thousand inputs.

If you can already define all possible inputs, write a software that
1) creates all of them
2) for each of them creates a hash, random number, consecutive number, whatever
3) checks if output equals any already existing output
4) while (3), repeat (2)
5) store all this in a look-up-table

At runtime, use the LUT instead of a function.
   
v2
Comments
Sergey Alexandrovich Kryukov 12-Aug-13 17:14pm
   
I'm really sorry, but this answer is wrong in this part:
"Any two inputs must not share the same hash" is difficoult for any hash function.

Not true. Any, absolutely any hash function produces identical results for identical input, by definition.

Even though your point about GetHashCode is probably correct, all OP needs is to use any suitable hash function with fixed source code, and never change this part of code in future.

—SA
lukeer 13-Aug-13 1:32am
   
You got me there. Correctly it must read "Any two non-identical inputs..." I updated my solution accordingly.
   
Not clear. Who put such weird "requirement"? This is not difficult, this is impossible for a hash; it would simply mean the function which returns constant, would make no sense...
—SA
lukeer 13-Aug-13 1:47am
   
Can we accord that I'm euphemising and you're splitting hairs, but essentially agree on the topic?
   
Not really. OP just needs some hash function in the form of source code (it .NET does not guarantee that it won't change, but I think for at least cryptographic hash functions it is guaranteed), pretty much any of them (the size of hash does matter, depending on required degree of uniqueness, which is never the real uniqueness in case of hash.

If really 100% guaranteed uniqueness is really required, nothing is a solution except some centralized source of IDs. Besides, very often people require uniqueness without telling us the scope. Perhaps this is just the uniqueness on a local computer (or even narrower scope), but people don't realize it, even though then it becomes not a problem at all.

That's why I did not answer myself: I would need to know the purpose and real requirements...

You know what? I'll explicitly ask OP about required scope of uniqueness.

—SA
lukeer 13-Aug-13 2:38am
   
Although you're denying it: given what OP told us, we do agree that it can't be done with an off-the-shelf hash function.
   
I'm not really sure. I explained why.
—SA
Pankaj.Sinha.Techno 13-Aug-13 1:06am
   
My problem is i do not know all the input. Any suggestion for SHA1 and MD5 ??
   
If the purpose is not security, either of them could do. If security is involved, none of them, as they both were considered broken; SHA2 should be used.
—SA

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)




CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900