Click here to Skip to main content
Click here to Skip to main content

Convert String to 64bit Integer

By , 20 Mar 2009
 

Introduction

.NET Framework provides the 'GetHashcode()' function returning a 32bit Integer. You can convert 'string' to 32bit Integer but it doesn't guarantee a unique number. Here is the function that converts string to a unique 64bit integer, avoids collision domain level string store, compare and trace.

Background

If you need to store a large amount of URLs in your database, you must define 'character' index for manipulation. If you can generate a matching 'one-string-to-one-numeric-value' that can use as a numeric 'Key' instead of a variant length of string.

Using the Code

  1. Convert a variable length of string to fixed length of hashcode, and it must have fast hashing speed, so use .NET provided System.Security.Cryptography.SHA256CryptoServiceProvider.
  2. Convert 32 byte hashcode to 8 byte integer, avoiding making a collision.
/// <summary>
/// Return unique Int64 value for input string
/// </summary>
/// <param name="strText"></param>
/// <returns></returns>
static Int64 GetInt64HashCode(string strText)
{
    Int64 hashCode = 0;
    if (!string.IsNullOrEmpty(strText))
    {
        //Unicode Encode Covering all characterset
          byte[] byteContents = Encoding.Unicode.GetBytes(strText);
        System.Security.Cryptography.SHA256 hash = 
		new System.Security.Cryptography.SHA256CryptoServiceProvider();
        byte[] hashText = hash.ComputeHash(byteContents);
        //32Byte hashText separate
        //hashCodeStart = 0~7  8Byte
        //hashCodeMedium = 8~23  8Byte
        //hashCodeEnd = 24~31  8Byte
        //and Fold
        Int64 hashCodeStart = BitConverter.ToInt64(hashText, 0);
        Int64 hashCodeMedium = BitConverter.ToInt64(hashText, 8);
        Int64 hashCodeEnd = BitConverter.ToInt64(hashText, 24);
        hashCode = hashCodeStart ^ hashCodeMedium ^ hashCodeEnd;
    }
    return (hashCode);
}        

Collision and Performance Test

Tested platform: Core2Duo, Windows 2003 Server SP2, .NET Framework 3.5 SP1

10,000,000 Times generate GetInt64HashCode

Collision: Not found

100,000 Times generate ElapsedTime: 830 milliseconds

10,000,000 Times generate .NET Framework provided object.GetHashCode

Collision: 4,150 found

100,000 Times generate ElapsedTime: 35 milliseconds

Points of Interest

I know that Cryptography.SHA256 does not provide perfect collision avoided hash value and compressed 32Byte to 8Byte can increase collision probability, but I think the above function shows enough performance and avoids collision.

Your reply will make the function more efficient and reliable.

This function now uses and collects large amount of URLs. There is no collision for 40,000,000 unique URLs.

History

  • 20th March, 2009: Initial post

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Composition4
Software Developer Seoul , Korea
Korea (Republic Of) Korea (Republic Of)
Member
develop, management software, website last six years, on .netframework environment
 
Interested in Architect and Framework
 
you can visit my blog : http://smack.kr/
 
articles for c#, framework, architect, and recommend books
but, only korean languages.
in the immediate future I will posts english version also.

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
GeneralMy vote of 5memberRob 1234523 Oct '12 - 13:35 
exactly what I needed to know, thank you
GeneralCRCmemberAndreyMir28 Mar '09 - 18:55 
What do you think about CRC algorithm?
QuestionRe: CRCmemberComposition412 Apr '09 - 17:42 
have no idea CRC algorithm.
Andrey could you give me a hint , solving above problem with CRC algorithm?
GeneralAlternative approachmemberdjlove26 Mar '09 - 11:39 
Just a couple of general comments:
 
(1) can you really not tolerate any collisions? Typically if the hashcodes are equal you then resort to a normal string compare. In your example with 32-bit hashcodes you'd only need to do this 4000ish times out of 10,000,000.
 
(2) If you want a 64-bit hashcode, it's wasteful to calculate a 256bit one like SHA256. What about MD5 (128 bits?) Even better why not just split the string up into two parts and combine their individual 32bit hashcodes into a 64bit number?
AnswerRe: Alternative approachmemberComposition426 Mar '09 - 16:41 
hi djlove , that`s a good comment , here are the reply that your comment
 
(1) example 32bit hashcode is for just comparing collision probabillity
(2) wanna result is 'one string - one unique 64bit int' not 64-bit hashcode,
the reason pick 'SHA256' is advanced cripto algorithm, almost avoded collision probability, and at least 4times paster than MD5 cripto.
 
32bit hashcode combine to 64bit collision probability depenad on 32bit hash algorithm, metioned before 32bit 'object.GetHashCode' collision probability is so high to get domain level '- not guarantee cover all over the world' unique 64bit inteager
Generalmore tested 100,000,000 unique string for hashe value , collision not found yetmemberComposition422 Mar '09 - 13:56 
I had more tested.
 
100,000,000 GUID for compute hashedkey
 
collision not found yet.
General"unique 64bit inteager"memberharold aptroot21 Mar '09 - 13:08 
You know, I rather doubt that. I think you were just lucky enough not to find a collision.
 

GeneralRe: "unique 64bit inteager"memberassmax21 Mar '09 - 23:04 
i have to second that. you sure can use strings hashing to increase performance, but 32bit is enough to narrow the range. after that you'll have to do string comparisson - as strings are unlimited you can expand your keys to 128 etc... it becomes uncertain whether you gain or loose performance/memory
AnswerRe: "unique 64bit inteager"memberComposition422 Mar '09 - 4:43 
thank for reply assmax.
in my use
first, 32bit hashed int is too collide to make unique numeric value, so 64bit inteager is selected, and it stored in database as bigint column.
second, hashed key only used for compare same url whether aleady exist in the storage, in the practical string hash executed only one time
 
string to unique 64bit inteager function is not same purpose with 'object.GetHashCode'
AnswerRe: "unique 64bit inteager"memberComposition422 Mar '09 - 4:56 
thank for herold , you are right
 
I think that will be enought collision avoid hashed key use of domain level,
but I`m not convinced of above function generating unique value
 
so, I post that want to inspect together

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web03 | 2.6.130523.1 | Last Updated 20 Mar 2009
Article Copyright 2009 by Composition4
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid