Click here to Skip to main content
Click here to Skip to main content

Fastest hash function for table lookups in C?!

, 28 Jan 2014 CPOL
Rate this:
Please Sign up or sign in to vote.
Optimized hash function for general table lookups

Introduction

This article is for those who need fastest hasher for their keys.

Who is that bad boy?
FNV1A_YoshimitsuTRIAD - a C etude which hashed one 256bytes block at 14GB/s on i7-4770K.
Also, on PowerPC 440 and Cortex-A8 processors, 'Luckylight' (its Japanese meaning) simply overshadowed everything known to me.

Long story short, my experience with hashing the most needed range (7..255 bytes long keys) is that 'YoshimitsuTRIAD' slashes monstrously, it simply devours the keys at sick rate.
Until recently I have not seen how Haswell performs, FNV1A_YoshimitsuTRIADiiXMM deserves special attention if XMM registers are available, this 96bytes stride variant of 'YoshimitsuTRIAD' hashed 256bytes block at 18GB/s, yippee.
As I can see things 'YoshimitsuTRIAD' is the weapon of choice when it comes to English language phrases/sentences hashing, an area of great importance.

First, I am thankful to Fantasy (Dubai), Mr.Eiht (Germany), m^2 (Poland) and Harold (Holland).
They helped me and I still didn't return the favor, so by sharing the final edition of my best table lookup hash etude I hope to gladden their hearts.
Or, as one unforgettable character (full of personality) from the Russian/Georgian/Armenian classic 'MIMINO' says:
'When he feels gladdened then I will feel that I am gladdened too. When I am gladdened then I will deliver in such a way that you too will be gladdened.'

Below, the results after running 32bit code by Intel 12.1 compiler (/Ox used):
Linear speed on Harold's Rig (i7-4770K, 3500MHz):
Allocating HASH memory 512MB ... OK

Memory pool starting address: 00F40040 ... 64 byte aligned, OK

Info1: One second seems to have 1,000 clocks.
Info2: This CPU seems to be working at 3,499 MHz.

Copying a 256MB block 1024 times i.e. 256GB READ + 256GB WRITTEN ...
memcpy(): (256MB block); 262144MB copied in 19843 clocks or 13.211MB per clock

Fetching/Hashing a 64MB block 1024 times i.e. 64GB ...
BURST_Read_4DWORDS:         (64MB block);     65536MB fetched in 4009 clocks or 16.347MB per clock
BURST_Read_8DWORDSi:        (64MB block);     65536MB fetched in 4009 clocks or 16.347MB per clock
FNV1A_penumbra:             (64MB block);     65536MB hashed in  3121 clocks or 20.998MB per clock
FNV1A_YoshimitsuTRIADiiXMM: (64MB block);     65536MB hashed in  3198 clocks or 20.493MB per clock
FNV1A_YoshimitsuTRIADii:    (64MB block);     65536MB hashed in  5288 clocks or 12.393MB per clock
FNV1A_YoshimitsuTRIAD:      (64MB block);     65536MB hashed in  4430 clocks or 14.794MB per clock
FNV1A_farolito:             (64MB block);     65536MB hashed in  5726 clocks or 11.445MB per clock
FNV1A_Yorikke:              (64MB block);     65536MB hashed in  4929 clocks or 13.296MB per clock
FNV1A_Yoshimura:            (64MB block);     65536MB hashed in  5788 clocks or 11.323MB per clock
CRC32_SlicingBy8:           (64MB block);     65536MB hashed in 36083 clocks or  1.816MB per clock

Fetching/Hashing a 2MB block 32*1024 times ...
BURST_Read_4DWORDS:         (2MB block);      65536MB fetched in 2761 clocks or 23.736MB per clock
BURST_Read_8DWORDSi:        (2MB block);      65536MB fetched in 2433 clocks or 26.936MB per clock
FNV1A_penumbra:             (2MB block);      65536MB hashed in  1591 clocks or 41.192MB per clock
FNV1A_YoshimitsuTRIADiiXMM: (2MB block);      65536MB hashed in  1716 clocks or 38.191MB per clock
FNV1A_YoshimitsuTRIADii:    (2MB block);      65536MB hashed in  3853 clocks or 17.009MB per clock
FNV1A_YoshimitsuTRIAD:      (2MB block);      65536MB hashed in  3713 clocks or 17.650MB per clock
FNV1A_farolito:             (2MB block);      65536MB hashed in  4087 clocks or 16.035MB per clock
FNV1A_Yorikke:              (2MB block);      65536MB hashed in  4462 clocks or 14.688MB per clock
FNV1A_Yoshimura:            (2MB block);      65536MB hashed in  4415 clocks or 14.844MB per clock
CRC32_SlicingBy8:           (2MB block);      65536MB hashed in 35864 clocks or  1.827MB per clock

Fetching/Hashing a 128KB block 512*1024 times ...
BURST_Read_4DWORDS:         (128KB block);    65536MB fetched in 2684 clocks or 24.417MB per clock
BURST_Read_8DWORDSi:        (128KB block);    65536MB fetched in 3104 clocks or 21.113MB per clock
FNV1A_penumbra:             (128KB block);    65536MB hashed in  1529 clocks or 42.862MB per clock
FNV1A_YoshimitsuTRIADiiXMM: (128KB block);    65536MB hashed in  1763 clocks or 37.173MB per clock
FNV1A_YoshimitsuTRIADii:    (128KB block);    65536MB hashed in  4180 clocks or 15.678MB per clock
FNV1A_YoshimitsuTRIAD:      (128KB block);    65536MB hashed in  3729 clocks or 17.575MB per clock
FNV1A_farolito:             (128KB block);    65536MB hashed in  4539 clocks or 14.438MB per clock
FNV1A_Yorikke:              (128KB block);    65536MB hashed in  4462 clocks or 14.688MB per clock
FNV1A_Yoshimura:            (128KB block);    65536MB hashed in  4586 clocks or 14.290MB per clock
CRC32_SlicingBy8:           (128KB block);    65536MB hashed in 35974 clocks or  1.822MB per clock

Fetching/Hashing a 16KB block 4*1024*1024 times ...
BURST_Read_4DWORDS:         (16KB block);     65536MB fetched in 2247 clocks or 29.166MB per clock
BURST_Read_8DWORDSi:        (16KB block);     65536MB fetched in 2277 clocks or 28.782MB per clock
FNV1A_penumbra:             (16KB block);     65536MB hashed in  1528 clocks or 42.890MB per clock
FNV1A_YoshimitsuTRIADiiXMM: (16KB block);     65536MB hashed in  1607 clocks or 40.782MB per clock
FNV1A_YoshimitsuTRIADii:    (16KB block);     65536MB hashed in  3729 clocks or 17.575MB per clock
FNV1A_YoshimitsuTRIAD:      (16KB block);     65536MB hashed in  3666 clocks or 17.877MB per clock
FNV1A_farolito:             (16KB block);     65536MB hashed in  4102 clocks or 15.977MB per clock
FNV1A_Yorikke:              (16KB block);     65536MB hashed in  4587 clocks or 14.287MB per clock
FNV1A_Yoshimura:            (16KB block);     65536MB hashed in  4461 clocks or 14.691MB per clock
CRC32_SlicingBy8:           (16KB block);     65536MB hashed in 35834 clocks or  1.829MB per clock

Fetching/Hashing a 256bytes block 256*1024*1024 times ...
BURST_Read_4DWORDS:         (256bytes block); 65536MB fetched in 2698 clocks or 24.291MB per clock
BURST_Read_8DWORDSi:        (256bytes block); 65536MB fetched in 3136 clocks or 20.898MB per clock
FNV1A_penumbra:             (256bytes block); 65536MB hashed in  3666 clocks or 17.877MB per clock
FNV1A_YoshimitsuTRIADiiXMM: (256bytes block); 65536MB hashed in  3470 clocks or 18.886MB per clock
FNV1A_YoshimitsuTRIADii:    (256bytes block); 65536MB hashed in  4840 clocks or 13.540MB per clock
FNV1A_YoshimitsuTRIAD:      (256bytes block); 65536MB hashed in  4516 clocks or 14.512MB per clock
FNV1A_farolito:             (256bytes block); 65536MB hashed in  5522 clocks or 11.868MB per clock
FNV1A_Yorikke:              (256bytes block); 65536MB hashed in  5117 clocks or 12.808MB per clock
FNV1A_Yoshimura:            (256bytes block); 65536MB hashed in  5195 clocks or 12.615MB per clock
CRC32_SlicingBy8:           (256bytes block); 65536MB hashed in 36004 clocks or  1.820MB per clock
Below, the results after running 32bit code by Intel 12.1 compiler (/Ox used):
Linear speed on Fantasy's 'Black-and-Red' Rig (i7-3930K, 4500MHz, CPU bus: 125MHz, RAM bus: 2400MHz Quad Channel):
FNV1A_YoshimitsuTRIAD: (64MB block); 65536MB hashed in  5728 clocks or 11.441MB per clock
CRC32_SlicingBy8:      (64MB block); 65536MB hashed in 37544 clocks or  1.746MB per clock

FNV1A_YoshimitsuTRIAD: (2MB block);  65536MB hashed in  4560 clocks or 14.372MB per clock
CRC32_SlicingBy8:      (2MB block);  65536MB hashed in 36394 clocks or  1.801MB per clock

FNV1A_YoshimitsuTRIAD: (16KB block); 65536MB hashed in  4390 clocks or 14.928MB per clock
CRC32_SlicingBy8:      (16KB block); 65536MB hashed in 36210 clocks or  1.810MB per clock
Below, the results after running 32bit code by Intel 12.1 compiler (/Ox used):
Linear speed on Mr.Eiht's 'Redemption' Rig (i7 3930K 3.2GHz 8x4GB @1337MHz):
FNV1A_YoshimitsuTRIAD: (64MB block); 65536MB hashed in  7307 clocks or  8.969MB per clock
CRC32_SlicingBy8:      (64MB block); 65536MB hashed in 44693 clocks or  1.466MB per clock

FNV1A_YoshimitsuTRIAD: (2MB block);  65536MB hashed in  5374 clocks or 12.195MB per clock
CRC32_SlicingBy8:      (2MB block);  65536MB hashed in 43143 clocks or  1.519MB per clock

FNV1A_YoshimitsuTRIAD: (16KB block); 65536MB hashed in  5206 clocks or 12.589MB per clock
CRC32_SlicingBy8:      (16KB block); 65536MB hashed in 42888 clocks or  1.528MB per clock

Below, the results after running 32bit code by Intel 12.1 compiler (/Ox used):
Linear speed on Sanmayce's 'Bonboniera' laptop (Core 2 T7500, 2200MHz):
FNV1A_penumbra:        (64MB block); 65536MB hashed in 14445 clocks or  4.581MB per clock
FNV1A_YoshimitsuTRIAD: (64MB block); 65536MB hashed in 16115 clocks or  4.087MB per clock
CRC32_SlicingBy8:      (64MB block); 65536MB hashed in 71573 clocks or  0.916MB per clock

FNV1A_penumbra:        (2MB block);  65536MB hashed in  6568 clocks or 10.025MB per clock
FNV1A_YoshimitsuTRIAD: (2MB block);  65536MB hashed in  9750 clocks or  6.722MB per clock
CRC32_SlicingBy8:      (2MB block);  65536MB hashed in 69763 clocks or  0.940MB per clock

FNV1A_penumbra:        (16KB block); 65536MB hashed in  5819 clocks or 11.293MB per clock
FNV1A_YoshimitsuTRIAD: (16KB block); 65536MB hashed in  8986 clocks or  7.294MB per clock
CRC32_SlicingBy8:      (16KB block); 65536MB hashed in 69467 clocks or  0.949MB per clock
Being an AMD fan (since the superb 'Barton') here come 'Zambezi' and 'Thuban' performances:
AMD FX-8120, 4515MHz = 21x215, L1 Data Cache: 8x16KB:
FNV1A_YoshimitsuTRIAD: (16KB block); 65536MB hashed in  5859 clocks or 11.186MB per clock
CRC32_SlicingBy8:      (16KB block); 65536MB hashed in 44437 clocks or  1.475MB per clock

AMD Phenom II X6 1600T, 4000MHz = 16x250, FSB Frequency: 250MHz, L1 Data Cache: 6x64KB:
FNV1A_YoshimitsuTRIAD: (16KB block); 65536MB hashed in  5769 clocks or 11.360MB per clock
CRC32_SlicingBy8:      (16KB block); 65536MB hashed in 34650 clocks or  1.891MB per clock
And a word for collision performance: problemless.
An excerpt from one of the tortures:

Purpose: to compare FNV1A_Yoshimura and CRC32 by giving the highest number of collisions i.e. the deepest nest/layer, the-lesser-the-better.
Keys are all BINARY strings from 0..18446744073709551615, i.e. all 64bit values at some step, look the sample below.
1111111111111111111111111111111111111111111111111111111111111111
=
18446744073709551615
                  18,446,744,073,709,551,616 = 2^64
                                       thousand
                                   million
                               billion
                           trillion
                        quadrillion
                    quintillion
Note1: On T7500 2200MHz the full cycle took 240h with step 111,111,111:
       for (QUADR = 0; QUADR <= ((unsigned long long)1<<32)*((unsigned long long)1<<32)-1; QUADR=QUADR+3111111111) {...}
Note2: The keys look like (at step 3,111,111,111) these:
0000,0000,0000,0000,0000,0000,0000,0000,0000,0000,0000,0000,0000,0000,0000,0000
0000,0000,0000,0000,0000,0000,0000,0000,1011,1001,0110,1111,1100,1001,1100,0111 i.e. 3111111111
0000,0000,0000,0000,0000,0000,0000,0001,0111,0010,1101,1111,1001,0011,1000,1110 i.e. 6222222222
0000,0000,0000,0000,0000,0000,0000,0010,0010,1100,0100,1111,0101,1101,0101,0101
0000,0000,0000,0000,0000,0000,0000,0010,1110,0101,1011,1111,0010,0111,0001,1100
0000,0000,0000,0000,0000,0000,0000,0011,1001,1111,0010,1110,1111,0000,1110,0011
0000,0000,0000,0000,0000,0000,0000,0100,0101,1000,1001,1110,1011,1010,1010,1010
And just for the fun of it I did similar benchmark with my dispersion 'TRISMUS' 1+ trillion 1Kb text chunks torture package (source at homepage).
Those 1,000,000,000,000 keys were 128bytes long Knight-Tour derivatives:
D:\_KAZE>Knight-Tour_FNV1A_YoshimitsuTRIADii_vs_CRC32_TRISMUS.exe a8 4000000000
...
Polynomial used:
CRC32: 0x8F6E37A0
KEYS to be hashed = 4,000,000,000x4x64
HashSizeInBits = 27
ReportAtEvery = 134,217,727
Allocating HASH memory 512MB ... OK
Allocating HASH memory 512MB ... OK
The first KT:
A8C7E8G7H5G3H1F2H3G1E2C1A2B4A6B8D7F8H7G5F7H8G6H4G2E1C2A1B3A5B7D8C6A7C8E7G8H6G4H2F1D2B1A3B5D6F5D4F3E5C4B2D3F4E6C5A4B6D5F6E4C3D1E3
...
FNV1A_YoshimitsuTRIADii: Keys = 0,000,134,217,729; 1 x MAXcollisionsAtSomeSlots = 0,012; FeeSlots = 50,530,128
CRC32 0x8F6E37A0, iSCSI: Keys = 0,000,134,217,729; 4 x MAXcollisionsAtSomeSlots = 0,011; FeeSlots = 49,561,215
FNV1A_YoshimitsuTRIADii: Keys = 0,000,268,435,457; 3 x MAXcollisionsAtSomeSlots = 0,015; FeeSlots = 19,089,321
CRC32 0x8F6E37A0, iSCSI: Keys = 0,000,268,435,457; 4 x MAXcollisionsAtSomeSlots = 0,014; FeeSlots = 18,307,048
FNV1A_YoshimitsuTRIADii: Keys = 0,000,402,653,185; 1 x MAXcollisionsAtSomeSlots = 0,019; FeeSlots = 07,194,504
CRC32 0x8F6E37A0, iSCSI: Keys = 0,000,402,653,185; 8 x MAXcollisionsAtSomeSlots = 0,017; FeeSlots = 06,762,415
FNV1A_YoshimitsuTRIADii: Keys = 0,000,536,870,913; 2 x MAXcollisionsAtSomeSlots = 0,021; FeeSlots = 02,708,588
CRC32 0x8F6E37A0, iSCSI: Keys = 0,000,536,870,913; 2 x MAXcollisionsAtSomeSlots = 0,020; FeeSlots = 02,496,170
FNV1A_YoshimitsuTRIADii: Keys = 0,000,671,088,641; 2 x MAXcollisionsAtSomeSlots = 0,023; FeeSlots = 01,022,485
CRC32 0x8F6E37A0, iSCSI: Keys = 0,000,671,088,641; 2 x MAXcollisionsAtSomeSlots = 0,023; FeeSlots = 00,922,884
...
FNV1A_YoshimitsuTRIADii: Keys = 1,000,056,291,329; 1 x MAXcollisionsAtSomeSlots = 7,930; FeeSlots = 00,000,000
CRC32 0x8F6E37A0, iSCSI: Keys = 1,000,056,291,329; 2 x MAXcollisionsAtSomeSlots = 7,910; FeeSlots = 00,000,000

As you can see the ideal dispersion would be 1,000,056,291,329/134,217,727 = 7451 MAXcollisionsAtSomeSlots, both YoshimitsuTRIADii & CRC32 did well.

And for those who want to experiment and write their own hashers I included (at download section) two simple tools (Fury & YoYo) allowing to test respectively Speed Performance & Dispersion Performance.
The collision benchmark called '~3 million IPs (dot format)' is included:
D:\_KAZE>YoYo_r2.exe IPS.TXT 20
YoYo - [CR]LF lines hasher, r.2 copyleft Kaze.
Note1: Incoming textual file can exceed 4GB and lines can be up to 1048576 chars long.
Note2: FNV1A_YoshimitsuTRIADiiXMM needs SSE4.1, so if not present YoYo will crash.
Polynomial(s) used:
CRC32C2_8slice: 0x8F6E37A0
HashSizeInBits = 20
Allocating KEY memory 1024KB ... OK
Allocating HASH memory 4MB ... OK
Allocating HASH memory 4MB ... OK
Allocating HASH memory 4MB ... OK
Hashing all the LF ending lines encountered in 42,892,307 bytes long file ...
Keys vs Slots ratio: 2:1 or 2,995,394:1,048,576
FNV1A_YoshimitsuTRIADiiXMM : Keys = 2,995,394; 1 x MAXcollisionsAtSomeSlots = 15; HASHfreeSLOTS = 0,059,943; HashUtilization = 094%; Collisions = 2,006,761
FNV1A_YoshimitsuTRIADii    : Keys = 2,995,394; 1 x MAXcollisionsAtSomeSlots = 15; HASHfreeSLOTS = 0,059,943; HashUtilization = 094%; Collisions = 2,006,761
CRC32C2_8slice             : Keys = 2,995,394; 1 x MAXcollisionsAtSomeSlots = 15; HASHfreeSLOTS = 0,062,064; HashUtilization = 094%; Collisions = 2,008,882
Physical Lines: 2,995,394
Shortest Line : 7
Longest Line  : 15
Performance: 38126.495 bytes per clock

Background

Ha, background, I remember while writing 'YoshimitsuTRIAD' I was listening to this cuore song for hours on end:

Just this once
Let me tell you you're the sweetest thing
The love in every song I sing
The music in my ears and everything
Happiness writes white
Maybe that isn't true tonight
And things you know you might forget
And other things I haven't told you yet
Close your eyes
Count to ten
Turn around
Back again
Hit the floor
Then once more
I'm still here
And it's all true
And it's all true
We don't need any kind of big parade
Just this once a little serenade
To celebrate this love we've made
We don't need
Don't need a big fanfare
This is just my heart laid bare
For anyone who might care
Go away
Round the world
Talk to all kinds of girls
But it's me you won't find
And you're mine
Close your eyes
Count to ten
Turn around
Back again
Hit the floor
Then once more
I'm still here
And it's all true, and it's all true, do you feel it too?
And it's all true, and it's all true, tell me do you feel it too?
/Tracey Thorn - It's All True/

Function homepage: http://www.sanmayce.com/Fastest_Hash/index.html

Two last things, don't complain that homepage of the bad boy is too-o-o long and second, the license, no license, it is 100% FREE.
Enjoy!

Points of Interest

To find the practical and most useful balance between overkill and underkill while being playful and LIFEFUL.

A man went into a pub and saw a dog sitting at a table playing poker with three men.
"Wow, what a smart dog!" he exclaimed.
"Nah, he's not much of a player, whenever he gets a good hand he wags his tail." a kibitzer said.

This dog reminds me of ... me.

History

  • FNV1A_YoshimitsuTRIAD written 2012-Oct-29

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Sanmayce
Other
Bulgaria Bulgaria
A Bulgarian old boy interested in search techniques, nothing special.
Follow on   Twitter

Comments and Discussions

 
-- There are no messages in this forum --
| Advertise | Privacy | Mobile
Web02 | 2.8.141015.1 | Last Updated 28 Jan 2014
Article Copyright 2014 by Sanmayce
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid