An add-on from 2016-Aug-22:
The fastest memmem(), known to me, is already the tuned Railgun_Swampshine_BailOut - avoiding second pattern comparison in BMH2 and pseudo-BMH4. It is called 'Trolldom', to signify its secondary purpose - to traverse fast via Trolllandia a.k.a. ugly/deformed haystacks causing quadratic behavior of most Boyer-Moore-Horspool variants. GNU/GLIBC memmem() is linearIC and sublinearIC at times while Railgun_Trolldom is mostly sublinearIC.
Railgun_Trolldom is slightly faster (both with Intel & GCC) than Railgun_Swampshine, this is mostly due to adding a bitwise BMH order 2 (8KB table overhead instead of 64KB) path - for haystacks < 77777 bytes long - the warm-up is faster.
So the result on Core 2 Q9550s @2.83GHz DDR2 @666MHz / i5-2430M @3.00GHz DDR3 @666MHz:
--------------------------------------------------------------------------------------------------------------------------------
| Searcher | GNU/GLIBC memmem() | Railgun_Swampshine | Railgun_Trolldom |
|--------------------------------------------------------------------------------------------------|---------------------------|
| Testfile\Compiler | Intel 15.0 | GCC 5.10 | Intel 15.0 | GCC 5.10 | Intel 15.0 | GCC 5.10 |
|------------------------------------------------------------------------------------------------------------------------------|
| Size: 27,703 bytes | 4506/- | 5330/14725 | 13198/- | 11581/15171 | 19105/22449 | 15493/21642 |
| Name: An_Interview_with_Carlos_Castaneda.TXT | | | | |
| LATENCY-WISE: Number of 'memmem()' Invocations: 308,062 | | | | |
| THROUGHPUT-WISE: Number of Total bytes Traversed: 3,242,492,648 | | | | |
|------------------------------------------------------------------------------------------------------------------------------|
| Size: 2,347,772 bytes | 190/- | 226/244 | 1654/- | 1729/1806 | 1794/1822 | 1743/1809 |
| Name: Gutenberg_EBook_Don_Quixote_996_(ANSI).txt | | | | |
| LATENCY-WISE: Number of 'memmem()' Invocations: 14,316,954 | | | | |
| THROUGHPUT-WISE: Number of Total bytes Traversed: 6,663,594,719,173 | | | | |
|------------------------------------------------------------------------------------------------------------------------------|
| Size: 899,425 bytes | 582/- | 760/816 | 3094/- | 2898/3088 | 3255/3289 | 2915/3322 |
| Name: Gutenberg_EBook_Dokoe_by_Hakucho_Masamune_(Japanese_UTF8).txt | | | | |
| LATENCY-WISE: Number of 'memmem()' Invocations: 3,465,806 | | | | |
| THROUGHPUT-WISE: Number of Total bytes Traversed: 848,276,034,315 | | | | |
|------------------------------------------------------------------------------------------------------------------------------|
| Size: 4,487,433 bytes | 104/- | 109/116 | 445/- | 458/417 | 450/411 | 467/425 |
| Name: Dragonfly_genome_shotgun_sequence_(ACGT_alphabet).fasta | | | | |
| LATENCY-WISE: Number of 'memmem()' Invocations: 20,540,375 | | | | |
| THROUGHPUT-WISE: Number of Total bytes Traversed: 13,592,530,857,131 | | | | |
|------------------------------------------------------------------------------------------------------------------------------|
| Size: 954,035 bytes | 99/- | 144/144 | 629/- | 580/682 | 634/807 | 585/725 |
| Name: LAOTZU_Wu_Wei_(BINARY).pdf | | | | |
| LATENCY-WISE: Number of 'memmem()' Invocations: 27,594,933 | | | | |
| THROUGHPUT-WISE: Number of Total bytes Traversed: 8,702,455,122,519 | | | | |
|------------------------------------------------------------------------------------------------------------------------------|
| Size: 15,583,440 bytes | -/- | -/- | -/- | 663/771 | 675/778 | 663/757 |
| Name: Arabian_Nights_complete.html | | | | |
| LATENCY-WISE: Number of 'memmem()' Invocations: 72,313,262 | | | | |
| THROUGHPUT-WISE: Number of Total bytes Traversed: 105,631,163,854,099 | | | | |
--------------------------------------------------------------------------------------------------------------------------------
The benchmark is downloadable at my Internet drive and my site:
https://1drv.ms/u/s!AmWWFXGMzDmEgl3izdwhSnIq0Lv5 www.sanmayce.com/Downloads/Nakamichi_Kintaro++_source_executables_64bit_(GCC510-vs-Intel150)_(TW-vs-RG)_BENCHMARK.zip
Note0: Railgun_Trolldom is slightly faster (both with Intel & GCC) than Railgun_Swampshine, this is mostly due to adding a bitwise BMH order 2 (8KB table overhead instead of 64KB) path - for haystacks < 77777 bytes long - the warm-up is faster.
Note1: The numbers represent the rate (bytes/s) at which patterns/needles 4,5,6,8,9,10,12,13,14,16,17,18,24 bytes long are memmemed into 4KB, 256KB, 1MB, 256MB long haystacks.
in fact, these numbers are the compression speed using LZSS and memmem() as matchfinder.
Note2: The Arabian Nights is downloadable at:
https://ebooks.adelaide.edu.au/b/burton/richard/b97b/complete.html
Note3: On i5-2430M, TW is catching up since this CPU crunches instructions faster while the RAM speed is almost the same, Railgun suffers from the slow RAM fetches - the prefetcher and such suck.
Note4: With a simple English text 'Tales of 1001 Nights', 15,583,440 bytes long, the cumulative size of traversed haystack data is nearly 100TB, 105,631,163,854,099 ~ 1024^4 = 1,099,511,627,776.
Note5: With a simple French text 'Agatha_Christie_85-ebooks_(French)_TXT.tar', 32,007,168 bytes long, the cumulative size of traversed haystack data is nearly 200TB ~ 234,427,099,834,376.
Since it is pure C, no intrinsics used, perhaps it will work unproblematically on most *nix systems.
To the attached archive I included the benchmark C source along with its listing in 28 pages in PDF format - to be printed and digested more easily from search techniques aficionados. ENFUN!
Here I digress because want to share my disappontment from not seeing Hamid's Turbo Family of superfast etudes utilized/appreciated. Next dumps reveal what the attached Benchmark (using Railgun_Trolldom, it parses 200+TB in this torture) does and how it compares to many optimized compressors of today.
The textual decompression picture given by:
- TurboBench Copyright (c) 2013-2016 Powturbo Feb 21 2016;
- zstd command line interface 64-bits v0.8.1, by Yann Collet;
- Nakamichi 'Kintaro++' written by Kaze.
This very benchmark (with it you can reproduce the dumps below) is at:
https://1drv.ms/u/s!AmWWFXGMzDmEgl6-_1mrjcTj07B1
Once the tarred Agatha's 85 texts into French are compressed by the benchmark let's see how the decompression rate fares against many "usual suspects" i.e. compressors, on i5-2430M @3.00GHz DDR3 @666MHz:
D:\Nakamichi_Kintaro++_source_executables_64bit_(GCC510-vs-Intel150)_(TW-vs-RG)_BENCHMARK>Nakamichi_Kintaro++_Intel_15.0_64bit.exe Agatha_Christie_85-ebooks_(French)_TXT.tar
Nakamichi 'Kintaro++', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Note1: This compile can handle files up to 1711MB.
Note2: The matchfinder/memmem() is Railgun_Trolldom.
Current priority class is HIGH_PRIORITY_CLASS.
Compressing 32007168 bytes ...
|; Each rotation means 64KB are encoded; Done 100%; Compression Ratio: 3.53:1
NumberOfFullLiterals (lower-the-better): 164
NumberOfFlushLiteralsHeuristic (bigger-the-better): 184323
Legend: WindowSizes: 1/2/3/4=Tiny/Short/Medium/Long
NumberOf(Tiny)Matches[Short]Window (4)[2]: 226869
NumberOf(Short)Matches[Short]Window (8)[2]: 119810
NumberOf(Medium)Matches[Short]Window (12)[2]: 71202
NumberOf(Long)Matches[Short]Window (16)[2]: 31955
NumberOf(MaxLong)Matches[Short]Window (24)[2]: 7078
NumberOf(Tiny)Matches[Medium]Window (5)[3]: 257313
NumberOf(Short)Matches[Medium]Window (9)[3]: 526493
NumberOf(Medium)Matches[Medium]Window (13)[3]: 285579
NumberOf(Long)Matches[Medium]Window (17)[3]: 158873
NumberOf(MaxLong)Matches[Medium]Window (24)[3]: 51276
NumberOf(Tiny)Matches[Long]Window (6)[4]: 41075
NumberOf(Short)Matches[Long]Window (10)[4]: 240454
NumberOf(Medium)Matches[Long]Window (14)[4]: 258653
NumberOf(Long)Matches[Long]Window (18)[4]: 209007
NumberOf(MaxLong)Matches[Long]Window (24)[4]: 190929
RAM-to-RAM performance: 605 bytes/s.
Compressed to 9076876 bytes.
LATENCY-WISE: Number of 'memmem()' Invocations: 102,091,852
THROUGHPUT-WISE: Number of Total bytes Traversed: 234,427,099,834,376
D:\Benchmark_Goldenboy_vs_LzTurbo_vs_Zstd>dir
08/22/2016 07:07 AM 9,076,876 Agatha_Christie_85-ebooks_(French)_TXT.tar.Nakamichi
08/22/2016 07:44 AM 124,928 Nakamichi_Kintaro++_Intel_15.0_64bit.exe
08/16/2016 01:55 PM 9,191,476 turbobenchs.exe
08/18/2016 07:24 AM 841,297 zstd-windows-v0.8.1_win64.exe
D:\Benchmark_Goldenboy_vs_LzTurbo_vs_Zstd>RUNME.BAT
D:\Benchmark_Goldenboy_vs_LzTurbo_vs_Zstd>Nakamichi_Kintaro++_Intel_15.0_64bit.exe Agatha_Christie_85-ebooks_(French)_TXT.tar.Nakamichi /report
Nakamichi 'Kintaro++', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Note1: This compile can handle files up to 1711MB.
Note2: The matchfinder/memmem() is Railgun_Trolldom.
Current priority class is HIGH_PRIORITY_CLASS.
Decompressing 9076876 bytes ...
RAM-to-RAM performance: 512 MB/s.
Compression Ratio (bigger-the-better): 3.53:1
D:\Benchmark_Goldenboy_vs_LzTurbo_vs_Zstd>zstd-windows-v0.8.1_win64.exe -b12 "Agatha_Christie_85-ebooks_(French)_TXT.tar"
12#_(French)_TXT.tar : 32007168 -> 8965791 (3.570), 4.0 MB/s , 409.2 MB/s
D:\Benchmark_Goldenboy_vs_LzTurbo_vs_Zstd>zstd-windows-v0.8.1_win64.exe -b22 "Agatha_Christie_85-ebooks_(French)_TXT.tar"
22#_(French)_TXT.tar : 32007168 -> 6802321 (4.705), 1.0 MB/s , 267.4 MB/s
D:\Benchmark_Goldenboy_vs_LzTurbo_vs_Zstd>turbobench_singlefile_noNakamichi_2G.bat "Agatha_Christie_85-ebooks_(French)_TXT.tar"
D:\Benchmark_Goldenboy_vs_LzTurbo_vs_Zstd>turbobenchs.exe "Agatha_Christie_85-ebooks_(French)_TXT.tar" -ezlib,9/fastlz,2/lzturbo,19,29,39/bzip2/chameleon,2/snappy_c/zstd,12,21/density,3/lz4,16/lz5,15/lzham,4/brieflz/brotli,1
1/crush,2/lzma,9/zpaq,2/lzf/yappy/trle/memcpy/lzsse2,1,16 -g -B2G
Benchmark: 1 from 3
6739693 21.1 0.89 74.96 lzma 9 Agatha_Christie_85-ebooks_(French)_TXT.tar
6758120 21.1 0.75 159.97 lzham 4 Agatha_Christie_85-ebooks_(French)_TXT.tar
6801784 21.3 0.88 380.07 lzturbo 39 Agatha_Christie_85-ebooks_(French)_TXT.tar
6834372 21.4 1.15 280.52 zstd 21 Agatha_Christie_85-ebooks_(French)_TXT.tar
6920292 21.6 0.34 187.25 brotli 11 Agatha_Christie_85-ebooks_(French)_TXT.tar
7680438 24.0 8.73 19.14 bzip2 Agatha_Christie_85-ebooks_(French)_TXT.tar
8897625 27.8 3.15 56.09 zpaq 2 Agatha_Christie_85-ebooks_(French)_TXT.tar
8964510 28.0 4.17 397.51 zstd 12 Agatha_Christie_85-ebooks_(French)_TXT.tar
8990840 28.1 0.89 590.17 lzturbo 29 Agatha_Christie_85-ebooks_(French)_TXT.tar
9076876 ? ? 512.00 Nakamichi Kintaro++ outside TurboBench
9147309 28.6 0.13 223.49 crush 2 Agatha_Christie_85-ebooks_(French)_TXT.tar
10104473 31.6 1.25 351.10 lz5 15 Agatha_Christie_85-ebooks_(French)_TXT.tar
10691287 33.4 4.92 2013.23 lzsse2 16 Agatha_Christie_85-ebooks_(French)_TXT.tar
10973383 34.3 10.13 214.94 zlib 9 Agatha_Christie_85-ebooks_(French)_TXT.tar
12107431 37.8 0.94 2136.70 lzturbo 19 Agatha_Christie_85-ebooks_(French)_TXT.tar
12266943 38.3 15.08 1615.99 lz4 16 Agatha_Christie_85-ebooks_(French)_TXT.tar
13238993 41.4 75.25 149.55 brieflz Agatha_Christie_85-ebooks_(French)_TXT.tar
14364428 44.9 255.83 234.91 density 3 Agatha_Christie_85-ebooks_(French)_TXT.tar
15448532 48.3 12.27 1503.99 lzsse2 1 Agatha_Christie_85-ebooks_(French)_TXT.tar
17262978 53.9 56.14 1158.91 yappy Agatha_Christie_85-ebooks_(French)_TXT.tar
17678955 55.2 172.49 344.66 lzf Agatha_Christie_85-ebooks_(French)_TXT.tar
17868118 55.8 177.74 303.39 fastlz 2 Agatha_Christie_85-ebooks_(French)_TXT.tar
17915390 56.0 926.24 1851.55 chameleon 2 Agatha_Christie_85-ebooks_(French)_TXT.tar
18694206 58.4 228.88 622.50 snappy_c Agatha_Christie_85-ebooks_(French)_TXT.tar
31952980 99.8 124.68 1230.32 trle Agatha_Christie_85-ebooks_(French)_TXT.tar
32007172 100.0 6096.18 5338.21 memcpy Agatha_Christie_85-ebooks_(French)_TXT.tar
D:\Benchmark_Goldenboy_vs_LzTurbo_vs_Zstd>turbobenchs.exe -l2
...
Brotli v16-02 Apache license https:Bzip2 v1.06 BSD like http:Chameleon v15-03 Public Domain http:Crush v1.0.0 Public Domain http:Density v0.13.0 BSD license https:FastLz v0.1.0 BSD like http:Lz4 v1.7.1 BSD license https:Lz5 v1.4.1 BSD license https:Lzham v1.1 MIT license https:Lzma v9.35 Public Domain http:lzsse v16-02 BSD license https:Snappy-c v1.1.2 BSD Like https:Yappy v2011
zlib v1.2.8 zlib license http:ZSTD v0.5.1 BSD license https:TurboRLE ESC v16-01 https:TurboRLE v16-01 https:LzTurbo v1.3 https:...
D:\Benchmark_Goldenboy_vs_LzTurbo_vs_Zstd>
That's it, those 512MB/s are the result of the quantative approach in choosing literal lengths. Now, in order we to appreciate one superb (in fact FASTEST) performer, lzturbo 39 with its amazing 380.07MB/s has to be compared with latest Yann's Zstd and with supersimplistic Nakamichi:
Agatha_Christie_85-ebooks_(French)_TXT.tar when recreated:
6801784 -> 32,007,168 380.07 MB/s lzturbo 39
8965791 -> 32,007,168 409.20 MB/s zstd-windows-v0.8.1_win64.exe -b12
8990840 -> 32,007,168 590.17 MB/s lzturbo 29
9076876 -> 32,007,168 512.00 MB/s Nakamichi Kintaro++
Such, as LzTurbo's 29/39, modes are truly amazing such performance inspires one to look beyond current bloated software and rethink why superfast implementations are not mainstreamish, yet.
An add-on from 2016-Aug-13:
Two things.
First, the fix from the last time was buggy, my apologies, now fixed, quite embarrassing since it is a simple left/right boundary check. It doesn't affect the speed, it appears as rare pattern hit misses.
Since I don't believe in saying "sorry" but in making things right, here my attempt to further disgrace my amateurish work follows:
Two years ago, I didn't pay due attention to adding 'Swampwalker' heuristic to the Railgun_Ennearch, I mean, only quick test was done and no real proofing - this was due not to a blunder of mine, nor carelessness, but overconfidence in my ability to write "on the fly". Stupid, indeed, however, when a coder gets momentum in writing simple etudes he starts gaining false confidence of mastering the subject, not good for sure!
Hopefully, other coders will learn to avoid such full of neglect style.
For manual fix into the C code, change this line:
if ( ((signed int)(i-(PRIMALposition-1)+(count-1)) >= 0) && (&pbTarget[i-(PRIMALposition-1)] <= pbTargetMax - 4) ) {
with this one:
if ( ((signed int)(i-(PRIMALposition-1)) >= 0) && (&pbTarget[i-(PRIMALposition-1)]+((PRIMALlengthCANDIDATE-4+1)-1) <= pbTargetMax - 4) ) {
Second, wanted to present the heaviest testbed for search i.e. memmem() functions: it benefits the benchmarking (speed in real application) as well as bug-control.
The benchmark is downloadable at my INTERNET drive:
https://1drv.ms/u/s!AmWWFXGMzDmEglwjlUtnMJrfhosK
It also is zipped (the 4 testfiles are not) and attached at the top of this article.
The speed showdown has three facets:
- compares the 64bit code generated from GCC 5.10 versus Intel 15.0 compilers;
- compares two datasets - search speed through English texts versus Genome ACGT-type data;
- compares the tweaked Two-Way algorithm (implemented by Eric Blake) and adopted by GLIBC as memmem() versus my Railgun_Swampshine.
Note1: The GLIBC memmem() was taken from latest (2016-08-05) glibc 2.24 tar:
https://www.gnu.org/software/libc/
Note2: Eric Blake says that he enhanced the linearity of Two-Way by adding some sublinear paths, well, Railgun is all about sublinearity, so feel free to experiment with your own testfiles (worst-case-scenarios), just make such a file feed the compressor with it, then we will see how the LINEAR Two-Way behaves versus Railgun_Swampshine.
Note3: Just copy-and-paste 'Railgun_Swampshine' or 'Railgun_Ennearch' from the benchmark's source.
So the result on Core 2 Q9550s @2.83GHz:
---------------------------------------------------------------------------------------------
| testfile\Searcher | GNU/GLIBC memmem() | Railgun_Swampshine |
|-------------------------------------------------------------------------------------------|
| Compiler | Intel 15.0 | GCC 5.10 | Intel 15.0 | GCC 5.10 |
|-------------------------------------------------------------------------------------------|
| The_Project_Gutenberg_EBook_of_Don | 190 | 226 | 1654 | 1729 |
| _Quixote_996_(ANSI).txt | | | | |
| 2,347,772 bytes | | | | |
|-------------------------------------------------------------------------------------------|
| The_Project_Gutenberg_EBook_of_Dokoe | 582 | 760 | 3094 | 2898 |
| _by_Hakucho_Masamune_(Japanese_UTF-8).txt | | | | |
| 899,425 bytes | | | | |
|-------------------------------------------------------------------------------------------|
| Dragonfly_genome_shotgun_sequence | 104 | 109 | 445 | 458 |
| _(ACGT_alphabet).fasta | | | | |
| 4,487,433 bytes | | | | |
|-------------------------------------------------------------------------------------------|
| LAOTZU_Wu_Wei_(BINARY).pdf | 99 | 144 | 629 | 580 |
| 954,035 bytes | | | | |
|-------------------------------------------------------------------------------------------|
Note: The numbers represent the rate (bytes/s) at which patterns/needles 4,5,6,8,9,10,12,13,14,16,17,18,24 bytes long are memmemed into 4KB, 256KB, 1MB, 256MB long haystacks.
in fact, these numbers are the compression speed using LZSS and memmem() as matchfinder.
Two-Way is significantly slower than BMH Order 2, the speed-down is in range:
- for TEXTUAL ANSI alphabets: 1729/226= 7.6x
- for TEXTUAL UTF8 alphabets: 2898/760= 3.8x
- for TEXTUAL ACGT alphabets: 458/109= 4.2x
- for BINARY-esque alphabets: 580/144= 4.0x
For faster RAM, than mine @666MHz, and for haystacks multimegabytes long, the speedup goes beyond 8x.
The benchmark shows the real behavior (both latency and raw speed) of the memmem variants, I added also the Thierry Lecroq's Two-Way implementation:
http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260
However, Eric Blake's one is faster, so it was chosen for the speed showdown.
Once I measured the total length of traversed haystacks, and for files 100+MB long, it went ... quintillion of bytes i.e. petabytes - good torture it is.
Enfun!
Introduction
The C strstr-like function presented below (written by Georgi 'Kaze') is named Railgun_Quadruplet_6ppp
and is based on well-known Boyer-Moore-Horspool and simplified Karp-Rabin algorithms. The main goal: to achieve fastest execution for all kinds of pattern/needle and string/haystack lengths (starting from 2 bytes).
Railgun_Quadruplet_6ppp
is now obsolete due to the arrival of Railgun_Quadruplet_7
which boosts the BMH fragment by more than a percent. See further below for its commentless and tabulated source.
Railgun_Quadruplet_7
exploits the CPUs fast unaligned DWORD
extract capability.
The heaviest and most precise test on 1+TB text (52 patterns vs 197MB English books text repeated 256 times results in 1000+ seconds, see strstr_SHORT-SHOWDOWN_r7_Heavy_Test_LOG.txt file for more information) gives:
[strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe:]
The variants returning the offset of first hit only:
Railgun_Quadruplet_6pp
52, i.e., average performance: 1584KB/clock Railgun_Quadruplet_7
52, i.e., average performance: 1659KB/clock
The variants counting all hits:
Railgun_6pp
8x2, i.e., average performance: 1246KB/clock Railgun_Quadruplet_7
8x2, i.e., average performance: 1266KB/clock
[strstr_SHORT-SHOWDOWN_Intel_v12_O3.exe:]
The variants returning the offset of first hit only:
Railgun_Quadruplet_6pp
52, i.e., average performance: 1773KB/clock Railgun_Quadruplet_7
52, i.e., average performance: 1685KB/clock
The variants counting all hits:
Railgun_6pp
8x2, i.e., average performance: 1054KB/clock Railgun_Quadruplet_7
8x2, i.e., average performance: 1076KB/clock
Note: For Railgun_Quadruplet_7
on CPUs with fast DWORD
(compared to BYTE
) memory fetching I expect significant boost, AFAIK the Intel i7 is one of those, unfortunately I lack the opportunity to play with it.
Or the two bottom-lines:
The FASTEST variant counting all hits:
Railgun_Quadruplet_7
8x2 with average performance: 1266KB/clock (best performance achieved with Microsoft_v16_Ox)!
The FASTEST variant returning the offset of first hit only:
Railgun_Quadruplet_6pp
52 with average performance: 1773KB/clock (1,943,883,635,456 bytes / 1,070,220 clocks = 1732 MB/s best performance achieved with Intel_v12_O3)!
An add-on from 2014-Apr-27:
My apologies.
The 'Swampwalker' heuristic has a nasty SIGNED/UNSIGNED bug which I illustrated several months ago in my fuzzy search article now is fixed here too:
The bug is this (the variables 'i' and 'PRIMALposition' are uint32_t):
Next line assumes -19 >= 0 is true:
if ( (i-(PRIMALposition-1)) >= 0) printf ("THE NASTY-CASTY BUG: %d >= 0\n", i-(PRIMALposition-1));
Next line assumes -19 >= 0 is false:
if ( (signed int)(i-(PRIMALposition-1)) >= 0) printf ("THE NASTY-CASTY BUG: %d >= 0\n", i-(PRIMALposition-1));
And the actual fix:
...
if ( count <= 0 ) {
if ( ((signed int)(i-(PRIMALposition-1)+(count-1)) >= 0) && (&pbTarget[i-(PRIMALposition-1)] <= pbTargetMax - 4) ) {
...
I checked the code one more time, it feels alright, love it.
The buggy 'Swampshine' function (in download section) has been replaced with fixed one.
An add-on from 2014-Feb-02:
Probably I am done with updates, here comes the latest one.
The benchmark with its source at: http://www.sekireigan.com/_KAZE_32bit_memmem-like_showdown_Swampshine_r2.7z.
And the beautiful main loop:
; The main loop of pseudo Boyer-Moore-Horspool order 2 in C:
; while (i <= cbTarget-cbPattern) {
; Gulliver = 1;
; if ( bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1]] != 0 ) {
; if ( bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1-2]] == 0 ) Gulliver = cbPattern-(2-1)-2; else {
; if ( *(uint32_t *)&pbTarget[i] == ulHashPattern) {
; count = cbPattern-4+1;
; while ( count > 0 && *(uint32_t *)(pbPattern+count-1) == *(uint32_t *)(&pbTarget[i]+(count-1)) )
; count = count-4;
; if ( count <= 0 ) return(pbTarget+i);
; }
; }
; } else Gulliver = cbPattern-(2-1);
; i = i + Gulliver;
; }
; The main loop of pseudo Boyer-Moore-Horspool order 2 in Assembly by Intel C++ Compiler XE for applications running on IA-32, Version 12.1.1.258, -O3 used:
.B2.28:
0019f 0f b7 6c 02 fe movzx ebp, WORD PTR [-2+edx+eax]
001a4 0f b6 34 2c movzx esi, BYTE PTR [esp+ebp]
001a8 85 f6 test esi, esi
001aa 0f 84 bd 00 00
00 je .B2.40
.B2.29:
001b0 0f b7 6c 02 fc movzx ebp, WORD PTR [-4+edx+eax]
001b5 0f b6 34 2c movzx esi, BYTE PTR [esp+ebp]
001b9 85 f6 test esi, esi
001bb 75 0c jne .B2.31
.B2.30:
001bd 8b ac 24 18 00
01 00 mov ebp, DWORD PTR [65560+esp]
001c4 e9 a6 00 00 00 jmp .B2.41
.B2.31:
001c9 8b b4 24 70 00
01 00 mov esi, DWORD PTR [65648+esp]
001d0 8b ac 24 14 00
01 00 mov ebp, DWORD PTR [65556+esp]
001d7 3b 2c 06 cmp ebp, DWORD PTR [esi+eax]
001da 0f 85 86 00 00
00 jne .B2.38
.B2.32:
001e0 8b b4 24 18 00
01 00 mov esi, DWORD PTR [65560+esp]
001e7 85 f6 test esi, esi
001e9 0f 8e 42 06 00
00 jle .B2.110
.B2.33:
001ef 33 ff xor edi, edi
001f1 8d 2c 10 lea ebp, DWORD PTR [eax+edx]
001f4 89 8c 24 00 00
01 00 mov DWORD PTR [65536+esp], ecx
001fb 89 94 24 04 00
01 00 mov DWORD PTR [65540+esp], edx
00202 89 9c 24 0c 00
01 00 mov DWORD PTR [65548+esp], ebx
00209 89 84 24 08 00
01 00 mov DWORD PTR [65544+esp], eax
00210 8b d7 mov edx, edi
00212 8b 8c 24 10 00
01 00 mov ecx, DWORD PTR [65552+esp]
00219 8b 9c 24 7c 00
01 00 mov ebx, DWORD PTR [65660+esp]
.B2.34:
00220 8b 44 0a fc mov eax, DWORD PTR [-4+edx+ecx]
00224 3b 44 2a fc cmp eax, DWORD PTR [-4+edx+ebp]
00228 75 11 jne .B2.36
.B2.35:
0022a 47 inc edi
0022b 8d 74 1a f9 lea esi, DWORD PTR [-7+edx+ebx]
0022f 83 c2 fc add edx, -4
00232 3b bc 24 1c 00
01 00 cmp edi, DWORD PTR [65564+esp]
00239 72 e5 jb .B2.34
.B2.36:
0023b 89 8c 24 10 00
01 00 mov DWORD PTR [65552+esp], ecx
00242 8b 8c 24 00 00
01 00 mov ecx, DWORD PTR [65536+esp]
00249 8b 94 24 04 00
01 00 mov edx, DWORD PTR [65540+esp]
00250 8b 84 24 08 00
01 00 mov eax, DWORD PTR [65544+esp]
00257 8b 9c 24 0c 00
01 00 mov ebx, DWORD PTR [65548+esp]
.B2.37:
0025e 85 f6 test esi, esi
00260 0f 8e cb 05 00
00 jle .B2.110
.B2.38:
00266 bd 01 00 00 00 mov ebp, 1
0026b eb 02 jmp .B2.41
.B2.40:
0026d 8b eb mov ebp, ebx
.B2.41:
0026f 03 c5 add eax, ebp
00271 3b c1 cmp eax, ecx
00273 0f 86 26 ff ff
ff jbe .B2.28
.B2.42:
; The main loop of pseudo Boyer-Moore-Horspool order 2: 00273-0019f+6 = 218 bytes
An add-on from 2014-Jan-29:
Many thanks go to Harold, the fellow CP member who helped me to see how i7 4770K executes Railgun.
At last Railgun has become troll-proof.
More tweaks are needed, but let's not kill the joy and let's see what 'Swampshine' revision 1 holds.
Results from three type of tortures are given below:
1] 'Wikipedia' torture, XML data
2] 'GENOME' torture, 'ACGT' data
3] 'Worst-Case-Scenarios' torture, repetitive data
My verdict: Railgun_Swampshine is the weapon of choice for all kind of patterns up to ~100bytes, it won't fail you no matter how muddy the battlefield is.
Not happy with current results, some calibrations remain, some drags in first revision 'Swampshine':
1] Order 4 (instead of 2) of 'Swampwalker' is used.
2] Unoptimized case when needle is PREFIXED, (decent performance when PREFIXED&POSTFIXED and when POSTFIXED).
3] Code Size is not pretty: 00a8c-00000+4 = 2704 bytes
4] Much steeper speed degradation than expected with increasing the needle length.
Doing Search for Pattern(65bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance: 18941MB/s; PRIMALposition=1; PRIMALlength=65
Railgun_Ennearch precise performance: 18859MB/s
Railgun_DecumanusBITified precise performance: 19136MB/s
Boyer_Moore-Horspool precise performance: 6959MB/s
Doing Search for Pattern(83bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance: 17734MB/s; PRIMALposition=1; PRIMALlength=68
Railgun_Ennearch precise performance: 18818MB/s
Railgun_DecumanusBITified precise performance: 18941MB/s
Boyer_Moore-Horspool precise performance: 7871MB/s
Doing Search for Pattern(98bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance: 17791MB/s; PRIMALposition=1; PRIMALlength=68
Railgun_Ennearch precise performance: 19654MB/s
Railgun_DecumanusBITified precise performance: 18948MB/s
Boyer_Moore-Horspool precise performance: 7917MB/s
Doing Search for Pattern(113bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance: 17360MB/s; PRIMALposition=1; PRIMALlength=68
Railgun_Ennearch precise performance: 20802MB/s
Railgun_DecumanusBITified precise performance: 19013MB/s
Boyer_Moore-Horspool precise performance: 8147MB/s
Doing Search for Pattern(130bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance: 17486MB/s; PRIMALposition=1; PRIMALlength=68
Railgun_Ennearch precise performance: 22800MB/s
Railgun_DecumanusBITified precise performance: 18631MB/s
Boyer_Moore-Horspool precise performance: 9085MB/s
Doing Search for Pattern(140bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance: 17311MB/s; PRIMALposition=1; PRIMALlength=68
Railgun_Ennearch precise performance: 23872MB/s
Railgun_DecumanusBITified precise performance: 17926MB/s
Boyer_Moore-Horspool precise performance: 9343MB/s
Doing Search for Pattern(151bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance: 17072MB/s; PRIMALposition=79; PRIMALlength=73
Railgun_Ennearch precise performance: 23983MB/s
Railgun_DecumanusBITified precise performance: 17933MB/s
Boyer_Moore-Horspool precise performance: 8591MB/s
Doing Search for Pattern(162bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance: 17725MB/s; PRIMALposition=79; PRIMALlength=84
Railgun_Ennearch precise performance: 26535MB/s
Railgun_DecumanusBITified precise performance: 19854MB/s
Boyer_Moore-Horspool precise performance: 9072MB/s
Doing Search for Pattern(201bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance: 17954MB/s; PRIMALposition=67; PRIMALlength=119
Railgun_Ennearch precise performance: 31708MB/s
Railgun_DecumanusBITified precise performance: 23392MB/s
Boyer_Moore-Horspool precise performance: 6983MB/s
Doing Search for Pattern(230bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance: 15113MB/s; PRIMALposition=165; PRIMALlength=66
Railgun_Ennearch precise performance: 33098MB/s
Railgun_DecumanusBITified precise performance: 25518MB/s
Boyer_Moore-Horspool precise performance: 7599MB/s
Doing Search for Pattern(255bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance: 14711MB/s; PRIMALposition=90; PRIMALlength=65
Railgun_Ennearch precise performance: 39911MB/s
Railgun_DecumanusBITified precise performance: 29435MB/s
Boyer_Moore-Horspool precise performance: 7602MB/s
Doing Search for Pattern(589bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance: 7163MB/s; PRIMALposition=32; PRIMALlength=150
Railgun_Ennearch precise performance: 123574MB/s
Railgun_DecumanusBITified precise performance: 95773MB/s
Boyer_Moore-Horspool precise performance: 8370MB/s
Doing Search for Pattern(1080bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance: 3713MB/s; PRIMALposition=32; PRIMALlength=150
Railgun_Ennearch precise performance: 196321MB/s
Railgun_DecumanusBITified precise performance: 178054MB/s
Boyer_Moore-Horspool precise performance: 11102MB/s
Doing Search for Pattern(1755bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance: 2723MB/s; PRIMALposition=1506; PRIMALlength=125
Railgun_Ennearch precise performance: 287063MB/s
Railgun_DecumanusBITified precise performance: 267550MB/s
Boyer_Moore-Horspool precise performance: 13990MB/s
Doing Search for Pattern(2686bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance: 800MB/s; PRIMALposition=2018; PRIMALlength=266
Railgun_Ennearch precise performance: 227827MB/s
Railgun_DecumanusBITified precise performance: 240171MB/s
Boyer_Moore-Horspool precise performance: 11565MB/s
Doing Search for Pattern(3867bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance: 492MB/s; PRIMALposition=2714; PRIMALlength=216
Railgun_Ennearch precise performance: 267842MB/s
Railgun_DecumanusBITified precise performance: 279599MB/s
Boyer_Moore-Horspool precise performance: 11846MB/s
What can I say? Trolls fought back, I underestimated their filthiness.
Order 4 of 'Swampwalker' follows:
// Swampwalker heuristic order 4 (Needle should be bigger than 4) [
// Needle: 1234567890qwertyuiopasdfghjklzxcv PRIMALposition=01 PRIMALlength=33 '1234567890qwertyuiopasdfghjklzxcv'
// Needle: vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv PRIMALposition=29 PRIMALlength=04 'vvvv'
// Needle: vvvvvvvvvvBOOMSHAKALAKAvvvvvvvvvv PRIMALposition=08 PRIMALlength=20 'vvvBOOMSHAKALAKAvvvv'
// Needle: Trollland PRIMALposition=01 PRIMALlength=09 'Trollland'
// Needle: Swampwalker PRIMALposition=01 PRIMALlength=11 'Swampwalker'
// Needle: licenselessness PRIMALposition=01 PRIMALlength=15 'licenselessness'
// Needle: alfalfa PRIMALposition=02 PRIMALlength=06 'lfalfa'
// Needle: Sandokan PRIMALposition=01 PRIMALlength=08 'Sandokan'
// Needle: shazamish PRIMALposition=01 PRIMALlength=09 'shazamish'
// Needle: Simplicius Simplicissimus PRIMALposition=06 PRIMALlength=20 'icius Simplicissimus'
// Needle: domilliaquadringenquattuorquinquagintillion PRIMALposition=01 PRIMALlength=32 'domilliaquadringenquattuorquinqu'
// Needle: boom-boom PRIMALposition=02 PRIMALlength=08 'oom-boom'
// Needle: vvvvv PRIMALposition=01 PRIMALlength=04 'vvvv'
// Needle: 12345 PRIMALposition=01 PRIMALlength=05 '12345'
// Needle: likey-likey PRIMALposition=03 PRIMALlength=09 'key-likey'
// Needle: BOOOOOM PRIMALposition=03 PRIMALlength=05 'OOOOM'
// Needle: aaaaaBOOOOOM PRIMALposition=02 PRIMALlength=09 'aaaaBOOOO'
// Needle: BOOOOOMaaaaa PRIMALposition=03 PRIMALlength=09 'OOOOMaaaa'
PRIMALlength=0;
for (i=0+(1); i < cbPattern-((4)-1)+(1)-(1); i++) { // -(1) because the last BB order 4 has no counterpart(s)
FoundAtPosition = cbPattern - ((4)-1) + 1;
PRIMALpositionCANDIDATE=i;
while ( PRIMALpositionCANDIDATE <= (FoundAtPosition-1) ) {
j = PRIMALpositionCANDIDATE + 1;
while ( j <= (FoundAtPosition-1) ) {
if ( *(uint32_t *)(pbPattern+PRIMALpositionCANDIDATE-(1)) == *(uint32_t *)(pbPattern+j-(1)) ) FoundAtPosition = j;
j++;
}
PRIMALpositionCANDIDATE++;
}
PRIMALlengthCANDIDATE = (FoundAtPosition-1)-i+1 +((4)-1);
if (PRIMALlengthCANDIDATE >= PRIMALlength) {PRIMALposition=i; PRIMALlength = PRIMALlengthCANDIDATE;}
if (cbPattern-i+1 <= PRIMALlength) break;
}
// Swampwalker heuristic order 4 (Needle should be bigger than 4) ]
'Worst-Case-Scenarios' torture gave:
Haystack: In file 'WORST.TXT'
Type: Worst
Size: 200,000,609 bytes
D:\_KAZE\_KAZE_32bit_memmem-like_showdown_Swampshine>type WORST.TXT
[200,000,000 'Z' bytes]
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZSHAKALAKAZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZ
Needle #1: In file 'WCS_PREFIXEDandPOSTFIXED_609chars.txt'
Needle #2: In file 'WCS_PREFIXED_309chars.txt'
Needle #3: In file 'WCS_POSTFIXED_309chars.txt'
D:\_KAZE\_KAZE_32bit_memmem-like_showdown_Swampshine>type WCS_PREFIXEDandPOSTFIXED_609chars.txt
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZSHAKALAKAZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZ
D:\_KAZE\_KAZE_32bit_memmem-like_showdown_Swampshine>type WCS_PREFIXED_309chars.txt
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZSHAKALAKA
D:\_KAZE\_KAZE_32bit_memmem-like_showdown_Swampshine>type WCS_POSTFIXED_309chars.txt
SHAKALAKAZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZ
For needle in file 'WCS_PREFIXEDandPOSTFIXED_609chars.txt':
Doing Search for Pattern(609bytes) into String(200000609bytes) as-one-line ...
Railgun_Swampshine precise performance: 902MB/s
Railgun_Ennearch precise performance: 9MB/s
Railgun_DecumanusBITified precise performance: 14MB/s
Boyer_Moore-Horspool precise performance: 21MB/s
PRIMAL Needle: 'ZZZSHAKALAKAZZZZ'
PRIMALposition=298; PRIMALlength=16
For needle in file 'WCS_PREFIXED_309chars.txt':
Doing Search for Pattern(309bytes) into String(200000609bytes) as-one-line ...
Railgun_Swampshine precise performance: 329MB/s
Railgun_Ennearch precise performance: 155MB/s
Railgun_DecumanusBITified precise performance: 102MB/s
Boyer_Moore-Horspool precise performance: 3048MB/s
PRIMAL Needle: 'ZZZZSHAKALAKA'
PRIMALposition=297; PRIMALlength=13
For needle in file 'WCS_POSTFIXED_309chars.txt':
Doing Search for Pattern(309bytes) into String(200000609bytes) as-one-line ...
Railgun_Swampshine precise performance: 902MB/s
Railgun_Ennearch precise performance: 451MB/s
Railgun_DecumanusBITified precise performance: 225MB/s
Boyer_Moore-Horspool precise performance: 223MB/s
PRIMAL Needle: 'SHAKALAKAZZZZ'
PRIMALposition=1; PRIMALlength=13
Revision 2 of 'Swampshine' should remove the break.
One solution is to fall back to 'Gulliver', after the transformation (order 2, not 4) its order 2 skips will outperform Boyer_Moore-Horspool for sure, but this will happen only if I fail to come up with something else.
An add-on from 2014-Jan-23:
The latest best (for non-worst-case-scenarios) is 'Railgun_Ennearch' - even better than 'Railgun_Decumanus':
Doing Search for Pattern(589bytes) into String(52428800bytes) as-one-line ...
Railgun_Ennearch precise performance: 19797MB/s
Railgun_DecumanusBITified precise performance: 17807MB/s
Boyer_Moore-Horspool precise performance: 3252MB/s
Doing Search for Pattern(1080bytes) into String(52428800bytes) as-one-line ...
Railgun_Ennearch precise performance: 37175MB/s
Railgun_DecumanusBITified precise performance: 33822MB/s
Boyer_Moore-Horspool precise performance: 4473MB/s
Doing Search for Pattern(1755bytes) into String(52428800bytes) as-one-line ...
Railgun_Ennearch precise performance: 138014MB/s
Railgun_DecumanusBITified precise performance: 121397MB/s
Boyer_Moore-Horspool precise performance: 5731MB/s
Doing Search for Pattern(2686bytes) into String(52428800bytes) as-one-line ...
Railgun_Ennearch precise performance: 129381MB/s
Railgun_DecumanusBITified precise performance: 145128MB/s
Boyer_Moore-Horspool precise performance: 4541MB/s
Doing Search for Pattern(3867bytes) into String(52428800bytes) as-one-line ...
Railgun_Ennearch precise performance: 169574MB/s
Railgun_DecumanusBITified precise performance: 171716MB/s
Boyer_Moore-Horspool precise performance: 4514MB/s
For Wikipedia 52,428,800 bytes haystack and 3,867 bytes needle Railgun_Ennearch overpowers Boyer_Moore-Horspool by (169574-4514)/4514*100% = 3,656%, commentless.
The benchmark with its source at: http://www.sekireigan.com/_KAZE_32bit_memmem-like_showdown_Ennearch.7z.
Wanna traverse the Trolldom running not walking muchless crawling.
From the very start Railgun was intended as text searcher, later on it has been widened to hit non-textual data, while the Worst Case Scenarios were uncharted deadly swampy zones, until now.
I am after writing the first W.C.S. Railgun: 'Railgun_Trolldom'.
There is one very simple heuristic which won't hurt (speedwise) the rest sections.
It is to be embedded into pseudo-order 4 section (where needle is longer than 19, 'if ( cbPattern>NeedleThreshold2vs4Nonus )' condition).
It is to find the rightmost 2-bytes chunk that appears most times, immediately an example:
Let our Haystack is 200,000,000+33 bytes long, where the first 200,000,000 bytes are 'v' byte and the needle is last 33 bytes:
Haystack: ...vvvvvvvvvvBOOMSHAKALAKAvvvvvvvvvv 200,000,033 bytes long
Needle: vvvvvvvvvvBOOMSHAKALAKAvvvvvvvvvv 33 bytes long
Factorizing the needle, Total Building-Blocks are 33-2+1 BBs/suffixes:
BB/Suffix01: [vv]vvvvvvvvBOOMSHAKALAKAvvvvvvvvvv
BB/Suffix02: v[vv]vvvvvvvBOOMSHAKALAKAvvvvvvvvvv
...
BB/Suffix31: vvvvvvvvvvBOOMSHAKALAKAvvvvvvv[vv]v
BB/Suffix32: vvvvvvvvvvBOOMSHAKALAKAvvvvvvvv[vv]
Distinct BBs are 33-2+1-(8+1+1+9)=13:
01: [vv] ! 8 duplicates !
02: [vB]OOMSHAKALAKAv
03: [BO]OMSHAKALAKAv
04: [OO]MSHAKALAKAv
05: [OM]SHAKALAKAv
06: [MS]HAKALAKAv
07: [SH]AKALAKAv
08: [HA]KALAKAv
09: [AK]ALAKAv
10: [KA]LAKAv
11: [AL]AKAv
12: [LA]KAv
13: [AK]Av ! 1 duplicate !
14: [KA]v ! 1 duplicate !
15: [Av]
16: [vv] ! 9 duplicates !
The five questions giving green light to this (so-called 'Swamprunner') heuristic:
Question #1] Is the 'kick-in' (Total/Distinct elements in lookup table) ratio bigger than 1?
Question #2] Is the number of Distinct elements bigger than 1?
Question #3] What is THE most common & rightmost 2-byte suffix (so-called 'TROLLish' Building-Block)?
Question #4] What is THE longest & rightmost string NOT containing the TROLLish BB (so-called 'NewNeedle')?
Question #5] Is THE longest & rightmost string NOT containing the TROLLish BB bigger than 3?
Zeroing & Reloading the lookup table with new needle's values, REasking the above for the 'NewNeedle'
Let's answer these five questions:
Answer #1] Yes, (int)(33-2+1)/(13)=2 > 1
Answer #2] Yes, 13 > 1
Answer #3] 'vv'
Answer #4] 'vBOOMSHAKALAKAv'
Answer #5] Yes, 15 > 3
Zeroing & Reloading the lookup table with new needle's values, REanswering the above for the 'NewNeedle'
Five green lights, 'Swamprunner' is under way.
Searching using order 2, if found then compare 'LeftLeftover+NewNeedle+RightLeftover' at corresponding Haystack offset.
'Railgun_Ennearch' + 'Swamprunner' heuristic = 'Railgun_Trolldom'
Ha, just recalled one of the reasons Napoleon to meet his final defeat in the Battle of Waterloo, his artillery couldn't make splash damage due to the muddy terrain - the cannonballs simply dived into the mud.
Hopefully my railgun will produce MUDDY TSUNAMIS in the trollish swamps making countless trolls to ride the bullet.
Update:
Not happy with 'Swamprunner', I realized that it can't do much good, more aggressive heuristic must be implemented.
I had to sleep one night away in order to refine and simplify the overkillous 'Swamprunner', the result is shazamish (not shamazing), the new heuristic is called 'Swampwalker'.
'Swamprunner' is too weak and so unpractical compared to 'Swampwalker'.
The idea is to find THE rightmost & longest PRIMAL string within the needle, PRIMAL is quite the same as prime numbers i.e. it has to be unfactorizable.
Unfactorizablility guarantees EXCELLENT Skip Performance for BMH Order 2 and BMH pseudo-order 4.
If the longest PRIMAL string happens to be of length 2 (all bytes are the same) then standard search is performed.
The algorithm for this nifty heuristic is simple, it is applied for order 2, at each position the first duplicate is searched and duplicate's position constitutes the right boundary for next searches, of course the next BB is examined in the same way until right boundary is reached, and all this for one position at a time.
At each position 'PRIMALlength' is updated if it is EQUAL or bigger than the previous 'PRIMALlength'.
In the end we have two values: 'PRIMALposition' and 'PRIMALlength' defining the PRIMAL NewNeedle.
Again, at each STEP 'PRIMALlength' is updated if it is EQUAL or bigger than the previous 'PRIMALlength'.
Legend:
'[]' points to BB forming left or right boundary;
'{}' points to BB being searched for;
'()' position of duplicate and new right boundary;
Three examples follow:
00000000011111111112222222222333
12345678901234567890123456789012
Example #1 for Needle: 1234567890qwertyuiopasdfghjklzxcv NewNeedle = '1234567890qwertyuiopasdfghjklzxcv'
Example #2 for Needle: vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv NewNeedle = 'vv'
Example #3 for Needle: vvvvvvvvvvBOOMSHAKALAKAvvvvvvvvvv NewNeedle = 'vvBOOMSHAKALA'
Example #1:
PRIMALlength=00; FoundAtPosition=33;
Step 01_00: {}[12]34567890qwertyuiopasdfghjklzxc[v?] ! For position #01 the initial boundaries are PRIMALpositionCANDIDATE=LeftBoundary=01, RightBoundary=FoundAtPosition-1, the CANDIDATE PRIMAL string length is RightBoundary-LeftBoundary+(2)=(33-1)-01+(2)=33 !
Step 01_01: [{12}]34567890qwertyuiopasdfghjklzxc[v?] ! Searching for '12', FoundAtPosition = 33, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(33-1)-01+(2)=33 !
Step 01_02: [1{2]3}4567890qwertyuiopasdfghjklzxc[v?] ! Searching for '23', FoundAtPosition = 33, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(33-1)-01+(2)=33 !
...
Step 01_30: [12]34567890qwertyuiopasdfghjkl{zx}c[v?] ! Searching for 'zx', FoundAtPosition = 33, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(33-1)-01+(2)=33 !
Step 01_31: [12]34567890qwertyuiopasdfghjklz{xc}[v?] ! Searching for 'xc', FoundAtPosition = 33, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(33-1)-01+(2)=33 !
if (PRIMALlengthCANDIDATE >= PRIMALlength) {PRIMALposition=PRIMALpositionCANDIDATE; PRIMALlength = PRIMALlengthCANDIDATE;}
Step 02_00: {}1[23]4567890qwertyuiopasdfghjklzxc[v?] ! For position #02 the initial boundaries are PRIMALpositionCANDIDATE=LeftBoundary=02, RightBoundary=FoundAtPosition-1, the CANDIDATE PRIMAL string length is RightBoundary-LeftBoundary+(2)=(33-1)-02+(2)=32 !
Step 02_01: 1[{23}]4567890qwertyuiopasdfghjklzxc[v?] ! Searching for '23', FoundAtPosition = 33, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(33-1)-02+(2)=32 !
Step 02_02: 1[2{3]4}567890qwertyuiopasdfghjklzxc[v?] ! Searching for '34', FoundAtPosition = 33, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(33-1)-02+(2)=32 !
...
Step 02_29: 1[23]4567890qwertyuiopasdfghjkl{zx}c[v?] ! Searching for 'zx', FoundAtPosition = 33, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(33-1)-02+(2)=32 !
Step 02_30: 1[23]4567890qwertyuiopasdfghjklz{xc}[v?] ! Searching for 'xc', FoundAtPosition = 33, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(33-1)-02+(2)=32 !
if (PRIMALlengthCANDIDATE >= PRIMALlength) {PRIMALposition=PRIMALpositionCANDIDATE; PRIMALlength = PRIMALlengthCANDIDATE;}
...
Step 31_00: {}1234567890qwertyuiopasdfghjklz[xc][v?] ! For position #31 the initial boundaries are PRIMALpositionCANDIDATE=LeftBoundary=31, RightBoundary=FoundAtPosition-1, the CANDIDATE PRIMAL string length is RightBoundary-LeftBoundary+(2)=(33-1)-31+(2)=03 !
Step 31_01: 1234567890qwertyuiopasdfghjklz[{xc}][v?] ! Searching for 'xc', FoundAtPosition = 33, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(33-1)-31+(2)=03 !
if (PRIMALlengthCANDIDATE >= PRIMALlength) {PRIMALposition=PRIMALpositionCANDIDATE; PRIMALlength = PRIMALlengthCANDIDATE;}
Result:
PRIMALposition=01 PRIMALlength=33, NewNeedle = '1234567890qwertyuiopasdfghjklzxcv'
Example #2:
PRIMALlength=00; FoundAtPosition=33;
Step 01_00: {}[vv]vvvvvvvvvvvvvvvvvvvvvvvvvvvvvv[v?] ! For position #01 the initial boundaries are PRIMALpositionCANDIDATE=LeftBoundary=01, RightBoundary=FoundAtPosition-1, the CANDIDATE PRIMAL string length is RightBoundary-LeftBoundary+(2)=(33-1)-01+(2)=33 !
Step 01_01: [{v(v}]v)vvvvvvvvvvvvvvvvvvvvvvvvvvvvvv ! Searching for 'vv', FoundAtPosition = 02, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(02-1)-01+(2)=02 !
if (PRIMALlengthCANDIDATE >= PRIMALlength) {PRIMALposition=PRIMALpositionCANDIDATE; PRIMALlength = PRIMALlengthCANDIDATE;}
Step 02_00: {}v[vv]vvvvvvvvvvvvvvvvvvvvvvvvvvvvv[v?] ! For position #02 the initial boundaries are PRIMALpositionCANDIDATE=LeftBoundary=02, RightBoundary=FoundAtPosition-1, the CANDIDATE PRIMAL string length is RightBoundary-LeftBoundary+(2)=(33-1)-02+(2)=32 !
Step 02_01: v[{v(v}]v)vvvvvvvvvvvvvvvvvvvvvvvvvvvvv ! Searching for 'vv', FoundAtPosition = 03, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(03-1)-02+(2)=02 !
if (PRIMALlengthCANDIDATE >= PRIMALlength) {PRIMALposition=PRIMALpositionCANDIDATE; PRIMALlength = PRIMALlengthCANDIDATE;}
...
Step 31_00: {}vvvvvvvvvvvvvvvvvvvvvvvvvvvvvv[vv][v?] ! For position #31 the initial boundaries are PRIMALpositionCANDIDATE=LeftBoundary=31, RightBoundary=FoundAtPosition-1, the CANDIDATE PRIMAL string length is RightBoundary-LeftBoundary+(2)=(33-1)-31+(2)=03 !
Step 31_01: vvvvvvvvvvvvvvvvvvvvvvvvvvvvvv[{v(v}]v) ! Searching for 'vv', FoundAtPosition = 32, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(32-1)-31+(2)=02 !
if (PRIMALlengthCANDIDATE >= PRIMALlength) {PRIMALposition=PRIMALpositionCANDIDATE; PRIMALlength = PRIMALlengthCANDIDATE;}
Result:
PRIMALposition=31 PRIMALlength=02, NewNeedle = 'vv'
Example #3:
PRIMALlength=00; FoundAtPosition=33;
Step 01_00: {}[vv]vvvvvvvvBOOMSHAKALAKAvvvvvvvvv[v?] ! For position #01 the initial boundaries are PRIMALpositionCANDIDATE=LeftBoundary=01, RightBoundary=FoundAtPosition-1, the CANDIDATE PRIMAL string length is RightBoundary-LeftBoundary+(2)=(33-1)-01+(2)=33 !
Step 01_01: [{v(v}]v)vvvvvvvBOOMSHAKALAKAvvvvvvvvvv ! Searching for 'vv', FoundAtPosition = 02, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(02-1)-01+(2)=02 !
if (PRIMALlengthCANDIDATE >= PRIMALlength) {PRIMALposition=PRIMALpositionCANDIDATE; PRIMALlength = PRIMALlengthCANDIDATE;}
Step 02_00: {}v[vv]vvvvvvvBOOMSHAKALAKAvvvvvvvvv[v?] ! For position #02 the initial boundaries are PRIMALpositionCANDIDATE=LeftBoundary=02, RightBoundary=FoundAtPosition-1, the CANDIDATE PRIMAL string length is RightBoundary-LeftBoundary+(2)=(33-1)-02+(2)=32 !
Step 02_01: v[{v(v}]v)vvvvvvBOOMSHAKALAKAvvvvvvvvvv ! Searching for 'vv', FoundAtPosition = 03, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(03-1)-02+(2)=02 !
if (PRIMALlengthCANDIDATE >= PRIMALlength) {PRIMALposition=PRIMALpositionCANDIDATE; PRIMALlength = PRIMALlengthCANDIDATE;}
...
Step 09_00: {}vvvvvvvv[vv]BOOMSHAKALAKAvvvvvvvvv[v?] ! For position #09 the initial boundaries are PRIMALpositionCANDIDATE=LeftBoundary=09, RightBoundary=FoundAtPosition-1, the CANDIDATE PRIMAL string length is RightBoundary-LeftBoundary+(2)=(33-1)-09+(2)=25 !
Step 09_01: vvvvvvvv[{vv}]BOOMSHAKALAKA(vv)vvvvvvvv ! Searching for 'vv', FoundAtPosition = 24, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(24-1)-09+(2)=16 !
Step 09_02: vvvvvvvv[v{v]B}OOMSHAKALAKA[vv]vvvvvvvv ! Searching for 'vB', FoundAtPosition = 24, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(24-1)-09+(2)=16 !
Step 09_03: vvvvvvvv[vv]{BO}OMSHAKALAKA[vv]vvvvvvvv ! Searching for 'BO', FoundAtPosition = 24, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(24-1)-09+(2)=16 !
Step 09_04: vvvvvvvv[vv]B{OO}MSHAKALAKA[vv]vvvvvvvv ! Searching for 'OO', FoundAtPosition = 24, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(24-1)-09+(2)=16 !
Step 09_05: vvvvvvvv[vv]BO{OM}SHAKALAKA[vv]vvvvvvvv ! Searching for 'OM', FoundAtPosition = 24, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(24-1)-09+(2)=16 !
Step 09_06: vvvvvvvv[vv]BOO{MS}HAKALAKA[vv]vvvvvvvv ! Searching for 'MS', FoundAtPosition = 24, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(24-1)-09+(2)=16 !
Step 09_07: vvvvvvvv[vv]BOOM{SH}AKALAKA[vv]vvvvvvvv ! Searching for 'SH', FoundAtPosition = 24, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(24-1)-09+(2)=16 !
Step 09_08: vvvvvvvv[vv]BOOMS{HA}KALAKA[vv]vvvvvvvv ! Searching for 'HA', FoundAtPosition = 24, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(24-1)-09+(2)=16 !
Step 09_09: vvvvvvvv[vv]BOOMSH{AK}AL(AK)Avvvvvvvvvv ! Searching for 'AK', FoundAtPosition = 21, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(21-1)-09+(2)=13 !
Step 09_10: vvvvvvvv[vv]BOOMSHA{KA}L[AK]Avvvvvvvvvv ! Searching for 'KA', FoundAtPosition = 21, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(21-1)-09+(2)=13 !
Step 09_11: vvvvvvvv[vv]BOOMSHAK{AL}[AK]Avvvvvvvvvv ! Searching for 'AL', FoundAtPosition = 21, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(21-1)-09+(2)=13 !
Step 09_12: vvvvvvvv[vv]BOOMSHAKA{L[A}K]Avvvvvvvvvv ! Searching for 'LA', FoundAtPosition = 21, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(21-1)-09+(2)=13 !
if (PRIMALlengthCANDIDATE >= PRIMALlength) {PRIMALposition=PRIMALpositionCANDIDATE; PRIMALlength = PRIMALlengthCANDIDATE;}
...
Step 31_00: {}vvvvvvvv[vv]BOOMSHAKALAKAvvvvvvvvv[v?] ! For position #31 the initial boundaries are PRIMALpositionCANDIDATE=LeftBoundary=31, RightBoundary=FoundAtPosition-1, the CANDIDATE PRIMAL string length is RightBoundary-LeftBoundary+(2)=(33-1)-31+(2)=03 !
Step 31_01: vvvvvvvvvvBOOMSHAKALAKAvvvvvvv[{v(v}]v) ! Searching for 'vv', FoundAtPosition = 32, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(32-1)-31+(2)=02 !
if (PRIMALlengthCANDIDATE >= PRIMALlength) {PRIMALposition=PRIMALpositionCANDIDATE; PRIMALlength = PRIMALlengthCANDIDATE;}
Result:
PRIMALposition=09 PRIMALlength=13, NewNeedle = 'vvBOOMSHAKALA'
When unfolded, the heuristic looks a bit complicated, actually, when dressed in C it is the simplicity itself:
PRIMALlength=0;
for (i=0+(1); i < cbPattern-2+1+(1)-(1); i++) { FoundAtPosition = cbPattern;
PRIMALpositionCANDIDATE=i;
while ( PRIMALpositionCANDIDATE <= (FoundAtPosition-1) ) {
j = PRIMALpositionCANDIDATE + 1;
while ( j <= (FoundAtPosition-1) ) {
if ( *(unsigned short *)(pbPattern+PRIMALpositionCANDIDATE-(1)) == *(unsigned short *)(pbPattern+j-(1)) ) FoundAtPosition = j;
j++;
}
PRIMALpositionCANDIDATE++;
}
PRIMALlengthCANDIDATE = (FoundAtPosition-1)-i+(2);
if (PRIMALlengthCANDIDATE >= PRIMALlength) {PRIMALposition=i; PRIMALlength = PRIMALlengthCANDIDATE;}
}
Oh, and after searching for the longest number-name within the range 10^0..10^9999 the next one-and-only 43 letters long popped-up:
domilliaquadringenquattuorquinquagintillion
which stands for:
10^7365 one do-millia-quadringen-quattuor-quinquagin-tillion
And transforming some strings down to their PRIMAL strings:
Needle: 1234567890qwertyuiopasdfghjklzxcv PRIMALposition=01 PRIMALlength=33 '1234567890qwertyuiopasdfghjklzxcv'
Needle: vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv PRIMALposition=31 PRIMALlength=02 'vv'
Needle: vvvvvvvvvvBOOMSHAKALAKAvvvvvvvvvv PRIMALposition=09 PRIMALlength=13 'vvBOOMSHAKALA'
Needle: Trollland PRIMALposition=05 PRIMALlength=05 'lland'
Needle: Swampwalker PRIMALposition=03 PRIMALlength=09 'ampwalker'
Needle: licenselessness PRIMALposition=01 PRIMALlength=13 'licenselessne'
Needle: alfalfa PRIMALposition=04 PRIMALlength=04 'alfa'
Needle: Sandokan PRIMALposition=01 PRIMALlength=07 'Sandoka'
Needle: shazamish PRIMALposition=02 PRIMALlength=08 'hazamish'
Needle: Simplicius Simplicissimus PRIMALposition=08 PRIMALlength=15 'ius Simplicissi'
Needle: domilliaquadringenquattuorquinquagintillion PRIMALposition=01 PRIMALlength=19 'domilliaquadringenq'
Needle: DODO PRIMALposition=02 PRIMALlength=03 'ODO'
Needle: DODOD PRIMALposition=03 PRIMALlength=03 'DOD'
Needle: aaaDODO PRIMALposition=02 PRIMALlength=05 'aaDOD'
Needle: aaaDODOD PRIMALposition=02 PRIMALlength=05 'aaDOD'
Needle: DODOaaa PRIMALposition=02 PRIMALlength=05 'ODOaa'
Needle: DODODaaa PRIMALposition=03 PRIMALlength=05 'DODaa'
earthshine
n.
The sunlight reflected from the earth's surface that illuminates part of the moon not directly lighted by the sun. Also called earthlight.
/HERITAGE/
The beauty of 'Swampwalker' is in its in-place (no additional space) work, this will pay off especially with big needles, say, 8KB. I believe that 'Railgun_Ennearch' reinforced by 'Swampwalker' transformation will gun Worst-Case-Scenarios down, thus making linearity closer. This WCS hitter will be called Railgun_Swampshine, got it, the shooting in the trollish swamps will be so intense/massacreOUS that the generated plasma will illuminate part of the moon.
An add-on from 2014-Jan-12:
Time for decuman hitting.
By far Railgun_Decumanus is the fastest-and-strongest of all railguns, its Skip Performance is superior due to 1+(1+1+1) lookups of pseudo-order 4 as first and second fast checks - implemented as only two conditionals.
Decumanus checks 4 haystack's suffixes (within the 10 rightmost bytes of the window) for presence in the needle, these 4 checks give firm grip while maintaining excellent Speed Performance, in fact 'Decumanus' means something immense.
Railgun_Decumanus is strengthened so well (see the GENOMEsque torture graphs) that it devastates all known to me hitters.
The 3 out-of-chart values for the Wikipedia torture are:
Doing Search for Pattern(589bytes) into String(52428800bytes) as-one-line ...
Railgun_Decumanus precise performance: 20492MB/s
Boyer_Moore-Horspool precise performance: 3605MB/s
Doing Search for Pattern(1080bytes) into String(52428800bytes) as-one-line ...
Railgun_Decumanus precise performance: 36408MB/s
Boyer_Moore-Horspool precise performance: 4896MB/s
Doing Search for Pattern(1755bytes) into String(52428800bytes) as-one-line ...
Railgun_Decumanus precise performance: 86141MB/s
Boyer_Moore-Horspool precise performance: 6281MB/s
Take a minute and compare 86141 vs 6281, for needle 1755 bytes long the number of repetitive suffixes order 2 or/and 4 is significant, the suffix management has to be both speedy and efficient.
Boyer-Moore-Horspool with its order 1 lookups suffers badly because of the great number of repetitive bytes, and we speak for its NATIVE mode - the textual data.
Even 'Gulliver', my old Railgun being BMH order 2, is hurt with such long needles.
Better yet choked, Railgun_Sekireigan_Wolfram_Nozomi reaches the miserable 28238MB/s, 1271% boost over BMH - this is the power of 4 pseudo-order 4 lookups.
Genome benchmark:
Wikipedia benchmark:
I have left the section for needle lengths 0..3 vacant/unhandled on purpose, SSE2 rules on small needles, Mischa Sandberg knows best, simply the 128bit Assembly destroys everything in its path.
The benchmark with its source at: http://www.sekireigan.com/_KAZE_32bit_memmem-like_showdown_Decumanus.7z.
An add-on from 2013-Nov-12:
I have strengthened Shockeroo's Skip Performance (while maintaining same speed) by adding an additional lookup in the second level check while not adding additional branching!
The C source with its main loop in Assembly is given at Download Section. The printer friendly version Railgun_Wolfram_4-pages.pdf.
The body of 'Railgun_Sekireigan_WolfRAM' is this:
- Section #1: Simplified Karp-Rabin for shortest (2..3 bytes) needles;
- Section #2: Quadruplet for shorter (4 bytes and bigger) needles and for "short" haystacks;
- Section #3a: Boyer-Moore-Horspool order 2 (2)+(2+2) for short (4..22 bytes) needles and for "long" haystacks;
- Section #3b: Boyer-Moore-Horspool pseudo-order 4 (4)+(4) for long (23..711 bytes) needles and for "long" haystacks;
- Section #4: Simplified Horspool pseudo-order 12 for longest (712 bytes and bigger) needles and for "long" haystacks.
Few thoughts regarding an excellent Skip Performance etude:
Something "still" considered crazy: using BITwise order 3, not pseudo order 3, though! 2^24 = 16MB BYTEwise or 2^(24-3) = 2MB BITwise, when searching big haystacks e.g. Wikipedia 42GB with Kazahana then this 2MB lookup table seems not so atrocious.
ulHashTarget = *(uint32_t *)&pbTarget[i+cbPattern-4]; if ( bm_Horspool_Order3[ulHashTarget>>8] == 0 ) Gulliver = cbPattern-(3-1); else
if ( bm_Horspool_Order3[ulHashTarget&0xFFFFFF] == 0 ) Gulliver = cbPattern-(3-1)-1; else
{
}
When searching big haystacks, say the whole Wikipedia a 42GB file, I expect order 3 combined with multi-threading, say 16 threads as in my Kazahana tool, given that all threads use this 2MB lookup table SHARED, to overpower even 'WolfRAM', to be done.
An add-on from 2013-Nov-09:
It's shocktime.
The shock is bitter-sweet, I failed to better the Skip Performance in the external loop, while the internal became MUCH faster (even on my Core 2). Two major improvements in 'Railgun_Sekireigan_Shriekeroo' sections #3a/#3b led to appearance of 'Railgun_Sekireigan_Shockeroo' - the FASTEST MEMMEM, so far. The boost is twofold by reducing the iterations in both loops (external&internal). For external, better skip is sought from right-to-left, unsuccessful so far. For internal, the cycles/iterations are 4 times less. The main loop of Section #3a:
while (i <= cbTarget-cbPattern) {
Gulliver = 1; if ( bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1]] != 0 ) {
if ( bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1-2]] == 0 ) Gulliver = cbPattern-(2-1)-2; else {
if ( *(uint32_t *)&pbTarget[i] == ulHashPattern) {
count = cbPattern-4+1;
while ( count > 0 && *(uint32_t *)(pbPattern+count-1) ==
*(uint32_t *)(&pbTarget[i]+(count-1)) )
count = count-4; if ( count <= 0 ) return(pbTarget+i); else
if ( bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+count-1]] ==
0 ) Gulliver = count; }
}
} else Gulliver = cbPattern-(2-1);
i = i + Gulliver;
}
Kind of disappointed after the boosting, on i7 I expect nothing less than 'BANG!'
The benchmark with its source at: http://www.sekireigan.com/_KAZE_32bit_memmem-like_showdown_Shockeroo.7z.
Since the test measures in milliseconds (enough for raw purposes) I added the total clocks for 'Railgun_Sekireigan_Shockeroo' which is executed 1024 times, the results on my laptop Core 2:
Doing Search for Pattern(7bytes) into String(52428800bytes) as-one-line ...
The first instance occurred at position 51265299 within next line:
Railgun_Sekireigan_Shockeroo_hits/Railgun_Sekireigan_Shockeroo_clocks: 0/19
Railgun_Sekireigan_Shockeroo performance: 2694KB/clock
Railgun_Sekireigan_Shockeroo TotalRoughSearchTime (lesser-the-better): 18642.000000clocks
...
Doing Search for Pattern(589bytes) into String(52428800bytes) as-one-line ...
The first instance occurred at position 52419431 within next line:
Railgun_Sekireigan_Shockeroo_hits/Railgun_Sekireigan_Shockeroo_clocks: 0/3
Railgun_Sekireigan_Shockeroo performance: 17066KB/clock
Railgun_Sekireigan_Shockeroo TotalRoughSearchTime (lesser-the-better): 2532.000000clocks
EDIT:
Or, 7 bytes long needle is found at 51,265,299KB/18642clocks = ((51265299/18642)*1000)/1024/1024= 2.622GB/s on Core 2 laptop!!!
...
Or, 589 bytes long needle is found at 52,419,431KB/2532clocks = ((52419431/2532)*1000)/1024/1024= 19+GB/s on Core 2 laptop!!!
And the same test on laptop Core i5 2430M @3000MHz 3MB L3 cache:
Doing Search for Pattern(7bytes) into String(52428800bytes) as-one-line ...
The first instance occurred at position 51265299 within next line:
Railgun_Sekireigan_Shockeroo_hits/Railgun_Sekireigan_Shockeroo_clocks: 0/13
Railgun_Sekireigan_Shockeroo performance: 3938KB/clock
Railgun_Sekireigan_Shockeroo TotalRoughSearchTime (lesser-the-better): 12345.000000clocks
...
Doing Search for Pattern(589bytes) into String(52428800bytes) as-one-line ...
The first instance occurred at position 52419431 within next line:
Railgun_Sekireigan_Shockeroo_hits/Railgun_Sekireigan_Shockeroo_clocks: 0/2
Railgun_Sekireigan_Shockeroo performance: 25600KB/clock
Railgun_Sekireigan_Shockeroo TotalRoughSearchTime (lesser-the-better): 1283.000000clocks
Or, 7 bytes long needle is found at 51,265,299KB/12345clocks = ((51265299/12345)*1000)/1024/1024= 3.960GB/s on Core i5 laptop!!!
...
Or, 589 bytes long needle is found at 52,419,431KB/1283clocks = ((52419431/1283)*1000)/1024/1024= 38+GB/s on Core i5 laptop!!!
For now, I cannot see how to boost 'Shockeroo', any ideas?
An add-on from 2013-Nov-07:
One step closer to Railgun_MIMINO...
An overlooked optimization in the main loop of BMH 2+2 and BMH 4+4 sections has been finally done, so here comes 'Railgun_Sekireigan_Shriekeroo' shriekingly faster than current fastest. The main loop of Section #3a:
while (i <= cbTarget-cbPattern) {
Gulliver = 1;
if ( bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1]] != 0 ) {
if ( bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1-2]] == 0 ) Gulliver = cbPattern-(2-1)-2; else {
if ( *(uint32_t *)&pbTarget[i] == ulHashPattern) {
count = countSTATIC; while ( count !=0 && *(char *)(pbPattern+
(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
count--;
if ( count == 0) return(pbTarget+i);
}
}
} else Gulliver = cbPattern-(2-1);
i = i + Gulliver;
}
The body of 'Railgun_Sekireigan_Shriekeroo' is this:
- Section #1: Simplified Karp-Rabin for shortest (2..3 bytes) needles;
- Section #2: Quadruplet for shorter (4 bytes and bigger) needles and for "short" haystacks;
- Section #3a: Boyer-Moore-Horspool order 2 (2+2) for short (4..22 bytes) needles and for "long" haystacks;
- Section #3b: Boyer-Moore-Horspool pseudo-order 4 (4+4) for long (23..711 bytes) needles and for "long" haystacks;
- Section #4: Simplified Horspool pseudo-order 12 for longest (712 bytes and bigger) needles and for "long" haystacks.
The benchmark with its source at: http://www.sekireigan.com/_KAZE_32bit_memmem-like_showdown_Shriekeroo.7z
Well, I have one more idea left (to compare backwards and to lookup the mismatched position, thus Gulliver won't be 1 but bigger) this revision will be named 'Railgun_Sekireigan_Shockeroo'.
An add-on from 2013-Nov-05:
FLY DUMBO FLY!
Long story short: 'Railgun_Sekireigan_TchittoGritto' dominates everywhere for XML/English texts, as for 'ACGT' i.e. GENOME files it is 200%-400% faster than classical BMH!
I have been stunned, the approach used in 'Piri' revision 2 named 'TchittoGritto' is so alacritous that it trumped even the gorgeous BNDM which excels at GENOME data!
For 'ACGT' data, after traversing 212MB:
Needle 20 chars:
BNDM_32 performance: 1235KB/clock
Railgun_Sekireigan_TchittoGritto performance: 2012KB/clock
Boyer_Moore-Horspool performance: 369KB/clock
Needle 30 chars:
BNDM_32 performance: 1659KB/clock
Railgun_Sekireigan_TchittoGritto performance: 2241KB/clock
Boyer_Moore-Horspool performance: 734KB/clock
What can I say, BNDM is maybe the most beautiful etude I have ever seen, but Railgun_Sekireigan_TchittoGritto outflies it, simply the hashing beats the DAWG - ONLY A BIG ENOUGH TABLE IS NEEDED TO DO SO! Moreover, for XML/English texts, BNDM can't stand a chance against 'TchittoGritto'. The speed stunning ends not here, with proper needle threshold tweaks the boost (compared to BMH) can reach the CCRRAAZZYY 700%, see further below.
BNDM stands for "Backward Nondeterministic DAWG Matching". DAWG "Directed Acyclic Word Graph": Definition (1): A directed acyclic graph representing the suffixes of a given string in which each edge is labeled with a character. The characters along a path from the root to a node are the substring which the node represents. Definition (2): A finite state machine that recognizes a set of words.
Out of curiosity, I wanted to juxtapose BMH order 1/2/4/12 implementations by throwing at them the 'Homo sapiens chromosome 1, alternate assembly HuRef, whole genome shotgun sequence' - a 212MB small alphabet (~6 distinct chars) file. It is well-known that BMH struggles with GENOMEsque data. Here the heavily reinforced 'Railgun_Sekireigan_TchittoGritto' proves the alacrity of higher orders BMH, being an order 4 but paired i.e. the rightmost 4+4 - a far cry from real order 8. The search is for 20/30/40/50/60/70 chars long needles located at the end of the file.
As you can see from the graph below, Railgun_Sekireigan_Bari (BMH order 12) comes screaming down gaining momentum for needles 60 and 70 chars long simply because:
#define NeedleThresholdBIGSekirei 12+40 // Should be bigger than 'HasherezadeOrder'. BMH2 works up to this value (inclusive).
The above threshold (52) decides to switch from order 2 to order 12 i.e. 'Hasherezade' for needles 53 and bigger. 'Railgun_Sekireigan_TchittoGritto' doesn't switch to order 12 until 712 (its threshold) which in turn is far more suitable for XML/English texts - it (order 4) outspeeds order 12 there.
For 'ACGT' data:
Needle 60 chars:
Railgun_Sekireigan_TchittoGritto performance: 2683KB/clock
Railgun_Sekireigan_Bari performance: 3813KB/clock
Railgun_Sekireigan_Bari_bit performance: 3813KB/clock
Boyer_Moore-Horspool performance: 442KB/clock
Needle 70 chars:
Railgun_Sekireigan_TchittoGritto performance: 2619KB/clock
Railgun_Sekireigan_Bari performance: 3952KB/clock
Railgun_Sekireigan_Bari_bit performance: 3952KB/clock
Boyer_Moore-Horspool performance: 488KB/clock
For XML data:
Needle 7 chars:
Railgun_Sekireigan_TchittoGritto performance: 2133KB/clock
Railgun_Sekireigan_Bari performance: 2133KB/clock
Needle 13 chars:
Railgun_Sekireigan_TchittoGritto performance: 3200KB/clock
Railgun_Sekireigan_Bari performance: 3200KB/clock
...
Needle 201 chars:
Railgun_Sekireigan_TchittoGritto performance: 7314KB/clock
Railgun_Sekireigan_Bari performance: 4654KB/clock
Needle 230 chars:
Railgun_Sekireigan_TchittoGritto performance: 8533KB/clock
Railgun_Sekireigan_Bari performance: 5120KB/clock
Needle 255 chars:
Railgun_Sekireigan_TchittoGritto performance: 8533KB/clock
Railgun_Sekireigan_Bari performance: 5688KB/clock
Needle 589 chars:
Railgun_Sekireigan_TchittoGritto performance: 17066KB/clock
Railgun_Sekireigan_Bari performance: 10240KB/clock
The benchmark with its source at: http://www.sekireigan.com/_KAZE_32bit_memmem-like_showdown_TchittoGritto.7z
- Что такое 'Чито Грито'?
- Птичка, птичка, невеличка, в общем ничего.
/Рубик и Мимино/
- What is this 'Tchitto Gritto'?
- Birdie, birdie, an unimportant one, in fact nothing special.
/Rubik and Mimino/
«Мимино» по-грузински მიმინო — «ястреб», in English Sparrowhawk, in Latin (Accipiter nisus); «сокол» — это «шевардени».
One day or rather night I will write the final mix: Railgun_MIMINO, but as in the song "my tomorrow never comes" it will be my swan song.
An add-on from 2013-Nov-03:
Two things.
First, thanks to Mr.Eiht, who helped me again, next results answer several interesting questions:
- How a modern CPU (i7 3930K @4.5GHz) executes 'Railgun';
- How BYTE vs BIT lookups behave;
- What the two 32bit (Microsoft VS2010 and Intel v12.1) C optimizers (Windows) can offer;
- How Core 2 vs i7 3rd generation stand side-by-side;
Needle 7 chars:
Railgun_Sekireigan_Bari performance: 5120KB/clock
Railgun_Sekireigan_Bari_bit performance: 3657KB/clock
Boyer_Moore-Horspool performance: 1765KB/clock
Needle 13 chars:
Railgun_Sekireigan_Bari performance: 6400KB/clock
Railgun_Sekireigan_Bari_bit performance: 5688KB/clock
Boyer_Moore-Horspool performance: 3011KB/clock
Needle 21 chars:
Railgun_Sekireigan_Bari performance: 7314KB/clock
Railgun_Sekireigan_Bari_bit performance: 7314KB/clock
Boyer_Moore-Horspool performance: 3657KB/clock
Needle 28 chars:
Railgun_Sekireigan_Bari performance: 8533KB/clock
Railgun_Sekireigan_Bari_bit performance: 7314KB/clock
Boyer_Moore-Horspool performance: 4266KB/clock
Needle 37 chars:
Railgun_Sekireigan_Bari performance: 8533KB/clock
Railgun_Sekireigan_Bari_bit performance: 7314KB/clock
Boyer_Moore-Horspool performance: 4654KB/clock
Needle 51 chars:
Railgun_Sekireigan_Bari performance: 10240KB/clock
Railgun_Sekireigan_Bari_bit performance: 8533KB/clock
Boyer_Moore-Horspool performance: 5120KB/clock
Quick answers:
* Railgun_Sekireigan_Bari dominates everywhere.
* Still the instructions needed to convert BYTE to BIT lookup take significant time.
* Intel compiler generates much faster code for BITwise version, yum-yum.
* Core 2 holds the line, it is still very good, given that my 'Bonboniera' is 2200MHz and Mr.Eiht's 'Redemption' is 4500MHz.
Second, just wanted to try an old idea once more (loading 4bytes at once at the right as in the very first Railgun where such fetch was done at the left), it resulted in appearance of 'Railgun_Sekireigan_Piri' - a much faster for 10+ long needles revision.
The body of 'Railgun_Sekireigan_Piri' is this:
- Section #1: Simplified Karp-Rabin for shortest (2..3 bytes) needles;
- Section #2: Quadruplet for shorter (4 bytes and bigger) needles and for "short" haystacks;
- Section #3: Boyer-Moore-Horspool pseudo-order 4 for short (4..198 bytes) needles and for "long" haystacks;
- Section #4: Simplified Horspool pseudo-order 12 for longest (199 bytes and bigger) needles and for "long" haystacks.
This first revision wags on one leg - reinforcing the stance (i.e. SKIP performance) by checking the previous 4 bytes (at left) should make Railgun to SCREAM deafeningly!
Possible improvements for Section #3:
- stronger "hash", the addition alone is the fastest;
- for needles shorter than 12 returning to order 2;
- increasing the 'NeedleThresholdBIGSekireiPiri' from 12+180 to 600;
- uniting the two lookup arrays and making order 4 section BITwise - thus one only 8KB array will be the overhead.
The benchmark with its source at:
http://www.sekireigan.com/_KAZE_32bit_memmem-like_showdown_Piri.7z
Section #3 is given below - the other three sections remain intact:
...
if ( cbPattern<=NeedleThresholdBIGSekireiPiri ) {
countSTATIC = cbPattern-2-2;
ulHashPattern = *(uint32_t *)(pbPattern); i=0;
for (a=0; a < 256*256; a++) {bm_Horspool_Order2[a]=0;}
for (j=0; j < cbPattern-4+1; j++) bm_Horspool_Order2[( (*(uint32_t *)
(pbPattern+j+0)>>16)+(*(uint32_t *)
(pbPattern+j+0)&0xFFFFF) ) & ( (1<<16)-1 )]=1;
while (i <= cbTarget-cbPattern) {
Gulliver = 1;
if ( bm_Horspool_Order2[( (*(uint32_t *)&pbTarget[i+cbPattern-1-1-2]>>16)+
(*(uint32_t *)&pbTarget[i+cbPattern-1-1-2]&0xFFFFF) ) & ( (1<<16)-1 )] != 0 ) {
if ( *(uint32_t *)&pbTarget[i] == ulHashPattern) {
count = countSTATIC; while ( count !=0 && *(char *)(pbPattern+(countSTATIC-
count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
count--;
if ( count == 0) return(pbTarget+i);
}
} else Gulliver = cbPattern-(2-1)-2; i = i + Gulliver;
}
return(NULL);
} else { ...
As the night is leaving - silently retreating down an empty hall
Suddenly a stirring - finally recurring where I let it fall
Following the wanderings of a dream - a dream that keeps my soul alive
Believing in an open sky - believing in a love.
Dancing with a stranger - careless of the danger there within his smile
While the dew was forming - breathing in the morning like a sleeping child.
If the memory of the light should fade - horizons reaching cold and blue
Until your heart is free to fly - then I will keep the sun for you
Until you touch the open sky - then I will keep the sun for you
Until we never say good-bye - then I will keep the sun for you.
/Giorgio Moroder - (Theme From) Midnight Express (vocal Chris Bennett) Lyrics/
For Piri.
An add-on from 2013-Oct-16:
It's time for the undisputed (for non-pathological cases, Mischa knows better) FASTEST MEMMEM named 'Railgun_Sekireigan_Bari'. Quite natural next step to 'Railgun_Sekireigan': now checking 2 more bytes at the left of our rightmost 2 bytes. This second check allows skips/jumps to be less often 1 byte, resembling Sekirei standing on two legs thus allowing much faster sway compared to (revision 1) where Sekirei relies/sways on one leg only (see the 'Comments & Discussions' section and video where speed of wagging is apparently lower on one leg). First and Second revision differ in one line only:
if ( bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1-2]] == 0 ) Gulliver = cbPattern-(2-1)-2;
A quick example:
Haystack: "...my friend Bari..."
Pattern: "my friend Piri"
Jump for r.1: "my friend Piri" +1
Jump for r.2: "my friend Piri" +11
Window: "Bari"
First Order 2 check: "ri" it is within the lookup table i.e. its slot holds 1, skip is 1 position
Second Order 2 check: "Ba" it is outwith the lookup table
i.e. its slot holds 0, skip is cbPattern-(2-1)-2=14-(2-1)-2=11 positions
Lookup table, next slots hold 1, rest 0:
'my'
'y '
' f'
'fr'
'ri'
'ie'
'en'
'nd'
'd '
' P'
'Pi'
'ir'
'ri'
Pattern Building-Block "Ba" holds 0.
The benchmark with its source at:
http://www.sekireigan.com/_KAZE_32bit_memmem-like_showdown.7z
A bitter cup for me and my mom these days, our fidel dog died in his (not its) prime, only 5 years old, how to say goodbye?
This function is dedicated to our animal friend Bari, one undistorted image of LIFE.
Dark the stars and dark the moon,
Hush the night and the morning loon,
Tell the horses and beat on your drum,
Gone their master, gone their son,
Dark the oceans, dark the sky,
Hush the whales and the ocean tide,
Tell the salt marsh and beat on your drum,
Gone their master, gone their son,
Dark to light and light to dark,
Three black carriages, three white carts,
What brings us together is what pulls us apart,
Gone our brother, gone our heart.
...
/'Gone' lyrics written and performed by Ioanna Gika/
An add-on from 2013-Oct-11:
It's time Sultaness of MEMMEM functions 'Hasherezade' to step back and 'Sekirei' to step in.
Right to the point, 'Sekireigan' stands for "Eye of the Sekirei". It is a technique in swordsmanship allowing to move extremely fast and attack from many angles simultaneously. Yukimura can utilize the Sekireigan to access extreme speed. As for 'Sekirei' it stands for 'Forest Wagtail', forest wagtail differs from its Motacilla relatives in its strange habit of swaying its tail from side to side, not wagging it up and down like other wagtails. Do you see the connection between lateral swaying (while removing the vertical swaying altogether) and the PLAINNESS of Boyer-Moore-Horspool order 2 maximum simplified main loop (no distraction to look up other skip values like in 'Hasherezade') which SWAYS left-to-right-to-left-... at SICK pace: Looking AT RIGHT of Haystack's window last 2 bytes for presence into BYTEwise 64KB table if findable then starting the comparison AT LEFT, and so forth. The key for this IMMENSE boost is the MAXIMUM possible simplification of Horspool order 2 section, the main loop became only 99 bytes!
Why didn't I make it 2 years ago, don't ask me, if you do the answer is - I didn't want to ... on purpose, the SKIP performance was good and I thought "it is better off slower but stronger", now this overkill is ended. In fact this simplification is the only thing that is different in 'Railgun_Sekireigan' compared to its overkill/counterpart 'Railgun_Hasherezade' revision 3.
Well, the only weak point lies in the increment/skip by 1 in BMH order 2 loop for patterns 5-~28 bytes long, to boost them the 64KB table should house (as it was) the skips for Building-Blocks instead of 0/1 as I simplified, this range is the only one where Mr.Volnitsky has some 300-400MB/s advantage. It is this way on purpose, maybe in future revisions I will make the BMH lookup table BITwise i.e. 8 times smaller which is 8KB - fully fittable in Vishera's 16KB L1 cache. My wish is to make a rich showdown on different CPUs (Haswell, Vishera mostly), sadly no opportunity in sight, if someone wants to help me and run the benchmark I will gladly share the results here.
How to compute TOTAL number of BBs, 'cbPattern - Order + 1' is the number of BBs for text 'cbPattern' bytes long. For example, for cbPattern=11 'fastest fox' and Order=8 we have BBs = 11-8+1=4:
'fastest '
'astest f'
'stest fo'
'test fox'
And for those who want to deepen their understanding and to have one supertool extracting the Building-Blocks from ANY file I included in next package my BBs ripper LeprechaunBBhex, written in C and compilable both for Windows/Linux it works on incredible speeds. If you run the batch 'RIP_BBs.bat' it will create and report this:
For our test file:
enwiki-20130904-pages-articles.7z.001: 52,428,800 bytes; 52,428,800 - Order + 1 Total 'Order' bytes long Building-Blocks
ENWIKI_50MB_order2.txt : 93,294 bytes; 15,549 Distinct Building-Blocks 2 bytes long
ENWIKI_50MB_order4.txt : 10,593,630 bytes; 1,059,363 Distinct Building-Blocks 4 bytes long
ENWIKI_50MB_order8.txt : 225,515,718 bytes; 12,528,651 Distinct Building-Blocks 8 bytes long
ENWIKI_50MB_order12.txt : 726,063,728 bytes; 27,925,528 Distinct Building-Blocks 12 bytes long
From above we can obtain the REDUNDANCY of WIKI 50MB dump:
- for order 2: 15,549/(52,428,800 - 2 + 1)*100% = 0.02%
- for order 4: 1,059,363/(52,428,800 - 4 + 1)*100% = 2.02%
- for order 8: 12,528,651/(52,428,800 - 8 + 1)*100% = 23.89%
- for order 12: 27,925,528/(52,428,800 - 12 + 1)*100% = 53.26%
The less the percentage the less load on hash function there will be, which is good due to less number of collisions. Given that order 2 has 15,549 distinct BBs and lookup table has 65,536 slots means that a good hash function could offer zero collisions. As for order 12, the default hash size is: #define HashTableSizeSekirei 17-1
which gives 27,925,528 keys vs 65,536 slots, or 426:1, not a pleasant ratio. In one test when the HashTableSizeSekirei
was 20 not 17-1 (which is 26:1) the speed drop (for 17-1) was of no real concern.
And to see what is the difference between XML (WIKIPEDIA) and TXT (English ebook Text) formats here come the stats for OSHO.TXT:
OSHO.TXT : 206,908,949 bytes; 206,908,949 - Order + 1 Total 'Order' bytes long Building-Blocks
OSHO_order2.txt : 26,544 bytes; 4,424 Distinct Building-Blocks 2 bytes long
OSHO_order4.txt : 2,480,190 bytes; 248,019 Distinct Building-Blocks 4 bytes long
OSHO_order8.txt : 161,216,928 bytes; 8,956,496 Distinct Building-Blocks 8 bytes long
OSHO_order12.txt: 1,138,861,490 bytes; 43,802,365 Distinct Building-Blocks 12 bytes long
From above we can obtain the REDUNDANCY of OSHO books:
- for order 2: 4,424/(206,908,949 - 2 + 1)*100% = 0.002%
- for order 4: 248,019/(206,908,949 - 4 + 1)*100% = 0.11%
- for order 8: 8,956,496/(206,908,949 - 8 + 1)*100% = 4.32%
- for order 12: 43,802,365/(206,908,949 - 12 + 1)*100% = 21.16%
Numbers say that REDUNDANCIES for our XML and TXT files are different by great margin. Roughly speaking, OSHO's orders 2 and 4 being four times less than WIKI's suggest hashers, hashing 2/4 bytes keys, are gonna be more effective for OSHO.
http://www.sanmayce.com/Downloads/_KAZE_32bit_memmem-like_showdown.7z
Scaling up is where MEMMEM screams, instead of needle/haystack, as here, 13/52,428,800 think or rather feel the upscaled case (the same proportion) when 52,428,800/211,444,543,803,076. Ouch, 200 terabytes.
I salute everyone who feels the beat of unstoppableness with 'KAT DELUNA - Unstoppable':
...
Save your pity for tomorrow
When I smash it like a ball in the club
On the news, I'll be rockin' like a rockstar
I'm unstoppable
Unstoppable
Unstoppable
Unstoppable
So supersonic almost ironic, I play it like a game of chess
You wanna ride like a queen, well here you got the best
I got the moves to the grooves, you like to imitate it
That's for free, yeah that's me! Uh-huh
Wanna whip me something crazy moving faster than the speed of light
Show you the way an entertainer always gets them right
Gonna need a back-up plan 'cause I'm givin you a shot
Watch me flippin now
Yeah that's me! Uh-huh
In my red shoes and my swags so sick I need med school
I never make a bad move but I can make her bed move
You like my attitude, ain't nobody else at my magnitude
See ain't nothing stopping me, I feel like ain't nothing after me
...
Now shake your body,
Like you got the jungle fever rushing through your veins
We gettin' crazy and the club is banging through your brain
When I say jump, you jump, jump is knockin' knockin'
I wanna see you jump like this beat is rockin'
Love bite on my bracelet, kinda fancy, but it's how I play
You wanna come along and see and holler at me
Maybe I lost it, maybe not, cause I got game
Like a burning flame, yeah that's me, uh huh
Fa-fa-fa-fa-fa-follow the leader
Fa-fa-fa-fa-fa-follow the leader
Fa-fa-fa-fa-fa-follow the leader
"... my swags so sick I need med school ...", ha-ha-ha-ha-ha.
An add-on from 2013-Oct-09:
Making some tweaks, listening to Chester and removing the 255 chars limitation for Needle lengths resulted in appearance of 'Railgun_Hasherezade' revision 3 - a full-fledged MEMMEM function, at last. This third revision now searches for Patterns/Needles longer than 255 chars and uses Horspool order 12 (the last 12 bytes of Haystack window are checked). Those 12 bytes are hashed, each slot is 0 or 1, a flag for presence. The hash table is adjustable through DEFINE, also whether it is Bitwise or Bytewise.
The skeleton of 'Railgun_Hasherezade' is this:
- Simplified Karp-Rabin for shortest needles;
- Quadruplet for shorter needles;
- Boyer-Moore-Horspool/Boyer-Moore-Sunday order 1 and Boyer-Moore-Horspool order 2 for short needles;
- Simplified Horspool order 12 for longest needles.
This variety of different search etudes makes the function highly customizable, especially when you need even higher order for Horspool, say 128, then the fastest ('FNV1A-penumbra' here 'FNV1A_YoshimitsuTRIAD' used) known to me hash function comes gun blazing, currently order 12 uses its main loop once.
With big patterns the bigger the order the bigger the skips/jumps are. For example if the pattern is 13 bytes and order is 12 the jump is 13-(12-1) or only 2 positions, but with 8192 bytes long pattern becomes 8192-(12-1). In fact order 12 (3x4bytes are hashed) is only the start when big blocks are to be compared, in next package I search 7/13/21/28/37/51/65/83/98/113/130/140/151/162/201/230/255/589/1080/1755 bytes long patterns into 'enwiki-20130904-pages-articles.7z.001' 52,428,800 bytes long.
http://www.sanmayce.com/Downloads/_KAZE_32bit_memmem-like_showdown.7z
Each pattern is located at the end of this 50MB file, that is one hit exists for each search.
The quest started as 'strstr' replacer of all old approaches for short haystacks/needles (line-by-line searching mostly) and reached a point that the function has become a full-fledged 'memmem' etude.
While the world is sleeping
I can hear the nightly quiet
I can feel the beating of the night
I can see some thieves
They are stealing my guitar
They can't play
The player is my heart
Don't look in the mirror
Hold the line
Have to win
Don't fool away your time
Just remember all the early days
Now come back
But please don´t lose your ways
While the world is dreaming
I can hear the sound of town
But this nighttime falling to the ground
In the midnight hours
I have killed all my hopes
Just my crime's been written on the walls
CHESTER - HOLD THE LINE (1987) is so incredible - an immortal song/etude.
In my eyes these etudes (no matter musical or programming) have so much in common, 'Finer Feelings' as Kylie sings.
An add-on from 2013-Oct-07:
Okay, since the article was intended as a semi-answer semi-question, the latter has found its answer thus completing the search for FASTEST strstr-like here 'memmem' C function, I think. Two years ago, having finished the last revision of my strstr quest I found by chance the site of Leonid Volnitsky, its description reads: "Fastest substring search algorithm for average case. Source, benchmark."
Immediately I wrote an email asking him to test my functions along with his, but for some reason no answer came back, I said to myself 'whatever' and dismissed the whole idea of clashing on. I waited to see whether he will find time to include my functions in his long list of rivals, but no and no, finally I wanted to juxtapose, for myself, his beautiful idea to use hash (I had had the exact idea around 2010 but didn't like the chaining hashing back then and still don't) and 'Railgun_Gulliver' and 'Railgun_Hasherezade'. My testbed is made of two 32bit executables - one compiled with Microsoft VS2010 (/Ox used), the other compiled with Intel 12.1 (/O3 used). These two executables search 12 patterns with lengths 021/028/037/051/065/083/098/113/130/140/151/162 chars into WIKIPEDIA dump, first 50MB. All the patterns occur in the end of the file within a same line, thus the traversal is 50MB for all searches. Warning! The used in the package 'volnitsky2' & 'volnitsky4' are not complete, they are for tests only. And I hate when results are given but they are not reproducible, therefore (as always) I included all in this benchmark package:
http://www.sanmayce.com/Downloads/_KAZE_32bit_strstr-like_showdown.7z
There you can find the assembler equivalent along with the C source as the compilers generated it:
strstr_SHORT-SHOWDOWN_LV_WIKI_32bit_MicrosoftV16.cod
strstr_SHORT-SHOWDOWN_LV_WIKI_32bit_IntelV12.cod
Obviously 'volnitsky2' dominates with up to (3413-2048)/2048*100% = 67% boost, very very good. With OSHO.TXT and shorter needles goes up to 280%, simply super! Below the dump for Wikipedia 50MB is given:
32bit_MicrosoftV16 executables:
Doing Search for Pattern(7bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 1651KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 1505KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 1462KB/clock
volnitsky2 performance: 2327KB/clock
volnitsky4 performance: 1651KB/clock
Doing Search for Pattern(13bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 2327KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 1969KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 1896KB/clock
volnitsky2 performance: 3200KB/clock
volnitsky4 performance: 3011KB/clock
Doing Search for Pattern(21bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 3011KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 2560KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 2560KB/clock
volnitsky2 performance: 3657KB/clock
volnitsky4 performance: 3657KB/clock
Doing Search for Pattern(28bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 3200KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 2844KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 2844KB/clock
volnitsky2 performance: 3657KB/clock
volnitsky4 performance: 3657KB/clock
Doing Search for Pattern(37bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 3200KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 3011KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 3011KB/clock
volnitsky2 performance: 3657KB/clock
volnitsky4 performance: 3657KB/clock
Doing Search for Pattern(51bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 3413KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 3200KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 3200KB/clock
volnitsky2 performance: 3657KB/clock
volnitsky4 performance: 3413KB/clock
Doing Search for Pattern(65bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 3413KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 3413KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 3200KB/clock
volnitsky2 performance: 3657KB/clock
volnitsky4 performance: 3657KB/clock
Doing Search for Pattern(83bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 3413KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 3200KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 3413KB/clock
volnitsky2 performance: 3657KB/clock
volnitsky4 performance: 3011KB/clock
Doing Search for Pattern(98bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 3200KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 3011KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 3200KB/clock
volnitsky2 performance: 3938KB/clock
volnitsky4 performance: 3657KB/clock
Doing Search for Pattern(113bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 2438KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 2560KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 2694KB/clock
volnitsky2 performance: 2694KB/clock
volnitsky4 performance: 3413KB/clock
Doing Search for Pattern(130bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 1969KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 2226KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 2226KB/clock
volnitsky2 performance: 3200KB/clock
volnitsky4 performance: 3413KB/clock
Doing Search for Pattern(140bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 2048KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 2226KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 2327KB/clock
volnitsky2 performance: 3413KB/clock
volnitsky4 performance: 3200KB/clock
Doing Search for Pattern(151bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 2048KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 2438KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 2327KB/clock
volnitsky2 performance: 3200KB/clock
volnitsky4 performance: 3200KB/clock
Doing Search for Pattern(162bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 2133KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 2560KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 2560KB/clock
volnitsky2 performance: 3413KB/clock
volnitsky4 performance: 3413KB/clock
The bottomline: Volnitsky wrote that what all guys have overlooked, bravo. Strange, not liking hashing with probing stopped even Sanmayce to explore further. I wish someone else could have done it before him, for one, me. Seeing how he doesn't appreciate the enigmatic/superfast C etude by Stephen R. van den Berg tells me much and enough not to write him again until he gets some appreciativeness - the most important thing known to me.
An add-on from 2012-Feb-01:
Just wanted to improve Skip-Performance by checking the window's 8 rightmost bytes (Order 8 Horspool) against all needle's 8 bytes suffixes, which resulted in Railgun_Quadruplet_7Hasherezade.
Here an example is given:
For needle 'and every day a continuous cleaning goes on' 43bytes long we have 43-8+1 Building-Blocks i.e. suffixes 8bytes long as follows:
and ever 117,013
nd every 108,604
d every 155,516
every d 170,959
every da 115,291
very day 073,191
ery day 097,042
ry day a 083,793
y day a 011,244
day a c 115,855
day a co 101,797
ay a con 222,568
y a cont 029,130
a conti 020,978
a contin 258,405
continu 252,691
continuo 123,607
ontinuou 056,546
ntinuous 135,857
tinuous 015,332
inuous c 250,584
nuous cl 048,224
uous cle 106,616
ous clea 137,020
us clean 035,751
s cleani 178,989
cleanin 213,855
cleaning 063,337
leaning 097,138
eaning g 062,366
aning go 247,590
ning goe 036,571
ing goes 041,142
ng goes 228,365
g goes o 229,696
goes on 176,852
Each number indicates which HashSlot is occupied (here 0..262143 slots i.e. 18bit used).
Okay here is a simplified example:
string: tumba-lumba-tumba-lumba-tumba-lumba-tumba-lumba-tumba-lumba-tumba-lumba-tumba-lumba
needle: and every day a continuous cleaning goes on
and every day a continuous cleaning goes on +14 Horspool Order 1
and every day a continuous cleaning goes on +36 Horspool Order 8
The goal: to jump when the rightmost 8bytes '-tumba-l' of window do not look like any of Needle's prefixes i.e. are not to be found. This maximum jump equals cbPattern-(Order-1) or 43-(8-1)=36.
Here the Horspool skip-table follows:
a n d e v e r y d a y a c o n t i n u o u s c l e a n i n g g o e s o n
12 09 32 02 04 37 04 35 30 02 32 12 30 02 12 02 15 01 09 23 10 09 18 01 18 03 02 15 14 04 12 09 10 09 06 02 06 01 04 03 02 01 09
We have 'l' vs 'n' where 'l' points to 14, however 'Hasherezade' is able to jump/skip 43-(8-1)=36positions in case of HashCode of '-tumba-l' doesn't equal any of HashCodes of Needle's suffixes. HashCode of '-tumba-l' is 083,467 it matches none of Needle's HashCodes so we skip 36 bytes. This boost is the only difference between 'Hasherezade' and 'Gulliver' revision 2.
// Scheherezade -> Hasherezade
// [Italian: buffa, feminine of buffo, comic.]
/*
First off: I am heavily disappointed from Speed-Performance of 'Hasherezade'
and comparing Skip-Performance of 'Gulliver' and 'Hasherezade' doesn't make me happy, either.
Pattern: fastest fox
Doing Search for Pattern(11bytes) into String(206908949bytes) as-one-line ...
Railgun_Quadruplet_7sun performance: 1,683KB/clock / 00,775%, 26,672,940 skips/iterations
Railgun_Quadruplet_7 performance: 1,642KB/clock / 00,669%, 30,925,578 skips/iterations
Railgun_Quadruplet_7sunhorse performance: 0,944KB/clock / 00,912%, 22,663,583 skips/iterations
Railgun_Quadruplet_7deuce performance: 0,801KB/clock / 00,797%, 25,945,709 skips/iterations
Railgun_Quadruplet_7Elsiane performance: 0,704KB/clock / 00,931%, 22,213,101 skips/iterations
Railgun_Quadruplet_7Gulliver performance: 2,104KB/clock / 00,977%, 21,166,516 skips/iterations
Railgun_Quadruplet_7Hasherezade performance: 1,496KB/clock / 00,980%, 21,112,657 skips/iterations 18bit HashTable
Railgun_Quadruplet_7Hasherezade performance: 1,642KB/clock / 00,980%, 21,112,646 skips/iterations 14bit HashTable
Railgun_Quadruplet_7Hasherezade performance: 1,642KB/clock / 00,980%, 21,112,735 skips/iterations 10bit HashTable
Boyer_Moore_Flensburg performance: 1,057KB/clock / 00,669%, 30,925,578 skips/iterations
Pattern: and every day a continuous cleaning goes on
Doing Search for Pattern(43bytes) into String(206908949bytes) as-one-line ...
Railgun_Quadruplet_7sun performance: 2,557KB/clock / 01,495%, 13,832,201 skips/iterations
Railgun_Quadruplet_7 performance: 2,590KB/clock / 01,404%, 14,731,326 skips/iterations
Railgun_Quadruplet_7sunhorse performance: 1,888KB/clock / 02,222%, 09,308,136 skips/iterations
Railgun_Quadruplet_7deuce performance: 1,741KB/clock / 01,971%, 10,494,460 skips/iterations
Railgun_Quadruplet_7Elsiane performance: 1,642KB/clock / 02,229%, 09,279,871 skips/iterations
Railgun_Quadruplet_7Gulliver performance: 3,157KB/clock / 03,795%, 05,450,890 skips/iterations
Railgun_Quadruplet_7Hasherezade performance: 2,322KB/clock / 04,100%, 05,046,049 skips/iterations 18bit HashTable
Railgun_Quadruplet_7Hasherezade performance: 2,971KB/clock / 04,100%, 05,046,445 skips/iterations 14bit HashTable
Railgun_Quadruplet_7Hasherezade performance: 3,015KB/clock / 04,090%, 05,058,667 skips/iterations 10bit HashTable
Boyer_Moore_Flensburg performance: 1,683KB/clock / 01,447%, 14,294,963 skips/iterations
Pattern: And let this be your very fundamental insight... about everything. Just for one year, don't choose.
Doing Search for Pattern(99bytes) into String(206908949bytes) as-one-line ...
Railgun_Quadruplet_7sun performance: 2,845KB/clock / 02,248%, 09,200,917 skips/iterations
Railgun_Quadruplet_7 performance: 2,928KB/clock / 02,105%, 09,827,389 skips/iterations
Railgun_Quadruplet_7sunhorse performance: 2,270KB/clock / 03,283%, 06,302,096 skips/iterations
Railgun_Quadruplet_7deuce performance: 2,149KB/clock / 03,196%, 06,472,407 skips/iterations
Railgun_Quadruplet_7Elsiane performance: 2,172KB/clock / 03,282%, 06,303,154 skips/iterations
Railgun_Quadruplet_7Gulliver performance: 2,928KB/clock / 07,820%, 02,645,814 skips/iterations
Railgun_Quadruplet_7Hasherezade performance: 2,494KB/clock / 09,575%, 02,160,890 skips/iterations 18bit HashTable
Railgun_Quadruplet_7Hasherezade performance: 3,015KB/clock / 09,568%, 02,162,493 skips/iterations 14bit HashTable
Railgun_Quadruplet_7Hasherezade performance: 3,015KB/clock / 09,438%, 02,192,204 skips/iterations 10bit HashTable
Boyer_Moore_Flensburg performance: 2,349KB/clock / 02,135%, 09,689,008 skips/iterations
Pattern: Then, singing among the savage branches, it impales itself upon the longest,
sharpest spine. And, dying, it rises above its own agony to outcarol the lark and the nightingale.
Doing Search for Pattern(175bytes) into String(206908949bytes) as-one-line ...
Railgun_Quadruplet_7sun performance: 2,658KB/clock / 03,142%, 06,583,682 skips/iterations
Railgun_Quadruplet_7 performance: 2,658KB/clock / 03,095%, 06,685,179 skips/iterations
Railgun_Quadruplet_7sunhorse performance: 2,377KB/clock / 04,776%, 04,331,865 skips/iterations
Railgun_Quadruplet_7deuce performance: 2,349KB/clock / 04,728%, 04,376,097 skips/iterations
Railgun_Quadruplet_7Elsiane performance: 2,525KB/clock / 04,778%, 04,330,200 skips/iterations
Railgun_Quadruplet_7Gulliver performance: 2,245KB/clock / 12,024%, 01,720,711 skips/iterations
Railgun_Quadruplet_7Hasherezade performance: 2,434KB/clock / 17,127%, 01,208,072 skips/iterations 18bit HashTable
Railgun_Quadruplet_7Hasherezade performance: 2,590KB/clock / 17,084%, 01,211,064 skips/iterations 14bit HashTable
Railgun_Quadruplet_7Hasherezade performance: 2,590KB/clock / 16,379%, 01,263,238 skips/iterations 10bit HashTable
Boyer_Moore_Flensburg performance: 2,525KB/clock / 03,198%, 06,469,280
skips/iterations Five times more main-cycles and faster than 'Hasherezade', pshaw!
*/
#define ROL(x, n) (((x) << (n)) | ((x) >> (32-(n))))
#define HashTableSize 18
// Revision: 1, 2012-Feb-01, the main disadvantage: the preprocessing overhead PLUS a hasher.
// Caution: For better speed the case 'if (cbPattern==1)' was removed, so Pattern must be longer than 1 char.
char * Railgun_Quadruplet_7Hasherezade_count_hits (char * pbTarget,
char * pbPattern, unsigned long cbTarget, unsigned long cbPattern)
{
char * pbTargetMax = pbTarget + cbTarget;
register unsigned long ulHashPattern;
register unsigned long ulHashTarget;
signed long count;
signed long countSTATIC;
unsigned char SINGLET;
unsigned long Quadruplet2nd;
unsigned long Quadruplet3rd;
unsigned long Quadruplet4th;
unsigned long AdvanceHopperGrass;
long i; //BMH needed
int a, j;
unsigned int bm_bc[256]; //BMH needed
unsigned int bm_bc2nd[256]; //BMS needed
unsigned char bm_Horspool_Order2[256*256]; //BMHSS(Elsiane) needed,
// 'char' limits patterns to 255, if 'long' then table becomes 256KB, grrr.
unsigned char bm_Hasherezade_HASH[1<<(HashTableSize)];
// Jesteressing 8bytes (Horspool Order 8) for fast lookup, should
// be bitwise (i.e. 8times smaller) since it says yes/no for presence.
uint32_t hash32;
unsigned long Gulliver; // or unsigned char or unsigned short
if (cbPattern > cbTarget)
return(NULL);
if ( cbPattern<4) {
pbTarget = pbTarget+cbPattern;
ulHashPattern = ( (*(char *)(pbPattern))<<8 ) + *(pbPattern+(cbPattern-1));
if ( cbPattern==3) {
for ( ;; ) {
if ( ulHashPattern == ( (*(char *)(pbTarget-3))<<8 ) + *(pbTarget-1) ) {
if ( *(char *)(pbPattern+1) ==
*(char *)(pbTarget-2) ) Railgunhits++; //return((pbTarget-3));
}
if ( (char)(ulHashPattern>>8) != *(pbTarget-2) ) pbTarget++;
pbTarget++;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else {
}
for ( ;; ) {
if ( ulHashPattern == ( (*(char *)(pbTarget-2))<<8 ) + *(pbTarget-1) )
Railgunhits++; //return((pbTarget-2));
if ( (char)(ulHashPattern>>8) != *(pbTarget-1) ) pbTarget++;
pbTarget++;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else { //if ( cbPattern<4)
if (cbTarget<961) { // This value is arbitrary(don't know how exactly),
// it ensures(at least must) better performance than 'Boyer_Moore_Horspool'.
/*
// A better strstr, with no asm code
// Written by Mischa Sandberg
// http://mischasan.wordpress.com
// static char const *
// scanstrm(char const *tgt, char const *pat, int len)
// {
// uint32_t head = MSBF32(pat), wind = 0, next;
//
// pat += 4, len -= 4;
// while ((next = *(uint8_t const*)tgt++)) {
// wind = ( wind << 8 ) + next;
// if (wind == head && !memcmp(tgt, pat, len))
// return tgt - 4;
// }
// return NULL;
//}
ulHashPattern = 0;
ulHashPattern = ( ulHashPattern << 8 ) + *(uint8_t const*)pbPattern++;
ulHashPattern = ( ulHashPattern << 8 ) + *(uint8_t const*)pbPattern++;
ulHashPattern = ( ulHashPattern << 8 ) + *(uint8_t const*)pbPattern++;
ulHashPattern = ( ulHashPattern << 8 ) + *(uint8_t const*)pbPattern++;
AdvanceHopperGrass = 0;
cbPattern -= 4;
while ((ulHashTarget = *(uint8_t const*)pbTarget++)) {
AdvanceHopperGrass = ( AdvanceHopperGrass << 8 ) + ulHashTarget;
if (AdvanceHopperGrass == ulHashPattern && !memcmp(pbTarget, pbPattern, cbPattern))
Railgunhits++; //return pbTarget - 4;
}
return NULL;
*/
pbTarget = pbTarget+cbPattern;
ulHashPattern = *(unsigned long *)(pbPattern);
SINGLET = ulHashPattern & 0xFF;
Quadruplet2nd = SINGLET<<8;
Quadruplet3rd = SINGLET<<16;
Quadruplet4th = SINGLET<<24;
for ( ;; ) {
AdvanceHopperGrass = 0;
ulHashTarget = *(unsigned long *)(pbTarget-cbPattern);
if ( ulHashPattern == ulHashTarget ) {
// Three unnecessary comparisons here, but 'AdvanceHopperGrass'
// must be calculated - it has a higher priority.
count = cbPattern-1;
while ( count && *(char *)(pbPattern+(cbPattern-count)) == *(char *)(pbTarget-count) ) {
if ( cbPattern-1==AdvanceHopperGrass+count &&
SINGLET != *(char *)(pbTarget-count) ) AdvanceHopperGrass++;
count--;
}
if ( count == 0) Railgunhits++; //return((pbTarget-cbPattern));
} else { // The goal here: to avoid memory accesses by stressing the registers.
if ( Quadruplet2nd != (ulHashTarget & 0x0000FF00) ) {
AdvanceHopperGrass++;
if ( Quadruplet3rd != (ulHashTarget & 0x00FF0000) ) {
AdvanceHopperGrass++;
if ( Quadruplet4th != (ulHashTarget & 0xFF000000) ) AdvanceHopperGrass++;
}
}
}
AdvanceHopperGrass++;
pbTarget = pbTarget + AdvanceHopperGrass;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else { //if (cbTarget<961)
countSTATIC = cbPattern-2-2;
for (a=0; a < 256; a++) {bm_bc[a]=cbPattern; bm_bc2nd[a]=cbPattern+1;}
for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1;
for (j=0; j < cbPattern; j++) bm_bc2nd[pbPattern[j]]=cbPattern-j;
ulHashPattern = *(unsigned long *)(pbPattern); // First four bytes
//ulHashTarget = *(unsigned short *)(pbPattern+cbPattern-1-1); // Last two bytes
AdvanceHopperGrass = 0;
i=0;
// Elsiane r.2 [
for (a=0; a < 256*256; a++) {bm_Horspool_Order2[a]= cbPattern-1;}
// cbPattern-(Order-1) for Horspool; 'memset' if not optimized
// alfalfa 7 long 6 BBs (al lf fa al lf fa) 3 distinct BBs (al lf fa)
// fast 4 0-1-2 fa as st
for (j=0; j < cbPattern-1; j++) bm_Horspool_Order2[*(unsigned short *)
(pbPattern+j)]=j; // Rightmost appearance/position is needed
// Elsiane r.2 ]
if ( cbPattern>10) {
// Hasherezade r.1 [
// OSHO.TXT has 00,046,486 03bytes distinct BBs
// OSHO.TXT has 00,248,019 04bytes distinct BBs
// OSHO.TXT has 00,855,682 05bytes distinct BBs
// OSHO.TXT has 02,236,138 06bytes distinct BBs
// OSHO.TXT has 04,803,152 07bytes distinct BBs
// OSHO.TXT has 08,956,496 08bytes distinct BBs
// to be hashed in 18bit i.e. 256KB i.e. 262,144 slots i.e. 34 vs 1.
// OSHO.TXT has 15,006,172 09bytes distinct BBs
// OSHO.TXT has 22,992,127 10bytes distinct BBs
// Note: BB stands for Building-Block (also suffix)
for (a=0; a < 1<<(HashTableSize); a++) {bm_Hasherezade_HASH[a]= 0;}
// to-do: bit to replace byte; 'memset' if not optimized
// cbPattern - Order + 1 i.e. number of BBs for 11 'fastest fox'
// 11-8+1=4: 'fastest ', 'astest f', 'stest fo', 'test fox'
for (j=0; j < cbPattern-8+1; j++) {
hash32 = (2166136261 ^ (ROL(*(uint32_t *)(pbPattern+j),5)^*(uint32_t *)(pbPattern+j+4))) * 709607;
bm_Hasherezade_HASH[( hash32 ^ (hash32 >> 16) ) & ( (1<<(HashTableSize))-1 )]=1;
/*
for (a=0; a<8; a++)
printf("%c",*(char *)(pbPattern+j+a) );
printf(" %lu\n",( hash32 ^ (hash32 >> 16) ) & ( (1<<(HashTableSize))-1 ));
//Input Pattern(up to 19+2000 chars): and every day a continuous cleaning goes on
//Doing Search for Pattern(43bytes) into String(206908949bytes) as-one-line ...
BBs Slot(HashCode for 18bit HashTable)
and ever 117013
nd every 108604
d every 155516
every d 170959
every da 115291
very day 73191
ery day 97042
ry day a 83793
y day a 11244
day a c 115855
day a co 101797
ay a con 222568
y a cont 29130
a conti 20978
a contin 258405
continu 252691
continuo 123607
ontinuou 56546
ntinuous 135857
tinuous 15332
inuous c 250584
nuous cl 48224
uous cle 106616
ous clea 137020
us clean 35751
s cleani 178989
cleanin 213855
cleaning 63337
leaning 97138
eaning g 62366
aning go 247590
ning goe 36571
ing goes 41142
ng goes 228365
g goes o 229696
goes on 176852
*/
}
// Hasherezade r.1 ]
while (i <= cbTarget-cbPattern-1) { // -1 because Sunday is used
Gulliver = bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1]];
if ( Gulliver == cbPattern-2 ) { // CASE #1: means the pair (char order 2) is found
if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
count = countSTATIC; // Last two chars already matched, to be fixed with -2
while ( count !=0 && *(char *)(pbPattern+
(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
count--;
if ( count == 0) Railgunhits++; //return(pbTarget+i);
}
//i = i + 1; // r.1, obviuosly this is the worst skip so turning to 'SunHorse': lines below
if ( bm_bc[(unsigned char)pbTarget[i+cbPattern-1]] < bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]] )
Gulliver = bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]];
else
Gulliver = bm_bc[(unsigned char)pbTarget[i+cbPattern-1]];
} else if ( Gulliver != cbPattern-1 ) // CASE #2: if equal means the pair (char order 2)
// is not found i.e. Gulliver remains intact, skip the whole pattern
// and fall back (Order-1) chars i.e. one char for Order 2
Gulliver = cbPattern - Gulliver - 2;
// CASE #3: the pair is found and not as suffix i.e. rightmost position
// The goal: to jump when the rightmost 8bytes (Order 8 Horspool) of window do
// not look like any of Needle prefixes i.e. are not to be found. This maximum
// jump equals cbPattern-(Order-1) or 11-(8-1)=4 for 'fastest fox' - a small
// one but for Needle 31 bytes the jump equals 31-(8-1)=24
if (Gulliver < cbPattern-(8-1)) {
hash32 = (2166136261 ^ (ROL(*(uint32_t *)
(pbTarget+i+cbPattern-8),5)^*(uint32_t *)(pbTarget+i+cbPattern-8+4))) * 709607;
if ( bm_Hasherezade_HASH[( hash32 ^ (hash32 >> 16) ) & ( (1<<(HashTableSize))-1 )]==0 )
Gulliver = cbPattern-(8-1);
}
i = i + Gulliver;
AdvanceHopperGrass++;
/*
; 4155 : // The goal: to jump when the rightmost 8bytes (Order 8 Horspool) of window
// do not look like any of Needle prefixes i.e. are not to be found. This maximum jump equals
// cbPattern-(Order-1) or 11-(8-1)=4 for 'fastest fox' - a small one
// but for Needle 31 bytes the jump equals 31-(8-1)=24
; 4156 : if (Gulliver < cbPattern-(8-1)) {
01f16 8d 43 f9 lea eax, DWORD PTR [ebx-7]
01f19 3b c8 cmp ecx, eax
01f1b 73 30 jae SHORT $LN18@Railgun_Qu@8
; 4157 : hash32 = (2166136261 ^ (ROL(*(uint32_t *)
(pbTarget+i+cbPattern-8),5)^*(uint32_t *)(pbTarget+i+cbPattern-8+4))) * 709607;
01f1d 8b 44 32 f8 mov eax, DWORD PTR [edx+esi-8]
01f21 c1 c0 05 rol eax, 5
01f24 33 44 32 fc xor eax, DWORD PTR [edx+esi-4]
01f28 35 c5 9d 1c 81 xor eax, -2128831035 ; 811c9dc5H
01f2d 69 c0 e7 d3 0a
00 imul eax, 709607 ; 000ad3e7H
; 4158 : if ( bm_Hasherezade_HASH[( hash32 ^
(hash32 >> 16) ) & ( (1<<(HashTableSize))-1 )]==0 )
01f33 8b f8 mov edi, eax
01f35 c1 ef 10 shr edi, 16 ; 00000010H
01f38 33 f8 xor edi, eax
01f3a 81 e7 ff ff 03
00 and edi, 262143 ; 0003ffffH
01f40 80 bc 3c 28 08
01 00 00 cmp BYTE PTR _bm_Hasherezade_HASH$[esp+edi+329776], 0
01f48 75 03 jne SHORT $LN18@Railgun_Qu@8
; 4159 : Gulliver = cbPattern-(8-1);
01f4a 8d 4b f9 lea ecx, DWORD PTR [ebx-7]
$LN18@Railgun_Qu@8:
; 4160 : }
; 4161 : i = i + Gulliver;
; 4162 : AdvanceHopperGrass++;
*/
}
} else {
while (i <= cbTarget-cbPattern-1) { // -1 because Sunday is used
Gulliver = bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1]];
if ( Gulliver == cbPattern-2 ) { // CASE #1: means the pair (char order 2) is found
if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
count = countSTATIC; // Last two chars already matched, to be fixed with -2
while ( count !=0 && *(char *)(pbPattern+
(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
count--;
if ( count == 0) Railgunhits++; //return(pbTarget+i);
}
//i = i + 1; // r.1, obviuosly this is the worst skip so turning to 'SunHorse': lines below
if ( bm_bc[(unsigned char)pbTarget[i+cbPattern-1]] < bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]] )
Gulliver = bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]];
else
Gulliver = bm_bc[(unsigned char)pbTarget[i+cbPattern-1]];
} else if ( Gulliver != cbPattern-1 )
// CASE #2: if equal means the pair (char order 2) is not found
// i.e. Gulliver remains intact, skip the whole pattern
// and fall back (Order-1) chars i.e. one char for Order 2
Gulliver = cbPattern - Gulliver - 2;
// CASE #3: the pair is found and not as suffix i.e. rightmost position
i = i + Gulliver;
// 32323218 Order 1 Horspool Skip-table A
// 01234568 Order 1 Horspool Skip-table B
// fa af fa af fa as st Order 2 Horspool Skip-table B
// 0 1 2 3 4 5 6
// HIKARIfast
// fafafast
// fafafast +2 Order 1 'a' vs 't'
// fafafast +2 = (cbPattern-SkipB-Order = 8-5-1 = 2) Order 1 'a' vs 't'
// fafafast +2 = (cbPattern-SkipB-Order = 8-4-2 = 2) Order 2 'fa' vs 'st' i.e. CASE #3
// 76543218 Order 1 Horspool
// lo on ng gp pa ac ce Order 2 Horspool
// 0 1 2 3 4 5 6
// HIKARIfast
// longpace
// longpace +2 Order 1 'a' vs 'e'
// longpace +7 = (cbPattern-(Order-1) = 8-(2-1) = 7) Order 2 'fa' vs 'ce' i.e. CASE #2
AdvanceHopperGrass++;
}
} //if ( cbPattern>10) {
if (i == cbTarget-cbPattern) {
if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
count = countSTATIC;
while ( count !=0 && *(char *)(pbPattern+
(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
count--;
if ( count == 0) Railgunhits++; //return(pbTarget+i);
}
AdvanceHopperGrass++;
}
GlobalSP += (int)((double)cbTarget/AdvanceHopperGrass*100);
GlobalI += AdvanceHopperGrass;
printf("Skip-Performance(bigger-the-better): %d%%, %d skips/iterations\n",
(int)((double)cbTarget/AdvanceHopperGrass*100), AdvanceHopperGrass);
return(NULL);
} //if (cbTarget<961)
} //if ( cbPattern<4)
}
// ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Sunday-Horspool algorithm ]
The Speed-Performance disappoints, also I don't see how to speed up this hash-boosted variant. Any ideas how to hasten?
An add-on from 2012-Jan-26:
Enter Railgun_Quadruplet_7Gulliver ... the most promising variant.
I am happy to share my latest etude - a result of a week-long sub-quest of mine saturated with music and program-mess-ing. The first search-function that breaks (effortlessly) the 3GB/s barrier on a computer with 3GB/s memory write.
// ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Sunday-Horspool algorithm [
/*
Raising Horspool/Sunday order from 1 to 2 came quite naturally after some carefree wander of mine.
I consider adding a new 'else if' block for haystacks longer than 960 in order to avoid the
big skip table overhead, something like 512*1024, thus fastest r7sun will operate
on 961..512*1024 long haystacks whereas 'Gulliver' on bigger than 512*1024.
I am not happy with Skip-Performance of revision 1: this should be amended...
'Gulliver' makes giant skips/jumps on patterns some 8+ chars long but for short
ones falls behind. Here r.1 is given being a Horspool order 2, of course Sunday order 2 is coming.
I restrain myself of dreaming about power behind order 3,
for bigger orders I expect nothing less than Dream-Performance.
Of course the enhancements await advent, they are very easy
to be seen, but they will appear in next revisions of 'Gulliver'.
Tuning continues but the skeleton is built, I see 'Gulliver' as a really High-Performance etude.
And not to be empty-handed here the Gulliver's swiftness is benchmarked on String(206,908,949bytes) as-one-line:
Pattern: fast
Railgun_Quadruplet_7sun: 1,074KB/clock / 0,456%, 45,330,622 iterations
Railgun_Quadruplet_7: 1,000KB/clock / 0,377%, 54,788,054 iterations
Railgun_Quadruplet_7sunhorse: 0,651KB/clock / 0,480%, 43,103,056 iterations
Railgun_Quadruplet_7deuce: 0,575KB/clock / 0,389%, 53,138,919 iterations
Railgun_Quadruplet_7Elsiane: 0,511KB/clock / 0,551%, 37,541,955 iterations
Railgun_Quadruplet_7Gulliver: 0,649KB/clock / 0,298%, 69,403,843 iterations
Boyer_Moore_Flensburg: 0,491KB/clock / 0,377%, 54,788,139 iterations
Pattern: faster
Railgun_Quadruplet_7sun: 1,347KB/clock / 0,591%, 34,996,936 iterations
Railgun_Quadruplet_7: 1,329KB/clock / 0,514%, 40,194,194 iterations
Railgun_Quadruplet_7sunhorse: 0,762KB/clock / 0,656%, 31,504,148 iterations
Railgun_Quadruplet_7deuce: 0,662KB/clock / 0,567%, 36,434,006 iterations
Railgun_Quadruplet_7Elsiane: 0,538KB/clock / 0,710%, 29,101,626 iterations
Railgun_Quadruplet_7Gulliver: 1,010KB/clock / 0,491%, 42,066,438 iterations
Boyer_Moore_Flensburg: 0,687KB/clock / 0,514%, 40,194,282 iterations
Pattern: fastest
Railgun_Quadruplet_7sun: 1,554KB/clock / 0,687%, 30,084,306 iterations
Railgun_Quadruplet_7: 1,507KB/clock / 0,599%, 34,540,430 iterations
Railgun_Quadruplet_7sunhorse: 0,918KB/clock / 0,761%, 27,188,853 iterations
Railgun_Quadruplet_7deuce: 0,786KB/clock / 0,663%, 31,175,827 iterations
Railgun_Quadruplet_7Elsiane: 0,635KB/clock / 0,818%, 25,281,493 iterations
Railgun_Quadruplet_7Gulliver: 1,209KB/clock / 0,592%, 34,946,434 iterations
Boyer_Moore_Flensburg: 0,798KB/clock / 0,621%, 33,278,240 iterations
Pattern: fastest fox
Railgun_Quadruplet_7sun: 1,727KB/clock / 0,775%, 26,672,940 iterations
Railgun_Quadruplet_7: 1,669KB/clock / 0,669%, 30,925,578 iterations
Railgun_Quadruplet_7sunhorse: 0,957KB/clock / 0,912%, 22,663,583 iterations
Railgun_Quadruplet_7deuce: 0,821KB/clock / 0,797%, 25,945,709 iterations
Railgun_Quadruplet_7Elsiane: 0,719KB/clock / 0,931%, 22,213,101 iterations Gulliver outskips Elsiane!
Railgun_Quadruplet_7Gulliver: 1,804KB/clock / 0,977%, 21,167,180 iterations monstrous!
Boyer_Moore_Flensburg: 1,080KB/clock / 0,669%, 30,925,649 iterations
Pattern: fastest fox with biggest strides
Railgun_Quadruplet_7sun: 2,658KB/clock / 1,584%, 13,060,463 iterations
Railgun_Quadruplet_7: 2,767KB/clock / 1,511%, 13,689,243 iterations
Railgun_Quadruplet_7sunhorse: 1,788KB/clock / 2,138%, 09,677,267 iterations
Railgun_Quadruplet_7deuce: 1,656KB/clock / 2,053%, 10,075,650 iterations
Railgun_Quadruplet_7Elsiane: 1,542KB/clock / 2,143%, 09,652,548 iterations
Railgun_Quadruplet_7Gulliver: 3,108KB/clock / 2,917%, 07,091,357 iterations over the top!
Boyer_Moore_Flensburg: 1,820KB/clock / 1,554%, 13,307,181 iterations
Pattern: fastest fox with biggest strides known to me
Railgun_Quadruplet_7sun: 2,590KB/clock / 1,548%, 13,363,356 iterations
Railgun_Quadruplet_7: 2,658KB/clock / 1,447%, 14,292,419 iterations
Railgun_Quadruplet_7sunhorse: 1,870KB/clock / 2,234%, 09,259,505 iterations
Railgun_Quadruplet_7deuce: 1,757KB/clock / 2,011%, 10,287,584 iterations
Railgun_Quadruplet_7Elsiane: 1,656KB/clock / 2,240%, 09,236,188 iterations
Railgun_Quadruplet_7Gulliver: 3,108KB/clock / 3,821%, 05,413,684 iterations unstoppable!
Boyer_Moore_Flensburg: 1,712KB/clock / 1,540%, 13,431,751 iterations
Pattern: fastest fox with biggest strides known to me up to 2012 January 26 namely 'Gulliver'
Railgun_Quadruplet_7sun: 3,108KB/clock / 2,890%, 07,159,321 iterations
Railgun_Quadruplet_7: 3,157KB/clock / 2,742%, 07,545,141 iterations
Railgun_Quadruplet_7sunhorse: 2,557KB/clock / 4,138%, 04,999,777 iterations
Railgun_Quadruplet_7deuce: 2,494KB/clock / 4,029%, 05,135,444 iterations
Railgun_Quadruplet_7Elsiane: 2,494KB/clock / 4,141%, 04,995,463 iterations
Railgun_Quadruplet_7Gulliver: 3,108KB/clock / 7,218%, 02,866,342 iterations simply amazing!
Boyer_Moore_Flensburg: 2,767KB/clock / 2,745%, 07,536,097 iterations
Note:
[
In the above benchmark (on Pentium Merom 2166Mhz) 'Gulliver' is limited by memory bandwidth only, due to:
Pattern: fastest fox with biggest strides
Railgun_Quadruplet_7sun: 2,658KB/clock / 1,584%, 13,060,463 iterations
Railgun_Quadruplet_7Gulliver: 3,108KB/clock / 2,917%, 07,091,357 iterations
Pattern: fastest fox with biggest strides known to me
Railgun_Quadruplet_7sun: 2,590KB/clock / 1,548%, 13,363,356 iterations
Railgun_Quadruplet_7Gulliver: 3,108KB/clock / 3,821%, 05,413,684 iterations
Pattern: fastest fox with biggest strides known to me up to 2012 January 26 namely 'Gulliver'
Railgun_Quadruplet_7sun: 3,108KB/clock / 2,890%, 07,159,321 iterations
Railgun_Quadruplet_7Gulliver: 3,108KB/clock / 7,218%, 02,866,342 iterations
You can see how Railgun_Quadruplet_7sun struggles to catch up, for longest pattern
it achieves the maximum 3,108KB/clock but at a nasty cost: 7,159,321 iterations
whereas 'Gulliver' achieves same (in fact much higher) speed with heart-touching 2,866,342 iterations only.
For my Pentium T3400 2166 MHz Toshiba Satellite L305 Dual DDR2-667 5-5-5-13, 'Everest' benchmark program gives:
Memory Read: 4,605 MB/s
Memory Write: 3,323 MB/s
'Gulliver' would run at 5GB/s without any problem if it were not for
the hardware limitation, a fantastic performance in my eyes.
]
*/
//
// Benchmark (on Pentium Merom 2166MHz, Windows XP 32bit, CL v16 32bit):
// Two tests were done - the first/second '6x2'/'52' searches 12/52 patterns
// (for all appearances i.e. counting all hits) into 206,908,949 bytes long English text:
//
// Skip-Performance (measured in %, bigger-the-better; Iterations, lesser-the-better) Summary:
//
// '6x2' test (12 patterns 4-19 chars in length)
//
// Railgun_Quadruplet_7sun 6x2 i.e. average performance: 2,092KB/clock / 0,153,936% / 005,572,202,064
// Railgun_Quadruplet_7 6x2 i.e. average performance: 2,032KB/clock / 0,137,568% / 006,454,416,320
// Railgun_Quadruplet_7sunhorse 6x2 i.e. average performance: 1,262KB/clock / 0,173,840% / 005,145,288,080
// Railgun_Quadruplet_7deuce 6x2 i.e. average performance: 1,134KB/clock / 0,155,744% / 005,995,597,776
// Railgun_Quadruplet_7Elsiane 6x2 i.e. average performance: 0,938KB/clock / 0,184,096% / 004,710,045,936
// Boyer_Moore_Flensburg 6x2 i.e. average performance: 1,163KB/clock / 0,139,168% / 006,428,106,496
// Railgun_Quadruplet_7Gulliver 6x2 i.e. average performance: 1,552KB/clock / 0,152,912% / 006,959,446,480
// Brute_Force_Dummy 6x2 i.e. average performance: 0,343KB/clock / 0,019,200% / 039,726,516,640
//
// '52' test (52 patterns 4-175 chars in length)
//
// Railgun_Quadruplet_7sun 52 i.e. average performance: 1,652KB/clock / 0,768,416% / 022,362,482,640
// Railgun_Quadruplet_7 52 i.e. average performance: 1,636KB/clock / 0,708,768% / 025,368,049,712
// Railgun_Quadruplet_7sunhorse 52 i.e. average performance: 1,052KB/clock / 0,939,520% / 020,187,534,736
// Railgun_Quadruplet_7deuce 52 i.e. average performance: 0,929KB/clock / 0,864,048% / 023,292,952,992
// Railgun_Quadruplet_7Elsiane 52 i.e. average performance: 0,809KB/clock / 0,979,072% / 018,535,811,280
// Boyer_Moore_Flensburg 52 i.e. average performance: 0,928KB/clock / 0,720,272% / 025,227,269,760
// Railgun_Quadruplet_7Gulliver 52 i.e. average performance: 1,296KB/clock / 1,145,808% / 026,131,772,688
// Brute_Force_Dummy 52 i.e. average performance: 0,279KB/clock / 0,083,200% / 172,148,232,496
//
// Revision: 1, 2012-Jan-26, the main disadvantage: the preprocessing overhead.
// Caution: For better speed the case 'if (cbPattern==1)' was removed, so Pattern must be longer than 1 char.
char * Railgun_Quadruplet_7Gulliver_count_hits (char * pbTarget,
char * pbPattern, unsigned long cbTarget, unsigned long cbPattern)
{
char * pbTargetMax = pbTarget + cbTarget;
register unsigned long ulHashPattern;
register unsigned long ulHashTarget;
signed long count;
signed long countSTATIC;
unsigned char SINGLET;
unsigned long Quadruplet2nd;
unsigned long Quadruplet3rd;
unsigned long Quadruplet4th;
unsigned long AdvanceHopperGrass;
long i; //BMH needed
int a, j;
unsigned int bm_bc[256]; //BMH needed
unsigned int bm_bc2nd[256]; //BMS needed
//BMHSS(Elsiane) needed, 'char' limits patterns to 255, if 'long' then table becomes 256KB, grrr.
unsigned char bm_Sunday_Order2[256*256];
unsigned long Gulliver; // or unsigned char or unsigned short
if (cbPattern > cbTarget)
return(NULL);
if ( cbPattern<4) {
pbTarget = pbTarget+cbPattern;
ulHashPattern = ( (*(char *)(pbPattern))<<8 ) + *(pbPattern+(cbPattern-1));
if ( cbPattern==3) {
for ( ;; ) {
if ( ulHashPattern == ( (*(char *)(pbTarget-3))<<8 ) + *(pbTarget-1) ) {
if ( *(char *)(pbPattern+1) == *(char *)(pbTarget-2) ) Railgunhits++; //return((pbTarget-3));
}
if ( (char)(ulHashPattern>>8) != *(pbTarget-2) ) pbTarget++;
pbTarget++;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else {
}
for ( ;; ) {
if ( ulHashPattern == ( (*(char *)(pbTarget-2))<<8 ) + *(pbTarget-1) )
Railgunhits++; //return((pbTarget-2));
if ( (char)(ulHashPattern>>8) != *(pbTarget-1) ) pbTarget++;
pbTarget++;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else { //if ( cbPattern<4)
if (cbTarget<961) { // This value is arbitrary(don't know how exactly),
// it ensures(at least must) better performance than 'Boyer_Moore_Horspool'.
pbTarget = pbTarget+cbPattern;
ulHashPattern = *(unsigned long *)(pbPattern);
SINGLET = ulHashPattern & 0xFF;
Quadruplet2nd = SINGLET<<8;
Quadruplet3rd = SINGLET<<16;
Quadruplet4th = SINGLET<<24;
for ( ;; ) {
AdvanceHopperGrass = 0;
ulHashTarget = *(unsigned long *)(pbTarget-cbPattern);
if ( ulHashPattern == ulHashTarget ) {
// Three unnecessary comparisons here, but 'AdvanceHopperGrass'
// must be calculated - it has a higher priority.
count = cbPattern-1;
while ( count && *(char *)(pbPattern+(cbPattern-count)) == *(char *)(pbTarget-count) ) {
if ( cbPattern-1==AdvanceHopperGrass+count &&
SINGLET != *(char *)(pbTarget-count) ) AdvanceHopperGrass++;
count--;
}
if ( count == 0) Railgunhits++; //return((pbTarget-cbPattern));
} else { // The goal here: to avoid memory accesses by stressing the registers.
if ( Quadruplet2nd != (ulHashTarget & 0x0000FF00) ) {
AdvanceHopperGrass++;
if ( Quadruplet3rd != (ulHashTarget & 0x00FF0000) ) {
AdvanceHopperGrass++;
if ( Quadruplet4th != (ulHashTarget & 0xFF000000) ) AdvanceHopperGrass++;
}
}
}
AdvanceHopperGrass++;
pbTarget = pbTarget + AdvanceHopperGrass;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else { //if (cbTarget<961)
countSTATIC = cbPattern-2-2;
for (a=0; a < 256; a++) {bm_bc[a]=cbPattern; bm_bc2nd[a]=cbPattern+1;}
for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1;
for (j=0; j < cbPattern; j++) bm_bc2nd[pbPattern[j]]=cbPattern-j;
// Elsiane r.2 [
for (a=0; a < 256*256; a++) {bm_Sunday_Order2[a]= cbPattern-1;} // 'memset' if not optimized
// alfalfa 7 long 6 BBs (al lf fa al lf fa) 3 distinct BBs (al lf fa)
// fast 4 0-1-2 fa as st
for (j=0; j < cbPattern-1; j++)
bm_Sunday_Order2[*(unsigned short *)(pbPattern+j)]=j; // Rightmost appearance/position is needed
// Elsiane r.2 ]
ulHashPattern = *(unsigned short *)(pbPattern+cbPattern-1-1); // Last two bytes
ulHashTarget = *(unsigned long *)(pbPattern); // First four bytes
AdvanceHopperGrass = 0;
i=0;
while (i <= cbTarget-cbPattern-1-1) {
Gulliver = bm_Sunday_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1]];
if ( Gulliver == cbPattern-2 ) { // CASE #1: means the pair (char order 2) is found
if ( *(unsigned long *)&pbTarget[i] == ulHashTarget) {
count = countSTATIC; // Last two chars already matched, to be fixed with -2
while ( count !=0 && *(char *)(pbPattern+
(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
count--;
if ( count == 0) Railgunhits++; //return(pbTarget+i);
}
i = i + 1;
} else if ( Gulliver == cbPattern-1 ) // CASE #2: means the pair (char order 2) is not found
i = i + Gulliver; // the pair is not found, skip the whole pattern and fall back one char
else
i = i + cbPattern - Gulliver - 2;
// CASE #3: the pair is found and not as suffix i.e. rightmost position
// 32323218 Order 1 Horspool
// fa af fa af fa as st Order 2 Horspool
// 0 1 2 3 4 5 6
// HIKARIfast
// fafafast
// fafafast +2 Order 1 'a' vs 't'
// fafafast +2 = (cbPattern-Gulliver-2 = 8-4-2 = 2) Order 2 'fa' vs 'st' i.e. CASE #3
// 76543218 Order 1 Horspool
// lo on ng gp pa ac ce Order 2 Horspool
// 0 1 2 3 4 5 6
// HIKARIfast
// longpace
// longpace +2 Order 1 'a' vs 'e'
// longpace +7 = (cbPattern-1 = 8-1 = 7) Order 2 'fa' vs 'ce' i.e. CASE #2
AdvanceHopperGrass++;
}
if (i == cbTarget-cbPattern-1) {
if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
count = countSTATIC;
while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4)
== *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
count--;
if ( count == 0) Railgunhits++; //return(pbTarget+i);
}
i++;
AdvanceHopperGrass++;
}
if (i == cbTarget-cbPattern) {
if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
count = countSTATIC;
while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4)
== *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
count--;
if ( count == 0) Railgunhits++; //return(pbTarget+i);
}
AdvanceHopperGrass++;
}
GlobalSP += (int)((double)cbTarget/AdvanceHopperGrass*100);
GlobalI += AdvanceHopperGrass;
printf("Skip-Performance(bigger-the-better): %d%%, %d skips/iterations\n",
(int)((double)cbTarget/AdvanceHopperGrass*100), AdvanceHopperGrass);
return(NULL);
} //if (cbTarget<961)
} //if ( cbPattern<4)
}
// ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Sunday-Horspool algorithm ]
I was very touched (thus inspired) by this cuore song:
Strong enough to tell you why
I feel enough to tell you something that's very close to me
In the eye of my mind
Feel the reasons that we thought
You just feel all
Falling I'm standing across the stream
Fearing saying I
I'm strong enough to find my courage
To find the way I'm always near it's brewing in my mind
Falling I'm standing across the stream
Fearing saying I
I'm strong enough to find my courage
To find the way I'm always near it's growing in my mind
/Artist: Elsiane - Across The Stream/
I salute everyone who feels-and-keeps the artistic way near.
An add-on from 2012-Jan-24:
Just wanted to see what Sunday's technique (applied on two chars, order 2) holds, this resulted in appearance of Railgun_Quadruplet_7Elsiane, the biggest-strider so far, I dream (already) of Railgun_Quadruplet_7Gulliver...
Horspool-Sunday-Sunday jumps to:
--------------------------------\
|
Horspool-Sunday jumps to:
-------------------------------\
|
Horspool jumps to:
------------------------------\
|
String: faster
Needle: fast
Sunday skip-table:
43215
Horspool skip-table:
32144
For above example:
Horspool jumps to A where A=4 'e' position
Sunday jumps to B where B=5 'r' position
Horspool-Sunday ('SunHorse') jumps to max(A,B) i.e. 5
Horspool-Sunday-Sunday ('Elsiane') jumps to max(A,B,C) i.e. max(5,6)=6 since neither 'e' nor 'r' appear in the needle i.e. right after 'r'
char * Railgun_Quadruplet_7Elsiane_count_hits (char * pbTarget,
char * pbPattern, unsigned long cbTarget, unsigned long cbPattern)
{
char * pbTargetMax = pbTarget + cbTarget;
register unsigned long ulHashPattern;
register unsigned long ulHashTarget;
signed long count;
signed long countSTATIC;
unsigned char SINGLET;
unsigned long Quadruplet2nd;
unsigned long Quadruplet3rd;
unsigned long Quadruplet4th;
unsigned long AdvanceHopperGrass;
long i;
int a, j;
unsigned int bm_bc[256];
unsigned int bm_bc2nd[256];
unsigned char bm_Sunday_Order2[256*256];
if (cbPattern > cbTarget)
return(NULL);
if ( cbPattern<4) {
pbTarget = pbTarget+cbPattern;
ulHashPattern = ( (*(char *)(pbPattern))<<8 ) + *(pbPattern+(cbPattern-1));
if ( cbPattern==3) {
for ( ;; ) {
if ( ulHashPattern == ( (*(char *)(pbTarget-3))<<8 ) + *(pbTarget-1) ) {
if ( *(char *)(pbPattern+1) ==
*(char *)(pbTarget-2) ) Railgunhits++;
}
if ( (char)(ulHashPattern>>8) != *(pbTarget-2) ) pbTarget++;
pbTarget++;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else {
}
for ( ;; ) {
if ( ulHashPattern == ( (*(char *)(pbTarget-2))<<8 ) + *(pbTarget-1) )
Railgunhits++;
if ( (char)(ulHashPattern>>8) != *(pbTarget-1) ) pbTarget++;
pbTarget++;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else {
if (cbTarget<961) {
pbTarget = pbTarget+cbPattern;
ulHashPattern = *(unsigned long *)(pbPattern);
SINGLET = ulHashPattern & 0xFF;
Quadruplet2nd = SINGLET<<8;
Quadruplet3rd = SINGLET<<16;
Quadruplet4th = SINGLET<<24;
for ( ;; ) {
AdvanceHopperGrass = 0;
ulHashTarget = *(unsigned long *)(pbTarget-cbPattern);
if ( ulHashPattern == ulHashTarget ) {
count = cbPattern-1;
while ( count && *(char *)(pbPattern+(cbPattern-count)) == *(char *)(pbTarget-count) ) {
if ( cbPattern-1==AdvanceHopperGrass+count &&
SINGLET != *(char *)(pbTarget-count) ) AdvanceHopperGrass++;
count--;
}
if ( count == 0) Railgunhits++;
} else {
if ( Quadruplet2nd != (ulHashTarget & 0x0000FF00) ) {
AdvanceHopperGrass++;
if ( Quadruplet3rd != (ulHashTarget & 0x00FF0000) ) {
AdvanceHopperGrass++;
if ( Quadruplet4th != (ulHashTarget & 0xFF000000) ) AdvanceHopperGrass++;
}
}
}
AdvanceHopperGrass++;
pbTarget = pbTarget + AdvanceHopperGrass;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else {
countSTATIC = cbPattern-2-2;
for (a=0; a < 256; a++) {bm_bc[a]=cbPattern; bm_bc2nd[a]=cbPattern+1;}
for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1;
for (j=0; j < cbPattern; j++) bm_bc2nd[pbPattern[j]]=cbPattern-j;
for (a=0; a < 256*256; a++) {bm_Sunday_Order2[a]= 0;}
for (j=0; j < cbPattern; j++)
for (a=0; a < 256; a++) {
bm_Sunday_Order2[(a<<8) + pbPattern[j]]=1;
bm_Sunday_Order2[(pbPattern[j]<<8) + a]=1;
}
ulHashPattern = *(unsigned long *)(pbPattern);
AdvanceHopperGrass = 0;
i=0;
while (i <= cbTarget-cbPattern-1-1) {
if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
count = countSTATIC;
while ( count !=0 && *(char *)(pbPattern+
(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
count--;
if ( count == 0) Railgunhits++;
}
if ( bm_Sunday_Order2[*(unsigned short *)&pbTarget[i+cbPattern]] == 0 )
i = i + cbPattern + 1 + 1;
else
if ( bm_bc[(unsigned char)pbTarget[i+cbPattern-1]] <
bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]] )
i= i + bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]];
else
i= i + bm_bc[(unsigned char)pbTarget[i+cbPattern-1]];
AdvanceHopperGrass++;
}
if (i == cbTarget-cbPattern-1) {
if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
count = countSTATIC;
while ( count !=0 && *(char *)(pbPattern+
(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
count--;
if ( count == 0) Railgunhits++;
}
i++;
AdvanceHopperGrass++;
}
if (i == cbTarget-cbPattern) {
if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
count = countSTATIC;
while ( count !=0 && *(char *)
(pbPattern+(countSTATIC-count)+4) == *(char *)
(&pbTarget[i]+(countSTATIC-count)+4) )
count--;
if ( count == 0) Railgunhits++;
}
AdvanceHopperGrass++;
}
GlobalSP += (int)((double)cbTarget/AdvanceHopperGrass*100);
GlobalI += AdvanceHopperGrass;
return(NULL);
}
}
}
I like a lot this look-ahead approach of D.M. Sunday, but can't see how to quicken it, any ideas?
An add-on from 2012-Jan-22:
[
A face-off
'SunHorse' vs Railgun_Quadruplet_7sun vs Railgun_Quadruplet_7 vs Boyer_Moore vs Brute_Force
Okay, here comes Railgun_Quadruplet_7sun - the new fastest hitter. Two tests were done - the first/second '6x2'/'52' searches 12/52 patterns (for all appearances i.e. counting all hits) into 206,908,949 bytes long English text.
First, 7deuce is in the past, a bad luck it was, yes. Wow, I have received collateral damage by the last shoot-out in which D.M. Sunday excels both in Speed-Performance and Skip-Performance.
Speed-Performance (measured in KB/clock, bigger-the-better) Summary:
Function '6x2' test (12 patterns 4-19 chars in length)
Railgun_Quadruplet_7sun 2,092KB/clock / 153,936% / 005,572,202,064 Iterations BONBON
Railgun_Quadruplet_7 2,030KB/clock / 137,568% / 006,454,416,320 Iterations Leaves-The-Town
Railgun_Quadruplet_7sunhorse 1,250KB/clock / 173,840% / 005,145,288,080 Iterations UNDERRATED
Boyer_Moore_Flensburg 1,154KB/clock / 139,168% / 006,428,115,568 Iterations Slow
Railgun_Quadruplet_7deuce 1,125KB/clock / 155,744% / 005,995,597,776 Iterations Slower
Brute_Force_Dummy 0,338KB/clock / 019,200% / 039,726,516,640 Iterations Slowest
Function '52' test (52 patterns 4-175 chars in length)
Railgun_Quadruplet_7sun 1,657KB/clock / 768,416% / 022,362,482,640 Iterations BONBON
Railgun_Quadruplet_7 1,637KB/clock / 708,768% / 025,368,049,712 Iterations Leaves-The-Town
Railgun_Quadruplet_7sunhorse 1,046KB/clock / 939,520% / 020,187,534,736 Iterations UNDERRATED
Railgun_Quadruplet_7deuce 0,923KB/clock / 864,048% / 023,292,952,992 Iterations Slow
Boyer_Moore_Flensburg 0,919KB/clock / 720,272% / 025,227,312,176 Iterations Slower
Brute_Force_Dummy 0,276KB/clock / 083,200% / 172,148,232,496 Iterations Slowest
Skip-Performance (measured in %, bigger-the-better; Iterations, lesser-the-better) Summary:
Function '6x2' test (12 patterns 4-19 chars in length)
Railgun_Quadruplet_7sunhorse 1,250KB/clock / 173,840% / 005,145,288,080 Iterations BONBON
Railgun_Quadruplet_7deuce 1,125KB/clock / 155,744% / 005,995,597,776 Iterations Leaves-The-Town
Railgun_Quadruplet_7sun 2,092KB/clock / 153,936% / 005,572,202,064 Iterations Bitter-Sweet
Boyer_Moore_Flensburg 1,154KB/clock / 139,168% / 006,428,115,568 Iterations Loopy
Railgun_Quadruplet_7 2,030KB/clock / 137,568% / 006,454,416,320 Iterations Loopier
Brute_Force_Dummy 0,338KB/clock / 019,200% / 039,726,516,640 Iterations Loopiest
Function '52' test (52 patterns 4-175 chars in length)
Railgun_Quadruplet_7sunhorse 1,046KB/clock / 939,520% / 020,187,534,736 Iterations BONBON
Railgun_Quadruplet_7deuce 0,923KB/clock / 864,048% / 023,292,952,992 Iterations Leaves-The-Town
Railgun_Quadruplet_7sun 1,657KB/clock / 768,416% / 022,362,482,640 Iterations Bitter-Sweet
Boyer_Moore_Flensburg 0,919KB/clock / 720,272% / 025,227,312,176 Iterations Loopy
Railgun_Quadruplet_7 1,637KB/clock / 708,768% / 025,368,049,712 Iterations Loopier
Brute_Force_Dummy 0,276KB/clock / 083,200% / 172,148,232,496 Iterations Loopiest
For English texts I would have no second thought which function to choose: it is Railgun_Quadruplet_7sunhorse (Boyer-Moore reinforced by Sunday-Horspool) - strongest of all Boyer-Moore variants (including the original), sadly [still] not among the fastest. I can't figure out the cause for such unproportionality: 'sunhorse' features amazing (for English texts) Skip-Performance i.e. number of pattern comparisions agaist the text is so low but Speed-Performance (on this CPU) lags behind.
The test package is downloadable at:
http://www.sanmayce.com/Downloads/_KAZE_strstr_SHORT-SHOWDOWN_7sun_vs_7_vs_7sunhorse_vs_7deuce_vs_BMF.7z
The test package contains:
D:\_KAZE_strstr_SHORT-SHOWDOWN_7sun_vs_7_vs_7sunhorse_vs_7deuce_vs_BMF>dir
01/22/2012 07:00 AM 59,198 FH_Flensburg.zip
01/22/2012 07:00 AM 206,908,949 OSHO.TXT
01/22/2012 07:00 AM 469,283 Results_Pentium_Merom_2166Mhz.txt
01/22/2012 07:00 AM 175 RUNME_IT_TAKES_some_20_MINUTES.BAT
01/22/2012 07:00 AM 191 strstr_COMPILE_Microsoft.bat
01/22/2012 07:00 AM 603,625 strstr_SHORT-SHOWDOWN.c
01/22/2012 07:00 AM 633,053 strstr_SHORT-SHOWDOWN.cod
01/22/2012 07:00 AM 108,032 strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe
D:\_KAZE_strstr_SHORT-SHOWDOWN_7sun_vs_7_vs_7sunhorse_vs_7deuce_vs_BMF>
A very easy-to-the-eyes/mind article at http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/bm.htm. Here I say 'Triple THANKS' to Prof. Dr. Hans Werner Lang (Institut für Medieninformatik und Technische Informatik Fachhochschule Flensburg), I used your code (after a straightforward port). On his page (http://www.inf.fh-flensburg.de/lang/) there are very well presented algorithms - a very good resource indeed. I expect Railgun_Quadruplet_7sunhorse to become a gun of choice but only on CPUs with FAR more effective branching/memory handling than my Pentium Merom.
I still do not understand Turbo Boyer-Moore algorithm, maybe it holds a significant Skip-Performance boost. This project is under researching, that is why I have called it a quest. Anyone who can point out faster functions is welcomed to share them with us here.
My favorite UNDERRATED gun has main loop as follows:
; 3209 : i=0;
; 3210 : while (i <= cbTarget-cbPattern-1) {
010e6 8b 4c 24 20 mov ecx, DWORD PTR _cbTarget$GSCopy$[esp+2104]
010ea 89 44 24 28 mov DWORD PTR _ulHashPattern$[esp+2104], eax
010ee 33 c0 xor eax, eax
010f0 2b ca sub ecx, edx
010f2 89 4c 24 14 mov DWORD PTR tv554[esp+2104], ecx
010f6 8d 0c 13 lea ecx, DWORD PTR [ebx+edx]
010f9 89 44 24 10 mov DWORD PTR _AdvanceHopperGrass$[esp+2104], eax
010fd 89 4c 24 18 mov DWORD PTR tv451[esp+2104], ecx
$LN13@Railgun_Qu@5:
; 3211 : ulHashTarget=*(unsigned long *)&pbTarget[i];
; 3212 : if ( ulHashTarget == ulHashPattern)
01101 8b 54 24 28 mov edx, DWORD PTR _ulHashPattern$[esp+2104]
01105 39 14 18 cmp DWORD PTR [eax+ebx], edx
01108 75 41 jne SHORT $LN8@Railgun_Qu@5
; 3213 : {
; 3214 : count = countSTATIC;
0110a 8b 7c 24 1c mov edi, DWORD PTR _countSTATIC$[esp+2104]
; 3215 : while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) ==
*(char *)(&pbTarget[i]+(countSTATIC-count)+4) ) {
0110e 85 ff test edi, edi
01110 74 2d je SHORT $LN85@Railgun_Qu@5
; 3213 : {
; 3214 : count = countSTATIC;
01112 8d 56 04 lea edx, DWORD PTR [esi+4]
01115 8d 74 18 04 lea esi, DWORD PTR [eax+ebx+4]
01119 8d a4 24 00 00
00 00 npad 7
$LL10@Railgun_Qu@5:
; 3215 : while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) ==
*(char *)(&pbTarget[i]+(countSTATIC-count)+4) ) {
01120 8a 0a mov cl, BYTE PTR [edx]
01122 3a 0e cmp cl, BYTE PTR [esi]
01124 75 11 jne SHORT $LN9@Railgun_Qu@5
; 3216 : count--;
01126 46 inc esi
01127 42 inc edx
01128 4f dec edi
01129 75 f5 jne SHORT $LL10@Railgun_Qu@5
; 3217 : }
; 3218 : if ( count == 0) Railgunhits++;
0112b 8b 74 24 24 mov esi, DWORD PTR _pbPattern$GSCopy$[esp+2104]
0112f ff 05 00 00 00
00 inc DWORD PTR _Railgunhits
01135 eb 14 jmp SHORT $LN8@Railgun_Qu@5
$LN9@Railgun_Qu@5:
01137 85 ff test edi, edi
01139 75 0c jne SHORT $LN111@Railgun_Qu@5
0113b 8b 74 24 24 mov esi, DWORD PTR _pbPattern$GSCopy$[esp+2104]
$LN85@Railgun_Qu@5:
0113f ff 05 00 00 00
00 inc DWORD PTR _Railgunhits
01145 eb 04 jmp SHORT $LN8@Railgun_Qu@5
$LN111@Railgun_Qu@5:
01147 8b 74 24 24 mov esi, DWORD PTR _pbPattern$GSCopy$[esp+2104]
$LN8@Railgun_Qu@5:
; 3219 : }
; 3220 :
; 3221 :
; 3222 :
; 3223 :
; 3224 :
; 3225 :
; 3226 : if ( bm_bc[(unsigned char)pbTarget[i+cbPattern-1]]
< bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]] )
0114b 8b 54 24 18 mov edx, DWORD PTR tv451[esp+2104]
0114f 0f b6 4c 02 ff movzx ecx, BYTE PTR [edx+eax-1]
01154 0f b6 14 02 movzx edx, BYTE PTR [edx+eax]
01158 8b 4c 8c 30 mov ecx, DWORD PTR _bm_bc$[esp+ecx*4+2104]
0115c 8b 94 94 30 04
00 00 mov edx, DWORD PTR _bm_bc2nd$[esp+edx*4+2104]
01163 3b ca cmp ecx, edx
01165 73 04 jae SHORT $LN7@Railgun_Qu@5
; 3227 : i= i + bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]];
01167 03 c2 add eax, edx
; 3228 : else
01169 eb 02 jmp SHORT $LN6@Railgun_Qu@5
$LN7@Railgun_Qu@5:
; 3229 : i= i + bm_bc[(unsigned char)pbTarget[i+cbPattern-1]];
0116b 03 c1 add eax, ecx
$LN6@Railgun_Qu@5:
; 3207 :
; 3208 : AdvanceHopperGrass = 0;
; 3209 : i=0;
; 3210 : while (i <= cbTarget-cbPattern-1) {
0116d 8b 4c 24 14 mov ecx, DWORD PTR tv554[esp+2104]
; 3230 :
; 3231 :
; 3232 : AdvanceHopperGrass++;
01171 ff 44 24 10 inc DWORD PTR _AdvanceHopperGrass$[esp+2104]
01175 49 dec ecx
01176 3b c1 cmp eax, ecx
01178 76 87 jbe SHORT $LN13@Railgun_Qu@5
; 3233 : }
The size is:
01178 - 01101 +1 +1 -4 = 75bytes, -4 because 'inc DWORD PTR _AdvanceHopperGrass$[esp+2104]
' is only for statistical use.
]
Background
Function homepage: http://www.sanmayce.com/Railgun/index.html
Note: In order to reproduce the tests shown on the picture (below) you need the test package (57.2 MB) downloadable at function's homepage, thanks Randor for pointing out.
Very informative resource on strstr search techniques at:
Tests done with 16 patterns with 197MB haystack (being a pure English book-like text).
The machine is a mainstream laptop on 2.1GHz with Intel Merom CPU.
Using the Code
The code is pretty simple and straightforward.
char * Railgun_Quadruplet_6ppp (char * pbTarget,
char * pbPattern,
unsigned long cbTarget,
unsigned long cbPattern)
{
char * pbTargetMax = pbTarget + cbTarget;
register unsigned long ulHashPattern;
unsigned long ulHashTarget;
signed long count;
signed long countSTATIC;
unsigned char SINGLET;
unsigned long Quadruplet2nd;
unsigned long Quadruplet3rd;
unsigned long Quadruplet4th;
unsigned long AdvanceHopperGrass;
long i; int a, j, bm_bc[ASIZE]; unsigned char ch;
if (cbPattern > cbTarget)
return(NULL);
if ( cbPattern<4) { pbTarget = pbTarget+cbPattern;
ulHashPattern = ( (*(char *)(pbPattern))<<8 ) + *(pbPattern+(cbPattern-1));
if ( cbPattern==3) {
for ( ;; )
{
if ( ulHashPattern == ( (*(char *)(pbTarget-3))<<8 ) + *(pbTarget-1) ) {
if ( *(char *)(pbPattern+1) == *(char *)(pbTarget-2) ) return((pbTarget-3));
}
if ( (char)(ulHashPattern>>8) != *(pbTarget-2) ) pbTarget++;
pbTarget++;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else {
}
for ( ;; )
{
if ( ulHashPattern == ( (*(char *)(pbTarget-2))<<8 ) + *(pbTarget-1) )
return((pbTarget-2));
if ( (char)(ulHashPattern>>8) != *(pbTarget-1) ) pbTarget++;
pbTarget++;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else { if (cbTarget<961) {
pbTarget = pbTarget+cbPattern;
ulHashPattern = *(unsigned long *)(pbPattern);
SINGLET = ulHashPattern & 0xFF;
Quadruplet2nd = SINGLET<<8;
Quadruplet3rd = SINGLET<<16;
Quadruplet4th = SINGLET<<24;
for ( ;; )
{
AdvanceHopperGrass = 0;
ulHashTarget = *(unsigned long *)(pbTarget-cbPattern);
if ( ulHashPattern == ulHashTarget ) { count = cbPattern-1;
while ( count && *(char *)(pbPattern+(cbPattern-count)) ==
*(char *)(pbTarget-count) ) {
if ( cbPattern-1==AdvanceHopperGrass+count && SINGLET !=
*(char *)(pbTarget-count) ) AdvanceHopperGrass++;
count--;
}
if ( count == 0) return((pbTarget-cbPattern));
} else { if ( Quadruplet2nd != (ulHashTarget & 0x0000FF00) ) {
AdvanceHopperGrass++;
if ( Quadruplet3rd != (ulHashTarget & 0x00FF0000) ) {
AdvanceHopperGrass++;
if ( Quadruplet4th != (ulHashTarget & 0xFF000000) ) AdvanceHopperGrass++;
}
}
}
AdvanceHopperGrass++;
pbTarget = pbTarget + AdvanceHopperGrass;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else { countSTATIC = cbPattern-2-2; ulHashPattern = *(unsigned long *)(pbPattern);
for (a=0; a < ASIZE; a++) bm_bc[a]=cbPattern;
for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1;
i=0;
while (i <= cbTarget-cbPattern) {
ch=pbTarget[i+cbPattern-1];
if (ch == pbPattern[cbPattern-1] &&
*(long *)&pbTarget[i] == ulHashPattern) {
count = countSTATIC;
while ( count !=0 && *(char *)(pbPattern+1+3+(countSTATIC-count)) ==
*(char *)(&pbTarget[i]+1+3+(countSTATIC-count)) )
{ count--;
}
if ( count <= 0) return(pbTarget+i);
}
i+=bm_bc[ch];
}
return(NULL);
} } }
Stomp stomp I arrived, here comes the Railgun_Quadruplet
revision 7:
char * Railgun_Quadruplet_7 (char * pbTarget, char * pbPattern,
unsigned long cbTarget, unsigned long cbPattern)
{
char * pbTargetMax = pbTarget + cbTarget;
register unsigned long ulHashPattern;
unsigned long ulHashTarget;
signed long count;
signed long countSTATIC;
unsigned char SINGLET;
unsigned long Quadruplet2nd;
unsigned long Quadruplet3rd;
unsigned long Quadruplet4th;
unsigned long AdvanceHopperGrass;
long i;
int a, j, bm_bc[ASIZE];
unsigned char ch;
if (cbPattern > cbTarget) return(NULL);
if (cbPattern<4) {
pbTarget = pbTarget+cbPattern;
ulHashPattern = ( (*(char *)(pbPattern))<<8 ) +
*(pbPattern+(cbPattern-1));
if (cbPattern==3) {
for ( ;; ) {
if ( ulHashPattern ==
( (*(char *)(pbTarget-3))<<8 ) + *(pbTarget-1) ) {
if ( *(char *)(pbPattern+1) ==
*(char *)(pbTarget-2) ) return((pbTarget-3));
}
if ( (char)(ulHashPattern>>8) !=
*(pbTarget-2) ) pbTarget++;
pbTarget++;
if (pbTarget > pbTargetMax) return(NULL);
}
} else {
}
for ( ;; ) {
if ( ulHashPattern == ( (*(char *)(pbTarget-2))<<8 )
+ *(pbTarget-1) ) return((pbTarget-2));
if ( (char)(ulHashPattern>>8) != *(pbTarget-1) )
pbTarget++;
pbTarget++;
if (pbTarget > pbTargetMax) return(NULL);
}
} else {
if (cbTarget<961) {
pbTarget = pbTarget+cbPattern;
ulHashPattern = *(unsigned long *)(pbPattern);
SINGLET = ulHashPattern & 0xFF;
Quadruplet2nd = SINGLET<<8;
Quadruplet3rd = SINGLET<<16;
Quadruplet4th = SINGLET<<24;
for ( ;; ) {
AdvanceHopperGrass = 0;
ulHashTarget = *(unsigned long *)(pbTarget-cbPattern);
if ( ulHashPattern == ulHashTarget ) {
count = cbPattern-1;
while ( count && *(char *)
(pbPattern+(cbPattern-count)) ==
*(char *)(pbTarget-count) ) {
if ( cbPattern-1==
AdvanceHopperGrass+count &&
SINGLET != *(char *)(pbTarget-count)
) AdvanceHopperGrass++;
count--;
}
if ( count == 0)
return((pbTarget-cbPattern));
} else {
if ( Quadruplet2nd != (ulHashTarget
& 0x0000FF00) ) {
AdvanceHopperGrass++;
if ( Quadruplet3rd !=
(ulHashTarget & 0x00FF0000) ) {
AdvanceHopperGrass++;
if ( Quadruplet4th !=
(ulHashTarget & 0xFF000000)
) AdvanceHopperGrass++;
}
}
}
AdvanceHopperGrass++;
pbTarget = pbTarget + AdvanceHopperGrass;
if (pbTarget > pbTargetMax) return(NULL);
}
} else {
countSTATIC = cbPattern-2-2;
ulHashPattern = *(unsigned long *)(pbPattern);
for (a=0; a < ASIZE; a++) bm_bc[a]=cbPattern;
for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=
cbPattern-j-1;
i=0;
while (i <= cbTarget-cbPattern) {
if ( *(unsigned long *)&pbTarget[i] ==
ulHashPattern ) {
count = countSTATIC;
while ( count !=0 && *(char *)
(pbPattern+1+3+(countSTATIC-count)) ==
*(char *)(&pbTarget[i]+1+3+
(countSTATIC-count)) ) count--;
if ( count == 0) return(pbTarget+i);
}
i= i + bm_bc[(unsigned char)pbTarget[i+cbPattern-1]];
}
return(NULL);
}
}
}
The tweak (in BMH fragment) which speeds-up significantly the obsolete (already) r.6+++ is this:
countSTATIC = cbPattern-2-2; ulHashPattern = *(unsigned long *)(pbPattern);
for (a=0; a < ASIZE; a++) bm_bc[a]=cbPattern;
for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1;
i=0;
while (i <= cbTarget-cbPattern) {
if ( *(unsigned long *)&pbTarget[i] == ulHashPattern ) {
count = countSTATIC;
while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) ==
*(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
{ count--;
}
if ( count == 0) return(pbTarget+i);
}
i= i + bm_bc[(unsigned char)pbTarget[i+cbPattern-1]];
}
return(NULL);
In order to see what path I had to traverse to make this tweak, you may see the discussion board (r.7m thread). The tests below show NOT the full boost simply because they average times for all lengths thus losing/diminishing the high speeds times (being 4-3-2 times lesser compared with pattern lengths 2-3-4-5-6).
D:\_KAZE_new-stuff>strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe
strstr_SHORT-SHOWDOWN, revision 7, written by Kaze.
Input Pattern(up to 19+2000 chars): frustration
Doing Search for Pattern(11bytes) into String(206908949bytes) as-one-line ...
WARNING! The next line works with BMH only for pattern 4[+] long,
otherwise (for 2 and 3) other searcher takes over!
Railgun_6pp_hits/Railgun_6pp_clocks: 1355/101
Railgun_6pp performance: 2000KB/clock
Doing Search for Pattern(11bytes) into String(206908949bytes) as-one-line ...
WARNING! The next line works with BMH only for pattern 4[+] long,
otherwise (for 2 and 3) other searcher takes over!
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1355/91
Railgun_Quadruplet_7 performance: 2220KB/clock
Note: Executing the next two tests 256 times i.e. the search is for 256x8x2 patterns!
Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ...
Not-Counting-Hits-Just-Returns-First-One
Found ('an') at 157 position, Railgun_Quadruplet_6pp performance: 0KB/clock
Found ('to') at 853 position, Railgun_Quadruplet_6pp performance: 0KB/clock
Found ('TDK') at -4325408 position, Railgun_Quadruplet_6pp performance: 922KB/clock
Found ('the') at 511 position, Railgun_Quadruplet_6pp performance: 0KB/clock
Found ('fast') at 42381 position, Railgun_Quadruplet_6pp performance: 41KB/clock
Found ('easy') at 30163 position, Railgun_Quadruplet_6pp performance: 29KB/clock
Found ('grmbl') at -4325408 position, Railgun_Quadruplet_6pp performance: 1167KB/clock
Found ('email') at 69587982 position, Railgun_Quadruplet_6pp performance: 1078KB/clock
Found ('pasting') at 134600126 position, Railgun_Quadruplet_6pp performance: 1383KB/clock
Found ('amazing') at 1629806 position, Railgun_Quadruplet_6pp performance: 93KB/clock
Found ('underdog') at 137424498 position,
Railgun_Quadruplet_6pp performance: 1698KB/clock
Found ('superdog') at -4325408 position, Railgun_Quadruplet_6pp performance: 1603KB/clock
Found ('participants') at 16060 position, Railgun_Quadruplet_6pp performance: 15KB/clock
Found ('skillessness') at -4325408 position,
Railgun_Quadruplet_6pp performance: 1836KB/clock
Found ('I should have known') at 41831880 position,
Railgun_Quadruplet_6pp performance: 2403KB/clock
Found ('human consciousness') at 3863 position,
Railgun_Quadruplet_6pp performance: 3KB/clock
Railgun_Quadruplet_6pp 8x2 i.e. average performance: 1342KB/clock
ReallyTraversed: 310,477,846,528 bytes
Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ...
Not-Counting-Hits-Just-Returns-First-One
Found ('an') at 157 position, Railgun_Quadruplet_7 performance: 0KB/clock
Found ('to') at 853 position, Railgun_Quadruplet_7 performance: 0KB/clock
Found ('TDK') at -4325408 position, Railgun_Quadruplet_7 performance: 990KB/clock
Found ('the') at 511 position, Railgun_Quadruplet_7 performance: 0KB/clock
Found ('fast') at 42381 position, Railgun_Quadruplet_7 performance: 41KB/clock
Found ('easy') at 30163 position, Railgun_Quadruplet_7 performance: 29KB/clock
Found ('grmbl') at -4325408 position, Railgun_Quadruplet_7 performance: 1069KB/clock
Found ('email') at 69587982 position, Railgun_Quadruplet_7 performance: 1078KB/clock
Found ('pasting') at 134600126 position, Railgun_Quadruplet_7 performance: 1383KB/clock
Found ('amazing') at 1629806 position, Railgun_Quadruplet_7 performance: 1591KB/clock
Found ('underdog') at 137424498 position, Railgun_Quadruplet_7 performance: 1698KB/clock
Found ('superdog') at -4325408 position, Railgun_Quadruplet_7 performance: 1422KB/clock
Found ('participants') at 16060 position, Railgun_Quadruplet_7 performance: 15KB/clock
Found ('skillessness') at -4325408 position,
Railgun_Quadruplet_7 performance: 2149KB/clock
Found ('I should have known') at 41831880 position,
Railgun_Quadruplet_7 performance: 2403KB/clock
Found ('human consciousness') at 3863 position,
Railgun_Quadruplet_7 performance: 3KB/clock
Railgun_Quadruplet_7 8x2 i.e. average performance: 1354KB/clock
ReallyTraversed: 310,477,846,528 bytes
Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ...
Found ('an') 1987797 time(s), BM_HORSPOOL performance: 293KB/clock
Found ('to') 1076629 time(s), BM_HORSPOOL performance: 293KB/clock
Found ('TDK') 0 time(s), BM_HORSPOOL performance: 516KB/clock
Found ('the') 2114180 time(s), BM_HORSPOOL performance: 379KB/clock
Found ('fast') 5945 time(s), BM_HORSPOOL performance: 585KB/clock
Found ('easy') 5191 time(s), BM_HORSPOOL performance: 585KB/clock
Found ('grmbl') 0 time(s), BM_HORSPOOL performance: 759KB/clock
Found ('email') 1 time(s), BM_HORSPOOL performance: 756KB/clock
Found ('pasting') 2 time(s), BM_HORSPOOL performance: 918KB/clock
Found ('amazing') 323 time(s), BM_HORSPOOL performance: 1074KB/clock
Found ('underdog') 4 time(s), BM_HORSPOOL performance: 1069KB/clock
Found ('superdog') 0 time(s), BM_HORSPOOL performance: 1074KB/clock
Found ('participants') 147 time(s), BM_HORSPOOL performance: 1603KB/clock
Found ('skillessness') 0 time(s), BM_HORSPOOL performance: 1422KB/clock
Found ('I should have known') 1 time(s), BM_HORSPOOL performance: 1433KB/clock
Found ('human consciousness') 519 time(s), BM_HORSPOOL performance: 1820KB/clock
BM_HORSPOOL 8x2 i.e. average performance: 666KB/clock
Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ...
Found ('an') 1987797 time(s), Railgun_6pp performance: 678KB/clock
Found ('to') 1076629 time(s), Railgun_6pp performance: 643KB/clock
Found ('TDK') 0 time(s), Railgun_6pp performance: 1174KB/clock
Found ('the') 2114180 time(s), Railgun_6pp performance: 678KB/clock
Found ('fast') 5945 time(s), Railgun_6pp performance: 918KB/clock
Found ('easy') 5191 time(s), Railgun_6pp performance: 990KB/clock
Found ('grmbl') 0 time(s), Railgun_6pp performance: 1287KB/clock
Found ('email') 1 time(s), Railgun_6pp performance: 1167KB/clock
Found ('pasting') 2 time(s), Railgun_6pp performance: 1603KB/clock
Found ('amazing') 323 time(s), Railgun_6pp performance: 1603KB/clock
Found ('underdog') 4 time(s), Railgun_6pp performance: 1603KB/clock
Found ('superdog') 0 time(s), Railgun_6pp performance: 1820KB/clock
Found ('participants') 147 time(s), Railgun_6pp performance: 2149KB/clock
Found ('skillessness') 0 time(s), Railgun_6pp performance: 2126KB/clock
Found ('I should have known') 1 time(s), Railgun_6pp performance: 2557KB/clock
Found ('human consciousness') 519 time(s), Railgun_6pp performance: 2126KB/clock
Railgun_6pp 8x2 i.e. average performance: 1218KB/clock
Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ...
Found ('an') 1987797 time(s), Railgun_Quadruplet_7 performance: 643KB/clock
Found ('to') 1076629 time(s), Railgun_Quadruplet_7 performance: 678KB/clock
Found ('TDK') 0 time(s), Railgun_Quadruplet_7 performance: 1074KB/clock
Found ('the') 2114180 time(s), Railgun_Quadruplet_7 performance: 643KB/clock
Found ('fast') 5945 time(s), Railgun_Quadruplet_7 performance: 1074KB/clock
Found ('easy') 5191 time(s), Railgun_Quadruplet_7 performance: 1074KB/clock
Found ('grmbl') 0 time(s), Railgun_Quadruplet_7 performance: 1287KB/clock
Found ('email') 1 time(s), Railgun_Quadruplet_7 performance: 1167KB/clock
Found ('pasting') 2 time(s), Railgun_Quadruplet_7 performance: 1603KB/clock
Found ('amazing') 323 time(s), Railgun_Quadruplet_7 performance: 1603KB/clock
Found ('underdog') 4 time(s), Railgun_Quadruplet_7 performance: 1820KB/clock
Found ('superdog') 0 time(s), Railgun_Quadruplet_7 performance: 1603KB/clock
Found ('participants') 147 time(s), Railgun_Quadruplet_7 performance: 2557KB/clock
Found ('skillessness') 0 time(s), Railgun_Quadruplet_7 performance: 2126KB/clock
Found ('I should have known') 1 time(s), Railgun_Quadruplet_7 performance: 2557KB/clock
Found ('human consciousness') 519 time(s),
Railgun_Quadruplet_7 performance: 2557KB/clock
Railgun_Quadruplet_7 8x2 i.e. average performance: 1232KB/clock
^C
D:\_KAZE_new-stuff>
In my opinion, it is one of the seven commandments of speed: do not put the cart before the horse, i.e., for adjacent (32?) memory accesses first extract the byte with lowest address. strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe, revision 7, some quick inputs:
D:\_KAZE_new-stuff>strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe
strstr_SHORT-SHOWDOWN, revision 7, written by Kaze.
Input Pattern(up to 19+2000 chars): fast
Doing Search for Pattern(4bytes) into String(206908949bytes) as-one-line ...
WARNING! The next line works with BMH only for pattern 4[+] long,
otherwise (for 2 and 3) other searcher takes over!
Railgun_6pp_hits/Railgun_6pp_clocks: 5945/215
Railgun_6pp performance: 939KB/clock
Doing Search for Pattern(4bytes) into String(206908949bytes) as-one-line ...
WARNING! The next line works with BMH only for pattern 4[+] long,
otherwise (for 2 and 3) other searcher takes over!
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 5945/193
Railgun_Quadruplet_7 performance: 1046KB/clock
^C
...
Input Pattern(up to 19+2000 chars): from
Railgun_6pp_hits/Railgun_6pp_clocks: 83155/196
Railgun_6pp performance: 1030KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 83155/194
Railgun_Quadruplet_7 performance: 1041KB/clock
Input Pattern(up to 19+2000 chars): rabea
Railgun_6pp_hits/Railgun_6pp_clocks: 0/176
Railgun_6pp performance: 1148KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/162
Railgun_Quadruplet_7 performance: 1247KB/clock
Input Pattern(up to 19+2000 chars): makes
Railgun_6pp_hits/Railgun_6pp_clocks: 9281/172
Railgun_6pp performance: 1174KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 9281/162
Railgun_Quadruplet_7 performance: 1247KB/clock
Input Pattern(up to 19+2000 chars): monkey
Railgun_6pp_hits/Railgun_6pp_clocks: 1691/141
Railgun_6pp performance: 1433KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1691/139
Railgun_Quadruplet_7 performance: 1453KB/clock
Input Pattern(up to 19+2000 chars): punny
Railgun_6pp_hits/Railgun_6pp_clocks: 0/156
Railgun_6pp performance: 1295KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/155
Railgun_Quadruplet_7 performance: 1303KB/clock
Input Pattern(up to 19+2000 chars): funny
Railgun_6pp_hits/Railgun_6pp_clocks: 179/157
Railgun_6pp performance: 1287KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 179/156
Railgun_Quadruplet_7 performance: 1295KB/clock
Input Pattern(up to 19+2000 chars): ramjet
Railgun_6pp_hits/Railgun_6pp_clocks: 0/152
Railgun_6pp performance: 1329KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/137
Railgun_Quadruplet_7 performance: 1474KB/clock
Input Pattern(up to 19+2000 chars): fallen
Railgun_6pp_hits/Railgun_6pp_clocks: 2578/151
Railgun_6pp performance: 1338KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 2578/138
Railgun_Quadruplet_7 performance: 1464KB/clock
Input Pattern(up to 19+2000 chars): elephant
Railgun_6pp_hits/Railgun_6pp_clocks: 1198/124
Railgun_6pp performance: 1629KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1198/112
Railgun_Quadruplet_7 performance: 1804KB/clock
Input Pattern(up to 19+2000 chars): layalina
Railgun_6pp_hits/Railgun_6pp_clocks: 0/121
Railgun_6pp performance: 1669KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/110
Railgun_Quadruplet_7 performance: 1836KB/clock
Input Pattern(up to 19+2000 chars): I should have known
Railgun_6pp_hits/Railgun_6pp_clocks: 1/89
Railgun_6pp performance: 2270KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1/84
Railgun_Quadruplet_7 performance: 2405KB/clock
Input Pattern(up to 19+2000 chars): human consciousness
Railgun_6pp_hits/Railgun_6pp_clocks: 519/80
Railgun_6pp performance: 2525KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 519/75
Railgun_Quadruplet_7 performance: 2694KB/clock
Input Pattern(up to 19+2000 chars): But he's looking right through me
Railgun_6pp_hits/Railgun_6pp_clocks: 0/84
Railgun_6pp performance: 2405KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/76
Railgun_Quadruplet_7 performance: 2658KB/clock
Input Pattern(up to 19+2000 chars): I'm living in an age that calls darkness light
Railgun_6pp_hits/Railgun_6pp_clocks: 0/72
Railgun_6pp performance: 2806KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/67
Railgun_Quadruplet_7 performance: 3015KB/clock
Input Pattern(up to 19+2000 chars): Following the wanderings of a dream -
a dream that keeps my soul alive
Railgun_6pp_hits/Railgun_6pp_clocks: 0/69
Railgun_6pp performance: 2928KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/63
Railgun_Quadruplet_7 performance: 3207KB/clock
Input Pattern(up to 19+2000 chars): I notice what matters and
I got nothing to lose but darkness and shadows
Railgun_6pp_hits/Railgun_6pp_clocks: 0/66
Railgun_6pp performance: 3061KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/63
Railgun_Quadruplet_7 performance: 3207KB/clock
Input Pattern(up to 19+2000 chars): Beth Ditto
Railgun_6pp_hits/Railgun_6pp_clocks: 0/118
Railgun_6pp performance: 1712KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/107
Railgun_Quadruplet_7 performance: 1888KB/clock
Input Pattern(up to 19+2000 chars): SR71
Railgun_6pp_hits/Railgun_6pp_clocks: 0/175
Railgun_6pp performance: 1154KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/182
Railgun_Quadruplet_7 performance: 1110KB/clock
Input Pattern(up to 19+2000 chars): Apache
Railgun_6pp_hits/Railgun_6pp_clocks: 1/162
Railgun_6pp performance: 1247KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1/133
Railgun_Quadruplet_7 performance: 1519KB/clock
Input Pattern(up to 19+2000 chars): Fibonacci
Railgun_6pp_hits/Railgun_6pp_clocks: 0/106
Railgun_6pp performance: 1906KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/98
Railgun_Quadruplet_7 performance: 2061KB/clock
Input Pattern(up to 19+2000 chars): Tesla
Railgun_6pp_hits/Railgun_6pp_clocks: 0/175
Railgun_6pp performance: 1154KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/160
Railgun_Quadruplet_7 performance: 1262KB/clock
Input Pattern(up to 19+2000 chars): Albert
Railgun_6pp_hits/Railgun_6pp_clocks: 932/154
Railgun_6pp performance: 1312KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 932/137
Railgun_Quadruplet_7 performance: 1474KB/clock
Input Pattern(up to 19+2000 chars): Toshiba
Railgun_6pp_hits/Railgun_6pp_clocks: 0/130
Railgun_6pp performance: 1554KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/121
Railgun_Quadruplet_7 performance: 1669KB/clock
Input Pattern(up to 19+2000 chars): Orion
Railgun_6pp_hits/Railgun_6pp_clocks: 1/175
Railgun_6pp performance: 1154KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1/160
Railgun_Quadruplet_7 performance: 1262KB/clock
Input Pattern(up to 19+2000 chars): pharaoh
Railgun_6pp_hits/Railgun_6pp_clocks: 26/129
Railgun_6pp performance: 1566KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 26/122
Railgun_Quadruplet_7 performance: 1656KB/clock
Input Pattern(up to 19+2000 chars): lemon
Railgun_6pp_hits/Railgun_6pp_clocks: 33/178
Railgun_6pp performance: 1135KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 33/161
Railgun_Quadruplet_7 performance: 1255KB/clock
Input Pattern(up to 19+2000 chars): grammar
Railgun_6pp_hits/Railgun_6pp_clocks: 218/123
Railgun_6pp performance: 1642KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 218/117
Railgun_Quadruplet_7 performance: 1727KB/clock
D:\_KAZE_new-stuff>
Grmbl, pattern 'SR71' compared to 'fast' gives me a hint how things are far from 'fastest'.
Points of Interest
The interesting part about tests I have done: 'memcmp
' and 'memcpy
' when inlined (properly) give much better results. The thing which pop-ups again-and-again: Never take anything for granted. A bug (which doesn't affect the previous results, correct) was crushed, now r.6+++ is ready to go/hit.
As you can see from the C (commentless and tabulated) and ASM sources (see the top of the page for downloading):
The Microsoft 16.00.30319.01 compiled Railgun_Quadruplet_6ppp is 00c6b-00a00+5 = 624 bytes long.
The Quadruplet (where cbTarget<961) fragment loop is 00bc3-00b17+5 = 177 bytes long.
The Boyer-Moore-Horspool (where cbTarget>=961) fragment loop is 00c61-00c20+2 = 67 bytes long.
An interesting question for more experienced coders: In case of reducing the BMH loop down by 3 bytes (i.e. to 64 bytes), can we obtain some additional/significant boost?
History
- Revision 6+++ written 2011-Sep-07
- Revision 7 written 2011-Sep-25
- Revision 7 'Sun' written 2012-Jan-22
- Revision 7 'Elsiane' written 2012-Jan-24
- Revision 7 'Gulliver' written 2012-Jan-26
- Revision 7 'Hasherezade' written 2012-Feb-01
- Revision 7 'Decumanus' written 2014-Jan-12