Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / programming / optimization

Fastest strstr-like function in C!?

4.81/5 (39 votes)
22 Aug 2016CPOL38 min read 292.5K   3.5K  
Tuned function for searching a needle in a haystack

An add-on from 2016-Aug-22:

The fastest memmem(), known to me, is already the tuned Railgun_Swampshine_BailOut - avoiding second pattern comparison in BMH2 and pseudo-BMH4. It is called 'Trolldom', to signify its secondary purpose - to traverse fast via Trolllandia a.k.a. ugly/deformed haystacks causing quadratic behavior of most Boyer-Moore-Horspool variants. GNU/GLIBC memmem() is linearIC and sublinearIC at times while Railgun_Trolldom is mostly sublinearIC.
Image 1

Railgun_Trolldom is slightly faster (both with Intel & GCC) than Railgun_Swampshine, this is mostly due to adding a bitwise BMH order 2 (8KB table overhead instead of 64KB) path - for haystacks < 77777 bytes long - the warm-up is faster.
So the result on Core 2 Q9550s @2.83GHz DDR2 @666MHz / i5-2430M @3.00GHz DDR3 @666MHz:

C++
--------------------------------------------------------------------------------------------------------------------------------
| Searcher                                  | GNU/GLIBC memmem()        | Railgun_Swampshine       | Railgun_Trolldom          | 
|--------------------------------------------------------------------------------------------------|---------------------------|
| Testfile\Compiler                         | Intel 15.0 | GCC 5.10     | Intel 15.0 | GCC 5.10    | Intel 15.0  | GCC 5.10    |
|------------------------------------------------------------------------------------------------------------------------------|
| Size: 27,703 bytes                        |     4506/- |   5330/14725 |    13198/- | 11581/15171 | 19105/22449 | 15493/21642 |
| Name: An_Interview_with_Carlos_Castaneda.TXT                          |            |             |             |             |
| LATENCY-WISE: Number of 'memmem()' Invocations: 308,062               |            |             |             |             |
| THROUGHPUT-WISE: Number of Total bytes Traversed: 3,242,492,648       |            |             |             |             |
|------------------------------------------------------------------------------------------------------------------------------|
| Size: 2,347,772 bytes                     |      190/- |      226/244 |     1654/- |   1729/1806 |   1794/1822 |   1743/1809 |
| Name: Gutenberg_EBook_Don_Quixote_996_(ANSI).txt                      |            |             |             |             |
| LATENCY-WISE: Number of 'memmem()' Invocations: 14,316,954            |            |             |             |             |
| THROUGHPUT-WISE: Number of Total bytes Traversed: 6,663,594,719,173   |            |             |             |             |
|------------------------------------------------------------------------------------------------------------------------------|
| Size: 899,425 bytes                       |      582/- |      760/816 |     3094/- |   2898/3088 |   3255/3289 |   2915/3322 |
| Name: Gutenberg_EBook_Dokoe_by_Hakucho_Masamune_(Japanese_UTF8).txt   |            |             |             |             |
| LATENCY-WISE: Number of 'memmem()' Invocations: 3,465,806             |            |             |             |             |
| THROUGHPUT-WISE: Number of Total bytes Traversed: 848,276,034,315     |            |             |             |             |
|------------------------------------------------------------------------------------------------------------------------------|
| Size: 4,487,433 bytes                     |      104/- |      109/116 |      445/- |     458/417 |     450/411 |     467/425 |
| Name: Dragonfly_genome_shotgun_sequence_(ACGT_alphabet).fasta         |            |             |             |             |
| LATENCY-WISE: Number of 'memmem()' Invocations: 20,540,375            |            |             |             |             |
| THROUGHPUT-WISE: Number of Total bytes Traversed: 13,592,530,857,131  |            |             |             |             |
|------------------------------------------------------------------------------------------------------------------------------|
| Size: 954,035 bytes                       |       99/- |      144/144 |      629/- |     580/682 |     634/807 |     585/725 |
| Name: LAOTZU_Wu_Wei_(BINARY).pdf                                      |            |             |             |             |
| LATENCY-WISE: Number of 'memmem()' Invocations: 27,594,933            |            |             |             |             |
| THROUGHPUT-WISE: Number of Total bytes Traversed: 8,702,455,122,519   |            |             |             |             |
|------------------------------------------------------------------------------------------------------------------------------|
| Size: 15,583,440 bytes                    |        -/- |          -/- |        -/- |     663/771 |     675/778 |     663/757 |
| Name: Arabian_Nights_complete.html                                    |            |             |             |             |
| LATENCY-WISE: Number of 'memmem()' Invocations: 72,313,262            |            |             |             |             |
| THROUGHPUT-WISE: Number of Total bytes Traversed: 105,631,163,854,099 |            |             |             |             |
--------------------------------------------------------------------------------------------------------------------------------

The benchmark is downloadable at my Internet drive and my site:
https://1drv.ms/u/s!AmWWFXGMzDmEgl3izdwhSnIq0Lv5 www.sanmayce.com/Downloads/Nakamichi_Kintaro++_source_executables_64bit_(GCC510-vs-Intel150)_(TW-vs-RG)_BENCHMARK.zip

Note0: Railgun_Trolldom is slightly faster (both with Intel & GCC) than Railgun_Swampshine, this is mostly due to adding a bitwise BMH order 2 (8KB table overhead instead of 64KB) path - for haystacks < 77777 bytes long - the warm-up is faster.

Note1: The numbers represent the rate (bytes/s) at which patterns/needles 4,5,6,8,9,10,12,13,14,16,17,18,24 bytes long are memmemed into 4KB, 256KB, 1MB, 256MB long haystacks.
in fact, these numbers are the compression speed using LZSS and memmem() as matchfinder.

Note2: The Arabian Nights is downloadable at:
https://ebooks.adelaide.edu.au/b/burton/richard/b97b/complete.html

Note3: On i5-2430M, TW is catching up since this CPU crunches instructions faster while the RAM speed is almost the same, Railgun suffers from the slow RAM fetches - the prefetcher and such suck.

Note4: With a simple English text 'Tales of 1001 Nights', 15,583,440 bytes long, the cumulative size of traversed haystack data is nearly 100TB, 105,631,163,854,099 ~ 1024^4 = 1,099,511,627,776.

Note5: With a simple French text 'Agatha_Christie_85-ebooks_(French)_TXT.tar', 32,007,168 bytes long, the cumulative size of traversed haystack data is nearly 200TB ~ 234,427,099,834,376.

Since it is pure C, no intrinsics used, perhaps it will work unproblematically on most *nix systems.

To the attached archive I included the benchmark C source along with its listing in 28 pages in PDF format - to be printed and digested more easily from search techniques aficionados. ENFUN!

Here I digress because want to share my disappontment from not seeing Hamid's Turbo Family of superfast etudes utilized/appreciated. Next dumps reveal what the attached Benchmark (using Railgun_Trolldom, it parses 200+TB in this torture) does and how it compares to many optimized compressors of today.

The textual decompression picture given by:
- TurboBench Copyright (c) 2013-2016 Powturbo Feb 21 2016;
- zstd command line interface 64-bits v0.8.1, by Yann Collet;
- Nakamichi 'Kintaro++' written by Kaze.

This very benchmark (with it you can reproduce the dumps below) is at:
https://1drv.ms/u/s!AmWWFXGMzDmEgl6-_1mrjcTj07B1

Once the tarred Agatha's 85 texts into French are compressed by the benchmark let's see how the decompression rate fares against many "usual suspects" i.e. compressors, on i5-2430M @3.00GHz DDR3 @666MHz:

C++
D:\Nakamichi_Kintaro++_source_executables_64bit_(GCC510-vs-Intel150)_(TW-vs-RG)_BENCHMARK>Nakamichi_Kintaro++_Intel_15.0_64bit.exe Agatha_Christie_85-ebooks_(French)_TXT.tar
Nakamichi 'Kintaro++', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Note1: This compile can handle files up to 1711MB.
Note2: The matchfinder/memmem() is Railgun_Trolldom.
Current priority class is HIGH_PRIORITY_CLASS.
Compressing 32007168 bytes ...
|; Each rotation means 64KB are encoded; Done 100%; Compression Ratio: 3.53:1
NumberOfFullLiterals (lower-the-better): 164
NumberOfFlushLiteralsHeuristic (bigger-the-better): 184323
Legend: WindowSizes: 1/2/3/4=Tiny/Short/Medium/Long
NumberOf(Tiny)Matches[Short]Window (4)[2]: 226869
NumberOf(Short)Matches[Short]Window (8)[2]: 119810
NumberOf(Medium)Matches[Short]Window (12)[2]: 71202
NumberOf(Long)Matches[Short]Window (16)[2]: 31955
NumberOf(MaxLong)Matches[Short]Window (24)[2]: 7078
NumberOf(Tiny)Matches[Medium]Window (5)[3]: 257313
NumberOf(Short)Matches[Medium]Window (9)[3]: 526493
NumberOf(Medium)Matches[Medium]Window (13)[3]: 285579
NumberOf(Long)Matches[Medium]Window (17)[3]: 158873
NumberOf(MaxLong)Matches[Medium]Window (24)[3]: 51276
NumberOf(Tiny)Matches[Long]Window (6)[4]: 41075
NumberOf(Short)Matches[Long]Window (10)[4]: 240454
NumberOf(Medium)Matches[Long]Window (14)[4]: 258653
NumberOf(Long)Matches[Long]Window (18)[4]: 209007
NumberOf(MaxLong)Matches[Long]Window (24)[4]: 190929
RAM-to-RAM performance: 605 bytes/s.
Compressed to 9076876 bytes.
LATENCY-WISE: Number of 'memmem()' Invocations: 102,091,852
THROUGHPUT-WISE: Number of Total bytes Traversed: 234,427,099,834,376

D:\Benchmark_Goldenboy_vs_LzTurbo_vs_Zstd>dir

08/22/2016  07:07 AM         9,076,876 Agatha_Christie_85-ebooks_(French)_TXT.tar.Nakamichi
08/22/2016  07:44 AM           124,928 Nakamichi_Kintaro++_Intel_15.0_64bit.exe
08/16/2016  01:55 PM         9,191,476 turbobenchs.exe
08/18/2016  07:24 AM           841,297 zstd-windows-v0.8.1_win64.exe

D:\Benchmark_Goldenboy_vs_LzTurbo_vs_Zstd>RUNME.BAT

D:\Benchmark_Goldenboy_vs_LzTurbo_vs_Zstd>Nakamichi_Kintaro++_Intel_15.0_64bit.exe Agatha_Christie_85-ebooks_(French)_TXT.tar.Nakamichi /report
Nakamichi 'Kintaro++', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Note1: This compile can handle files up to 1711MB.
Note2: The matchfinder/memmem() is Railgun_Trolldom.
Current priority class is HIGH_PRIORITY_CLASS.
Decompressing 9076876 bytes ...
RAM-to-RAM performance: 512 MB/s.
Compression Ratio (bigger-the-better): 3.53:1

D:\Benchmark_Goldenboy_vs_LzTurbo_vs_Zstd>zstd-windows-v0.8.1_win64.exe -b12 "Agatha_Christie_85-ebooks_(French)_TXT.tar"
12#_(French)_TXT.tar :  32007168 ->   8965791 (3.570),   4.0 MB/s , 409.2 MB/s

D:\Benchmark_Goldenboy_vs_LzTurbo_vs_Zstd>zstd-windows-v0.8.1_win64.exe -b22 "Agatha_Christie_85-ebooks_(French)_TXT.tar"
22#_(French)_TXT.tar :  32007168 ->   6802321 (4.705),   1.0 MB/s , 267.4 MB/s

D:\Benchmark_Goldenboy_vs_LzTurbo_vs_Zstd>turbobench_singlefile_noNakamichi_2G.bat "Agatha_Christie_85-ebooks_(French)_TXT.tar"

D:\Benchmark_Goldenboy_vs_LzTurbo_vs_Zstd>turbobenchs.exe "Agatha_Christie_85-ebooks_(French)_TXT.tar" -ezlib,9/fastlz,2/lzturbo,19,29,39/bzip2/chameleon,2/snappy_c/zstd,12,21/density,3/lz4,16/lz5,15/lzham,4/brieflz/brotli,1
1/crush,2/lzma,9/zpaq,2/lzf/yappy/trle/memcpy/lzsse2,1,16 -g -B2G
Benchmark: 1 from 3
     6739693    21.1       0.89      74.96   lzma 9              Agatha_Christie_85-ebooks_(French)_TXT.tar
     6758120    21.1       0.75     159.97   lzham 4             Agatha_Christie_85-ebooks_(French)_TXT.tar
     6801784    21.3       0.88     380.07   lzturbo 39          Agatha_Christie_85-ebooks_(French)_TXT.tar
     6834372    21.4       1.15     280.52   zstd 21             Agatha_Christie_85-ebooks_(French)_TXT.tar
     6920292    21.6       0.34     187.25   brotli 11           Agatha_Christie_85-ebooks_(French)_TXT.tar
     7680438    24.0       8.73      19.14   bzip2               Agatha_Christie_85-ebooks_(French)_TXT.tar
     8897625    27.8       3.15      56.09   zpaq 2              Agatha_Christie_85-ebooks_(French)_TXT.tar
     8964510    28.0       4.17     397.51   zstd 12             Agatha_Christie_85-ebooks_(French)_TXT.tar
     8990840    28.1       0.89     590.17   lzturbo 29          Agatha_Christie_85-ebooks_(French)_TXT.tar
     9076876       ?          ?     512.00   Nakamichi Kintaro++ outside TurboBench
     9147309    28.6       0.13     223.49   crush 2             Agatha_Christie_85-ebooks_(French)_TXT.tar
    10104473    31.6       1.25     351.10   lz5 15              Agatha_Christie_85-ebooks_(French)_TXT.tar
    10691287    33.4       4.92    2013.23   lzsse2 16           Agatha_Christie_85-ebooks_(French)_TXT.tar
    10973383    34.3      10.13     214.94   zlib 9              Agatha_Christie_85-ebooks_(French)_TXT.tar
    12107431    37.8       0.94    2136.70   lzturbo 19          Agatha_Christie_85-ebooks_(French)_TXT.tar
    12266943    38.3      15.08    1615.99   lz4 16              Agatha_Christie_85-ebooks_(French)_TXT.tar
    13238993    41.4      75.25     149.55   brieflz             Agatha_Christie_85-ebooks_(French)_TXT.tar
    14364428    44.9     255.83     234.91   density 3           Agatha_Christie_85-ebooks_(French)_TXT.tar
    15448532    48.3      12.27    1503.99   lzsse2 1            Agatha_Christie_85-ebooks_(French)_TXT.tar
    17262978    53.9      56.14    1158.91   yappy               Agatha_Christie_85-ebooks_(French)_TXT.tar
    17678955    55.2     172.49     344.66   lzf                 Agatha_Christie_85-ebooks_(French)_TXT.tar
    17868118    55.8     177.74     303.39   fastlz 2            Agatha_Christie_85-ebooks_(French)_TXT.tar
    17915390    56.0     926.24    1851.55   chameleon 2         Agatha_Christie_85-ebooks_(French)_TXT.tar
    18694206    58.4     228.88     622.50   snappy_c            Agatha_Christie_85-ebooks_(French)_TXT.tar
    31952980    99.8     124.68    1230.32   trle                Agatha_Christie_85-ebooks_(French)_TXT.tar
    32007172   100.0    6096.18    5338.21   memcpy              Agatha_Christie_85-ebooks_(French)_TXT.tar

D:\Benchmark_Goldenboy_vs_LzTurbo_vs_Zstd>turbobenchs.exe -l2
...
Brotli v16-02                   Apache license  https://github.com/google/brotli
Bzip2 v1.06                     BSD like        http://www.bzip.org/downloads.html      https://github.com/asimonov-im/bzip2
Chameleon v15-03                Public Domain   http://cbloomrants.blogspot.de/2015/03/03-25-15-my-chameleon.html
Crush v1.0.0                    Public Domain   http://sourceforge.net/projects/crush
Density v0.13.0                 BSD license     https://github.com/centaurean/density
FastLz v0.1.0                   BSD like        http://fastlz.org       https://github.com/ariya/FastLZ
Lz4 v1.7.1                      BSD license     https://github.com/Cyan4973/lz4
Lz5 v1.4.1                      BSD license     https://github.com/inikep/lz5
Lzham v1.1                      MIT license     https://github.com/richgel999/lzham_codec_devel
Lzma v9.35                      Public Domain   http://7-zip.org        https://github.com/jljusten/LZMA-SDK
lzsse v16-02                    BSD license     https://github.com/ConorStokes/LZSSE
Snappy-c v1.1.2                 BSD Like        https://github.com/andikleen/snappy-c
Yappy v2011
zlib v1.2.8                     zlib license    http://zlib.net https://github.com/madler/zlib
ZSTD v0.5.1                     BSD license     https://github.com/Cyan4973/zstd
TurboRLE ESC v16-01                             https://sites.google.com/site/powturbo
TurboRLE v16-01                                 https://sites.google.com/site/powturbo
LzTurbo v1.3                                    https://sites.google.com/site/powturbo
...

D:\Benchmark_Goldenboy_vs_LzTurbo_vs_Zstd>

That's it, those 512MB/s are the result of the quantative approach in choosing literal lengths. Now, in order we to appreciate one superb (in fact FASTEST) performer, lzturbo 39 with its amazing 380.07MB/s has to be compared with latest Yann's Zstd and with supersimplistic Nakamichi:

Agatha_Christie_85-ebooks_(French)_TXT.tar when recreated:

6801784 -> 32,007,168 380.07 MB/s lzturbo 39
8965791 -> 32,007,168 409.20 MB/s zstd-windows-v0.8.1_win64.exe -b12
8990840 -> 32,007,168 590.17 MB/s lzturbo 29
9076876 -> 32,007,168 512.00 MB/s Nakamichi Kintaro++

Such, as LzTurbo's 29/39, modes are truly amazing such performance inspires one to look beyond current bloated software and rethink why superfast implementations are not mainstreamish, yet.

An add-on from 2016-Aug-13:

Two things.

First, the fix from the last time was buggy, my apologies, now fixed, quite embarrassing since it is a simple left/right boundary check. It doesn't affect the speed, it appears as rare pattern hit misses.
Since I don't believe in saying "sorry" but in making things right, here my attempt to further disgrace my amateurish work follows:
Two years ago, I didn't pay due attention to adding 'Swampwalker' heuristic to the Railgun_Ennearch, I mean, only quick test was done and no real proofing - this was due not to a blunder of mine, nor carelessness, but overconfidence in my ability to write "on the fly". Stupid, indeed, however, when a coder gets momentum in writing simple etudes he starts gaining false confidence of mastering the subject, not good for sure!
Hopefully, other coders will learn to avoid such full of neglect style.

For manual fix into the C code, change this line:

C++
if ( ((signed int)(i-(PRIMALposition-1)+(count-1)) >= 0) && (&pbTarget[i-(PRIMALposition-1)] <= pbTargetMax - 4) ) {

with this one:

C++
if ( ((signed int)(i-(PRIMALposition-1)) >= 0) && (&pbTarget[i-(PRIMALposition-1)]+((PRIMALlengthCANDIDATE-4+1)-1) <= pbTargetMax - 4) ) {


Second, wanted to present the heaviest testbed for search i.e. memmem() functions: it benefits the benchmarking (speed in real application) as well as bug-control.

The benchmark is downloadable at my INTERNET drive:
https://1drv.ms/u/s!AmWWFXGMzDmEglwjlUtnMJrfhosK

It also is zipped (the 4 testfiles are not) and attached at the top of this article.

The speed showdown has three facets:
- compares the 64bit code generated from GCC 5.10 versus Intel 15.0 compilers;
- compares two datasets - search speed through English texts versus Genome ACGT-type data;
- compares the tweaked Two-Way algorithm (implemented by Eric Blake) and adopted by GLIBC as memmem() versus my Railgun_Swampshine.

Note1: The GLIBC memmem() was taken from latest (2016-08-05) glibc 2.24 tar:
https://www.gnu.org/software/libc/
Note2: Eric Blake says that he enhanced the linearity of Two-Way by adding some sublinear paths, well, Railgun is all about sublinearity, so feel free to experiment with your own testfiles (worst-case-scenarios), just make such a file feed the compressor with it, then we will see how the LINEAR Two-Way behaves versus Railgun_Swampshine.
Note3: Just copy-and-paste 'Railgun_Swampshine' or 'Railgun_Ennearch' from the benchmark's source.

So the result on Core 2 Q9550s @2.83GHz:

C++
---------------------------------------------------------------------------------------------
| testfile\Searcher                         | GNU/GLIBC memmem()    | Railgun_Swampshine    | 
|-------------------------------------------------------------------------------------------|
| Compiler                                  | Intel 15.0 | GCC 5.10 | Intel 15.0 | GCC 5.10 |
|-------------------------------------------------------------------------------------------|
| The_Project_Gutenberg_EBook_of_Don        |        190 |      226 |       1654 |     1729 |
| _Quixote_996_(ANSI).txt                   |            |          |            |          |
| 2,347,772 bytes                           |            |          |            |          |
|-------------------------------------------------------------------------------------------|
| The_Project_Gutenberg_EBook_of_Dokoe      |        582 |      760 |       3094 |     2898 |
| _by_Hakucho_Masamune_(Japanese_UTF-8).txt |            |          |            |          |
| 899,425 bytes                             |            |          |            |          |
|-------------------------------------------------------------------------------------------|
| Dragonfly_genome_shotgun_sequence         |        104 |      109 |        445 |      458 |
| _(ACGT_alphabet).fasta                    |            |          |            |          |
| 4,487,433 bytes                           |            |          |            |          |
|-------------------------------------------------------------------------------------------|
| LAOTZU_Wu_Wei_(BINARY).pdf                |         99 |      144 |        629 |      580 |
| 954,035 bytes                             |            |          |            |          |
|-------------------------------------------------------------------------------------------|

Note: The numbers represent the rate (bytes/s) at which patterns/needles 4,5,6,8,9,10,12,13,14,16,17,18,24 bytes long are memmemed into 4KB, 256KB, 1MB, 256MB long haystacks.
in fact, these numbers are the compression speed using LZSS and memmem() as matchfinder.

Two-Way is significantly slower than BMH Order 2, the speed-down is in range:
- for TEXTUAL ANSI alphabets: 1729/226= 7.6x
- for TEXTUAL UTF8 alphabets: 2898/760= 3.8x
- for TEXTUAL ACGT alphabets: 458/109= 4.2x
- for BINARY-esque alphabets: 580/144= 4.0x

For faster RAM, than mine @666MHz, and for haystacks multimegabytes long, the speedup goes beyond 8x.

The benchmark shows the real behavior (both latency and raw speed) of the memmem variants, I added also the Thierry Lecroq's Two-Way implementation:
http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260
However, Eric Blake's one is faster, so it was chosen for the speed showdown.

Once I measured the total length of traversed haystacks, and for files 100+MB long, it went ... quintillion of bytes i.e. petabytes - good torture it is.

Enfun!

Introduction

 

The C strstr-like function presented below (written by Georgi 'Kaze') is named Railgun_Quadruplet_6ppp and is based on well-known Boyer-Moore-Horspool and simplified Karp-Rabin algorithms. The main goal: to achieve fastest execution for all kinds of pattern/needle and string/haystack lengths (starting from 2 bytes).

Railgun_Quadruplet_6ppp is now obsolete due to the arrival of Railgun_Quadruplet_7 which boosts the BMH fragment by more than a percent. See further below for its commentless and tabulated source.

Railgun_Quadruplet_7 exploits the CPUs fast unaligned DWORD extract capability.
The heaviest and most precise test on 1+TB text (52 patterns vs 197MB English books text repeated 256 times results in 1000+ seconds, see strstr_SHORT-SHOWDOWN_r7_Heavy_Test_LOG.txt file for more information) gives:

[strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe:]
The variants returning the offset of first hit only:

  • Railgun_Quadruplet_6pp 52, i.e., average performance: 1584KB/clock
  • Railgun_Quadruplet_7 52, i.e., average performance: 1659KB/clock

The variants counting all hits:

  • Railgun_6pp 8x2, i.e., average performance: 1246KB/clock
  • Railgun_Quadruplet_7 8x2, i.e., average performance: 1266KB/clock

[strstr_SHORT-SHOWDOWN_Intel_v12_O3.exe:]
The variants returning the offset of first hit only:

  • Railgun_Quadruplet_6pp 52, i.e., average performance: 1773KB/clock
  • Railgun_Quadruplet_7 52, i.e., average performance: 1685KB/clock

The variants counting all hits:

  • Railgun_6pp 8x2, i.e., average performance: 1054KB/clock
  • Railgun_Quadruplet_7 8x2, i.e., average performance: 1076KB/clock

Note: For Railgun_Quadruplet_7 on CPUs with fast DWORD (compared to BYTE) memory fetching I expect significant boost, AFAIK the Intel i7 is one of those, unfortunately I lack the opportunity to play with it.

Or the two bottom-lines:
The FASTEST variant counting all hits:

  • Railgun_Quadruplet_7 8x2 with average performance: 1266KB/clock (best performance achieved with Microsoft_v16_Ox)!

The FASTEST variant returning the offset of first hit only:

  • Railgun_Quadruplet_6pp 52 with average performance: 1773KB/clock (1,943,883,635,456 bytes / 1,070,220 clocks = 1732 MB/s best performance achieved with Intel_v12_O3)!

An add-on from 2014-Apr-27:

My apologies.

The 'Swampwalker' heuristic has a nasty SIGNED/UNSIGNED bug which I illustrated several months ago in my fuzzy search article now is fixed here too:

The bug is this (the variables 'i' and 'PRIMALposition' are uint32_t):
Next line assumes -19 >= 0 is true:
if ( (i-(PRIMALposition-1)) >= 0) printf ("THE NASTY-CASTY BUG: %d >= 0\n", i-(PRIMALposition-1));
Next line assumes -19 >= 0 is false:
if ( (signed int)(i-(PRIMALposition-1)) >= 0) printf ("THE NASTY-CASTY BUG: %d >= 0\n", i-(PRIMALposition-1));

And the actual fix:

C++
...
if ( count <= 0 ) {
// I have to add out-of-range checks...
// i-(PRIMALposition-1) >= 0
// &pbTarget[i-(PRIMALposition-1)] <= pbTargetMax - 4
// i-(PRIMALposition-1)+(count-1) >= 0
// &pbTarget[i-(PRIMALposition-1)+(count-1)] <= pbTargetMax - 4
// FIX from 2014-Apr-27:
// Because (count-1) is negative, above fours are reduced to next twos:
// i-(PRIMALposition-1)+(count-1) >= 0
// &pbTarget[i-(PRIMALposition-1)] <= pbTargetMax - 4
    // The line below is BUGGY:
    //if ( (i-(PRIMALposition-1) >= 0) && (&pbTarget[i-(PRIMALposition-1)] <= pbTargetMax - 4) && (&pbTarget[i-(PRIMALposition-1)+(count-1)] <= pbTargetMax - 4) ) {
    // The line below is OKAY:
    if ( ((signed int)(i-(PRIMALposition-1)+(count-1)) >= 0) && (&pbTarget[i-(PRIMALposition-1)] <= pbTargetMax - 4) ) {
...

I checked the code one more time, it feels alright, love it.
The buggy 'Swampshine' function (in download section) has been replaced with fixed one.

An add-on from 2014-Feb-02:

Probably I am done with updates, here comes the latest one.

Railgun_Quadruplet/Railgun_Swampshine_.png

The benchmark with its source at: http://www.sekireigan.com/_KAZE_32bit_memmem-like_showdown_Swampshine_r2.7z.

And the beautiful main loop:

C++
; The main loop of pseudo Boyer-Moore-Horspool order 2 in C:

; while (i <= cbTarget-cbPattern) {
;     Gulliver = 1;
;     if ( bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1]] != 0 ) {
;         if ( bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1-2]] == 0 ) Gulliver = cbPattern-(2-1)-2; else {
;             if ( *(uint32_t *)&pbTarget[i] == ulHashPattern) {
;                 count = cbPattern-4+1; 
;                 while ( count > 0 && *(uint32_t *)(pbPattern+count-1) == *(uint32_t *)(&pbTarget[i]+(count-1)) )
;                     count = count-4;
;                 if ( count <= 0 ) return(pbTarget+i);
;             }
;         }
;     } else Gulliver = cbPattern-(2-1);
;     i = i + Gulliver;
; }

; The main loop of pseudo Boyer-Moore-Horspool order 2 in Assembly by Intel C++ Compiler XE for applications running on IA-32, Version 12.1.1.258, -O3 used:

.B2.28:                         
  0019f 0f b7 6c 02 fe   movzx ebp, WORD PTR [-2+edx+eax]       
  001a4 0f b6 34 2c      movzx esi, BYTE PTR [esp+ebp]          
  001a8 85 f6            test esi, esi                          
  001aa 0f 84 bd 00 00 
        00               je .B2.40 
.B2.29:                         
  001b0 0f b7 6c 02 fc   movzx ebp, WORD PTR [-4+edx+eax]       
  001b5 0f b6 34 2c      movzx esi, BYTE PTR [esp+ebp]          
  001b9 85 f6            test esi, esi                          
  001bb 75 0c            jne .B2.31 
.B2.30:                         
  001bd 8b ac 24 18 00 
        01 00            mov ebp, DWORD PTR [65560+esp]         
  001c4 e9 a6 00 00 00   jmp .B2.41 
.B2.31:                         
  001c9 8b b4 24 70 00 
        01 00            mov esi, DWORD PTR [65648+esp]         
  001d0 8b ac 24 14 00 
        01 00            mov ebp, DWORD PTR [65556+esp]         
  001d7 3b 2c 06         cmp ebp, DWORD PTR [esi+eax]           
  001da 0f 85 86 00 00 
        00               jne .B2.38 
.B2.32:                         
  001e0 8b b4 24 18 00 
        01 00            mov esi, DWORD PTR [65560+esp]         
  001e7 85 f6            test esi, esi                          
  001e9 0f 8e 42 06 00 
        00               jle .B2.110 
.B2.33:                         
  001ef 33 ff            xor edi, edi                           
  001f1 8d 2c 10         lea ebp, DWORD PTR [eax+edx]           
  001f4 89 8c 24 00 00 
        01 00            mov DWORD PTR [65536+esp], ecx         
  001fb 89 94 24 04 00 
        01 00            mov DWORD PTR [65540+esp], edx         
  00202 89 9c 24 0c 00 
        01 00            mov DWORD PTR [65548+esp], ebx         
  00209 89 84 24 08 00 
        01 00            mov DWORD PTR [65544+esp], eax         
  00210 8b d7            mov edx, edi                           
  00212 8b 8c 24 10 00 
        01 00            mov ecx, DWORD PTR [65552+esp]         
  00219 8b 9c 24 7c 00 
        01 00            mov ebx, DWORD PTR [65660+esp]         
.B2.34:                         
  00220 8b 44 0a fc      mov eax, DWORD PTR [-4+edx+ecx]        
  00224 3b 44 2a fc      cmp eax, DWORD PTR [-4+edx+ebp]        
  00228 75 11            jne .B2.36 
.B2.35:                         
  0022a 47               inc edi                                
  0022b 8d 74 1a f9      lea esi, DWORD PTR [-7+edx+ebx]        
  0022f 83 c2 fc         add edx, -4                            
  00232 3b bc 24 1c 00 
        01 00            cmp edi, DWORD PTR [65564+esp]         
  00239 72 e5            jb .B2.34 
.B2.36:                         
  0023b 89 8c 24 10 00 
        01 00            mov DWORD PTR [65552+esp], ecx         
  00242 8b 8c 24 00 00 
        01 00            mov ecx, DWORD PTR [65536+esp]         
  00249 8b 94 24 04 00 
        01 00            mov edx, DWORD PTR [65540+esp]         
  00250 8b 84 24 08 00 
        01 00            mov eax, DWORD PTR [65544+esp]         
  00257 8b 9c 24 0c 00 
        01 00            mov ebx, DWORD PTR [65548+esp]         
.B2.37:                         
  0025e 85 f6            test esi, esi                          
  00260 0f 8e cb 05 00 
        00               jle .B2.110 
.B2.38:                         
  00266 bd 01 00 00 00   mov ebp, 1                             
  0026b eb 02            jmp .B2.41 
.B2.40:                         
  0026d 8b eb            mov ebp, ebx                           
.B2.41:                         
  0026f 03 c5            add eax, ebp                           
  00271 3b c1            cmp eax, ecx                           
  00273 0f 86 26 ff ff 
        ff               jbe .B2.28 
.B2.42:                         

; The main loop of pseudo Boyer-Moore-Horspool order 2: 00273-0019f+6 = 218 bytes

An add-on from 2014-Jan-29:

Many thanks go to Harold, the fellow CP member who helped me to see how i7 4770K executes Railgun.

At last Railgun has become troll-proof.
More tweaks are needed, but let's not kill the joy and let's see what 'Swampshine' revision 1 holds.

Results from three type of tortures are given below:
1] 'Wikipedia' torture, XML data
2] 'GENOME' torture, 'ACGT' data
3] 'Worst-Case-Scenarios' torture, repetitive data

My verdict: Railgun_Swampshine is the weapon of choice for all kind of patterns up to ~100bytes, it won't fail you no matter how muddy the battlefield is.

Railgun_Quadruplet/Railgun_Swampshine_Full-Fledged_MEMMEM_WIKI_ragged.png

Railgun_Quadruplet/Railgun_Swampshine_Full-Fledged_MEMMEM_ACGT_ragged.png

Not happy with current results, some calibrations remain, some drags in first revision 'Swampshine':

1] Order 4 (instead of 2) of 'Swampwalker' is used.
2] Unoptimized case when needle is PREFIXED, (decent performance when PREFIXED&POSTFIXED and when POSTFIXED).
3] Code Size is not pretty: 00a8c-00000+4 = 2704 bytes
4] Much steeper speed degradation than expected with increasing the needle length.

Doing Search for Pattern(65bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance:         18941MB/s; PRIMALposition=1; PRIMALlength=65
Railgun_Ennearch precise performance:           18859MB/s
Railgun_DecumanusBITified precise performance:  19136MB/s
Boyer_Moore-Horspool precise performance:        6959MB/s

Doing Search for Pattern(83bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance:         17734MB/s; PRIMALposition=1; PRIMALlength=68
Railgun_Ennearch precise performance:           18818MB/s
Railgun_DecumanusBITified precise performance:  18941MB/s
Boyer_Moore-Horspool precise performance:        7871MB/s

Doing Search for Pattern(98bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance:         17791MB/s; PRIMALposition=1; PRIMALlength=68
Railgun_Ennearch precise performance:           19654MB/s
Railgun_DecumanusBITified precise performance:  18948MB/s
Boyer_Moore-Horspool precise performance:        7917MB/s

Doing Search for Pattern(113bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance:         17360MB/s; PRIMALposition=1; PRIMALlength=68
Railgun_Ennearch precise performance:           20802MB/s
Railgun_DecumanusBITified precise performance:  19013MB/s
Boyer_Moore-Horspool precise performance:        8147MB/s

Doing Search for Pattern(130bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance:         17486MB/s; PRIMALposition=1; PRIMALlength=68
Railgun_Ennearch precise performance:           22800MB/s
Railgun_DecumanusBITified precise performance:  18631MB/s
Boyer_Moore-Horspool precise performance:        9085MB/s

Doing Search for Pattern(140bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance:         17311MB/s; PRIMALposition=1; PRIMALlength=68
Railgun_Ennearch precise performance:           23872MB/s
Railgun_DecumanusBITified precise performance:  17926MB/s
Boyer_Moore-Horspool precise performance:        9343MB/s

Doing Search for Pattern(151bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance:         17072MB/s; PRIMALposition=79; PRIMALlength=73
Railgun_Ennearch precise performance:           23983MB/s
Railgun_DecumanusBITified precise performance:  17933MB/s
Boyer_Moore-Horspool precise performance:        8591MB/s

Doing Search for Pattern(162bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance:         17725MB/s; PRIMALposition=79; PRIMALlength=84
Railgun_Ennearch precise performance:           26535MB/s
Railgun_DecumanusBITified precise performance:  19854MB/s
Boyer_Moore-Horspool precise performance:        9072MB/s

Doing Search for Pattern(201bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance:         17954MB/s; PRIMALposition=67; PRIMALlength=119
Railgun_Ennearch precise performance:           31708MB/s
Railgun_DecumanusBITified precise performance:  23392MB/s
Boyer_Moore-Horspool precise performance:        6983MB/s

Doing Search for Pattern(230bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance:         15113MB/s; PRIMALposition=165; PRIMALlength=66
Railgun_Ennearch precise performance:           33098MB/s
Railgun_DecumanusBITified precise performance:  25518MB/s
Boyer_Moore-Horspool precise performance:        7599MB/s

Doing Search for Pattern(255bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance:         14711MB/s; PRIMALposition=90; PRIMALlength=65
Railgun_Ennearch precise performance:           39911MB/s
Railgun_DecumanusBITified precise performance:  29435MB/s
Boyer_Moore-Horspool precise performance:        7602MB/s

Doing Search for Pattern(589bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance:          7163MB/s; PRIMALposition=32; PRIMALlength=150
Railgun_Ennearch precise performance:          123574MB/s
Railgun_DecumanusBITified precise performance:  95773MB/s
Boyer_Moore-Horspool precise performance:        8370MB/s

Doing Search for Pattern(1080bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance:          3713MB/s; PRIMALposition=32; PRIMALlength=150
Railgun_Ennearch precise performance:          196321MB/s
Railgun_DecumanusBITified precise performance: 178054MB/s
Boyer_Moore-Horspool precise performance:       11102MB/s

Doing Search for Pattern(1755bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance:          2723MB/s; PRIMALposition=1506; PRIMALlength=125
Railgun_Ennearch precise performance:          287063MB/s
Railgun_DecumanusBITified precise performance: 267550MB/s
Boyer_Moore-Horspool precise performance:       13990MB/s

Doing Search for Pattern(2686bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance:           800MB/s; PRIMALposition=2018; PRIMALlength=266
Railgun_Ennearch precise performance:          227827MB/s
Railgun_DecumanusBITified precise performance: 240171MB/s
Boyer_Moore-Horspool precise performance:       11565MB/s

Doing Search for Pattern(3867bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance:           492MB/s; PRIMALposition=2714; PRIMALlength=216
Railgun_Ennearch precise performance:          267842MB/s
Railgun_DecumanusBITified precise performance: 279599MB/s
Boyer_Moore-Horspool precise performance:       11846MB/s

What can I say? Trolls fought back, I underestimated their filthiness.

Order 4 of 'Swampwalker' follows:

// Swampwalker heuristic order 4 (Needle should be bigger than 4) [
// Needle: 1234567890qwertyuiopasdfghjklzxcv            PRIMALposition=01 PRIMALlength=33  '1234567890qwertyuiopasdfghjklzxcv'
// Needle: vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv            PRIMALposition=29 PRIMALlength=04  'vvvv'
// Needle: vvvvvvvvvvBOOMSHAKALAKAvvvvvvvvvv            PRIMALposition=08 PRIMALlength=20  'vvvBOOMSHAKALAKAvvvv'
// Needle: Trollland                                    PRIMALposition=01 PRIMALlength=09  'Trollland'
// Needle: Swampwalker                                  PRIMALposition=01 PRIMALlength=11  'Swampwalker'
// Needle: licenselessness                              PRIMALposition=01 PRIMALlength=15  'licenselessness'
// Needle: alfalfa                                      PRIMALposition=02 PRIMALlength=06  'lfalfa'
// Needle: Sandokan                                     PRIMALposition=01 PRIMALlength=08  'Sandokan'
// Needle: shazamish                                    PRIMALposition=01 PRIMALlength=09  'shazamish'
// Needle: Simplicius Simplicissimus                    PRIMALposition=06 PRIMALlength=20  'icius Simplicissimus'
// Needle: domilliaquadringenquattuorquinquagintillion  PRIMALposition=01 PRIMALlength=32  'domilliaquadringenquattuorquinqu'
// Needle: boom-boom                                    PRIMALposition=02 PRIMALlength=08  'oom-boom'
// Needle: vvvvv                                        PRIMALposition=01 PRIMALlength=04  'vvvv'
// Needle: 12345                                        PRIMALposition=01 PRIMALlength=05  '12345'
// Needle: likey-likey                                  PRIMALposition=03 PRIMALlength=09  'key-likey'
// Needle: BOOOOOM                                      PRIMALposition=03 PRIMALlength=05  'OOOOM'
// Needle: aaaaaBOOOOOM                                 PRIMALposition=02 PRIMALlength=09  'aaaaBOOOO'
// Needle: BOOOOOMaaaaa                                 PRIMALposition=03 PRIMALlength=09  'OOOOMaaaa'
PRIMALlength=0;
for (i=0+(1); i < cbPattern-((4)-1)+(1)-(1); i++) { // -(1) because the last BB order 4 has no counterpart(s)
    FoundAtPosition = cbPattern - ((4)-1) + 1;
    PRIMALpositionCANDIDATE=i;
    while ( PRIMALpositionCANDIDATE <= (FoundAtPosition-1) ) {
        j = PRIMALpositionCANDIDATE + 1;
        while ( j <= (FoundAtPosition-1) ) {
            if ( *(uint32_t *)(pbPattern+PRIMALpositionCANDIDATE-(1)) == *(uint32_t *)(pbPattern+j-(1)) ) FoundAtPosition = j;
            j++;
        }
        PRIMALpositionCANDIDATE++;
    }
    PRIMALlengthCANDIDATE = (FoundAtPosition-1)-i+1 +((4)-1);
    if (PRIMALlengthCANDIDATE >= PRIMALlength) {PRIMALposition=i; PRIMALlength = PRIMALlengthCANDIDATE;}
    if (cbPattern-i+1 <= PRIMALlength) break;
}
// Swampwalker heuristic order 4 (Needle should be bigger than 4) ]

'Worst-Case-Scenarios' torture gave:

Haystack: In file 'WORST.TXT'
Type: Worst
Size: 200,000,609 bytes

D:\_KAZE\_KAZE_32bit_memmem-like_showdown_Swampshine>type WORST.TXT
[200,000,000 'Z' bytes]
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZSHAKALAKAZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZ

Needle #1: In file 'WCS_PREFIXEDandPOSTFIXED_609chars.txt'
Needle #2: In file 'WCS_PREFIXED_309chars.txt'
Needle #3: In file 'WCS_POSTFIXED_309chars.txt'

D:\_KAZE\_KAZE_32bit_memmem-like_showdown_Swampshine>type WCS_PREFIXEDandPOSTFIXED_609chars.txt
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZSHAKALAKAZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZ

D:\_KAZE\_KAZE_32bit_memmem-like_showdown_Swampshine>type WCS_PREFIXED_309chars.txt
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZSHAKALAKA

D:\_KAZE\_KAZE_32bit_memmem-like_showdown_Swampshine>type WCS_POSTFIXED_309chars.txt
SHAKALAKAZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZ

For needle in file 'WCS_PREFIXEDandPOSTFIXED_609chars.txt':

Doing Search for Pattern(609bytes) into String(200000609bytes) as-one-line ...
Railgun_Swampshine precise performance:         902MB/s
Railgun_Ennearch precise performance:             9MB/s
Railgun_DecumanusBITified precise performance:   14MB/s
Boyer_Moore-Horspool precise performance:        21MB/s

PRIMAL Needle: 'ZZZSHAKALAKAZZZZ'
PRIMALposition=298; PRIMALlength=16

For needle in file 'WCS_PREFIXED_309chars.txt':

Doing Search for Pattern(309bytes) into String(200000609bytes) as-one-line ...
Railgun_Swampshine precise performance:         329MB/s
Railgun_Ennearch precise performance:           155MB/s
Railgun_DecumanusBITified precise performance:  102MB/s
Boyer_Moore-Horspool precise performance:      3048MB/s

PRIMAL Needle: 'ZZZZSHAKALAKA'
PRIMALposition=297; PRIMALlength=13

For needle in file 'WCS_POSTFIXED_309chars.txt':

Doing Search for Pattern(309bytes) into String(200000609bytes) as-one-line ...
Railgun_Swampshine precise performance:         902MB/s
Railgun_Ennearch precise performance:           451MB/s
Railgun_DecumanusBITified precise performance:  225MB/s
Boyer_Moore-Horspool precise performance:       223MB/s

PRIMAL Needle: 'SHAKALAKAZZZZ'
PRIMALposition=1; PRIMALlength=13

Revision 2 of 'Swampshine' should remove the break.
One solution is to fall back to 'Gulliver', after the transformation (order 2, not 4) its order 2 skips will outperform Boyer_Moore-Horspool for sure, but this will happen only if I fail to come up with something else.

An add-on from 2014-Jan-23:

The latest best (for non-worst-case-scenarios) is 'Railgun_Ennearch' - even better than 'Railgun_Decumanus':

Doing Search for Pattern(589bytes) into String(52428800bytes) as-one-line ...
Railgun_Ennearch precise performance: 19797MB/s
Railgun_DecumanusBITified precise performance: 17807MB/s
Boyer_Moore-Horspool precise performance: 3252MB/s

Doing Search for Pattern(1080bytes) into String(52428800bytes) as-one-line ...
Railgun_Ennearch precise performance: 37175MB/s
Railgun_DecumanusBITified precise performance: 33822MB/s
Boyer_Moore-Horspool precise performance: 4473MB/s

Doing Search for Pattern(1755bytes) into String(52428800bytes) as-one-line ...
Railgun_Ennearch precise performance: 138014MB/s
Railgun_DecumanusBITified precise performance: 121397MB/s
Boyer_Moore-Horspool precise performance: 5731MB/s

Doing Search for Pattern(2686bytes) into String(52428800bytes) as-one-line ...
Railgun_Ennearch precise performance: 129381MB/s
Railgun_DecumanusBITified precise performance: 145128MB/s
Boyer_Moore-Horspool precise performance: 4541MB/s

Doing Search for Pattern(3867bytes) into String(52428800bytes) as-one-line ...
Railgun_Ennearch precise performance: 169574MB/s
Railgun_DecumanusBITified precise performance: 171716MB/s
Boyer_Moore-Horspool precise performance: 4514MB/s

For Wikipedia 52,428,800 bytes haystack and 3,867 bytes needle Railgun_Ennearch overpowers Boyer_Moore-Horspool by (169574-4514)/4514*100% = 3,656%, commentless.

The benchmark with its source at: http://www.sekireigan.com/_KAZE_32bit_memmem-like_showdown_Ennearch.7z.

Railgun_Quadruplet/Railgun_Ennearch_Full-Fledged_MEMMEM_ACGT_ragged_balloon.png

Wanna traverse the Trolldom running not walking muchless crawling.

From the very start Railgun was intended as text searcher, later on it has been widened to hit non-textual data, while the Worst Case Scenarios were uncharted deadly swampy zones, until now.
I am after writing the first W.C.S. Railgun: 'Railgun_Trolldom'.

There is one very simple heuristic which won't hurt (speedwise) the rest sections.
It is to be embedded into pseudo-order 4 section (where needle is longer than 19, 'if ( cbPattern>NeedleThreshold2vs4Nonus )' condition).
It is to find the rightmost 2-bytes chunk that appears most times, immediately an example:

Let our Haystack is 200,000,000+33 bytes long, where the first 200,000,000 bytes are 'v' byte and the needle is last 33 bytes:

Haystack: ...vvvvvvvvvvBOOMSHAKALAKAvvvvvvvvvv  200,000,033 bytes long
     Needle: vvvvvvvvvvBOOMSHAKALAKAvvvvvvvvvv           33 bytes long

Factorizing the needle, Total Building-Blocks are 33-2+1 BBs/suffixes:
BB/Suffix01: [vv]vvvvvvvvBOOMSHAKALAKAvvvvvvvvvv
BB/Suffix02: v[vv]vvvvvvvBOOMSHAKALAKAvvvvvvvvvv
...
BB/Suffix31: vvvvvvvvvvBOOMSHAKALAKAvvvvvvv[vv]v
BB/Suffix32: vvvvvvvvvvBOOMSHAKALAKAvvvvvvvv[vv]

Distinct BBs are 33-2+1-(8+1+1+9)=13:
01: [vv]            ! 8 duplicates !
02: [vB]OOMSHAKALAKAv
03: [BO]OMSHAKALAKAv
04: [OO]MSHAKALAKAv
05: [OM]SHAKALAKAv
06: [MS]HAKALAKAv
07: [SH]AKALAKAv
08: [HA]KALAKAv
09: [AK]ALAKAv
10: [KA]LAKAv
11: [AL]AKAv
12: [LA]KAv
13: [AK]Av            ! 1 duplicate !
14: [KA]v            ! 1 duplicate !
15: [Av]
16: [vv]            ! 9 duplicates !

The five questions giving green light to this (so-called 'Swamprunner') heuristic:
Question #1] Is the 'kick-in' (Total/Distinct elements in lookup table) ratio bigger than 1?
Question #2] Is the number of Distinct elements bigger than 1?
Question #3] What is THE most common & rightmost 2-byte suffix (so-called 'TROLLish' Building-Block)?
Question #4] What is THE longest & rightmost string NOT containing the TROLLish BB (so-called 'NewNeedle')?
Question #5] Is THE longest & rightmost string NOT containing the TROLLish BB bigger than 3?
Zeroing & Reloading the lookup table with new needle's values, REasking the above for the 'NewNeedle'

Let's answer these five questions:
Answer #1] Yes, (int)(33-2+1)/(13)=2 > 1
Answer #2] Yes, 13 > 1
Answer #3] 'vv'
Answer #4] 'vBOOMSHAKALAKAv'
Answer #5] Yes, 15 > 3
Zeroing & Reloading the lookup table with new needle's values, REanswering the above for the 'NewNeedle'

Five green lights, 'Swamprunner' is under way.
Searching using order 2, if found then compare 'LeftLeftover+NewNeedle+RightLeftover' at corresponding Haystack offset.

'Railgun_Ennearch' + 'Swamprunner' heuristic = 'Railgun_Trolldom'

Ha, just recalled one of the reasons Napoleon to meet his final defeat in the Battle of Waterloo, his artillery couldn't make splash damage due to the muddy terrain - the cannonballs simply dived into the mud.
Hopefully my railgun will produce MUDDY TSUNAMIS in the trollish swamps making countless trolls to ride the bullet.

Update:
Not happy with 'Swamprunner', I realized that it can't do much good, more aggressive heuristic must be implemented.
I had to sleep one night away in order to refine and simplify the overkillous 'Swamprunner', the result is shazamish (not shamazing), the new heuristic is called 'Swampwalker'.
'Swamprunner' is too weak and so unpractical compared to 'Swampwalker'.
The idea is to find THE rightmost & longest PRIMAL string within the needle, PRIMAL is quite the same as prime numbers i.e. it has to be unfactorizable.
Unfactorizablility guarantees EXCELLENT Skip Performance for BMH Order 2 and BMH pseudo-order 4.
If the longest PRIMAL string happens to be of length 2 (all bytes are the same) then standard search is performed.
The algorithm for this nifty heuristic is simple, it is applied for order 2, at each position the first duplicate is searched and duplicate's position constitutes the right boundary for next searches, of course the next BB is examined in the same way until right boundary is reached, and all this for one position at a time.
At each position 'PRIMALlength' is updated if it is EQUAL or bigger than the previous 'PRIMALlength'.
In the end we have two values: 'PRIMALposition' and 'PRIMALlength' defining the PRIMAL NewNeedle.

Again, at each STEP 'PRIMALlength' is updated if it is EQUAL or bigger than the previous 'PRIMALlength'.

Legend:
'[]' points to BB forming left or right boundary;
'{}' points to BB being searched for;
'()' position of duplicate and new right boundary;

Three examples follow:

                       00000000011111111112222222222333
                       12345678901234567890123456789012
Example #1 for Needle: 1234567890qwertyuiopasdfghjklzxcv  NewNeedle = '1234567890qwertyuiopasdfghjklzxcv'
Example #2 for Needle: vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv  NewNeedle = 'vv'
Example #3 for Needle: vvvvvvvvvvBOOMSHAKALAKAvvvvvvvvvv  NewNeedle = 'vvBOOMSHAKALA'

Example #1:

     PRIMALlength=00; FoundAtPosition=33; 
Step 01_00: {}[12]34567890qwertyuiopasdfghjklzxc[v?] ! For position #01 the initial boundaries are PRIMALpositionCANDIDATE=LeftBoundary=01, RightBoundary=FoundAtPosition-1, the CANDIDATE PRIMAL string length is RightBoundary-LeftBoundary+(2)=(33-1)-01+(2)=33 !
Step 01_01: [{12}]34567890qwertyuiopasdfghjklzxc[v?] ! Searching for '12', FoundAtPosition = 33, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(33-1)-01+(2)=33 ! 
Step 01_02: [1{2]3}4567890qwertyuiopasdfghjklzxc[v?] ! Searching for '23', FoundAtPosition = 33, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(33-1)-01+(2)=33 ! 
...
Step 01_30: [12]34567890qwertyuiopasdfghjkl{zx}c[v?] ! Searching for 'zx', FoundAtPosition = 33, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(33-1)-01+(2)=33 ! 
Step 01_31: [12]34567890qwertyuiopasdfghjklz{xc}[v?] ! Searching for 'xc', FoundAtPosition = 33, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(33-1)-01+(2)=33 ! 
     if (PRIMALlengthCANDIDATE >= PRIMALlength) {PRIMALposition=PRIMALpositionCANDIDATE; PRIMALlength = PRIMALlengthCANDIDATE;}
Step 02_00: {}1[23]4567890qwertyuiopasdfghjklzxc[v?] ! For position #02 the initial boundaries are PRIMALpositionCANDIDATE=LeftBoundary=02, RightBoundary=FoundAtPosition-1, the CANDIDATE PRIMAL string length is RightBoundary-LeftBoundary+(2)=(33-1)-02+(2)=32 !
Step 02_01: 1[{23}]4567890qwertyuiopasdfghjklzxc[v?] ! Searching for '23', FoundAtPosition = 33, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(33-1)-02+(2)=32 ! 
Step 02_02: 1[2{3]4}567890qwertyuiopasdfghjklzxc[v?] ! Searching for '34', FoundAtPosition = 33, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(33-1)-02+(2)=32 ! 
...
Step 02_29: 1[23]4567890qwertyuiopasdfghjkl{zx}c[v?] ! Searching for 'zx', FoundAtPosition = 33, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(33-1)-02+(2)=32 ! 
Step 02_30: 1[23]4567890qwertyuiopasdfghjklz{xc}[v?] ! Searching for 'xc', FoundAtPosition = 33, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(33-1)-02+(2)=32 ! 
     if (PRIMALlengthCANDIDATE >= PRIMALlength) {PRIMALposition=PRIMALpositionCANDIDATE; PRIMALlength = PRIMALlengthCANDIDATE;}
...
Step 31_00: {}1234567890qwertyuiopasdfghjklz[xc][v?] ! For position #31 the initial boundaries are PRIMALpositionCANDIDATE=LeftBoundary=31, RightBoundary=FoundAtPosition-1, the CANDIDATE PRIMAL string length is RightBoundary-LeftBoundary+(2)=(33-1)-31+(2)=03 !
Step 31_01: 1234567890qwertyuiopasdfghjklz[{xc}][v?] ! Searching for 'xc', FoundAtPosition = 33, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(33-1)-31+(2)=03 ! 
     if (PRIMALlengthCANDIDATE >= PRIMALlength) {PRIMALposition=PRIMALpositionCANDIDATE; PRIMALlength = PRIMALlengthCANDIDATE;}
     Result:
     PRIMALposition=01 PRIMALlength=33, NewNeedle = '1234567890qwertyuiopasdfghjklzxcv'

Example #2:

     PRIMALlength=00; FoundAtPosition=33; 
Step 01_00: {}[vv]vvvvvvvvvvvvvvvvvvvvvvvvvvvvvv[v?] ! For position #01 the initial boundaries are PRIMALpositionCANDIDATE=LeftBoundary=01, RightBoundary=FoundAtPosition-1, the CANDIDATE PRIMAL string length is RightBoundary-LeftBoundary+(2)=(33-1)-01+(2)=33 !
Step 01_01: [{v(v}]v)vvvvvvvvvvvvvvvvvvvvvvvvvvvvvv  ! Searching for 'vv', FoundAtPosition = 02, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(02-1)-01+(2)=02 ! 
     if (PRIMALlengthCANDIDATE >= PRIMALlength) {PRIMALposition=PRIMALpositionCANDIDATE; PRIMALlength = PRIMALlengthCANDIDATE;}
Step 02_00: {}v[vv]vvvvvvvvvvvvvvvvvvvvvvvvvvvvv[v?] ! For position #02 the initial boundaries are PRIMALpositionCANDIDATE=LeftBoundary=02, RightBoundary=FoundAtPosition-1, the CANDIDATE PRIMAL string length is RightBoundary-LeftBoundary+(2)=(33-1)-02+(2)=32 !
Step 02_01: v[{v(v}]v)vvvvvvvvvvvvvvvvvvvvvvvvvvvvv  ! Searching for 'vv', FoundAtPosition = 03, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(03-1)-02+(2)=02 ! 
     if (PRIMALlengthCANDIDATE >= PRIMALlength) {PRIMALposition=PRIMALpositionCANDIDATE; PRIMALlength = PRIMALlengthCANDIDATE;}
...
Step 31_00: {}vvvvvvvvvvvvvvvvvvvvvvvvvvvvvv[vv][v?] ! For position #31 the initial boundaries are PRIMALpositionCANDIDATE=LeftBoundary=31, RightBoundary=FoundAtPosition-1, the CANDIDATE PRIMAL string length is RightBoundary-LeftBoundary+(2)=(33-1)-31+(2)=03 !
Step 31_01: vvvvvvvvvvvvvvvvvvvvvvvvvvvvvv[{v(v}]v)  ! Searching for 'vv', FoundAtPosition = 32, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(32-1)-31+(2)=02 ! 
     if (PRIMALlengthCANDIDATE >= PRIMALlength) {PRIMALposition=PRIMALpositionCANDIDATE; PRIMALlength = PRIMALlengthCANDIDATE;}
     Result:
     PRIMALposition=31 PRIMALlength=02, NewNeedle = 'vv'

Example #3:

     PRIMALlength=00; FoundAtPosition=33; 
Step 01_00: {}[vv]vvvvvvvvBOOMSHAKALAKAvvvvvvvvv[v?] ! For position #01 the initial boundaries are PRIMALpositionCANDIDATE=LeftBoundary=01, RightBoundary=FoundAtPosition-1, the CANDIDATE PRIMAL string length is RightBoundary-LeftBoundary+(2)=(33-1)-01+(2)=33 !
Step 01_01: [{v(v}]v)vvvvvvvBOOMSHAKALAKAvvvvvvvvvv  ! Searching for 'vv', FoundAtPosition = 02, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(02-1)-01+(2)=02 ! 
     if (PRIMALlengthCANDIDATE >= PRIMALlength) {PRIMALposition=PRIMALpositionCANDIDATE; PRIMALlength = PRIMALlengthCANDIDATE;}
Step 02_00: {}v[vv]vvvvvvvBOOMSHAKALAKAvvvvvvvvv[v?] ! For position #02 the initial boundaries are PRIMALpositionCANDIDATE=LeftBoundary=02, RightBoundary=FoundAtPosition-1, the CANDIDATE PRIMAL string length is RightBoundary-LeftBoundary+(2)=(33-1)-02+(2)=32 !
Step 02_01: v[{v(v}]v)vvvvvvBOOMSHAKALAKAvvvvvvvvvv  ! Searching for 'vv', FoundAtPosition = 03, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(03-1)-02+(2)=02 ! 
     if (PRIMALlengthCANDIDATE >= PRIMALlength) {PRIMALposition=PRIMALpositionCANDIDATE; PRIMALlength = PRIMALlengthCANDIDATE;}
...
Step 09_00: {}vvvvvvvv[vv]BOOMSHAKALAKAvvvvvvvvv[v?] ! For position #09 the initial boundaries are PRIMALpositionCANDIDATE=LeftBoundary=09, RightBoundary=FoundAtPosition-1, the CANDIDATE PRIMAL string length is RightBoundary-LeftBoundary+(2)=(33-1)-09+(2)=25 !
Step 09_01: vvvvvvvv[{vv}]BOOMSHAKALAKA(vv)vvvvvvvv  ! Searching for 'vv', FoundAtPosition = 24, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(24-1)-09+(2)=16 ! 
Step 09_02: vvvvvvvv[v{v]B}OOMSHAKALAKA[vv]vvvvvvvv  ! Searching for 'vB', FoundAtPosition = 24, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(24-1)-09+(2)=16 ! 
Step 09_03: vvvvvvvv[vv]{BO}OMSHAKALAKA[vv]vvvvvvvv  ! Searching for 'BO', FoundAtPosition = 24, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(24-1)-09+(2)=16 ! 
Step 09_04: vvvvvvvv[vv]B{OO}MSHAKALAKA[vv]vvvvvvvv  ! Searching for 'OO', FoundAtPosition = 24, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(24-1)-09+(2)=16 ! 
Step 09_05: vvvvvvvv[vv]BO{OM}SHAKALAKA[vv]vvvvvvvv  ! Searching for 'OM', FoundAtPosition = 24, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(24-1)-09+(2)=16 ! 
Step 09_06: vvvvvvvv[vv]BOO{MS}HAKALAKA[vv]vvvvvvvv  ! Searching for 'MS', FoundAtPosition = 24, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(24-1)-09+(2)=16 ! 
Step 09_07: vvvvvvvv[vv]BOOM{SH}AKALAKA[vv]vvvvvvvv  ! Searching for 'SH', FoundAtPosition = 24, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(24-1)-09+(2)=16 ! 
Step 09_08: vvvvvvvv[vv]BOOMS{HA}KALAKA[vv]vvvvvvvv  ! Searching for 'HA', FoundAtPosition = 24, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(24-1)-09+(2)=16 ! 
Step 09_09: vvvvvvvv[vv]BOOMSH{AK}AL(AK)Avvvvvvvvvv  ! Searching for 'AK', FoundAtPosition = 21, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(21-1)-09+(2)=13 ! 
Step 09_10: vvvvvvvv[vv]BOOMSHA{KA}L[AK]Avvvvvvvvvv  ! Searching for 'KA', FoundAtPosition = 21, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(21-1)-09+(2)=13 ! 
Step 09_11: vvvvvvvv[vv]BOOMSHAK{AL}[AK]Avvvvvvvvvv  ! Searching for 'AL', FoundAtPosition = 21, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(21-1)-09+(2)=13 ! 
Step 09_12: vvvvvvvv[vv]BOOMSHAKA{L[A}K]Avvvvvvvvvv  ! Searching for 'LA', FoundAtPosition = 21, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(21-1)-09+(2)=13 ! 
     if (PRIMALlengthCANDIDATE >= PRIMALlength) {PRIMALposition=PRIMALpositionCANDIDATE; PRIMALlength = PRIMALlengthCANDIDATE;}
...
Step 31_00: {}vvvvvvvv[vv]BOOMSHAKALAKAvvvvvvvvv[v?] ! For position #31 the initial boundaries are PRIMALpositionCANDIDATE=LeftBoundary=31, RightBoundary=FoundAtPosition-1, the CANDIDATE PRIMAL string length is RightBoundary-LeftBoundary+(2)=(33-1)-31+(2)=03 !
Step 31_01: vvvvvvvvvvBOOMSHAKALAKAvvvvvvv[{v(v}]v)  ! Searching for 'vv', FoundAtPosition = 32, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(32-1)-31+(2)=02 ! 
     if (PRIMALlengthCANDIDATE >= PRIMALlength) {PRIMALposition=PRIMALpositionCANDIDATE; PRIMALlength = PRIMALlengthCANDIDATE;}
     Result:
     PRIMALposition=09 PRIMALlength=13, NewNeedle = 'vvBOOMSHAKALA'

When unfolded, the heuristic looks a bit complicated, actually, when dressed in C it is the simplicity itself:

C++
PRIMALlength=0;
for (i=0+(1); i < cbPattern-2+1+(1)-(1); i++) { // -(1) because the last BB order 2 has no counterpart(s)
    FoundAtPosition = cbPattern;
    PRIMALpositionCANDIDATE=i;
    while ( PRIMALpositionCANDIDATE <= (FoundAtPosition-1) ) {
        j = PRIMALpositionCANDIDATE + 1;
        while ( j <= (FoundAtPosition-1) ) {
            if ( *(unsigned short *)(pbPattern+PRIMALpositionCANDIDATE-(1)) == *(unsigned short *)(pbPattern+j-(1)) ) FoundAtPosition = j;
            j++;
        }
        PRIMALpositionCANDIDATE++;
    }
    PRIMALlengthCANDIDATE = (FoundAtPosition-1)-i+(2);
    if (PRIMALlengthCANDIDATE >= PRIMALlength) {PRIMALposition=i; PRIMALlength = PRIMALlengthCANDIDATE;}
}

Oh, and after searching for the longest number-name within the range 10^0..10^9999 the next one-and-only 43 letters long popped-up:
domilliaquadringenquattuorquinquagintillion
which stands for:
10^7365 one do-millia-quadringen-quattuor-quinquagin-tillion

And transforming some strings down to their PRIMAL strings:

Needle: 1234567890qwertyuiopasdfghjklzxcv            PRIMALposition=01 PRIMALlength=33  '1234567890qwertyuiopasdfghjklzxcv'
Needle: vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv            PRIMALposition=31 PRIMALlength=02  'vv'
Needle: vvvvvvvvvvBOOMSHAKALAKAvvvvvvvvvv            PRIMALposition=09 PRIMALlength=13  'vvBOOMSHAKALA'
Needle: Trollland                                    PRIMALposition=05 PRIMALlength=05  'lland'
Needle: Swampwalker                                  PRIMALposition=03 PRIMALlength=09  'ampwalker'
Needle: licenselessness                              PRIMALposition=01 PRIMALlength=13  'licenselessne'
Needle: alfalfa                                      PRIMALposition=04 PRIMALlength=04  'alfa'
Needle: Sandokan                                     PRIMALposition=01 PRIMALlength=07  'Sandoka'
Needle: shazamish                                    PRIMALposition=02 PRIMALlength=08  'hazamish'
Needle: Simplicius Simplicissimus                    PRIMALposition=08 PRIMALlength=15  'ius Simplicissi'
Needle: domilliaquadringenquattuorquinquagintillion  PRIMALposition=01 PRIMALlength=19  'domilliaquadringenq'
Needle: DODO                                         PRIMALposition=02 PRIMALlength=03  'ODO'
Needle: DODOD                                        PRIMALposition=03 PRIMALlength=03  'DOD'
Needle: aaaDODO                                      PRIMALposition=02 PRIMALlength=05  'aaDOD'
Needle: aaaDODOD                                     PRIMALposition=02 PRIMALlength=05  'aaDOD'
Needle: DODOaaa                                      PRIMALposition=02 PRIMALlength=05  'ODOaa'
Needle: DODODaaa                                     PRIMALposition=03 PRIMALlength=05  'DODaa'

earthshine
n.
The sunlight reflected from the earth's surface that illuminates part of the moon not directly lighted by the sun. Also called earthlight.
/HERITAGE/

The beauty of 'Swampwalker' is in its in-place (no additional space) work, this will pay off especially with big needles, say, 8KB. I believe that 'Railgun_Ennearch' reinforced by 'Swampwalker' transformation will gun Worst-Case-Scenarios down, thus making linearity closer. This WCS hitter will be called Railgun_Swampshine, got it, the shooting in the trollish swamps will be so intense/massacreOUS that the generated plasma will illuminate part of the moon.

 

An add-on from 2014-Jan-12:

Time for decuman hitting.

By far Railgun_Decumanus is the fastest-and-strongest of all railguns, its Skip Performance is superior due to 1+(1+1+1) lookups of pseudo-order 4 as first and second fast checks - implemented as only two conditionals.
Decumanus checks 4 haystack's suffixes (within the 10 rightmost bytes of the window) for presence in the needle, these 4 checks give firm grip while maintaining excellent Speed Performance, in fact 'Decumanus' means something immense.

Railgun_Decumanus is strengthened so well (see the GENOMEsque torture graphs) that it devastates all known to me hitters.

The 3 out-of-chart values for the Wikipedia torture are:

 

Doing Search for Pattern(589bytes) into String(52428800bytes) as-one-line ...
Railgun_Decumanus precise performance: 20492MB/s
Boyer_Moore-Horspool precise performance: 3605MB/s

Doing Search for Pattern(1080bytes) into String(52428800bytes) as-one-line ...
Railgun_Decumanus precise performance: 36408MB/s
Boyer_Moore-Horspool precise performance: 4896MB/s

Doing Search for Pattern(1755bytes) into String(52428800bytes) as-one-line ...
Railgun_Decumanus precise performance: 86141MB/s
Boyer_Moore-Horspool precise performance: 6281MB/s

Take a minute and compare 86141 vs 6281, for needle 1755 bytes long the number of repetitive suffixes order 2 or/and 4 is significant, the suffix management has to be both speedy and efficient.
Boyer-Moore-Horspool with its order 1 lookups suffers badly because of the great number of repetitive bytes, and we speak for its NATIVE mode - the textual data.
Even 'Gulliver', my old Railgun being BMH order 2, is hurt with such long needles.
Better yet choked, Railgun_Sekireigan_Wolfram_Nozomi reaches the miserable 28238MB/s, 1271% boost over BMH - this is the power of 4 pseudo-order 4 lookups.

Genome benchmark:

Railgun_Quadruplet/Railgun_Decumanus_Full-Fledged_MEMMEM_ACGT.png

Wikipedia benchmark:

Railgun_Quadruplet/Railgun_Decumanus_Full-Fledged_MEMMEM_WIKI.png

I have left the section for needle lengths 0..3 vacant/unhandled on purpose, SSE2 rules on small needles, Mischa Sandberg knows best, simply the 128bit Assembly destroys everything in its path.

The benchmark with its source at: http://www.sekireigan.com/_KAZE_32bit_memmem-like_showdown_Decumanus.7z.

An add-on from 2013-Nov-12:

I have strengthened Shockeroo's Skip Performance (while maintaining same speed) by adding an additional lookup in the second level check while not adding additional branching!
The C source with its main loop in Assembly is given at Download Section. The printer friendly version Railgun_Wolfram_4-pages.pdf.

The body of 'Railgun_Sekireigan_WolfRAM' is this:

  • Section #1: Simplified Karp-Rabin for shortest (2..3 bytes) needles;
  • Section #2: Quadruplet for shorter (4 bytes and bigger) needles and for "short" haystacks;
  • Section #3a: Boyer-Moore-Horspool order 2 (2)+(2+2) for short (4..22 bytes) needles and for "long" haystacks;
  • Section #3b: Boyer-Moore-Horspool pseudo-order 4 (4)+(4) for long (23..711 bytes) needles and for "long" haystacks;
  • Section #4: Simplified Horspool pseudo-order 12 for longest (712 bytes and bigger) needles and for "long" haystacks.

Few thoughts regarding an excellent Skip Performance etude:
Something "still" considered crazy: using BITwise order 3, not pseudo order 3, though! 2^24 = 16MB BYTEwise or 2^(24-3) = 2MB BITwise, when searching big haystacks e.g. Wikipedia 42GB with Kazahana then this 2MB lookup table seems not so atrocious.

C++
// The code is like:
ulHashTarget = *(uint32_t *)&pbTarget[i+cbPattern-4]; // One memory access instead of 2
if ( bm_Horspool_Order3[ulHashTarget>>8] == 0 ) Gulliver = cbPattern-(3-1); else
if ( bm_Horspool_Order3[ulHashTarget&0xFFFFFF] == 0 ) Gulliver = cbPattern-(3-1)-1; else
{
    // ...
}

When searching big haystacks, say the whole Wikipedia a 42GB file, I expect order 3 combined with multi-threading, say 16 threads as in my Kazahana tool, given that all threads use this 2MB lookup table SHARED, to overpower even 'WolfRAM', to be done.

An add-on from 2013-Nov-09:

It's shocktime.

The shock is bitter-sweet, I failed to better the Skip Performance in the external loop, while the internal became MUCH faster (even on my Core 2). Two major improvements in 'Railgun_Sekireigan_Shriekeroo' sections #3a/#3b led to appearance of 'Railgun_Sekireigan_Shockeroo' - the FASTEST MEMMEM, so far. The boost is twofold by reducing the iterations in both loops (external&internal). For external, better skip is sought from right-to-left, unsuccessful so far. For internal, the cycles/iterations are 4 times less. The main loop of Section #3a:

C++
while (i <= cbTarget-cbPattern) {
    Gulliver = 1; // 'Gulliver' is the skip
    if ( bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1]] != 0 ) {
        if ( bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1-2]] == 0 ) Gulliver = cbPattern-(2-1)-2; else {
            if ( *(uint32_t *)&pbTarget[i] == ulHashPattern) {
            // This fast check ensures not missing a match (for remainder) when going under 0 in loop below:
                // Order 4 [
            // Let's try something "outrageous" like comparing
            // with[out] overlap BBs 4bytes long instead of 1 byte back-to-back:
            // Inhere we are using order 4, 'cbPattern - Order + 1' is the number
            // of BBs for text 'cbPattern' bytes long, for example,
            // for cbPattern=11 'fastest fox' and Order=4 we have BBs = 11-4+1=8:
            //0:"fast" if the comparison failed here, 'count' is 1; 'Gulliver' is cbPattern-(4-1)-7
            //1:"aste" if the comparison failed here, 'count' is 2; 'Gulliver' is cbPattern-(4-1)-6
            //2:"stes" if the comparison failed here, 'count' is 3; 'Gulliver' is cbPattern-(4-1)-5
            //3:"test" if the comparison failed here, 'count' is 4; 'Gulliver' is cbPattern-(4-1)-4
            //4:"est " if the comparison failed here, 'count' is 5; 'Gulliver' is cbPattern-(4-1)-3
            //5:"st f" if the comparison failed here, 'count' is 6; 'Gulliver' is cbPattern-(4-1)-2
            //6:"t fo" if the comparison failed here, 'count' is 7; 'Gulliver' is cbPattern-(4-1)-1
            //7:" fox" if the comparison failed here, 'count' is 8; 'Gulliver' is cbPattern-(4-1)
                count = cbPattern-4+1; 
                while ( count > 0 && *(uint32_t *)(pbPattern+count-1) == 
                            *(uint32_t *)(&pbTarget[i]+(count-1)) )
                    //count--;
                    count = count-4; // - order, of course order 4 is much more SWEET&CHEAP - less loops
                if ( count <= 0 ) return(pbTarget+i); else
                    if ( bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+count-1]] == 
                               0 ) Gulliver = count; // 1 or bigger, as it should
                // Order 4 ]
            }
        }
        // Trying to strengthen the Skip performance... here ONLY one additional
        // lookup, for better/longer jumps more such lookups, unrolled to be added. 
        //if ( bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1-2]] == 0 ) Gulliver = cbPattern-(2-1)-2;
        // Dummy me, the above line should play two roles (by putting
        // it before the comparison cycle - which is nastily slow) instead of one:
        // 1] If it is 0 that discards the need for further comparing - they cannot be equal.
        // 2] Strengthen the skip.
    } else Gulliver = cbPattern-(2-1);
    i = i + Gulliver;
    //GlobalI++; // Comment it, it is only for stats.
}

Kind of disappointed after the boosting, on i7 I expect nothing less than 'BANG!'

The benchmark with its source at: http://www.sekireigan.com/_KAZE_32bit_memmem-like_showdown_Shockeroo.7z.

Railgun_Quadruplet/Railgun_Sekireigan_Shockeroo_Full-Fledged_MEMMEM_WIKI.png

Since the test measures in milliseconds (enough for raw purposes) I added the total clocks for 'Railgun_Sekireigan_Shockeroo' which is executed 1024 times, the results on my laptop Core 2:

Doing Search for Pattern(7bytes) into String(52428800bytes) as-one-line ...
The first instance occurred at position 51265299 within next line:
Railgun_Sekireigan_Shockeroo_hits/Railgun_Sekireigan_Shockeroo_clocks: 0/19
Railgun_Sekireigan_Shockeroo performance: 2694KB/clock
Railgun_Sekireigan_Shockeroo TotalRoughSearchTime (lesser-the-better): 18642.000000clocks
...
Doing Search for Pattern(589bytes) into String(52428800bytes) as-one-line ...
The first instance occurred at position 52419431 within next line:
Railgun_Sekireigan_Shockeroo_hits/Railgun_Sekireigan_Shockeroo_clocks: 0/3
Railgun_Sekireigan_Shockeroo performance: 17066KB/clock
Railgun_Sekireigan_Shockeroo TotalRoughSearchTime (lesser-the-better): 2532.000000clocks

EDIT:

Or, 7 bytes long needle is found at 51,265,299KB/18642clocks = ((51265299/18642)*1000)/1024/1024= 2.622GB/s on Core 2 laptop!!!

...
Or, 589 bytes long needle is found at 52,419,431KB/2532clocks = ((52419431/2532)*1000)/1024/1024= 19+GB/s on Core 2 laptop!!!

And the same test on laptop Core i5 2430M @3000MHz 3MB L3 cache:

Doing Search for Pattern(7bytes) into String(52428800bytes) as-one-line ...
The first instance occurred at position 51265299 within next line:
Railgun_Sekireigan_Shockeroo_hits/Railgun_Sekireigan_Shockeroo_clocks: 0/13
Railgun_Sekireigan_Shockeroo performance: 3938KB/clock
Railgun_Sekireigan_Shockeroo TotalRoughSearchTime (lesser-the-better): 12345.000000clocks
...
Doing Search for Pattern(589bytes) into String(52428800bytes) as-one-line ...
The first instance occurred at position 52419431 within next line:
Railgun_Sekireigan_Shockeroo_hits/Railgun_Sekireigan_Shockeroo_clocks: 0/2
Railgun_Sekireigan_Shockeroo performance: 25600KB/clock
Railgun_Sekireigan_Shockeroo TotalRoughSearchTime (lesser-the-better): 1283.000000clocks
Or, 7 bytes long needle is found at 51,265,299KB/12345clocks = ((51265299/12345)*1000)/1024/1024= 3.960GB/s on Core i5 laptop!!!
...
Or, 589 bytes long needle is found at 52,419,431KB/1283clocks = ((52419431/1283)*1000)/1024/1024= 38+GB/s on Core i5 laptop!!!

For now, I cannot see how to boost 'Shockeroo', any ideas?

An add-on from 2013-Nov-07:

One step closer to Railgun_MIMINO...

An overlooked optimization in the main loop of BMH 2+2 and BMH 4+4 sections has been finally done, so here comes 'Railgun_Sekireigan_Shriekeroo' shriekingly faster than current fastest. The main loop of Section #3a:

C++
while (i <= cbTarget-cbPattern) {
    Gulliver = 1;
    if ( bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1]] != 0 ) {
        //if ( Gulliver == 1 ) { // Means the Building-Block order 2 is found somewhere i.e. NO MAXIMUM SKIP
        if ( bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1-2]] == 0 ) Gulliver = cbPattern-(2-1)-2; else {
            if ( *(uint32_t *)&pbTarget[i] == ulHashPattern) {
                count = countSTATIC; // Last two chars already matched, to be fixed with -2
                while ( count !=0 && *(char *)(pbPattern+
                       (countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
                    count--;
                if ( count == 0) return(pbTarget+i);
            }
        }
        //}
        // Trying to strengthen the Skip performance... here ONLY one additional lookup,
        // for better/longer jumps more such lookups, unrolled to be added. 
        //if ( bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1-2]] == 0 ) Gulliver = cbPattern-(2-1)-2;
        // Dummy me, the above line should play two roles (by putting
        // it before the comparison cycle - which is nastily slow) instead of one:
        // 1] If it is 0 that discards the need for further comparing - they cannot be equal.
        // 2] Strengthen the skip.
    } else Gulliver = cbPattern-(2-1);
    i = i + Gulliver;
    //GlobalI++; // Comment it, it is only for stats.
}

The body of 'Railgun_Sekireigan_Shriekeroo' is this:

  • Section #1: Simplified Karp-Rabin for shortest (2..3 bytes) needles;
  • Section #2: Quadruplet for shorter (4 bytes and bigger) needles and for "short" haystacks;
  • Section #3a: Boyer-Moore-Horspool order 2 (2+2) for short (4..22 bytes) needles and for "long" haystacks;
  • Section #3b: Boyer-Moore-Horspool pseudo-order 4 (4+4) for long (23..711 bytes) needles and for "long" haystacks;
  • Section #4: Simplified Horspool pseudo-order 12 for longest (712 bytes and bigger) needles and for "long" haystacks.

The benchmark with its source at: http://www.sekireigan.com/_KAZE_32bit_memmem-like_showdown_Shriekeroo.7z

Railgun_Quadruplet/Railgun_Sekireigan_Shriekeroo_Full-Fledged_MEMMEM_WIKI.png

Railgun_Quadruplet/Railgun_Sekireigan_Shriekeroo_Full-Fledged_MEMMEM_ACGT.png

Well, I have one more idea left (to compare backwards and to lookup the mismatched position, thus Gulliver won't be 1 but bigger) this revision will be named 'Railgun_Sekireigan_Shockeroo'.

An add-on from 2013-Nov-05:

FLY DUMBO FLY!

Long story short: 'Railgun_Sekireigan_TchittoGritto' dominates everywhere for XML/English texts, as for 'ACGT' i.e. GENOME files it is 200%-400% faster than classical BMH!

I have been stunned, the approach used in 'Piri' revision 2 named 'TchittoGritto' is so alacritous that it trumped even the gorgeous BNDM which excels at GENOME data!

For 'ACGT' data, after traversing 212MB:

Needle 20 chars:
BNDM_32 performance: 1235KB/clock
Railgun_Sekireigan_TchittoGritto performance: 2012KB/clock
Boyer_Moore-Horspool performance: 369KB/clock

Needle 30 chars:
BNDM_32 performance: 1659KB/clock
Railgun_Sekireigan_TchittoGritto performance: 2241KB/clock
Boyer_Moore-Horspool performance: 734KB/clock


What can I say, BNDM is maybe the most beautiful etude I have ever seen, but Railgun_Sekireigan_TchittoGritto outflies it, simply the hashing beats the DAWG - ONLY A BIG ENOUGH TABLE IS NEEDED TO DO SO! Moreover, for XML/English texts, BNDM can't stand a chance against 'TchittoGritto'. The speed stunning ends not here, with proper needle threshold tweaks the boost (compared to BMH) can reach the CCRRAAZZYY 700%, see further below.

BNDM stands for "Backward Nondeterministic DAWG Matching". DAWG "Directed Acyclic Word Graph": Definition (1): A directed acyclic graph representing the suffixes of a given string in which each edge is labeled with a character. The characters along a path from the root to a node are the substring which the node represents. Definition (2): A finite state machine that recognizes a set of words.

Out of curiosity, I wanted to juxtapose BMH order 1/2/4/12 implementations by throwing at them the 'Homo sapiens chromosome 1, alternate assembly HuRef, whole genome shotgun sequence' - a 212MB small alphabet (~6 distinct chars) file. It is well-known that BMH struggles with GENOMEsque data. Here the heavily reinforced 'Railgun_Sekireigan_TchittoGritto' proves the alacrity of higher orders BMH, being an order 4 but paired i.e. the rightmost 4+4 - a far cry from real order 8. The search is for 20/30/40/50/60/70 chars long needles located at the end of the file.

As you can see from the graph below, Railgun_Sekireigan_Bari (BMH order 12) comes screaming down gaining momentum for needles 60 and 70 chars long simply because:

 

#define NeedleThresholdBIGSekirei 12+40 // Should be bigger than 'HasherezadeOrder'. BMH2 works up to this value (inclusive).

The above threshold (52) decides to switch from order 2 to order 12 i.e. 'Hasherezade' for needles 53 and bigger. 'Railgun_Sekireigan_TchittoGritto' doesn't switch to order 12 until 712 (its threshold) which in turn is far more suitable for XML/English texts - it (order 4) outspeeds order 12 there.

For 'ACGT' data:

 

Needle 60 chars:
Railgun_Sekireigan_TchittoGritto performance: 2683KB/clock
Railgun_Sekireigan_Bari performance: 3813KB/clock
Railgun_Sekireigan_Bari_bit performance: 3813KB/clock
Boyer_Moore-Horspool performance: 442KB/clock

Needle 70 chars:
Railgun_Sekireigan_TchittoGritto performance: 2619KB/clock
Railgun_Sekireigan_Bari performance: 3952KB/clock
Railgun_Sekireigan_Bari_bit performance: 3952KB/clock
Boyer_Moore-Horspool performance: 488KB/clock

For XML data:

Needle 7 chars:
Railgun_Sekireigan_TchittoGritto performance: 2133KB/clock
Railgun_Sekireigan_Bari performance: 2133KB/clock

Needle 13 chars:
Railgun_Sekireigan_TchittoGritto performance: 3200KB/clock
Railgun_Sekireigan_Bari performance: 3200KB/clock

...

Needle 201 chars:
Railgun_Sekireigan_TchittoGritto performance: 7314KB/clock
Railgun_Sekireigan_Bari performance: 4654KB/clock

Needle 230 chars:
Railgun_Sekireigan_TchittoGritto performance: 8533KB/clock
Railgun_Sekireigan_Bari performance: 5120KB/clock

Needle 255 chars:
Railgun_Sekireigan_TchittoGritto performance: 8533KB/clock
Railgun_Sekireigan_Bari performance: 5688KB/clock

Needle 589 chars:
Railgun_Sekireigan_TchittoGritto performance: 17066KB/clock
Railgun_Sekireigan_Bari performance: 10240KB/clock

The benchmark with its source at: http://www.sekireigan.com/_KAZE_32bit_memmem-like_showdown_TchittoGritto.7z

Railgun_Quadruplet/Railgun_Sekireigan_TG_Full-Fledged_MEMMEM_cut_shadow.png


- Что такое 'Чито Грито'?
- Птичка, птичка, невеличка, в общем ничего.
/Рубик и Мимино/

- What is this 'Tchitto Gritto'?
- Birdie, birdie, an unimportant one, in fact nothing special.
/Rubik and Mimino/

«Мимино» по-грузински მიმინო — «ястреб», in English Sparrowhawk, in Latin (Accipiter nisus); «сокол» — это «шевардени».

One day or rather night I will write the final mix: Railgun_MIMINO, but as in the song "my tomorrow never comes" it will be my swan song.

 

An add-on from 2013-Nov-03:

Two things.

First, thanks to Mr.Eiht, who helped me again, next results answer several interesting questions:

- How a modern CPU (i7 3930K @4.5GHz) executes 'Railgun';
- How BYTE vs BIT lookups behave;
- What the two 32bit (Microsoft VS2010 and Intel v12.1) C optimizers (Windows) can offer;
- How Core 2 vs i7 3rd generation stand side-by-side;

 

 

 

Needle 7 chars:
Railgun_Sekireigan_Bari performance: 5120KB/clock
Railgun_Sekireigan_Bari_bit performance: 3657KB/clock
Boyer_Moore-Horspool performance: 1765KB/clock

Needle 13 chars:
Railgun_Sekireigan_Bari performance: 6400KB/clock
Railgun_Sekireigan_Bari_bit performance: 5688KB/clock
Boyer_Moore-Horspool performance: 3011KB/clock

Needle 21 chars:
Railgun_Sekireigan_Bari performance: 7314KB/clock
Railgun_Sekireigan_Bari_bit performance: 7314KB/clock
Boyer_Moore-Horspool performance: 3657KB/clock

Needle 28 chars:
Railgun_Sekireigan_Bari performance: 8533KB/clock
Railgun_Sekireigan_Bari_bit performance: 7314KB/clock
Boyer_Moore-Horspool performance: 4266KB/clock

Needle 37 chars:
Railgun_Sekireigan_Bari performance: 8533KB/clock
Railgun_Sekireigan_Bari_bit performance: 7314KB/clock
Boyer_Moore-Horspool performance: 4654KB/clock

Needle 51 chars:
Railgun_Sekireigan_Bari performance: 10240KB/clock
Railgun_Sekireigan_Bari_bit performance: 8533KB/clock
Boyer_Moore-Horspool performance: 5120KB/clock


Quick answers:
* Railgun_Sekireigan_Bari dominates everywhere.
* Still the instructions needed to convert BYTE to BIT lookup take significant time.
* Intel compiler generates much faster code for BITwise version, yum-yum.
* Core 2 holds the line, it is still very good, given that my 'Bonboniera' is 2200MHz and Mr.Eiht's 'Redemption' is 4500MHz.

Second, just wanted to try an old idea once more (loading 4bytes at once at the right as in the very first Railgun where such fetch was done at the left), it resulted in appearance of 'Railgun_Sekireigan_Piri' - a much faster for 10+ long needles revision.

 

 

 

The body of 'Railgun_Sekireigan_Piri' is this:

 

  • Section #1: Simplified Karp-Rabin for shortest (2..3 bytes) needles;
  • Section #2: Quadruplet for shorter (4 bytes and bigger) needles and for "short" haystacks;
  • Section #3: Boyer-Moore-Horspool pseudo-order 4 for short (4..198 bytes) needles and for "long" haystacks;
  • Section #4: Simplified Horspool pseudo-order 12 for longest (199 bytes and bigger) needles and for "long" haystacks.


This first revision wags on one leg - reinforcing the stance (i.e. SKIP performance) by checking the previous 4 bytes (at left) should make Railgun to SCREAM deafeningly!

Possible improvements for Section #3:
- stronger "hash", the addition alone is the fastest;
- for needles shorter than 12 returning to order 2;
- increasing the 'NeedleThresholdBIGSekireiPiri' from 12+180 to 600;
- uniting the two lookup arrays and making order 4 section BITwise - thus one only 8KB array will be the overhead.

 

 

The benchmark with its source at:
http://www.sekireigan.com/_KAZE_32bit_memmem-like_showdown_Piri.7z

 

 

Railgun_Quadruplet/Railgun_Sekireigan_Piri_Full-Fledged_MEMMEM_cut_shadow.png

Section #3 is given below - the other three sections remain intact:

C++
...
if ( cbPattern<=NeedleThresholdBIGSekireiPiri ) { 
        countSTATIC = cbPattern-2-2;
        ulHashPattern = *(uint32_t *)(pbPattern); // First four bytes
        //ulHashTarget = *(unsigned short *)(pbPattern+cbPattern-1-1); // Last two bytes
        i=0;
        //for (a=0; a < 256*256; a++) {bm_Horspool_Order2[a]= cbPattern-1;} // cbPattern-(Order-1) for Horspool; 'memset' if not optimized
        //for (j=0; j < cbPattern-1; j++) bm_Horspool_Order2[*(unsigned short *)(pbPattern+j)]=j; // Rightmost appearance/position is needed
        for (a=0; a < 256*256; a++) {bm_Horspool_Order2[a]=0;}
        // In line below we "hash" 4bytes to 2bytes i.e. 16bit table, how to compute TOTAL number of BBs, 'cbPattern - Order + 1' is the number of BBs for text 'cbPattern' bytes long, for example, for cbPattern=11 'fastest fox' and Order=4 we have BBs = 11-4+1=8:
        //"fast"

        //"aste"

        //"stes"
        //"test"
        //"est "

        //"st f"

        //"t fo"

        //" fox"
        //for (j=0; j < cbPattern-4+1; j++) bm_Horspool_Order2[( *(unsigned short *)(pbPattern+j+0) + *(unsigned short *)(pbPattern+j+2) ) & ( (1<<16)-1 )]=1;
        for (j=0; j < cbPattern-4+1; j++) bm_Horspool_Order2[( (*(uint32_t *)
          (pbPattern+j+0)>>16)+(*(uint32_t *)
          (pbPattern+j+0)&0xFFFFF) ) & ( (1<<16)-1 )]=1;
        while (i <= cbTarget-cbPattern) {
            Gulliver = 1;
            //if ( bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1]] != 0 ) {
            //if ( bm_Horspool_Order2[( *(unsigned short *)&pbTarget[i+cbPattern-1-1-2] + 
            //      *(unsigned short *)&pbTarget[i+cbPattern-1-1] ) & ( (1<<16)-1 )] != 0 ) {
            //  001d2 0f b7 6c 01 fc   movzx ebp, WORD PTR [-4+ecx+eax]       
            //  001d7 0f b7 5c 01 fe   movzx ebx, WORD PTR [-2+ecx+eax]       
            //  001dc 03 eb            add ebp, ebx                           
            // TO-DO: The above line should be with one only memory access i.e.,
            // to fetch 4 bytes at once and then to mix the low and high part.
            if ( bm_Horspool_Order2[( (*(uint32_t *)&pbTarget[i+cbPattern-1-1-2]>>16)+
                   (*(uint32_t *)&pbTarget[i+cbPattern-1-1-2]&0xFFFFF) ) & ( (1<<16)-1 )] != 0 ) {
                //if ( Gulliver == 1 ) { // Means the Building-Block order 2 is found somewhere i.e. NO MAXIMUM SKIP
                    if ( *(uint32_t *)&pbTarget[i] == ulHashPattern) {
                        count = countSTATIC; // Last two chars already matched, to be fixed with -2
                        while ( count !=0 && *(char *)(pbPattern+(countSTATIC-
                               count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
                            count--;
                        if ( count == 0) return(pbTarget+i);
                    }
                //}
                // Trying to strengthen the Skip performance... here ONLY one additional
                // lookup, for better/longer jumps more such lookups, unrolled to be added. 
                // Since 'Piri' revision not trying anymore, instead trying to reduce main loop size while going to order 4.
                //if ( bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1-2]] == 0 ) Gulliver = cbPattern-(2-1)-2;
            } else Gulliver = cbPattern-(2-1)-2; // -2 because we check the 4 rightmost bytes not 2.
            i = i + Gulliver;
            //GlobalI++; // Comment it, it is only for stats.
        }
        return(NULL);
// The above loop in Assembly:
/*
; mark_description "Intel(R) C++ Compiler XE for applications 
; running on IA-32, Version 12.1.1.258 Build 20111011";
; mark_description "-O3 -FAcs -Festrstr_SHORT-SHOWDOWN_LV_WIKI_32bit_IntelV12";
.B2.29:                         
  001d2 8b 5c 01 fc      mov ebx, DWORD PTR [-4+ecx+eax]        
  001d6 8b eb            mov ebp, ebx                           
  001d8 c1 ed 10         shr ebp, 16                            
  001db 03 eb            add ebp, ebx                           
  001dd 0f b7 fd         movzx edi, bp                          
  001e0 0f b6 1c 3c      movzx ebx, BYTE PTR [esp+edi]          
  001e4 85 db            test ebx, ebx                          
  001e6 74 52            je .B2.38 
.B2.30:                         
  001e8 3b 14 06         cmp edx, DWORD PTR [esi+eax]           
  001eb 0f 85 3c 04 00 
        00               jne .B2.83 
.B2.31:                         
  001f1 8b 9c 24 08 00 
        01 00            mov ebx, DWORD PTR [65544+esp]         
  001f8 8b eb            mov ebp, ebx                           
  001fa f7 dd            neg ebp                                
  001fc 85 db            test ebx, ebx                          
  001fe 0f 84 21 04 00 
        00               je .B2.82 
.B2.32:                         
  00204 8b bc 24 04 00 
        01 00            mov edi, DWORD PTR [65540+esp]         
  0020b 8b b4 24 00 00 
        01 00            mov esi, DWORD PTR [65536+esp]         
  00212 03 f8            add edi, eax                           
.B2.33:                         
  00214 8a 54 35 04      mov dl, BYTE PTR [4+ebp+esi]           
  00218 3a 54 3d 04      cmp dl, BYTE PTR [4+ebp+edi]           
  0021c 0f 85 ee 03 00 
        00               jne .B2.81 
.B2.34:                         
  00222 45               inc ebp                                
  00223 4b               dec ebx                                
  00224 75 ee            jne .B2.33 
.B2.35:                         
  00226 8b b4 24 2c 00 
        01 00            mov esi, DWORD PTR [65580+esp]         
.B2.36:                         
  0022d 03 c6            add eax, esi                           
  0022f 81 c4 18 00 01 
        00               add esp, 65560                         
  00235 5d               pop ebp                                
  00236 5b               pop ebx                                
  00237 5f               pop edi                                
  00238 5e               pop esi                                
  00239 c3               ret                                    
.B2.38:                         
  0023a 8b 9c 24 10 00 
        01 00            mov ebx, DWORD PTR [65552+esp]         
.B2.39:                         
  00241 03 c3            add eax, ebx                           
  00243 3b 84 24 14 00 
        01 00            cmp eax, DWORD PTR [65556+esp]         
  0024a 76 86            jbe .B2.29 
.B2.40:                         
  0024c 33 c0            xor eax, eax                           
  0024e 81 c4 18 00 01 
        00               add esp, 65560                         
  00254 5d               pop ebp                                
  00255 5b               pop ebx                                
  00256 5f               pop edi                                
  00257 5e               pop esi                                
  00258 c3               ret                                    
.B2.41:                         
...
.B2.81:                         
  00610 89 b4 24 00 00 
        01 00            mov DWORD PTR [65536+esp], esi         
  00617 8b 94 24 0c 00 
        01 00            mov edx, DWORD PTR [65548+esp]         
  0061e 8b b4 24 2c 00 
        01 00            mov esi, DWORD PTR [65580+esp]         
.B2.82:                         
  00625 85 db            test ebx, ebx                          
  00627 0f 84 00 fc ff 
        ff               je .B2.36 
.B2.83:                         
  0062d bb 01 00 00 00   mov ebx, 1                             
  00632 e9 0a fc ff ff   jmp .B2.39 
; The whole section (main loop plus the exit) is 00258-001d2+1 + 00632-00610+5 = 135+39 bytes long.
*/
    } else { // if ( cbPattern<=NeedleThresholdBIGSekireiPiri )
...

As the night is leaving - silently retreating down an empty hall
Suddenly a stirring - finally recurring where I let it fall
Following the wanderings of a dream - a dream that keeps my soul alive
Believing in an open sky - believing in a love.

Dancing with a stranger - careless of the danger there within his smile
While the dew was forming - breathing in the morning like a sleeping child.

If the memory of the light should fade - horizons reaching cold and blue
Until your heart is free to fly - then I will keep the sun for you
Until you touch the open sky - then I will keep the sun for you
Until we never say good-bye - then I will keep the sun for you.

/Giorgio Moroder - (Theme From) Midnight Express (vocal Chris Bennett) Lyrics/

For Piri.

An add-on from 2013-Oct-16:

It's time for the undisputed (for non-pathological cases, Mischa knows better) FASTEST MEMMEM named 'Railgun_Sekireigan_Bari'. Quite natural next step to 'Railgun_Sekireigan': now checking 2 more bytes at the left of our rightmost 2 bytes. This second check allows skips/jumps to be less often 1 byte, resembling Sekirei standing on two legs thus allowing much faster sway compared to (revision 1) where Sekirei relies/sways on one leg only (see the 'Comments & Discussions' section and video where speed of wagging is apparently lower on one leg). First and Second revision differ in one line only:

if ( bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1-2]] == 0 ) Gulliver = cbPattern-(2-1)-2;

A quick example:

Haystack:   "...my friend Bari..."

Pattern:       "my friend Piri"
Jump for r.1:   "my friend Piri" +1
Jump for r.2:             "my friend Piri" +11
Window:                  "Bari"

First Order 2 check:       "ri" it is within the lookup table i.e. its slot holds 1, skip is 1 position
Second Order 2 check:    "Ba" it is outwith the lookup table
         i.e. its slot holds 0, skip is cbPattern-(2-1)-2=14-(2-1)-2=11 positions
Lookup table, next slots hold 1, rest 0:
'my'
'y '
' f'
'fr'
'ri'
'ie'
'en'
'nd'
'd '
' P'
'Pi'
'ir'
'ri'
Pattern Building-Block "Ba" holds 0.

The benchmark with its source at:
http://www.sekireigan.com/_KAZE_32bit_memmem-like_showdown.7z

Railgun_Quadruplet/Railgun_Sekireigan_Bari_Full-Fledged_MEMMEM_cut.png

A bitter cup for me and my mom these days, our fidel dog died in his (not its) prime, only 5 years old, how to say goodbye?
This function is dedicated to our animal friend Bari, one undistorted image of LIFE.

Dark the stars and dark the moon,
Hush the night and the morning loon,
Tell the horses and beat on your drum,
Gone their master, gone their son,

Dark the oceans, dark the sky,
Hush the whales and the ocean tide,
Tell the salt marsh and beat on your drum,
Gone their master, gone their son,

Dark to light and light to dark,
Three black carriages, three white carts,
What brings us together is what pulls us apart,
Gone our brother, gone our heart.
...

/'Gone' lyrics written and performed by Ioanna Gika/

An add-on from 2013-Oct-11:

It's time Sultaness of MEMMEM functions 'Hasherezade' to step back and 'Sekirei' to step in.

Right to the point, 'Sekireigan' stands for "Eye of the Sekirei". It is a technique in swordsmanship allowing to move extremely fast and attack from many angles simultaneously. Yukimura can utilize the Sekireigan to access extreme speed. As for 'Sekirei' it stands for 'Forest Wagtail', forest wagtail differs from its Motacilla relatives in its strange habit of swaying its tail from side to side, not wagging it up and down like other wagtails. Do you see the connection between lateral swaying (while removing the vertical swaying altogether) and the PLAINNESS of Boyer-Moore-Horspool order 2 maximum simplified main loop (no distraction to look up other skip values like in 'Hasherezade') which SWAYS left-to-right-to-left-... at SICK pace: Looking AT RIGHT of Haystack's window last 2 bytes for presence into BYTEwise 64KB table if findable then starting the comparison AT LEFT, and so forth. The key for this IMMENSE boost is the MAXIMUM possible simplification of Horspool order 2 section, the main loop became only 99 bytes!

Railgun_Quadruplet/Railgun_Sekireigan_Full-Fledged_MEMMEM_cut.png

Why didn't I make it 2 years ago, don't ask me, if you do the answer is - I didn't want to ... on purpose, the SKIP performance was good and I thought "it is better off slower but stronger", now this overkill is ended. In fact this simplification is the only thing that is different in 'Railgun_Sekireigan' compared to its overkill/counterpart 'Railgun_Hasherezade' revision 3.

Well, the only weak point lies in the increment/skip by 1 in BMH order 2 loop for patterns 5-~28 bytes long, to boost them the 64KB table should house (as it was) the skips for Building-Blocks instead of 0/1 as I simplified, this range is the only one where Mr.Volnitsky has some 300-400MB/s advantage. It is this way on purpose, maybe in future revisions I will make the BMH lookup table BITwise i.e. 8 times smaller which is 8KB - fully fittable in Vishera's 16KB L1 cache. My wish is to make a rich showdown on different CPUs (Haswell, Vishera mostly), sadly no opportunity in sight, if someone wants to help me and run the benchmark I will gladly share the results here.

How to compute TOTAL number of BBs, 'cbPattern - Order + 1' is the number of BBs for text 'cbPattern' bytes long. For example, for cbPattern=11 'fastest fox' and Order=8 we have BBs = 11-8+1=4:

'fastest '
'astest f'
'stest fo'
'test fox'

And for those who want to deepen their understanding and to have one supertool extracting the Building-Blocks from ANY file I included in next package my BBs ripper LeprechaunBBhex, written in C and compilable both for Windows/Linux it works on incredible speeds. If you run the batch 'RIP_BBs.bat' it will create and report this:

For our test file:

enwiki-20130904-pages-articles.7z.001:  52,428,800 bytes; 52,428,800 - Order + 1 Total 'Order' bytes long Building-Blocks
ENWIKI_50MB_order2.txt               :      93,294 bytes;                 15,549 Distinct Building-Blocks 2 bytes long
ENWIKI_50MB_order4.txt               :  10,593,630 bytes;              1,059,363 Distinct Building-Blocks 4 bytes long
ENWIKI_50MB_order8.txt               : 225,515,718 bytes;             12,528,651 Distinct Building-Blocks 8 bytes long
ENWIKI_50MB_order12.txt              : 726,063,728 bytes;             27,925,528 Distinct Building-Blocks 12 bytes long

From above we can obtain the REDUNDANCY of WIKI 50MB dump:
- for order  2:     15,549/(52,428,800 - 2 + 1)*100%  =  0.02%
- for order  4:  1,059,363/(52,428,800 - 4 + 1)*100%  =  2.02%
- for order  8: 12,528,651/(52,428,800 - 8 + 1)*100%  = 23.89%
- for order 12: 27,925,528/(52,428,800 - 12 + 1)*100% = 53.26%

The less the percentage the less load on hash function there will be, which is good due to less number of collisions. Given that order 2 has 15,549 distinct BBs and lookup table has 65,536 slots means that a good hash function could offer zero collisions. As for order 12, the default hash size is: #define HashTableSizeSekirei 17-1 which gives 27,925,528 keys vs 65,536 slots, or 426:1, not a pleasant ratio. In one test when the HashTableSizeSekirei was 20 not 17-1 (which is 26:1) the speed drop (for 17-1) was of no real concern.

And to see what is the difference between XML (WIKIPEDIA) and TXT (English ebook Text) formats here come the stats for OSHO.TXT:

OSHO.TXT        :   206,908,949 bytes; 206,908,949 - Order + 1 Total 'Order' bytes long Building-Blocks
OSHO_order2.txt :        26,544 bytes;                   4,424 Distinct Building-Blocks 2 bytes long
OSHO_order4.txt :     2,480,190 bytes;                 248,019 Distinct Building-Blocks 4 bytes long
OSHO_order8.txt :   161,216,928 bytes;               8,956,496 Distinct Building-Blocks 8 bytes long
OSHO_order12.txt: 1,138,861,490 bytes;              43,802,365 Distinct Building-Blocks 12 bytes long

From above we can obtain the REDUNDANCY of OSHO books:
- for order  2:      4,424/(206,908,949 - 2 + 1)*100%  =  0.002%
- for order  4:    248,019/(206,908,949 - 4 + 1)*100%  =  0.11%
- for order  8:  8,956,496/(206,908,949 - 8 + 1)*100%  =  4.32%
- for order 12: 43,802,365/(206,908,949 - 12 + 1)*100% = 21.16%

Numbers say that REDUNDANCIES for our XML and TXT files are different by great margin. Roughly speaking, OSHO's orders 2 and 4 being four times less than WIKI's suggest hashers, hashing 2/4 bytes keys, are gonna be more effective for OSHO.

http://www.sanmayce.com/Downloads/_KAZE_32bit_memmem-like_showdown.7z

Scaling up is where MEMMEM screams, instead of needle/haystack, as here, 13/52,428,800 think or rather feel the upscaled case (the same proportion) when 52,428,800/211,444,543,803,076. Ouch, 200 terabytes.

I salute everyone who feels the beat of unstoppableness with 'KAT DELUNA - Unstoppable':

...
Save your pity for tomorrow
When I smash it like a ball in the club
On the news, I'll be rockin' like a rockstar
I'm unstoppable
Unstoppable
Unstoppable
Unstoppable

So supersonic almost ironic, I play it like a game of chess
You wanna ride like a queen, well here you got the best
I got the moves to the grooves, you like to imitate it
That's for free, yeah that's me! Uh-huh

Wanna whip me something crazy moving faster than the speed of light
Show you the way an entertainer always gets them right
Gonna need a back-up plan 'cause I'm givin you a shot
Watch me flippin now
Yeah that's me! Uh-huh

In my red shoes and my swags so sick I need med school
I never make a bad move but I can make her bed move
You like my attitude, ain't nobody else at my magnitude
See ain't nothing stopping me, I feel like ain't nothing after me
...

Now shake your body,
Like you got the jungle fever rushing through your veins
We gettin' crazy and the club is banging through your brain
When I say jump, you jump, jump is knockin' knockin'
I wanna see you jump like this beat is rockin'

Love bite on my bracelet, kinda fancy, but it's how I play
You wanna come along and see and holler at me
Maybe I lost it, maybe not, cause I got game
Like a burning flame, yeah that's me, uh huh

Fa-fa-fa-fa-fa-follow the leader
Fa-fa-fa-fa-fa-follow the leader
Fa-fa-fa-fa-fa-follow the leader

"... my swags so sick I need med school ...", ha-ha-ha-ha-ha.

An add-on from 2013-Oct-09:

Making some tweaks, listening to Chester and removing the 255 chars limitation for Needle lengths resulted in appearance of 'Railgun_Hasherezade' revision 3 - a full-fledged MEMMEM function, at last. This third revision now searches for Patterns/Needles longer than 255 chars and uses Horspool order 12 (the last 12 bytes of Haystack window are checked). Those 12 bytes are hashed, each slot is 0 or 1, a flag for presence. The hash table is adjustable through DEFINE, also whether it is Bitwise or Bytewise.

The skeleton of 'Railgun_Hasherezade' is this:

  • Simplified Karp-Rabin for shortest needles;
  • Quadruplet for shorter needles;
  • Boyer-Moore-Horspool/Boyer-Moore-Sunday order 1 and Boyer-Moore-Horspool order 2 for short needles;
  • Simplified Horspool order 12 for longest needles.

This variety of different search etudes makes the function highly customizable, especially when you need even higher order for Horspool, say 128, then the fastest ('FNV1A-penumbra' here 'FNV1A_YoshimitsuTRIAD' used) known to me hash function comes gun blazing, currently order 12 uses its main loop once.

Railgun_Quadruplet/Railgun_Hasherezade_revision_3_Full-Fledged_MEMMEM.png

With big patterns the bigger the order the bigger the skips/jumps are. For example if the pattern is 13 bytes and order is 12 the jump is 13-(12-1) or only 2 positions, but with 8192 bytes long pattern becomes 8192-(12-1). In fact order 12 (3x4bytes are hashed) is only the start when big blocks are to be compared, in next package I search 7/13/21/28/37/51/65/83/98/113/130/140/151/162/201/230/255/589/1080/1755 bytes long patterns into 'enwiki-20130904-pages-articles.7z.001' 52,428,800 bytes long.

http://www.sanmayce.com/Downloads/_KAZE_32bit_memmem-like_showdown.7z

Each pattern is located at the end of this 50MB file, that is one hit exists for each search.

The quest started as 'strstr' replacer of all old approaches for short haystacks/needles (line-by-line searching mostly) and reached a point that the function has become a full-fledged 'memmem' etude.

While the world is sleeping
I can hear the nightly quiet
I can feel the beating of the night

I can see some thieves
They are stealing my guitar
They can't play
The player is my heart

Don't look in the mirror
Hold the line
Have to win
Don't fool away your time

Just remember all the early days
Now come back
But please don´t lose your ways

While the world is dreaming
I can hear the sound of town
But this nighttime falling to the ground

In the midnight hours
I have killed all my hopes
Just my crime's been written on the walls

CHESTER - HOLD THE LINE (1987) is so incredible - an immortal song/etude.
In my eyes these etudes (no matter musical or programming) have so much in common, 'Finer Feelings' as Kylie sings.

An add-on from 2013-Oct-07:

Okay, since the article was intended as a semi-answer semi-question, the latter has found its answer thus completing the search for FASTEST strstr-like here 'memmem' C function, I think. Two years ago, having finished the last revision of my strstr quest I found by chance the site of Leonid Volnitsky, its description reads: "Fastest substring search algorithm for average case. Source, benchmark."

Immediately I wrote an email asking him to test my functions along with his, but for some reason no answer came back, I said to myself 'whatever' and dismissed the whole idea of clashing on. I waited to see whether he will find time to include my functions in his long list of rivals, but no and no, finally I wanted to juxtapose, for myself, his beautiful idea to use hash (I had had the exact idea around 2010 but didn't like the chaining hashing back then and still don't) and 'Railgun_Gulliver' and 'Railgun_Hasherezade'. My testbed is made of two 32bit executables - one compiled with Microsoft VS2010 (/Ox used), the other compiled with Intel 12.1 (/O3 used). These two executables search 12 patterns with lengths 021/028/037/051/065/083/098/113/130/140/151/162 chars into WIKIPEDIA dump, first 50MB. All the patterns occur in the end of the file within a same line, thus the traversal is 50MB for all searches. Warning! The used in the package 'volnitsky2' & 'volnitsky4' are not complete, they are for tests only. And I hate when results are given but they are not reproducible, therefore (as always) I included all in this benchmark package:

http://www.sanmayce.com/Downloads/_KAZE_32bit_strstr-like_showdown.7z

Railgun_Quadruplet/Core2DuoT7500_WIKI.png

There you can find the assembler equivalent along with the C source as the compilers generated it:

strstr_SHORT-SHOWDOWN_LV_WIKI_32bit_MicrosoftV16.cod
strstr_SHORT-SHOWDOWN_LV_WIKI_32bit_IntelV12.cod

Railgun_Quadruplet/Core2DuoT7500_OSHO.png

Obviously 'volnitsky2' dominates with up to (3413-2048)/2048*100% = 67% boost, very very good. With OSHO.TXT and shorter needles goes up to 280%, simply super! Below the dump for Wikipedia 50MB is given:

32bit_MicrosoftV16 executables:

Doing Search for Pattern(7bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 1651KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 1505KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 1462KB/clock
volnitsky2 performance: 2327KB/clock
volnitsky4 performance: 1651KB/clock

Doing Search for Pattern(13bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 2327KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 1969KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 1896KB/clock
volnitsky2 performance: 3200KB/clock
volnitsky4 performance: 3011KB/clock

Doing Search for Pattern(21bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 3011KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 2560KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 2560KB/clock
volnitsky2 performance: 3657KB/clock
volnitsky4 performance: 3657KB/clock

Doing Search for Pattern(28bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 3200KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 2844KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 2844KB/clock
volnitsky2 performance: 3657KB/clock
volnitsky4 performance: 3657KB/clock

Doing Search for Pattern(37bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 3200KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 3011KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 3011KB/clock
volnitsky2 performance: 3657KB/clock
volnitsky4 performance: 3657KB/clock

Doing Search for Pattern(51bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 3413KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 3200KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 3200KB/clock
volnitsky2 performance: 3657KB/clock
volnitsky4 performance: 3413KB/clock

Doing Search for Pattern(65bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 3413KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 3413KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 3200KB/clock
volnitsky2 performance: 3657KB/clock
volnitsky4 performance: 3657KB/clock

Doing Search for Pattern(83bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 3413KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 3200KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 3413KB/clock
volnitsky2 performance: 3657KB/clock
volnitsky4 performance: 3011KB/clock

Doing Search for Pattern(98bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 3200KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 3011KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 3200KB/clock
volnitsky2 performance: 3938KB/clock
volnitsky4 performance: 3657KB/clock

Doing Search for Pattern(113bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 2438KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 2560KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 2694KB/clock
volnitsky2 performance: 2694KB/clock
volnitsky4 performance: 3413KB/clock

Doing Search for Pattern(130bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 1969KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 2226KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 2226KB/clock
volnitsky2 performance: 3200KB/clock
volnitsky4 performance: 3413KB/clock

Doing Search for Pattern(140bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 2048KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 2226KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 2327KB/clock
volnitsky2 performance: 3413KB/clock
volnitsky4 performance: 3200KB/clock

Doing Search for Pattern(151bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 2048KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 2438KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 2327KB/clock
volnitsky2 performance: 3200KB/clock
volnitsky4 performance: 3200KB/clock

Doing Search for Pattern(162bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 2133KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 2560KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 2560KB/clock
volnitsky2 performance: 3413KB/clock
volnitsky4 performance: 3413KB/clock

The bottomline: Volnitsky wrote that what all guys have overlooked, bravo. Strange, not liking hashing with probing stopped even Sanmayce to explore further. I wish someone else could have done it before him, for one, me. Seeing how he doesn't appreciate the enigmatic/superfast C etude by Stephen R. van den Berg tells me much and enough not to write him again until he gets some appreciativeness - the most important thing known to me.

An add-on from 2012-Feb-01:

Just wanted to improve Skip-Performance by checking the window's 8 rightmost bytes (Order 8 Horspool) against all needle's 8 bytes suffixes, which resulted in Railgun_Quadruplet_7Hasherezade.

Here an example is given:
For needle 'and every day a continuous cleaning goes on' 43bytes long we have 43-8+1 Building-Blocks i.e. suffixes 8bytes long as follows:

and ever 117,013
nd every 108,604
d every  155,516
 every d 170,959
every da 115,291
very day 073,191
ery day  097,042
ry day a 083,793
y day a  011,244
 day a c 115,855
day a co 101,797
ay a con 222,568
y a cont 029,130
 a conti 020,978
a contin 258,405
 continu 252,691
continuo 123,607
ontinuou 056,546
ntinuous 135,857
tinuous  015,332
inuous c 250,584
nuous cl 048,224
uous cle 106,616
ous clea 137,020
us clean 035,751
s cleani 178,989
 cleanin 213,855
cleaning 063,337
leaning  097,138
eaning g 062,366
aning go 247,590
ning goe 036,571
ing goes 041,142
ng goes  228,365
g goes o 229,696
 goes on 176,852

Each number indicates which HashSlot is occupied (here 0..262143 slots i.e. 18bit used).

Okay here is a simplified example:

string: tumba-lumba-tumba-lumba-tumba-lumba-tumba-lumba-tumba-lumba-tumba-lumba-tumba-lumba
needle: and every day a continuous cleaning goes on
                      and every day a continuous cleaning goes on                        +14 Horspool Order 1
                                            and every day a continuous cleaning goes on  +36 Horspool Order 8

The goal: to jump when the rightmost 8bytes '-tumba-l' of window do not look like any of Needle's prefixes i.e. are not to be found. This maximum jump equals cbPattern-(Order-1) or 43-(8-1)=36.

Here the Horspool skip-table follows:

a  n  d     e  v  e  r  y     d  a  y     a     c  o  n  t  i  n  u  o  u  s     c  l  e  a  n  i  n  g     g  o  e  s     o  n
12 09 32 02 04 37 04 35 30 02 32 12 30 02 12 02 15 01 09 23 10 09 18 01 18 03 02 15 14 04 12 09 10 09 06 02 06 01 04 03 02 01 09

We have 'l' vs 'n' where 'l' points to 14, however 'Hasherezade' is able to jump/skip 43-(8-1)=36positions in case of HashCode of '-tumba-l' doesn't equal any of HashCodes of Needle's suffixes. HashCode of '-tumba-l' is 083,467 it matches none of Needle's HashCodes so we skip 36 bytes. This boost is the only difference between 'Hasherezade' and 'Gulliver' revision 2.

// Scheherezade -> Hasherezade
// [Italian: buffa, feminine of buffo, comic.]
/*
First off: I am heavily disappointed from Speed-Performance of 'Hasherezade' 
   and comparing Skip-Performance of 'Gulliver' and 'Hasherezade' doesn't make me happy, either.

Pattern: fastest fox
Doing Search for Pattern(11bytes) into String(206908949bytes) as-one-line ...

Railgun_Quadruplet_7sun performance:         1,683KB/clock / 00,775%, 26,672,940 skips/iterations
Railgun_Quadruplet_7 performance:            1,642KB/clock / 00,669%, 30,925,578 skips/iterations
Railgun_Quadruplet_7sunhorse performance:    0,944KB/clock / 00,912%, 22,663,583 skips/iterations
Railgun_Quadruplet_7deuce performance:       0,801KB/clock / 00,797%, 25,945,709 skips/iterations
Railgun_Quadruplet_7Elsiane performance:     0,704KB/clock / 00,931%, 22,213,101 skips/iterations
Railgun_Quadruplet_7Gulliver performance:    2,104KB/clock / 00,977%, 21,166,516 skips/iterations
Railgun_Quadruplet_7Hasherezade performance: 1,496KB/clock / 00,980%, 21,112,657 skips/iterations  18bit HashTable
Railgun_Quadruplet_7Hasherezade performance: 1,642KB/clock / 00,980%, 21,112,646 skips/iterations  14bit HashTable
Railgun_Quadruplet_7Hasherezade performance: 1,642KB/clock / 00,980%, 21,112,735 skips/iterations  10bit HashTable
Boyer_Moore_Flensburg performance:           1,057KB/clock / 00,669%, 30,925,578 skips/iterations

Pattern: and every day a continuous cleaning goes on
Doing Search for Pattern(43bytes) into String(206908949bytes) as-one-line ...

Railgun_Quadruplet_7sun performance:         2,557KB/clock / 01,495%, 13,832,201 skips/iterations
Railgun_Quadruplet_7 performance:            2,590KB/clock / 01,404%, 14,731,326 skips/iterations
Railgun_Quadruplet_7sunhorse performance:    1,888KB/clock / 02,222%, 09,308,136 skips/iterations
Railgun_Quadruplet_7deuce performance:       1,741KB/clock / 01,971%, 10,494,460 skips/iterations
Railgun_Quadruplet_7Elsiane performance:     1,642KB/clock / 02,229%, 09,279,871 skips/iterations
Railgun_Quadruplet_7Gulliver performance:    3,157KB/clock / 03,795%, 05,450,890 skips/iterations
Railgun_Quadruplet_7Hasherezade performance: 2,322KB/clock / 04,100%, 05,046,049 skips/iterations  18bit HashTable
Railgun_Quadruplet_7Hasherezade performance: 2,971KB/clock / 04,100%, 05,046,445 skips/iterations  14bit HashTable
Railgun_Quadruplet_7Hasherezade performance: 3,015KB/clock / 04,090%, 05,058,667 skips/iterations  10bit HashTable
Boyer_Moore_Flensburg performance:           1,683KB/clock / 01,447%, 14,294,963 skips/iterations

Pattern: And let this be your very fundamental insight... about everything. Just for one year, don't choose.
Doing Search for Pattern(99bytes) into String(206908949bytes) as-one-line ...

Railgun_Quadruplet_7sun performance:         2,845KB/clock / 02,248%, 09,200,917 skips/iterations
Railgun_Quadruplet_7 performance:            2,928KB/clock / 02,105%, 09,827,389 skips/iterations
Railgun_Quadruplet_7sunhorse performance:    2,270KB/clock / 03,283%, 06,302,096 skips/iterations
Railgun_Quadruplet_7deuce performance:       2,149KB/clock / 03,196%, 06,472,407 skips/iterations
Railgun_Quadruplet_7Elsiane performance:     2,172KB/clock / 03,282%, 06,303,154 skips/iterations
Railgun_Quadruplet_7Gulliver performance:    2,928KB/clock / 07,820%, 02,645,814 skips/iterations
Railgun_Quadruplet_7Hasherezade performance: 2,494KB/clock / 09,575%, 02,160,890 skips/iterations  18bit HashTable
Railgun_Quadruplet_7Hasherezade performance: 3,015KB/clock / 09,568%, 02,162,493 skips/iterations  14bit HashTable
Railgun_Quadruplet_7Hasherezade performance: 3,015KB/clock / 09,438%, 02,192,204 skips/iterations  10bit HashTable
Boyer_Moore_Flensburg performance:           2,349KB/clock / 02,135%, 09,689,008 skips/iterations

Pattern: Then, singing among the savage branches, it impales itself upon the longest,
sharpest spine. And, dying, it rises above its own agony to outcarol the lark and the nightingale.
Doing Search for Pattern(175bytes) into String(206908949bytes) as-one-line ...

Railgun_Quadruplet_7sun performance:         2,658KB/clock / 03,142%, 06,583,682 skips/iterations
Railgun_Quadruplet_7 performance:            2,658KB/clock / 03,095%, 06,685,179 skips/iterations
Railgun_Quadruplet_7sunhorse performance:    2,377KB/clock / 04,776%, 04,331,865 skips/iterations
Railgun_Quadruplet_7deuce performance:       2,349KB/clock / 04,728%, 04,376,097 skips/iterations
Railgun_Quadruplet_7Elsiane performance:     2,525KB/clock / 04,778%, 04,330,200 skips/iterations
Railgun_Quadruplet_7Gulliver performance:    2,245KB/clock / 12,024%, 01,720,711 skips/iterations
Railgun_Quadruplet_7Hasherezade performance: 2,434KB/clock / 17,127%, 01,208,072 skips/iterations  18bit HashTable
Railgun_Quadruplet_7Hasherezade performance: 2,590KB/clock / 17,084%, 01,211,064 skips/iterations  14bit HashTable
Railgun_Quadruplet_7Hasherezade performance: 2,590KB/clock / 16,379%, 01,263,238 skips/iterations  10bit HashTable
Boyer_Moore_Flensburg performance:           2,525KB/clock / 03,198%, 06,469,280 
         skips/iterations  Five times more main-cycles and faster than 'Hasherezade', pshaw!
*/
#define ROL(x, n) (((x) << (n)) | ((x) >> (32-(n))))
#define HashTableSize 18
// Revision: 1, 2012-Feb-01, the main disadvantage: the preprocessing overhead PLUS a hasher.
// Caution: For better speed the case 'if (cbPattern==1)' was removed, so Pattern must be longer than 1 char.
char * Railgun_Quadruplet_7Hasherezade_count_hits (char * pbTarget, 
       char * pbPattern, unsigned long cbTarget, unsigned long cbPattern)
{
    char * pbTargetMax = pbTarget + cbTarget;
    register unsigned long ulHashPattern;
    register unsigned long ulHashTarget;
    signed long count;
    signed long countSTATIC;

    unsigned char SINGLET;
    unsigned long Quadruplet2nd;
    unsigned long Quadruplet3rd;
    unsigned long Quadruplet4th;

    unsigned long  AdvanceHopperGrass;

    long i; //BMH needed
    int a, j;
    unsigned int bm_bc[256]; //BMH needed
    unsigned int bm_bc2nd[256]; //BMS needed
    unsigned char bm_Horspool_Order2[256*256]; //BMHSS(Elsiane) needed,
    // 'char' limits patterns to 255, if 'long' then table becomes 256KB, grrr.
    unsigned char bm_Hasherezade_HASH[1<<(HashTableSize)];
    // Jesteressing 8bytes (Horspool Order 8) for fast lookup, should
    // be bitwise (i.e. 8times smaller) since it says yes/no for presence.
    uint32_t hash32;
    unsigned long Gulliver; // or unsigned char or unsigned short

    if (cbPattern > cbTarget)
        return(NULL);

    if ( cbPattern<4) { 
        pbTarget = pbTarget+cbPattern;
        ulHashPattern = ( (*(char *)(pbPattern))<<8 ) + *(pbPattern+(cbPattern-1));

        if ( cbPattern==3) {
            for ( ;; ) {
                if ( ulHashPattern == ( (*(char *)(pbTarget-3))<<8 ) + *(pbTarget-1) ) {
                    if ( *(char *)(pbPattern+1) == 
                         *(char *)(pbTarget-2) ) Railgunhits++; //return((pbTarget-3));
                }
                if ( (char)(ulHashPattern>>8) != *(pbTarget-2) ) pbTarget++;
                pbTarget++;
                if (pbTarget > pbTargetMax)
                    return(NULL);
            }
        } else {
        }
        for ( ;; ) {
            if ( ulHashPattern == ( (*(char *)(pbTarget-2))<<8 ) + *(pbTarget-1) )
                Railgunhits++; //return((pbTarget-2));
            if ( (char)(ulHashPattern>>8) != *(pbTarget-1) ) pbTarget++;
            pbTarget++;
            if (pbTarget > pbTargetMax)
                return(NULL);
        }
    } else { //if ( cbPattern<4)
        if (cbTarget<961) { // This value is arbitrary(don't know how exactly),
        // it ensures(at least must) better performance than 'Boyer_Moore_Horspool'.
/*
            // A better strstr, with no asm code
            // Written by Mischa Sandberg
            // http://mischasan.wordpress.com
            // static char const *
            // scanstrm(char const *tgt, char const *pat, int len)
            // {
            //     uint32_t head = MSBF32(pat), wind = 0, next;
            // 
            //     pat += 4, len -= 4;
            //     while ((next = *(uint8_t const*)tgt++)) {
            //         wind = ( wind << 8 ) + next;
            //         if (wind == head && !memcmp(tgt, pat, len))
            //             return tgt - 4;
            //     }
            //     return  NULL;
            //}
            ulHashPattern = 0;
            ulHashPattern = ( ulHashPattern << 8 ) + *(uint8_t const*)pbPattern++;
            ulHashPattern = ( ulHashPattern << 8 ) + *(uint8_t const*)pbPattern++;
            ulHashPattern = ( ulHashPattern << 8 ) + *(uint8_t const*)pbPattern++;
            ulHashPattern = ( ulHashPattern << 8 ) + *(uint8_t const*)pbPattern++;
            AdvanceHopperGrass = 0;
            cbPattern -= 4;
            while ((ulHashTarget = *(uint8_t const*)pbTarget++)) {
                AdvanceHopperGrass = ( AdvanceHopperGrass << 8 ) + ulHashTarget;
                if (AdvanceHopperGrass == ulHashPattern && !memcmp(pbTarget, pbPattern, cbPattern))
                Railgunhits++; //return pbTarget - 4;
            }
            return  NULL;
*/
            pbTarget = pbTarget+cbPattern;
            ulHashPattern = *(unsigned long *)(pbPattern);

            SINGLET = ulHashPattern & 0xFF;
            Quadruplet2nd = SINGLET<<8;
            Quadruplet3rd = SINGLET<<16;
            Quadruplet4th = SINGLET<<24;

            for ( ;; ) {
                AdvanceHopperGrass = 0;
                ulHashTarget = *(unsigned long *)(pbTarget-cbPattern);

                    if ( ulHashPattern == ulHashTarget ) {
                    // Three unnecessary comparisons here, but 'AdvanceHopperGrass'
                    // must be calculated - it has a higher priority.
                    count = cbPattern-1;
                    while ( count && *(char *)(pbPattern+(cbPattern-count)) == *(char *)(pbTarget-count) ) {
                        if ( cbPattern-1==AdvanceHopperGrass+count && 
                               SINGLET != *(char *)(pbTarget-count) ) AdvanceHopperGrass++;
                        count--;
                    }
                    if ( count == 0) Railgunhits++; //return((pbTarget-cbPattern));
                    } else { // The goal here: to avoid memory accesses by stressing the registers.
                    if ( Quadruplet2nd != (ulHashTarget & 0x0000FF00) ) {
                        AdvanceHopperGrass++;
                        if ( Quadruplet3rd != (ulHashTarget & 0x00FF0000) ) {
                            AdvanceHopperGrass++;
                            if ( Quadruplet4th != (ulHashTarget & 0xFF000000) ) AdvanceHopperGrass++;
                        }
                    }
                }

                AdvanceHopperGrass++;

                pbTarget = pbTarget + AdvanceHopperGrass;
                if (pbTarget > pbTargetMax)
                    return(NULL);
            }
        } else { //if (cbTarget<961)
            countSTATIC = cbPattern-2-2;

            for (a=0; a < 256; a++) {bm_bc[a]=cbPattern; bm_bc2nd[a]=cbPattern+1;}
            for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1; 
            for (j=0; j < cbPattern; j++) bm_bc2nd[pbPattern[j]]=cbPattern-j; 

            ulHashPattern = *(unsigned long *)(pbPattern); // First four bytes
            //ulHashTarget = *(unsigned short *)(pbPattern+cbPattern-1-1); // Last two bytes
        
            AdvanceHopperGrass = 0;
            i=0;

            // Elsiane r.2  [
            for (a=0; a < 256*256; a++) {bm_Horspool_Order2[a]= cbPattern-1;}
            // cbPattern-(Order-1) for Horspool; 'memset' if not optimized

            // alfalfa 7 long 6 BBs (al lf fa al lf fa) 3 distinct BBs (al lf fa) 
            // fast 4 0-1-2 fa as st
            for (j=0; j < cbPattern-1; j++) bm_Horspool_Order2[*(unsigned short *)
                (pbPattern+j)]=j; // Rightmost appearance/position is needed

            // Elsiane r.2  ]

    if ( cbPattern>10) { 
            // Hasherezade r.1 [
            // OSHO.TXT has 00,046,486 03bytes distinct BBs
            // OSHO.TXT has 00,248,019 04bytes distinct BBs
            // OSHO.TXT has 00,855,682 05bytes distinct BBs
            // OSHO.TXT has 02,236,138 06bytes distinct BBs
            // OSHO.TXT has 04,803,152 07bytes distinct BBs
            // OSHO.TXT has 08,956,496 08bytes distinct BBs
            // to be hashed in 18bit i.e. 256KB i.e. 262,144 slots i.e. 34 vs 1.
            // OSHO.TXT has 15,006,172 09bytes distinct BBs
            // OSHO.TXT has 22,992,127 10bytes distinct BBs
            // Note: BB stands for Building-Block (also suffix)

            for (a=0; a < 1<<(HashTableSize); a++) {bm_Hasherezade_HASH[a]= 0;}
            // to-do: bit to replace byte; 'memset' if not optimized

            // cbPattern - Order + 1 i.e. number of BBs for 11 'fastest fox'
            // 11-8+1=4: 'fastest ', 'astest f', 'stest fo', 'test fox'
            for (j=0; j < cbPattern-8+1; j++) {
                hash32 = (2166136261 ^ (ROL(*(uint32_t *)(pbPattern+j),5)^*(uint32_t *)(pbPattern+j+4))) * 709607;        
                bm_Hasherezade_HASH[( hash32 ^ (hash32 >> 16) ) & ( (1<<(HashTableSize))-1 )]=1;
/*
for (a=0; a<8; a++)
printf("%c",*(char *)(pbPattern+j+a) );
printf(" %lu\n",( hash32 ^ (hash32 >> 16) ) & ( (1<<(HashTableSize))-1 ));
//Input Pattern(up to 19+2000 chars): and every day a continuous cleaning goes on
//Doing Search for Pattern(43bytes) into String(206908949bytes) as-one-line ...
BBs      Slot(HashCode for 18bit HashTable)
and ever 117013
nd every 108604
d every  155516
 every d 170959
every da 115291
very day 73191
ery day  97042
ry day a 83793
y day a  11244
 day a c 115855
day a co 101797
ay a con 222568
y a cont 29130
 a conti 20978
a contin 258405
 continu 252691
continuo 123607
ontinuou 56546
ntinuous 135857
tinuous  15332
inuous c 250584
nuous cl 48224
uous cle 106616
ous clea 137020
us clean 35751
s cleani 178989
 cleanin 213855
cleaning 63337
leaning  97138
eaning g 62366
aning go 247590
ning goe 36571
ing goes 41142
ng goes  228365
g goes o 229696
 goes on 176852
*/
            }
            // Hasherezade r.1 ]

            while (i <= cbTarget-cbPattern-1) { // -1 because Sunday is used
                Gulliver = bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1]];

                if ( Gulliver == cbPattern-2 ) { // CASE #1: means the pair (char order 2) is found
                    if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
                        count = countSTATIC; // Last two chars already matched, to be fixed with -2
                        while ( count !=0 && *(char *)(pbPattern+
                          (countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
                            count--;
                        if ( count == 0) Railgunhits++; //return(pbTarget+i);
                    }
                    //i = i + 1; // r.1, obviuosly this is the worst skip so turning to 'SunHorse': lines below
                    if ( bm_bc[(unsigned char)pbTarget[i+cbPattern-1]] < bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]] )
                             Gulliver =  bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]];
                    else
                             Gulliver =  bm_bc[(unsigned char)pbTarget[i+cbPattern-1]];
                } else if ( Gulliver != cbPattern-1 ) // CASE #2: if equal means the pair (char order 2)
                // is not found i.e. Gulliver remains intact, skip the whole pattern
                // and fall back (Order-1) chars i.e. one char for Order 2
                    Gulliver = cbPattern - Gulliver - 2;
                    // CASE #3: the pair is found and not as suffix i.e. rightmost position
            
                // The goal: to jump when the rightmost 8bytes (Order 8 Horspool) of window do
                // not look like any of Needle prefixes i.e. are not to be found. This maximum
                // jump equals cbPattern-(Order-1) or 11-(8-1)=4 for 'fastest fox' - a small
                // one but for Needle 31 bytes the jump equals 31-(8-1)=24
                if (Gulliver < cbPattern-(8-1)) { 
                    hash32 = (2166136261 ^ (ROL(*(uint32_t *)
                         (pbTarget+i+cbPattern-8),5)^*(uint32_t *)(pbTarget+i+cbPattern-8+4))) * 709607;        
                    if ( bm_Hasherezade_HASH[( hash32 ^ (hash32 >> 16) ) & ( (1<<(HashTableSize))-1 )]==0 )
                        Gulliver = cbPattern-(8-1);
                }
                    i = i + Gulliver;
                AdvanceHopperGrass++;
/*
; 4155 :                 // The goal: to jump when the rightmost 8bytes (Order 8 Horspool) of window
// do not look like any of Needle prefixes i.e. are not to be found. This maximum jump equals
// cbPattern-(Order-1) or 11-(8-1)=4 for 'fastest fox' - a small one
//           but for Needle 31 bytes the jump equals 31-(8-1)=24
; 4156 :                 if (Gulliver < cbPattern-(8-1)) { 

  01f16    8d 43 f9     lea     eax, DWORD PTR [ebx-7]
  01f19    3b c8         cmp     ecx, eax
  01f1b    73 30         jae     SHORT $LN18@Railgun_Qu@8

; 4157 :                     hash32 = (2166136261 ^ (ROL(*(uint32_t *)
  (pbTarget+i+cbPattern-8),5)^*(uint32_t *)(pbTarget+i+cbPattern-8+4))) * 709607;        

  01f1d    8b 44 32 f8     mov     eax, DWORD PTR [edx+esi-8]
  01f21    c1 c0 05     rol     eax, 5
  01f24    33 44 32 fc     xor     eax, DWORD PTR [edx+esi-4]
  01f28    35 c5 9d 1c 81     xor     eax, -2128831035    ; 811c9dc5H
  01f2d    69 c0 e7 d3 0a
    00         imul     eax, 709607        ; 000ad3e7H

; 4158 :                     if ( bm_Hasherezade_HASH[( hash32 ^ 
  (hash32 >> 16) ) & ( (1<<(HashTableSize))-1 )]==0 )

  01f33    8b f8         mov     edi, eax
  01f35    c1 ef 10     shr     edi, 16            ; 00000010H
  01f38    33 f8         xor     edi, eax
  01f3a    81 e7 ff ff 03
    00         and     edi, 262143        ; 0003ffffH
  01f40    80 bc 3c 28 08
    01 00 00     cmp     BYTE PTR _bm_Hasherezade_HASH$[esp+edi+329776], 0
  01f48    75 03         jne     SHORT $LN18@Railgun_Qu@8

; 4159 :                         Gulliver = cbPattern-(8-1);

  01f4a    8d 4b f9     lea     ecx, DWORD PTR [ebx-7]
$LN18@Railgun_Qu@8:

; 4160 :                 }
; 4161 :                     i = i + Gulliver;
; 4162 :                 AdvanceHopperGrass++;
*/
            }

    } else {
            while (i <= cbTarget-cbPattern-1) { // -1 because Sunday is used
                Gulliver = bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1]];

                if ( Gulliver == cbPattern-2 ) { // CASE #1: means the pair (char order 2) is found
                    if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
                        count = countSTATIC; // Last two chars already matched, to be fixed with -2
                        while ( count !=0 && *(char *)(pbPattern+
                          (countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
                            count--;
                        if ( count == 0) Railgunhits++; //return(pbTarget+i);
                    }
                    //i = i + 1; // r.1, obviuosly this is the worst skip so turning to 'SunHorse': lines below
                    if ( bm_bc[(unsigned char)pbTarget[i+cbPattern-1]] < bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]] )
                             Gulliver =  bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]];
                    else
                             Gulliver =  bm_bc[(unsigned char)pbTarget[i+cbPattern-1]];
                } else if ( Gulliver != cbPattern-1 )
                // CASE #2: if equal means the pair (char order 2) is not found
                // i.e. Gulliver remains intact, skip the whole pattern
                // and fall back (Order-1) chars i.e. one char for Order 2
                    Gulliver = cbPattern - Gulliver - 2;
                    // CASE #3: the pair is found and not as suffix i.e. rightmost position

                i = i + Gulliver;

// 32323218 Order 1 Horspool Skip-table A
// 01234568 Order 1 Horspool Skip-table B
// fa af fa af fa as st Order 2 Horspool Skip-table B
//  0  1  2  3  4  5  6
// HIKARIfast
// fafafast
//   fafafast +2 Order 1 'a' vs 't'
//   fafafast +2 = (cbPattern-SkipB-Order = 8-5-1 = 2) Order 1 'a' vs 't'
//   fafafast +2 = (cbPattern-SkipB-Order = 8-4-2 = 2) Order 2 'fa' vs 'st' i.e. CASE #3

// 76543218 Order 1 Horspool
// lo on ng gp pa ac ce Order 2 Horspool
//  0  1  2  3  4  5  6
// HIKARIfast
// longpace
//   longpace +2 Order 1 'a' vs 'e'
//        longpace +7 = (cbPattern-(Order-1) = 8-(2-1) = 7) Order 2 'fa' vs 'ce' i.e. CASE #2

                AdvanceHopperGrass++;
            }
    } //if ( cbPattern>10) { 

            if (i == cbTarget-cbPattern) {
                if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
                    count = countSTATIC;
                    while ( count !=0 && *(char *)(pbPattern+
                           (countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
                        count--;
                    if ( count == 0) Railgunhits++; //return(pbTarget+i);
                }
                AdvanceHopperGrass++;
            }

            GlobalSP += (int)((double)cbTarget/AdvanceHopperGrass*100);
            GlobalI += AdvanceHopperGrass;
            printf("Skip-Performance(bigger-the-better): %d%%, %d skips/iterations\n",
              (int)((double)cbTarget/AdvanceHopperGrass*100), AdvanceHopperGrass);
        
            return(NULL);
        } //if (cbTarget<961)
    } //if ( cbPattern<4)
}
// ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Sunday-Horspool algorithm ]

The Speed-Performance disappoints, also I don't see how to speed up this hash-boosted variant. Any ideas how to hasten?

An add-on from 2012-Jan-26:

Enter Railgun_Quadruplet_7Gulliver ... the most promising variant.

I am happy to share my latest etude - a result of a week-long sub-quest of mine saturated with music and program-mess-ing. The first search-function that breaks (effortlessly) the 3GB/s barrier on a computer with 3GB/s memory write.

// ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Sunday-Horspool algorithm [
/*
Raising Horspool/Sunday order from 1 to 2 came quite naturally after some carefree wander of mine.
I consider adding a new 'else if' block for haystacks longer than 960 in order to avoid the 
big skip table overhead, something like 512*1024, thus fastest r7sun will operate 
on 961..512*1024 long haystacks whereas 'Gulliver' on bigger than 512*1024.
I am not happy with Skip-Performance of revision 1: this should be amended...
'Gulliver' makes giant skips/jumps on patterns some 8+ chars long but for short 
ones falls behind. Here r.1 is given being a Horspool order 2, of course Sunday order 2 is coming.
I restrain myself of dreaming about power behind order 3, 
for bigger orders I expect nothing less than Dream-Performance.

Of course the enhancements await advent, they are very easy 
to be seen, but they will appear in next revisions of 'Gulliver'.
Tuning continues but the skeleton is built, I see 'Gulliver' as a really High-Performance etude.

And not to be empty-handed here the Gulliver's swiftness is benchmarked on String(206,908,949bytes) as-one-line:

Pattern: fast
Railgun_Quadruplet_7sun:       1,074KB/clock / 0,456%, 45,330,622 iterations
Railgun_Quadruplet_7:          1,000KB/clock / 0,377%, 54,788,054 iterations
Railgun_Quadruplet_7sunhorse:  0,651KB/clock / 0,480%, 43,103,056 iterations
Railgun_Quadruplet_7deuce:     0,575KB/clock / 0,389%, 53,138,919 iterations
Railgun_Quadruplet_7Elsiane:   0,511KB/clock / 0,551%, 37,541,955 iterations
Railgun_Quadruplet_7Gulliver:  0,649KB/clock / 0,298%, 69,403,843 iterations
Boyer_Moore_Flensburg:         0,491KB/clock / 0,377%, 54,788,139 iterations

Pattern: faster
Railgun_Quadruplet_7sun:       1,347KB/clock / 0,591%, 34,996,936 iterations
Railgun_Quadruplet_7:          1,329KB/clock / 0,514%, 40,194,194 iterations
Railgun_Quadruplet_7sunhorse:  0,762KB/clock / 0,656%, 31,504,148 iterations
Railgun_Quadruplet_7deuce:     0,662KB/clock / 0,567%, 36,434,006 iterations
Railgun_Quadruplet_7Elsiane:   0,538KB/clock / 0,710%, 29,101,626 iterations
Railgun_Quadruplet_7Gulliver:  1,010KB/clock / 0,491%, 42,066,438 iterations
Boyer_Moore_Flensburg:         0,687KB/clock / 0,514%, 40,194,282 iterations

Pattern: fastest
Railgun_Quadruplet_7sun:       1,554KB/clock / 0,687%, 30,084,306 iterations
Railgun_Quadruplet_7:          1,507KB/clock / 0,599%, 34,540,430 iterations
Railgun_Quadruplet_7sunhorse:  0,918KB/clock / 0,761%, 27,188,853 iterations
Railgun_Quadruplet_7deuce:     0,786KB/clock / 0,663%, 31,175,827 iterations
Railgun_Quadruplet_7Elsiane:   0,635KB/clock / 0,818%, 25,281,493 iterations
Railgun_Quadruplet_7Gulliver:  1,209KB/clock / 0,592%, 34,946,434 iterations
Boyer_Moore_Flensburg:         0,798KB/clock / 0,621%, 33,278,240 iterations

Pattern: fastest fox
Railgun_Quadruplet_7sun:       1,727KB/clock / 0,775%, 26,672,940 iterations
Railgun_Quadruplet_7:          1,669KB/clock / 0,669%, 30,925,578 iterations
Railgun_Quadruplet_7sunhorse:  0,957KB/clock / 0,912%, 22,663,583 iterations
Railgun_Quadruplet_7deuce:     0,821KB/clock / 0,797%, 25,945,709 iterations
Railgun_Quadruplet_7Elsiane:   0,719KB/clock / 0,931%, 22,213,101 iterations  Gulliver outskips Elsiane!
Railgun_Quadruplet_7Gulliver:  1,804KB/clock / 0,977%, 21,167,180 iterations  monstrous!
Boyer_Moore_Flensburg:         1,080KB/clock / 0,669%, 30,925,649 iterations

Pattern: fastest fox with biggest strides
Railgun_Quadruplet_7sun:       2,658KB/clock / 1,584%, 13,060,463 iterations
Railgun_Quadruplet_7:          2,767KB/clock / 1,511%, 13,689,243 iterations
Railgun_Quadruplet_7sunhorse:  1,788KB/clock / 2,138%, 09,677,267 iterations
Railgun_Quadruplet_7deuce:     1,656KB/clock / 2,053%, 10,075,650 iterations
Railgun_Quadruplet_7Elsiane:   1,542KB/clock / 2,143%, 09,652,548 iterations
Railgun_Quadruplet_7Gulliver:  3,108KB/clock / 2,917%, 07,091,357 iterations  over the top!
Boyer_Moore_Flensburg:         1,820KB/clock / 1,554%, 13,307,181 iterations

Pattern: fastest fox with biggest strides known to me
Railgun_Quadruplet_7sun:       2,590KB/clock / 1,548%, 13,363,356 iterations
Railgun_Quadruplet_7:          2,658KB/clock / 1,447%, 14,292,419 iterations
Railgun_Quadruplet_7sunhorse:  1,870KB/clock / 2,234%, 09,259,505 iterations
Railgun_Quadruplet_7deuce:     1,757KB/clock / 2,011%, 10,287,584 iterations
Railgun_Quadruplet_7Elsiane:   1,656KB/clock / 2,240%, 09,236,188 iterations
Railgun_Quadruplet_7Gulliver:  3,108KB/clock / 3,821%, 05,413,684 iterations  unstoppable!
Boyer_Moore_Flensburg:         1,712KB/clock / 1,540%, 13,431,751 iterations

Pattern: fastest fox with biggest strides known to me up to 2012 January 26 namely 'Gulliver'
Railgun_Quadruplet_7sun:       3,108KB/clock / 2,890%, 07,159,321 iterations
Railgun_Quadruplet_7:          3,157KB/clock / 2,742%, 07,545,141 iterations
Railgun_Quadruplet_7sunhorse:  2,557KB/clock / 4,138%, 04,999,777 iterations
Railgun_Quadruplet_7deuce:     2,494KB/clock / 4,029%, 05,135,444 iterations
Railgun_Quadruplet_7Elsiane:   2,494KB/clock / 4,141%, 04,995,463 iterations
Railgun_Quadruplet_7Gulliver:  3,108KB/clock / 7,218%, 02,866,342 iterations  simply amazing!
Boyer_Moore_Flensburg:         2,767KB/clock / 2,745%, 07,536,097 iterations

Note:
[
In the above benchmark (on Pentium Merom 2166Mhz) 'Gulliver' is limited by memory bandwidth only, due to:

Pattern: fastest fox with biggest strides
Railgun_Quadruplet_7sun:       2,658KB/clock / 1,584%, 13,060,463 iterations
Railgun_Quadruplet_7Gulliver:  3,108KB/clock / 2,917%, 07,091,357 iterations

Pattern: fastest fox with biggest strides known to me
Railgun_Quadruplet_7sun:       2,590KB/clock / 1,548%, 13,363,356 iterations
Railgun_Quadruplet_7Gulliver:  3,108KB/clock / 3,821%, 05,413,684 iterations

Pattern: fastest fox with biggest strides known to me up to 2012 January 26 namely 'Gulliver'
Railgun_Quadruplet_7sun:       3,108KB/clock / 2,890%, 07,159,321 iterations
Railgun_Quadruplet_7Gulliver:  3,108KB/clock / 7,218%, 02,866,342 iterations

You can see how Railgun_Quadruplet_7sun struggles to catch up, for longest pattern
it achieves the maximum 3,108KB/clock but at a nasty cost: 7,159,321 iterations 
whereas 'Gulliver' achieves same (in fact much higher) speed with heart-touching 2,866,342 iterations only.
For my Pentium T3400 2166 MHz Toshiba Satellite L305 Dual DDR2-667 5-5-5-13, 'Everest' benchmark program gives:
Memory Read:  4,605 MB/s
Memory Write: 3,323 MB/s
'Gulliver' would run at 5GB/s without any problem if it were not for 
the hardware limitation, a fantastic performance in my eyes.
]
*/
// 
// Benchmark (on Pentium Merom 2166MHz, Windows XP 32bit, CL v16 32bit):
// Two tests were done - the first/second '6x2'/'52' searches 12/52 patterns
// (for all appearances i.e. counting all hits) into 206,908,949 bytes long English text:
// 
// Skip-Performance (measured in %, bigger-the-better; Iterations, lesser-the-better) Summary:
// 
// '6x2' test (12 patterns 4-19 chars in length)
// 
// Railgun_Quadruplet_7sun 6x2 i.e. average performance:       2,092KB/clock / 0,153,936% / 005,572,202,064
// Railgun_Quadruplet_7 6x2 i.e. average performance:          2,032KB/clock / 0,137,568% / 006,454,416,320
// Railgun_Quadruplet_7sunhorse 6x2 i.e. average performance:  1,262KB/clock / 0,173,840% / 005,145,288,080
// Railgun_Quadruplet_7deuce 6x2 i.e. average performance:     1,134KB/clock / 0,155,744% / 005,995,597,776
// Railgun_Quadruplet_7Elsiane 6x2 i.e. average performance:   0,938KB/clock / 0,184,096% / 004,710,045,936
// Boyer_Moore_Flensburg 6x2 i.e. average performance:         1,163KB/clock / 0,139,168% / 006,428,106,496
// Railgun_Quadruplet_7Gulliver 6x2 i.e. average performance:  1,552KB/clock / 0,152,912% / 006,959,446,480
// Brute_Force_Dummy 6x2 i.e. average performance:             0,343KB/clock / 0,019,200% / 039,726,516,640
// 
// '52' test (52 patterns 4-175 chars in length)
// 
// Railgun_Quadruplet_7sun 52 i.e. average performance:        1,652KB/clock / 0,768,416% / 022,362,482,640
// Railgun_Quadruplet_7 52 i.e. average performance:           1,636KB/clock / 0,708,768% / 025,368,049,712
// Railgun_Quadruplet_7sunhorse 52 i.e. average performance:   1,052KB/clock / 0,939,520% / 020,187,534,736
// Railgun_Quadruplet_7deuce 52 i.e. average performance:      0,929KB/clock / 0,864,048% / 023,292,952,992
// Railgun_Quadruplet_7Elsiane 52 i.e. average performance:    0,809KB/clock / 0,979,072% / 018,535,811,280
// Boyer_Moore_Flensburg 52 i.e. average performance:          0,928KB/clock / 0,720,272% / 025,227,269,760
// Railgun_Quadruplet_7Gulliver 52 i.e. average performance:   1,296KB/clock / 1,145,808% / 026,131,772,688
// Brute_Force_Dummy 52 i.e. average performance:              0,279KB/clock / 0,083,200% / 172,148,232,496
// 
// Revision: 1, 2012-Jan-26, the main disadvantage: the preprocessing overhead.
// Caution: For better speed the case 'if (cbPattern==1)' was removed, so Pattern must be longer than 1 char.
char * Railgun_Quadruplet_7Gulliver_count_hits (char * pbTarget, 
         char * pbPattern, unsigned long cbTarget, unsigned long cbPattern)
{
    char * pbTargetMax = pbTarget + cbTarget;
    register unsigned long ulHashPattern;
    register unsigned long ulHashTarget;
    signed long count;
    signed long countSTATIC;

    unsigned char SINGLET;
    unsigned long Quadruplet2nd;
    unsigned long Quadruplet3rd;
    unsigned long Quadruplet4th;

    unsigned long  AdvanceHopperGrass;

    long i; //BMH needed
    int a, j;
    unsigned int bm_bc[256]; //BMH needed
    unsigned int bm_bc2nd[256]; //BMS needed
    //BMHSS(Elsiane) needed, 'char' limits patterns to 255, if 'long' then table becomes 256KB, grrr.
    unsigned char bm_Sunday_Order2[256*256];
    unsigned long Gulliver; // or unsigned char or unsigned short

    if (cbPattern > cbTarget)
        return(NULL);

    if ( cbPattern<4) { 
        pbTarget = pbTarget+cbPattern;
        ulHashPattern = ( (*(char *)(pbPattern))<<8 ) + *(pbPattern+(cbPattern-1));

        if ( cbPattern==3) {
            for ( ;; ) {
                if ( ulHashPattern == ( (*(char *)(pbTarget-3))<<8 ) + *(pbTarget-1) ) {
                    if ( *(char *)(pbPattern+1) == *(char *)(pbTarget-2) ) Railgunhits++; //return((pbTarget-3));
                }
                if ( (char)(ulHashPattern>>8) != *(pbTarget-2) ) pbTarget++;
                pbTarget++;
                if (pbTarget > pbTargetMax)
                    return(NULL);
            }
        } else {
        }
        for ( ;; ) {
            if ( ulHashPattern == ( (*(char *)(pbTarget-2))<<8 ) + *(pbTarget-1) )
                Railgunhits++; //return((pbTarget-2));
            if ( (char)(ulHashPattern>>8) != *(pbTarget-1) ) pbTarget++;
            pbTarget++;
            if (pbTarget > pbTargetMax)
                return(NULL);
        }
    } else { //if ( cbPattern<4)
        if (cbTarget<961) { // This value is arbitrary(don't know how exactly),
        // it ensures(at least must) better performance than 'Boyer_Moore_Horspool'.
            pbTarget = pbTarget+cbPattern;
            ulHashPattern = *(unsigned long *)(pbPattern);

            SINGLET = ulHashPattern & 0xFF;
            Quadruplet2nd = SINGLET<<8;
            Quadruplet3rd = SINGLET<<16;
            Quadruplet4th = SINGLET<<24;

            for ( ;; ) {
                AdvanceHopperGrass = 0;
                ulHashTarget = *(unsigned long *)(pbTarget-cbPattern);

                    if ( ulHashPattern == ulHashTarget ) {
                    // Three unnecessary comparisons here, but 'AdvanceHopperGrass'
                    // must be calculated - it has a higher priority.
                    count = cbPattern-1;
                    while ( count && *(char *)(pbPattern+(cbPattern-count)) == *(char *)(pbTarget-count) ) {
                        if ( cbPattern-1==AdvanceHopperGrass+count && 
                               SINGLET != *(char *)(pbTarget-count) ) AdvanceHopperGrass++;
                        count--;
                    }
                    if ( count == 0) Railgunhits++; //return((pbTarget-cbPattern));
                    } else { // The goal here: to avoid memory accesses by stressing the registers.
                    if ( Quadruplet2nd != (ulHashTarget & 0x0000FF00) ) {
                        AdvanceHopperGrass++;
                        if ( Quadruplet3rd != (ulHashTarget & 0x00FF0000) ) {
                            AdvanceHopperGrass++;
                            if ( Quadruplet4th != (ulHashTarget & 0xFF000000) ) AdvanceHopperGrass++;
                        }
                    }
                }

                AdvanceHopperGrass++;

                pbTarget = pbTarget + AdvanceHopperGrass;
                if (pbTarget > pbTargetMax)
                    return(NULL);
            }
        } else { //if (cbTarget<961)
            countSTATIC = cbPattern-2-2;

            for (a=0; a < 256; a++) {bm_bc[a]=cbPattern; bm_bc2nd[a]=cbPattern+1;}
            for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1; 
            for (j=0; j < cbPattern; j++) bm_bc2nd[pbPattern[j]]=cbPattern-j; 

            // Elsiane r.2  [
            for (a=0; a < 256*256; a++) {bm_Sunday_Order2[a]= cbPattern-1;} // 'memset' if not optimized

            // alfalfa 7 long 6 BBs (al lf fa al lf fa) 3 distinct BBs (al lf fa) 
            // fast 4 0-1-2 fa as st
            for (j=0; j < cbPattern-1; j++) 
                bm_Sunday_Order2[*(unsigned short *)(pbPattern+j)]=j; // Rightmost appearance/position is needed

            // Elsiane r.2 ]

            ulHashPattern = *(unsigned short *)(pbPattern+cbPattern-1-1); // Last two bytes
            ulHashTarget = *(unsigned long *)(pbPattern); // First four bytes
        
            AdvanceHopperGrass = 0;
            i=0;
            while (i <= cbTarget-cbPattern-1-1) {
                Gulliver = bm_Sunday_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1]];

                if ( Gulliver == cbPattern-2 ) { // CASE #1: means the pair (char order 2) is found
                    if ( *(unsigned long *)&pbTarget[i] == ulHashTarget) {
                        count = countSTATIC; // Last two chars already matched, to be fixed with -2
                        while ( count !=0 && *(char *)(pbPattern+
                               (countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
                            count--;
                        if ( count == 0) Railgunhits++; //return(pbTarget+i);
                    }
                    i = i + 1;
                } else if ( Gulliver == cbPattern-1 ) // CASE #2: means the pair (char order 2) is not found
                    i = i + Gulliver; // the pair is not found, skip the whole pattern and fall back one char
                else
                    i = i +  cbPattern - Gulliver - 2;
                    // CASE #3: the pair is found and not as suffix i.e. rightmost position

// 32323218 Order 1 Horspool
// fa af fa af fa as st Order 2 Horspool
//  0  1  2  3  4  5  6
// HIKARIfast
// fafafast
//   fafafast +2 Order 1 'a' vs 't'
//   fafafast +2 = (cbPattern-Gulliver-2 = 8-4-2 = 2) Order 2 'fa' vs 'st' i.e. CASE #3

// 76543218 Order 1 Horspool
// lo on ng gp pa ac ce Order 2 Horspool
//  0  1  2  3  4  5  6
// HIKARIfast
// longpace
//   longpace +2 Order 1 'a' vs 'e'
//        longpace +7 = (cbPattern-1 = 8-1 = 7) Order 2 'fa' vs 'ce' i.e. CASE #2

                AdvanceHopperGrass++;
            }

            if (i == cbTarget-cbPattern-1) {
                if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
                    count = countSTATIC;
                    while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) 
                             == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
                        count--;
                    if ( count == 0) Railgunhits++; //return(pbTarget+i);
                }
                i++;
                AdvanceHopperGrass++;
            }

            if (i == cbTarget-cbPattern) {
                if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
                    count = countSTATIC;
                    while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) 
                             == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
                        count--;
                    if ( count == 0) Railgunhits++; //return(pbTarget+i);
                }
                AdvanceHopperGrass++;
            }

            GlobalSP += (int)((double)cbTarget/AdvanceHopperGrass*100);
            GlobalI += AdvanceHopperGrass;
            printf("Skip-Performance(bigger-the-better): %d%%, %d skips/iterations\n",
              (int)((double)cbTarget/AdvanceHopperGrass*100), AdvanceHopperGrass);
        
            return(NULL);
        } //if (cbTarget<961)
    } //if ( cbPattern<4)
}
// ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Sunday-Horspool algorithm ]

I was very touched (thus inspired) by this cuore song:

Strong enough to tell you why
I feel enough to tell you something that's very close to me
In the eye of my mind
Feel the reasons that we thought
You just feel all
Falling I'm standing across the stream
Fearing saying I
I'm strong enough to find my courage
To find the way I'm always near it's brewing in my mind
Falling I'm standing across the stream
Fearing saying I
I'm strong enough to find my courage
To find the way I'm always near it's growing in my mind

/Artist: Elsiane - Across The Stream/

I salute everyone who feels-and-keeps the artistic way near.

An add-on from 2012-Jan-24:

Just wanted to see what Sunday's technique (applied on two chars, order 2) holds, this resulted in appearance of Railgun_Quadruplet_7Elsiane, the biggest-strider so far, I dream (already) of Railgun_Quadruplet_7Gulliver...

Horspool-Sunday-Sunday jumps to:
--------------------------------\
                                |
Horspool-Sunday jumps to:
-------------------------------\
                               |
Horspool jumps to:
------------------------------\
                              |
                  String: faster
                  Needle: fast

       Sunday skip-table:
                          43215
     Horspool skip-table:
                          32144

For above example:
Horspool jumps to A where A=4 'e' position
Sunday jumps to B where B=5 'r' position
Horspool-Sunday ('SunHorse') jumps to max(A,B) i.e. 5
Horspool-Sunday-Sunday ('Elsiane') jumps to max(A,B,C) i.e. max(5,6)=6 since neither 'e' nor 'r' appear in the needle i.e. right after 'r'

// ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Sunday-Horspool algorithm [
// Note: 'Elsiane' is the strongest (i.e. with biggest-strides)
// Boyer-Moore-Sunday-Horspool variant known to me,
// named after a 'very close to me' duo(two musicians), viva Elsieanne Caplette!
// Benchmark (on Pentium Merom 2166MHz, Windows XP 32bit, CL v16 32bit):
// Two tests were done - the first/second '6x2'/'52' searches 12/52
// patterns (for all appearances i.e. counting all hits) into 206,908,949 bytes long English text:
// 
// Skip-Performance (measured in %, bigger-the-better; Iterations, lesser-the-better) Summary:
// 
// Function                      '6x2' test (12 patterns 4-19 chars in length)
// 
// Railgun_Quadruplet_7Elsiane   0,934KB/clock / 184,096% / 004,710,045,936 Iterations  Muffin
// Railgun_Quadruplet_7sunhorse  1,261KB/clock / 173,840% / 005,145,288,080 Iterations  BONBON
// Railgun_Quadruplet_7deuce     1,132KB/clock / 155,744% / 005,995,597,776 Iterations  Leaves-The-Town
// Railgun_Quadruplet_7sun       2,091KB/clock / 153,936% / 005,572,202,064 Iterations  Bitter-Sweet
// Boyer_Moore_Flensburg         1,162KB/clock / 139,168% / 006,428,106,496 Iterations  Loopy
// Railgun_Quadruplet_7          2,029KB/clock / 137,568% / 006,454,416,320 Iterations  Loopier
// Brute_Force_Dummy             0,342KB/clock / 019,200% / 039,726,516,640 Iterations  Loopiest
// 
// Function                      '52' test (52 patterns 4-175 chars in length)
// 
// Railgun_Quadruplet_7Elsiane   0,809KB/clock / 979,072% / 018,535,811,280 Iterations  Muffin
// Railgun_Quadruplet_7sunhorse  1,051KB/clock / 939,520% / 020,187,534,736 Iterations  BONBON
// Railgun_Quadruplet_7deuce     0,928KB/clock / 864,048% / 023,292,952,992 Iterations  Leaves-The-Town
// Railgun_Quadruplet_7sun       1,651KB/clock / 768,416% / 022,362,482,640 Iterations  Bitter-Sweet
// Boyer_Moore_Flensburg         0,928KB/clock / 720,272% / 025,227,269,760 Iterations  Loopy
// Railgun_Quadruplet_7          1,632KB/clock / 708,768% / 025,368,049,712 Iterations  Loopier
// Brute_Force_Dummy             0,279KB/clock / 083,200% / 172,148,232,496 Iterations  Loopiest
// 
// Revision: 1, 2012-Jan-24, the main disadvantage: the preprocessing overhead.
// Caution: For better speed the case 'if (cbPattern==1)' was removed, so Pattern must be longer than 1 char.
char * Railgun_Quadruplet_7Elsiane_count_hits (char * pbTarget,
    char * pbPattern, unsigned long cbTarget, unsigned long cbPattern)
{
    char * pbTargetMax = pbTarget + cbTarget;
    register unsigned long ulHashPattern;
    register unsigned long ulHashTarget;
    signed long count;
    signed long countSTATIC;

    unsigned char SINGLET;
    unsigned long Quadruplet2nd;
    unsigned long Quadruplet3rd;
    unsigned long Quadruplet4th;

    unsigned long  AdvanceHopperGrass;

    long i; //BMH needed
    int a, j;
    unsigned int bm_bc[256]; //BMH needed
    unsigned int bm_bc2nd[256]; //BMS needed
    unsigned char bm_Sunday_Order2[256*256];
    //BMHSS(Elsiane) needed, of course those 65536 bytes are
    // scary in next revision reduce them to 8192 i.e. bitwise it

    if (cbPattern > cbTarget)
        return(NULL);

    if ( cbPattern<4) { 
        pbTarget = pbTarget+cbPattern;
        ulHashPattern = ( (*(char *)(pbPattern))<<8 ) + *(pbPattern+(cbPattern-1));

        if ( cbPattern==3) {
            for ( ;; ) {
                if ( ulHashPattern == ( (*(char *)(pbTarget-3))<<8 ) + *(pbTarget-1) ) {
                    if ( *(char *)(pbPattern+1) == 
                           *(char *)(pbTarget-2) ) Railgunhits++; //return((pbTarget-3));
                }
                if ( (char)(ulHashPattern>>8) != *(pbTarget-2) ) pbTarget++;
                pbTarget++;
                if (pbTarget > pbTargetMax)
                    return(NULL);
            }
        } else {
        }
        for ( ;; ) {
            if ( ulHashPattern == ( (*(char *)(pbTarget-2))<<8 ) + *(pbTarget-1) )
                Railgunhits++; //return((pbTarget-2));
            if ( (char)(ulHashPattern>>8) != *(pbTarget-1) ) pbTarget++;
            pbTarget++;
            if (pbTarget > pbTargetMax)
                return(NULL);
        }
    } else { //if ( cbPattern<4)
        if (cbTarget<961) { // This value is arbitrary(don't k
        // now how exactly), it ensures(at least must) better performance than 'Boyer_Moore_Horspool'.
            pbTarget = pbTarget+cbPattern;
            ulHashPattern = *(unsigned long *)(pbPattern);

            SINGLET = ulHashPattern & 0xFF;
            Quadruplet2nd = SINGLET<<8;
            Quadruplet3rd = SINGLET<<16;
            Quadruplet4th = SINGLET<<24;

            for ( ;; ) {
                AdvanceHopperGrass = 0;
                ulHashTarget = *(unsigned long *)(pbTarget-cbPattern);

                    if ( ulHashPattern == ulHashTarget ) {
                    // Three unnecessary comparisons here,
                    // but 'AdvanceHopperGrass' must be calculated - it has a higher priority.
                    count = cbPattern-1;
                    while ( count && *(char *)(pbPattern+(cbPattern-count)) == *(char *)(pbTarget-count) ) {
                        if ( cbPattern-1==AdvanceHopperGrass+count && 
                                 SINGLET != *(char *)(pbTarget-count) ) AdvanceHopperGrass++;
                        count--;
                    }
                    if ( count == 0) Railgunhits++; //return((pbTarget-cbPattern));
                    } else { // The goal here: to avoid memory accesses by stressing the registers.
                    if ( Quadruplet2nd != (ulHashTarget & 0x0000FF00) ) {
                        AdvanceHopperGrass++;
                        if ( Quadruplet3rd != (ulHashTarget & 0x00FF0000) ) {
                            AdvanceHopperGrass++;
                            if ( Quadruplet4th != (ulHashTarget & 0xFF000000) ) AdvanceHopperGrass++;
                        }
                    }
                }

                AdvanceHopperGrass++;

                pbTarget = pbTarget + AdvanceHopperGrass;
                if (pbTarget > pbTargetMax)
                    return(NULL);
            }
        } else { //if (cbTarget<961)
            countSTATIC = cbPattern-2-2;

            for (a=0; a < 256; a++) {bm_bc[a]=cbPattern; bm_bc2nd[a]=cbPattern+1;}
            for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1; 
            for (j=0; j < cbPattern; j++) bm_bc2nd[pbPattern[j]]=cbPattern-j; 

            // Elsiane [
            for (a=0; a < 256*256; a++) {bm_Sunday_Order2[a]= 0;} // 'memset' if not optimized

            // alfalfa 7 long 6 BBs (al lf fa al lf fa) 3 distinct BBs (al lf fa) 
            // fast 4 0-1-2 fa as st
            //for (j=0; j < cbPattern-1; j++) bm_Sunday_Order2[*(unsigned short *)(pbPattern+j)]=1;

            // The idea here is to use Order 2 of Sunday's approach:
            // It takes all rows and columns containing a char (e.g. A)
            // from the pattern to be marked (as 1 i.e. active)
            //    000 001 002 003 ...   A ... 255
            // 000                      * 
            // 001                      *
            // 002                      *
            // 003                      *
            // ...                      *
            //   A  *   *   *   *   *   *   *   *  
            // ...                      *
            // 255                      *
            // TO-DO: bytewise -> bitwise
            for (j=0; j < cbPattern; j++) 
                for (a=0; a < 256; a++) {
                    bm_Sunday_Order2[(a<<8) + pbPattern[j]]=1; // columns filling
                    bm_Sunday_Order2[(pbPattern[j]<<8) + a]=1; // rows filling
                }
            // Elsiane ]

            ulHashPattern = *(unsigned long *)(pbPattern);
        
            AdvanceHopperGrass = 0;
            i=0;
            while (i <= cbTarget-cbPattern-1-1) {
                if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
                    count = countSTATIC;
                    while ( count !=0 && *(char *)(pbPattern+
                             (countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
                        count--;
                    if ( count == 0) Railgunhits++; //return(pbTarget+i);
                }

                if ( bm_Sunday_Order2[*(unsigned short *)&pbTarget[i+cbPattern]] == 0 )
                    i = i + cbPattern + 1 + 1;
                else
                    if ( bm_bc[(unsigned char)pbTarget[i+cbPattern-1]] < 
                             bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]] )
                        i= i + bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]];
                    else
                        i= i + bm_bc[(unsigned char)pbTarget[i+cbPattern-1]];

                AdvanceHopperGrass++;
            }

            if (i == cbTarget-cbPattern-1) {
                if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
                    count = countSTATIC;
                    while ( count !=0 && *(char *)(pbPattern+
                          (countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
                        count--;
                    if ( count == 0) Railgunhits++; //return(pbTarget+i);
                }
                i++;
                AdvanceHopperGrass++;
            }

            if (i == cbTarget-cbPattern) {
                if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
                    count = countSTATIC;
                    while ( count !=0 && *(char *)
                            (pbPattern+(countSTATIC-count)+4) == *(char *)
                            (&pbTarget[i]+(countSTATIC-count)+4) )
                        count--;
                    if ( count == 0) Railgunhits++; //return(pbTarget+i);
                }
                AdvanceHopperGrass++;
            }

            GlobalSP += (int)((double)cbTarget/AdvanceHopperGrass*100);
            GlobalI += AdvanceHopperGrass;
            //printf("Skip-Performance(bigger-the-better): %d%%, %d skips/iterations\n",
            //  (int)((double)cbTarget/AdvanceHopperGrass*100), AdvanceHopperGrass);
        
            return(NULL);
        } //if (cbTarget<961)
    } //if ( cbPattern<4)
}
// ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Sunday-Horspool algorithm ]
/*
strstr_SHORT-SHOWDOWN, revision 7Elsiane_vs_7sun_vs_7_vs_7sunhorse_vs_7deuce_vs_BMF, written by Kaze.
Full credits to: R.S. Boyer, J.S. Moore, R.N. Horspool, D.M. Sunday.
Input Pattern(up to 19+2000 chars): faster

Doing Search for Pattern(6bytes) into String(206908949bytes) as-one-line ...
Skip-Performance(bigger-the-better): 591%, 34996936 skips/iterations
Railgun_Quadruplet_7sun_hits/Railgun_Quadruplet_7sun_clocks: 744/154
Railgun_Quadruplet_7sun performance: 1312KB/clock

Doing Search for Pattern(6bytes) into String(206908949bytes) as-one-line ...
Skip-Performance(bigger-the-better): 514%, 40194194 skips/iterations
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 744/158
Railgun_Quadruplet_7 performance: 1278KB/clock

Doing Search for Pattern(6bytes) into String(206908949bytes) as-one-line ...
Skip-Performance(bigger-the-better): 656%, 31504148 skips/iterations
Railgun_Quadruplet_7sunhorse_hits/Railgun_Quadruplet_7sunhorse_clocks: 744/272
Railgun_Quadruplet_7sunhorse performance: 742KB/clock

Doing Search for Pattern(6bytes) into String(206908949bytes) as-one-line ...
Skip-Performance(bigger-the-better): 567%, 36434006 skips/iterations
Railgun_Quadruplet_7deuce_hits/Railgun_Quadruplet_7deuce_clocks: 744/313
Railgun_Quadruplet_7deuce performance: 645KB/clock

Doing Search for Pattern(6bytes) into String(206908949bytes) as-one-line ...
Skip-Performance(bigger-the-better): 710%, 29101626 skips/iterations
Railgun_Quadruplet_7Elsiane_hits/Railgun_Quadruplet_7Elsiane_clocks: 744/383
Railgun_Quadruplet_7Elsiane performance: 527KB/clock

Doing Search for Pattern(6bytes) into String(206908949bytes) as-one-line ...
Skip-Performance(bigger-the-better): 514%, 40194216 skips/iterations
Boyer_Moore_Flensburg_hits/Boyer_Moore_Flensburg_clocks: 744/299
Boyer_Moore_Flensburg performance: 675KB/clock

Doing Search for Pattern(6bytes) into String(206908949bytes) as-one-line ...
Skip-Performance(bigger-the-better): 106%, 194548807 skips/iterations
Two-Way_hits/Two-Way_clocks: 744/836
Two-Way performance: 241KB/clock

D:\_KAZE_strstr_SHORT-SHOWDOWN_7Elsiane_7sun_vs_7_vs_7sunhorse_vs_7deuce_vs_BMF>

*/

I like a lot this look-ahead approach of D.M. Sunday, but can't see how to quicken it, any ideas?

An add-on from 2012-Jan-22:

[
A face-off
'SunHorse' vs Railgun_Quadruplet_7sun vs Railgun_Quadruplet_7 vs Boyer_Moore vs Brute_Force

Okay, here comes Railgun_Quadruplet_7sun - the new fastest hitter. Two tests were done - the first/second '6x2'/'52' searches 12/52 patterns (for all appearances i.e. counting all hits) into 206,908,949 bytes long English text.

First, 7deuce is in the past, a bad luck it was, yes. Wow, I have received collateral damage by the last shoot-out in which D.M. Sunday excels both in Speed-Performance and Skip-Performance.

Speed-Performance (measured in KB/clock, bigger-the-better) Summary:

Function                      '6x2' test (12 patterns 4-19 chars in length)

Railgun_Quadruplet_7sun       2,092KB/clock / 153,936% / 005,572,202,064 Iterations  BONBON
Railgun_Quadruplet_7          2,030KB/clock / 137,568% / 006,454,416,320 Iterations  Leaves-The-Town
Railgun_Quadruplet_7sunhorse  1,250KB/clock / 173,840% / 005,145,288,080 Iterations  UNDERRATED
Boyer_Moore_Flensburg         1,154KB/clock / 139,168% / 006,428,115,568 Iterations  Slow 
Railgun_Quadruplet_7deuce     1,125KB/clock / 155,744% / 005,995,597,776 Iterations  Slower 
Brute_Force_Dummy             0,338KB/clock / 019,200% / 039,726,516,640 Iterations  Slowest 

Function                      '52' test (52 patterns 4-175 chars in length)

Railgun_Quadruplet_7sun       1,657KB/clock / 768,416% / 022,362,482,640 Iterations  BONBON
Railgun_Quadruplet_7          1,637KB/clock / 708,768% / 025,368,049,712 Iterations  Leaves-The-Town
Railgun_Quadruplet_7sunhorse  1,046KB/clock / 939,520% / 020,187,534,736 Iterations  UNDERRATED
Railgun_Quadruplet_7deuce     0,923KB/clock / 864,048% / 023,292,952,992 Iterations  Slow 
Boyer_Moore_Flensburg         0,919KB/clock / 720,272% / 025,227,312,176 Iterations  Slower 
Brute_Force_Dummy             0,276KB/clock / 083,200% / 172,148,232,496 Iterations  Slowest 

Skip-Performance (measured in %, bigger-the-better; Iterations, lesser-the-better) Summary:

Function                      '6x2' test (12 patterns 4-19 chars in length)

Railgun_Quadruplet_7sunhorse  1,250KB/clock / 173,840% / 005,145,288,080 Iterations  BONBON
Railgun_Quadruplet_7deuce     1,125KB/clock / 155,744% / 005,995,597,776 Iterations  Leaves-The-Town
Railgun_Quadruplet_7sun       2,092KB/clock / 153,936% / 005,572,202,064 Iterations  Bitter-Sweet
Boyer_Moore_Flensburg         1,154KB/clock / 139,168% / 006,428,115,568 Iterations  Loopy
Railgun_Quadruplet_7          2,030KB/clock / 137,568% / 006,454,416,320 Iterations  Loopier
Brute_Force_Dummy             0,338KB/clock / 019,200% / 039,726,516,640 Iterations  Loopiest 

Function                      '52' test (52 patterns 4-175 chars in length)

Railgun_Quadruplet_7sunhorse  1,046KB/clock / 939,520% / 020,187,534,736 Iterations  BONBON
Railgun_Quadruplet_7deuce     0,923KB/clock / 864,048% / 023,292,952,992 Iterations  Leaves-The-Town
Railgun_Quadruplet_7sun       1,657KB/clock / 768,416% / 022,362,482,640 Iterations  Bitter-Sweet
Boyer_Moore_Flensburg         0,919KB/clock / 720,272% / 025,227,312,176 Iterations  Loopy
Railgun_Quadruplet_7          1,637KB/clock / 708,768% / 025,368,049,712 Iterations  Loopier
Brute_Force_Dummy             0,276KB/clock / 083,200% / 172,148,232,496 Iterations  Loopiest

For English texts I would have no second thought which function to choose: it is Railgun_Quadruplet_7sunhorse (Boyer-Moore reinforced by Sunday-Horspool) - strongest of all Boyer-Moore variants (including the original), sadly [still] not among the fastest. I can't figure out the cause for such unproportionality: 'sunhorse' features amazing (for English texts) Skip-Performance i.e. number of pattern comparisions agaist the text is so low but Speed-Performance (on this CPU) lags behind.

The test package is downloadable at:
http://www.sanmayce.com/Downloads/_KAZE_strstr_SHORT-SHOWDOWN_7sun_vs_7_vs_7sunhorse_vs_7deuce_vs_BMF.7z

The test package contains:

D:\_KAZE_strstr_SHORT-SHOWDOWN_7sun_vs_7_vs_7sunhorse_vs_7deuce_vs_BMF>dir

01/22/2012  07:00 AM            59,198 FH_Flensburg.zip
01/22/2012  07:00 AM       206,908,949 OSHO.TXT
01/22/2012  07:00 AM           469,283 Results_Pentium_Merom_2166Mhz.txt
01/22/2012  07:00 AM               175 RUNME_IT_TAKES_some_20_MINUTES.BAT
01/22/2012  07:00 AM               191 strstr_COMPILE_Microsoft.bat
01/22/2012  07:00 AM           603,625 strstr_SHORT-SHOWDOWN.c
01/22/2012  07:00 AM           633,053 strstr_SHORT-SHOWDOWN.cod
01/22/2012  07:00 AM           108,032 strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe

D:\_KAZE_strstr_SHORT-SHOWDOWN_7sun_vs_7_vs_7sunhorse_vs_7deuce_vs_BMF>

A very easy-to-the-eyes/mind article at http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/bm.htm. Here I say 'Triple THANKS' to Prof. Dr. Hans Werner Lang (Institut für Medieninformatik und Technische Informatik Fachhochschule Flensburg), I used your code (after a straightforward port). On his page (http://www.inf.fh-flensburg.de/lang/) there are very well presented algorithms - a very good resource indeed. I expect Railgun_Quadruplet_7sunhorse to become a gun of choice but only on CPUs with FAR more effective branching/memory handling than my Pentium Merom.

I still do not understand Turbo Boyer-Moore algorithm, maybe it holds a significant Skip-Performance boost. This project is under researching, that is why I have called it a quest. Anyone who can point out faster functions is welcomed to share them with us here.

My favorite UNDERRATED gun has main loop as follows:

; 3209 :     i=0;
; 3210 :     while (i <= cbTarget-cbPattern-1) {

  010e6    8b 4c 24 20     mov     ecx, DWORD PTR _cbTarget$GSCopy$[esp+2104]
  010ea    89 44 24 28     mov     DWORD PTR _ulHashPattern$[esp+2104], eax
  010ee    33 c0         xor     eax, eax
  010f0    2b ca         sub     ecx, edx
  010f2    89 4c 24 14     mov     DWORD PTR tv554[esp+2104], ecx
  010f6    8d 0c 13     lea     ecx, DWORD PTR [ebx+edx]
  010f9    89 44 24 10     mov     DWORD PTR _AdvanceHopperGrass$[esp+2104], eax
  010fd    89 4c 24 18     mov     DWORD PTR tv451[esp+2104], ecx
$LN13@Railgun_Qu@5:

; 3211 :          ulHashTarget=*(unsigned long *)&pbTarget[i];
; 3212 :          if ( ulHashTarget == ulHashPattern)

  01101    8b 54 24 28     mov     edx, DWORD PTR _ulHashPattern$[esp+2104]
  01105    39 14 18     cmp     DWORD PTR [eax+ebx], edx
  01108    75 41         jne     SHORT $LN8@Railgun_Qu@5

; 3213 :              {
; 3214 :          count = countSTATIC;

  0110a    8b 7c 24 1c     mov     edi, DWORD PTR _countSTATIC$[esp+2104]

; 3215 :          while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) == 
  *(char *)(&pbTarget[i]+(countSTATIC-count)+4) ) {
  // if pattern length is 4 or 5 we have count=-1 and count=0
  // respectively i.e. no need of comparing in-between chars.

  0110e    85 ff         test     edi, edi
  01110    74 2d         je     SHORT $LN85@Railgun_Qu@5

; 3213 :              {
; 3214 :          count = countSTATIC;

  01112    8d 56 04     lea     edx, DWORD PTR [esi+4]
  01115    8d 74 18 04     lea     esi, DWORD PTR [eax+ebx+4]
  01119    8d a4 24 00 00
    00 00         npad     7
$LL10@Railgun_Qu@5:

; 3215 :          while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) == 
   *(char *)(&pbTarget[i]+(countSTATIC-count)+4) ) {
   // if pattern length is 4 or 5 we have count=-1 and count=0
   // respectively i.e. no need of comparing in-between chars.

  01120    8a 0a         mov     cl, BYTE PTR [edx]
  01122    3a 0e         cmp     cl, BYTE PTR [esi]
  01124    75 11         jne     SHORT $LN9@Railgun_Qu@5

; 3216 :                count--;

  01126    46         inc     esi
  01127    42         inc     edx
  01128    4f         dec     edi
  01129    75 f5         jne     SHORT $LL10@Railgun_Qu@5

; 3217 :          }
; 3218 :          if ( count == 0) Railgunhits++; //return(pbTarget+i);

  0112b    8b 74 24 24     mov     esi, DWORD PTR _pbPattern$GSCopy$[esp+2104]
  0112f    ff 05 00 00 00
    00         inc     DWORD PTR _Railgunhits
  01135    eb 14         jmp     SHORT $LN8@Railgun_Qu@5
$LN9@Railgun_Qu@5:
  01137    85 ff         test     edi, edi
  01139    75 0c         jne     SHORT $LN111@Railgun_Qu@5
  0113b    8b 74 24 24     mov     esi, DWORD PTR _pbPattern$GSCopy$[esp+2104]
$LN85@Railgun_Qu@5:
  0113f    ff 05 00 00 00
    00         inc     DWORD PTR _Railgunhits
  01145    eb 04         jmp     SHORT $LN8@Railgun_Qu@5
$LN111@Railgun_Qu@5:
  01147    8b 74 24 24     mov     esi, DWORD PTR _pbPattern$GSCopy$[esp+2104]
$LN8@Railgun_Qu@5:

; 3219 :          }
; 3220 : 
; 3221 : //         i+=cbPattern;
; 3222 : //         i= i - bm_bc2nd[(unsigned char)pbTarget[i]];
; 3223 : 
; 3224 : //if ( bm_bc[(unsigned char)pbTarget[i+cbPattern-1]] < 
         //   ((cbPattern) - bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]]) )
; 3225 : //         i= i + ((cbPattern) - bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]]);
; 3226 : if ( bm_bc[(unsigned char)pbTarget[i+cbPattern-1]]
               < bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]] )

  0114b    8b 54 24 18     mov     edx, DWORD PTR tv451[esp+2104]
  0114f    0f b6 4c 02 ff     movzx     ecx, BYTE PTR [edx+eax-1]
  01154    0f b6 14 02     movzx     edx, BYTE PTR [edx+eax]
  01158    8b 4c 8c 30     mov     ecx, DWORD PTR _bm_bc$[esp+ecx*4+2104]
  0115c    8b 94 94 30 04
    00 00         mov     edx, DWORD PTR _bm_bc2nd$[esp+edx*4+2104]
  01163    3b ca         cmp     ecx, edx
  01165    73 04         jae     SHORT $LN7@Railgun_Qu@5

; 3227 :          i= i + bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]];

  01167    03 c2         add     eax, edx

; 3228 : else

  01169    eb 02         jmp     SHORT $LN6@Railgun_Qu@5
$LN7@Railgun_Qu@5:

; 3229 :          i= i + bm_bc[(unsigned char)pbTarget[i+cbPattern-1]];

  0116b    03 c1         add     eax, ecx
$LN6@Railgun_Qu@5:

; 3207 : 
; 3208 :     AdvanceHopperGrass = 0;
; 3209 :     i=0;
; 3210 :     while (i <= cbTarget-cbPattern-1) {

  0116d    8b 4c 24 14     mov     ecx, DWORD PTR tv554[esp+2104]

; 3230 : 
; 3231 : 
; 3232 :          AdvanceHopperGrass++;

  01171    ff 44 24 10     inc     DWORD PTR _AdvanceHopperGrass$[esp+2104]
  01175    49         dec     ecx
  01176    3b c1         cmp     eax, ecx
  01178    76 87         jbe     SHORT $LN13@Railgun_Qu@5

; 3233 :     }

The size is:
01178 - 01101 +1 +1 -4 = 75bytes, -4 because 'inc DWORD PTR _AdvanceHopperGrass$[esp+2104]' is only for statistical use.
]

Background

Function homepage: http://www.sanmayce.com/Railgun/index.html

Note: In order to reproduce the tests shown on the picture (below) you need the test package (57.2 MB) downloadable at function's homepage, thanks Randor for pointing out.

Very informative resource on strstr search techniques at:

Tests done with 16 patterns with 197MB haystack (being a pure English book-like text).

Railgun_Quadruplet/Railgun_Benchmarked.png

The machine is a mainstream laptop on 2.1GHz with Intel Merom CPU.

Using the Code

The code is pretty simple and straightforward.

C++
// Here the strstr-like function Railgun_Quadruplet_6ppp is given,  
// written by Georgi 'Kaze', 2011-Sep-07.
//
// ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Horspool algorithm [
// Caution: For better speed the case 'if (cbPattern==1)' was removed, 
// so Pattern must be longer than 1 char.
char * Railgun_Quadruplet_6ppp (char * pbTarget,
     char * pbPattern,
     unsigned long cbTarget,
     unsigned long cbPattern)
{
    char * pbTargetMax = pbTarget + cbTarget;
    register unsigned long  ulHashPattern;
    unsigned long ulHashTarget;
    //unsigned long count; //r.6+
    signed long count;
    //unsigned long countSTATIC; //r.6+
    signed long countSTATIC;
//  unsigned long countRemainder;

/*
    const unsigned char SINGLET = *(char *)(pbPattern);
    const unsigned long Quadruplet2nd = SINGLET<<8;
    const unsigned long Quadruplet3rd = SINGLET<<16;
    const unsigned long Quadruplet4th = SINGLET<<24;
*/
    unsigned char SINGLET;
    unsigned long Quadruplet2nd;
    unsigned long Quadruplet3rd;
    unsigned long Quadruplet4th;

    unsigned long  AdvanceHopperGrass;

    long i; //BMH needed
    int a, j, bm_bc[ASIZE]; //BMH needed
    unsigned char ch; //BMH needed
//    unsigned char lastch, firstch; //BMH needed

    if (cbPattern > cbTarget)
        return(NULL);

// Doesn't work when cbPattern = 1
// The next IF-fragment works very well with cbPattern>1, 
// OBVIOUSLY IT MUST BE UNROLLED(but crippled with less functionality) 
// SINCE either cbPattern=2 or cbPattern=3!
if ( cbPattern<4) {     // This IF makes me unhappy: it slows down from 
            // 390KB/clock to 367KB/clock for 'fast' pattern. 
            // This fragment(for 2..3 pattern lengths) is needed 
            // because I need a function different than strchr, but 
            // sticking to strstr i.e. lengths above 1 are to be handled.
        pbTarget = pbTarget+cbPattern;
        ulHashPattern = ( (*(char *)(pbPattern))<<8 ) + *(pbPattern+(cbPattern-1));
//        countSTATIC = cbPattern-2;

if ( cbPattern==3) {
    for ( ;; )
    {
        if ( ulHashPattern == ( (*(char *)(pbTarget-3))<<8 ) + *(pbTarget-1) ) {
         if ( *(char *)(pbPattern+1) == *(char *)(pbTarget-2) ) return((pbTarget-3));
        }
        if ( (char)(ulHashPattern>>8) != *(pbTarget-2) ) pbTarget++;
        pbTarget++;
        if (pbTarget > pbTargetMax)
            return(NULL);
    }
} else {
}
    for ( ;; )
    {
        // The line below gives for 'cbPattern'>=1:
        // Karp_Rabin_Kaze_4_OCTETS_hits/Karp_Rabin_Kaze_4_OCTETS_clocks: 4/543
        // Karp_Rabin_Kaze_4_OCTETS performance: 372KB/clock
/*
        if ( (ulHashPattern == ( (*(char *)(pbTarget-cbPattern))<<8 ) + 
        *(pbTarget-1)) && !memcmp(pbPattern, pbTarget-cbPattern, 
        (unsigned int)cbPattern) )
            return((long)(pbTarget-cbPattern));
*/

        // The fragment below gives for 'cbPattern'>=2:
        // Karp_Rabin_Kaze_4_OCTETS_hits/Karp_Rabin_Kaze_4_OCTETS_clocks: 4/546
        // Karp_Rabin_Kaze_4_OCTETS performance: 370KB/clock

/*
//For 2 and 3 [
        if ( ulHashPattern == ( (*(char *)(pbTarget-cbPattern))<<8 ) + *(pbTarget-1) ) {
//         count = countSTATIC;
         count = cbPattern-2;
//         while ( count && *(char *)(pbPattern+1+(countSTATIC-count)) == 
//         *(char *)(pbTarget-cbPattern+1+(countSTATIC-count)) ) {
         while ( count && *(char *)(pbPattern+1) == *(char *)(pbTarget-2) ) 
            { // Crippling i.e. only 2 and 3 chars are allowed!
               count--;
         }
         if ( count == 0) return((pbTarget-cbPattern));
        }
        if ( (char)(ulHashPattern>>8) != *(pbTarget-cbPattern+1) ) pbTarget++;
//For 2 and 3 ]
*/


        if ( ulHashPattern == ( (*(char *)(pbTarget-2))<<8 ) + *(pbTarget-1) )
            return((pbTarget-2));
        if ( (char)(ulHashPattern>>8) != *(pbTarget-1) ) pbTarget++;


        // The fragment below gives for 'cbPattern'>=2:
    // Karp_Rabin_Kaze_4_OCTETS_hits/Karp_Rabin_Kaze_4_OCTETS_clocks: 4/554
    // Karp_Rabin_Kaze_4_OCTETS performance: 364KB/clock
/*
        if ( ulHashPattern == ( (*(char *)(pbTarget-cbPattern))<<8 ) + *(pbTarget-1) ) {
         count = countSTATIC>>2;
         countRemainder = countSTATIC % 4;

         while ( count && *(unsigned long *)(pbPattern+1+((count-1)<<2)) == 
        *(unsigned long *)(pbTarget-cbPattern+1+((count-1)<<2)) ) {
               count--;
         }
     //if (count == 0) {      // Disastrous degradation only from this line
                // (317KB/clock when 1+2x4+2+1 bytes pattern: 
                // 'skillessness'; 312KB/clock when 1+1x4+2+1 bytes 
                // pattern: 'underdog'), otherwise 368KB/clock.
         while ( countRemainder && *(char *)(pbPattern+1+
        (countSTATIC-countRemainder)) == *(char *)(pbTarget-cbPattern+1+
        (countSTATIC-countRemainder)) ) {
               countRemainder--;
         }
         //if ( countRemainder == 0) return((long)(pbTarget-cbPattern));
         if ( count+countRemainder == 0) return((long)(pbTarget-cbPattern));
         //}
        }
*/

        pbTarget++;
        if (pbTarget > pbTargetMax)
            return(NULL);
    }
} else { //if ( cbPattern<4)
if (cbTarget<961) // This value is arbitrary(don't know how exactly), 
        // it ensures(at least must) 
        // better performance than 'Boyer_Moore_Horspool'.
{
        pbTarget = pbTarget+cbPattern;
        ulHashPattern = *(unsigned long *)(pbPattern);
//        countSTATIC = cbPattern-1;

    //SINGLET = *(char *)(pbPattern);
    SINGLET = ulHashPattern & 0xFF;
    Quadruplet2nd = SINGLET<<8;
    Quadruplet3rd = SINGLET<<16;
    Quadruplet4th = SINGLET<<24;

    for ( ;; )
    {
    AdvanceHopperGrass = 0;
    ulHashTarget = *(unsigned long *)(pbTarget-cbPattern);

        if ( ulHashPattern == ulHashTarget ) { // Three unnecessary comparisons here, 
        // but 'AdvanceHopperGrass' must be calculated - it has a higher priority.
//         count = countSTATIC;
//         while ( count && *(char *)(pbPattern+1+(countSTATIC-count)) == 
//        *(char *)(pbTarget-cbPattern+1+(countSTATIC-count)) ) {
//           if ( countSTATIC==AdvanceHopperGrass+count && SINGLET != 
//         *(char *)(pbTarget-cbPattern+1+(countSTATIC-count)) ) 
//        AdvanceHopperGrass++;
//               count--;
//         }
         count = cbPattern-1;
         while ( count && *(char *)(pbPattern+(cbPattern-count)) == 
                    *(char *)(pbTarget-count) ) {
           if ( cbPattern-1==AdvanceHopperGrass+count && SINGLET != 
            *(char *)(pbTarget-count) ) AdvanceHopperGrass++;
               count--;
         }
         if ( count == 0) return((pbTarget-cbPattern));
        } else { // The goal here: to avoid memory accesses by stressing the registers.
    if ( Quadruplet2nd != (ulHashTarget & 0x0000FF00) ) {
         AdvanceHopperGrass++;
         if ( Quadruplet3rd != (ulHashTarget & 0x00FF0000) ) {
              AdvanceHopperGrass++;
              if ( Quadruplet4th != (ulHashTarget & 0xFF000000) ) AdvanceHopperGrass++;
         }
    }
    }

    AdvanceHopperGrass++;

    pbTarget = pbTarget + AdvanceHopperGrass;
        if (pbTarget > pbTargetMax)
            return(NULL);
    }
} else { //if (cbTarget<961)
        //countSTATIC = cbPattern-2; //r.6+
        //countSTATIC = cbPattern-2-3; //r.6++
        countSTATIC = cbPattern-2-2; //r.6+++ This line fixes the bug from r.6++!
        ulHashPattern = *(unsigned long *)(pbPattern);
    /* Preprocessing */
    for (a=0; a < ASIZE; a++) bm_bc[a]=cbPattern;
    for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1;

    /* Searching */
    //lastch=pbPattern[cbPattern-1];
    //firstch=pbPattern[0];
    i=0;
    while (i <= cbTarget-cbPattern) {
       ch=pbTarget[i+cbPattern-1];
       //if (ch ==lastch)
          //if (memcmp(&pbTarget[i],pbPattern,cbPattern-1) == 0) OUTPUT(i);
          //if (ch == lastch && pbTarget[i] == firstch && 
     //memcmp(&pbTarget[i],pbPattern,cbPattern-1) == 0) return(i);  
          // Kaze: The idea(to prevent execution of slower 'memcmp') 
          // is borrowed from Karp-Rabin i.e. to perform a slower check only 
          // when the target "looks like".
          //if (ch == pbPattern[cbPattern-1] && pbTarget[i] == pbPattern[0]) // r.6+
          //if (ch == pbPattern[cbPattern-1] && *(long *)&pbTarget[i] 
          // == *(long *)&pbPattern[0]) // No problema here since we have 4[+] 
          // long pattern here. Overlapping (1 byte recompared) when length=4, grmbl.
          if (ch == pbPattern[cbPattern-1] && 
        *(long *)&pbTarget[i] == ulHashPattern) // No problema here since 
            // we have 4[+] long pattern here. Overlapping 
            // (1 byte recompared) when length=4, grmbl.
             {
         count = countSTATIC;
         //while ( count && *(char *)(pbPattern+1+(countSTATIC-count)) == 
    //*(char *)(&pbTarget[i]+1+(countSTATIC-count)) ) { // r.6+
         while ( count !=0 && *(char *)(pbPattern+1+3+(countSTATIC-count)) == 
        *(char *)(&pbTarget[i]+1+3+(countSTATIC-count)) ) 
        {     // if pattern length is 4 or 5 we have count=-1 and count=0 
            // respectively i.e. no need of comparing in-between chars.
               count--;
         }
         if ( count <= 0) return(pbTarget+i);
         }
       i+=bm_bc[ch];
    }
    return(NULL);
} //if (cbTarget<961)
} //if ( cbPattern<4)
}
// ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Horspool algorithm ]

Stomp stomp I arrived, here comes the Railgun_Quadruplet revision 7:

C++
// Caution: For better speed the case 'if (cbPattern==1)' was removed, 
// so Pattern must be longer than 1 char.
char * Railgun_Quadruplet_7 (char * pbTarget, char * pbPattern, 
        unsigned long cbTarget, unsigned long cbPattern)
{
    char * pbTargetMax = pbTarget + cbTarget;
    register unsigned long ulHashPattern;
    unsigned long ulHashTarget;
    signed long count;
    signed long countSTATIC;

    unsigned char SINGLET;
    unsigned long Quadruplet2nd;
    unsigned long Quadruplet3rd;
    unsigned long Quadruplet4th;
    unsigned long AdvanceHopperGrass;

    long i;
    int a, j, bm_bc[ASIZE];
    unsigned char ch;

    if (cbPattern > cbTarget) return(NULL);

    if (cbPattern<4) {
        pbTarget = pbTarget+cbPattern;
        ulHashPattern = ( (*(char *)(pbPattern))<<8 ) + 
                    *(pbPattern+(cbPattern-1));
        if (cbPattern==3) {
            for ( ;; ) {
                if ( ulHashPattern == 
                ( (*(char *)(pbTarget-3))<<8 ) + *(pbTarget-1) ) {
                    if ( *(char *)(pbPattern+1) == 
                    *(char *)(pbTarget-2) ) return((pbTarget-3));
                }
                if ( (char)(ulHashPattern>>8) != 
                        *(pbTarget-2) ) pbTarget++;
                pbTarget++;
                    if (pbTarget > pbTargetMax) return(NULL);
                }
        } else {
        }
            for ( ;; ) {
                if ( ulHashPattern == ( (*(char *)(pbTarget-2))<<8 ) 
                    + *(pbTarget-1) ) return((pbTarget-2));
                if ( (char)(ulHashPattern>>8) != *(pbTarget-1) ) 
                    pbTarget++;
                pbTarget++;
                if (pbTarget > pbTargetMax) return(NULL);
            }
    } else {
        if (cbTarget<961) {
            pbTarget = pbTarget+cbPattern;
            ulHashPattern = *(unsigned long *)(pbPattern);
            SINGLET = ulHashPattern & 0xFF;
            Quadruplet2nd = SINGLET<<8;
            Quadruplet3rd = SINGLET<<16;
            Quadruplet4th = SINGLET<<24;
            for ( ;; ) {
                AdvanceHopperGrass = 0;
                ulHashTarget = *(unsigned long *)(pbTarget-cbPattern);
                if ( ulHashPattern == ulHashTarget ) { 
                    count = cbPattern-1;
                    while ( count && *(char *)
                    (pbPattern+(cbPattern-count)) == 
                        *(char *)(pbTarget-count) ) {
                        if ( cbPattern-1==
                        AdvanceHopperGrass+count && 
                        SINGLET != *(char *)(pbTarget-count)
                             ) AdvanceHopperGrass++;
                        count--;
                    }
                    if ( count == 0) 
                        return((pbTarget-cbPattern));
                } else {
                    if ( Quadruplet2nd != (ulHashTarget 
                            & 0x0000FF00) ) {
                        AdvanceHopperGrass++;
                        if ( Quadruplet3rd != 
                        (ulHashTarget & 0x00FF0000) ) {
                            AdvanceHopperGrass++;
                            if ( Quadruplet4th != 
                            (ulHashTarget & 0xFF000000)
                             ) AdvanceHopperGrass++;
                        }
                    }
                }
                AdvanceHopperGrass++;
                pbTarget = pbTarget + AdvanceHopperGrass;
                if (pbTarget > pbTargetMax) return(NULL);
            }
        } else {
            countSTATIC = cbPattern-2-2; 
            ulHashPattern = *(unsigned long *)(pbPattern);
            for (a=0; a < ASIZE; a++) bm_bc[a]=cbPattern;
            for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=
                                cbPattern-j-1;
            i=0;
            while (i <= cbTarget-cbPattern) {
                if ( *(unsigned long *)&pbTarget[i] == 
                    ulHashPattern ) {
                    count = countSTATIC;
                    while ( count !=0 && *(char *)
                    (pbPattern+1+3+(countSTATIC-count)) == 
                    *(char *)(&pbTarget[i]+1+3+
                    (countSTATIC-count)) ) count--;
                    if ( count == 0) return(pbTarget+i);
                }
                i= i + bm_bc[(unsigned char)pbTarget[i+cbPattern-1]];
            }
            return(NULL);
        }
    }
}

The tweak (in BMH fragment) which speeds-up significantly the obsolete (already) r.6+++ is this:

C++
//countSTATIC = cbPattern-2; //r.6+
//countSTATIC = cbPattern-2-3;
//countSTATIC = cbPattern-2-2; // r.6+++ I suppose that the awful degradation 
// comes from 2bytes more (from either 'if (countSTATIC<0) 
// countSTATIC=0;' or 'count >0' fixes) which make the 
// function unfittable in code cache lines?!
//countSTATIC = cbPattern-2-3;     // r.7- At last no recompared bytes 
// in-between chars
countSTATIC = cbPattern-2-2; // r.7 
ulHashPattern = *(unsigned long *)(pbPattern);
//chPTR=(unsigned char *)&chchchch+3;
// Next line fixes the BUG from r.6++: but with awful speed degradation! 
// So the bug is fixed in the definitions by setting 'countSTATIC = cbPattern-2-2;', 
// bug appears only for patterns with lengths of 4, The setback is one unnecessary 
// comparison for 5bytes patterns, stupidly such setback exists (from before) for 
// 4bytes as well.
//if (countSTATIC<0) countSTATIC=0;
    /* Preprocessing */
    for (a=0; a < ASIZE; a++) bm_bc[a]=cbPattern;
    for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1;

    /* Searching */
    //lastch=pbPattern[cbPattern-1];
    //firstch=pbPattern[0];
    i=0;
    while (i <= cbTarget-cbPattern) {
       //ch=pbTarget[i+cbPattern-1];
       //ch=pbTarget[i];
          //if ( pbTarget[i] == pbPattern[0] && *(unsigned long *)
          //&pbTarget[i+cbPattern-1-3] == ulHashPattern) // No problema here since 
          // we have 4[+] long pattern here. Overlapping 
          // (1 byte recompared) when length=4, grmbl.
          if ( *(unsigned long *)&pbTarget[i] == ulHashPattern ) // The lesson I learned 
            // from r.7- now applied in r.7: instead of extracting 'ch' 
            // having higher address now the lower address is extracted 
            // first in order (hopefully, the test confirms it) the next 
            // 32bytes (including 'ch') to be cached i.e. to comparison 
            // part is faster.
          {
              count = countSTATIC;
              while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) == 
                      *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
              { // if pattern length 
                  // is 4 or 5 we have count=-1 and count=0 respectively i.e. no need of 
                  // comparing in-between chars.
                  count--;
              }
              if ( count == 0) return(pbTarget+i);
         }
         i= i + bm_bc[(unsigned char)pbTarget[i+cbPattern-1]];
    }
    return(NULL);

In order to see what path I had to traverse to make this tweak, you may see the discussion board (r.7m thread). The tests below show NOT the full boost simply because they average times for all lengths thus losing/diminishing the high speeds times (being 4-3-2 times lesser compared with pattern lengths 2-3-4-5-6).

D:\_KAZE_new-stuff>strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe
strstr_SHORT-SHOWDOWN, revision 7, written by Kaze.
Input Pattern(up to 19+2000 chars): frustration

Doing Search for Pattern(11bytes) into String(206908949bytes) as-one-line ...
WARNING! The next line works with BMH only for pattern 4[+] long, 
otherwise (for 2 and 3) other searcher takes over!
Railgun_6pp_hits/Railgun_6pp_clocks: 1355/101
Railgun_6pp performance: 2000KB/clock

Doing Search for Pattern(11bytes) into String(206908949bytes) as-one-line ...
WARNING! The next line works with BMH only for pattern 4[+] long, 
otherwise (for 2 and 3) other searcher takes over!
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1355/91
Railgun_Quadruplet_7 performance: 2220KB/clock

Note: Executing the next two tests 256 times i.e. the search is for 256x8x2 patterns!

Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ... 
Not-Counting-Hits-Just-Returns-First-One
Found ('an') at 157 position, Railgun_Quadruplet_6pp performance: 0KB/clock
Found ('to') at 853 position, Railgun_Quadruplet_6pp performance: 0KB/clock
Found ('TDK') at -4325408 position, Railgun_Quadruplet_6pp performance: 922KB/clock
Found ('the') at 511 position, Railgun_Quadruplet_6pp performance: 0KB/clock
Found ('fast') at 42381 position, Railgun_Quadruplet_6pp performance: 41KB/clock
Found ('easy') at 30163 position, Railgun_Quadruplet_6pp performance: 29KB/clock
Found ('grmbl') at -4325408 position, Railgun_Quadruplet_6pp performance: 1167KB/clock
Found ('email') at 69587982 position, Railgun_Quadruplet_6pp performance: 1078KB/clock
Found ('pasting') at 134600126 position, Railgun_Quadruplet_6pp performance: 1383KB/clock
Found ('amazing') at 1629806 position, Railgun_Quadruplet_6pp performance: 93KB/clock
Found ('underdog') at 137424498 position, 
            Railgun_Quadruplet_6pp performance: 1698KB/clock
Found ('superdog') at -4325408 position, Railgun_Quadruplet_6pp performance: 1603KB/clock
Found ('participants') at 16060 position, Railgun_Quadruplet_6pp performance: 15KB/clock
Found ('skillessness') at -4325408 position, 
    Railgun_Quadruplet_6pp performance: 1836KB/clock
Found ('I should have known') at 41831880 position, 
    Railgun_Quadruplet_6pp performance: 2403KB/clock
Found ('human consciousness') at 3863 position, 
    Railgun_Quadruplet_6pp performance: 3KB/clock
Railgun_Quadruplet_6pp 8x2 i.e. average performance: 1342KB/clock
ReallyTraversed: 310,477,846,528 bytes

Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ... 
Not-Counting-Hits-Just-Returns-First-One
Found ('an') at 157 position, Railgun_Quadruplet_7 performance: 0KB/clock
Found ('to') at 853 position, Railgun_Quadruplet_7 performance: 0KB/clock
Found ('TDK') at -4325408 position, Railgun_Quadruplet_7 performance: 990KB/clock
Found ('the') at 511 position, Railgun_Quadruplet_7 performance: 0KB/clock
Found ('fast') at 42381 position, Railgun_Quadruplet_7 performance: 41KB/clock
Found ('easy') at 30163 position, Railgun_Quadruplet_7 performance: 29KB/clock
Found ('grmbl') at -4325408 position, Railgun_Quadruplet_7 performance: 1069KB/clock
Found ('email') at 69587982 position, Railgun_Quadruplet_7 performance: 1078KB/clock
Found ('pasting') at 134600126 position, Railgun_Quadruplet_7 performance: 1383KB/clock
Found ('amazing') at 1629806 position, Railgun_Quadruplet_7 performance: 1591KB/clock
Found ('underdog') at 137424498 position, Railgun_Quadruplet_7 performance: 1698KB/clock
Found ('superdog') at -4325408 position, Railgun_Quadruplet_7 performance: 1422KB/clock
Found ('participants') at 16060 position, Railgun_Quadruplet_7 performance: 15KB/clock
Found ('skillessness') at -4325408 position, 
    Railgun_Quadruplet_7 performance: 2149KB/clock
Found ('I should have known') at 41831880 position, 
    Railgun_Quadruplet_7 performance: 2403KB/clock
Found ('human consciousness') at 3863 position, 
    Railgun_Quadruplet_7 performance: 3KB/clock
Railgun_Quadruplet_7 8x2 i.e. average performance: 1354KB/clock
ReallyTraversed: 310,477,846,528 bytes

Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ...
Found ('an') 1987797 time(s), BM_HORSPOOL performance: 293KB/clock
Found ('to') 1076629 time(s), BM_HORSPOOL performance: 293KB/clock
Found ('TDK') 0 time(s), BM_HORSPOOL performance: 516KB/clock
Found ('the') 2114180 time(s), BM_HORSPOOL performance: 379KB/clock
Found ('fast') 5945 time(s), BM_HORSPOOL performance: 585KB/clock
Found ('easy') 5191 time(s), BM_HORSPOOL performance: 585KB/clock
Found ('grmbl') 0 time(s), BM_HORSPOOL performance: 759KB/clock
Found ('email') 1 time(s), BM_HORSPOOL performance: 756KB/clock
Found ('pasting') 2 time(s), BM_HORSPOOL performance: 918KB/clock
Found ('amazing') 323 time(s), BM_HORSPOOL performance: 1074KB/clock
Found ('underdog') 4 time(s), BM_HORSPOOL performance: 1069KB/clock
Found ('superdog') 0 time(s), BM_HORSPOOL performance: 1074KB/clock
Found ('participants') 147 time(s), BM_HORSPOOL performance: 1603KB/clock
Found ('skillessness') 0 time(s), BM_HORSPOOL performance: 1422KB/clock
Found ('I should have known') 1 time(s), BM_HORSPOOL performance: 1433KB/clock
Found ('human consciousness') 519 time(s), BM_HORSPOOL performance: 1820KB/clock
BM_HORSPOOL 8x2 i.e. average performance: 666KB/clock

Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ...
Found ('an') 1987797 time(s), Railgun_6pp performance: 678KB/clock
Found ('to') 1076629 time(s), Railgun_6pp performance: 643KB/clock
Found ('TDK') 0 time(s), Railgun_6pp performance: 1174KB/clock
Found ('the') 2114180 time(s), Railgun_6pp performance: 678KB/clock
Found ('fast') 5945 time(s), Railgun_6pp performance: 918KB/clock
Found ('easy') 5191 time(s), Railgun_6pp performance: 990KB/clock
Found ('grmbl') 0 time(s), Railgun_6pp performance: 1287KB/clock
Found ('email') 1 time(s), Railgun_6pp performance: 1167KB/clock
Found ('pasting') 2 time(s), Railgun_6pp performance: 1603KB/clock
Found ('amazing') 323 time(s), Railgun_6pp performance: 1603KB/clock
Found ('underdog') 4 time(s), Railgun_6pp performance: 1603KB/clock
Found ('superdog') 0 time(s), Railgun_6pp performance: 1820KB/clock
Found ('participants') 147 time(s), Railgun_6pp performance: 2149KB/clock
Found ('skillessness') 0 time(s), Railgun_6pp performance: 2126KB/clock
Found ('I should have known') 1 time(s), Railgun_6pp performance: 2557KB/clock
Found ('human consciousness') 519 time(s), Railgun_6pp performance: 2126KB/clock
Railgun_6pp 8x2 i.e. average performance: 1218KB/clock

Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ...
Found ('an') 1987797 time(s), Railgun_Quadruplet_7 performance: 643KB/clock
Found ('to') 1076629 time(s), Railgun_Quadruplet_7 performance: 678KB/clock
Found ('TDK') 0 time(s), Railgun_Quadruplet_7 performance: 1074KB/clock
Found ('the') 2114180 time(s), Railgun_Quadruplet_7 performance: 643KB/clock
Found ('fast') 5945 time(s), Railgun_Quadruplet_7 performance: 1074KB/clock
Found ('easy') 5191 time(s), Railgun_Quadruplet_7 performance: 1074KB/clock
Found ('grmbl') 0 time(s), Railgun_Quadruplet_7 performance: 1287KB/clock
Found ('email') 1 time(s), Railgun_Quadruplet_7 performance: 1167KB/clock
Found ('pasting') 2 time(s), Railgun_Quadruplet_7 performance: 1603KB/clock
Found ('amazing') 323 time(s), Railgun_Quadruplet_7 performance: 1603KB/clock
Found ('underdog') 4 time(s), Railgun_Quadruplet_7 performance: 1820KB/clock
Found ('superdog') 0 time(s), Railgun_Quadruplet_7 performance: 1603KB/clock
Found ('participants') 147 time(s), Railgun_Quadruplet_7 performance: 2557KB/clock
Found ('skillessness') 0 time(s), Railgun_Quadruplet_7 performance: 2126KB/clock
Found ('I should have known') 1 time(s), Railgun_Quadruplet_7 performance: 2557KB/clock
Found ('human consciousness') 519 time(s), 
        Railgun_Quadruplet_7 performance: 2557KB/clock
Railgun_Quadruplet_7 8x2 i.e. average performance: 1232KB/clock

^C
D:\_KAZE_new-stuff>

In my opinion, it is one of the seven commandments of speed: do not put the cart before the horse, i.e., for adjacent (32?) memory accesses first extract the byte with lowest address. strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe, revision 7, some quick inputs:

D:\_KAZE_new-stuff>strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe
strstr_SHORT-SHOWDOWN, revision 7, written by Kaze.
Input Pattern(up to 19+2000 chars): fast

Doing Search for Pattern(4bytes) into String(206908949bytes) as-one-line ...
WARNING! The next line works with BMH only for pattern 4[+] long, 
otherwise (for 2 and 3) other searcher takes over!
Railgun_6pp_hits/Railgun_6pp_clocks: 5945/215
Railgun_6pp performance: 939KB/clock

Doing Search for Pattern(4bytes) into String(206908949bytes) as-one-line ...
WARNING! The next line works with BMH only for pattern 4[+] long, 
otherwise (for 2 and 3) other searcher takes over!
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 5945/193
Railgun_Quadruplet_7 performance: 1046KB/clock

^C

...

Input Pattern(up to 19+2000 chars): from
Railgun_6pp_hits/Railgun_6pp_clocks: 83155/196
Railgun_6pp performance: 1030KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 83155/194
Railgun_Quadruplet_7 performance: 1041KB/clock

Input Pattern(up to 19+2000 chars): rabea
Railgun_6pp_hits/Railgun_6pp_clocks: 0/176
Railgun_6pp performance: 1148KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/162
Railgun_Quadruplet_7 performance: 1247KB/clock

Input Pattern(up to 19+2000 chars): makes
Railgun_6pp_hits/Railgun_6pp_clocks: 9281/172
Railgun_6pp performance: 1174KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 9281/162
Railgun_Quadruplet_7 performance: 1247KB/clock

Input Pattern(up to 19+2000 chars): monkey
Railgun_6pp_hits/Railgun_6pp_clocks: 1691/141
Railgun_6pp performance: 1433KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1691/139
Railgun_Quadruplet_7 performance: 1453KB/clock

Input Pattern(up to 19+2000 chars): punny
Railgun_6pp_hits/Railgun_6pp_clocks: 0/156
Railgun_6pp performance: 1295KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/155
Railgun_Quadruplet_7 performance: 1303KB/clock

Input Pattern(up to 19+2000 chars): funny
Railgun_6pp_hits/Railgun_6pp_clocks: 179/157
Railgun_6pp performance: 1287KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 179/156
Railgun_Quadruplet_7 performance: 1295KB/clock

Input Pattern(up to 19+2000 chars): ramjet
Railgun_6pp_hits/Railgun_6pp_clocks: 0/152
Railgun_6pp performance: 1329KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/137
Railgun_Quadruplet_7 performance: 1474KB/clock

Input Pattern(up to 19+2000 chars): fallen
Railgun_6pp_hits/Railgun_6pp_clocks: 2578/151
Railgun_6pp performance: 1338KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 2578/138
Railgun_Quadruplet_7 performance: 1464KB/clock

Input Pattern(up to 19+2000 chars): elephant
Railgun_6pp_hits/Railgun_6pp_clocks: 1198/124
Railgun_6pp performance: 1629KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1198/112
Railgun_Quadruplet_7 performance: 1804KB/clock

Input Pattern(up to 19+2000 chars): layalina
Railgun_6pp_hits/Railgun_6pp_clocks: 0/121
Railgun_6pp performance: 1669KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/110
Railgun_Quadruplet_7 performance: 1836KB/clock

Input Pattern(up to 19+2000 chars): I should have known
Railgun_6pp_hits/Railgun_6pp_clocks: 1/89
Railgun_6pp performance: 2270KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1/84
Railgun_Quadruplet_7 performance: 2405KB/clock

Input Pattern(up to 19+2000 chars): human consciousness
Railgun_6pp_hits/Railgun_6pp_clocks: 519/80
Railgun_6pp performance: 2525KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 519/75
Railgun_Quadruplet_7 performance: 2694KB/clock

Input Pattern(up to 19+2000 chars): But he's looking right through me
Railgun_6pp_hits/Railgun_6pp_clocks: 0/84
Railgun_6pp performance: 2405KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/76
Railgun_Quadruplet_7 performance: 2658KB/clock

Input Pattern(up to 19+2000 chars): I'm living in an age that calls darkness light
Railgun_6pp_hits/Railgun_6pp_clocks: 0/72
Railgun_6pp performance: 2806KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/67
Railgun_Quadruplet_7 performance: 3015KB/clock

Input Pattern(up to 19+2000 chars): Following the wanderings of a dream - 
a dream that keeps my soul alive
Railgun_6pp_hits/Railgun_6pp_clocks: 0/69
Railgun_6pp performance: 2928KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/63
Railgun_Quadruplet_7 performance: 3207KB/clock

Input Pattern(up to 19+2000 chars): I notice what matters and 
I got nothing to lose but darkness and shadows
Railgun_6pp_hits/Railgun_6pp_clocks: 0/66
Railgun_6pp performance: 3061KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/63
Railgun_Quadruplet_7 performance: 3207KB/clock

Input Pattern(up to 19+2000 chars): Beth Ditto
Railgun_6pp_hits/Railgun_6pp_clocks: 0/118
Railgun_6pp performance: 1712KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/107
Railgun_Quadruplet_7 performance: 1888KB/clock

Input Pattern(up to 19+2000 chars): SR71
Railgun_6pp_hits/Railgun_6pp_clocks: 0/175
Railgun_6pp performance: 1154KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/182
Railgun_Quadruplet_7 performance: 1110KB/clock

Input Pattern(up to 19+2000 chars): Apache
Railgun_6pp_hits/Railgun_6pp_clocks: 1/162
Railgun_6pp performance: 1247KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1/133
Railgun_Quadruplet_7 performance: 1519KB/clock

Input Pattern(up to 19+2000 chars): Fibonacci
Railgun_6pp_hits/Railgun_6pp_clocks: 0/106
Railgun_6pp performance: 1906KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/98
Railgun_Quadruplet_7 performance: 2061KB/clock

Input Pattern(up to 19+2000 chars): Tesla
Railgun_6pp_hits/Railgun_6pp_clocks: 0/175
Railgun_6pp performance: 1154KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/160
Railgun_Quadruplet_7 performance: 1262KB/clock

Input Pattern(up to 19+2000 chars): Albert
Railgun_6pp_hits/Railgun_6pp_clocks: 932/154
Railgun_6pp performance: 1312KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 932/137
Railgun_Quadruplet_7 performance: 1474KB/clock

Input Pattern(up to 19+2000 chars): Toshiba
Railgun_6pp_hits/Railgun_6pp_clocks: 0/130
Railgun_6pp performance: 1554KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/121
Railgun_Quadruplet_7 performance: 1669KB/clock

Input Pattern(up to 19+2000 chars): Orion
Railgun_6pp_hits/Railgun_6pp_clocks: 1/175
Railgun_6pp performance: 1154KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1/160
Railgun_Quadruplet_7 performance: 1262KB/clock

Input Pattern(up to 19+2000 chars): pharaoh
Railgun_6pp_hits/Railgun_6pp_clocks: 26/129
Railgun_6pp performance: 1566KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 26/122
Railgun_Quadruplet_7 performance: 1656KB/clock

Input Pattern(up to 19+2000 chars): lemon
Railgun_6pp_hits/Railgun_6pp_clocks: 33/178
Railgun_6pp performance: 1135KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 33/161
Railgun_Quadruplet_7 performance: 1255KB/clock

Input Pattern(up to 19+2000 chars): grammar
Railgun_6pp_hits/Railgun_6pp_clocks: 218/123
Railgun_6pp performance: 1642KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 218/117
Railgun_Quadruplet_7 performance: 1727KB/clock

D:\_KAZE_new-stuff>

Grmbl, pattern 'SR71' compared to 'fast' gives me a hint how things are far from 'fastest'.

Points of Interest

The interesting part about tests I have done: 'memcmp' and 'memcpy' when inlined (properly) give much better results. The thing which pop-ups again-and-again: Never take anything for granted. A bug (which doesn't affect the previous results, correct) was crushed, now r.6+++ is ready to go/hit.

As you can see from the C (commentless and tabulated) and ASM sources (see the top of the page for downloading):

The Microsoft 16.00.30319.01 compiled Railgun_Quadruplet_6ppp is 00c6b-00a00+5 = 624 bytes long.
The Quadruplet (where cbTarget<961) fragment loop is 00bc3-00b17+5 = 177 bytes long.
The Boyer-Moore-Horspool (where cbTarget>=961) fragment loop is 00c61-00c20+2 = 67 bytes long.

An interesting question for more experienced coders: In case of reducing the BMH loop down by 3 bytes (i.e. to 64 bytes), can we obtain some additional/significant boost?

History

  • Revision 6+++ written 2011-Sep-07
  • Revision 7 written 2011-Sep-25
  • Revision 7 'Sun' written 2012-Jan-22
  • Revision 7 'Elsiane' written 2012-Jan-24
  • Revision 7 'Gulliver' written 2012-Jan-26
  • Revision 7 'Hasherezade' written 2012-Feb-01
  • Revision 7 'Decumanus' written 2014-Jan-12
You'll need Skype CreditFree via Skype
You'll need Skype CreditFree via Skype

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)