Click here to Skip to main content
Click here to Skip to main content

Fastest strstr-like Function in C!?

By , 1 Feb 2012
 

Introduction

The C strstr-like function presented below (written by Georgi 'Kaze') is named Railgun_Quadruplet_6ppp and is based on well-known Boyer-Moore-Horspool and simplified Karp-Rabin algorithms. The main goal: to achieve fastest execution for all kinds of pattern/needle and string/haystack lengths (starting from 2 bytes).

Railgun_Quadruplet_6ppp is now obsolete due to the arrival of Railgun_Quadruplet_7 which boosts the BMH fragment by more than a percent. See further below for its commentless and tabulated source.

Railgun_Quadruplet_7 exploits the CPUs fast unaligned DWORD extract capability.
The heaviest and most precise test on 1+TB text (52 patterns vs 197MB English books text repeated 256 times results in 1000+ seconds, see strstr_SHORT-SHOWDOWN_r7_Heavy_Test_LOG.txt file for more information) gives:

[strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe:]
The variants returning the offset of first hit only:

  • Railgun_Quadruplet_6pp 52, i.e., average performance: 1584KB/clock
  • Railgun_Quadruplet_7 52, i.e., average performance: 1659KB/clock

The variants counting all hits:

  • Railgun_6pp 8x2, i.e., average performance: 1246KB/clock
  • Railgun_Quadruplet_7 8x2, i.e., average performance: 1266KB/clock

[strstr_SHORT-SHOWDOWN_Intel_v12_O3.exe:]
The variants returning the offset of first hit only:

  • Railgun_Quadruplet_6pp 52, i.e., average performance: 1773KB/clock
  • Railgun_Quadruplet_7 52, i.e., average performance: 1685KB/clock

The variants counting all hits:

  • Railgun_6pp 8x2, i.e., average performance: 1054KB/clock
  • Railgun_Quadruplet_7 8x2, i.e., average performance: 1076KB/clock

Note: For Railgun_Quadruplet_7 on CPUs with fast DWORD (compared to BYTE) memory fetching I expect significant boost, AFAIK the Intel i7 is one of those, unfortunately I lack the opportunity to play with it.

Or the two bottom-lines:
The FASTEST variant counting all hits:

  • Railgun_Quadruplet_7 8x2 with average performance: 1266KB/clock (best performance achieved with Microsoft_v16_Ox)!

The FASTEST variant returning the offset of first hit only:

  • Railgun_Quadruplet_6pp 52 with average performance: 1773KB/clock (1,943,883,635,456 bytes / 1,070,220 clocks = 1732 MB/s best performance achieved with Intel_v12_O3)!

An add-on from 2012-Feb-01:

Just wanted to improve Skip-Performance by checking the window's 8 rightmost bytes (Order 8 Horspool) against all needle's 8 bytes suffixes, which resulted in Railgun_Quadruplet_7Hasherezade.

Here an example is given:
For needle 'and every day a continuous cleaning goes on' 43bytes long we have 43-8+1 Building-Blocks i.e. suffixes 8bytes long as follows:
and ever 117,013
nd every 108,604
d every  155,516
 every d 170,959
every da 115,291
very day 073,191
ery day  097,042
ry day a 083,793
y day a  011,244
 day a c 115,855
day a co 101,797
ay a con 222,568
y a cont 029,130
 a conti 020,978
a contin 258,405
 continu 252,691
continuo 123,607
ontinuou 056,546
ntinuous 135,857
tinuous  015,332
inuous c 250,584
nuous cl 048,224
uous cle 106,616
ous clea 137,020
us clean 035,751
s cleani 178,989
 cleanin 213,855
cleaning 063,337
leaning  097,138
eaning g 062,366
aning go 247,590
ning goe 036,571
ing goes 041,142
ng goes  228,365
g goes o 229,696
 goes on 176,852
Each number indicates which HashSlot is occupied (here 0..262143 slots i.e. 18bit used).

Okay here is a simplified example:
string: tumba-lumba-tumba-lumba-tumba-lumba-tumba-lumba-tumba-lumba-tumba-lumba-tumba-lumba
needle: and every day a continuous cleaning goes on
                      and every day a continuous cleaning goes on                        +14 Horspool Order 1
                                            and every day a continuous cleaning goes on  +36 Horspool Order 8
The goal: to jump when the rightmost 8bytes '-tumba-l' of window do not look like any of Needle's prefixes i.e. are not to be found. This maximum jump equals cbPattern-(Order-1) or 43-(8-1)=36.

Here the Horspool skip-table follows:
a  n  d     e  v  e  r  y     d  a  y     a     c  o  n  t  i  n  u  o  u  s     c  l  e  a  n  i  n  g     g  o  e  s     o  n
12 09 32 02 04 37 04 35 30 02 32 12 30 02 12 02 15 01 09 23 10 09 18 01 18 03 02 15 14 04 12 09 10 09 06 02 06 01 04 03 02 01 09
We have 'l' vs 'n' where 'l' points to 14, however 'Hasherezade' is able to jump/skip 43-(8-1)=36positions in case of HashCode of '-tumba-l' doesn't equal any of HashCodes of Needle's suffixes.
HashCode of '-tumba-l' is 083,467 it matches none of Needle's HashCodes so we skip 36 bytes.
This boost is the only difference between 'Hasherezade' and 'Gulliver' revision 2.
// Scheherezade -> Hasherezade
// [Italian: buffa, feminine of buffo, comic.]
/*
First off: I am heavily disappointed from Speed-Performance of 'Hasherezade' and comparing Skip-Performance of 'Gulliver' and 'Hasherezade' doesn't make me happy, either.

Pattern: fastest fox
Doing Search for Pattern(11bytes) into String(206908949bytes) as-one-line ...

Railgun_Quadruplet_7sun performance:         1,683KB/clock / 00,775%, 26,672,940 skips/iterations
Railgun_Quadruplet_7 performance:            1,642KB/clock / 00,669%, 30,925,578 skips/iterations
Railgun_Quadruplet_7sunhorse performance:    0,944KB/clock / 00,912%, 22,663,583 skips/iterations
Railgun_Quadruplet_7deuce performance:       0,801KB/clock / 00,797%, 25,945,709 skips/iterations
Railgun_Quadruplet_7Elsiane performance:     0,704KB/clock / 00,931%, 22,213,101 skips/iterations
Railgun_Quadruplet_7Gulliver performance:    2,104KB/clock / 00,977%, 21,166,516 skips/iterations
Railgun_Quadruplet_7Hasherezade performance: 1,496KB/clock / 00,980%, 21,112,657 skips/iterations  18bit HashTable
Railgun_Quadruplet_7Hasherezade performance: 1,642KB/clock / 00,980%, 21,112,646 skips/iterations  14bit HashTable
Railgun_Quadruplet_7Hasherezade performance: 1,642KB/clock / 00,980%, 21,112,735 skips/iterations  10bit HashTable
Boyer_Moore_Flensburg performance:           1,057KB/clock / 00,669%, 30,925,578 skips/iterations

Pattern: and every day a continuous cleaning goes on
Doing Search for Pattern(43bytes) into String(206908949bytes) as-one-line ...

Railgun_Quadruplet_7sun performance:         2,557KB/clock / 01,495%, 13,832,201 skips/iterations
Railgun_Quadruplet_7 performance:            2,590KB/clock / 01,404%, 14,731,326 skips/iterations
Railgun_Quadruplet_7sunhorse performance:    1,888KB/clock / 02,222%, 09,308,136 skips/iterations
Railgun_Quadruplet_7deuce performance:       1,741KB/clock / 01,971%, 10,494,460 skips/iterations
Railgun_Quadruplet_7Elsiane performance:     1,642KB/clock / 02,229%, 09,279,871 skips/iterations
Railgun_Quadruplet_7Gulliver performance:    3,157KB/clock / 03,795%, 05,450,890 skips/iterations
Railgun_Quadruplet_7Hasherezade performance: 2,322KB/clock / 04,100%, 05,046,049 skips/iterations  18bit HashTable
Railgun_Quadruplet_7Hasherezade performance: 2,971KB/clock / 04,100%, 05,046,445 skips/iterations  14bit HashTable
Railgun_Quadruplet_7Hasherezade performance: 3,015KB/clock / 04,090%, 05,058,667 skips/iterations  10bit HashTable
Boyer_Moore_Flensburg performance:           1,683KB/clock / 01,447%, 14,294,963 skips/iterations

Pattern: And let this be your very fundamental insight... about everything. Just for one year, don't choose.
Doing Search for Pattern(99bytes) into String(206908949bytes) as-one-line ...

Railgun_Quadruplet_7sun performance:         2,845KB/clock / 02,248%, 09,200,917 skips/iterations
Railgun_Quadruplet_7 performance:            2,928KB/clock / 02,105%, 09,827,389 skips/iterations
Railgun_Quadruplet_7sunhorse performance:    2,270KB/clock / 03,283%, 06,302,096 skips/iterations
Railgun_Quadruplet_7deuce performance:       2,149KB/clock / 03,196%, 06,472,407 skips/iterations
Railgun_Quadruplet_7Elsiane performance:     2,172KB/clock / 03,282%, 06,303,154 skips/iterations
Railgun_Quadruplet_7Gulliver performance:    2,928KB/clock / 07,820%, 02,645,814 skips/iterations
Railgun_Quadruplet_7Hasherezade performance: 2,494KB/clock / 09,575%, 02,160,890 skips/iterations  18bit HashTable
Railgun_Quadruplet_7Hasherezade performance: 3,015KB/clock / 09,568%, 02,162,493 skips/iterations  14bit HashTable
Railgun_Quadruplet_7Hasherezade performance: 3,015KB/clock / 09,438%, 02,192,204 skips/iterations  10bit HashTable
Boyer_Moore_Flensburg performance:           2,349KB/clock / 02,135%, 09,689,008 skips/iterations

Pattern: Then, singing among the savage branches, it impales itself upon the longest, sharpest spine. And, dying, it rises above its own agony to outcarol the lark and the nightingale.
Doing Search for Pattern(175bytes) into String(206908949bytes) as-one-line ...

Railgun_Quadruplet_7sun performance:         2,658KB/clock / 03,142%, 06,583,682 skips/iterations
Railgun_Quadruplet_7 performance:            2,658KB/clock / 03,095%, 06,685,179 skips/iterations
Railgun_Quadruplet_7sunhorse performance:    2,377KB/clock / 04,776%, 04,331,865 skips/iterations
Railgun_Quadruplet_7deuce performance:       2,349KB/clock / 04,728%, 04,376,097 skips/iterations
Railgun_Quadruplet_7Elsiane performance:     2,525KB/clock / 04,778%, 04,330,200 skips/iterations
Railgun_Quadruplet_7Gulliver performance:    2,245KB/clock / 12,024%, 01,720,711 skips/iterations
Railgun_Quadruplet_7Hasherezade performance: 2,434KB/clock / 17,127%, 01,208,072 skips/iterations  18bit HashTable
Railgun_Quadruplet_7Hasherezade performance: 2,590KB/clock / 17,084%, 01,211,064 skips/iterations  14bit HashTable
Railgun_Quadruplet_7Hasherezade performance: 2,590KB/clock / 16,379%, 01,263,238 skips/iterations  10bit HashTable
Boyer_Moore_Flensburg performance:           2,525KB/clock / 03,198%, 06,469,280 skips/iterations  Five times more main-cycles and faster than 'Hasherezade', pshaw!
*/
#define ROL(x, n) (((x) << (n)) | ((x) >> (32-(n))))
#define HashTableSize 18
// Revision: 1, 2012-Feb-01, the main disadvantage: the preprocessing overhead PLUS a hasher.
// Caution: For better speed the case 'if (cbPattern==1)' was removed, so Pattern must be longer than 1 char.
char * Railgun_Quadruplet_7Hasherezade_count_hits (char * pbTarget, char * pbPattern, unsigned long cbTarget, unsigned long cbPattern)
{
    char * pbTargetMax = pbTarget + cbTarget;
    register unsigned long ulHashPattern;
    register unsigned long ulHashTarget;
    signed long count;
    signed long countSTATIC;

    unsigned char SINGLET;
    unsigned long Quadruplet2nd;
    unsigned long Quadruplet3rd;
    unsigned long Quadruplet4th;

    unsigned long  AdvanceHopperGrass;

    long i; //BMH needed
    int a, j;
    unsigned int bm_bc[256]; //BMH needed
    unsigned int bm_bc2nd[256]; //BMS needed
    unsigned char bm_Horspool_Order2[256*256]; //BMHSS(Elsiane) needed, 'char' limits patterns to 255, if 'long' then table becomes 256KB, grrr.
    unsigned char bm_Hasherezade_HASH[1<<(HashTableSize)]; // Jesteressing 8bytes (Horspool Order 8) for fast lookup, should be bitwise (i.e. 8times smaller) since it says yes/no for presence.
    uint32_t hash32;
    unsigned long Gulliver; // or unsigned char or unsigned short

    if (cbPattern > cbTarget)
        return(NULL);

    if ( cbPattern<4) { 
        pbTarget = pbTarget+cbPattern;
        ulHashPattern = ( (*(char *)(pbPattern))<<8 ) + *(pbPattern+(cbPattern-1));

        if ( cbPattern==3) {
            for ( ;; ) {
                if ( ulHashPattern == ( (*(char *)(pbTarget-3))<<8 ) + *(pbTarget-1) ) {
                    if ( *(char *)(pbPattern+1) == *(char *)(pbTarget-2) ) Railgunhits++; //return((pbTarget-3));
                }
                if ( (char)(ulHashPattern>>8) != *(pbTarget-2) ) pbTarget++;
                pbTarget++;
                if (pbTarget > pbTargetMax)
                    return(NULL);
            }
        } else {
        }
        for ( ;; ) {
            if ( ulHashPattern == ( (*(char *)(pbTarget-2))<<8 ) + *(pbTarget-1) )
                Railgunhits++; //return((pbTarget-2));
            if ( (char)(ulHashPattern>>8) != *(pbTarget-1) ) pbTarget++;
            pbTarget++;
            if (pbTarget > pbTargetMax)
                return(NULL);
        }
    } else { //if ( cbPattern<4)
        if (cbTarget<961) { // This value is arbitrary(don't know how exactly), it ensures(at least must) better performance than 'Boyer_Moore_Horspool'.
/*
            // A better strstr, with no asm code
            // Written by Mischa Sandberg
            // http://mischasan.wordpress.com
            // static char const *
            // scanstrm(char const *tgt, char const *pat, int len)
            // {
            //     uint32_t head = MSBF32(pat), wind = 0, next;
            // 
            //     pat += 4, len -= 4;
            //     while ((next = *(uint8_t const*)tgt++)) {
            //         wind = ( wind << 8 ) + next;
            //         if (wind == head && !memcmp(tgt, pat, len))
            //             return tgt - 4;
            //     }
            //     return  NULL;
            //}
            ulHashPattern = 0;
            ulHashPattern = ( ulHashPattern << 8 ) + *(uint8_t const*)pbPattern++;
            ulHashPattern = ( ulHashPattern << 8 ) + *(uint8_t const*)pbPattern++;
            ulHashPattern = ( ulHashPattern << 8 ) + *(uint8_t const*)pbPattern++;
            ulHashPattern = ( ulHashPattern << 8 ) + *(uint8_t const*)pbPattern++;
            AdvanceHopperGrass = 0;
            cbPattern -= 4;
            while ((ulHashTarget = *(uint8_t const*)pbTarget++)) {
                AdvanceHopperGrass = ( AdvanceHopperGrass << 8 ) + ulHashTarget;
                if (AdvanceHopperGrass == ulHashPattern && !memcmp(pbTarget, pbPattern, cbPattern))
                Railgunhits++; //return pbTarget - 4;
            }
            return  NULL;
*/
            pbTarget = pbTarget+cbPattern;
            ulHashPattern = *(unsigned long *)(pbPattern);

            SINGLET = ulHashPattern & 0xFF;
            Quadruplet2nd = SINGLET<<8;
            Quadruplet3rd = SINGLET<<16;
            Quadruplet4th = SINGLET<<24;

            for ( ;; ) {
                AdvanceHopperGrass = 0;
                ulHashTarget = *(unsigned long *)(pbTarget-cbPattern);

                    if ( ulHashPattern == ulHashTarget ) { // Three unnecessary comparisons here, but 'AdvanceHopperGrass' must be calculated - it has a higher priority.
                    count = cbPattern-1;
                    while ( count && *(char *)(pbPattern+(cbPattern-count)) == *(char *)(pbTarget-count) ) {
                        if ( cbPattern-1==AdvanceHopperGrass+count && SINGLET != *(char *)(pbTarget-count) ) AdvanceHopperGrass++;
                        count--;
                    }
                    if ( count == 0) Railgunhits++; //return((pbTarget-cbPattern));
                    } else { // The goal here: to avoid memory accesses by stressing the registers.
                    if ( Quadruplet2nd != (ulHashTarget & 0x0000FF00) ) {
                        AdvanceHopperGrass++;
                        if ( Quadruplet3rd != (ulHashTarget & 0x00FF0000) ) {
                            AdvanceHopperGrass++;
                            if ( Quadruplet4th != (ulHashTarget & 0xFF000000) ) AdvanceHopperGrass++;
                        }
                    }
                }

                AdvanceHopperGrass++;

                pbTarget = pbTarget + AdvanceHopperGrass;
                if (pbTarget > pbTargetMax)
                    return(NULL);
            }
        } else { //if (cbTarget<961)
            countSTATIC = cbPattern-2-2;

            for (a=0; a < 256; a++) {bm_bc[a]=cbPattern; bm_bc2nd[a]=cbPattern+1;}
            for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1; 
            for (j=0; j < cbPattern; j++) bm_bc2nd[pbPattern[j]]=cbPattern-j; 

            ulHashPattern = *(unsigned long *)(pbPattern); // First four bytes
            //ulHashTarget = *(unsigned short *)(pbPattern+cbPattern-1-1); // Last two bytes
        
            AdvanceHopperGrass = 0;
            i=0;

            // Elsiane r.2  [
            for (a=0; a < 256*256; a++) {bm_Horspool_Order2[a]= cbPattern-1;} // cbPattern-(Order-1) for Horspool; 'memset' if not optimized

            // alfalfa 7 long 6 BBs (al lf fa al lf fa) 3 distinct BBs (al lf fa) 
            // fast 4 0-1-2 fa as st
            for (j=0; j < cbPattern-1; j++) bm_Horspool_Order2[*(unsigned short *)(pbPattern+j)]=j; // Rightmost appearance/position is needed

            // Elsiane r.2  ]

    if ( cbPattern>10) { 
            // Hasherezade r.1 [
            // OSHO.TXT has 00,046,486 03bytes distinct BBs
            // OSHO.TXT has 00,248,019 04bytes distinct BBs
            // OSHO.TXT has 00,855,682 05bytes distinct BBs
            // OSHO.TXT has 02,236,138 06bytes distinct BBs
            // OSHO.TXT has 04,803,152 07bytes distinct BBs
            // OSHO.TXT has 08,956,496 08bytes distinct BBs to be hashed in 18bit i.e. 256KB i.e. 262,144 slots i.e. 34 vs 1.
            // OSHO.TXT has 15,006,172 09bytes distinct BBs
            // OSHO.TXT has 22,992,127 10bytes distinct BBs
            // Note: BB stands for Building-Block (also suffix)

            for (a=0; a < 1<<(HashTableSize); a++) {bm_Hasherezade_HASH[a]= 0;} // to-do: bit to replace byte; 'memset' if not optimized
            // cbPattern - Order + 1 i.e. number of BBs for 11 'fastest fox' 11-8+1=4: 'fastest ', 'astest f', 'stest fo', 'test fox'
            for (j=0; j < cbPattern-8+1; j++) {
                hash32 = (2166136261 ^ (ROL(*(uint32_t *)(pbPattern+j),5)^*(uint32_t *)(pbPattern+j+4))) * 709607;        
                bm_Hasherezade_HASH[( hash32 ^ (hash32 >> 16) ) & ( (1<<(HashTableSize))-1 )]=1;
/*
for (a=0; a<8; a++)
printf("%c",*(char *)(pbPattern+j+a) );
printf(" %lu\n",( hash32 ^ (hash32 >> 16) ) & ( (1<<(HashTableSize))-1 ));
//Input Pattern(up to 19+2000 chars): and every day a continuous cleaning goes on
//Doing Search for Pattern(43bytes) into String(206908949bytes) as-one-line ...
BBs      Slot(HashCode for 18bit HashTable)
and ever 117013
nd every 108604
d every  155516
 every d 170959
every da 115291
very day 73191
ery day  97042
ry day a 83793
y day a  11244
 day a c 115855
day a co 101797
ay a con 222568
y a cont 29130
 a conti 20978
a contin 258405
 continu 252691
continuo 123607
ontinuou 56546
ntinuous 135857
tinuous  15332
inuous c 250584
nuous cl 48224
uous cle 106616
ous clea 137020
us clean 35751
s cleani 178989
 cleanin 213855
cleaning 63337
leaning  97138
eaning g 62366
aning go 247590
ning goe 36571
ing goes 41142
ng goes  228365
g goes o 229696
 goes on 176852
*/
            }
            // Hasherezade r.1 ]

            while (i <= cbTarget-cbPattern-1) { // -1 because Sunday is used
                Gulliver = bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1]];

                if ( Gulliver == cbPattern-2 ) { // CASE #1: means the pair (char order 2) is found
                    if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
                        count = countSTATIC; // Last two chars already matched, to be fixed with -2
                        while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
                            count--;
                        if ( count == 0) Railgunhits++; //return(pbTarget+i);
                    }
                    //i = i + 1; // r.1, obviuosly this is the worst skip so turning to 'SunHorse': lines below
                    if ( bm_bc[(unsigned char)pbTarget[i+cbPattern-1]] < bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]] )
                             Gulliver =  bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]];
                    else
                             Gulliver =  bm_bc[(unsigned char)pbTarget[i+cbPattern-1]];
                } else if ( Gulliver != cbPattern-1 ) // CASE #2: if equal means the pair (char order 2) is not found i.e. Gulliver remains intact, skip the whole pattern and fall back (Order-1) chars i.e. one char for Order 2
                    Gulliver = cbPattern - Gulliver - 2; // CASE #3: the pair is found and not as suffix i.e. rightmost position
            
                // The goal: to jump when the rightmost 8bytes (Order 8 Horspool) of window do not look like any of Needle prefixes i.e. are not to be found. This maximum jump equals cbPattern-(Order-1) or 11-(8-1)=4 for 'fastest fox' - a small one but for Needle 31 bytes the jump equals 31-(8-1)=24
                if (Gulliver < cbPattern-(8-1)) { 
                    hash32 = (2166136261 ^ (ROL(*(uint32_t *)(pbTarget+i+cbPattern-8),5)^*(uint32_t *)(pbTarget+i+cbPattern-8+4))) * 709607;        
                    if ( bm_Hasherezade_HASH[( hash32 ^ (hash32 >> 16) ) & ( (1<<(HashTableSize))-1 )]==0 )
                        Gulliver = cbPattern-(8-1);
                }
                    i = i + Gulliver;
                AdvanceHopperGrass++;
/*
; 4155 :                 // The goal: to jump when the rightmost 8bytes (Order 8 Horspool) of window do not look like any of Needle prefixes i.e. are not to be found. This maximum jump equals cbPattern-(Order-1) or 11-(8-1)=4 for 'fastest fox' - a small one but for Needle 31 bytes the jump equals 31-(8-1)=24
; 4156 :                 if (Gulliver < cbPattern-(8-1)) { 

  01f16    8d 43 f9     lea     eax, DWORD PTR [ebx-7]
  01f19    3b c8         cmp     ecx, eax
  01f1b    73 30         jae     SHORT $LN18@Railgun_Qu@8

; 4157 :                     hash32 = (2166136261 ^ (ROL(*(uint32_t *)(pbTarget+i+cbPattern-8),5)^*(uint32_t *)(pbTarget+i+cbPattern-8+4))) * 709607;        

  01f1d    8b 44 32 f8     mov     eax, DWORD PTR [edx+esi-8]
  01f21    c1 c0 05     rol     eax, 5
  01f24    33 44 32 fc     xor     eax, DWORD PTR [edx+esi-4]
  01f28    35 c5 9d 1c 81     xor     eax, -2128831035    ; 811c9dc5H
  01f2d    69 c0 e7 d3 0a
    00         imul     eax, 709607        ; 000ad3e7H

; 4158 :                     if ( bm_Hasherezade_HASH[( hash32 ^ (hash32 >> 16) ) & ( (1<<(HashTableSize))-1 )]==0 )

  01f33    8b f8         mov     edi, eax
  01f35    c1 ef 10     shr     edi, 16            ; 00000010H
  01f38    33 f8         xor     edi, eax
  01f3a    81 e7 ff ff 03
    00         and     edi, 262143        ; 0003ffffH
  01f40    80 bc 3c 28 08
    01 00 00     cmp     BYTE PTR _bm_Hasherezade_HASH$[esp+edi+329776], 0
  01f48    75 03         jne     SHORT $LN18@Railgun_Qu@8

; 4159 :                         Gulliver = cbPattern-(8-1);

  01f4a    8d 4b f9     lea     ecx, DWORD PTR [ebx-7]
$LN18@Railgun_Qu@8:

; 4160 :                 }
; 4161 :                     i = i + Gulliver;
; 4162 :                 AdvanceHopperGrass++;
*/
            }

    } else {
            while (i <= cbTarget-cbPattern-1) { // -1 because Sunday is used
                Gulliver = bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1]];

                if ( Gulliver == cbPattern-2 ) { // CASE #1: means the pair (char order 2) is found
                    if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
                        count = countSTATIC; // Last two chars already matched, to be fixed with -2
                        while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
                            count--;
                        if ( count == 0) Railgunhits++; //return(pbTarget+i);
                    }
                    //i = i + 1; // r.1, obviuosly this is the worst skip so turning to 'SunHorse': lines below
                    if ( bm_bc[(unsigned char)pbTarget[i+cbPattern-1]] < bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]] )
                             Gulliver =  bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]];
                    else
                             Gulliver =  bm_bc[(unsigned char)pbTarget[i+cbPattern-1]];
                } else if ( Gulliver != cbPattern-1 ) // CASE #2: if equal means the pair (char order 2) is not found i.e. Gulliver remains intact, skip the whole pattern and fall back (Order-1) chars i.e. one char for Order 2
                    Gulliver = cbPattern - Gulliver - 2; // CASE #3: the pair is found and not as suffix i.e. rightmost position

                i = i + Gulliver;

// 32323218 Order 1 Horspool Skip-table A
// 01234568 Order 1 Horspool Skip-table B
// fa af fa af fa as st Order 2 Horspool Skip-table B
//  0  1  2  3  4  5  6
// HIKARIfast
// fafafast
//   fafafast +2 Order 1 'a' vs 't'
//   fafafast +2 = (cbPattern-SkipB-Order = 8-5-1 = 2) Order 1 'a' vs 't'
//   fafafast +2 = (cbPattern-SkipB-Order = 8-4-2 = 2) Order 2 'fa' vs 'st' i.e. CASE #3

// 76543218 Order 1 Horspool
// lo on ng gp pa ac ce Order 2 Horspool
//  0  1  2  3  4  5  6
// HIKARIfast
// longpace
//   longpace +2 Order 1 'a' vs 'e'
//        longpace +7 = (cbPattern-(Order-1) = 8-(2-1) = 7) Order 2 'fa' vs 'ce' i.e. CASE #2

                AdvanceHopperGrass++;
            }
    } //if ( cbPattern>10) { 

            if (i == cbTarget-cbPattern) {
                if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
                    count = countSTATIC;
                    while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
                        count--;
                    if ( count == 0) Railgunhits++; //return(pbTarget+i);
                }
                AdvanceHopperGrass++;
            }

            GlobalSP += (int)((double)cbTarget/AdvanceHopperGrass*100);
            GlobalI += AdvanceHopperGrass;
            printf("Skip-Performance(bigger-the-better): %d%%, %d skips/iterations\n",(int)((double)cbTarget/AdvanceHopperGrass*100), AdvanceHopperGrass);
        
            return(NULL);
        } //if (cbTarget<961)
    } //if ( cbPattern<4)
}
// ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Sunday-Horspool algorithm ]
The Speed-Performance disappoints, also I don't see how to speed up this hash-boosted variant.
Any ideas how to hasten?

An add-on from 2012-Jan-26:

Enter Railgun_Quadruplet_7Gulliver ... the most promising variant.

I am happy to share my latest etude - a result of a week-long sub-quest of mine saturated with music and program-mess-ing.
The first search-function that breaks (effortlessly) the 3GB/s barrier on a computer with 3GB/s memory write.
// ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Sunday-Horspool algorithm [
/*
Raising Horspool/Sunday order from 1 to 2 came quite naturally after some carefree wander of mine.
I consider adding a new 'else if' block for haystacks longer than 960 in order to avoid the big skip table overhead, something like 512*1024, thus fastest r7sun will operate on 961..512*1024 long haystacks whereas 'Gulliver' on bigger than 512*1024.
I am not happy with Skip-Performance of revision 1: this should be amended...
'Gulliver' makes giant skips/jumps on patterns some 8+ chars long but for short ones falls behind. Here r.1 is given being a Horspool order 2, of course Sunday order 2 is coming.
I restrain myself of dreaming about power behind order 3, for bigger orders I expect nothing less than Dream-Performance.

Of course the enhancements await advent, they are very easy to be seen, but they will appear in next revisions of 'Gulliver'.
Tuning continues but the skeleton is built, I see 'Gulliver' as a really High-Performance etude.

And not to be empty-handed here the Gulliver's swiftness is benchmarked on String(206,908,949bytes) as-one-line:

Pattern: fast
Railgun_Quadruplet_7sun:       1,074KB/clock / 0,456%, 45,330,622 iterations
Railgun_Quadruplet_7:          1,000KB/clock / 0,377%, 54,788,054 iterations
Railgun_Quadruplet_7sunhorse:  0,651KB/clock / 0,480%, 43,103,056 iterations
Railgun_Quadruplet_7deuce:     0,575KB/clock / 0,389%, 53,138,919 iterations
Railgun_Quadruplet_7Elsiane:   0,511KB/clock / 0,551%, 37,541,955 iterations
Railgun_Quadruplet_7Gulliver:  0,649KB/clock / 0,298%, 69,403,843 iterations
Boyer_Moore_Flensburg:         0,491KB/clock / 0,377%, 54,788,139 iterations

Pattern: faster
Railgun_Quadruplet_7sun:       1,347KB/clock / 0,591%, 34,996,936 iterations
Railgun_Quadruplet_7:          1,329KB/clock / 0,514%, 40,194,194 iterations
Railgun_Quadruplet_7sunhorse:  0,762KB/clock / 0,656%, 31,504,148 iterations
Railgun_Quadruplet_7deuce:     0,662KB/clock / 0,567%, 36,434,006 iterations
Railgun_Quadruplet_7Elsiane:   0,538KB/clock / 0,710%, 29,101,626 iterations
Railgun_Quadruplet_7Gulliver:  1,010KB/clock / 0,491%, 42,066,438 iterations
Boyer_Moore_Flensburg:         0,687KB/clock / 0,514%, 40,194,282 iterations

Pattern: fastest
Railgun_Quadruplet_7sun:       1,554KB/clock / 0,687%, 30,084,306 iterations
Railgun_Quadruplet_7:          1,507KB/clock / 0,599%, 34,540,430 iterations
Railgun_Quadruplet_7sunhorse:  0,918KB/clock / 0,761%, 27,188,853 iterations
Railgun_Quadruplet_7deuce:     0,786KB/clock / 0,663%, 31,175,827 iterations
Railgun_Quadruplet_7Elsiane:   0,635KB/clock / 0,818%, 25,281,493 iterations
Railgun_Quadruplet_7Gulliver:  1,209KB/clock / 0,592%, 34,946,434 iterations
Boyer_Moore_Flensburg:         0,798KB/clock / 0,621%, 33,278,240 iterations

Pattern: fastest fox
Railgun_Quadruplet_7sun:       1,727KB/clock / 0,775%, 26,672,940 iterations
Railgun_Quadruplet_7:          1,669KB/clock / 0,669%, 30,925,578 iterations
Railgun_Quadruplet_7sunhorse:  0,957KB/clock / 0,912%, 22,663,583 iterations
Railgun_Quadruplet_7deuce:     0,821KB/clock / 0,797%, 25,945,709 iterations
Railgun_Quadruplet_7Elsiane:   0,719KB/clock / 0,931%, 22,213,101 iterations  Gulliver outskips Elsiane!
Railgun_Quadruplet_7Gulliver:  1,804KB/clock / 0,977%, 21,167,180 iterations  monstrous!
Boyer_Moore_Flensburg:         1,080KB/clock / 0,669%, 30,925,649 iterations

Pattern: fastest fox with biggest strides
Railgun_Quadruplet_7sun:       2,658KB/clock / 1,584%, 13,060,463 iterations
Railgun_Quadruplet_7:          2,767KB/clock / 1,511%, 13,689,243 iterations
Railgun_Quadruplet_7sunhorse:  1,788KB/clock / 2,138%, 09,677,267 iterations
Railgun_Quadruplet_7deuce:     1,656KB/clock / 2,053%, 10,075,650 iterations
Railgun_Quadruplet_7Elsiane:   1,542KB/clock / 2,143%, 09,652,548 iterations
Railgun_Quadruplet_7Gulliver:  3,108KB/clock / 2,917%, 07,091,357 iterations  over the top!
Boyer_Moore_Flensburg:         1,820KB/clock / 1,554%, 13,307,181 iterations

Pattern: fastest fox with biggest strides known to me
Railgun_Quadruplet_7sun:       2,590KB/clock / 1,548%, 13,363,356 iterations
Railgun_Quadruplet_7:          2,658KB/clock / 1,447%, 14,292,419 iterations
Railgun_Quadruplet_7sunhorse:  1,870KB/clock / 2,234%, 09,259,505 iterations
Railgun_Quadruplet_7deuce:     1,757KB/clock / 2,011%, 10,287,584 iterations
Railgun_Quadruplet_7Elsiane:   1,656KB/clock / 2,240%, 09,236,188 iterations
Railgun_Quadruplet_7Gulliver:  3,108KB/clock / 3,821%, 05,413,684 iterations  unstoppable!
Boyer_Moore_Flensburg:         1,712KB/clock / 1,540%, 13,431,751 iterations

Pattern: fastest fox with biggest strides known to me up to 2012 January 26 namely 'Gulliver'
Railgun_Quadruplet_7sun:       3,108KB/clock / 2,890%, 07,159,321 iterations
Railgun_Quadruplet_7:          3,157KB/clock / 2,742%, 07,545,141 iterations
Railgun_Quadruplet_7sunhorse:  2,557KB/clock / 4,138%, 04,999,777 iterations
Railgun_Quadruplet_7deuce:     2,494KB/clock / 4,029%, 05,135,444 iterations
Railgun_Quadruplet_7Elsiane:   2,494KB/clock / 4,141%, 04,995,463 iterations
Railgun_Quadruplet_7Gulliver:  3,108KB/clock / 7,218%, 02,866,342 iterations  simply amazing!
Boyer_Moore_Flensburg:         2,767KB/clock / 2,745%, 07,536,097 iterations

Note:
[
In the above benchmark (on Pentium Merom 2166Mhz) 'Gulliver' is limited by memory bandwidth only, due to:

Pattern: fastest fox with biggest strides
Railgun_Quadruplet_7sun:       2,658KB/clock / 1,584%, 13,060,463 iterations
Railgun_Quadruplet_7Gulliver:  3,108KB/clock / 2,917%, 07,091,357 iterations

Pattern: fastest fox with biggest strides known to me
Railgun_Quadruplet_7sun:       2,590KB/clock / 1,548%, 13,363,356 iterations
Railgun_Quadruplet_7Gulliver:  3,108KB/clock / 3,821%, 05,413,684 iterations

Pattern: fastest fox with biggest strides known to me up to 2012 January 26 namely 'Gulliver'
Railgun_Quadruplet_7sun:       3,108KB/clock / 2,890%, 07,159,321 iterations
Railgun_Quadruplet_7Gulliver:  3,108KB/clock / 7,218%, 02,866,342 iterations

You can see how Railgun_Quadruplet_7sun struggles to catch up, for longest pattern it achieves the maximum 3,108KB/clock but at a nasty cost: 7,159,321 iterations whereas 'Gulliver' achieves same (in fact much higher) speed with heart-touching 2,866,342 iterations only.
For my Pentium T3400 2166 MHz Toshiba Satellite L305 Dual DDR2-667 5-5-5-13, 'Everest' benchmark program gives:
Memory Read:  4,605 MB/s
Memory Write: 3,323 MB/s
'Gulliver' would run at 5GB/s without any problem if it were not for the hardware limitation, a fantastic performance in my eyes.
]
*/
// 
// Benchmark (on Pentium Merom 2166MHz, Windows XP 32bit, CL v16 32bit):
// Two tests were done - the first/second '6x2'/'52' searches 12/52 patterns (for all appearances i.e. counting all hits) into 206,908,949 bytes long English text:
// 
// Skip-Performance (measured in %, bigger-the-better; Iterations, lesser-the-better) Summary:
// 
// '6x2' test (12 patterns 4-19 chars in length)
// 
// Railgun_Quadruplet_7sun 6x2 i.e. average performance:       2,092KB/clock / 0,153,936% / 005,572,202,064
// Railgun_Quadruplet_7 6x2 i.e. average performance:          2,032KB/clock / 0,137,568% / 006,454,416,320
// Railgun_Quadruplet_7sunhorse 6x2 i.e. average performance:  1,262KB/clock / 0,173,840% / 005,145,288,080
// Railgun_Quadruplet_7deuce 6x2 i.e. average performance:     1,134KB/clock / 0,155,744% / 005,995,597,776
// Railgun_Quadruplet_7Elsiane 6x2 i.e. average performance:   0,938KB/clock / 0,184,096% / 004,710,045,936
// Boyer_Moore_Flensburg 6x2 i.e. average performance:         1,163KB/clock / 0,139,168% / 006,428,106,496
// Railgun_Quadruplet_7Gulliver 6x2 i.e. average performance:  1,552KB/clock / 0,152,912% / 006,959,446,480
// Brute_Force_Dummy 6x2 i.e. average performance:             0,343KB/clock / 0,019,200% / 039,726,516,640
// 
// '52' test (52 patterns 4-175 chars in length)
// 
// Railgun_Quadruplet_7sun 52 i.e. average performance:        1,652KB/clock / 0,768,416% / 022,362,482,640
// Railgun_Quadruplet_7 52 i.e. average performance:           1,636KB/clock / 0,708,768% / 025,368,049,712
// Railgun_Quadruplet_7sunhorse 52 i.e. average performance:   1,052KB/clock / 0,939,520% / 020,187,534,736
// Railgun_Quadruplet_7deuce 52 i.e. average performance:      0,929KB/clock / 0,864,048% / 023,292,952,992
// Railgun_Quadruplet_7Elsiane 52 i.e. average performance:    0,809KB/clock / 0,979,072% / 018,535,811,280
// Boyer_Moore_Flensburg 52 i.e. average performance:          0,928KB/clock / 0,720,272% / 025,227,269,760
// Railgun_Quadruplet_7Gulliver 52 i.e. average performance:   1,296KB/clock / 1,145,808% / 026,131,772,688
// Brute_Force_Dummy 52 i.e. average performance:              0,279KB/clock / 0,083,200% / 172,148,232,496
// 
// Revision: 1, 2012-Jan-26, the main disadvantage: the preprocessing overhead.
// Caution: For better speed the case 'if (cbPattern==1)' was removed, so Pattern must be longer than 1 char.
char * Railgun_Quadruplet_7Gulliver_count_hits (char * pbTarget, char * pbPattern, unsigned long cbTarget, unsigned long cbPattern)
{
    char * pbTargetMax = pbTarget + cbTarget;
    register unsigned long ulHashPattern;
    register unsigned long ulHashTarget;
    signed long count;
    signed long countSTATIC;

    unsigned char SINGLET;
    unsigned long Quadruplet2nd;
    unsigned long Quadruplet3rd;
    unsigned long Quadruplet4th;

    unsigned long  AdvanceHopperGrass;

    long i; //BMH needed
    int a, j;
    unsigned int bm_bc[256]; //BMH needed
    unsigned int bm_bc2nd[256]; //BMS needed
    unsigned char bm_Sunday_Order2[256*256]; //BMHSS(Elsiane) needed, 'char' limits patterns to 255, if 'long' then table becomes 256KB, grrr.
    unsigned long Gulliver; // or unsigned char or unsigned short

    if (cbPattern > cbTarget)
        return(NULL);

    if ( cbPattern<4) { 
        pbTarget = pbTarget+cbPattern;
        ulHashPattern = ( (*(char *)(pbPattern))<<8 ) + *(pbPattern+(cbPattern-1));

        if ( cbPattern==3) {
            for ( ;; ) {
                if ( ulHashPattern == ( (*(char *)(pbTarget-3))<<8 ) + *(pbTarget-1) ) {
                    if ( *(char *)(pbPattern+1) == *(char *)(pbTarget-2) ) Railgunhits++; //return((pbTarget-3));
                }
                if ( (char)(ulHashPattern>>8) != *(pbTarget-2) ) pbTarget++;
                pbTarget++;
                if (pbTarget > pbTargetMax)
                    return(NULL);
            }
        } else {
        }
        for ( ;; ) {
            if ( ulHashPattern == ( (*(char *)(pbTarget-2))<<8 ) + *(pbTarget-1) )
                Railgunhits++; //return((pbTarget-2));
            if ( (char)(ulHashPattern>>8) != *(pbTarget-1) ) pbTarget++;
            pbTarget++;
            if (pbTarget > pbTargetMax)
                return(NULL);
        }
    } else { //if ( cbPattern<4)
        if (cbTarget<961) { // This value is arbitrary(don't know how exactly), it ensures(at least must) better performance than 'Boyer_Moore_Horspool'.
            pbTarget = pbTarget+cbPattern;
            ulHashPattern = *(unsigned long *)(pbPattern);

            SINGLET = ulHashPattern & 0xFF;
            Quadruplet2nd = SINGLET<<8;
            Quadruplet3rd = SINGLET<<16;
            Quadruplet4th = SINGLET<<24;

            for ( ;; ) {
                AdvanceHopperGrass = 0;
                ulHashTarget = *(unsigned long *)(pbTarget-cbPattern);

                    if ( ulHashPattern == ulHashTarget ) { // Three unnecessary comparisons here, but 'AdvanceHopperGrass' must be calculated - it has a higher priority.
                    count = cbPattern-1;
                    while ( count && *(char *)(pbPattern+(cbPattern-count)) == *(char *)(pbTarget-count) ) {
                        if ( cbPattern-1==AdvanceHopperGrass+count && SINGLET != *(char *)(pbTarget-count) ) AdvanceHopperGrass++;
                        count--;
                    }
                    if ( count == 0) Railgunhits++; //return((pbTarget-cbPattern));
                    } else { // The goal here: to avoid memory accesses by stressing the registers.
                    if ( Quadruplet2nd != (ulHashTarget & 0x0000FF00) ) {
                        AdvanceHopperGrass++;
                        if ( Quadruplet3rd != (ulHashTarget & 0x00FF0000) ) {
                            AdvanceHopperGrass++;
                            if ( Quadruplet4th != (ulHashTarget & 0xFF000000) ) AdvanceHopperGrass++;
                        }
                    }
                }

                AdvanceHopperGrass++;

                pbTarget = pbTarget + AdvanceHopperGrass;
                if (pbTarget > pbTargetMax)
                    return(NULL);
            }
        } else { //if (cbTarget<961)
            countSTATIC = cbPattern-2-2;

            for (a=0; a < 256; a++) {bm_bc[a]=cbPattern; bm_bc2nd[a]=cbPattern+1;}
            for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1; 
            for (j=0; j < cbPattern; j++) bm_bc2nd[pbPattern[j]]=cbPattern-j; 

            // Elsiane r.2  [
            for (a=0; a < 256*256; a++) {bm_Sunday_Order2[a]= cbPattern-1;} // 'memset' if not optimized

            // alfalfa 7 long 6 BBs (al lf fa al lf fa) 3 distinct BBs (al lf fa) 
            // fast 4 0-1-2 fa as st
            for (j=0; j < cbPattern-1; j++) bm_Sunday_Order2[*(unsigned short *)(pbPattern+j)]=j; // Rightmost appearance/position is needed

            // Elsiane r.2 ]

            ulHashPattern = *(unsigned short *)(pbPattern+cbPattern-1-1); // Last two bytes
            ulHashTarget = *(unsigned long *)(pbPattern); // First four bytes
        
            AdvanceHopperGrass = 0;
            i=0;
            while (i <= cbTarget-cbPattern-1-1) {
                Gulliver = bm_Sunday_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1]];

                if ( Gulliver == cbPattern-2 ) { // CASE #1: means the pair (char order 2) is found
                    if ( *(unsigned long *)&pbTarget[i] == ulHashTarget) {
                        count = countSTATIC; // Last two chars already matched, to be fixed with -2
                        while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
                            count--;
                        if ( count == 0) Railgunhits++; //return(pbTarget+i);
                    }
                    i = i + 1;
                } else if ( Gulliver == cbPattern-1 ) // CASE #2: means the pair (char order 2) is not found
                    i = i + Gulliver; // the pair is not found, skip the whole pattern and fall back one char
                else
                    i = i +  cbPattern - Gulliver - 2; // CASE #3: the pair is found and not as suffix i.e. rightmost position

// 32323218 Order 1 Horspool
// fa af fa af fa as st Order 2 Horspool
//  0  1  2  3  4  5  6
// HIKARIfast
// fafafast
//   fafafast +2 Order 1 'a' vs 't'
//   fafafast +2 = (cbPattern-Gulliver-2 = 8-4-2 = 2) Order 2 'fa' vs 'st' i.e. CASE #3

// 76543218 Order 1 Horspool
// lo on ng gp pa ac ce Order 2 Horspool
//  0  1  2  3  4  5  6
// HIKARIfast
// longpace
//   longpace +2 Order 1 'a' vs 'e'
//        longpace +7 = (cbPattern-1 = 8-1 = 7) Order 2 'fa' vs 'ce' i.e. CASE #2

                AdvanceHopperGrass++;
            }

            if (i == cbTarget-cbPattern-1) {
                if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
                    count = countSTATIC;
                    while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
                        count--;
                    if ( count == 0) Railgunhits++; //return(pbTarget+i);
                }
                i++;
                AdvanceHopperGrass++;
            }

            if (i == cbTarget-cbPattern) {
                if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
                    count = countSTATIC;
                    while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
                        count--;
                    if ( count == 0) Railgunhits++; //return(pbTarget+i);
                }
                AdvanceHopperGrass++;
            }

            GlobalSP += (int)((double)cbTarget/AdvanceHopperGrass*100);
            GlobalI += AdvanceHopperGrass;
            printf("Skip-Performance(bigger-the-better): %d%%, %d skips/iterations\n",(int)((double)cbTarget/AdvanceHopperGrass*100), AdvanceHopperGrass);
        
            return(NULL);
        } //if (cbTarget<961)
    } //if ( cbPattern<4)
}
// ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Sunday-Horspool algorithm ]
I was very touched (thus inspired) by this cuore song:

Strong enough to tell you why
I feel enough to tell you something that's very close to me
In the eye of my mind
Feel the reasons that we thought
You just feel all
Falling I'm standing across the stream
Fearing saying I
I'm strong enough to find my courage
To find the way I'm always near it's brewing in my mind
Falling I'm standing across the stream
Fearing saying I
I'm strong enough to find my courage
To find the way I'm always near it's growing in my mind

/Artist: Elsiane - Across The Stream/

I salute everyone who feels-and-keeps the artistic way near.

An add-on from 2012-Jan-24:

Just wanted to see what Sunday's technique (applied on two chars, order 2) holds, this resulted in appearance of Railgun_Quadruplet_7Elsiane, the biggest-strider so far, I dream (already) of Railgun_Quadruplet_7Gulliver...
Horspool-Sunday-Sunday jumps to:
--------------------------------\
                                |
Horspool-Sunday jumps to:
-------------------------------\
                               |
Horspool jumps to:
------------------------------\
                              |
                  String: faster
                  Needle: fast

       Sunday skip-table:
                          43215
     Horspool skip-table:
                          32144
For above example:
Horspool jumps to A where A=4 'e' position
Sunday jumps to B where B=5 'r' position
Horspool-Sunday ('SunHorse') jumps to max(A,B) i.e. 5
Horspool-Sunday-Sunday ('Elsiane') jumps to max(A,B,C) i.e. max(5,6)=6 since neither 'e' nor 'r' appear in the needle i.e. right after 'r'
// ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Sunday-Horspool algorithm [
// Note: 'Elsiane' is the strongest (i.e. with biggest-strides) Boyer-Moore-Sunday-Horspool variant known to me, named after a 'very close to me' duo(two musicians), viva Elsieanne Caplette!
// Benchmark (on Pentium Merom 2166MHz, Windows XP 32bit, CL v16 32bit):
// Two tests were done - the first/second '6x2'/'52' searches 12/52 patterns (for all appearances i.e. counting all hits) into 206,908,949 bytes long English text:
// 
// Skip-Performance (measured in %, bigger-the-better; Iterations, lesser-the-better) Summary:
// 
// Function                      '6x2' test (12 patterns 4-19 chars in length)
// 
// Railgun_Quadruplet_7Elsiane   0,934KB/clock / 184,096% / 004,710,045,936 Iterations  Muffin
// Railgun_Quadruplet_7sunhorse  1,261KB/clock / 173,840% / 005,145,288,080 Iterations  BONBON
// Railgun_Quadruplet_7deuce     1,132KB/clock / 155,744% / 005,995,597,776 Iterations  Leaves-The-Town
// Railgun_Quadruplet_7sun       2,091KB/clock / 153,936% / 005,572,202,064 Iterations  Bitter-Sweet
// Boyer_Moore_Flensburg         1,162KB/clock / 139,168% / 006,428,106,496 Iterations  Loopy
// Railgun_Quadruplet_7          2,029KB/clock / 137,568% / 006,454,416,320 Iterations  Loopier
// Brute_Force_Dummy             0,342KB/clock / 019,200% / 039,726,516,640 Iterations  Loopiest
// 
// Function                      '52' test (52 patterns 4-175 chars in length)
// 
// Railgun_Quadruplet_7Elsiane   0,809KB/clock / 979,072% / 018,535,811,280 Iterations  Muffin
// Railgun_Quadruplet_7sunhorse  1,051KB/clock / 939,520% / 020,187,534,736 Iterations  BONBON
// Railgun_Quadruplet_7deuce     0,928KB/clock / 864,048% / 023,292,952,992 Iterations  Leaves-The-Town
// Railgun_Quadruplet_7sun       1,651KB/clock / 768,416% / 022,362,482,640 Iterations  Bitter-Sweet
// Boyer_Moore_Flensburg         0,928KB/clock / 720,272% / 025,227,269,760 Iterations  Loopy
// Railgun_Quadruplet_7          1,632KB/clock / 708,768% / 025,368,049,712 Iterations  Loopier
// Brute_Force_Dummy             0,279KB/clock / 083,200% / 172,148,232,496 Iterations  Loopiest
// 
// Revision: 1, 2012-Jan-24, the main disadvantage: the preprocessing overhead.
// Caution: For better speed the case 'if (cbPattern==1)' was removed, so Pattern must be longer than 1 char.
char * Railgun_Quadruplet_7Elsiane_count_hits (char * pbTarget,    char * pbPattern, unsigned long cbTarget, unsigned long cbPattern)
{
    char * pbTargetMax = pbTarget + cbTarget;
    register unsigned long ulHashPattern;
    register unsigned long ulHashTarget;
    signed long count;
    signed long countSTATIC;

    unsigned char SINGLET;
    unsigned long Quadruplet2nd;
    unsigned long Quadruplet3rd;
    unsigned long Quadruplet4th;

    unsigned long  AdvanceHopperGrass;

    long i; //BMH needed
    int a, j;
    unsigned int bm_bc[256]; //BMH needed
    unsigned int bm_bc2nd[256]; //BMS needed
    unsigned char bm_Sunday_Order2[256*256]; //BMHSS(Elsiane) needed, of course those 65536 bytes are scary in next revision reduce them to 8192 i.e. bitwise it

    if (cbPattern > cbTarget)
        return(NULL);

    if ( cbPattern<4) { 
        pbTarget = pbTarget+cbPattern;
        ulHashPattern = ( (*(char *)(pbPattern))<<8 ) + *(pbPattern+(cbPattern-1));

        if ( cbPattern==3) {
            for ( ;; ) {
                if ( ulHashPattern == ( (*(char *)(pbTarget-3))<<8 ) + *(pbTarget-1) ) {
                    if ( *(char *)(pbPattern+1) == *(char *)(pbTarget-2) ) Railgunhits++; //return((pbTarget-3));
                }
                if ( (char)(ulHashPattern>>8) != *(pbTarget-2) ) pbTarget++;
                pbTarget++;
                if (pbTarget > pbTargetMax)
                    return(NULL);
            }
        } else {
        }
        for ( ;; ) {
            if ( ulHashPattern == ( (*(char *)(pbTarget-2))<<8 ) + *(pbTarget-1) )
                Railgunhits++; //return((pbTarget-2));
            if ( (char)(ulHashPattern>>8) != *(pbTarget-1) ) pbTarget++;
            pbTarget++;
            if (pbTarget > pbTargetMax)
                return(NULL);
        }
    } else { //if ( cbPattern<4)
        if (cbTarget<961) { // This value is arbitrary(don't know how exactly), it ensures(at least must) better performance than 'Boyer_Moore_Horspool'.
            pbTarget = pbTarget+cbPattern;
            ulHashPattern = *(unsigned long *)(pbPattern);

            SINGLET = ulHashPattern & 0xFF;
            Quadruplet2nd = SINGLET<<8;
            Quadruplet3rd = SINGLET<<16;
            Quadruplet4th = SINGLET<<24;

            for ( ;; ) {
                AdvanceHopperGrass = 0;
                ulHashTarget = *(unsigned long *)(pbTarget-cbPattern);

                    if ( ulHashPattern == ulHashTarget ) { // Three unnecessary comparisons here, but 'AdvanceHopperGrass' must be calculated - it has a higher priority.
                    count = cbPattern-1;
                    while ( count && *(char *)(pbPattern+(cbPattern-count)) == *(char *)(pbTarget-count) ) {
                        if ( cbPattern-1==AdvanceHopperGrass+count && SINGLET != *(char *)(pbTarget-count) ) AdvanceHopperGrass++;
                        count--;
                    }
                    if ( count == 0) Railgunhits++; //return((pbTarget-cbPattern));
                    } else { // The goal here: to avoid memory accesses by stressing the registers.
                    if ( Quadruplet2nd != (ulHashTarget & 0x0000FF00) ) {
                        AdvanceHopperGrass++;
                        if ( Quadruplet3rd != (ulHashTarget & 0x00FF0000) ) {
                            AdvanceHopperGrass++;
                            if ( Quadruplet4th != (ulHashTarget & 0xFF000000) ) AdvanceHopperGrass++;
                        }
                    }
                }

                AdvanceHopperGrass++;

                pbTarget = pbTarget + AdvanceHopperGrass;
                if (pbTarget > pbTargetMax)
                    return(NULL);
            }
        } else { //if (cbTarget<961)
            countSTATIC = cbPattern-2-2;

            for (a=0; a < 256; a++) {bm_bc[a]=cbPattern; bm_bc2nd[a]=cbPattern+1;}
            for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1; 
            for (j=0; j < cbPattern; j++) bm_bc2nd[pbPattern[j]]=cbPattern-j; 

            // Elsiane [
            for (a=0; a < 256*256; a++) {bm_Sunday_Order2[a]= 0;} // 'memset' if not optimized

            // alfalfa 7 long 6 BBs (al lf fa al lf fa) 3 distinct BBs (al lf fa) 
            // fast 4 0-1-2 fa as st
            //for (j=0; j < cbPattern-1; j++) bm_Sunday_Order2[*(unsigned short *)(pbPattern+j)]=1;

            // The idea here is to use Order 2 of Sunday's approach:
            // It takes all rows and columns containing a char (e.g. A) from the pattern to be marked (as 1 i.e. active)
            //    000 001 002 003 ...   A ... 255
            // 000                      * 
            // 001                      *
            // 002                      *
            // 003                      *
            // ...                      *
            //   A  *   *   *   *   *   *   *   *  
            // ...                      *
            // 255                      *
            // TO-DO: bytewise -> bitwise
            for (j=0; j < cbPattern; j++) 
                for (a=0; a < 256; a++) {
                    bm_Sunday_Order2[(a<<8) + pbPattern[j]]=1; // columns filling
                    bm_Sunday_Order2[(pbPattern[j]<<8) + a]=1; // rows filling
                }
            // Elsiane ]

            ulHashPattern = *(unsigned long *)(pbPattern);
        
            AdvanceHopperGrass = 0;
            i=0;
            while (i <= cbTarget-cbPattern-1-1) {
                if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
                    count = countSTATIC;
                    while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
                        count--;
                    if ( count == 0) Railgunhits++; //return(pbTarget+i);
                }

                if ( bm_Sunday_Order2[*(unsigned short *)&pbTarget[i+cbPattern]] == 0 )
                    i = i + cbPattern + 1 + 1;
                else
                    if ( bm_bc[(unsigned char)pbTarget[i+cbPattern-1]] < bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]] )
                        i= i + bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]];
                    else
                        i= i + bm_bc[(unsigned char)pbTarget[i+cbPattern-1]];

                AdvanceHopperGrass++;
            }

            if (i == cbTarget-cbPattern-1) {
                if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
                    count = countSTATIC;
                    while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
                        count--;
                    if ( count == 0) Railgunhits++; //return(pbTarget+i);
                }
                i++;
                AdvanceHopperGrass++;
            }

            if (i == cbTarget-cbPattern) {
                if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
                    count = countSTATIC;
                    while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
                        count--;
                    if ( count == 0) Railgunhits++; //return(pbTarget+i);
                }
                AdvanceHopperGrass++;
            }

            GlobalSP += (int)((double)cbTarget/AdvanceHopperGrass*100);
            GlobalI += AdvanceHopperGrass;
            //printf("Skip-Performance(bigger-the-better): %d%%, %d skips/iterations\n",(int)((double)cbTarget/AdvanceHopperGrass*100), AdvanceHopperGrass);
        
            return(NULL);
        } //if (cbTarget<961)
    } //if ( cbPattern<4)
}
// ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Sunday-Horspool algorithm ]
/*
strstr_SHORT-SHOWDOWN, revision 7Elsiane_vs_7sun_vs_7_vs_7sunhorse_vs_7deuce_vs_BMF, written by Kaze.
Full credits to: R.S. Boyer, J.S. Moore, R.N. Horspool, D.M. Sunday.
Input Pattern(up to 19+2000 chars): faster

Doing Search for Pattern(6bytes) into String(206908949bytes) as-one-line ...
Skip-Performance(bigger-the-better): 591%, 34996936 skips/iterations
Railgun_Quadruplet_7sun_hits/Railgun_Quadruplet_7sun_clocks: 744/154
Railgun_Quadruplet_7sun performance: 1312KB/clock

Doing Search for Pattern(6bytes) into String(206908949bytes) as-one-line ...
Skip-Performance(bigger-the-better): 514%, 40194194 skips/iterations
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 744/158
Railgun_Quadruplet_7 performance: 1278KB/clock

Doing Search for Pattern(6bytes) into String(206908949bytes) as-one-line ...
Skip-Performance(bigger-the-better): 656%, 31504148 skips/iterations
Railgun_Quadruplet_7sunhorse_hits/Railgun_Quadruplet_7sunhorse_clocks: 744/272
Railgun_Quadruplet_7sunhorse performance: 742KB/clock

Doing Search for Pattern(6bytes) into String(206908949bytes) as-one-line ...
Skip-Performance(bigger-the-better): 567%, 36434006 skips/iterations
Railgun_Quadruplet_7deuce_hits/Railgun_Quadruplet_7deuce_clocks: 744/313
Railgun_Quadruplet_7deuce performance: 645KB/clock

Doing Search for Pattern(6bytes) into String(206908949bytes) as-one-line ...
Skip-Performance(bigger-the-better): 710%, 29101626 skips/iterations
Railgun_Quadruplet_7Elsiane_hits/Railgun_Quadruplet_7Elsiane_clocks: 744/383
Railgun_Quadruplet_7Elsiane performance: 527KB/clock

Doing Search for Pattern(6bytes) into String(206908949bytes) as-one-line ...
Skip-Performance(bigger-the-better): 514%, 40194216 skips/iterations
Boyer_Moore_Flensburg_hits/Boyer_Moore_Flensburg_clocks: 744/299
Boyer_Moore_Flensburg performance: 675KB/clock

Doing Search for Pattern(6bytes) into String(206908949bytes) as-one-line ...
Skip-Performance(bigger-the-better): 106%, 194548807 skips/iterations
Two-Way_hits/Two-Way_clocks: 744/836
Two-Way performance: 241KB/clock

D:\_KAZE_strstr_SHORT-SHOWDOWN_7Elsiane_7sun_vs_7_vs_7sunhorse_vs_7deuce_vs_BMF>
*/
I like a lot this look-ahead approach of D.M. Sunday, but can't see how to quicken it, any ideas?

An add-on from 2012-Jan-22:

[
A face-off
'SunHorse' vs Railgun_Quadruplet_7sun vs Railgun_Quadruplet_7 vs Boyer_Moore vs Brute_Force

Okay, here comes Railgun_Quadruplet_7sun - the new fastest hitter.
Two tests were done - the first/second '6x2'/'52' searches 12/52 patterns (for all appearances i.e. counting all hits) into 206,908,949 bytes long English text.

First, 7deuce is in the past, a bad luck it was, yes.
Wow, I have received collateral damage by the last shoot-out in which D.M. Sunday excels both in Speed-Performance and Skip-Performance.
Speed-Performance (measured in KB/clock, bigger-the-better) Summary:

Function                      '6x2' test (12 patterns 4-19 chars in length)

Railgun_Quadruplet_7sun       2,092KB/clock / 153,936% / 005,572,202,064 Iterations  BONBON
Railgun_Quadruplet_7          2,030KB/clock / 137,568% / 006,454,416,320 Iterations  Leaves-The-Town
Railgun_Quadruplet_7sunhorse  1,250KB/clock / 173,840% / 005,145,288,080 Iterations  UNDERRATED
Boyer_Moore_Flensburg         1,154KB/clock / 139,168% / 006,428,115,568 Iterations  Slow 
Railgun_Quadruplet_7deuce     1,125KB/clock / 155,744% / 005,995,597,776 Iterations  Slower 
Brute_Force_Dummy             0,338KB/clock / 019,200% / 039,726,516,640 Iterations  Slowest 

Function                      '52' test (52 patterns 4-175 chars in length)

Railgun_Quadruplet_7sun       1,657KB/clock / 768,416% / 022,362,482,640 Iterations  BONBON
Railgun_Quadruplet_7          1,637KB/clock / 708,768% / 025,368,049,712 Iterations  Leaves-The-Town
Railgun_Quadruplet_7sunhorse  1,046KB/clock / 939,520% / 020,187,534,736 Iterations  UNDERRATED
Railgun_Quadruplet_7deuce     0,923KB/clock / 864,048% / 023,292,952,992 Iterations  Slow 
Boyer_Moore_Flensburg         0,919KB/clock / 720,272% / 025,227,312,176 Iterations  Slower 
Brute_Force_Dummy             0,276KB/clock / 083,200% / 172,148,232,496 Iterations  Slowest 

Skip-Performance (measured in %, bigger-the-better; Iterations, lesser-the-better) Summary:

Function                      '6x2' test (12 patterns 4-19 chars in length)

Railgun_Quadruplet_7sunhorse  1,250KB/clock / 173,840% / 005,145,288,080 Iterations  BONBON
Railgun_Quadruplet_7deuce     1,125KB/clock / 155,744% / 005,995,597,776 Iterations  Leaves-The-Town
Railgun_Quadruplet_7sun       2,092KB/clock / 153,936% / 005,572,202,064 Iterations  Bitter-Sweet
Boyer_Moore_Flensburg         1,154KB/clock / 139,168% / 006,428,115,568 Iterations  Loopy
Railgun_Quadruplet_7          2,030KB/clock / 137,568% / 006,454,416,320 Iterations  Loopier
Brute_Force_Dummy             0,338KB/clock / 019,200% / 039,726,516,640 Iterations  Loopiest 

Function                      '52' test (52 patterns 4-175 chars in length)

Railgun_Quadruplet_7sunhorse  1,046KB/clock / 939,520% / 020,187,534,736 Iterations  BONBON
Railgun_Quadruplet_7deuce     0,923KB/clock / 864,048% / 023,292,952,992 Iterations  Leaves-The-Town
Railgun_Quadruplet_7sun       1,657KB/clock / 768,416% / 022,362,482,640 Iterations  Bitter-Sweet
Boyer_Moore_Flensburg         0,919KB/clock / 720,272% / 025,227,312,176 Iterations  Loopy
Railgun_Quadruplet_7          1,637KB/clock / 708,768% / 025,368,049,712 Iterations  Loopier
Brute_Force_Dummy             0,276KB/clock / 083,200% / 172,148,232,496 Iterations  Loopiest
For English texts I would have no second thought which function to choose: it is Railgun_Quadruplet_7sunhorse (Boyer-Moore reinforced by Sunday-Horspool) - strongest of all Boyer-Moore variants(including the original), sadly [still] not among the fastest.
I can't figure out the cause for such unproportionality: 'sunhorse' features amazing (for English texts) Skip-Performance i.e. number of pattern comparisions agaist the text is so low but Speed-Performance (on this CPU) lags behind.

The test package is downloadable at:
http://www.sanmayce.com/Downloads/_KAZE_strstr_SHORT-SHOWDOWN_7sun_vs_7_vs_7sunhorse_vs_7deuce_vs_BMF.7z

The test package contains:
D:\_KAZE_strstr_SHORT-SHOWDOWN_7sun_vs_7_vs_7sunhorse_vs_7deuce_vs_BMF>dir

01/22/2012  07:00 AM            59,198 FH_Flensburg.zip
01/22/2012  07:00 AM       206,908,949 OSHO.TXT
01/22/2012  07:00 AM           469,283 Results_Pentium_Merom_2166Mhz.txt
01/22/2012  07:00 AM               175 RUNME_IT_TAKES_some_20_MINUTES.BAT
01/22/2012  07:00 AM               191 strstr_COMPILE_Microsoft.bat
01/22/2012  07:00 AM           603,625 strstr_SHORT-SHOWDOWN.c
01/22/2012  07:00 AM           633,053 strstr_SHORT-SHOWDOWN.cod
01/22/2012  07:00 AM           108,032 strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe

D:\_KAZE_strstr_SHORT-SHOWDOWN_7sun_vs_7_vs_7sunhorse_vs_7deuce_vs_BMF>
A very easy-to-the-eyes/mind article at http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/bm.htm
Here I say 'Triple THANKS' to Prof. Dr. Hans Werner Lang (Institut für Medieninformatik und Technische Informatik Fachhochschule Flensburg), I used your code (after a straightforward port). On his page (http://www.inf.fh-flensburg.de/lang/) there are very well presented algorithms - a very good resource indeed.
I expect Railgun_Quadruplet_7sunhorse to become a gun of choice but only on CPUs with FAR more effective branching/memory handling than my Pentium Merom.

I still do not understand Turbo Boyer-Moore algorithm, maybe it holds a significant Skip-Performance boost.
This project is under researching, that is why I have called it a quest. Anyone who can point out faster functions is welcomed to share them with us here.

My favorite UNDERRATED gun has main loop as follows:
; 3209 :     i=0;
; 3210 :     while (i <= cbTarget-cbPattern-1) {

  010e6    8b 4c 24 20     mov     ecx, DWORD PTR _cbTarget$GSCopy$[esp+2104]
  010ea    89 44 24 28     mov     DWORD PTR _ulHashPattern$[esp+2104], eax
  010ee    33 c0         xor     eax, eax
  010f0    2b ca         sub     ecx, edx
  010f2    89 4c 24 14     mov     DWORD PTR tv554[esp+2104], ecx
  010f6    8d 0c 13     lea     ecx, DWORD PTR [ebx+edx]
  010f9    89 44 24 10     mov     DWORD PTR _AdvanceHopperGrass$[esp+2104], eax
  010fd    89 4c 24 18     mov     DWORD PTR tv451[esp+2104], ecx
$LN13@Railgun_Qu@5:

; 3211 :          ulHashTarget=*(unsigned long *)&pbTarget[i];
; 3212 :          if ( ulHashTarget == ulHashPattern)

  01101    8b 54 24 28     mov     edx, DWORD PTR _ulHashPattern$[esp+2104]
  01105    39 14 18     cmp     DWORD PTR [eax+ebx], edx
  01108    75 41         jne     SHORT $LN8@Railgun_Qu@5

; 3213 :              {
; 3214 :          count = countSTATIC;

  0110a    8b 7c 24 1c     mov     edi, DWORD PTR _countSTATIC$[esp+2104]

; 3215 :          while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) ) { // if pattern length is 4 or 5 we have count=-1 and count=0 respectively i.e. no need of comparing in-between chars.

  0110e    85 ff         test     edi, edi
  01110    74 2d         je     SHORT $LN85@Railgun_Qu@5

; 3213 :              {
; 3214 :          count = countSTATIC;

  01112    8d 56 04     lea     edx, DWORD PTR [esi+4]
  01115    8d 74 18 04     lea     esi, DWORD PTR [eax+ebx+4]
  01119    8d a4 24 00 00
    00 00         npad     7
$LL10@Railgun_Qu@5:

; 3215 :          while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) ) { // if pattern length is 4 or 5 we have count=-1 and count=0 respectively i.e. no need of comparing in-between chars.

  01120    8a 0a         mov     cl, BYTE PTR [edx]
  01122    3a 0e         cmp     cl, BYTE PTR [esi]
  01124    75 11         jne     SHORT $LN9@Railgun_Qu@5

; 3216 :                count--;

  01126    46         inc     esi
  01127    42         inc     edx
  01128    4f         dec     edi
  01129    75 f5         jne     SHORT $LL10@Railgun_Qu@5

; 3217 :          }
; 3218 :          if ( count == 0) Railgunhits++; //return(pbTarget+i);

  0112b    8b 74 24 24     mov     esi, DWORD PTR _pbPattern$GSCopy$[esp+2104]
  0112f    ff 05 00 00 00
    00         inc     DWORD PTR _Railgunhits
  01135    eb 14         jmp     SHORT $LN8@Railgun_Qu@5
$LN9@Railgun_Qu@5:
  01137    85 ff         test     edi, edi
  01139    75 0c         jne     SHORT $LN111@Railgun_Qu@5
  0113b    8b 74 24 24     mov     esi, DWORD PTR _pbPattern$GSCopy$[esp+2104]
$LN85@Railgun_Qu@5:
  0113f    ff 05 00 00 00
    00         inc     DWORD PTR _Railgunhits
  01145    eb 04         jmp     SHORT $LN8@Railgun_Qu@5
$LN111@Railgun_Qu@5:
  01147    8b 74 24 24     mov     esi, DWORD PTR _pbPattern$GSCopy$[esp+2104]
$LN8@Railgun_Qu@5:

; 3219 :          }
; 3220 : 
; 3221 : //         i+=cbPattern;
; 3222 : //         i= i - bm_bc2nd[(unsigned char)pbTarget[i]];
; 3223 : 
; 3224 : //if ( bm_bc[(unsigned char)pbTarget[i+cbPattern-1]] < ((cbPattern) - bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]]) )
; 3225 : //         i= i + ((cbPattern) - bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]]);
; 3226 : if ( bm_bc[(unsigned char)pbTarget[i+cbPattern-1]] < bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]] )

  0114b    8b 54 24 18     mov     edx, DWORD PTR tv451[esp+2104]
  0114f    0f b6 4c 02 ff     movzx     ecx, BYTE PTR [edx+eax-1]
  01154    0f b6 14 02     movzx     edx, BYTE PTR [edx+eax]
  01158    8b 4c 8c 30     mov     ecx, DWORD PTR _bm_bc$[esp+ecx*4+2104]
  0115c    8b 94 94 30 04
    00 00         mov     edx, DWORD PTR _bm_bc2nd$[esp+edx*4+2104]
  01163    3b ca         cmp     ecx, edx
  01165    73 04         jae     SHORT $LN7@Railgun_Qu@5

; 3227 :          i= i + bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]];

  01167    03 c2         add     eax, edx

; 3228 : else

  01169    eb 02         jmp     SHORT $LN6@Railgun_Qu@5
$LN7@Railgun_Qu@5:

; 3229 :          i= i + bm_bc[(unsigned char)pbTarget[i+cbPattern-1]];

  0116b    03 c1         add     eax, ecx
$LN6@Railgun_Qu@5:

; 3207 : 
; 3208 :     AdvanceHopperGrass = 0;
; 3209 :     i=0;
; 3210 :     while (i <= cbTarget-cbPattern-1) {

  0116d    8b 4c 24 14     mov     ecx, DWORD PTR tv554[esp+2104]

; 3230 : 
; 3231 : 
; 3232 :          AdvanceHopperGrass++;

  01171    ff 44 24 10     inc     DWORD PTR _AdvanceHopperGrass$[esp+2104]
  01175    49         dec     ecx
  01176    3b c1         cmp     eax, ecx
  01178    76 87         jbe     SHORT $LN13@Railgun_Qu@5

; 3233 :     }
The size is:
01178 - 01101 +1 +1 -4 = 75bytes, -4 because 'inc DWORD PTR _AdvanceHopperGrass$[esp+2104]' is only for statistical use.
]

Background

Function homepage:
http://www.sanmayce.com/Railgun/index.html

Note: In order to reproduce the tests shown on the picture (below) you need the test package (57.2 MB) downloadable at function's homepage, thanks Randor for pointing out.

Very informative resource on strstr search techniques at:

Tests done with 16 patterns with 197MB haystack (being a pure English book-like text).

Railgun_Benchmarked.png

The machine is a mainstream laptop on 2.1GHz with Intel Merom CPU.

Using the Code

The code is pretty simple and straightforward.

// Here the strstr-like function Railgun_Quadruplet_6ppp is given,  
// written by Georgi 'Kaze', 2011-Sep-07.
//
// ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Horspool algorithm [
// Caution: For better speed the case 'if (cbPattern==1)' was removed, 
// so Pattern must be longer than 1 char.
char * Railgun_Quadruplet_6ppp (char * pbTarget,
     char * pbPattern,
     unsigned long cbTarget,
     unsigned long cbPattern)
{
    char * pbTargetMax = pbTarget + cbTarget;
    register unsigned long  ulHashPattern;
    unsigned long ulHashTarget;
    //unsigned long count; //r.6+
    signed long count;
    //unsigned long countSTATIC; //r.6+
    signed long countSTATIC;
//  unsigned long countRemainder;

/*
    const unsigned char SINGLET = *(char *)(pbPattern);
    const unsigned long Quadruplet2nd = SINGLET<<8;
    const unsigned long Quadruplet3rd = SINGLET<<16;
    const unsigned long Quadruplet4th = SINGLET<<24;
*/
    unsigned char SINGLET;
    unsigned long Quadruplet2nd;
    unsigned long Quadruplet3rd;
    unsigned long Quadruplet4th;

    unsigned long  AdvanceHopperGrass;

    long i; //BMH needed
    int a, j, bm_bc[ASIZE]; //BMH needed
    unsigned char ch; //BMH needed
//    unsigned char lastch, firstch; //BMH needed

    if (cbPattern > cbTarget)
        return(NULL);

// Doesn't work when cbPattern = 1
// The next IF-fragment works very well with cbPattern>1, 
// OBVIOUSLY IT MUST BE UNROLLED(but crippled with less functionality) 
// SINCE either cbPattern=2 or cbPattern=3!
if ( cbPattern<4) {     // This IF makes me unhappy: it slows down from 
            // 390KB/clock to 367KB/clock for 'fast' pattern. 
            // This fragment(for 2..3 pattern lengths) is needed 
            // because I need a function different than strchr, but 
            // sticking to strstr i.e. lengths above 1 are to be handled.
        pbTarget = pbTarget+cbPattern;
        ulHashPattern = ( (*(char *)(pbPattern))<<8 ) + *(pbPattern+(cbPattern-1));
//        countSTATIC = cbPattern-2;

if ( cbPattern==3) {
    for ( ;; )
    {
        if ( ulHashPattern == ( (*(char *)(pbTarget-3))<<8 ) + *(pbTarget-1) ) {
         if ( *(char *)(pbPattern+1) == *(char *)(pbTarget-2) ) return((pbTarget-3));
        }
        if ( (char)(ulHashPattern>>8) != *(pbTarget-2) ) pbTarget++;
        pbTarget++;
        if (pbTarget > pbTargetMax)
            return(NULL);
    }
} else {
}
    for ( ;; )
    {
        // The line below gives for 'cbPattern'>=1:
        // Karp_Rabin_Kaze_4_OCTETS_hits/Karp_Rabin_Kaze_4_OCTETS_clocks: 4/543
        // Karp_Rabin_Kaze_4_OCTETS performance: 372KB/clock
/*
        if ( (ulHashPattern == ( (*(char *)(pbTarget-cbPattern))<<8 ) + 
        *(pbTarget-1)) && !memcmp(pbPattern, pbTarget-cbPattern, 
        (unsigned int)cbPattern) )
            return((long)(pbTarget-cbPattern));
*/

        // The fragment below gives for 'cbPattern'>=2:
        // Karp_Rabin_Kaze_4_OCTETS_hits/Karp_Rabin_Kaze_4_OCTETS_clocks: 4/546
        // Karp_Rabin_Kaze_4_OCTETS performance: 370KB/clock

/*
//For 2 and 3 [
        if ( ulHashPattern == ( (*(char *)(pbTarget-cbPattern))<<8 ) + *(pbTarget-1) ) {
//         count = countSTATIC;
         count = cbPattern-2;
//         while ( count && *(char *)(pbPattern+1+(countSTATIC-count)) == 
//         *(char *)(pbTarget-cbPattern+1+(countSTATIC-count)) ) {
         while ( count && *(char *)(pbPattern+1) == *(char *)(pbTarget-2) ) 
            { // Crippling i.e. only 2 and 3 chars are allowed!
               count--;
         }
         if ( count == 0) return((pbTarget-cbPattern));
        }
        if ( (char)(ulHashPattern>>8) != *(pbTarget-cbPattern+1) ) pbTarget++;
//For 2 and 3 ]
*/


        if ( ulHashPattern == ( (*(char *)(pbTarget-2))<<8 ) + *(pbTarget-1) )
            return((pbTarget-2));
        if ( (char)(ulHashPattern>>8) != *(pbTarget-1) ) pbTarget++;


        // The fragment below gives for 'cbPattern'>=2:
    // Karp_Rabin_Kaze_4_OCTETS_hits/Karp_Rabin_Kaze_4_OCTETS_clocks: 4/554
    // Karp_Rabin_Kaze_4_OCTETS performance: 364KB/clock
/*
        if ( ulHashPattern == ( (*(char *)(pbTarget-cbPattern))<<8 ) + *(pbTarget-1) ) {
         count = countSTATIC>>2;
         countRemainder = countSTATIC % 4;

         while ( count && *(unsigned long *)(pbPattern+1+((count-1)<<2)) == 
        *(unsigned long *)(pbTarget-cbPattern+1+((count-1)<<2)) ) {
               count--;
         }
     //if (count == 0) {      // Disastrous degradation only from this line
                // (317KB/clock when 1+2x4+2+1 bytes pattern: 
                // 'skillessness'; 312KB/clock when 1+1x4+2+1 bytes 
                // pattern: 'underdog'), otherwise 368KB/clock.
         while ( countRemainder && *(char *)(pbPattern+1+
        (countSTATIC-countRemainder)) == *(char *)(pbTarget-cbPattern+1+
        (countSTATIC-countRemainder)) ) {
               countRemainder--;
         }
         //if ( countRemainder == 0) return((long)(pbTarget-cbPattern));
         if ( count+countRemainder == 0) return((long)(pbTarget-cbPattern));
         //}
        }
*/

        pbTarget++;
        if (pbTarget > pbTargetMax)
            return(NULL);
    }
} else { //if ( cbPattern<4)
if (cbTarget<961) // This value is arbitrary(don't know how exactly), 
        // it ensures(at least must) 
        // better performance than 'Boyer_Moore_Horspool'.
{
        pbTarget = pbTarget+cbPattern;
        ulHashPattern = *(unsigned long *)(pbPattern);
//        countSTATIC = cbPattern-1;

    //SINGLET = *(char *)(pbPattern);
    SINGLET = ulHashPattern & 0xFF;
    Quadruplet2nd = SINGLET<<8;
    Quadruplet3rd = SINGLET<<16;
    Quadruplet4th = SINGLET<<24;

    for ( ;; )
    {
    AdvanceHopperGrass = 0;
    ulHashTarget = *(unsigned long *)(pbTarget-cbPattern);

        if ( ulHashPattern == ulHashTarget ) { // Three unnecessary comparisons here, 
        // but 'AdvanceHopperGrass' must be calculated - it has a higher priority.
//         count = countSTATIC;
//         while ( count && *(char *)(pbPattern+1+(countSTATIC-count)) == 
//        *(char *)(pbTarget-cbPattern+1+(countSTATIC-count)) ) {
//           if ( countSTATIC==AdvanceHopperGrass+count && SINGLET != 
//         *(char *)(pbTarget-cbPattern+1+(countSTATIC-count)) ) 
//        AdvanceHopperGrass++;
//               count--;
//         }
         count = cbPattern-1;
         while ( count && *(char *)(pbPattern+(cbPattern-count)) == 
                    *(char *)(pbTarget-count) ) {
           if ( cbPattern-1==AdvanceHopperGrass+count && SINGLET != 
            *(char *)(pbTarget-count) ) AdvanceHopperGrass++;
               count--;
         }
         if ( count == 0) return((pbTarget-cbPattern));
        } else { // The goal here: to avoid memory accesses by stressing the registers.
    if ( Quadruplet2nd != (ulHashTarget & 0x0000FF00) ) {
         AdvanceHopperGrass++;
         if ( Quadruplet3rd != (ulHashTarget & 0x00FF0000) ) {
              AdvanceHopperGrass++;
              if ( Quadruplet4th != (ulHashTarget & 0xFF000000) ) AdvanceHopperGrass++;
         }
    }
    }

    AdvanceHopperGrass++;

    pbTarget = pbTarget + AdvanceHopperGrass;
        if (pbTarget > pbTargetMax)
            return(NULL);
    }
} else { //if (cbTarget<961)
        //countSTATIC = cbPattern-2; //r.6+
        //countSTATIC = cbPattern-2-3; //r.6++
        countSTATIC = cbPattern-2-2; //r.6+++ This line fixes the bug from r.6++!
        ulHashPattern = *(unsigned long *)(pbPattern);
    /* Preprocessing */
    for (a=0; a < ASIZE; a++) bm_bc[a]=cbPattern;
    for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1;

    /* Searching */
    //lastch=pbPattern[cbPattern-1];
    //firstch=pbPattern[0];
    i=0;
    while (i <= cbTarget-cbPattern) {
       ch=pbTarget[i+cbPattern-1];
       //if (ch ==lastch)
          //if (memcmp(&pbTarget[i],pbPattern,cbPattern-1) == 0) OUTPUT(i);
          //if (ch == lastch && pbTarget[i] == firstch && 
     //memcmp(&pbTarget[i],pbPattern,cbPattern-1) == 0) return(i);  
          // Kaze: The idea(to prevent execution of slower 'memcmp') 
          // is borrowed from Karp-Rabin i.e. to perform a slower check only 
          // when the target "looks like".
          //if (ch == pbPattern[cbPattern-1] && pbTarget[i] == pbPattern[0]) // r.6+
          //if (ch == pbPattern[cbPattern-1] && *(long *)&pbTarget[i] 
          // == *(long *)&pbPattern[0]) // No problema here since we have 4[+] 
          // long pattern here. Overlapping (1 byte recompared) when length=4, grmbl.
          if (ch == pbPattern[cbPattern-1] && 
        *(long *)&pbTarget[i] == ulHashPattern) // No problema here since 
            // we have 4[+] long pattern here. Overlapping 
            // (1 byte recompared) when length=4, grmbl.
             {
         count = countSTATIC;
         //while ( count && *(char *)(pbPattern+1+(countSTATIC-count)) == 
    //*(char *)(&pbTarget[i]+1+(countSTATIC-count)) ) { // r.6+
         while ( count !=0 && *(char *)(pbPattern+1+3+(countSTATIC-count)) == 
        *(char *)(&pbTarget[i]+1+3+(countSTATIC-count)) ) 
        {     // if pattern length is 4 or 5 we have count=-1 and count=0 
            // respectively i.e. no need of comparing in-between chars.
               count--;
         }
         if ( count <= 0) return(pbTarget+i);
         }
       i+=bm_bc[ch];
    }
    return(NULL);
} //if (cbTarget<961)
} //if ( cbPattern<4)
}
// ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Horspool algorithm ]

Stomp stomp I arrived, here comes the Railgun_Quadruplet revision 7:

// Caution: For better speed the case 'if (cbPattern==1)' was removed, 
// so Pattern must be longer than 1 char.
char * Railgun_Quadruplet_7 (char * pbTarget, char * pbPattern, 
        unsigned long cbTarget, unsigned long cbPattern)
{
    char * pbTargetMax = pbTarget + cbTarget;
    register unsigned long ulHashPattern;
    unsigned long ulHashTarget;
    signed long count;
    signed long countSTATIC;

    unsigned char SINGLET;
    unsigned long Quadruplet2nd;
    unsigned long Quadruplet3rd;
    unsigned long Quadruplet4th;
    unsigned long AdvanceHopperGrass;

    long i;
    int a, j, bm_bc[ASIZE];
    unsigned char ch;

    if (cbPattern > cbTarget) return(NULL);

    if (cbPattern<4) {
        pbTarget = pbTarget+cbPattern;
        ulHashPattern = ( (*(char *)(pbPattern))<<8 ) + 
                    *(pbPattern+(cbPattern-1));
        if (cbPattern==3) {
            for ( ;; ) {
                if ( ulHashPattern == 
                ( (*(char *)(pbTarget-3))<<8 ) + *(pbTarget-1) ) {
                    if ( *(char *)(pbPattern+1) == 
                    *(char *)(pbTarget-2) ) return((pbTarget-3));
                }
                if ( (char)(ulHashPattern>>8) != 
                        *(pbTarget-2) ) pbTarget++;
                pbTarget++;
                    if (pbTarget > pbTargetMax) return(NULL);
                }
        } else {
        }
            for ( ;; ) {
                if ( ulHashPattern == ( (*(char *)(pbTarget-2))<<8 ) 
                    + *(pbTarget-1) ) return((pbTarget-2));
                if ( (char)(ulHashPattern>>8) != *(pbTarget-1) ) 
                    pbTarget++;
                pbTarget++;
                if (pbTarget > pbTargetMax) return(NULL);
            }
    } else {
        if (cbTarget<961) {
            pbTarget = pbTarget+cbPattern;
            ulHashPattern = *(unsigned long *)(pbPattern);
            SINGLET = ulHashPattern & 0xFF;
            Quadruplet2nd = SINGLET<<8;
            Quadruplet3rd = SINGLET<<16;
            Quadruplet4th = SINGLET<<24;
            for ( ;; ) {
                AdvanceHopperGrass = 0;
                ulHashTarget = *(unsigned long *)(pbTarget-cbPattern);
                if ( ulHashPattern == ulHashTarget ) { 
                    count = cbPattern-1;
                    while ( count && *(char *)
                    (pbPattern+(cbPattern-count)) == 
                        *(char *)(pbTarget-count) ) {
                        if ( cbPattern-1==
                        AdvanceHopperGrass+count && 
                        SINGLET != *(char *)(pbTarget-count)
                             ) AdvanceHopperGrass++;
                        count--;
                    }
                    if ( count == 0) 
                        return((pbTarget-cbPattern));
                } else {
                    if ( Quadruplet2nd != (ulHashTarget 
                            & 0x0000FF00) ) {
                        AdvanceHopperGrass++;
                        if ( Quadruplet3rd != 
                        (ulHashTarget & 0x00FF0000) ) {
                            AdvanceHopperGrass++;
                            if ( Quadruplet4th != 
                            (ulHashTarget & 0xFF000000)
                             ) AdvanceHopperGrass++;
                        }
                    }
                }
                AdvanceHopperGrass++;
                pbTarget = pbTarget + AdvanceHopperGrass;
                if (pbTarget > pbTargetMax) return(NULL);
            }
        } else {
            countSTATIC = cbPattern-2-2; 
            ulHashPattern = *(unsigned long *)(pbPattern);
            for (a=0; a < ASIZE; a++) bm_bc[a]=cbPattern;
            for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=
                                cbPattern-j-1;
            i=0;
            while (i <= cbTarget-cbPattern) {
                if ( *(unsigned long *)&pbTarget[i] == 
                    ulHashPattern ) {
                    count = countSTATIC;
                    while ( count !=0 && *(char *)
                    (pbPattern+1+3+(countSTATIC-count)) == 
                    *(char *)(&pbTarget[i]+1+3+
                    (countSTATIC-count)) ) count--;
                    if ( count == 0) return(pbTarget+i);
                }
                i= i + bm_bc[(unsigned char)pbTarget[i+cbPattern-1]];
            }
            return(NULL);
        }
    }
}

The tweak (in BMH fragment) which speeds-up significantly the obsolete (already) r.6+++ is this:

        //countSTATIC = cbPattern-2; //r.6+
        //countSTATIC = cbPattern-2-3;
        //countSTATIC = cbPattern-2-2; // r.6+++ I suppose that the awful degradation 
            // comes from 2bytes more (from either 'if (countSTATIC<0) 
            // countSTATIC=0;' or 'count >0' fixes) which make the 
            // function unfittable in code cache lines?!
        //countSTATIC = cbPattern-2-3;     // r.7- At last no recompared bytes 
                    // in-between chars
        countSTATIC = cbPattern-2-2; // r.7 
        ulHashPattern = *(unsigned long *)(pbPattern);
        //chPTR=(unsigned char *)&chchchch+3;
// Next line fixes the BUG from r.6++: but with awful speed degradation! 
// So the bug is fixed in the definitions by setting 'countSTATIC = cbPattern-2-2;', 
// bug appears only for patterns with lengths of 4, The setback is one unnecessary 
// comparison for 5bytes patterns, stupidly such setback exists (from before) for 
// 4bytes as well.
//if (countSTATIC<0) countSTATIC=0;
    /* Preprocessing */
    for (a=0; a < ASIZE; a++) bm_bc[a]=cbPattern;
    for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1;

    /* Searching */
    //lastch=pbPattern[cbPattern-1];
    //firstch=pbPattern[0];
    i=0;
    while (i <= cbTarget-cbPattern) {
       //ch=pbTarget[i+cbPattern-1];
       //ch=pbTarget[i];
          //if ( pbTarget[i] == pbPattern[0] && *(unsigned long *)
    //&pbTarget[i+cbPattern-1-3] == ulHashPattern) // No problema here since 
            // we have 4[+] long pattern here. Overlapping 
            // (1 byte recompared) when length=4, grmbl.
          if ( *(unsigned long *)&pbTarget[i] == ulHashPattern ) // The lesson I learned 
            // from r.7- now applied in r.7: instead of extracting 'ch' 
            // having higher address now the lower address is extracted 
            // first in order (hopefully, the test confirms it) the next 
            // 32bytes (including 'ch') to be cached i.e. to comparison 
            // part is faster.
             {
         count = countSTATIC;
         while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) == 
        *(char *)(&pbTarget[i]+(countSTATIC-count)+4) ) { // if pattern length 
        // is 4 or 5 we have count=-1 and count=0 respectively i.e. no need of 
        // comparing in-between chars.
               count--;
         }
         if ( count == 0) return(pbTarget+i);
         }
       i= i + bm_bc[(unsigned char)pbTarget[i+cbPattern-1]];
    }
    return(NULL);

In order to see what path I had to traverse to make this tweak, you may see the discussion board (r.7m thread). The tests below show NOT the full boost simply because they average times for all lengths thus losing/diminishing the high speeds times (being 4-3-2 times lesser compared with pattern lengths 2-3-4-5-6).

D:\_KAZE_new-stuff>strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe
strstr_SHORT-SHOWDOWN, revision 7, written by Kaze.
Input Pattern(up to 19+2000 chars): frustration

Doing Search for Pattern(11bytes) into String(206908949bytes) as-one-line ...
WARNING! The next line works with BMH only for pattern 4[+] long, 
otherwise (for 2 and 3) other searcher takes over!
Railgun_6pp_hits/Railgun_6pp_clocks: 1355/101
Railgun_6pp performance: 2000KB/clock

Doing Search for Pattern(11bytes) into String(206908949bytes) as-one-line ...
WARNING! The next line works with BMH only for pattern 4[+] long, 
otherwise (for 2 and 3) other searcher takes over!
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1355/91
Railgun_Quadruplet_7 performance: 2220KB/clock

Note: Executing the next two tests 256 times i.e. the search is for 256x8x2 patterns!

Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ... 
Not-Counting-Hits-Just-Returns-First-One
Found ('an') at 157 position, Railgun_Quadruplet_6pp performance: 0KB/clock
Found ('to') at 853 position, Railgun_Quadruplet_6pp performance: 0KB/clock
Found ('TDK') at -4325408 position, Railgun_Quadruplet_6pp performance: 922KB/clock
Found ('the') at 511 position, Railgun_Quadruplet_6pp performance: 0KB/clock
Found ('fast') at 42381 position, Railgun_Quadruplet_6pp performance: 41KB/clock
Found ('easy') at 30163 position, Railgun_Quadruplet_6pp performance: 29KB/clock
Found ('grmbl') at -4325408 position, Railgun_Quadruplet_6pp performance: 1167KB/clock
Found ('email') at 69587982 position, Railgun_Quadruplet_6pp performance: 1078KB/clock
Found ('pasting') at 134600126 position, Railgun_Quadruplet_6pp performance: 1383KB/clock
Found ('amazing') at 1629806 position, Railgun_Quadruplet_6pp performance: 93KB/clock
Found ('underdog') at 137424498 position, 
            Railgun_Quadruplet_6pp performance: 1698KB/clock
Found ('superdog') at -4325408 position, Railgun_Quadruplet_6pp performance: 1603KB/clock
Found ('participants') at 16060 position, Railgun_Quadruplet_6pp performance: 15KB/clock
Found ('skillessness') at -4325408 position, 
    Railgun_Quadruplet_6pp performance: 1836KB/clock
Found ('I should have known') at 41831880 position, 
    Railgun_Quadruplet_6pp performance: 2403KB/clock
Found ('human consciousness') at 3863 position, 
    Railgun_Quadruplet_6pp performance: 3KB/clock
Railgun_Quadruplet_6pp 8x2 i.e. average performance: 1342KB/clock
ReallyTraversed: 310,477,846,528 bytes

Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ... 
Not-Counting-Hits-Just-Returns-First-One
Found ('an') at 157 position, Railgun_Quadruplet_7 performance: 0KB/clock
Found ('to') at 853 position, Railgun_Quadruplet_7 performance: 0KB/clock
Found ('TDK') at -4325408 position, Railgun_Quadruplet_7 performance: 990KB/clock
Found ('the') at 511 position, Railgun_Quadruplet_7 performance: 0KB/clock
Found ('fast') at 42381 position, Railgun_Quadruplet_7 performance: 41KB/clock
Found ('easy') at 30163 position, Railgun_Quadruplet_7 performance: 29KB/clock
Found ('grmbl') at -4325408 position, Railgun_Quadruplet_7 performance: 1069KB/clock
Found ('email') at 69587982 position, Railgun_Quadruplet_7 performance: 1078KB/clock
Found ('pasting') at 134600126 position, Railgun_Quadruplet_7 performance: 1383KB/clock
Found ('amazing') at 1629806 position, Railgun_Quadruplet_7 performance: 1591KB/clock
Found ('underdog') at 137424498 position, Railgun_Quadruplet_7 performance: 1698KB/clock
Found ('superdog') at -4325408 position, Railgun_Quadruplet_7 performance: 1422KB/clock
Found ('participants') at 16060 position, Railgun_Quadruplet_7 performance: 15KB/clock
Found ('skillessness') at -4325408 position, 
    Railgun_Quadruplet_7 performance: 2149KB/clock
Found ('I should have known') at 41831880 position, 
    Railgun_Quadruplet_7 performance: 2403KB/clock
Found ('human consciousness') at 3863 position, 
    Railgun_Quadruplet_7 performance: 3KB/clock
Railgun_Quadruplet_7 8x2 i.e. average performance: 1354KB/clock
ReallyTraversed: 310,477,846,528 bytes

Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ...
Found ('an') 1987797 time(s), BM_HORSPOOL performance: 293KB/clock
Found ('to') 1076629 time(s), BM_HORSPOOL performance: 293KB/clock
Found ('TDK') 0 time(s), BM_HORSPOOL performance: 516KB/clock
Found ('the') 2114180 time(s), BM_HORSPOOL performance: 379KB/clock
Found ('fast') 5945 time(s), BM_HORSPOOL performance: 585KB/clock
Found ('easy') 5191 time(s), BM_HORSPOOL performance: 585KB/clock
Found ('grmbl') 0 time(s), BM_HORSPOOL performance: 759KB/clock
Found ('email') 1 time(s), BM_HORSPOOL performance: 756KB/clock
Found ('pasting') 2 time(s), BM_HORSPOOL performance: 918KB/clock
Found ('amazing') 323 time(s), BM_HORSPOOL performance: 1074KB/clock
Found ('underdog') 4 time(s), BM_HORSPOOL performance: 1069KB/clock
Found ('superdog') 0 time(s), BM_HORSPOOL performance: 1074KB/clock
Found ('participants') 147 time(s), BM_HORSPOOL performance: 1603KB/clock
Found ('skillessness') 0 time(s), BM_HORSPOOL performance: 1422KB/clock
Found ('I should have known') 1 time(s), BM_HORSPOOL performance: 1433KB/clock
Found ('human consciousness') 519 time(s), BM_HORSPOOL performance: 1820KB/clock
BM_HORSPOOL 8x2 i.e. average performance: 666KB/clock

Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ...
Found ('an') 1987797 time(s), Railgun_6pp performance: 678KB/clock
Found ('to') 1076629 time(s), Railgun_6pp performance: 643KB/clock
Found ('TDK') 0 time(s), Railgun_6pp performance: 1174KB/clock
Found ('the') 2114180 time(s), Railgun_6pp performance: 678KB/clock
Found ('fast') 5945 time(s), Railgun_6pp performance: 918KB/clock
Found ('easy') 5191 time(s), Railgun_6pp performance: 990KB/clock
Found ('grmbl') 0 time(s), Railgun_6pp performance: 1287KB/clock
Found ('email') 1 time(s), Railgun_6pp performance: 1167KB/clock
Found ('pasting') 2 time(s), Railgun_6pp performance: 1603KB/clock
Found ('amazing') 323 time(s), Railgun_6pp performance: 1603KB/clock
Found ('underdog') 4 time(s), Railgun_6pp performance: 1603KB/clock
Found ('superdog') 0 time(s), Railgun_6pp performance: 1820KB/clock
Found ('participants') 147 time(s), Railgun_6pp performance: 2149KB/clock
Found ('skillessness') 0 time(s), Railgun_6pp performance: 2126KB/clock
Found ('I should have known') 1 time(s), Railgun_6pp performance: 2557KB/clock
Found ('human consciousness') 519 time(s), Railgun_6pp performance: 2126KB/clock
Railgun_6pp 8x2 i.e. average performance: 1218KB/clock

Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ...
Found ('an') 1987797 time(s), Railgun_Quadruplet_7 performance: 643KB/clock
Found ('to') 1076629 time(s), Railgun_Quadruplet_7 performance: 678KB/clock
Found ('TDK') 0 time(s), Railgun_Quadruplet_7 performance: 1074KB/clock
Found ('the') 2114180 time(s), Railgun_Quadruplet_7 performance: 643KB/clock
Found ('fast') 5945 time(s), Railgun_Quadruplet_7 performance: 1074KB/clock
Found ('easy') 5191 time(s), Railgun_Quadruplet_7 performance: 1074KB/clock
Found ('grmbl') 0 time(s), Railgun_Quadruplet_7 performance: 1287KB/clock
Found ('email') 1 time(s), Railgun_Quadruplet_7 performance: 1167KB/clock
Found ('pasting') 2 time(s), Railgun_Quadruplet_7 performance: 1603KB/clock
Found ('amazing') 323 time(s), Railgun_Quadruplet_7 performance: 1603KB/clock
Found ('underdog') 4 time(s), Railgun_Quadruplet_7 performance: 1820KB/clock
Found ('superdog') 0 time(s), Railgun_Quadruplet_7 performance: 1603KB/clock
Found ('participants') 147 time(s), Railgun_Quadruplet_7 performance: 2557KB/clock
Found ('skillessness') 0 time(s), Railgun_Quadruplet_7 performance: 2126KB/clock
Found ('I should have known') 1 time(s), Railgun_Quadruplet_7 performance: 2557KB/clock
Found ('human consciousness') 519 time(s), 
        Railgun_Quadruplet_7 performance: 2557KB/clock
Railgun_Quadruplet_7 8x2 i.e. average performance: 1232KB/clock

^C
D:\_KAZE_new-stuff>

In my opinion, it is one of the seven commandments of speed: do not put the cart before the horse, i.e., for adjacent (32?) memory accesses first extract the byte with lowest address. strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe, revision 7, some quick inputs:

D:\_KAZE_new-stuff>strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe
strstr_SHORT-SHOWDOWN, revision 7, written by Kaze.
Input Pattern(up to 19+2000 chars): fast

Doing Search for Pattern(4bytes) into String(206908949bytes) as-one-line ...
WARNING! The next line works with BMH only for pattern 4[+] long, 
otherwise (for 2 and 3) other searcher takes over!
Railgun_6pp_hits/Railgun_6pp_clocks: 5945/215
Railgun_6pp performance: 939KB/clock

Doing Search for Pattern(4bytes) into String(206908949bytes) as-one-line ...
WARNING! The next line works with BMH only for pattern 4[+] long, 
otherwise (for 2 and 3) other searcher takes over!
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 5945/193
Railgun_Quadruplet_7 performance: 1046KB/clock

^C

...

Input Pattern(up to 19+2000 chars): from
Railgun_6pp_hits/Railgun_6pp_clocks: 83155/196
Railgun_6pp performance: 1030KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 83155/194
Railgun_Quadruplet_7 performance: 1041KB/clock

Input Pattern(up to 19+2000 chars): rabea
Railgun_6pp_hits/Railgun_6pp_clocks: 0/176
Railgun_6pp performance: 1148KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/162
Railgun_Quadruplet_7 performance: 1247KB/clock

Input Pattern(up to 19+2000 chars): makes
Railgun_6pp_hits/Railgun_6pp_clocks: 9281/172
Railgun_6pp performance: 1174KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 9281/162
Railgun_Quadruplet_7 performance: 1247KB/clock

Input Pattern(up to 19+2000 chars): monkey
Railgun_6pp_hits/Railgun_6pp_clocks: 1691/141
Railgun_6pp performance: 1433KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1691/139
Railgun_Quadruplet_7 performance: 1453KB/clock

Input Pattern(up to 19+2000 chars): punny
Railgun_6pp_hits/Railgun_6pp_clocks: 0/156
Railgun_6pp performance: 1295KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/155
Railgun_Quadruplet_7 performance: 1303KB/clock

Input Pattern(up to 19+2000 chars): funny
Railgun_6pp_hits/Railgun_6pp_clocks: 179/157
Railgun_6pp performance: 1287KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 179/156
Railgun_Quadruplet_7 performance: 1295KB/clock

Input Pattern(up to 19+2000 chars): ramjet
Railgun_6pp_hits/Railgun_6pp_clocks: 0/152
Railgun_6pp performance: 1329KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/137
Railgun_Quadruplet_7 performance: 1474KB/clock

Input Pattern(up to 19+2000 chars): fallen
Railgun_6pp_hits/Railgun_6pp_clocks: 2578/151
Railgun_6pp performance: 1338KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 2578/138
Railgun_Quadruplet_7 performance: 1464KB/clock

Input Pattern(up to 19+2000 chars): elephant
Railgun_6pp_hits/Railgun_6pp_clocks: 1198/124
Railgun_6pp performance: 1629KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1198/112
Railgun_Quadruplet_7 performance: 1804KB/clock

Input Pattern(up to 19+2000 chars): layalina
Railgun_6pp_hits/Railgun_6pp_clocks: 0/121
Railgun_6pp performance: 1669KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/110
Railgun_Quadruplet_7 performance: 1836KB/clock

Input Pattern(up to 19+2000 chars): I should have known
Railgun_6pp_hits/Railgun_6pp_clocks: 1/89
Railgun_6pp performance: 2270KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1/84
Railgun_Quadruplet_7 performance: 2405KB/clock

Input Pattern(up to 19+2000 chars): human consciousness
Railgun_6pp_hits/Railgun_6pp_clocks: 519/80
Railgun_6pp performance: 2525KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 519/75
Railgun_Quadruplet_7 performance: 2694KB/clock

Input Pattern(up to 19+2000 chars): But he's looking right through me
Railgun_6pp_hits/Railgun_6pp_clocks: 0/84
Railgun_6pp performance: 2405KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/76
Railgun_Quadruplet_7 performance: 2658KB/clock

Input Pattern(up to 19+2000 chars): I'm living in an age that calls darkness light
Railgun_6pp_hits/Railgun_6pp_clocks: 0/72
Railgun_6pp performance: 2806KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/67
Railgun_Quadruplet_7 performance: 3015KB/clock

Input Pattern(up to 19+2000 chars): Following the wanderings of a dream - 
a dream that keeps my soul alive
Railgun_6pp_hits/Railgun_6pp_clocks: 0/69
Railgun_6pp performance: 2928KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/63
Railgun_Quadruplet_7 performance: 3207KB/clock

Input Pattern(up to 19+2000 chars): I notice what matters and 
I got nothing to lose but darkness and shadows
Railgun_6pp_hits/Railgun_6pp_clocks: 0/66
Railgun_6pp performance: 3061KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/63
Railgun_Quadruplet_7 performance: 3207KB/clock

Input Pattern(up to 19+2000 chars): Beth Ditto
Railgun_6pp_hits/Railgun_6pp_clocks: 0/118
Railgun_6pp performance: 1712KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/107
Railgun_Quadruplet_7 performance: 1888KB/clock

Input Pattern(up to 19+2000 chars): SR71
Railgun_6pp_hits/Railgun_6pp_clocks: 0/175
Railgun_6pp performance: 1154KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/182
Railgun_Quadruplet_7 performance: 1110KB/clock

Input Pattern(up to 19+2000 chars): Apache
Railgun_6pp_hits/Railgun_6pp_clocks: 1/162
Railgun_6pp performance: 1247KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1/133
Railgun_Quadruplet_7 performance: 1519KB/clock

Input Pattern(up to 19+2000 chars): Fibonacci
Railgun_6pp_hits/Railgun_6pp_clocks: 0/106
Railgun_6pp performance: 1906KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/98
Railgun_Quadruplet_7 performance: 2061KB/clock

Input Pattern(up to 19+2000 chars): Tesla
Railgun_6pp_hits/Railgun_6pp_clocks: 0/175
Railgun_6pp performance: 1154KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/160
Railgun_Quadruplet_7 performance: 1262KB/clock

Input Pattern(up to 19+2000 chars): Albert
Railgun_6pp_hits/Railgun_6pp_clocks: 932/154
Railgun_6pp performance: 1312KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 932/137
Railgun_Quadruplet_7 performance: 1474KB/clock

Input Pattern(up to 19+2000 chars): Toshiba
Railgun_6pp_hits/Railgun_6pp_clocks: 0/130
Railgun_6pp performance: 1554KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/121
Railgun_Quadruplet_7 performance: 1669KB/clock

Input Pattern(up to 19+2000 chars): Orion
Railgun_6pp_hits/Railgun_6pp_clocks: 1/175
Railgun_6pp performance: 1154KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1/160
Railgun_Quadruplet_7 performance: 1262KB/clock

Input Pattern(up to 19+2000 chars): pharaoh
Railgun_6pp_hits/Railgun_6pp_clocks: 26/129
Railgun_6pp performance: 1566KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 26/122
Railgun_Quadruplet_7 performance: 1656KB/clock

Input Pattern(up to 19+2000 chars): lemon
Railgun_6pp_hits/Railgun_6pp_clocks: 33/178
Railgun_6pp performance: 1135KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 33/161
Railgun_Quadruplet_7 performance: 1255KB/clock

Input Pattern(up to 19+2000 chars): grammar
Railgun_6pp_hits/Railgun_6pp_clocks: 218/123
Railgun_6pp performance: 1642KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 218/117
Railgun_Quadruplet_7 performance: 1727KB/clock

D:\_KAZE_new-stuff>

Grmbl, pattern 'SR71' compared to 'fast' gives me a hint how things are far from 'fastest'.

Points of Interest

The interesting part about tests I have done: 'memcmp' and 'memcpy' when inlined (properly) give much better results. The thing which pop-ups again-and-again: Never take anything for granted. A bug (which doesn't affect the previous results, correct) was crushed, now r.6+++ is ready to go/hit.

As you can see from the C (commentless and tabulated) and ASM sources (see the top of the page for downloading):
The Microsoft 16.00.30319.01 compiled Railgun_Quadruplet_6ppp is 00c6b-00a00+5 = 624 bytes long.
The Quadruplet (where cbTarget<961) fragment loop is 00bc3-00b17+5 = 177 bytes long.
The Boyer-Moore-Horspool (where cbTarget>=961) fragment loop is 00c61-00c20+2 = 67 bytes long.
An interesting question for more experienced coders:
In case of reducing the BMH loop down by 3 bytes (i.e. to 64 bytes), can we obtain some additional/significant boost?

History

  • Revision 6+++ written 2011-Sep-07
  • Revision 7 written 2011-Sep-25
  • Revision 7 'Sun' written 2012-Jan-22
  • Revision 7 'Elsiane' written 2012-Jan-24
  • Revision 7 'Gulliver' written 2012-Jan-26
  • Revision 7 'Hasherezade' written 2012-Feb-01

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Sanmayce
Other
Bulgaria Bulgaria
Member
A Bulgarian old boy interested in search techniques, nothing special.

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
QuestionSuperfast 16-threaded exact searcher: 16 Railguns in action and more [modified] PinmemberSanmayce9 Feb '13 - 6:16 
Embrace yourself for the hit of 16 railguns in a second.
 
Recently I wrote a superfast fuzzy (GALADRIEL) console text searcher, and to put exact matching under one roof along with wildcards and Levenshtein Distance (Wagner-Fischer) was a spontaneous idea which in turn resulted in the fastest 'on the fly' text searcher known to me.
 
Since the exact matching was not supposed to enter among fuzzy guns it was invoked just as other two hitters (wildcards&fuzzy) for EACH LINE - which is very slow and for high speeds is highly not recommended, I did it that way in order to see how it would behave, just of curiosity.
But my laptop's 2 threads can do it better, so revision 1-++ showcases the full might of my top-gun text hitter: Railgun_Quadruplet_7Gulliver - the fastest text search function known to me, and on top of that multi-threaded.
I wrote it as mix of Boyer-Moore-Horspool-Sunday order 2 and some other nifty micro etudes.
This resulted in appearance of first in the INTERNET text searcher utilizing the I/O read bandwidth at its fullest!
 
To be more precise, the upper theoretical speed limit is 16 threads * 3 GB/s = 48 GB/s, those 3GB/s are nominal for 'Railgun_Quadruplet_7Gulliver' on my laptop. Of course, in reality, my estimation is 8-20GB/s.
Because of this monstrous bandwidth Kazahana proves to be typhoon class tool.
 
For those who don't believe I suggest to give it a try on their superfast I/O systems, nowadays mainstream drives offer the "miserable" 512MB/s, it would be very interesting those with 1++GB/s to share their results.
 
For example my humble laptop is equipped with Samsung 470 64GB which gives average linear read 241MB/s at 1MB blocks, obtained with 'Everest'.
 
From Wikipedia torture below you can see how full is my fullest: (241-232)/232*100% = 3.8% deviation:
 
To run the test below you need to download enwiki-20121201-pages-articles.xml (42,153,646,707 bytes) file from http://dumps.wikimedia.org/enwiki/.
 
Yes, that's right, you can extract lines of latest Wikipedia containing strings you need.
 
E:\_Kaze_Kazahana>timer "Kazahana_r1-++_HEXADECAD-Threads_IntelV12.exe" "metal fatigue" ..\enwiki-20121201-pages-articles.xml
Timer 9.01 : Igor Pavlov : Public domain : 2009-05-31
Kazahana, a superfast exact & wildcards & Levenshtein Distance (Wagner-Fischer) searcher, revision 1-++, copyleft Kaze 2013-Feb-09.
omp_get_num_procs( ) = 2
omp_get_max_threads( ) = 2
Enforcing HEXADECAD i.e. hexadecuple-threads ...
Allocating Master-Buffer 7MB ... OK
-; 00,000,244,543 bytes/clock
Kazahana: Dumped xgrams: 329
Kazahana: Performance: 238 KB/clock
Kazahana: Done.
 
Kernel Time  =    58.391 =   33%
User Time    =   148.902 =   86%
Process Time =   207.294 =  119%
Global Time  =   172.780 =  100%
 
E:\_Kaze_Kazahana>timer grep\grep -c "metal fatigue" ..\enwiki-20121201-pages-articles.xml
Timer 9.01 : Igor Pavlov : Public domain : 2009-05-31
329
 
Kernel Time  =    24.148 =   13%
User Time    =    78.718 =   44%
Process Time =   102.867 =   58%
Global Time  =   175.565 =  100%
 
E:\_Kaze_Kazahana>timer "Kazahana_r1-++_HEXADECAD-Threads_IntelV12.exe" "metal fatigue" ..\enwiki-20121201-pages-articles.xml
Timer 9.01 : Igor Pavlov : Public domain : 2009-05-31
Kazahana, a superfast exact & wildcards & Levenshtein Distance (Wagner-Fischer) searcher, revision 1-++, copyleft Kaze 2013-Feb-09.
omp_get_num_procs( ) = 2
omp_get_max_threads( ) = 2
Enforcing HEXADECAD i.e. hexadecuple-threads ...
Allocating Master-Buffer 7MB ... OK
-; 00,000,244,102 bytes/clock
Kazahana: Dumped xgrams: 329
Kazahana: Performance: 238 KB/clock
Kazahana: Done.
 
Kernel Time  =    59.108 =   34%
User Time    =   143.224 =   82%
Process Time =   202.333 =  116%
Global Time  =   173.508 =  100%
 
E:\_Kaze_Kazahana>"Kazahana_r1-++_HEXADECAD-Threads_IntelV12.exe"
Kazahana, a superfast exact & wildcards & Levenshtein Distance (Wagner-Fischer) searcher, revision 1-++, copyleft Kaze 2013-Feb-09.
Usage: Kazahana [AtMostLevenshteinDistance] string textualfile
Note1: There are three regimes: exact, wildcards and fuzzy searches. First two kick in when 2 parameters are given, fuzzy when 3.
Note2: What decides whether exact or wildcards? Of course presence of at least one wildcard. To see exact search see Example #4.
Note3: Exact search hits with 'Railgun_Quadruplet_7Gulliver'.
Note4: Incoming string is automatically lowercased for wildcards searches i.e. they are case insensitive.
Note5: Incoming string could be up to 21168/126 chars for exact&wildcards/Levenshtein respectively.
Note6: Incoming textualfile could be bigger than 4GB.
Note7: Each line should end with [CR]LF, that is Windows or/and UNIX style.
Note8: The dump goes to Kazahana.txt file.
Note9: Seven wildcards are available:
       wildcard '*' any character(s) or empty,
       wildcard '@'/'#' any character {or empty}/{and not empty},
       wildcard '^'/'$' any ALPHA character {or empty}/{and not empty},
       wildcard '|'/'~' any NON-ALPHA character {or empty}/{and not empty}.
Example1: E:\>Kazahana 0 ramjet MASAKARI_General-Purpose_Grade_English_Wordlist_r3_316423_words.wrd
Example2: E:\>Kazahana 3 psychedlicize MASAKARI_General-Purpose_Grade_English_Wordlist_r3_316423_words.wrd
Example3: E:\>Kazahana "psyched^^^^^^ize^" MASAKARI_General-Purpose_Grade_English_Wordlist_r3_316423_words.wrd
Example4: E:\>Kazahana "metal fatigue" enwiki-20121201-pages-articles.xml
Example5: E:\>Kazahana "out^^^^^^^^^^^^^ize*" MASAKARI_General-Purpose_Grade_English_Wordlist_r3_316423_words.wrd
          E:\>type Kazahana.txt
          [out^^^^^^^^^^^^^ize*] outhyperbolize /MASAKARI_General-Purpose_Grade_English_Wordlist_r3_316423_words.wrd/
          [out^^^^^^^^^^^^^ize*] outsize /MASAKARI_General-Purpose_Grade_English_Wordlist_r3_316423_words.wrd/
          [out^^^^^^^^^^^^^ize*] outsized /MASAKARI_General-Purpose_Grade_English_Wordlist_r3_316423_words.wrd/
          [out^^^^^^^^^^^^^ize*] outstrategize /MASAKARI_General-Purpose_Grade_English_Wordlist_r3_316423_words.wrd/
          [out^^^^^^^^^^^^^ize*] outtyrannize /MASAKARI_General-Purpose_Grade_English_Wordlist_r3_316423_words.wrd/
 
E:\_Kaze_Kazahana>
 
And to behold the real hit of only 2 railguns:
E:\_Kaze_Kazahana>timer Kazahana_r1-+_MONAD-Thread_IntelV12 ramjet 4andabove_Gamera.tar.2.sorted
Timer 9.01 : Igor Pavlov : Public domain : 2009-05-31
Kazahana, a superfast exact & wildcards & Levenshtein Distance (Wagner-Fischer) searcher, revision 1-++, copyleft Kaze 2013-Feb-09.
Enforcing MONAD i.e. single-thread ...
Allocating Master-Buffer 7MB ... OK
|; 00,000,639,411 bytes/clock
Kazahana: Dumped xgrams: 49
Kazahana: Performance: 625 KB/clock
Kazahana: Done.
 
Kernel Time  =     0.795 =   33%
User Time    =     1.513 =   63%
Process Time =     2.308 =   96%
Global Time  =     2.381 =  100%
 
E:\_Kaze_Kazahana>timer Kazahana_r1-+_HEXADECAD-Threads_IntelV12 ramjet 4andabove_Gamera.tar.2.sorted
Timer 9.01 : Igor Pavlov : Public domain : 2009-05-31
Kazahana, a superfast exact & wildcards & Levenshtein Distance (Wagner-Fischer) searcher, revision 1-++, copyleft Kaze 2013-Feb-09.
omp_get_num_procs( ) = 2
omp_get_max_threads( ) = 2
Enforcing HEXADECAD i.e. hexadecuple-threads ...
Allocating Master-Buffer 7MB ... OK
|; 00,000,729,181 bytes/clock
Kazahana: Dumped xgrams: 49
Kazahana: Performance: 703 KB/clock
Kazahana: Done.
 
Kernel Time  =     0.904 =   51%
User Time    =     1.778 =  100%
Process Time =     2.683 =  151%
Global Time  =     1.771 =  100%
 
E:\_Kaze_Kazahana>timer grep\grep ramjet 4andabove_Gamera.tar.2.sorted
Timer 9.01 : Igor Pavlov : Public domain : 2009-05-31
0,000,083       bussard_ramjet
0,000,051       the_ramjet
0,000,048       the_ramjets
0,000,046       a_ramjet
0,000,031       a_scramjet
0,000,027       the_scramjet
0,000,026       bussard_ramjets
0,000,018       interstellar_ramjet
0,000,014       ramjet_engine
0,000,012       scramjet_powered
0,000,012       ramjet_is
0,000,011       scramjet_engines
0,000,011       scramjet_engine
0,000,011       ramjet_engines
0,000,010       ramjets_were
0,000,010       combustion_ramjet
0,000,009       ramjet_and
0,000,008       ramjet_controls
0,000,007       combustion_ramjets
0,000,006       water_ramjet
0,000,006       scramjet_technology
0,000,006       ramjets_on
0,000,006       ramjet_will
0,000,006       ramjet_speeds
0,000,006       ramjet_ship
0,000,006       ramjet_rocket
0,000,006       ramjet_in
0,000,006       mode_scramjet
0,000,005       scramjets_can
0,000,005       ramjet_to
0,000,005       ramjet_scramjet
0,000,005       ramjet_operation
0,000,005       of_scramjets
0,000,005       of_scramjet
0,000,005       of_ramjets
0,000,005       by_ramjets
0,000,005       and_ramjets
0,000,005       and_ramjet
0,000,004       scramjet_to
0,000,004       scramjet_s
0,000,004       scramjet_is
0,000,004       scramjet_intake
0,000,004       ramjet_was
0,000,004       ramjet_a
0,000,004       raking_ramjets
0,000,004       or_scramjet
0,000,004       expander_ramjets
0,000,004       ejector_ramjet
0,000,004       a_turboramjet
 
Kernel Time  =     0.546 =   10%
User Time    =     4.258 =   82%
Process Time =     4.804 =   93%
Global Time  =     5.138 =  100%
 
E:\_Kaze_Kazahana>
 
Glad I will be to receive some feedback on how Kazahana performs on 1++GB/s I/O systems.
 
Forgot to give the link: http://www.sanmayce.com/Downloads/Kazahana_r1-++.zip[^]
Get down get down get down get it on show love and give it up
What are you waiting on?


modified 11 Feb '13 - 13:40.

QuestionFastest 32bit hash function in C PinmemberSanmayce29 Sep '12 - 7:25 
Just want to share my latest hash package, I was lucky to write the fastest hash function:
FNV1A_Yorikke
http://www.sanmayce.com/Fastest_Hash/index.html#FNV1A_Yorikke[^]
 
FNV1A_Yorikke outspeeds monstrously (lines below) CRC-32 while maintaining similar collisions.
 
In OSHO.TXT 'Building-Blocks' test with (85868050-70657880)/70657880*100% = 21%
In '5,000,000 Knight Tours' test with (9774609-5986428)/5986428*100% = 63%
In '100MB as one line' test with (764957-176506)/176506*100% = 333%
 
Of course T7500 limits her, on new CPUs like i5/i7 FNV1A_Yorikke simply 'dances on the water'.
 
Enjoy!
Get down get down get down get it on show love and give it up
What are you waiting on?

QuestionMy latest Windows benchmark package PinmemberSanmayce20 Mar '12 - 7:26 
At last I made an easy to be run test package (a NSIS installation):
http://www.sanmayce.com/Downloads/index.html#Jesters[^]
 
I initiated a thread on a cool (COLD yes) overclock maniacs forum at:
http://www.overclockaholics.com/forums/showthread.php?t=5132[^]
 
'Monstrous Jesters' benchmark package short overview:
 
This is my latest 32bit/64bit (strstr-showdown included) CPU/RAM benchmark package (a NSIS installation).
 
File: Monstrous_Jesters.exe
Size: 153 MB (161,009,933 bytes)
Size unpacked: 500 MB
Size needed: 1200 MB
 
After installation 5 shortcuts (tests) are placed on Desktop/Programs.
All tests are written in C (sources included), and compiled with latest Intel 12.1 and Microsoft 16 optimizers.
 
The MEMMEM (strstr-showdown) test takes some 21minutes to complete on Core2Duo_E7500_2.93Ghz.
Of course in order to obtain decent results stop all the concurrent processes before running the test.
Also enable 100% computing power.
 
Well, there are some additional tests (Intel 12.1 and Microsoft 16 executables included):
- lzpre a LZ77 32bit/64bit [de]compressor, written by Matt Mahoney;
- Yappy a LZ 32bit/64bit [de]compressor, written by IronPeter;
- Knight tour benchmark, finds first 9,000,000 tours (at rate some 1 billion per minute jumps), in fact tests/stresses only CPU clock;
- Quicksort 32bit/64bit used to sort 200,000,000+ pointers (pointing to 7bytes chunks).
 
Also I would be glad for some feedback and results on your machines.
 
Enjoy!
Get down get down get down get it on show love and give it up
What are you waiting on?

QuestionHorspool vs BNDM PinmemberMischa Sandberg4 Mar '12 - 18:06 
Just noticed your comment in an older post:
 
"Quick note3: 'Trident' being 200MB/s faster than BNDM_64 for text and 700MB/s slower than BNDM_64 for DNA, ouch!"
 
Yes, pretty much that exact statement is in "Flexible Pattern Matching...", the Navarro text.
The statistical analysis there says that BNDM on average beats Horspool for small alphabets like DNA,
Horspool beats BNDM for larger alphabets like English -- i.e. where the probability of any given character occurring is less and less.
BNDM immediately benefits from larger word sizes, Horspool doesn't.
 
For longer patterns, there's another algorithm (Backward Oracle Matching) that beats Horspool, but has a higher setup cost.
 
The 64bit tests I'm running mostly confirm this. Linux 2.6.32, GCC 4.3.4, Intel Core2 12-CPU 3.33GHz, L1 cache 12MB.
"Dreams come true, not free" -- S.Sondheim

AnswerRe: Horspool vs BNDM PinmemberSanmayce5 Mar '12 - 5:35 
GeneralRe: Unix PinmemberMischa Sandberg5 Mar '12 - 14:40 
AnswerRe: Horspool vs BNDM PinmemberSanmayce7 Mar '12 - 3:23 
AnswerBNDM_64 vs 'Tridentx64' PinmemberSanmayce7 Mar '12 - 5:05 
AnswerThe best practical on-line resource on String Matching PinmemberSanmayce7 Mar '12 - 5:41 
QuestionRailgun_Doublet_x64: the fastest mini-memmem known to me PinmemberSanmayce26 Feb '12 - 4:47 
Last night (2012 Feb 25) achieved 13%-20% boost of my strstr-like (i.e. short strings/patterns cases) function compared to GNU's Berg's strstr function - both compiled as 64bit code using Microsoft 16 and Intel 12.1 compilers. For full benchmark see the commented lines further below. After some ups-and-downs experienced in Windows 7 64bit with Intel 12.1 64bit and Microsoft 16 64bit C compilers I revised Railgun in order to fit it better in the new environment which resulted in appearance of Railgun_Doublet. The function is tiny: with 00cc5-00c71+1+1 = 86 bytes main-loop.
 
GNU Berg's performance (compiled with Microsoft 2010 64bit compiler): 376KB/clock+400KB/clock+564KB/clock+346KB/clock+540KB/clock+395KB/clock+535KB/clock+401KB/clock+542KB/clock+455KB/clock+510KB/clock+468KB/clock+541KB/clock+469KB/clock+557KB/clock+482KB/clock=7581
GNU Berg's performance (compiled with Intel C++ 64 Compiler XE 12.1): 296KB/clock+320KB/clock+421KB/clock+276KB/clock+417KB/clock+323KB/clock+415KB/clock+326KB/clock+415KB/clock+367KB/clock+405KB/clock+378KB/clock+415KB/clock+378KB/clock+418KB/clock+388KB/clock=5958

Railgun_Doublet performance (compiled with Microsoft 2010 64bit compiler): 421KB/clock+476KB/clock+551KB/clock+393KB/clock+550KB/clock+534KB/clock+553KB/clock+550KB/clock+559KB/clock+557KB/clock+553KB/clock+562KB/clock+568KB/clock+573KB/clock+587KB/clock+594KB/clock=8581
Railgun_Doublet performance (compiled with Intel C++ 64 Compiler XE 12.1): 348KB/clock+397KB/clock+464KB/clock+328KB/clock+463KB/clock+452KB/clock+465KB/clock+463KB/clock+469KB/clock+468KB/clock+464KB/clock+470KB/clock+477KB/clock+479KB/clock+490KB/clock+494KB/clock=7191

BNDM_32 performance (compiled with Microsoft 2010 64bit compiler): 267KB/clock+290KB/clock+462KB/clock+255KB/clock+390KB/clock+379KB/clock+451KB/clock+395KB/clock+400KB/clock+433KB/clock+419KB/clock+420KB/clock+436KB/clock+447KB/clock+435KB/clock+441KB/clock=6320
BNDM_32 performance (compiled with Intel C++ 64 Compiler XE 12.1): 241KB/clock+262KB/clock+409KB/clock+234KB/clock+352KB/clock+343KB/clock+402KB/clock+359KB/clock+367KB/clock+392KB/clock+383KB/clock+384KB/clock+398KB/clock+407KB/clock+400KB/clock+405KB/clock=5738

Summary for all 16 patterns:
GNU Berg's performance:      7581 Microsoft / 5958 Intel
Railgun_Doublet performance: 8581 Microsoft / 7191 Intel
BNDM_32 performance:         6320 Microsoft / 5738 Intel
 
Using Microsoft:
Railgun_Doublet is (8581-7581)/7581*100% = 13% faster than GNU Berg's
Railgun_Doublet is (8581-6320)/6320*100% = 35% faster than BNDM_32
 
Using Intel:
Railgun_Doublet is (7191-5958)/5958*100% = 20% faster than GNU Berg's
Railgun_Doublet is (7191-5738)/5738*100% = 25% faster than BNDM_32
For full benchmark dumps: Railgun homepage.
 
// Notes on 80x86 and x64, 2012-Feb-25:
// Three compilers were used (first two on Windows 7 64bit, the third on Windows XP 32bit):
// Intel(R) C++ 64 Compiler XE for applications running on Intel(R) 64, Version 12.1.1.258 Build 20111011
// Microsoft (R) C/C++ Optimizing Compiler Version 16.00.30319.01 for x64
// Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.30319.01 for 80x86
// I have been using x64 for more than 12 hours and quickly the picture has become clear: code written for 32bit must be replaced with dedicated 64bit counterpart, relying on former only is a gambling venture.
// For example Railgun_Doublet dominates when compiled as 64bit (both with Intel XE 2011 12.1 and Microsoft 2010 16.00.30319.01 for x64), but for 32bit it is inferior compared to Railgun_Quadruplet_8Triplet.
// Summary for strstr-like (i.e. memmem for short strings/patterns) usage:
// - for 32bit use Railgun_Quadruplet_8Triplet
// - for 64bit use Railgun_Doublet
// My interest has been shifted from strstr toward memmem, Railgun_Doublet will fill the gap (short strings/patterns cases) in Railgun_r8_Mimino_x64 whereas 'Trident2'+'Hasherezade' will deal with memmem part, 'Hasherezade' should be surely tuned for 64bit.
Get down get down get down get it on show love and give it up
What are you waiting on?

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web04 | 2.6.130516.1 | Last Updated 1 Feb 2012
Article Copyright 2011 by Sanmayce
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid