Click here to Skip to main content
15,885,366 members
Articles / General Programming / Optimization

Fastest strstr-like Function in C!?

Rate me:
Please Sign up or sign in to vote.
4.81/5 (39 votes)
25 Jan 2012CPOL38 min read 270.1K   3.5K   70  
Tuned function for searching a needle in a haystack
This is an old version of the currently published article.

Introduction

The C strstr-like function presented below (written by Georgi 'Kaze') is named Railgun_Quadruplet_6ppp and is based on well-known Boyer-Moore-Horspool and simplified Karp-Rabin algorithms. The main goal: to achieve fastest execution for all kinds of pattern/needle and string/haystack lengths (starting from 2 bytes).

Railgun_Quadruplet_6ppp is now obsolete due to the arrival of Railgun_Quadruplet_7 which boosts the BMH fragment by more than a percent. See further below for its commentless and tabulated source.

Railgun_Quadruplet_7 exploits the CPUs fast unaligned DWORD extract capability.
The heaviest and most precise test on 1+TB text (52 patterns vs 197MB English books text repeated 256 times results in 1000+ seconds, see strstr_SHORT-SHOWDOWN_r7_Heavy_Test_LOG.txt file for more information) gives:

[strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe:]
The variants returning the offset of first hit only:

  • Railgun_Quadruplet_6pp 52, i.e., average performance: 1584KB/clock
  • Railgun_Quadruplet_7 52, i.e., average performance: 1659KB/clock

The variants counting all hits:

  • Railgun_6pp 8x2, i.e., average performance: 1246KB/clock
  • Railgun_Quadruplet_7 8x2, i.e., average performance: 1266KB/clock

[strstr_SHORT-SHOWDOWN_Intel_v12_O3.exe:]
The variants returning the offset of first hit only:

  • Railgun_Quadruplet_6pp 52, i.e., average performance: 1773KB/clock
  • Railgun_Quadruplet_7 52, i.e., average performance: 1685KB/clock

The variants counting all hits:

  • Railgun_6pp 8x2, i.e., average performance: 1054KB/clock
  • Railgun_Quadruplet_7 8x2, i.e., average performance: 1076KB/clock

Note: For Railgun_Quadruplet_7 on CPUs with fast DWORD (compared to BYTE) memory fetching I expect significant boost, AFAIK the Intel i7 is one of those, unfortunately I lack the opportunity to play with it.

Or the two bottom-lines:
The FASTEST variant counting all hits:

  • Railgun_Quadruplet_7 8x2 with average performance: 1266KB/clock (best performance achieved with Microsoft_v16_Ox)!

The FASTEST variant returning the offset of first hit only:

  • Railgun_Quadruplet_6pp 52 with average performance: 1773KB/clock (1,943,883,635,456 bytes / 1,070,220 clocks = 1732 MB/s best performance achieved with Intel_v12_O3)!

An add-on from 2012-Jan-26:

Enter Railgun_Quadruplet_7Gulliver ... the most promising variant.

I am happy to share my latest etude - a result of a week-long sub-quest of mine saturated with music and program-mess-ing.
The first search-function that breaks (effortlessly) the 3GB/s barrier on a computer with 3GB/s memory write.
// ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Sunday-Horspool algorithm [
/*
Raising Horspool/Sunday order from 1 to 2 came quite naturally after some carefree wander of mine.
I consider adding a new 'else if' block for haystacks longer than 960 in order to avoid the big skip table overhead, something like 512*1024*1024, thus fastest r7sun will operate on 961..512*1024*1024 long haystacks whereas 'Gulliver' on bigger than 512*1024*1024.
I am not happy with Skip-Performance of revision 1: this should be amended...
'Gulliver' makes giant skips/jumps on patterns some 8+ chars long but for short ones falls behind. Here r.1 is given being a Horspool order 2, of course Sunday order 2 is coming.
I restrain myself of dreaming about power behind order 3, for bigger orders I expect nothing less than Dream-Performance.

Of course the enhancements await advent, they are very easy to be seen, but they will appear in next revisions of 'Gulliver'.
Tuning continues but the skeleton is built, I see 'Gulliver' as a really High-Performance etude.

And not to be empty-handed here the Gulliver's swiftness is benchmarked on String(206,908,949bytes) as-one-line:

Pattern: fast
Railgun_Quadruplet_7sun:       1,074KB/clock / 0,456%, 45,330,622 iterations
Railgun_Quadruplet_7:          1,000KB/clock / 0,377%, 54,788,054 iterations
Railgun_Quadruplet_7sunhorse:  0,651KB/clock / 0,480%, 43,103,056 iterations
Railgun_Quadruplet_7deuce:     0,575KB/clock / 0,389%, 53,138,919 iterations
Railgun_Quadruplet_7Elsiane:   0,511KB/clock / 0,551%, 37,541,955 iterations
Railgun_Quadruplet_7Gulliver:  0,649KB/clock / 0,298%, 69,403,843 iterations
Boyer_Moore_Flensburg:         0,491KB/clock / 0,377%, 54,788,139 iterations

Pattern: faster
Railgun_Quadruplet_7sun:       1,347KB/clock / 0,591%, 34,996,936 iterations
Railgun_Quadruplet_7:          1,329KB/clock / 0,514%, 40,194,194 iterations
Railgun_Quadruplet_7sunhorse:  0,762KB/clock / 0,656%, 31,504,148 iterations
Railgun_Quadruplet_7deuce:     0,662KB/clock / 0,567%, 36,434,006 iterations
Railgun_Quadruplet_7Elsiane:   0,538KB/clock / 0,710%, 29,101,626 iterations
Railgun_Quadruplet_7Gulliver:  1,010KB/clock / 0,491%, 42,066,438 iterations
Boyer_Moore_Flensburg:         0,687KB/clock / 0,514%, 40,194,282 iterations

Pattern: fastest
Railgun_Quadruplet_7sun:       1,554KB/clock / 0,687%, 30,084,306 iterations
Railgun_Quadruplet_7:          1,507KB/clock / 0,599%, 34,540,430 iterations
Railgun_Quadruplet_7sunhorse:  0,918KB/clock / 0,761%, 27,188,853 iterations
Railgun_Quadruplet_7deuce:     0,786KB/clock / 0,663%, 31,175,827 iterations
Railgun_Quadruplet_7Elsiane:   0,635KB/clock / 0,818%, 25,281,493 iterations
Railgun_Quadruplet_7Gulliver:  1,209KB/clock / 0,592%, 34,946,434 iterations
Boyer_Moore_Flensburg:         0,798KB/clock / 0,621%, 33,278,240 iterations

Pattern: fastest fox
Railgun_Quadruplet_7sun:       1,727KB/clock / 0,775%, 26,672,940 iterations
Railgun_Quadruplet_7:          1,669KB/clock / 0,669%, 30,925,578 iterations
Railgun_Quadruplet_7sunhorse:  0,957KB/clock / 0,912%, 22,663,583 iterations
Railgun_Quadruplet_7deuce:     0,821KB/clock / 0,797%, 25,945,709 iterations
Railgun_Quadruplet_7Elsiane:   0,719KB/clock / 0,931%, 22,213,101 iterations  Gulliver outskips Elsiane!
Railgun_Quadruplet_7Gulliver:  1,804KB/clock / 0,977%, 21,167,180 iterations  monstrous!
Boyer_Moore_Flensburg:         1,080KB/clock / 0,669%, 30,925,649 iterations

Pattern: fastest fox with biggest strides
Railgun_Quadruplet_7sun:       2,658KB/clock / 1,584%, 13,060,463 iterations
Railgun_Quadruplet_7:          2,767KB/clock / 1,511%, 13,689,243 iterations
Railgun_Quadruplet_7sunhorse:  1,788KB/clock / 2,138%, 09,677,267 iterations
Railgun_Quadruplet_7deuce:     1,656KB/clock / 2,053%, 10,075,650 iterations
Railgun_Quadruplet_7Elsiane:   1,542KB/clock / 2,143%, 09,652,548 iterations
Railgun_Quadruplet_7Gulliver:  3,108KB/clock / 2,917%, 07,091,357 iterations  over the top!
Boyer_Moore_Flensburg:         1,820KB/clock / 1,554%, 13,307,181 iterations

Pattern: fastest fox with biggest strides known to me
Railgun_Quadruplet_7sun:       2,590KB/clock / 1,548%, 13,363,356 iterations
Railgun_Quadruplet_7:          2,658KB/clock / 1,447%, 14,292,419 iterations
Railgun_Quadruplet_7sunhorse:  1,870KB/clock / 2,234%, 09,259,505 iterations
Railgun_Quadruplet_7deuce:     1,757KB/clock / 2,011%, 10,287,584 iterations
Railgun_Quadruplet_7Elsiane:   1,656KB/clock / 2,240%, 09,236,188 iterations
Railgun_Quadruplet_7Gulliver:  3,108KB/clock / 3,821%, 05,413,684 iterations  unstoppable!
Boyer_Moore_Flensburg:         1,712KB/clock / 1,540%, 13,431,751 iterations

Pattern: fastest fox with biggest strides known to me up to 2012 January 26 namely 'Gulliver'
Railgun_Quadruplet_7sun:       3,108KB/clock / 2,890%, 07,159,321 iterations
Railgun_Quadruplet_7:          3,157KB/clock / 2,742%, 07,545,141 iterations
Railgun_Quadruplet_7sunhorse:  2,557KB/clock / 4,138%, 04,999,777 iterations
Railgun_Quadruplet_7deuce:     2,494KB/clock / 4,029%, 05,135,444 iterations
Railgun_Quadruplet_7Elsiane:   2,494KB/clock / 4,141%, 04,995,463 iterations
Railgun_Quadruplet_7Gulliver:  3,108KB/clock / 7,218%, 02,866,342 iterations  simply amazing!
Boyer_Moore_Flensburg:         2,767KB/clock / 2,745%, 07,536,097 iterations

Note:
[
In the above benchmark (on Pentium Merom 2166Mhz) 'Gulliver' is limited by memory bandwidth only, due to:

Pattern: fastest fox with biggest strides
Railgun_Quadruplet_7sun:       2,658KB/clock / 1,584%, 13,060,463 iterations
Railgun_Quadruplet_7Gulliver:  3,108KB/clock / 2,917%, 07,091,357 iterations

Pattern: fastest fox with biggest strides known to me
Railgun_Quadruplet_7sun:       2,590KB/clock / 1,548%, 13,363,356 iterations
Railgun_Quadruplet_7Gulliver:  3,108KB/clock / 3,821%, 05,413,684 iterations

Pattern: fastest fox with biggest strides known to me up to 2012 January 26 namely 'Gulliver'
Railgun_Quadruplet_7sun:       3,108KB/clock / 2,890%, 07,159,321 iterations
Railgun_Quadruplet_7Gulliver:  3,108KB/clock / 7,218%, 02,866,342 iterations

You can see how Railgun_Quadruplet_7sun struggles to catch up, for longest pattern it achieves the maximum 3,108KB/clock but at a nasty cost: 7,159,321 iterations whereas 'Gulliver' achieves same (in fact much higher) speed with heart-touching 2,866,342 iterations only.
For my Pentium T3400 2166 MHz Toshiba Satellite L305 Dual DDR2-667 5-5-5-13, 'Everest' benchmark program gives:
Memory Read:  4,605 MB/s
Memory Write: 3,323 MB/s
'Gulliver' would run at 5GB/s without any problem if it were not for the hardware limitation, a fantastic performance in my eyes.
]
*/
// 
// Benchmark (on Pentium Merom 2166MHz, Windows XP 32bit, CL v16 32bit):
// Two tests were done - the first/second '6x2'/'52' searches 12/52 patterns (for all appearances i.e. counting all hits) into 206,908,949 bytes long English text:
// 
// Skip-Performance (measured in %, bigger-the-better; Iterations, lesser-the-better) Summary:
// 
// '6x2' test (12 patterns 4-19 chars in length)
// 
// Railgun_Quadruplet_7sun 6x2 i.e. average performance:       2,092KB/clock / 0,153,936% / 005,572,202,064
// Railgun_Quadruplet_7 6x2 i.e. average performance:          2,032KB/clock / 0,137,568% / 006,454,416,320
// Railgun_Quadruplet_7sunhorse 6x2 i.e. average performance:  1,262KB/clock / 0,173,840% / 005,145,288,080
// Railgun_Quadruplet_7deuce 6x2 i.e. average performance:     1,134KB/clock / 0,155,744% / 005,995,597,776
// Railgun_Quadruplet_7Elsiane 6x2 i.e. average performance:   0,938KB/clock / 0,184,096% / 004,710,045,936
// Boyer_Moore_Flensburg 6x2 i.e. average performance:         1,163KB/clock / 0,139,168% / 006,428,106,496
// Railgun_Quadruplet_7Gulliver 6x2 i.e. average performance:  1,552KB/clock / 0,152,912% / 006,959,446,480
// Brute_Force_Dummy 6x2 i.e. average performance:             0,343KB/clock / 0,019,200% / 039,726,516,640
// 
// '52' test (52 patterns 4-175 chars in length)
// 
// Railgun_Quadruplet_7sun 52 i.e. average performance:        1,652KB/clock / 0,768,416% / 022,362,482,640
// Railgun_Quadruplet_7 52 i.e. average performance:           1,636KB/clock / 0,708,768% / 025,368,049,712
// Railgun_Quadruplet_7sunhorse 52 i.e. average performance:   1,052KB/clock / 0,939,520% / 020,187,534,736
// Railgun_Quadruplet_7deuce 52 i.e. average performance:      0,929KB/clock / 0,864,048% / 023,292,952,992
// Railgun_Quadruplet_7Elsiane 52 i.e. average performance:    0,809KB/clock / 0,979,072% / 018,535,811,280
// Boyer_Moore_Flensburg 52 i.e. average performance:          0,928KB/clock / 0,720,272% / 025,227,269,760
// Railgun_Quadruplet_7Gulliver 52 i.e. average performance:   1,296KB/clock / 1,145,808% / 026,131,772,688
// Brute_Force_Dummy 52 i.e. average performance:              0,279KB/clock / 0,083,200% / 172,148,232,496
// 
// Revision: 1, 2012-Jan-26, the main disadvantage: the preprocessing overhead.
// Caution: For better speed the case 'if (cbPattern==1)' was removed, so Pattern must be longer than 1 char.
char * Railgun_Quadruplet_7Gulliver_count_hits (char * pbTarget, char * pbPattern, unsigned long cbTarget, unsigned long cbPattern)
{
    char * pbTargetMax = pbTarget + cbTarget;
    register unsigned long ulHashPattern;
    register unsigned long ulHashTarget;
    signed long count;
    signed long countSTATIC;

    unsigned char SINGLET;
    unsigned long Quadruplet2nd;
    unsigned long Quadruplet3rd;
    unsigned long Quadruplet4th;

    unsigned long  AdvanceHopperGrass;

    long i; //BMH needed
    int a, j;
    unsigned int bm_bc[256]; //BMH needed
    unsigned int bm_bc2nd[256]; //BMS needed
    unsigned char bm_Sunday_Order2[256*256]; //BMHSS(Elsiane) needed, 'char' limits patterns to 255, if 'long' then table becomes 256KB, grrr.
    unsigned long Gulliver; // or unsigned char or unsigned short

    if (cbPattern > cbTarget)
        return(NULL);

    if ( cbPattern<4) { 
        pbTarget = pbTarget+cbPattern;
        ulHashPattern = ( (*(char *)(pbPattern))<<8 ) + *(pbPattern+(cbPattern-1));

        if ( cbPattern==3) {
            for ( ;; ) {
                if ( ulHashPattern == ( (*(char *)(pbTarget-3))<<8 ) + *(pbTarget-1) ) {
                    if ( *(char *)(pbPattern+1) == *(char *)(pbTarget-2) ) Railgunhits++; //return((pbTarget-3));
                }
                if ( (char)(ulHashPattern>>8) != *(pbTarget-2) ) pbTarget++;
                pbTarget++;
                if (pbTarget > pbTargetMax)
                    return(NULL);
            }
        } else {
        }
        for ( ;; ) {
            if ( ulHashPattern == ( (*(char *)(pbTarget-2))<<8 ) + *(pbTarget-1) )
                Railgunhits++; //return((pbTarget-2));
            if ( (char)(ulHashPattern>>8) != *(pbTarget-1) ) pbTarget++;
            pbTarget++;
            if (pbTarget > pbTargetMax)
                return(NULL);
        }
    } else { //if ( cbPattern<4)
        if (cbTarget<961) { // This value is arbitrary(don't know how exactly), it ensures(at least must) better performance than 'Boyer_Moore_Horspool'.
            pbTarget = pbTarget+cbPattern;
            ulHashPattern = *(unsigned long *)(pbPattern);

            SINGLET = ulHashPattern & 0xFF;
            Quadruplet2nd = SINGLET<<8;
            Quadruplet3rd = SINGLET<<16;
            Quadruplet4th = SINGLET<<24;

            for ( ;; ) {
                AdvanceHopperGrass = 0;
                ulHashTarget = *(unsigned long *)(pbTarget-cbPattern);

                    if ( ulHashPattern == ulHashTarget ) { // Three unnecessary comparisons here, but 'AdvanceHopperGrass' must be calculated - it has a higher priority.
                    count = cbPattern-1;
                    while ( count && *(char *)(pbPattern+(cbPattern-count)) == *(char *)(pbTarget-count) ) {
                        if ( cbPattern-1==AdvanceHopperGrass+count && SINGLET != *(char *)(pbTarget-count) ) AdvanceHopperGrass++;
                        count--;
                    }
                    if ( count == 0) Railgunhits++; //return((pbTarget-cbPattern));
                    } else { // The goal here: to avoid memory accesses by stressing the registers.
                    if ( Quadruplet2nd != (ulHashTarget & 0x0000FF00) ) {
                        AdvanceHopperGrass++;
                        if ( Quadruplet3rd != (ulHashTarget & 0x00FF0000) ) {
                            AdvanceHopperGrass++;
                            if ( Quadruplet4th != (ulHashTarget & 0xFF000000) ) AdvanceHopperGrass++;
                        }
                    }
                }

                AdvanceHopperGrass++;

                pbTarget = pbTarget + AdvanceHopperGrass;
                if (pbTarget > pbTargetMax)
                    return(NULL);
            }
        } else { //if (cbTarget<961)
            countSTATIC = cbPattern-2-2;

            for (a=0; a < 256; a++) {bm_bc[a]=cbPattern; bm_bc2nd[a]=cbPattern+1;}
            for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1; 
            for (j=0; j < cbPattern; j++) bm_bc2nd[pbPattern[j]]=cbPattern-j; 

            // Elsiane r.2  [
            for (a=0; a < 256*256; a++) {bm_Sunday_Order2[a]= cbPattern-1;} // 'memset' if not optimized

            // alfalfa 7 long 6 BBs (al lf fa al lf fa) 3 distinct BBs (al lf fa) 
            // fast 4 0-1-2 fa as st
            for (j=0; j < cbPattern-1; j++) bm_Sunday_Order2[*(unsigned short *)(pbPattern+j)]=j; // Rightmost appearance/position is needed

            // Elsiane r.2 ]

            ulHashPattern = *(unsigned short *)(pbPattern+cbPattern-1-1); // Last two bytes
            ulHashTarget = *(unsigned long *)(pbPattern); // First four bytes
        
            AdvanceHopperGrass = 0;
            i=0;
            while (i <= cbTarget-cbPattern-1-1) {
                Gulliver = bm_Sunday_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1]];

                if ( Gulliver == cbPattern-2 ) { // CASE #1: means the pair (char order 2) is found
                    if ( *(unsigned long *)&pbTarget[i] == ulHashTarget) {
                        count = countSTATIC; // Last two chars already matched, to be fixed with -2
                        while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
                            count--;
                        if ( count == 0) Railgunhits++; //return(pbTarget+i);
                    }
                    i = i + 1;
                } else if ( Gulliver == cbPattern-1 ) // CASE #2: means the pair (char order 2) is found
                    i = i + Gulliver; // the pair is not found, skip the whole pattern and fall back one char
                else
                    i = i +  cbPattern - Gulliver - 2; // CASE #3: the pair is found and not as suffix i.e. rightmost position

// 32323218 Order 1 Horspool
// fa af fa af fa as st Order 2 Horspool
//  0  1  2  3  4  5  6
// HIKARIfast
// fafafast
//   fafafast +2 Order 1 'a' vs 't'
//   fafafast +2 = (cbPattern-Gulliver-2 = 8-4-2 = 2) Order 2 'fa' vs 'st' i.e. CASE #3

// 76543218 Order 1 Horspool
// lo on ng gp pa ac ce Order 2 Horspool
//  0  1  2  3  4  5  6
// HIKARIfast
// longpace
//   longpace +2 Order 1 'a' vs 'e'
//        longpace +7 = (cbPattern-1 = 8-1 = 7) Order 2 'fa' vs 'ce' i.e. CASE #2

                AdvanceHopperGrass++;
            }

            if (i == cbTarget-cbPattern-1) {
                if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
                    count = countSTATIC;
                    while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
                        count--;
                    if ( count == 0) Railgunhits++; //return(pbTarget+i);
                }
                i++;
                AdvanceHopperGrass++;
            }

            if (i == cbTarget-cbPattern) {
                if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
                    count = countSTATIC;
                    while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
                        count--;
                    if ( count == 0) Railgunhits++; //return(pbTarget+i);
                }
                AdvanceHopperGrass++;
            }

            GlobalSP += (int)((double)cbTarget/AdvanceHopperGrass*100);
            GlobalI += AdvanceHopperGrass;
            printf("Skip-Performance(bigger-the-better): %d%%, %d skips/iterations\n",(int)((double)cbTarget/AdvanceHopperGrass*100), AdvanceHopperGrass);
        
            return(NULL);
        } //if (cbTarget<961)
    } //if ( cbPattern<4)
}
// ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Sunday-Horspool algorithm ]
I was very touched (thus inspired) by this cuore song:

Strong enough to tell you why
I feel enough to tell you something that's very close to me
In the eye of my mind
Feel the reasons that we thought
You just feel all
Falling I'm standing across the stream
Fearing saying I
I'm strong enough to find my courage
To find the way I'm always near it's brewing in my mind
Falling I'm standing across the stream
Fearing saying I
I'm strong enough to find my courage
To find the way I'm always near it's growing in my mind

/Artist: Elsiane - Across The Stream/

I salute everyone who feels-and-keeps the artistic way near.

An add-on from 2012-Jan-24:

Just wanted to see what Sunday's technique (applied on two chars, order 2) holds, this resulted in appearance of Railgun_Quadruplet_7Elsiane, the biggest-strider so far, I dream (already) of Railgun_Quadruplet_7Gulliver...
Horspool-Sunday-Sunday jumps to:
--------------------------------\
                                |
Horspool-Sunday jumps to:
-------------------------------\
                               |
Horspool jumps to:
------------------------------\
                              |
                  String: faster
                  Needle: fast

       Sunday skip-table:
                          43215
     Horspool skip-table:
                          32144
For above example:
Horspool jumps to A where A=4 'e' position
Sunday jumps to B where B=5 'r' position
Horspool-Sunday ('SunHorse') jumps to max(A,B) i.e. 5
Horspool-Sunday-Sunday ('Elsiane') jumps to max(A,B,C) i.e. max(5,6)=6 since neither 'e' nor 'r' appear in the needle i.e. right after 'r'
// ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Sunday-Horspool algorithm [
// Note: 'Elsiane' is the strongest (i.e. with biggest-strides) Boyer-Moore-Sunday-Horspool variant known to me, named after a 'very close to me' duo(two musicians), viva Elsieanne Caplette!
// Benchmark (on Pentium Merom 2166MHz, Windows XP 32bit, CL v16 32bit):
// Two tests were done - the first/second '6x2'/'52' searches 12/52 patterns (for all appearances i.e. counting all hits) into 206,908,949 bytes long English text:
// 
// Skip-Performance (measured in %, bigger-the-better; Iterations, lesser-the-better) Summary:
// 
// Function                      '6x2' test (12 patterns 4-19 chars in length)
// 
// Railgun_Quadruplet_7Elsiane   0,934KB/clock / 184,096% / 004,710,045,936 Iterations  Muffin
// Railgun_Quadruplet_7sunhorse  1,261KB/clock / 173,840% / 005,145,288,080 Iterations  BONBON
// Railgun_Quadruplet_7deuce     1,132KB/clock / 155,744% / 005,995,597,776 Iterations  Leaves-The-Town
// Railgun_Quadruplet_7sun       2,091KB/clock / 153,936% / 005,572,202,064 Iterations  Bitter-Sweet
// Boyer_Moore_Flensburg         1,162KB/clock / 139,168% / 006,428,106,496 Iterations  Loopy
// Railgun_Quadruplet_7          2,029KB/clock / 137,568% / 006,454,416,320 Iterations  Loopier
// Brute_Force_Dummy             0,342KB/clock / 019,200% / 039,726,516,640 Iterations  Loopiest
// 
// Function                      '52' test (52 patterns 4-175 chars in length)
// 
// Railgun_Quadruplet_7Elsiane   0,809KB/clock / 979,072% / 018,535,811,280 Iterations  Muffin
// Railgun_Quadruplet_7sunhorse  1,051KB/clock / 939,520% / 020,187,534,736 Iterations  BONBON
// Railgun_Quadruplet_7deuce     0,928KB/clock / 864,048% / 023,292,952,992 Iterations  Leaves-The-Town
// Railgun_Quadruplet_7sun       1,651KB/clock / 768,416% / 022,362,482,640 Iterations  Bitter-Sweet
// Boyer_Moore_Flensburg         0,928KB/clock / 720,272% / 025,227,269,760 Iterations  Loopy
// Railgun_Quadruplet_7          1,632KB/clock / 708,768% / 025,368,049,712 Iterations  Loopier
// Brute_Force_Dummy             0,279KB/clock / 083,200% / 172,148,232,496 Iterations  Loopiest
// 
// Revision: 1, 2012-Jan-24, the main disadvantage: the preprocessing overhead.
// Caution: For better speed the case 'if (cbPattern==1)' was removed, so Pattern must be longer than 1 char.
char * Railgun_Quadruplet_7Elsiane_count_hits (char * pbTarget,    char * pbPattern, unsigned long cbTarget, unsigned long cbPattern)
{
    char * pbTargetMax = pbTarget + cbTarget;
    register unsigned long ulHashPattern;
    register unsigned long ulHashTarget;
    signed long count;
    signed long countSTATIC;

    unsigned char SINGLET;
    unsigned long Quadruplet2nd;
    unsigned long Quadruplet3rd;
    unsigned long Quadruplet4th;

    unsigned long  AdvanceHopperGrass;

    long i; //BMH needed
    int a, j;
    unsigned int bm_bc[256]; //BMH needed
    unsigned int bm_bc2nd[256]; //BMS needed
    unsigned char bm_Sunday_Order2[256*256]; //BMHSS(Elsiane) needed, of course those 65536 bytes are scary in next revision reduce them to 8192 i.e. bitwise it

    if (cbPattern > cbTarget)
        return(NULL);

    if ( cbPattern<4) { 
        pbTarget = pbTarget+cbPattern;
        ulHashPattern = ( (*(char *)(pbPattern))<<8 ) + *(pbPattern+(cbPattern-1));

        if ( cbPattern==3) {
            for ( ;; ) {
                if ( ulHashPattern == ( (*(char *)(pbTarget-3))<<8 ) + *(pbTarget-1) ) {
                    if ( *(char *)(pbPattern+1) == *(char *)(pbTarget-2) ) Railgunhits++; //return((pbTarget-3));
                }
                if ( (char)(ulHashPattern>>8) != *(pbTarget-2) ) pbTarget++;
                pbTarget++;
                if (pbTarget > pbTargetMax)
                    return(NULL);
            }
        } else {
        }
        for ( ;; ) {
            if ( ulHashPattern == ( (*(char *)(pbTarget-2))<<8 ) + *(pbTarget-1) )
                Railgunhits++; //return((pbTarget-2));
            if ( (char)(ulHashPattern>>8) != *(pbTarget-1) ) pbTarget++;
            pbTarget++;
            if (pbTarget > pbTargetMax)
                return(NULL);
        }
    } else { //if ( cbPattern<4)
        if (cbTarget<961) { // This value is arbitrary(don't know how exactly), it ensures(at least must) better performance than 'Boyer_Moore_Horspool'.
            pbTarget = pbTarget+cbPattern;
            ulHashPattern = *(unsigned long *)(pbPattern);

            SINGLET = ulHashPattern & 0xFF;
            Quadruplet2nd = SINGLET<<8;
            Quadruplet3rd = SINGLET<<16;
            Quadruplet4th = SINGLET<<24;

            for ( ;; ) {
                AdvanceHopperGrass = 0;
                ulHashTarget = *(unsigned long *)(pbTarget-cbPattern);

                    if ( ulHashPattern == ulHashTarget ) { // Three unnecessary comparisons here, but 'AdvanceHopperGrass' must be calculated - it has a higher priority.
                    count = cbPattern-1;
                    while ( count && *(char *)(pbPattern+(cbPattern-count)) == *(char *)(pbTarget-count) ) {
                        if ( cbPattern-1==AdvanceHopperGrass+count && SINGLET != *(char *)(pbTarget-count) ) AdvanceHopperGrass++;
                        count--;
                    }
                    if ( count == 0) Railgunhits++; //return((pbTarget-cbPattern));
                    } else { // The goal here: to avoid memory accesses by stressing the registers.
                    if ( Quadruplet2nd != (ulHashTarget & 0x0000FF00) ) {
                        AdvanceHopperGrass++;
                        if ( Quadruplet3rd != (ulHashTarget & 0x00FF0000) ) {
                            AdvanceHopperGrass++;
                            if ( Quadruplet4th != (ulHashTarget & 0xFF000000) ) AdvanceHopperGrass++;
                        }
                    }
                }

                AdvanceHopperGrass++;

                pbTarget = pbTarget + AdvanceHopperGrass;
                if (pbTarget > pbTargetMax)
                    return(NULL);
            }
        } else { //if (cbTarget<961)
            countSTATIC = cbPattern-2-2;

            for (a=0; a < 256; a++) {bm_bc[a]=cbPattern; bm_bc2nd[a]=cbPattern+1;}
            for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1; 
            for (j=0; j < cbPattern; j++) bm_bc2nd[pbPattern[j]]=cbPattern-j; 

            // Elsiane [
            for (a=0; a < 256*256; a++) {bm_Sunday_Order2[a]= 0;} // 'memset' if not optimized

            // alfalfa 7 long 6 BBs (al lf fa al lf fa) 3 distinct BBs (al lf fa) 
            // fast 4 0-1-2 fa as st
            //for (j=0; j < cbPattern-1; j++) bm_Sunday_Order2[*(unsigned short *)(pbPattern+j)]=1;

            // The idea here is to use Order 2 of Sunday's approach:
            // It takes all rows and columns containing a char (e.g. A) from the pattern to be marked (as 1 i.e. active)
            //    000 001 002 003 ...   A ... 255
            // 000                      * 
            // 001                      *
            // 002                      *
            // 003                      *
            // ...                      *
            //   A  *   *   *   *   *   *   *   *  
            // ...                      *
            // 255                      *
            // TO-DO: bytewise -> bitwise
            for (j=0; j < cbPattern; j++) 
                for (a=0; a < 256; a++) {
                    bm_Sunday_Order2[(a<<8) + pbPattern[j]]=1; // columns filling
                    bm_Sunday_Order2[(pbPattern[j]<<8) + a]=1; // rows filling
                }
            // Elsiane ]

            ulHashPattern = *(unsigned long *)(pbPattern);
        
            AdvanceHopperGrass = 0;
            i=0;
            while (i <= cbTarget-cbPattern-1-1) {
                if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
                    count = countSTATIC;
                    while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
                        count--;
                    if ( count == 0) Railgunhits++; //return(pbTarget+i);
                }

                if ( bm_Sunday_Order2[*(unsigned short *)&pbTarget[i+cbPattern]] == 0 )
                    i = i + cbPattern + 1 + 1;
                else
                    if ( bm_bc[(unsigned char)pbTarget[i+cbPattern-1]] < bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]] )
                        i= i + bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]];
                    else
                        i= i + bm_bc[(unsigned char)pbTarget[i+cbPattern-1]];

                AdvanceHopperGrass++;
            }

            if (i == cbTarget-cbPattern-1) {
                if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
                    count = countSTATIC;
                    while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
                        count--;
                    if ( count == 0) Railgunhits++; //return(pbTarget+i);
                }
                i++;
                AdvanceHopperGrass++;
            }

            if (i == cbTarget-cbPattern) {
                if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
                    count = countSTATIC;
                    while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
                        count--;
                    if ( count == 0) Railgunhits++; //return(pbTarget+i);
                }
                AdvanceHopperGrass++;
            }

            GlobalSP += (int)((double)cbTarget/AdvanceHopperGrass*100);
            GlobalI += AdvanceHopperGrass;
            //printf("Skip-Performance(bigger-the-better): %d%%, %d skips/iterations\n",(int)((double)cbTarget/AdvanceHopperGrass*100), AdvanceHopperGrass);
        
            return(NULL);
        } //if (cbTarget<961)
    } //if ( cbPattern<4)
}
// ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Sunday-Horspool algorithm ]
/*
strstr_SHORT-SHOWDOWN, revision 7Elsiane_vs_7sun_vs_7_vs_7sunhorse_vs_7deuce_vs_BMF, written by Kaze.
Full credits to: R.S. Boyer, J.S. Moore, R.N. Horspool, D.M. Sunday.
Input Pattern(up to 19+2000 chars): faster

Doing Search for Pattern(6bytes) into String(206908949bytes) as-one-line ...
Skip-Performance(bigger-the-better): 591%, 34996936 skips/iterations
Railgun_Quadruplet_7sun_hits/Railgun_Quadruplet_7sun_clocks: 744/154
Railgun_Quadruplet_7sun performance: 1312KB/clock

Doing Search for Pattern(6bytes) into String(206908949bytes) as-one-line ...
Skip-Performance(bigger-the-better): 514%, 40194194 skips/iterations
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 744/158
Railgun_Quadruplet_7 performance: 1278KB/clock

Doing Search for Pattern(6bytes) into String(206908949bytes) as-one-line ...
Skip-Performance(bigger-the-better): 656%, 31504148 skips/iterations
Railgun_Quadruplet_7sunhorse_hits/Railgun_Quadruplet_7sunhorse_clocks: 744/272
Railgun_Quadruplet_7sunhorse performance: 742KB/clock

Doing Search for Pattern(6bytes) into String(206908949bytes) as-one-line ...
Skip-Performance(bigger-the-better): 567%, 36434006 skips/iterations
Railgun_Quadruplet_7deuce_hits/Railgun_Quadruplet_7deuce_clocks: 744/313
Railgun_Quadruplet_7deuce performance: 645KB/clock

Doing Search for Pattern(6bytes) into String(206908949bytes) as-one-line ...
Skip-Performance(bigger-the-better): 710%, 29101626 skips/iterations
Railgun_Quadruplet_7Elsiane_hits/Railgun_Quadruplet_7Elsiane_clocks: 744/383
Railgun_Quadruplet_7Elsiane performance: 527KB/clock

Doing Search for Pattern(6bytes) into String(206908949bytes) as-one-line ...
Skip-Performance(bigger-the-better): 514%, 40194216 skips/iterations
Boyer_Moore_Flensburg_hits/Boyer_Moore_Flensburg_clocks: 744/299
Boyer_Moore_Flensburg performance: 675KB/clock

Doing Search for Pattern(6bytes) into String(206908949bytes) as-one-line ...
Skip-Performance(bigger-the-better): 106%, 194548807 skips/iterations
Two-Way_hits/Two-Way_clocks: 744/836
Two-Way performance: 241KB/clock

D:\_KAZE_strstr_SHORT-SHOWDOWN_7Elsiane_7sun_vs_7_vs_7sunhorse_vs_7deuce_vs_BMF>
*/
I like a lot this look-ahead approach of D.M. Sunday, but can't see how to quicken it, any ideas?

An add-on from 2012-Jan-22:

[
A face-off
'SunHorse' vs Railgun_Quadruplet_7sun vs Railgun_Quadruplet_7 vs Boyer_Moore vs Brute_Force

Okay, here comes Railgun_Quadruplet_7sun - the new fastest hitter.
Two tests were done - the first/second '6x2'/'52' searches 12/52 patterns (for all appearances i.e. counting all hits) into 206,908,949 bytes long English text.

First, 7deuce is in the past, a bad luck it was, yes.
Wow, I have received collateral damage by the last shoot-out in which D.M. Sunday excels both in Speed-Performance and Skip-Performance.
Speed-Performance (measured in KB/clock, bigger-the-better) Summary:

Function                      '6x2' test (12 patterns 4-19 chars in length)

Railgun_Quadruplet_7sun       2,092KB/clock / 153,936% / 005,572,202,064 Iterations  BONBON
Railgun_Quadruplet_7          2,030KB/clock / 137,568% / 006,454,416,320 Iterations  Leaves-The-Town
Railgun_Quadruplet_7sunhorse  1,250KB/clock / 173,840% / 005,145,288,080 Iterations  UNDERRATED
Boyer_Moore_Flensburg         1,154KB/clock / 139,168% / 006,428,115,568 Iterations  Slow 
Railgun_Quadruplet_7deuce     1,125KB/clock / 155,744% / 005,995,597,776 Iterations  Slower 
Brute_Force_Dummy             0,338KB/clock / 019,200% / 039,726,516,640 Iterations  Slowest 

Function                      '52' test (52 patterns 4-175 chars in length)

Railgun_Quadruplet_7sun       1,657KB/clock / 768,416% / 022,362,482,640 Iterations  BONBON
Railgun_Quadruplet_7          1,637KB/clock / 708,768% / 025,368,049,712 Iterations  Leaves-The-Town
Railgun_Quadruplet_7sunhorse  1,046KB/clock / 939,520% / 020,187,534,736 Iterations  UNDERRATED
Railgun_Quadruplet_7deuce     0,923KB/clock / 864,048% / 023,292,952,992 Iterations  Slow 
Boyer_Moore_Flensburg         0,919KB/clock / 720,272% / 025,227,312,176 Iterations  Slower 
Brute_Force_Dummy             0,276KB/clock / 083,200% / 172,148,232,496 Iterations  Slowest 

Skip-Performance (measured in %, bigger-the-better; Iterations, lesser-the-better) Summary:

Function                      '6x2' test (12 patterns 4-19 chars in length)

Railgun_Quadruplet_7sunhorse  1,250KB/clock / 173,840% / 005,145,288,080 Iterations  BONBON
Railgun_Quadruplet_7deuce     1,125KB/clock / 155,744% / 005,995,597,776 Iterations  Leaves-The-Town
Railgun_Quadruplet_7sun       2,092KB/clock / 153,936% / 005,572,202,064 Iterations  Bitter-Sweet
Boyer_Moore_Flensburg         1,154KB/clock / 139,168% / 006,428,115,568 Iterations  Loopy
Railgun_Quadruplet_7          2,030KB/clock / 137,568% / 006,454,416,320 Iterations  Loopier
Brute_Force_Dummy             0,338KB/clock / 019,200% / 039,726,516,640 Iterations  Loopiest 

Function                      '52' test (52 patterns 4-175 chars in length)

Railgun_Quadruplet_7sunhorse  1,046KB/clock / 939,520% / 020,187,534,736 Iterations  BONBON
Railgun_Quadruplet_7deuce     0,923KB/clock / 864,048% / 023,292,952,992 Iterations  Leaves-The-Town
Railgun_Quadruplet_7sun       1,657KB/clock / 768,416% / 022,362,482,640 Iterations  Bitter-Sweet
Boyer_Moore_Flensburg         0,919KB/clock / 720,272% / 025,227,312,176 Iterations  Loopy
Railgun_Quadruplet_7          1,637KB/clock / 708,768% / 025,368,049,712 Iterations  Loopier
Brute_Force_Dummy             0,276KB/clock / 083,200% / 172,148,232,496 Iterations  Loopiest
For English texts I would have no second thought which function to choose: it is Railgun_Quadruplet_7sunhorse (Boyer-Moore reinforced by Sunday-Horspool) - strongest of all Boyer-Moore variants(including the original), sadly [still] not among the fastest.
I can't figure out the cause for such unproportionality: 'sunhorse' features amazing (for English texts) Skip-Performance i.e. number of pattern comparisions agaist the text is so low but Speed-Performance (on this CPU) lags behind.

The test package is downloadable at:
http://www.sanmayce.com/Downloads/_KAZE_strstr_SHORT-SHOWDOWN_7sun_vs_7_vs_7sunhorse_vs_7deuce_vs_BMF.7z

The test package contains:
D:\_KAZE_strstr_SHORT-SHOWDOWN_7sun_vs_7_vs_7sunhorse_vs_7deuce_vs_BMF>dir

01/22/2012  07:00 AM            59,198 FH_Flensburg.zip
01/22/2012  07:00 AM       206,908,949 OSHO.TXT
01/22/2012  07:00 AM           469,283 Results_Pentium_Merom_2166Mhz.txt
01/22/2012  07:00 AM               175 RUNME_IT_TAKES_some_20_MINUTES.BAT
01/22/2012  07:00 AM               191 strstr_COMPILE_Microsoft.bat
01/22/2012  07:00 AM           603,625 strstr_SHORT-SHOWDOWN.c
01/22/2012  07:00 AM           633,053 strstr_SHORT-SHOWDOWN.cod
01/22/2012  07:00 AM           108,032 strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe

D:\_KAZE_strstr_SHORT-SHOWDOWN_7sun_vs_7_vs_7sunhorse_vs_7deuce_vs_BMF>
A very easy-to-the-eyes/mind article at http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/bm.htm
Here I say 'Triple THANKS' to Prof. Dr. Hans Werner Lang (Institut für Medieninformatik und Technische Informatik Fachhochschule Flensburg), I used your code (after a straightforward port). On his page (http://www.inf.fh-flensburg.de/lang/) there are very well presented algorithms - a very good resource indeed.
I expect Railgun_Quadruplet_7sunhorse to become a gun of choice but only on CPUs with FAR more effective branching/memory handling than my Pentium Merom.

I still do not understand Turbo Boyer-Moore algorithm, maybe it holds a significant Skip-Performance boost.
This project is under researching, that is why I have called it a quest. Anyone who can point out faster functions is welcomed to share them with us here.

My favorite UNDERRATED gun has main loop as follows:
; 3209 :     i=0;
; 3210 :     while (i <= cbTarget-cbPattern-1) {

  010e6    8b 4c 24 20     mov     ecx, DWORD PTR _cbTarget$GSCopy$[esp+2104]
  010ea    89 44 24 28     mov     DWORD PTR _ulHashPattern$[esp+2104], eax
  010ee    33 c0         xor     eax, eax
  010f0    2b ca         sub     ecx, edx
  010f2    89 4c 24 14     mov     DWORD PTR tv554[esp+2104], ecx
  010f6    8d 0c 13     lea     ecx, DWORD PTR [ebx+edx]
  010f9    89 44 24 10     mov     DWORD PTR _AdvanceHopperGrass$[esp+2104], eax
  010fd    89 4c 24 18     mov     DWORD PTR tv451[esp+2104], ecx
$LN13@Railgun_Qu@5:

; 3211 :          ulHashTarget=*(unsigned long *)&pbTarget[i];
; 3212 :          if ( ulHashTarget == ulHashPattern)

  01101    8b 54 24 28     mov     edx, DWORD PTR _ulHashPattern$[esp+2104]
  01105    39 14 18     cmp     DWORD PTR [eax+ebx], edx
  01108    75 41         jne     SHORT $LN8@Railgun_Qu@5

; 3213 :              {
; 3214 :          count = countSTATIC;

  0110a    8b 7c 24 1c     mov     edi, DWORD PTR _countSTATIC$[esp+2104]

; 3215 :          while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) ) { // if pattern length is 4 or 5 we have count=-1 and count=0 respectively i.e. no need of comparing in-between chars.

  0110e    85 ff         test     edi, edi
  01110    74 2d         je     SHORT $LN85@Railgun_Qu@5

; 3213 :              {
; 3214 :          count = countSTATIC;

  01112    8d 56 04     lea     edx, DWORD PTR [esi+4]
  01115    8d 74 18 04     lea     esi, DWORD PTR [eax+ebx+4]
  01119    8d a4 24 00 00
    00 00         npad     7
$LL10@Railgun_Qu@5:

; 3215 :          while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) ) { // if pattern length is 4 or 5 we have count=-1 and count=0 respectively i.e. no need of comparing in-between chars.

  01120    8a 0a         mov     cl, BYTE PTR [edx]
  01122    3a 0e         cmp     cl, BYTE PTR [esi]
  01124    75 11         jne     SHORT $LN9@Railgun_Qu@5

; 3216 :                count--;

  01126    46         inc     esi
  01127    42         inc     edx
  01128    4f         dec     edi
  01129    75 f5         jne     SHORT $LL10@Railgun_Qu@5

; 3217 :          }
; 3218 :          if ( count == 0) Railgunhits++; //return(pbTarget+i);

  0112b    8b 74 24 24     mov     esi, DWORD PTR _pbPattern$GSCopy$[esp+2104]
  0112f    ff 05 00 00 00
    00         inc     DWORD PTR _Railgunhits
  01135    eb 14         jmp     SHORT $LN8@Railgun_Qu@5
$LN9@Railgun_Qu@5:
  01137    85 ff         test     edi, edi
  01139    75 0c         jne     SHORT $LN111@Railgun_Qu@5
  0113b    8b 74 24 24     mov     esi, DWORD PTR _pbPattern$GSCopy$[esp+2104]
$LN85@Railgun_Qu@5:
  0113f    ff 05 00 00 00
    00         inc     DWORD PTR _Railgunhits
  01145    eb 04         jmp     SHORT $LN8@Railgun_Qu@5
$LN111@Railgun_Qu@5:
  01147    8b 74 24 24     mov     esi, DWORD PTR _pbPattern$GSCopy$[esp+2104]
$LN8@Railgun_Qu@5:

; 3219 :          }
; 3220 : 
; 3221 : //         i+=cbPattern;
; 3222 : //         i= i - bm_bc2nd[(unsigned char)pbTarget[i]];
; 3223 : 
; 3224 : //if ( bm_bc[(unsigned char)pbTarget[i+cbPattern-1]] < ((cbPattern) - bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]]) )
; 3225 : //         i= i + ((cbPattern) - bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]]);
; 3226 : if ( bm_bc[(unsigned char)pbTarget[i+cbPattern-1]] < bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]] )

  0114b    8b 54 24 18     mov     edx, DWORD PTR tv451[esp+2104]
  0114f    0f b6 4c 02 ff     movzx     ecx, BYTE PTR [edx+eax-1]
  01154    0f b6 14 02     movzx     edx, BYTE PTR [edx+eax]
  01158    8b 4c 8c 30     mov     ecx, DWORD PTR _bm_bc$[esp+ecx*4+2104]
  0115c    8b 94 94 30 04
    00 00         mov     edx, DWORD PTR _bm_bc2nd$[esp+edx*4+2104]
  01163    3b ca         cmp     ecx, edx
  01165    73 04         jae     SHORT $LN7@Railgun_Qu@5

; 3227 :          i= i + bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]];

  01167    03 c2         add     eax, edx

; 3228 : else

  01169    eb 02         jmp     SHORT $LN6@Railgun_Qu@5
$LN7@Railgun_Qu@5:

; 3229 :          i= i + bm_bc[(unsigned char)pbTarget[i+cbPattern-1]];

  0116b    03 c1         add     eax, ecx
$LN6@Railgun_Qu@5:

; 3207 : 
; 3208 :     AdvanceHopperGrass = 0;
; 3209 :     i=0;
; 3210 :     while (i <= cbTarget-cbPattern-1) {

  0116d    8b 4c 24 14     mov     ecx, DWORD PTR tv554[esp+2104]

; 3230 : 
; 3231 : 
; 3232 :          AdvanceHopperGrass++;

  01171    ff 44 24 10     inc     DWORD PTR _AdvanceHopperGrass$[esp+2104]
  01175    49         dec     ecx
  01176    3b c1         cmp     eax, ecx
  01178    76 87         jbe     SHORT $LN13@Railgun_Qu@5

; 3233 :     }
The size is:
01178 - 01101 +1 +1 -4 = 75bytes, -4 because 'inc DWORD PTR _AdvanceHopperGrass$[esp+2104]' is only for statistical use.
]

Background

Function homepage:
http://www.sanmayce.com/Railgun/index.html

Note: In order to reproduce the tests shown on the picture (below) you need the test package (57.2 MB) downloadable at function's homepage, thanks Randor for pointing out.

Very informative resource on strstr search techniques at:

Tests done with 16 patterns with 197MB haystack (being a pure English book-like text).

Railgun_Benchmarked.png

The machine is a mainstream laptop on 2.1GHz with Intel Merom CPU.

Using the Code

The code is pretty simple and straightforward.

C++
// Here the strstr-like function Railgun_Quadruplet_6ppp is given,  
// written by Georgi 'Kaze', 2011-Sep-07.
//
// ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Horspool algorithm [
// Caution: For better speed the case 'if (cbPattern==1)' was removed, 
// so Pattern must be longer than 1 char.
char * Railgun_Quadruplet_6ppp (char * pbTarget,
     char * pbPattern,
     unsigned long cbTarget,
     unsigned long cbPattern)
{
    char * pbTargetMax = pbTarget + cbTarget;
    register unsigned long  ulHashPattern;
    unsigned long ulHashTarget;
    //unsigned long count; //r.6+
    signed long count;
    //unsigned long countSTATIC; //r.6+
    signed long countSTATIC;
//  unsigned long countRemainder;

/*
    const unsigned char SINGLET = *(char *)(pbPattern);
    const unsigned long Quadruplet2nd = SINGLET<<8;
    const unsigned long Quadruplet3rd = SINGLET<<16;
    const unsigned long Quadruplet4th = SINGLET<<24;
*/
    unsigned char SINGLET;
    unsigned long Quadruplet2nd;
    unsigned long Quadruplet3rd;
    unsigned long Quadruplet4th;

    unsigned long  AdvanceHopperGrass;

    long i; //BMH needed
    int a, j, bm_bc[ASIZE]; //BMH needed
    unsigned char ch; //BMH needed
//    unsigned char lastch, firstch; //BMH needed

    if (cbPattern > cbTarget)
        return(NULL);

// Doesn't work when cbPattern = 1
// The next IF-fragment works very well with cbPattern>1, 
// OBVIOUSLY IT MUST BE UNROLLED(but crippled with less functionality) 
// SINCE either cbPattern=2 or cbPattern=3!
if ( cbPattern<4) {     // This IF makes me unhappy: it slows down from 
            // 390KB/clock to 367KB/clock for 'fast' pattern. 
            // This fragment(for 2..3 pattern lengths) is needed 
            // because I need a function different than strchr, but 
            // sticking to strstr i.e. lengths above 1 are to be handled.
        pbTarget = pbTarget+cbPattern;
        ulHashPattern = ( (*(char *)(pbPattern))<<8 ) + *(pbPattern+(cbPattern-1));
//        countSTATIC = cbPattern-2;

if ( cbPattern==3) {
    for ( ;; )
    {
        if ( ulHashPattern == ( (*(char *)(pbTarget-3))<<8 ) + *(pbTarget-1) ) {
         if ( *(char *)(pbPattern+1) == *(char *)(pbTarget-2) ) return((pbTarget-3));
        }
        if ( (char)(ulHashPattern>>8) != *(pbTarget-2) ) pbTarget++;
        pbTarget++;
        if (pbTarget > pbTargetMax)
            return(NULL);
    }
} else {
}
    for ( ;; )
    {
        // The line below gives for 'cbPattern'>=1:
        // Karp_Rabin_Kaze_4_OCTETS_hits/Karp_Rabin_Kaze_4_OCTETS_clocks: 4/543
        // Karp_Rabin_Kaze_4_OCTETS performance: 372KB/clock
/*
        if ( (ulHashPattern == ( (*(char *)(pbTarget-cbPattern))<<8 ) + 
        *(pbTarget-1)) && !memcmp(pbPattern, pbTarget-cbPattern, 
        (unsigned int)cbPattern) )
            return((long)(pbTarget-cbPattern));
*/

        // The fragment below gives for 'cbPattern'>=2:
        // Karp_Rabin_Kaze_4_OCTETS_hits/Karp_Rabin_Kaze_4_OCTETS_clocks: 4/546
        // Karp_Rabin_Kaze_4_OCTETS performance: 370KB/clock

/*
//For 2 and 3 [
        if ( ulHashPattern == ( (*(char *)(pbTarget-cbPattern))<<8 ) + *(pbTarget-1) ) {
//         count = countSTATIC;
         count = cbPattern-2;
//         while ( count && *(char *)(pbPattern+1+(countSTATIC-count)) == 
//         *(char *)(pbTarget-cbPattern+1+(countSTATIC-count)) ) {
         while ( count && *(char *)(pbPattern+1) == *(char *)(pbTarget-2) ) 
            { // Crippling i.e. only 2 and 3 chars are allowed!
               count--;
         }
         if ( count == 0) return((pbTarget-cbPattern));
        }
        if ( (char)(ulHashPattern>>8) != *(pbTarget-cbPattern+1) ) pbTarget++;
//For 2 and 3 ]
*/


        if ( ulHashPattern == ( (*(char *)(pbTarget-2))<<8 ) + *(pbTarget-1) )
            return((pbTarget-2));
        if ( (char)(ulHashPattern>>8) != *(pbTarget-1) ) pbTarget++;


        // The fragment below gives for 'cbPattern'>=2:
    // Karp_Rabin_Kaze_4_OCTETS_hits/Karp_Rabin_Kaze_4_OCTETS_clocks: 4/554
    // Karp_Rabin_Kaze_4_OCTETS performance: 364KB/clock
/*
        if ( ulHashPattern == ( (*(char *)(pbTarget-cbPattern))<<8 ) + *(pbTarget-1) ) {
         count = countSTATIC>>2;
         countRemainder = countSTATIC % 4;

         while ( count && *(unsigned long *)(pbPattern+1+((count-1)<<2)) == 
        *(unsigned long *)(pbTarget-cbPattern+1+((count-1)<<2)) ) {
               count--;
         }
     //if (count == 0) {      // Disastrous degradation only from this line
                // (317KB/clock when 1+2x4+2+1 bytes pattern: 
                // 'skillessness'; 312KB/clock when 1+1x4+2+1 bytes 
                // pattern: 'underdog'), otherwise 368KB/clock.
         while ( countRemainder && *(char *)(pbPattern+1+
        (countSTATIC-countRemainder)) == *(char *)(pbTarget-cbPattern+1+
        (countSTATIC-countRemainder)) ) {
               countRemainder--;
         }
         //if ( countRemainder == 0) return((long)(pbTarget-cbPattern));
         if ( count+countRemainder == 0) return((long)(pbTarget-cbPattern));
         //}
        }
*/

        pbTarget++;
        if (pbTarget > pbTargetMax)
            return(NULL);
    }
} else { //if ( cbPattern<4)
if (cbTarget<961) // This value is arbitrary(don't know how exactly), 
        // it ensures(at least must) 
        // better performance than 'Boyer_Moore_Horspool'.
{
        pbTarget = pbTarget+cbPattern;
        ulHashPattern = *(unsigned long *)(pbPattern);
//        countSTATIC = cbPattern-1;

    //SINGLET = *(char *)(pbPattern);
    SINGLET = ulHashPattern & 0xFF;
    Quadruplet2nd = SINGLET<<8;
    Quadruplet3rd = SINGLET<<16;
    Quadruplet4th = SINGLET<<24;

    for ( ;; )
    {
    AdvanceHopperGrass = 0;
    ulHashTarget = *(unsigned long *)(pbTarget-cbPattern);

        if ( ulHashPattern == ulHashTarget ) { // Three unnecessary comparisons here, 
        // but 'AdvanceHopperGrass' must be calculated - it has a higher priority.
//         count = countSTATIC;
//         while ( count && *(char *)(pbPattern+1+(countSTATIC-count)) == 
//        *(char *)(pbTarget-cbPattern+1+(countSTATIC-count)) ) {
//           if ( countSTATIC==AdvanceHopperGrass+count && SINGLET != 
//         *(char *)(pbTarget-cbPattern+1+(countSTATIC-count)) ) 
//        AdvanceHopperGrass++;
//               count--;
//         }
         count = cbPattern-1;
         while ( count && *(char *)(pbPattern+(cbPattern-count)) == 
                    *(char *)(pbTarget-count) ) {
           if ( cbPattern-1==AdvanceHopperGrass+count && SINGLET != 
            *(char *)(pbTarget-count) ) AdvanceHopperGrass++;
               count--;
         }
         if ( count == 0) return((pbTarget-cbPattern));
        } else { // The goal here: to avoid memory accesses by stressing the registers.
    if ( Quadruplet2nd != (ulHashTarget & 0x0000FF00) ) {
         AdvanceHopperGrass++;
         if ( Quadruplet3rd != (ulHashTarget & 0x00FF0000) ) {
              AdvanceHopperGrass++;
              if ( Quadruplet4th != (ulHashTarget & 0xFF000000) ) AdvanceHopperGrass++;
         }
    }
    }

    AdvanceHopperGrass++;

    pbTarget = pbTarget + AdvanceHopperGrass;
        if (pbTarget > pbTargetMax)
            return(NULL);
    }
} else { //if (cbTarget<961)
        //countSTATIC = cbPattern-2; //r.6+
        //countSTATIC = cbPattern-2-3; //r.6++
        countSTATIC = cbPattern-2-2; //r.6+++ This line fixes the bug from r.6++!
        ulHashPattern = *(unsigned long *)(pbPattern);
    /* Preprocessing */
    for (a=0; a < ASIZE; a++) bm_bc[a]=cbPattern;
    for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1;

    /* Searching */
    //lastch=pbPattern[cbPattern-1];
    //firstch=pbPattern[0];
    i=0;
    while (i <= cbTarget-cbPattern) {
       ch=pbTarget[i+cbPattern-1];
       //if (ch ==lastch)
          //if (memcmp(&pbTarget[i],pbPattern,cbPattern-1) == 0) OUTPUT(i);
          //if (ch == lastch && pbTarget[i] == firstch && 
     //memcmp(&pbTarget[i],pbPattern,cbPattern-1) == 0) return(i);  
          // Kaze: The idea(to prevent execution of slower 'memcmp') 
          // is borrowed from Karp-Rabin i.e. to perform a slower check only 
          // when the target "looks like".
          //if (ch == pbPattern[cbPattern-1] && pbTarget[i] == pbPattern[0]) // r.6+
          //if (ch == pbPattern[cbPattern-1] && *(long *)&pbTarget[i] 
          // == *(long *)&pbPattern[0]) // No problema here since we have 4[+] 
          // long pattern here. Overlapping (1 byte recompared) when length=4, grmbl.
          if (ch == pbPattern[cbPattern-1] && 
        *(long *)&pbTarget[i] == ulHashPattern) // No problema here since 
            // we have 4[+] long pattern here. Overlapping 
            // (1 byte recompared) when length=4, grmbl.
             {
         count = countSTATIC;
         //while ( count && *(char *)(pbPattern+1+(countSTATIC-count)) == 
    //*(char *)(&pbTarget[i]+1+(countSTATIC-count)) ) { // r.6+
         while ( count !=0 && *(char *)(pbPattern+1+3+(countSTATIC-count)) == 
        *(char *)(&pbTarget[i]+1+3+(countSTATIC-count)) ) 
        {     // if pattern length is 4 or 5 we have count=-1 and count=0 
            // respectively i.e. no need of comparing in-between chars.
               count--;
         }
         if ( count <= 0) return(pbTarget+i);
         }
       i+=bm_bc[ch];
    }
    return(NULL);
} //if (cbTarget<961)
} //if ( cbPattern<4)
}
// ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Horspool algorithm ]

Stomp stomp I arrived, here comes the Railgun_Quadruplet revision 7:

C++
// Caution: For better speed the case 'if (cbPattern==1)' was removed, 
// so Pattern must be longer than 1 char.
char * Railgun_Quadruplet_7 (char * pbTarget, char * pbPattern, 
        unsigned long cbTarget, unsigned long cbPattern)
{
    char * pbTargetMax = pbTarget + cbTarget;
    register unsigned long ulHashPattern;
    unsigned long ulHashTarget;
    signed long count;
    signed long countSTATIC;

    unsigned char SINGLET;
    unsigned long Quadruplet2nd;
    unsigned long Quadruplet3rd;
    unsigned long Quadruplet4th;
    unsigned long AdvanceHopperGrass;

    long i;
    int a, j, bm_bc[ASIZE];
    unsigned char ch;

    if (cbPattern > cbTarget) return(NULL);

    if (cbPattern<4) {
        pbTarget = pbTarget+cbPattern;
        ulHashPattern = ( (*(char *)(pbPattern))<<8 ) + 
                    *(pbPattern+(cbPattern-1));
        if (cbPattern==3) {
            for ( ;; ) {
                if ( ulHashPattern == 
                ( (*(char *)(pbTarget-3))<<8 ) + *(pbTarget-1) ) {
                    if ( *(char *)(pbPattern+1) == 
                    *(char *)(pbTarget-2) ) return((pbTarget-3));
                }
                if ( (char)(ulHashPattern>>8) != 
                        *(pbTarget-2) ) pbTarget++;
                pbTarget++;
                    if (pbTarget > pbTargetMax) return(NULL);
                }
        } else {
        }
            for ( ;; ) {
                if ( ulHashPattern == ( (*(char *)(pbTarget-2))<<8 ) 
                    + *(pbTarget-1) ) return((pbTarget-2));
                if ( (char)(ulHashPattern>>8) != *(pbTarget-1) ) 
                    pbTarget++;
                pbTarget++;
                if (pbTarget > pbTargetMax) return(NULL);
            }
    } else {
        if (cbTarget<961) {
            pbTarget = pbTarget+cbPattern;
            ulHashPattern = *(unsigned long *)(pbPattern);
            SINGLET = ulHashPattern & 0xFF;
            Quadruplet2nd = SINGLET<<8;
            Quadruplet3rd = SINGLET<<16;
            Quadruplet4th = SINGLET<<24;
            for ( ;; ) {
                AdvanceHopperGrass = 0;
                ulHashTarget = *(unsigned long *)(pbTarget-cbPattern);
                if ( ulHashPattern == ulHashTarget ) { 
                    count = cbPattern-1;
                    while ( count && *(char *)
                    (pbPattern+(cbPattern-count)) == 
                        *(char *)(pbTarget-count) ) {
                        if ( cbPattern-1==
                        AdvanceHopperGrass+count && 
                        SINGLET != *(char *)(pbTarget-count)
                             ) AdvanceHopperGrass++;
                        count--;
                    }
                    if ( count == 0) 
                        return((pbTarget-cbPattern));
                } else {
                    if ( Quadruplet2nd != (ulHashTarget 
                            & 0x0000FF00) ) {
                        AdvanceHopperGrass++;
                        if ( Quadruplet3rd != 
                        (ulHashTarget & 0x00FF0000) ) {
                            AdvanceHopperGrass++;
                            if ( Quadruplet4th != 
                            (ulHashTarget & 0xFF000000)
                             ) AdvanceHopperGrass++;
                        }
                    }
                }
                AdvanceHopperGrass++;
                pbTarget = pbTarget + AdvanceHopperGrass;
                if (pbTarget > pbTargetMax) return(NULL);
            }
        } else {
            countSTATIC = cbPattern-2-2; 
            ulHashPattern = *(unsigned long *)(pbPattern);
            for (a=0; a < ASIZE; a++) bm_bc[a]=cbPattern;
            for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=
                                cbPattern-j-1;
            i=0;
            while (i <= cbTarget-cbPattern) {
                if ( *(unsigned long *)&pbTarget[i] == 
                    ulHashPattern ) {
                    count = countSTATIC;
                    while ( count !=0 && *(char *)
                    (pbPattern+1+3+(countSTATIC-count)) == 
                    *(char *)(&pbTarget[i]+1+3+
                    (countSTATIC-count)) ) count--;
                    if ( count == 0) return(pbTarget+i);
                }
                i= i + bm_bc[(unsigned char)pbTarget[i+cbPattern-1]];
            }
            return(NULL);
        }
    }
}

The tweak (in BMH fragment) which speeds-up significantly the obsolete (already) r.6+++ is this:

C++
        //countSTATIC = cbPattern-2; //r.6+
        //countSTATIC = cbPattern-2-3;
        //countSTATIC = cbPattern-2-2; // r.6+++ I suppose that the awful degradation 
            // comes from 2bytes more (from either 'if (countSTATIC<0) 
            // countSTATIC=0;' or 'count >0' fixes) which make the 
            // function unfittable in code cache lines?!
        //countSTATIC = cbPattern-2-3;     // r.7- At last no recompared bytes 
                    // in-between chars
        countSTATIC = cbPattern-2-2; // r.7 
        ulHashPattern = *(unsigned long *)(pbPattern);
        //chPTR=(unsigned char *)&chchchch+3;
// Next line fixes the BUG from r.6++: but with awful speed degradation! 
// So the bug is fixed in the definitions by setting 'countSTATIC = cbPattern-2-2;', 
// bug appears only for patterns with lengths of 4, The setback is one unnecessary 
// comparison for 5bytes patterns, stupidly such setback exists (from before) for 
// 4bytes as well.
//if (countSTATIC<0) countSTATIC=0;
    /* Preprocessing */
    for (a=0; a < ASIZE; a++) bm_bc[a]=cbPattern;
    for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1;

    /* Searching */
    //lastch=pbPattern[cbPattern-1];
    //firstch=pbPattern[0];
    i=0;
    while (i <= cbTarget-cbPattern) {
       //ch=pbTarget[i+cbPattern-1];
       //ch=pbTarget[i];
          //if ( pbTarget[i] == pbPattern[0] && *(unsigned long *)
    //&pbTarget[i+cbPattern-1-3] == ulHashPattern) // No problema here since 
            // we have 4[+] long pattern here. Overlapping 
            // (1 byte recompared) when length=4, grmbl.
          if ( *(unsigned long *)&pbTarget[i] == ulHashPattern ) // The lesson I learned 
            // from r.7- now applied in r.7: instead of extracting 'ch' 
            // having higher address now the lower address is extracted 
            // first in order (hopefully, the test confirms it) the next 
            // 32bytes (including 'ch') to be cached i.e. to comparison 
            // part is faster.
             {
         count = countSTATIC;
         while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) == 
        *(char *)(&pbTarget[i]+(countSTATIC-count)+4) ) { // if pattern length 
        // is 4 or 5 we have count=-1 and count=0 respectively i.e. no need of 
        // comparing in-between chars.
               count--;
         }
         if ( count == 0) return(pbTarget+i);
         }
       i= i + bm_bc[(unsigned char)pbTarget[i+cbPattern-1]];
    }
    return(NULL);

In order to see what path I had to traverse to make this tweak, you may see the discussion board (r.7m thread). The tests below show NOT the full boost simply because they average times for all lengths thus losing/diminishing the high speeds times (being 4-3-2 times lesser compared with pattern lengths 2-3-4-5-6).

D:\_KAZE_new-stuff>strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe
strstr_SHORT-SHOWDOWN, revision 7, written by Kaze.
Input Pattern(up to 19+2000 chars): frustration

Doing Search for Pattern(11bytes) into String(206908949bytes) as-one-line ...
WARNING! The next line works with BMH only for pattern 4[+] long, 
otherwise (for 2 and 3) other searcher takes over!
Railgun_6pp_hits/Railgun_6pp_clocks: 1355/101
Railgun_6pp performance: 2000KB/clock

Doing Search for Pattern(11bytes) into String(206908949bytes) as-one-line ...
WARNING! The next line works with BMH only for pattern 4[+] long, 
otherwise (for 2 and 3) other searcher takes over!
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1355/91
Railgun_Quadruplet_7 performance: 2220KB/clock

Note: Executing the next two tests 256 times i.e. the search is for 256x8x2 patterns!

Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ... 
Not-Counting-Hits-Just-Returns-First-One
Found ('an') at 157 position, Railgun_Quadruplet_6pp performance: 0KB/clock
Found ('to') at 853 position, Railgun_Quadruplet_6pp performance: 0KB/clock
Found ('TDK') at -4325408 position, Railgun_Quadruplet_6pp performance: 922KB/clock
Found ('the') at 511 position, Railgun_Quadruplet_6pp performance: 0KB/clock
Found ('fast') at 42381 position, Railgun_Quadruplet_6pp performance: 41KB/clock
Found ('easy') at 30163 position, Railgun_Quadruplet_6pp performance: 29KB/clock
Found ('grmbl') at -4325408 position, Railgun_Quadruplet_6pp performance: 1167KB/clock
Found ('email') at 69587982 position, Railgun_Quadruplet_6pp performance: 1078KB/clock
Found ('pasting') at 134600126 position, Railgun_Quadruplet_6pp performance: 1383KB/clock
Found ('amazing') at 1629806 position, Railgun_Quadruplet_6pp performance: 93KB/clock
Found ('underdog') at 137424498 position, 
            Railgun_Quadruplet_6pp performance: 1698KB/clock
Found ('superdog') at -4325408 position, Railgun_Quadruplet_6pp performance: 1603KB/clock
Found ('participants') at 16060 position, Railgun_Quadruplet_6pp performance: 15KB/clock
Found ('skillessness') at -4325408 position, 
    Railgun_Quadruplet_6pp performance: 1836KB/clock
Found ('I should have known') at 41831880 position, 
    Railgun_Quadruplet_6pp performance: 2403KB/clock
Found ('human consciousness') at 3863 position, 
    Railgun_Quadruplet_6pp performance: 3KB/clock
Railgun_Quadruplet_6pp 8x2 i.e. average performance: 1342KB/clock
ReallyTraversed: 310,477,846,528 bytes

Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ... 
Not-Counting-Hits-Just-Returns-First-One
Found ('an') at 157 position, Railgun_Quadruplet_7 performance: 0KB/clock
Found ('to') at 853 position, Railgun_Quadruplet_7 performance: 0KB/clock
Found ('TDK') at -4325408 position, Railgun_Quadruplet_7 performance: 990KB/clock
Found ('the') at 511 position, Railgun_Quadruplet_7 performance: 0KB/clock
Found ('fast') at 42381 position, Railgun_Quadruplet_7 performance: 41KB/clock
Found ('easy') at 30163 position, Railgun_Quadruplet_7 performance: 29KB/clock
Found ('grmbl') at -4325408 position, Railgun_Quadruplet_7 performance: 1069KB/clock
Found ('email') at 69587982 position, Railgun_Quadruplet_7 performance: 1078KB/clock
Found ('pasting') at 134600126 position, Railgun_Quadruplet_7 performance: 1383KB/clock
Found ('amazing') at 1629806 position, Railgun_Quadruplet_7 performance: 1591KB/clock
Found ('underdog') at 137424498 position, Railgun_Quadruplet_7 performance: 1698KB/clock
Found ('superdog') at -4325408 position, Railgun_Quadruplet_7 performance: 1422KB/clock
Found ('participants') at 16060 position, Railgun_Quadruplet_7 performance: 15KB/clock
Found ('skillessness') at -4325408 position, 
    Railgun_Quadruplet_7 performance: 2149KB/clock
Found ('I should have known') at 41831880 position, 
    Railgun_Quadruplet_7 performance: 2403KB/clock
Found ('human consciousness') at 3863 position, 
    Railgun_Quadruplet_7 performance: 3KB/clock
Railgun_Quadruplet_7 8x2 i.e. average performance: 1354KB/clock
ReallyTraversed: 310,477,846,528 bytes

Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ...
Found ('an') 1987797 time(s), BM_HORSPOOL performance: 293KB/clock
Found ('to') 1076629 time(s), BM_HORSPOOL performance: 293KB/clock
Found ('TDK') 0 time(s), BM_HORSPOOL performance: 516KB/clock
Found ('the') 2114180 time(s), BM_HORSPOOL performance: 379KB/clock
Found ('fast') 5945 time(s), BM_HORSPOOL performance: 585KB/clock
Found ('easy') 5191 time(s), BM_HORSPOOL performance: 585KB/clock
Found ('grmbl') 0 time(s), BM_HORSPOOL performance: 759KB/clock
Found ('email') 1 time(s), BM_HORSPOOL performance: 756KB/clock
Found ('pasting') 2 time(s), BM_HORSPOOL performance: 918KB/clock
Found ('amazing') 323 time(s), BM_HORSPOOL performance: 1074KB/clock
Found ('underdog') 4 time(s), BM_HORSPOOL performance: 1069KB/clock
Found ('superdog') 0 time(s), BM_HORSPOOL performance: 1074KB/clock
Found ('participants') 147 time(s), BM_HORSPOOL performance: 1603KB/clock
Found ('skillessness') 0 time(s), BM_HORSPOOL performance: 1422KB/clock
Found ('I should have known') 1 time(s), BM_HORSPOOL performance: 1433KB/clock
Found ('human consciousness') 519 time(s), BM_HORSPOOL performance: 1820KB/clock
BM_HORSPOOL 8x2 i.e. average performance: 666KB/clock

Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ...
Found ('an') 1987797 time(s), Railgun_6pp performance: 678KB/clock
Found ('to') 1076629 time(s), Railgun_6pp performance: 643KB/clock
Found ('TDK') 0 time(s), Railgun_6pp performance: 1174KB/clock
Found ('the') 2114180 time(s), Railgun_6pp performance: 678KB/clock
Found ('fast') 5945 time(s), Railgun_6pp performance: 918KB/clock
Found ('easy') 5191 time(s), Railgun_6pp performance: 990KB/clock
Found ('grmbl') 0 time(s), Railgun_6pp performance: 1287KB/clock
Found ('email') 1 time(s), Railgun_6pp performance: 1167KB/clock
Found ('pasting') 2 time(s), Railgun_6pp performance: 1603KB/clock
Found ('amazing') 323 time(s), Railgun_6pp performance: 1603KB/clock
Found ('underdog') 4 time(s), Railgun_6pp performance: 1603KB/clock
Found ('superdog') 0 time(s), Railgun_6pp performance: 1820KB/clock
Found ('participants') 147 time(s), Railgun_6pp performance: 2149KB/clock
Found ('skillessness') 0 time(s), Railgun_6pp performance: 2126KB/clock
Found ('I should have known') 1 time(s), Railgun_6pp performance: 2557KB/clock
Found ('human consciousness') 519 time(s), Railgun_6pp performance: 2126KB/clock
Railgun_6pp 8x2 i.e. average performance: 1218KB/clock

Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ...
Found ('an') 1987797 time(s), Railgun_Quadruplet_7 performance: 643KB/clock
Found ('to') 1076629 time(s), Railgun_Quadruplet_7 performance: 678KB/clock
Found ('TDK') 0 time(s), Railgun_Quadruplet_7 performance: 1074KB/clock
Found ('the') 2114180 time(s), Railgun_Quadruplet_7 performance: 643KB/clock
Found ('fast') 5945 time(s), Railgun_Quadruplet_7 performance: 1074KB/clock
Found ('easy') 5191 time(s), Railgun_Quadruplet_7 performance: 1074KB/clock
Found ('grmbl') 0 time(s), Railgun_Quadruplet_7 performance: 1287KB/clock
Found ('email') 1 time(s), Railgun_Quadruplet_7 performance: 1167KB/clock
Found ('pasting') 2 time(s), Railgun_Quadruplet_7 performance: 1603KB/clock
Found ('amazing') 323 time(s), Railgun_Quadruplet_7 performance: 1603KB/clock
Found ('underdog') 4 time(s), Railgun_Quadruplet_7 performance: 1820KB/clock
Found ('superdog') 0 time(s), Railgun_Quadruplet_7 performance: 1603KB/clock
Found ('participants') 147 time(s), Railgun_Quadruplet_7 performance: 2557KB/clock
Found ('skillessness') 0 time(s), Railgun_Quadruplet_7 performance: 2126KB/clock
Found ('I should have known') 1 time(s), Railgun_Quadruplet_7 performance: 2557KB/clock
Found ('human consciousness') 519 time(s), 
        Railgun_Quadruplet_7 performance: 2557KB/clock
Railgun_Quadruplet_7 8x2 i.e. average performance: 1232KB/clock

^C
D:\_KAZE_new-stuff>

In my opinion, it is one of the seven commandments of speed: do not put the cart before the horse, i.e., for adjacent (32?) memory accesses first extract the byte with lowest address. strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe, revision 7, some quick inputs:

D:\_KAZE_new-stuff>strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe
strstr_SHORT-SHOWDOWN, revision 7, written by Kaze.
Input Pattern(up to 19+2000 chars): fast

Doing Search for Pattern(4bytes) into String(206908949bytes) as-one-line ...
WARNING! The next line works with BMH only for pattern 4[+] long, 
otherwise (for 2 and 3) other searcher takes over!
Railgun_6pp_hits/Railgun_6pp_clocks: 5945/215
Railgun_6pp performance: 939KB/clock

Doing Search for Pattern(4bytes) into String(206908949bytes) as-one-line ...
WARNING! The next line works with BMH only for pattern 4[+] long, 
otherwise (for 2 and 3) other searcher takes over!
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 5945/193
Railgun_Quadruplet_7 performance: 1046KB/clock

^C

...

Input Pattern(up to 19+2000 chars): from
Railgun_6pp_hits/Railgun_6pp_clocks: 83155/196
Railgun_6pp performance: 1030KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 83155/194
Railgun_Quadruplet_7 performance: 1041KB/clock

Input Pattern(up to 19+2000 chars): rabea
Railgun_6pp_hits/Railgun_6pp_clocks: 0/176
Railgun_6pp performance: 1148KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/162
Railgun_Quadruplet_7 performance: 1247KB/clock

Input Pattern(up to 19+2000 chars): makes
Railgun_6pp_hits/Railgun_6pp_clocks: 9281/172
Railgun_6pp performance: 1174KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 9281/162
Railgun_Quadruplet_7 performance: 1247KB/clock

Input Pattern(up to 19+2000 chars): monkey
Railgun_6pp_hits/Railgun_6pp_clocks: 1691/141
Railgun_6pp performance: 1433KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1691/139
Railgun_Quadruplet_7 performance: 1453KB/clock

Input Pattern(up to 19+2000 chars): punny
Railgun_6pp_hits/Railgun_6pp_clocks: 0/156
Railgun_6pp performance: 1295KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/155
Railgun_Quadruplet_7 performance: 1303KB/clock

Input Pattern(up to 19+2000 chars): funny
Railgun_6pp_hits/Railgun_6pp_clocks: 179/157
Railgun_6pp performance: 1287KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 179/156
Railgun_Quadruplet_7 performance: 1295KB/clock

Input Pattern(up to 19+2000 chars): ramjet
Railgun_6pp_hits/Railgun_6pp_clocks: 0/152
Railgun_6pp performance: 1329KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/137
Railgun_Quadruplet_7 performance: 1474KB/clock

Input Pattern(up to 19+2000 chars): fallen
Railgun_6pp_hits/Railgun_6pp_clocks: 2578/151
Railgun_6pp performance: 1338KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 2578/138
Railgun_Quadruplet_7 performance: 1464KB/clock

Input Pattern(up to 19+2000 chars): elephant
Railgun_6pp_hits/Railgun_6pp_clocks: 1198/124
Railgun_6pp performance: 1629KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1198/112
Railgun_Quadruplet_7 performance: 1804KB/clock

Input Pattern(up to 19+2000 chars): layalina
Railgun_6pp_hits/Railgun_6pp_clocks: 0/121
Railgun_6pp performance: 1669KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/110
Railgun_Quadruplet_7 performance: 1836KB/clock

Input Pattern(up to 19+2000 chars): I should have known
Railgun_6pp_hits/Railgun_6pp_clocks: 1/89
Railgun_6pp performance: 2270KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1/84
Railgun_Quadruplet_7 performance: 2405KB/clock

Input Pattern(up to 19+2000 chars): human consciousness
Railgun_6pp_hits/Railgun_6pp_clocks: 519/80
Railgun_6pp performance: 2525KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 519/75
Railgun_Quadruplet_7 performance: 2694KB/clock

Input Pattern(up to 19+2000 chars): But he's looking right through me
Railgun_6pp_hits/Railgun_6pp_clocks: 0/84
Railgun_6pp performance: 2405KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/76
Railgun_Quadruplet_7 performance: 2658KB/clock

Input Pattern(up to 19+2000 chars): I'm living in an age that calls darkness light
Railgun_6pp_hits/Railgun_6pp_clocks: 0/72
Railgun_6pp performance: 2806KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/67
Railgun_Quadruplet_7 performance: 3015KB/clock

Input Pattern(up to 19+2000 chars): Following the wanderings of a dream - 
a dream that keeps my soul alive
Railgun_6pp_hits/Railgun_6pp_clocks: 0/69
Railgun_6pp performance: 2928KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/63
Railgun_Quadruplet_7 performance: 3207KB/clock

Input Pattern(up to 19+2000 chars): I notice what matters and 
I got nothing to lose but darkness and shadows
Railgun_6pp_hits/Railgun_6pp_clocks: 0/66
Railgun_6pp performance: 3061KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/63
Railgun_Quadruplet_7 performance: 3207KB/clock

Input Pattern(up to 19+2000 chars): Beth Ditto
Railgun_6pp_hits/Railgun_6pp_clocks: 0/118
Railgun_6pp performance: 1712KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/107
Railgun_Quadruplet_7 performance: 1888KB/clock

Input Pattern(up to 19+2000 chars): SR71
Railgun_6pp_hits/Railgun_6pp_clocks: 0/175
Railgun_6pp performance: 1154KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/182
Railgun_Quadruplet_7 performance: 1110KB/clock

Input Pattern(up to 19+2000 chars): Apache
Railgun_6pp_hits/Railgun_6pp_clocks: 1/162
Railgun_6pp performance: 1247KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1/133
Railgun_Quadruplet_7 performance: 1519KB/clock

Input Pattern(up to 19+2000 chars): Fibonacci
Railgun_6pp_hits/Railgun_6pp_clocks: 0/106
Railgun_6pp performance: 1906KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/98
Railgun_Quadruplet_7 performance: 2061KB/clock

Input Pattern(up to 19+2000 chars): Tesla
Railgun_6pp_hits/Railgun_6pp_clocks: 0/175
Railgun_6pp performance: 1154KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/160
Railgun_Quadruplet_7 performance: 1262KB/clock

Input Pattern(up to 19+2000 chars): Albert
Railgun_6pp_hits/Railgun_6pp_clocks: 932/154
Railgun_6pp performance: 1312KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 932/137
Railgun_Quadruplet_7 performance: 1474KB/clock

Input Pattern(up to 19+2000 chars): Toshiba
Railgun_6pp_hits/Railgun_6pp_clocks: 0/130
Railgun_6pp performance: 1554KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/121
Railgun_Quadruplet_7 performance: 1669KB/clock

Input Pattern(up to 19+2000 chars): Orion
Railgun_6pp_hits/Railgun_6pp_clocks: 1/175
Railgun_6pp performance: 1154KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1/160
Railgun_Quadruplet_7 performance: 1262KB/clock

Input Pattern(up to 19+2000 chars): pharaoh
Railgun_6pp_hits/Railgun_6pp_clocks: 26/129
Railgun_6pp performance: 1566KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 26/122
Railgun_Quadruplet_7 performance: 1656KB/clock

Input Pattern(up to 19+2000 chars): lemon
Railgun_6pp_hits/Railgun_6pp_clocks: 33/178
Railgun_6pp performance: 1135KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 33/161
Railgun_Quadruplet_7 performance: 1255KB/clock

Input Pattern(up to 19+2000 chars): grammar
Railgun_6pp_hits/Railgun_6pp_clocks: 218/123
Railgun_6pp performance: 1642KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 218/117
Railgun_Quadruplet_7 performance: 1727KB/clock

D:\_KAZE_new-stuff>

Grmbl, pattern 'SR71' compared to 'fast' gives me a hint how things are far from 'fastest'.

Points of Interest

The interesting part about tests I have done: 'memcmp' and 'memcpy' when inlined (properly) give much better results. The thing which pop-ups again-and-again: Never take anything for granted. A bug (which doesn't affect the previous results, correct) was crushed, now r.6+++ is ready to go/hit.

As you can see from the C (commentless and tabulated) and ASM sources (see the top of the page for downloading):
The Microsoft 16.00.30319.01 compiled Railgun_Quadruplet_6ppp is 00c6b-00a00+5 = 624 bytes long.
The Quadruplet (where cbTarget<961) fragment loop is 00bc3-00b17+5 = 177 bytes long.
The Boyer-Moore-Horspool (where cbTarget>=961) fragment loop is 00c61-00c20+2 = 67 bytes long.
An interesting question for more experienced coders:
In case of reducing the BMH loop down by 3 bytes (i.e. to 64 bytes), can we obtain some additional/significant boost?

History

  • Revision 6+++ written 2011-Sep-07
  • Revision 7 written 2011-Sep-25
  • Revision 7 'Sun' written 2012-Jan-22
  • Revision 7 'Elsiane' written 2012-Jan-24
  • Revision 7 'Gulliver' written 2012-Jan-26

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Other
Bulgaria Bulgaria
A Bulgarian old boy interested in search techniques, nothing special.

Comments and Discussions

Discussions on this specific version of this article. Add your comments on how to improve this article here. These comments will not be visible on the final published version of this article.