12,547,922 members (48,037 online)
Article
alternative version

127.4K views
69 bookmarked
Posted

# Fastest strstr-like function in C!?

, 25 Sep 2011 CPOL
 Rate this:
Tuned function for searching a needle in a haystack.
This is an old version of the currently published article.

## Introduction

The C strstr-like function presented below (written by Georgi 'Kaze') is named Railgun_Quadruplet_6ppp and is based on well-known Boyer-Moore-Horspool and simplified Karp-Rabin algorithms. The main goal: to achieve fastest execution for all kind of pattern/needle and string/haystack lengths (starting from 2 bytes).

Railgun_Quadruplet_6ppp is now obsolete due to arrival of Railgun_Quadruplet_7 which boosts the BMH fragment by more than a percent. See further below for its commentless and tabulated source.

## Background

Function homepage:
http://www.sanmayce.com/Railgun/index.html

Note: In order to reproduce the tests shown on the picture (below) you need the test package (57.2 MB) downloadable at function's homepage, thanks Randor for pointing out.

Very informative resource on strstr search techniques at:
http://www-igm.univ-mlv.fr/~lecroq/string/

#### Tests done with 16 patterns with 197MB haystack (being a pure English book-like text)

The machine is a mainstream laptop on 2.1GHz with Intel Merom CPU.

## Using the Code

The code is pretty simple and straightforward.

```// Here the strstr-like function Railgun_Quadruplet_6ppp is given,  written by Georgi 'Kaze', 2011-Sep-07.
//
// ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Horspool algorithm [
// Caution: For better speed the case 'if (cbPattern==1)' was removed, so Pattern must be longer than 1 char.
char * Railgun_Quadruplet_6ppp (char * pbTarget,
char * pbPattern,
unsigned long cbTarget,
unsigned long cbPattern)
{
char * pbTargetMax = pbTarget + cbTarget;
register unsigned long  ulHashPattern;
unsigned long ulHashTarget;
//unsigned long count; //r.6+
signed long count;
//unsigned long countSTATIC; //r.6+
signed long countSTATIC;
//  unsigned long countRemainder;

/*
const unsigned char SINGLET = *(char *)(pbPattern);
const unsigned long Quadruplet2nd = SINGLET<<8;
const unsigned long Quadruplet3rd = SINGLET<<16;
const unsigned long Quadruplet4th = SINGLET<<24;
*/
unsigned char SINGLET;

long i; //BMH needed
int a, j, bm_bc[ASIZE]; //BMH needed
unsigned char ch; //BMH needed
//    unsigned char lastch, firstch; //BMH needed

if (cbPattern > cbTarget)
return(NULL);

// Doesn't work when cbPattern = 1
// The next IF-fragment works very well with cbPattern>1, OBVIOUSLY IT MUST BE UNROLLED(but crippled with less functionality) SINCE either cbPattern=2 or cbPattern=3!
if ( cbPattern<4) { // This IF makes me unhappy: it slows down from 390KB/clock to 367KB/clock for 'fast' pattern. This fragment(for 2..3 pattern lengths) is needed because I need a function different than strchr but sticking to strstr i.e. lengths above 1 are to be handled.
pbTarget = pbTarget+cbPattern;
ulHashPattern = ( (*(char *)(pbPattern))<<8 ) + *(pbPattern+(cbPattern-1));
//        countSTATIC = cbPattern-2;

if ( cbPattern==3) {
for ( ;; )
{
if ( ulHashPattern == ( (*(char *)(pbTarget-3))<<8 ) + *(pbTarget-1) ) {
if ( *(char *)(pbPattern+1) == *(char *)(pbTarget-2) ) return((pbTarget-3));
}
if ( (char)(ulHashPattern>>8) != *(pbTarget-2) ) pbTarget++;
pbTarget++;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else {
}
for ( ;; )
{
// The line below gives for 'cbPattern'>=1:
// Karp_Rabin_Kaze_4_OCTETS_hits/Karp_Rabin_Kaze_4_OCTETS_clocks: 4/543
// Karp_Rabin_Kaze_4_OCTETS performance: 372KB/clock
/*
if ( (ulHashPattern == ( (*(char *)(pbTarget-cbPattern))<<8 ) + *(pbTarget-1)) && !memcmp(pbPattern, pbTarget-cbPattern, (unsigned int)cbPattern) )
return((long)(pbTarget-cbPattern));
*/

// The fragment below gives for 'cbPattern'>=2:
// Karp_Rabin_Kaze_4_OCTETS_hits/Karp_Rabin_Kaze_4_OCTETS_clocks: 4/546
// Karp_Rabin_Kaze_4_OCTETS performance: 370KB/clock

/*
//For 2 and 3 [
if ( ulHashPattern == ( (*(char *)(pbTarget-cbPattern))<<8 ) + *(pbTarget-1) ) {
//         count = countSTATIC;
count = cbPattern-2;
//         while ( count && *(char *)(pbPattern+1+(countSTATIC-count)) == *(char *)(pbTarget-cbPattern+1+(countSTATIC-count)) ) {
while ( count && *(char *)(pbPattern+1) == *(char *)(pbTarget-2) ) { // Crippling i.e. only 2 and 3 chars are allowed!
count--;
}
if ( count == 0) return((pbTarget-cbPattern));
}
if ( (char)(ulHashPattern>>8) != *(pbTarget-cbPattern+1) ) pbTarget++;
//For 2 and 3 ]
*/

if ( ulHashPattern == ( (*(char *)(pbTarget-2))<<8 ) + *(pbTarget-1) )
return((pbTarget-2));
if ( (char)(ulHashPattern>>8) != *(pbTarget-1) ) pbTarget++;

// The fragment below gives for 'cbPattern'>=2:
// Karp_Rabin_Kaze_4_OCTETS_hits/Karp_Rabin_Kaze_4_OCTETS_clocks: 4/554
// Karp_Rabin_Kaze_4_OCTETS performance: 364KB/clock
/*
if ( ulHashPattern == ( (*(char *)(pbTarget-cbPattern))<<8 ) + *(pbTarget-1) ) {
count = countSTATIC>>2;
countRemainder = countSTATIC % 4;

while ( count && *(unsigned long *)(pbPattern+1+((count-1)<<2)) == *(unsigned long *)(pbTarget-cbPattern+1+((count-1)<<2)) ) {
count--;
}
//if (count == 0) {  // Disastrous degradation only from this line(317KB/clock when 1+2x4+2+1 bytes pattern: 'skillessness'; 312KB/clock when 1+1x4+2+1 bytes pattern: 'underdog'), otherwise 368KB/clock.
while ( countRemainder && *(char *)(pbPattern+1+(countSTATIC-countRemainder)) == *(char *)(pbTarget-cbPattern+1+(countSTATIC-countRemainder)) ) {
countRemainder--;
}
//if ( countRemainder == 0) return((long)(pbTarget-cbPattern));
if ( count+countRemainder == 0) return((long)(pbTarget-cbPattern));
//}
}
*/

pbTarget++;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else { //if ( cbPattern<4)
if (cbTarget<961) // This value is arbitrary(don't know how exactly), it ensures(at least must) better performance than 'Boyer_Moore_Horspool'.
{
pbTarget = pbTarget+cbPattern;
ulHashPattern = *(unsigned long *)(pbPattern);
//        countSTATIC = cbPattern-1;

//SINGLET = *(char *)(pbPattern);
SINGLET = ulHashPattern & 0xFF;

for ( ;; )
{
ulHashTarget = *(unsigned long *)(pbTarget-cbPattern);

if ( ulHashPattern == ulHashTarget ) { // Three unnecessary comparisons here, but 'AdvanceHopperGrass' must be calculated - it has a higher priority.
//         count = countSTATIC;
//         while ( count && *(char *)(pbPattern+1+(countSTATIC-count)) == *(char *)(pbTarget-cbPattern+1+(countSTATIC-count)) ) {
//               count--;
//         }
count = cbPattern-1;
while ( count && *(char *)(pbPattern+(cbPattern-count)) == *(char *)(pbTarget-count) ) {
count--;
}
if ( count == 0) return((pbTarget-cbPattern));
} else { // The goal here: to avoid memory accesses by stressing the registers.
if ( Quadruplet2nd != (ulHashTarget & 0x0000FF00) ) {
if ( Quadruplet3rd != (ulHashTarget & 0x00FF0000) ) {
}
}
}

if (pbTarget > pbTargetMax)
return(NULL);
}
} else { //if (cbTarget<961)
//countSTATIC = cbPattern-2; //r.6+
//countSTATIC = cbPattern-2-3; //r.6++
countSTATIC = cbPattern-2-2; //r.6+++ This line fixes the bug from r.6++!
ulHashPattern = *(unsigned long *)(pbPattern);
/* Preprocessing */
for (a=0; a < ASIZE; a++) bm_bc[a]=cbPattern;
for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1;

/* Searching */
//lastch=pbPattern[cbPattern-1];
//firstch=pbPattern[0];
i=0;
while (i <= cbTarget-cbPattern) {
ch=pbTarget[i+cbPattern-1];
//if (ch ==lastch)
//if (memcmp(&pbTarget[i],pbPattern,cbPattern-1) == 0) OUTPUT(i);
//if (ch == lastch && pbTarget[i] == firstch && memcmp(&pbTarget[i],pbPattern,cbPattern-1) == 0) return(i);  // Kaze: The idea(to prevent execution of slower 'memcmp') is borrowed from Karp-Rabin i.e. to perform a slower check only when the target "looks like".
//if (ch == pbPattern[cbPattern-1] && pbTarget[i] == pbPattern[0]) // r.6+
//if (ch == pbPattern[cbPattern-1] && *(long *)&pbTarget[i] == *(long *)&pbPattern[0]) // No problema here since we have 4[+] long pattern here. Overlapping (1 byte recompared) when length=4, grmbl.
if (ch == pbPattern[cbPattern-1] && *(long *)&pbTarget[i] == ulHashPattern) // No problema here since we have 4[+] long pattern here. Overlapping (1 byte recompared) when length=4, grmbl.
{
count = countSTATIC;
//while ( count && *(char *)(pbPattern+1+(countSTATIC-count)) == *(char *)(&pbTarget[i]+1+(countSTATIC-count)) ) { // r.6+
while ( count !=0 && *(char *)(pbPattern+1+3+(countSTATIC-count)) == *(char *)(&pbTarget[i]+1+3+(countSTATIC-count)) ) { // if pattern length is 4 or 5 we have count=-1 and count=0 respectively i.e. no need of comparing in-between chars.
count--;
}
if ( count <= 0) return(pbTarget+i);
}
i+=bm_bc[ch];
}
return(NULL);
} //if (cbTarget<961)
} //if ( cbPattern<4)
}
// ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Horspool algorithm ]```
Stomp stomp I arrived, here comes the Railgun_Quadruplet revision 7:
```// Caution: For better speed the case 'if (cbPattern==1)' was removed, so Pattern must be longer than 1 char.
char * Railgun_Quadruplet_7 (char * pbTarget, char * pbPattern, unsigned long cbTarget, unsigned long cbPattern)
{
char * pbTargetMax = pbTarget + cbTarget;
register unsigned long ulHashPattern;
unsigned long ulHashTarget;
signed long count;
signed long countSTATIC;

unsigned char SINGLET;

long i;
int a, j, bm_bc[ASIZE];
unsigned char ch;

if (cbPattern > cbTarget) return(NULL);

if (cbPattern<4) {
pbTarget = pbTarget+cbPattern;
ulHashPattern = ( (*(char *)(pbPattern))<<8 ) + *(pbPattern+(cbPattern-1));
if (cbPattern==3) {
for ( ;; ) {
if ( ulHashPattern == ( (*(char *)(pbTarget-3))<<8 ) + *(pbTarget-1) ) {
if ( *(char *)(pbPattern+1) == *(char *)(pbTarget-2) ) return((pbTarget-3));
}
if ( (char)(ulHashPattern>>8) != *(pbTarget-2) ) pbTarget++;
pbTarget++;
if (pbTarget > pbTargetMax) return(NULL);
}
} else {
}
for ( ;; ) {
if ( ulHashPattern == ( (*(char *)(pbTarget-2))<<8 ) + *(pbTarget-1) ) return((pbTarget-2));
if ( (char)(ulHashPattern>>8) != *(pbTarget-1) ) pbTarget++;
pbTarget++;
if (pbTarget > pbTargetMax) return(NULL);
}
} else {
if (cbTarget<961) {
pbTarget = pbTarget+cbPattern;
ulHashPattern = *(unsigned long *)(pbPattern);
SINGLET = ulHashPattern & 0xFF;
for ( ;; ) {
ulHashTarget = *(unsigned long *)(pbTarget-cbPattern);
if ( ulHashPattern == ulHashTarget ) {
count = cbPattern-1;
while ( count && *(char *)(pbPattern+(cbPattern-count)) == *(char *)(pbTarget-count) ) {
count--;
}
if ( count == 0) return((pbTarget-cbPattern));
} else {
if ( Quadruplet2nd != (ulHashTarget & 0x0000FF00) ) {
if ( Quadruplet3rd != (ulHashTarget & 0x00FF0000) ) {
}
}
}
if (pbTarget > pbTargetMax) return(NULL);
}
} else {
countSTATIC = cbPattern-2-2;
ulHashPattern = *(unsigned long *)(pbPattern);
for (a=0; a < ASIZE; a++) bm_bc[a]=cbPattern;
for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1;
i=0;
while (i <= cbTarget-cbPattern) {
if ( *(unsigned long *)&pbTarget[i] == ulHashPattern ) {
count = countSTATIC;
while ( count !=0 && *(char *)(pbPattern+1+3+(countSTATIC-count)) == *(char *)(&pbTarget[i]+1+3+(countSTATIC-count)) ) count--;
if ( count == 0) return(pbTarget+i);
}
i= i + bm_bc[(unsigned char)pbTarget[i+cbPattern-1]];
}
return(NULL);
}
}
}
```
The tweak (in BMH fragment) which speed-ups significantly the obsolete (already) r.6+++ is this:
```        //countSTATIC = cbPattern-2; //r.6+
//countSTATIC = cbPattern-2-3;
//countSTATIC = cbPattern-2-2; // r.6+++ I suppose that the awful degradation comes from 2bytes more (from either 'if (countSTATIC<0) countSTATIC=0;' or 'count >0' fixes) which make the function unfittable in code cache lines?!
//countSTATIC = cbPattern-2-3; // r.7- At last no recompared bytes in-between chars
countSTATIC = cbPattern-2-2; // r.7
ulHashPattern = *(unsigned long *)(pbPattern);
//chPTR=(unsigned char *)&chchchch+3;
// Next line fixes the BUG from r.6++: but with awful speed degradation! So the bug is fixed in the definitions by setting 'countSTATIC = cbPattern-2-2;', bug appears only for patterns with lengths of 4, The setback is one unnecessary comparison for 5bytes patterns, stupidly such setback exists (from before) for 4bytes as well.
//if (countSTATIC<0) countSTATIC=0;
/* Preprocessing */
for (a=0; a < ASIZE; a++) bm_bc[a]=cbPattern;
for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1;

/* Searching */
//lastch=pbPattern[cbPattern-1];
//firstch=pbPattern[0];
i=0;
while (i <= cbTarget-cbPattern) {
//ch=pbTarget[i+cbPattern-1];
//ch=pbTarget[i];
//if ( pbTarget[i] == pbPattern[0] && *(unsigned long *)&pbTarget[i+cbPattern-1-3] == ulHashPattern) // No problema here since we have 4[+] long pattern here. Overlapping (1 byte recompared) when length=4, grmbl.
if ( *(unsigned long *)&pbTarget[i] == ulHashPattern ) // The lesson I learned from r.7- now applied in r.7: instead of extracting 'ch' having higher address now the lower address is extracted first in order (hopefully, the test confirms it) the next 32bytes (including 'ch') to be cached i.e. to comparison part is faster.
{
count = countSTATIC;
while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) ) { // if pattern length is 4 or 5 we have count=-1 and count=0 respectively i.e. no need of comparing in-between chars.
count--;
}
if ( count == 0) return(pbTarget+i);
}
i= i + bm_bc[(unsigned char)pbTarget[i+cbPattern-1]];
}
return(NULL);
```
In order to see what path I had to traverse to make this tweak you may see the discussion board (r.7m thread). The tests below show NOT the full boost simply because they average times for all lenghts thus losing/diminishing the high speeds times (being 4-3-2 times lesser compared with pattern lenghts 2-3-4-5-6).
```D:\_KAZE_new-stuff>strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe
strstr_SHORT-SHOWDOWN, revision 7, written by Kaze.
Input Pattern(up to 19+2000 chars): frustration

Doing Search for Pattern(11bytes) into String(206908949bytes) as-one-line ...
WARNING! The next line works with BMH only for pattern 4[+] long, otherwise (for 2 and 3) other searcher takes over!
Railgun_6pp_hits/Railgun_6pp_clocks: 1355/101
Railgun_6pp performance: 2000KB/clock

Doing Search for Pattern(11bytes) into String(206908949bytes) as-one-line ...
WARNING! The next line works with BMH only for pattern 4[+] long, otherwise (for 2 and 3) other searcher takes over!

Note: Executing the next two tests 256 times i.e. the search is for 256x8x2 patterns!

Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ... Not-Counting-Hits-Just-Returns-First-One
Found ('an') at 157 position, Railgun_Quadruplet_6pp performance: 0KB/clock
Found ('to') at 853 position, Railgun_Quadruplet_6pp performance: 0KB/clock
Found ('TDK') at -4325408 position, Railgun_Quadruplet_6pp performance: 922KB/clock
Found ('the') at 511 position, Railgun_Quadruplet_6pp performance: 0KB/clock
Found ('fast') at 42381 position, Railgun_Quadruplet_6pp performance: 41KB/clock
Found ('easy') at 30163 position, Railgun_Quadruplet_6pp performance: 29KB/clock
Found ('grmbl') at -4325408 position, Railgun_Quadruplet_6pp performance: 1167KB/clock
Found ('email') at 69587982 position, Railgun_Quadruplet_6pp performance: 1078KB/clock
Found ('pasting') at 134600126 position, Railgun_Quadruplet_6pp performance: 1383KB/clock
Found ('amazing') at 1629806 position, Railgun_Quadruplet_6pp performance: 93KB/clock
Found ('underdog') at 137424498 position, Railgun_Quadruplet_6pp performance: 1698KB/clock
Found ('superdog') at -4325408 position, Railgun_Quadruplet_6pp performance: 1603KB/clock
Found ('participants') at 16060 position, Railgun_Quadruplet_6pp performance: 15KB/clock
Found ('skillessness') at -4325408 position, Railgun_Quadruplet_6pp performance: 1836KB/clock
Found ('I should have known') at 41831880 position, Railgun_Quadruplet_6pp performance: 2403KB/clock
Found ('human consciousness') at 3863 position, Railgun_Quadruplet_6pp performance: 3KB/clock
Railgun_Quadruplet_6pp 8x2 i.e. average performance: 1342KB/clock
ReallyTraversed: 310,477,846,528 bytes

Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ... Not-Counting-Hits-Just-Returns-First-One
Found ('an') at 157 position, Railgun_Quadruplet_7 performance: 0KB/clock
Found ('to') at 853 position, Railgun_Quadruplet_7 performance: 0KB/clock
Found ('TDK') at -4325408 position, Railgun_Quadruplet_7 performance: 990KB/clock
Found ('the') at 511 position, Railgun_Quadruplet_7 performance: 0KB/clock
Found ('fast') at 42381 position, Railgun_Quadruplet_7 performance: 41KB/clock
Found ('easy') at 30163 position, Railgun_Quadruplet_7 performance: 29KB/clock
Found ('grmbl') at -4325408 position, Railgun_Quadruplet_7 performance: 1069KB/clock
Found ('email') at 69587982 position, Railgun_Quadruplet_7 performance: 1078KB/clock
Found ('pasting') at 134600126 position, Railgun_Quadruplet_7 performance: 1383KB/clock
Found ('amazing') at 1629806 position, Railgun_Quadruplet_7 performance: 1591KB/clock
Found ('underdog') at 137424498 position, Railgun_Quadruplet_7 performance: 1698KB/clock
Found ('superdog') at -4325408 position, Railgun_Quadruplet_7 performance: 1422KB/clock
Found ('participants') at 16060 position, Railgun_Quadruplet_7 performance: 15KB/clock
Found ('skillessness') at -4325408 position, Railgun_Quadruplet_7 performance: 2149KB/clock
Found ('I should have known') at 41831880 position, Railgun_Quadruplet_7 performance: 2403KB/clock
Found ('human consciousness') at 3863 position, Railgun_Quadruplet_7 performance: 3KB/clock
Railgun_Quadruplet_7 8x2 i.e. average performance: 1354KB/clock
ReallyTraversed: 310,477,846,528 bytes

Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ...
Found ('an') 1987797 time(s), BM_HORSPOOL performance: 293KB/clock
Found ('to') 1076629 time(s), BM_HORSPOOL performance: 293KB/clock
Found ('TDK') 0 time(s), BM_HORSPOOL performance: 516KB/clock
Found ('the') 2114180 time(s), BM_HORSPOOL performance: 379KB/clock
Found ('fast') 5945 time(s), BM_HORSPOOL performance: 585KB/clock
Found ('easy') 5191 time(s), BM_HORSPOOL performance: 585KB/clock
Found ('grmbl') 0 time(s), BM_HORSPOOL performance: 759KB/clock
Found ('email') 1 time(s), BM_HORSPOOL performance: 756KB/clock
Found ('pasting') 2 time(s), BM_HORSPOOL performance: 918KB/clock
Found ('amazing') 323 time(s), BM_HORSPOOL performance: 1074KB/clock
Found ('underdog') 4 time(s), BM_HORSPOOL performance: 1069KB/clock
Found ('superdog') 0 time(s), BM_HORSPOOL performance: 1074KB/clock
Found ('participants') 147 time(s), BM_HORSPOOL performance: 1603KB/clock
Found ('skillessness') 0 time(s), BM_HORSPOOL performance: 1422KB/clock
Found ('I should have known') 1 time(s), BM_HORSPOOL performance: 1433KB/clock
Found ('human consciousness') 519 time(s), BM_HORSPOOL performance: 1820KB/clock
BM_HORSPOOL 8x2 i.e. average performance: 666KB/clock

Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ...
Found ('an') 1987797 time(s), Railgun_6pp performance: 678KB/clock
Found ('to') 1076629 time(s), Railgun_6pp performance: 643KB/clock
Found ('TDK') 0 time(s), Railgun_6pp performance: 1174KB/clock
Found ('the') 2114180 time(s), Railgun_6pp performance: 678KB/clock
Found ('fast') 5945 time(s), Railgun_6pp performance: 918KB/clock
Found ('easy') 5191 time(s), Railgun_6pp performance: 990KB/clock
Found ('grmbl') 0 time(s), Railgun_6pp performance: 1287KB/clock
Found ('email') 1 time(s), Railgun_6pp performance: 1167KB/clock
Found ('pasting') 2 time(s), Railgun_6pp performance: 1603KB/clock
Found ('amazing') 323 time(s), Railgun_6pp performance: 1603KB/clock
Found ('underdog') 4 time(s), Railgun_6pp performance: 1603KB/clock
Found ('superdog') 0 time(s), Railgun_6pp performance: 1820KB/clock
Found ('participants') 147 time(s), Railgun_6pp performance: 2149KB/clock
Found ('skillessness') 0 time(s), Railgun_6pp performance: 2126KB/clock
Found ('I should have known') 1 time(s), Railgun_6pp performance: 2557KB/clock
Found ('human consciousness') 519 time(s), Railgun_6pp performance: 2126KB/clock
Railgun_6pp 8x2 i.e. average performance: 1218KB/clock

Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ...
Found ('an') 1987797 time(s), Railgun_Quadruplet_7 performance: 643KB/clock
Found ('to') 1076629 time(s), Railgun_Quadruplet_7 performance: 678KB/clock
Found ('TDK') 0 time(s), Railgun_Quadruplet_7 performance: 1074KB/clock
Found ('the') 2114180 time(s), Railgun_Quadruplet_7 performance: 643KB/clock
Found ('fast') 5945 time(s), Railgun_Quadruplet_7 performance: 1074KB/clock
Found ('easy') 5191 time(s), Railgun_Quadruplet_7 performance: 1074KB/clock
Found ('grmbl') 0 time(s), Railgun_Quadruplet_7 performance: 1287KB/clock
Found ('email') 1 time(s), Railgun_Quadruplet_7 performance: 1167KB/clock
Found ('pasting') 2 time(s), Railgun_Quadruplet_7 performance: 1603KB/clock
Found ('amazing') 323 time(s), Railgun_Quadruplet_7 performance: 1603KB/clock
Found ('underdog') 4 time(s), Railgun_Quadruplet_7 performance: 1820KB/clock
Found ('superdog') 0 time(s), Railgun_Quadruplet_7 performance: 1603KB/clock
Found ('participants') 147 time(s), Railgun_Quadruplet_7 performance: 2557KB/clock
Found ('skillessness') 0 time(s), Railgun_Quadruplet_7 performance: 2126KB/clock
Found ('I should have known') 1 time(s), Railgun_Quadruplet_7 performance: 2557KB/clock
Found ('human consciousness') 519 time(s), Railgun_Quadruplet_7 performance: 2557KB/clock
Railgun_Quadruplet_7 8x2 i.e. average performance: 1232KB/clock

^C
D:\_KAZE_new-stuff>
```
In my opinion it is one of the seven commandments of speed: do not put the cart before the horse i.e. for adjacent (32?) memory accesses first extract the byte with lowest address. strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe, revision 7, some quick inputs:
```D:\_KAZE_new-stuff>strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe
strstr_SHORT-SHOWDOWN, revision 7, written by Kaze.
Input Pattern(up to 19+2000 chars): fast

Doing Search for Pattern(4bytes) into String(206908949bytes) as-one-line ...
WARNING! The next line works with BMH only for pattern 4[+] long, otherwise (for 2 and 3) other searcher takes over!
Railgun_6pp_hits/Railgun_6pp_clocks: 5945/215
Railgun_6pp performance: 939KB/clock

Doing Search for Pattern(4bytes) into String(206908949bytes) as-one-line ...
WARNING! The next line works with BMH only for pattern 4[+] long, otherwise (for 2 and 3) other searcher takes over!

^C

...

Input Pattern(up to 19+2000 chars): from
Railgun_6pp_hits/Railgun_6pp_clocks: 83155/196
Railgun_6pp performance: 1030KB/clock

Input Pattern(up to 19+2000 chars): rabea
Railgun_6pp_hits/Railgun_6pp_clocks: 0/176
Railgun_6pp performance: 1148KB/clock

Input Pattern(up to 19+2000 chars): makes
Railgun_6pp_hits/Railgun_6pp_clocks: 9281/172
Railgun_6pp performance: 1174KB/clock

Input Pattern(up to 19+2000 chars): monkey
Railgun_6pp_hits/Railgun_6pp_clocks: 1691/141
Railgun_6pp performance: 1433KB/clock

Input Pattern(up to 19+2000 chars): punny
Railgun_6pp_hits/Railgun_6pp_clocks: 0/156
Railgun_6pp performance: 1295KB/clock

Input Pattern(up to 19+2000 chars): funny
Railgun_6pp_hits/Railgun_6pp_clocks: 179/157
Railgun_6pp performance: 1287KB/clock

Input Pattern(up to 19+2000 chars): ramjet
Railgun_6pp_hits/Railgun_6pp_clocks: 0/152
Railgun_6pp performance: 1329KB/clock

Input Pattern(up to 19+2000 chars): fallen
Railgun_6pp_hits/Railgun_6pp_clocks: 2578/151
Railgun_6pp performance: 1338KB/clock

Input Pattern(up to 19+2000 chars): elephant
Railgun_6pp_hits/Railgun_6pp_clocks: 1198/124
Railgun_6pp performance: 1629KB/clock

Input Pattern(up to 19+2000 chars): layalina
Railgun_6pp_hits/Railgun_6pp_clocks: 0/121
Railgun_6pp performance: 1669KB/clock

Input Pattern(up to 19+2000 chars): I should have known
Railgun_6pp_hits/Railgun_6pp_clocks: 1/89
Railgun_6pp performance: 2270KB/clock

Input Pattern(up to 19+2000 chars): human consciousness
Railgun_6pp_hits/Railgun_6pp_clocks: 519/80
Railgun_6pp performance: 2525KB/clock

Input Pattern(up to 19+2000 chars): But he's looking right through me
Railgun_6pp_hits/Railgun_6pp_clocks: 0/84
Railgun_6pp performance: 2405KB/clock

Input Pattern(up to 19+2000 chars): I'm living in an age that calls darkness light
Railgun_6pp_hits/Railgun_6pp_clocks: 0/72
Railgun_6pp performance: 2806KB/clock

Input Pattern(up to 19+2000 chars): Following the wanderings of a dream - a dream that keeps my soul alive
Railgun_6pp_hits/Railgun_6pp_clocks: 0/69
Railgun_6pp performance: 2928KB/clock

Input Pattern(up to 19+2000 chars): I notice what matters and I got nothing to lose but darkness and shadows
Railgun_6pp_hits/Railgun_6pp_clocks: 0/66
Railgun_6pp performance: 3061KB/clock

Input Pattern(up to 19+2000 chars): Beth Ditto
Railgun_6pp_hits/Railgun_6pp_clocks: 0/118
Railgun_6pp performance: 1712KB/clock

Input Pattern(up to 19+2000 chars): SR71
Railgun_6pp_hits/Railgun_6pp_clocks: 0/175
Railgun_6pp performance: 1154KB/clock

Input Pattern(up to 19+2000 chars): Apache
Railgun_6pp_hits/Railgun_6pp_clocks: 1/162
Railgun_6pp performance: 1247KB/clock

Input Pattern(up to 19+2000 chars): Fibonacci
Railgun_6pp_hits/Railgun_6pp_clocks: 0/106
Railgun_6pp performance: 1906KB/clock

Input Pattern(up to 19+2000 chars): Tesla
Railgun_6pp_hits/Railgun_6pp_clocks: 0/175
Railgun_6pp performance: 1154KB/clock

Input Pattern(up to 19+2000 chars): Albert
Railgun_6pp_hits/Railgun_6pp_clocks: 932/154
Railgun_6pp performance: 1312KB/clock

Input Pattern(up to 19+2000 chars): Toshiba
Railgun_6pp_hits/Railgun_6pp_clocks: 0/130
Railgun_6pp performance: 1554KB/clock

Input Pattern(up to 19+2000 chars): Orion
Railgun_6pp_hits/Railgun_6pp_clocks: 1/175
Railgun_6pp performance: 1154KB/clock

Input Pattern(up to 19+2000 chars): pharaoh
Railgun_6pp_hits/Railgun_6pp_clocks: 26/129
Railgun_6pp performance: 1566KB/clock

Input Pattern(up to 19+2000 chars): lemon
Railgun_6pp_hits/Railgun_6pp_clocks: 33/178
Railgun_6pp performance: 1135KB/clock

Input Pattern(up to 19+2000 chars): grammar
Railgun_6pp_hits/Railgun_6pp_clocks: 218/123
Railgun_6pp performance: 1642KB/clock

D:\_KAZE_new-stuff>
```
Grmbl, pattern 'SR71' compared to 'fast' gives me a hint how things are far from 'fastest'.

## Points of Interest

The interesting part about tests I have done: 'memcmp' and 'memcpy' when inlined (properly) give much better results. The thing which pop-ups again-and-again: Never take anything for granted. A bug (which doesn't affect the previous results, correct) was crushed, now r.6+++ is ready to go.

As you can see from the C (commentless and tabulated) and ASM sources (see the top of the page for downloading):
The Microsoft 16.00.30319.01 compiled Railgun_Quadruplet_6ppp is 00c6b-00a00+5 = 624 bytes long.
The Quadruplet (where cbTarget<961) fragment loop is 00bc3-00b17+5 = 177 bytes long.
The Boyer-Moore-Horspool (where cbTarget>=961) fragment loop is 00c61-00c20+2 = 67 bytes long.
An interesting question for more experienced coders:
In case of reducing the BMH loop down by 3 bytes (i.e. to 64 bytes), can we obtain some additional/significant boost?

## History

This is revision 6+++ written 2011-Sep-07.
Also revision 7 written 2011-Sep-25.

## Share

 Other Bulgaria
A Bulgarian old boy interested in search techniques, nothing special.

## You may also be interested in...

 Pro Pro

Discussions posted for the Published version of this article. Posting a message here will take you to the publicly available article in order to continue your conversation in public.

 First PrevNext
 Excellent Manikandan1018-May-14 3:09 Manikandan10 18-May-14 3:09
 Re: Excellent Sanmayce18-May-14 6:35 Sanmayce 18-May-14 6:35
 Re: Excellent Manikandan1020-May-14 22:55 Manikandan10 20-May-14 22:55
 Railgun_Swampshine_BailOut used as match finder in LZSS Sanmayce15-Apr-14 8:32 Sanmayce 15-Apr-14 8:32
 Round #2 Sanmayce17-Apr-14 5:46 Sanmayce 17-Apr-14 5:46
 Round #2, recuperation Sanmayce19-Apr-14 8:46 Sanmayce 19-Apr-14 8:46
 Round #3 Sanmayce21-Apr-14 6:35 Sanmayce 21-Apr-14 6:35
 Round #4 Sanmayce25-Apr-14 5:27 Sanmayce 25-Apr-14 5:27
 Round #5, secret weapon deployed Sanmayce27-Apr-14 4:43 Sanmayce 27-Apr-14 4:43
 Round #5, Aftermath: the supremacy of Nakamichi 'Kaidanji' Sanmayce2-May-14 5:58 Sanmayce 2-May-14 5:58
 Some question KarstenK4-Nov-13 22:43 KarstenK 4-Nov-13 22:43
 Re: Some question Sanmayce5-Nov-13 4:05 Sanmayce 5-Nov-13 4:05
 My vote of 4 Sanmayce3-Nov-13 7:19 Sanmayce 3-Nov-13 7:19
 My vote of 5 Volynsky Alex16-Oct-13 8:06 Volynsky Alex 16-Oct-13 8:06
 Re: My vote of 5 Sanmayce16-Oct-13 8:24 Sanmayce 16-Oct-13 8:24
 Re: My vote of 5 Volynsky Alex16-Oct-13 9:25 Volynsky Alex 16-Oct-13 9:25
 Where to find INSPIRATION? Sanmayce12-Oct-13 7:51 Sanmayce 12-Oct-13 7:51
 Superfast 16-threaded exact searcher: 16 Railguns in action and more Sanmayce9-Feb-13 6:16 Sanmayce 9-Feb-13 6:16
 Fastest 32bit hash function in C Sanmayce29-Sep-12 7:25 Sanmayce 29-Sep-12 7:25
 My latest Windows benchmark package Sanmayce20-Mar-12 7:26 Sanmayce 20-Mar-12 7:26
 Horspool vs BNDM Mischa Sandberg4-Mar-12 18:06 Mischa Sandberg 4-Mar-12 18:06
 Re: Horspool vs BNDM Sanmayce5-Mar-12 5:35 Sanmayce 5-Mar-12 5:35
 Re: Unix Mischa Sandberg5-Mar-12 14:40 Mischa Sandberg 5-Mar-12 14:40
 Re: Horspool vs BNDM Sanmayce7-Mar-12 3:23 Sanmayce 7-Mar-12 3:23
 BNDM_64 vs 'Tridentx64' Sanmayce7-Mar-12 5:05 Sanmayce 7-Mar-12 5:05
 The best practical on-line resource on String Matching Sanmayce7-Mar-12 5:41 Sanmayce 7-Mar-12 5:41
 Railgun_Doublet_x64: the fastest mini-memmem known to me Sanmayce26-Feb-12 4:47 Sanmayce 26-Feb-12 4:47
 Re: Railgun_Doublet_x64: the fastest mini-memmem known to me Mischa Sandberg4-Mar-12 17:52 Mischa Sandberg 4-Mar-12 17:52
 Re: Railgun_Doublet_x64: the fastest mini-memmem known to me Sanmayce5-Mar-12 5:02 Sanmayce 5-Mar-12 5:02
 Decision threshold for switching to 2byte horspool Mischa Sandberg20-Feb-12 19:38 Mischa Sandberg 20-Feb-12 19:38
 Re: Decision threshold for switching to 2byte horspool Sanmayce21-Feb-12 5:30 Sanmayce 21-Feb-12 5:30
 Try this test ... Mischa Sandberg17-Feb-12 10:14 Mischa Sandberg 17-Feb-12 10:14
 Re: Try this test ... Sanmayce18-Feb-12 6:07 Sanmayce 18-Feb-12 6:07
 Re: Try this test ... Mischa Sandberg19-Feb-12 12:43 Mischa Sandberg 19-Feb-12 12:43
 Hasherezade Mischa Sandberg1-Feb-12 10:45 Mischa Sandberg 1-Feb-12 10:45
 Re: Hasherezade Mischa Sandberg1-Feb-12 11:36 Mischa Sandberg 1-Feb-12 11:36
 Re: Hasherezade Sanmayce3-Feb-12 5:13 Sanmayce 3-Feb-12 5:13
 BNDM: more than a Top Gun Sanmayce7-Feb-12 6:25 Sanmayce 7-Feb-12 6:25
 'Gulliver' with Order 3 presence check Sanmayce9-Feb-12 1:20 Sanmayce 9-Feb-12 1:20
 GNU's strstr strikes back Sanmayce16-Feb-12 1:53 Sanmayce 16-Feb-12 1:53
 Re: GNU's strstr strikes back Mischa Sandberg16-Feb-12 11:39 Mischa Sandberg 16-Feb-12 11:39
 Railgun_Quadruplet_r8Triplet fastest for 3bytes patterns Sanmayce18-Feb-12 4:10 Sanmayce 18-Feb-12 4:10
 Re: Railgun_Quadruplet_r8Triplet fastest for 3bytes patterns Mischa Sandberg21-Feb-12 8:47 Mischa Sandberg 21-Feb-12 8:47
 Re: Railgun_Quadruplet_r8Triplet fastest for 3bytes patterns Sanmayce23-Feb-12 6:51 Sanmayce 23-Feb-12 6:51
 Re: Railgun_Quadruplet_r8Triplet fastest for 3bytes patterns Mischa Sandberg23-Feb-12 7:37 Mischa Sandberg 23-Feb-12 7:37
 Re: Railgun_Quadruplet_r8Triplet fastest for 3bytes patterns Sanmayce23-Feb-12 8:47 Sanmayce 23-Feb-12 8:47
 Re: Railgun_Quadruplet_r8Triplet fastest for 3bytes patterns Mischa Sandberg23-Feb-12 9:56 Mischa Sandberg 23-Feb-12 9:56
 Homo sapiens chromosome 1 under 'Hasherezade' fire Sanmayce21-Feb-12 4:25 Sanmayce 21-Feb-12 4:25
 32bit vs 64bit: my first knock-down Sanmayce24-Feb-12 7:24 Sanmayce 24-Feb-12 7:24
 Re: 32bit vs 64bit: my first knock-down Mischa Sandberg24-Feb-12 11:31 Mischa Sandberg 24-Feb-12 11:31
 Last Visit: 31-Dec-99 18:00     Last Update: 21-Oct-16 15:32 Refresh 123 Next »