12,998,875 members (31,901 online)
Article
alternative version

#### Stats

138.9K views
69 bookmarked
Posted 9 Sep 2011

# Fastest strstr-like function in C!?

, 13 Sep 2011
 Rate this:
Tuned function for searching a needle in a haystack.
This is an old version of the currently published article.

## Introduction

The C strstr-like function presented below (written by Georgi 'Kaze') is named Railgun_Quadruplet_6ppp and is based on well-known Boyer-Moore-Horspool and simplified Karp-Rabin algorithms. The main goal: to achieve fastest execution for all kind of pattern/needle and string/haystack lengths (starting from 2 bytes).

## Background

Function homepage:
http://www.sanmayce.com/Railgun/index.html

Very informative resource on strstr search techniques at:
http://www-igm.univ-mlv.fr/~lecroq/string/

#### Tests done with 16 patterns with 197MB haystack (being a pure English book-like text)

The machine is a mainstream laptop on 2.1GHz with Intel Merom CPU.

## Using the Code

The code is pretty simple and straightforward.

```// Here the strstr-like function Railgun_Quadruplet_6ppp is given,  written by Georgi 'Kaze', 2011-Sep-07.
//
// ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Horspool algorithm [
// Caution: For better speed the case 'if (cbPattern==1)' was removed, so Pattern must be longer than 1 char.
char * Railgun_Quadruplet_6ppp (char * pbTarget,
char * pbPattern,
unsigned long cbTarget,
unsigned long cbPattern)
{
char * pbTargetMax = pbTarget + cbTarget;
register unsigned long  ulHashPattern;
unsigned long ulHashTarget;
//unsigned long count; //r.6+
signed long count;
//unsigned long countSTATIC; //r.6+
signed long countSTATIC;
//  unsigned long countRemainder;

/*
const unsigned char SINGLET = *(char *)(pbPattern);
const unsigned long Quadruplet2nd = SINGLET<<8;
const unsigned long Quadruplet3rd = SINGLET<<16;
const unsigned long Quadruplet4th = SINGLET<<24;
*/
unsigned char SINGLET;

long i; //BMH needed
int a, j, bm_bc[ASIZE]; //BMH needed
unsigned char ch; //BMH needed
//    unsigned char lastch, firstch; //BMH needed

if (cbPattern > cbTarget)
return(NULL);

// Doesn't work when cbPattern = 1
// The next IF-fragment works very well with cbPattern>1, OBVIOUSLY IT MUST BE UNROLLED(but crippled with less functionality) SINCE either cbPattern=2 or cbPattern=3!
if ( cbPattern<4) { // This IF makes me unhappy: it slows down from 390KB/clock to 367KB/clock for 'fast' pattern. This fragment(for 2..3 pattern lengths) is needed because I need a function different than strchr but sticking to strstr i.e. lengths above 1 are to be handled.
pbTarget = pbTarget+cbPattern;
ulHashPattern = ( (*(char *)(pbPattern))<<8 ) + *(pbPattern+(cbPattern-1));
//        countSTATIC = cbPattern-2;

if ( cbPattern==3) {
for ( ;; )
{
if ( ulHashPattern == ( (*(char *)(pbTarget-3))<<8 ) + *(pbTarget-1) ) {
if ( *(char *)(pbPattern+1) == *(char *)(pbTarget-2) ) return((pbTarget-3));
}
if ( (char)(ulHashPattern>>8) != *(pbTarget-2) ) pbTarget++;
pbTarget++;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else {
}
for ( ;; )
{
// The line below gives for 'cbPattern'>=1:
// Karp_Rabin_Kaze_4_OCTETS_hits/Karp_Rabin_Kaze_4_OCTETS_clocks: 4/543
// Karp_Rabin_Kaze_4_OCTETS performance: 372KB/clock
/*
if ( (ulHashPattern == ( (*(char *)(pbTarget-cbPattern))<<8 ) + *(pbTarget-1)) && !memcmp(pbPattern, pbTarget-cbPattern, (unsigned int)cbPattern) )
return((long)(pbTarget-cbPattern));
*/

// The fragment below gives for 'cbPattern'>=2:
// Karp_Rabin_Kaze_4_OCTETS_hits/Karp_Rabin_Kaze_4_OCTETS_clocks: 4/546
// Karp_Rabin_Kaze_4_OCTETS performance: 370KB/clock

/*
//For 2 and 3 [
if ( ulHashPattern == ( (*(char *)(pbTarget-cbPattern))<<8 ) + *(pbTarget-1) ) {
//         count = countSTATIC;
count = cbPattern-2;
//         while ( count && *(char *)(pbPattern+1+(countSTATIC-count)) == *(char *)(pbTarget-cbPattern+1+(countSTATIC-count)) ) {
while ( count && *(char *)(pbPattern+1) == *(char *)(pbTarget-2) ) { // Crippling i.e. only 2 and 3 chars are allowed!
count--;
}
if ( count == 0) return((pbTarget-cbPattern));
}
if ( (char)(ulHashPattern>>8) != *(pbTarget-cbPattern+1) ) pbTarget++;
//For 2 and 3 ]
*/

if ( ulHashPattern == ( (*(char *)(pbTarget-2))<<8 ) + *(pbTarget-1) )
return((pbTarget-2));
if ( (char)(ulHashPattern>>8) != *(pbTarget-1) ) pbTarget++;

// The fragment below gives for 'cbPattern'>=2:
// Karp_Rabin_Kaze_4_OCTETS_hits/Karp_Rabin_Kaze_4_OCTETS_clocks: 4/554
// Karp_Rabin_Kaze_4_OCTETS performance: 364KB/clock
/*
if ( ulHashPattern == ( (*(char *)(pbTarget-cbPattern))<<8 ) + *(pbTarget-1) ) {
count = countSTATIC>>2;
countRemainder = countSTATIC % 4;

while ( count && *(unsigned long *)(pbPattern+1+((count-1)<<2)) == *(unsigned long *)(pbTarget-cbPattern+1+((count-1)<<2)) ) {
count--;
}
//if (count == 0) {  // Disastrous degradation only from this line(317KB/clock when 1+2x4+2+1 bytes pattern: 'skillessness'; 312KB/clock when 1+1x4+2+1 bytes pattern: 'underdog'), otherwise 368KB/clock.
while ( countRemainder && *(char *)(pbPattern+1+(countSTATIC-countRemainder)) == *(char *)(pbTarget-cbPattern+1+(countSTATIC-countRemainder)) ) {
countRemainder--;
}
//if ( countRemainder == 0) return((long)(pbTarget-cbPattern));
if ( count+countRemainder == 0) return((long)(pbTarget-cbPattern));
//}
}
*/

pbTarget++;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else { //if ( cbPattern<4)
if (cbTarget<961) // This value is arbitrary(don't know how exactly), it ensures(at least must) better performance than 'Boyer_Moore_Horspool'.
{
pbTarget = pbTarget+cbPattern;
ulHashPattern = *(unsigned long *)(pbPattern);
//        countSTATIC = cbPattern-1;

//SINGLET = *(char *)(pbPattern);
SINGLET = ulHashPattern & 0xFF;

for ( ;; )
{
ulHashTarget = *(unsigned long *)(pbTarget-cbPattern);

if ( ulHashPattern == ulHashTarget ) { // Three unnecessary comparisons here, but 'AdvanceHopperGrass' must be calculated - it has a higher priority.
//         count = countSTATIC;
//         while ( count && *(char *)(pbPattern+1+(countSTATIC-count)) == *(char *)(pbTarget-cbPattern+1+(countSTATIC-count)) ) {
//               count--;
//         }
count = cbPattern-1;
while ( count && *(char *)(pbPattern+(cbPattern-count)) == *(char *)(pbTarget-count) ) {
count--;
}
if ( count == 0) return((pbTarget-cbPattern));
} else { // The goal here: to avoid memory accesses by stressing the registers.
if ( Quadruplet2nd != (ulHashTarget & 0x0000FF00) ) {
if ( Quadruplet3rd != (ulHashTarget & 0x00FF0000) ) {
}
}
}

if (pbTarget > pbTargetMax)
return(NULL);
}
} else { //if (cbTarget<961)
//countSTATIC = cbPattern-2; //r.6+
//countSTATIC = cbPattern-2-3; //r.6++
countSTATIC = cbPattern-2-2; //r.6+++ This line fixes the bug from r.6++!
ulHashPattern = *(unsigned long *)(pbPattern);
/* Preprocessing */
for (a=0; a < ASIZE; a++) bm_bc[a]=cbPattern;
for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1;

/* Searching */
//lastch=pbPattern[cbPattern-1];
//firstch=pbPattern[0];
i=0;
while (i <= cbTarget-cbPattern) {
ch=pbTarget[i+cbPattern-1];
//if (ch ==lastch)
//if (memcmp(&pbTarget[i],pbPattern,cbPattern-1) == 0) OUTPUT(i);
//if (ch == lastch && pbTarget[i] == firstch && memcmp(&pbTarget[i],pbPattern,cbPattern-1) == 0) return(i);  // Kaze: The idea(to prevent execution of slower 'memcmp') is borrowed from Karp-Rabin i.e. to perform a slower check only when the target "looks like".
//if (ch == pbPattern[cbPattern-1] && pbTarget[i] == pbPattern[0]) // r.6+
//if (ch == pbPattern[cbPattern-1] && *(long *)&pbTarget[i] == *(long *)&pbPattern[0]) // No problema here since we have 4[+] long pattern here. Overlapping (1 byte recompared) when length=4, grmbl.
if (ch == pbPattern[cbPattern-1] && *(long *)&pbTarget[i] == ulHashPattern) // No problema here since we have 4[+] long pattern here. Overlapping (1 byte recompared) when length=4, grmbl.
{
count = countSTATIC;
//while ( count && *(char *)(pbPattern+1+(countSTATIC-count)) == *(char *)(&pbTarget[i]+1+(countSTATIC-count)) ) { // r.6+
while ( count !=0 && *(char *)(pbPattern+1+3+(countSTATIC-count)) == *(char *)(&pbTarget[i]+1+3+(countSTATIC-count)) ) { // if pattern length is 4 or 5 we have count=-1 and count=0 respectively i.e. no need of comparing in-between chars.
count--;
}
if ( count <= 0) return(pbTarget+i);
}
i+=bm_bc[ch];
}
return(NULL);
} //if (cbTarget<961)
} //if ( cbPattern<4)
}
// ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Horspool algorithm ]```

## Points of Interest

The interesting part about tests I have done: 'memcmp' and 'memcpy' when inlined (properly) give much better results. The thing which pop-ups again-and-again: Never take anything for granted. A bug (which doesn't affect the previous results, correct) was crushed, now r.6+++ is ready to go.

As you can see from the C (commentless and tabulated) and ASM sources (see the top of the page for downloading):
The Microsoft 16.00.30319.01 compiled Railgun_Quadruplet_6ppp is 00c6b-00a00+5 = 624 bytes long.
The Quadruplet (where cbTarget<961) fragment loop is 00bc3-00b17+5 = 177 bytes long.
The Boyer-Moore-Horspool (where cbTarget>=961) fragment loop is 00c61-00c20+2 = 67 bytes long.
An interesting question for more experienced coders:
In case of reducing the BMH loop down by 3 bytes (i.e. to 64 bytes), can we obtain some additional/significant boost?

## History

This is revision 6+++ written 2011-Sep-07:

## Share

 Other Bulgaria
A Bulgarian old boy interested in search techniques, nothing special.

## You may also be interested in...

Discussions posted for the Published version of this article. Posting a message here will take you to the publicly available article in order to continue your conversation in public.

 First PrevNext
 Excellent Manikandan1018-May-14 3:09 Manikandan10 18-May-14 3:09
 Re: Excellent Sanmayce18-May-14 6:35 Sanmayce 18-May-14 6:35
 Re: Excellent Manikandan1020-May-14 22:55 Manikandan10 20-May-14 22:55
 Railgun_Swampshine_BailOut used as match finder in LZSS Sanmayce15-Apr-14 8:32 Sanmayce 15-Apr-14 8:32
 Round #2 Sanmayce17-Apr-14 5:46 Sanmayce 17-Apr-14 5:46
 Round #2, recuperation Sanmayce19-Apr-14 8:46 Sanmayce 19-Apr-14 8:46
 Round #3 Sanmayce21-Apr-14 6:35 Sanmayce 21-Apr-14 6:35
 Round #4 Sanmayce25-Apr-14 5:27 Sanmayce 25-Apr-14 5:27
 Round #5, secret weapon deployed Sanmayce27-Apr-14 4:43 Sanmayce 27-Apr-14 4:43
 Round #5, Aftermath: the supremacy of Nakamichi 'Kaidanji' Sanmayce2-May-14 5:58 Sanmayce 2-May-14 5:58
 Some question KarstenK4-Nov-13 22:43 KarstenK 4-Nov-13 22:43
 Re: Some question Sanmayce5-Nov-13 4:05 Sanmayce 5-Nov-13 4:05
 My vote of 4 Sanmayce3-Nov-13 7:19 Sanmayce 3-Nov-13 7:19
 My vote of 5 Volynsky Alex16-Oct-13 8:06 Volynsky Alex 16-Oct-13 8:06
 Re: My vote of 5 Sanmayce16-Oct-13 8:24 Sanmayce 16-Oct-13 8:24
 Re: My vote of 5 Volynsky Alex16-Oct-13 9:25 Volynsky Alex 16-Oct-13 9:25
 Where to find INSPIRATION? Sanmayce12-Oct-13 7:51 Sanmayce 12-Oct-13 7:51
 Superfast 16-threaded exact searcher: 16 Railguns in action and more Sanmayce9-Feb-13 6:16 Sanmayce 9-Feb-13 6:16
 Fastest 32bit hash function in C Sanmayce29-Sep-12 7:25 Sanmayce 29-Sep-12 7:25
 My latest Windows benchmark package Sanmayce20-Mar-12 7:26 Sanmayce 20-Mar-12 7:26
 Horspool vs BNDM Mischa Sandberg4-Mar-12 18:06 Mischa Sandberg 4-Mar-12 18:06
 Re: Horspool vs BNDM Sanmayce5-Mar-12 5:35 Sanmayce 5-Mar-12 5:35
 Re: Unix Mischa Sandberg5-Mar-12 14:40 Mischa Sandberg 5-Mar-12 14:40
 Re: Horspool vs BNDM Sanmayce7-Mar-12 3:23 Sanmayce 7-Mar-12 3:23
 BNDM_64 vs 'Tridentx64' Sanmayce7-Mar-12 5:05 Sanmayce 7-Mar-12 5:05
 Last Visit: 31-Dec-99 18:00     Last Update: 23-Jun-17 20:27 Refresh 12345 Next »