Introduction
The C strstr-like function presented below (written by Georgi 'Kaze') is named Railgun_Quadruplet_6ppp and is based on well-known Boyer-Moore-Horspool and simplified Karp-Rabin algorithms. The main goal: to achieve fastest execution for all kinds of pattern/needle and string/haystack lengths (starting from 2 bytes).
Railgun_Quadruplet_6ppp is now obsolete due to the arrival of Railgun_Quadruplet_7 which boosts the BMH fragment by more than a percent. See further below for its commentless and tabulated source.
Railgun_Quadruplet_7 exploits the CPUs fast unaligned DWORD extract capability.
The heaviest and most precise test on 1+TB text (52 patterns vs 197MB English books text repeated 256 times results in 1000+ seconds, see strstr_SHORT-SHOWDOWN_r7_Heavy_Test_LOG.txt file for more information) gives:
[strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe:]
The variants returning the offset of first hit only:
Railgun_Quadruplet_6pp 52, i.e., average performance: 1584KB/clock
Railgun_Quadruplet_7 52, i.e., average performance: 1659KB/clock
The variants counting all hits:
Railgun_6pp 8x2, i.e., average performance: 1246KB/clock
Railgun_Quadruplet_7 8x2, i.e., average performance: 1266KB/clock
[strstr_SHORT-SHOWDOWN_Intel_v12_O3.exe:]
The variants returning the offset of first hit only:
Railgun_Quadruplet_6pp 52, i.e., average performance: 1773KB/clock
Railgun_Quadruplet_7 52, i.e., average performance: 1685KB/clock
The variants counting all hits:
Railgun_6pp 8x2, i.e., average performance: 1054KB/clock
Railgun_Quadruplet_7 8x2, i.e., average performance: 1076KB/clock
Note: For Railgun_Quadruplet_7 on CPUs with fast DWORD (compared to BYTE) memory fetching I expect significant boost, AFAIK the Intel i7 is one of those, unfortunately I lack the opportunity to play with it.
Or the two bottom-lines:
The FASTEST variant counting all hits:
Railgun_Quadruplet_7 8x2 with average performance: 1266KB/clock (best performance achieved with Microsoft_v16_Ox)!
The FASTEST variant returning the offset of first hit only:
Railgun_Quadruplet_6pp 52 with average performance: 1773KB/clock (1,943,883,635,456 bytes / 1,070,220 clocks = 1732 MB/s best performance achieved with Intel_v12_O3)!
An add-on from 2012-Jan-20:
[
Railgun_Quadruplet_7 vs
Railgun_Quadruplet_7deuce vs
Boyer_Moore vs
Brute_Force: a face-off
Okay, nothing new, only a must-see benchmark against the original Boyer-Moore implementation.
Two tests were done - the first/second '6x2'/'52' searches 12/52 patterns (for all appearances i.e. counting all hits) into 206,908,949 bytes long English text.
Speed-Performance (measured in KB/clock, bigger-the-better) Summary:
Function 6x2 test
Railgun_Quadruplet_7 2,039KB/clock / 137,568% / 006,454,416,320 BONBON
Boyer_Moore_Flensburg 1,239KB/clock / 139,168% / 006,428,115,568 Slow
Railgun_Quadruplet_7deuce 1,120KB/clock / 155,744% / 005,995,597,776 Slower
Brute_Force_Dummy 0,341KB/clock / 019,200% / 039,726,516,640 Slowest
Speed-Performance (measured in KB/clock, bigger-the-better) Summary:
Function 52 test
Railgun_Quadruplet_7 1,645KB/clock / 708,768% / 025,368,049,712 BONBON
Boyer_Moore_Flensburg 0,987KB/clock / 720,272% / 025,227,312,176 Slow
Railgun_Quadruplet_7deuce 0,919KB/clock / 864,048% / 023,292,952,992 Slower
Brute_Force_Dummy 0,277KB/clock / 083,200% / 172,148,232,496 Slowest
Skip-Performance (measured in %, bigger-the-better) Summary:
Function 6x2 test
Railgun_Quadruplet_7deuce 1,120KB/clock / 155,744% / 005,995,597,776 BONBON
Boyer_Moore_Flensburg 1,239KB/clock / 139,168% / 006,428,115,568 loopy
Railgun_Quadruplet_7 2,039KB/clock / 137,568% / 006,454,416,320 loopier
Brute_Force_Dummy 0,341KB/clock / 019,200% / 039,726,516,640 loopiest
Skip-Performance (measured in %, bigger-the-better) Summary:
Function 52 test
Railgun_Quadruplet_7deuce 0,919KB/clock / 864,048% / 023,292,952,992 BONBON
Boyer_Moore_Flensburg 0,987KB/clock / 720,272% / 025,227,312,176 loopy
Railgun_Quadruplet_7 1,645KB/clock / 708,768% / 025,368,049,712 loopier
Brute_Force_Dummy 0,277KB/clock / 083,200% / 172,148,232,496 loopiest
A very easy-to-the-eyes/mind article at http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/bm.htm
Here I say 'Triple THANKS' to Mr. Prof. Dr. Hans Werner Lang (Institut für Medieninformatik und Technische Informatik
Fachhochschule Flensburg), I used your code (after a straightforward port). On his page (http://www.inf.fh-flensburg.de/lang/) there are very well presented algorithms - a very good resource indeed.
I see that Sunday algorithm has a potential, soon I will play with it.
I expect
Railgun_Quadruplet_7deuce to become a gun of choice but only on CPUs with FAR more effective memory handling than my Pentium Merom.
]
Background
Function homepage:
http://www.sanmayce.com/Railgun/index.html
Note: In order to reproduce the tests shown on the picture (below) you need the test package (57.2 MB) downloadable at function's homepage, thanks Randor for pointing out.
Very informative resource on strstr search techniques at:
Tests done with 16 patterns with 197MB haystack (being a pure English book-like text).
The machine is a mainstream laptop on 2.1GHz with Intel Merom CPU.
Using the Code
The code is pretty simple and straightforward.
char * Railgun_Quadruplet_6ppp (char * pbTarget,
char * pbPattern,
unsigned long cbTarget,
unsigned long cbPattern)
{
char * pbTargetMax = pbTarget + cbTarget;
register unsigned long ulHashPattern;
unsigned long ulHashTarget;
signed long count;
signed long countSTATIC;
unsigned char SINGLET;
unsigned long Quadruplet2nd;
unsigned long Quadruplet3rd;
unsigned long Quadruplet4th;
unsigned long AdvanceHopperGrass;
long i; int a, j, bm_bc[ASIZE]; unsigned char ch;
if (cbPattern > cbTarget)
return(NULL);
if ( cbPattern<4) { pbTarget = pbTarget+cbPattern;
ulHashPattern = ( (*(char *)(pbPattern))<<8 ) + *(pbPattern+(cbPattern-1));
if ( cbPattern==3) {
for ( ;; )
{
if ( ulHashPattern == ( (*(char *)(pbTarget-3))<<8 ) + *(pbTarget-1) ) {
if ( *(char *)(pbPattern+1) == *(char *)(pbTarget-2) ) return((pbTarget-3));
}
if ( (char)(ulHashPattern>>8) != *(pbTarget-2) ) pbTarget++;
pbTarget++;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else {
}
for ( ;; )
{
if ( ulHashPattern == ( (*(char *)(pbTarget-2))<<8 ) + *(pbTarget-1) )
return((pbTarget-2));
if ( (char)(ulHashPattern>>8) != *(pbTarget-1) ) pbTarget++;
pbTarget++;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else { if (cbTarget<961) {
pbTarget = pbTarget+cbPattern;
ulHashPattern = *(unsigned long *)(pbPattern);
SINGLET = ulHashPattern & 0xFF;
Quadruplet2nd = SINGLET<<8;
Quadruplet3rd = SINGLET<<16;
Quadruplet4th = SINGLET<<24;
for ( ;; )
{
AdvanceHopperGrass = 0;
ulHashTarget = *(unsigned long *)(pbTarget-cbPattern);
if ( ulHashPattern == ulHashTarget ) { count = cbPattern-1;
while ( count && *(char *)(pbPattern+(cbPattern-count)) ==
*(char *)(pbTarget-count) ) {
if ( cbPattern-1==AdvanceHopperGrass+count && SINGLET !=
*(char *)(pbTarget-count) ) AdvanceHopperGrass++;
count--;
}
if ( count == 0) return((pbTarget-cbPattern));
} else { if ( Quadruplet2nd != (ulHashTarget & 0x0000FF00) ) {
AdvanceHopperGrass++;
if ( Quadruplet3rd != (ulHashTarget & 0x00FF0000) ) {
AdvanceHopperGrass++;
if ( Quadruplet4th != (ulHashTarget & 0xFF000000) ) AdvanceHopperGrass++;
}
}
}
AdvanceHopperGrass++;
pbTarget = pbTarget + AdvanceHopperGrass;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else { countSTATIC = cbPattern-2-2; ulHashPattern = *(unsigned long *)(pbPattern);
for (a=0; a < ASIZE; a++) bm_bc[a]=cbPattern;
for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1;
i=0;
while (i <= cbTarget-cbPattern) {
ch=pbTarget[i+cbPattern-1];
if (ch == pbPattern[cbPattern-1] &&
*(long *)&pbTarget[i] == ulHashPattern) {
count = countSTATIC;
while ( count !=0 && *(char *)(pbPattern+1+3+(countSTATIC-count)) ==
*(char *)(&pbTarget[i]+1+3+(countSTATIC-count)) )
{ count--;
}
if ( count <= 0) return(pbTarget+i);
}
i+=bm_bc[ch];
}
return(NULL);
} } }
Stomp stomp I arrived, here comes the Railgun_Quadruplet revision 7:
char * Railgun_Quadruplet_7 (char * pbTarget, char * pbPattern,
unsigned long cbTarget, unsigned long cbPattern)
{
char * pbTargetMax = pbTarget + cbTarget;
register unsigned long ulHashPattern;
unsigned long ulHashTarget;
signed long count;
signed long countSTATIC;
unsigned char SINGLET;
unsigned long Quadruplet2nd;
unsigned long Quadruplet3rd;
unsigned long Quadruplet4th;
unsigned long AdvanceHopperGrass;
long i;
int a, j, bm_bc[ASIZE];
unsigned char ch;
if (cbPattern > cbTarget) return(NULL);
if (cbPattern<4) {
pbTarget = pbTarget+cbPattern;
ulHashPattern = ( (*(char *)(pbPattern))<<8 ) +
*(pbPattern+(cbPattern-1));
if (cbPattern==3) {
for ( ;; ) {
if ( ulHashPattern ==
( (*(char *)(pbTarget-3))<<8 ) + *(pbTarget-1) ) {
if ( *(char *)(pbPattern+1) ==
*(char *)(pbTarget-2) ) return((pbTarget-3));
}
if ( (char)(ulHashPattern>>8) !=
*(pbTarget-2) ) pbTarget++;
pbTarget++;
if (pbTarget > pbTargetMax) return(NULL);
}
} else {
}
for ( ;; ) {
if ( ulHashPattern == ( (*(char *)(pbTarget-2))<<8 )
+ *(pbTarget-1) ) return((pbTarget-2));
if ( (char)(ulHashPattern>>8) != *(pbTarget-1) )
pbTarget++;
pbTarget++;
if (pbTarget > pbTargetMax) return(NULL);
}
} else {
if (cbTarget<961) {
pbTarget = pbTarget+cbPattern;
ulHashPattern = *(unsigned long *)(pbPattern);
SINGLET = ulHashPattern & 0xFF;
Quadruplet2nd = SINGLET<<8;
Quadruplet3rd = SINGLET<<16;
Quadruplet4th = SINGLET<<24;
for ( ;; ) {
AdvanceHopperGrass = 0;
ulHashTarget = *(unsigned long *)(pbTarget-cbPattern);
if ( ulHashPattern == ulHashTarget ) {
count = cbPattern-1;
while ( count && *(char *)
(pbPattern+(cbPattern-count)) ==
*(char *)(pbTarget-count) ) {
if ( cbPattern-1==
AdvanceHopperGrass+count &&
SINGLET != *(char *)(pbTarget-count)
) AdvanceHopperGrass++;
count--;
}
if ( count == 0)
return((pbTarget-cbPattern));
} else {
if ( Quadruplet2nd != (ulHashTarget
& 0x0000FF00) ) {
AdvanceHopperGrass++;
if ( Quadruplet3rd !=
(ulHashTarget & 0x00FF0000) ) {
AdvanceHopperGrass++;
if ( Quadruplet4th !=
(ulHashTarget & 0xFF000000)
) AdvanceHopperGrass++;
}
}
}
AdvanceHopperGrass++;
pbTarget = pbTarget + AdvanceHopperGrass;
if (pbTarget > pbTargetMax) return(NULL);
}
} else {
countSTATIC = cbPattern-2-2;
ulHashPattern = *(unsigned long *)(pbPattern);
for (a=0; a < ASIZE; a++) bm_bc[a]=cbPattern;
for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=
cbPattern-j-1;
i=0;
while (i <= cbTarget-cbPattern) {
if ( *(unsigned long *)&pbTarget[i] ==
ulHashPattern ) {
count = countSTATIC;
while ( count !=0 && *(char *)
(pbPattern+1+3+(countSTATIC-count)) ==
*(char *)(&pbTarget[i]+1+3+
(countSTATIC-count)) ) count--;
if ( count == 0) return(pbTarget+i);
}
i= i + bm_bc[(unsigned char)pbTarget[i+cbPattern-1]];
}
return(NULL);
}
}
}
The tweak (in BMH fragment) which speeds-up significantly the obsolete (already) r.6+++ is this:
countSTATIC = cbPattern-2-2; ulHashPattern = *(unsigned long *)(pbPattern);
for (a=0; a < ASIZE; a++) bm_bc[a]=cbPattern;
for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1;
i=0;
while (i <= cbTarget-cbPattern) {
if ( *(unsigned long *)&pbTarget[i] == ulHashPattern ) {
count = countSTATIC;
while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) ==
*(char *)(&pbTarget[i]+(countSTATIC-count)+4) ) { count--;
}
if ( count == 0) return(pbTarget+i);
}
i= i + bm_bc[(unsigned char)pbTarget[i+cbPattern-1]];
}
return(NULL);
In order to see what path I had to traverse to make this tweak, you may see the discussion board (r.7m thread). The tests below show NOT the full boost simply because they average times for all lengths thus losing/diminishing the high speeds times (being 4-3-2 times lesser compared with pattern lengths 2-3-4-5-6).
D:\_KAZE_new-stuff>strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe
strstr_SHORT-SHOWDOWN, revision 7, written by Kaze.
Input Pattern(up to 19+2000 chars): frustration
Doing Search for Pattern(11bytes) into String(206908949bytes) as-one-line ...
WARNING! The next line works with BMH only for pattern 4[+] long,
otherwise (for 2 and 3) other searcher takes over!
Railgun_6pp_hits/Railgun_6pp_clocks: 1355/101
Railgun_6pp performance: 2000KB/clock
Doing Search for Pattern(11bytes) into String(206908949bytes) as-one-line ...
WARNING! The next line works with BMH only for pattern 4[+] long,
otherwise (for 2 and 3) other searcher takes over!
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1355/91
Railgun_Quadruplet_7 performance: 2220KB/clock
Note: Executing the next two tests 256 times i.e. the search is for 256x8x2 patterns!
Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ...
Not-Counting-Hits-Just-Returns-First-One
Found ('an') at 157 position, Railgun_Quadruplet_6pp performance: 0KB/clock
Found ('to') at 853 position, Railgun_Quadruplet_6pp performance: 0KB/clock
Found ('TDK') at -4325408 position, Railgun_Quadruplet_6pp performance: 922KB/clock
Found ('the') at 511 position, Railgun_Quadruplet_6pp performance: 0KB/clock
Found ('fast') at 42381 position, Railgun_Quadruplet_6pp performance: 41KB/clock
Found ('easy') at 30163 position, Railgun_Quadruplet_6pp performance: 29KB/clock
Found ('grmbl') at -4325408 position, Railgun_Quadruplet_6pp performance: 1167KB/clock
Found ('email') at 69587982 position, Railgun_Quadruplet_6pp performance: 1078KB/clock
Found ('pasting') at 134600126 position, Railgun_Quadruplet_6pp performance: 1383KB/clock
Found ('amazing') at 1629806 position, Railgun_Quadruplet_6pp performance: 93KB/clock
Found ('underdog') at 137424498 position,
Railgun_Quadruplet_6pp performance: 1698KB/clock
Found ('superdog') at -4325408 position, Railgun_Quadruplet_6pp performance: 1603KB/clock
Found ('participants') at 16060 position, Railgun_Quadruplet_6pp performance: 15KB/clock
Found ('skillessness') at -4325408 position,
Railgun_Quadruplet_6pp performance: 1836KB/clock
Found ('I should have known') at 41831880 position,
Railgun_Quadruplet_6pp performance: 2403KB/clock
Found ('human consciousness') at 3863 position,
Railgun_Quadruplet_6pp performance: 3KB/clock
Railgun_Quadruplet_6pp 8x2 i.e. average performance: 1342KB/clock
ReallyTraversed: 310,477,846,528 bytes
Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ...
Not-Counting-Hits-Just-Returns-First-One
Found ('an') at 157 position, Railgun_Quadruplet_7 performance: 0KB/clock
Found ('to') at 853 position, Railgun_Quadruplet_7 performance: 0KB/clock
Found ('TDK') at -4325408 position, Railgun_Quadruplet_7 performance: 990KB/clock
Found ('the') at 511 position, Railgun_Quadruplet_7 performance: 0KB/clock
Found ('fast') at 42381 position, Railgun_Quadruplet_7 performance: 41KB/clock
Found ('easy') at 30163 position, Railgun_Quadruplet_7 performance: 29KB/clock
Found ('grmbl') at -4325408 position, Railgun_Quadruplet_7 performance: 1069KB/clock
Found ('email') at 69587982 position, Railgun_Quadruplet_7 performance: 1078KB/clock
Found ('pasting') at 134600126 position, Railgun_Quadruplet_7 performance: 1383KB/clock
Found ('amazing') at 1629806 position, Railgun_Quadruplet_7 performance: 1591KB/clock
Found ('underdog') at 137424498 position, Railgun_Quadruplet_7 performance: 1698KB/clock
Found ('superdog') at -4325408 position, Railgun_Quadruplet_7 performance: 1422KB/clock
Found ('participants') at 16060 position, Railgun_Quadruplet_7 performance: 15KB/clock
Found ('skillessness') at -4325408 position,
Railgun_Quadruplet_7 performance: 2149KB/clock
Found ('I should have known') at 41831880 position,
Railgun_Quadruplet_7 performance: 2403KB/clock
Found ('human consciousness') at 3863 position,
Railgun_Quadruplet_7 performance: 3KB/clock
Railgun_Quadruplet_7 8x2 i.e. average performance: 1354KB/clock
ReallyTraversed: 310,477,846,528 bytes
Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ...
Found ('an') 1987797 time(s), BM_HORSPOOL performance: 293KB/clock
Found ('to') 1076629 time(s), BM_HORSPOOL performance: 293KB/clock
Found ('TDK') 0 time(s), BM_HORSPOOL performance: 516KB/clock
Found ('the') 2114180 time(s), BM_HORSPOOL performance: 379KB/clock
Found ('fast') 5945 time(s), BM_HORSPOOL performance: 585KB/clock
Found ('easy') 5191 time(s), BM_HORSPOOL performance: 585KB/clock
Found ('grmbl') 0 time(s), BM_HORSPOOL performance: 759KB/clock
Found ('email') 1 time(s), BM_HORSPOOL performance: 756KB/clock
Found ('pasting') 2 time(s), BM_HORSPOOL performance: 918KB/clock
Found ('amazing') 323 time(s), BM_HORSPOOL performance: 1074KB/clock
Found ('underdog') 4 time(s), BM_HORSPOOL performance: 1069KB/clock
Found ('superdog') 0 time(s), BM_HORSPOOL performance: 1074KB/clock
Found ('participants') 147 time(s), BM_HORSPOOL performance: 1603KB/clock
Found ('skillessness') 0 time(s), BM_HORSPOOL performance: 1422KB/clock
Found ('I should have known') 1 time(s), BM_HORSPOOL performance: 1433KB/clock
Found ('human consciousness') 519 time(s), BM_HORSPOOL performance: 1820KB/clock
BM_HORSPOOL 8x2 i.e. average performance: 666KB/clock
Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ...
Found ('an') 1987797 time(s), Railgun_6pp performance: 678KB/clock
Found ('to') 1076629 time(s), Railgun_6pp performance: 643KB/clock
Found ('TDK') 0 time(s), Railgun_6pp performance: 1174KB/clock
Found ('the') 2114180 time(s), Railgun_6pp performance: 678KB/clock
Found ('fast') 5945 time(s), Railgun_6pp performance: 918KB/clock
Found ('easy') 5191 time(s), Railgun_6pp performance: 990KB/clock
Found ('grmbl') 0 time(s), Railgun_6pp performance: 1287KB/clock
Found ('email') 1 time(s), Railgun_6pp performance: 1167KB/clock
Found ('pasting') 2 time(s), Railgun_6pp performance: 1603KB/clock
Found ('amazing') 323 time(s), Railgun_6pp performance: 1603KB/clock
Found ('underdog') 4 time(s), Railgun_6pp performance: 1603KB/clock
Found ('superdog') 0 time(s), Railgun_6pp performance: 1820KB/clock
Found ('participants') 147 time(s), Railgun_6pp performance: 2149KB/clock
Found ('skillessness') 0 time(s), Railgun_6pp performance: 2126KB/clock
Found ('I should have known') 1 time(s), Railgun_6pp performance: 2557KB/clock
Found ('human consciousness') 519 time(s), Railgun_6pp performance: 2126KB/clock
Railgun_6pp 8x2 i.e. average performance: 1218KB/clock
Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ...
Found ('an') 1987797 time(s), Railgun_Quadruplet_7 performance: 643KB/clock
Found ('to') 1076629 time(s), Railgun_Quadruplet_7 performance: 678KB/clock
Found ('TDK') 0 time(s), Railgun_Quadruplet_7 performance: 1074KB/clock
Found ('the') 2114180 time(s), Railgun_Quadruplet_7 performance: 643KB/clock
Found ('fast') 5945 time(s), Railgun_Quadruplet_7 performance: 1074KB/clock
Found ('easy') 5191 time(s), Railgun_Quadruplet_7 performance: 1074KB/clock
Found ('grmbl') 0 time(s), Railgun_Quadruplet_7 performance: 1287KB/clock
Found ('email') 1 time(s), Railgun_Quadruplet_7 performance: 1167KB/clock
Found ('pasting') 2 time(s), Railgun_Quadruplet_7 performance: 1603KB/clock
Found ('amazing') 323 time(s), Railgun_Quadruplet_7 performance: 1603KB/clock
Found ('underdog') 4 time(s), Railgun_Quadruplet_7 performance: 1820KB/clock
Found ('superdog') 0 time(s), Railgun_Quadruplet_7 performance: 1603KB/clock
Found ('participants') 147 time(s), Railgun_Quadruplet_7 performance: 2557KB/clock
Found ('skillessness') 0 time(s), Railgun_Quadruplet_7 performance: 2126KB/clock
Found ('I should have known') 1 time(s), Railgun_Quadruplet_7 performance: 2557KB/clock
Found ('human consciousness') 519 time(s),
Railgun_Quadruplet_7 performance: 2557KB/clock
Railgun_Quadruplet_7 8x2 i.e. average performance: 1232KB/clock
^C
D:\_KAZE_new-stuff>
In my opinion, it is one of the seven commandments of speed: do not put the cart before the horse, i.e., for adjacent (32?) memory accesses first extract the byte with lowest address. strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe, revision 7, some quick inputs:
D:\_KAZE_new-stuff>strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe
strstr_SHORT-SHOWDOWN, revision 7, written by Kaze.
Input Pattern(up to 19+2000 chars): fast
Doing Search for Pattern(4bytes) into String(206908949bytes) as-one-line ...
WARNING! The next line works with BMH only for pattern 4[+] long,
otherwise (for 2 and 3) other searcher takes over!
Railgun_6pp_hits/Railgun_6pp_clocks: 5945/215
Railgun_6pp performance: 939KB/clock
Doing Search for Pattern(4bytes) into String(206908949bytes) as-one-line ...
WARNING! The next line works with BMH only for pattern 4[+] long,
otherwise (for 2 and 3) other searcher takes over!
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 5945/193
Railgun_Quadruplet_7 performance: 1046KB/clock
^C
...
Input Pattern(up to 19+2000 chars): from
Railgun_6pp_hits/Railgun_6pp_clocks: 83155/196
Railgun_6pp performance: 1030KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 83155/194
Railgun_Quadruplet_7 performance: 1041KB/clock
Input Pattern(up to 19+2000 chars): rabea
Railgun_6pp_hits/Railgun_6pp_clocks: 0/176
Railgun_6pp performance: 1148KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/162
Railgun_Quadruplet_7 performance: 1247KB/clock
Input Pattern(up to 19+2000 chars): makes
Railgun_6pp_hits/Railgun_6pp_clocks: 9281/172
Railgun_6pp performance: 1174KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 9281/162
Railgun_Quadruplet_7 performance: 1247KB/clock
Input Pattern(up to 19+2000 chars): monkey
Railgun_6pp_hits/Railgun_6pp_clocks: 1691/141
Railgun_6pp performance: 1433KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1691/139
Railgun_Quadruplet_7 performance: 1453KB/clock
Input Pattern(up to 19+2000 chars): punny
Railgun_6pp_hits/Railgun_6pp_clocks: 0/156
Railgun_6pp performance: 1295KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/155
Railgun_Quadruplet_7 performance: 1303KB/clock
Input Pattern(up to 19+2000 chars): funny
Railgun_6pp_hits/Railgun_6pp_clocks: 179/157
Railgun_6pp performance: 1287KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 179/156
Railgun_Quadruplet_7 performance: 1295KB/clock
Input Pattern(up to 19+2000 chars): ramjet
Railgun_6pp_hits/Railgun_6pp_clocks: 0/152
Railgun_6pp performance: 1329KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/137
Railgun_Quadruplet_7 performance: 1474KB/clock
Input Pattern(up to 19+2000 chars): fallen
Railgun_6pp_hits/Railgun_6pp_clocks: 2578/151
Railgun_6pp performance: 1338KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 2578/138
Railgun_Quadruplet_7 performance: 1464KB/clock
Input Pattern(up to 19+2000 chars): elephant
Railgun_6pp_hits/Railgun_6pp_clocks: 1198/124
Railgun_6pp performance: 1629KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1198/112
Railgun_Quadruplet_7 performance: 1804KB/clock
Input Pattern(up to 19+2000 chars): layalina
Railgun_6pp_hits/Railgun_6pp_clocks: 0/121
Railgun_6pp performance: 1669KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/110
Railgun_Quadruplet_7 performance: 1836KB/clock
Input Pattern(up to 19+2000 chars): I should have known
Railgun_6pp_hits/Railgun_6pp_clocks: 1/89
Railgun_6pp performance: 2270KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1/84
Railgun_Quadruplet_7 performance: 2405KB/clock
Input Pattern(up to 19+2000 chars): human consciousness
Railgun_6pp_hits/Railgun_6pp_clocks: 519/80
Railgun_6pp performance: 2525KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 519/75
Railgun_Quadruplet_7 performance: 2694KB/clock
Input Pattern(up to 19+2000 chars): But he's looking right through me
Railgun_6pp_hits/Railgun_6pp_clocks: 0/84
Railgun_6pp performance: 2405KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/76
Railgun_Quadruplet_7 performance: 2658KB/clock
Input Pattern(up to 19+2000 chars): I'm living in an age that calls darkness light
Railgun_6pp_hits/Railgun_6pp_clocks: 0/72
Railgun_6pp performance: 2806KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/67
Railgun_Quadruplet_7 performance: 3015KB/clock
Input Pattern(up to 19+2000 chars): Following the wanderings of a dream -
a dream that keeps my soul alive
Railgun_6pp_hits/Railgun_6pp_clocks: 0/69
Railgun_6pp performance: 2928KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/63
Railgun_Quadruplet_7 performance: 3207KB/clock
Input Pattern(up to 19+2000 chars): I notice what matters and
I got nothing to lose but darkness and shadows
Railgun_6pp_hits/Railgun_6pp_clocks: 0/66
Railgun_6pp performance: 3061KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/63
Railgun_Quadruplet_7 performance: 3207KB/clock
Input Pattern(up to 19+2000 chars): Beth Ditto
Railgun_6pp_hits/Railgun_6pp_clocks: 0/118
Railgun_6pp performance: 1712KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/107
Railgun_Quadruplet_7 performance: 1888KB/clock
Input Pattern(up to 19+2000 chars): SR71
Railgun_6pp_hits/Railgun_6pp_clocks: 0/175
Railgun_6pp performance: 1154KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/182
Railgun_Quadruplet_7 performance: 1110KB/clock
Input Pattern(up to 19+2000 chars): Apache
Railgun_6pp_hits/Railgun_6pp_clocks: 1/162
Railgun_6pp performance: 1247KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1/133
Railgun_Quadruplet_7 performance: 1519KB/clock
Input Pattern(up to 19+2000 chars): Fibonacci
Railgun_6pp_hits/Railgun_6pp_clocks: 0/106
Railgun_6pp performance: 1906KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/98
Railgun_Quadruplet_7 performance: 2061KB/clock
Input Pattern(up to 19+2000 chars): Tesla
Railgun_6pp_hits/Railgun_6pp_clocks: 0/175
Railgun_6pp performance: 1154KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/160
Railgun_Quadruplet_7 performance: 1262KB/clock
Input Pattern(up to 19+2000 chars): Albert
Railgun_6pp_hits/Railgun_6pp_clocks: 932/154
Railgun_6pp performance: 1312KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 932/137
Railgun_Quadruplet_7 performance: 1474KB/clock
Input Pattern(up to 19+2000 chars): Toshiba
Railgun_6pp_hits/Railgun_6pp_clocks: 0/130
Railgun_6pp performance: 1554KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/121
Railgun_Quadruplet_7 performance: 1669KB/clock
Input Pattern(up to 19+2000 chars): Orion
Railgun_6pp_hits/Railgun_6pp_clocks: 1/175
Railgun_6pp performance: 1154KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1/160
Railgun_Quadruplet_7 performance: 1262KB/clock
Input Pattern(up to 19+2000 chars): pharaoh
Railgun_6pp_hits/Railgun_6pp_clocks: 26/129
Railgun_6pp performance: 1566KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 26/122
Railgun_Quadruplet_7 performance: 1656KB/clock
Input Pattern(up to 19+2000 chars): lemon
Railgun_6pp_hits/Railgun_6pp_clocks: 33/178
Railgun_6pp performance: 1135KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 33/161
Railgun_Quadruplet_7 performance: 1255KB/clock
Input Pattern(up to 19+2000 chars): grammar
Railgun_6pp_hits/Railgun_6pp_clocks: 218/123
Railgun_6pp performance: 1642KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 218/117
Railgun_Quadruplet_7 performance: 1727KB/clock
D:\_KAZE_new-stuff>
Grmbl, pattern 'SR71' compared to 'fast' gives me a hint how things are far from 'fastest'.
Points of Interest
The interesting part about tests I have done: 'memcmp' and 'memcpy' when inlined (properly) give much better results. The thing which pop-ups again-and-again: Never take anything for granted. A bug (which doesn't affect the previous results, correct) was crushed, now r.6+++ is ready to go/hit.
As you can see from the C (commentless and tabulated) and ASM sources (see the top of the page for downloading):
The Microsoft 16.00.30319.01 compiled Railgun_Quadruplet_6ppp is 00c6b-00a00+5 = 624 bytes long.
The Quadruplet (where cbTarget<961) fragment loop is 00bc3-00b17+5 = 177 bytes long.
The Boyer-Moore-Horspool (where cbTarget>=961) fragment loop is 00c61-00c20+2 = 67 bytes long.
An interesting question for more experienced coders:
In case of reducing the BMH loop down by 3 bytes (i.e. to 64 bytes), can we obtain some additional/significant boost?
History
- Revision 6+++ written 2011-Sep-07
- Revision 7 written 2011-Sep-25