Introduction
The C strstr-like function presented below (written by Georgi 'Kaze') is named Railgun_Quadruplet_6ppp
and is based on well-known Boyer-Moore-Horspool and simplified Karp-Rabin algorithms. The main goal: to achieve fastest execution for all kinds of pattern/needle and string/haystack lengths (starting from 2 bytes).
Railgun_Quadruplet_6ppp
is now obsolete due to the arrival of Railgun_Quadruplet_7
which boosts the BMH fragment by more than a percent. See further below for its commentless and tabulated source.
Railgun_Quadruplet_7
exploits the CPUs fast unaligned DWORD
extract capability.
The heaviest and most precise test on 1+TB text (52 patterns vs 197MB English books text repeated 256 times results in 1000+ seconds, see strstr_SHORT-SHOWDOWN_r7_Heavy_Test_LOG.txt file for more information) gives:
[strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe:]
The variants returning the offset of first hit only:
Railgun_Quadruplet_6pp
52, i.e., average performance: 1584KB/clockRailgun_Quadruplet_7
52, i.e., average performance: 1659KB/clock
The variants counting all hits:
Railgun_6pp
8x2, i.e., average performance: 1246KB/clockRailgun_Quadruplet_7
8x2, i.e., average performance: 1266KB/clock
[strstr_SHORT-SHOWDOWN_Intel_v12_O3.exe:]
The variants returning the offset of first hit only:
Railgun_Quadruplet_6pp
52, i.e., average performance: 1773KB/clockRailgun_Quadruplet_7
52, i.e., average performance: 1685KB/clock
The variants counting all hits:
Railgun_6pp
8x2, i.e., average performance: 1054KB/clockRailgun_Quadruplet_7
8x2, i.e., average performance: 1076KB/clock
Note: For Railgun_Quadruplet_7
on CPUs with fast DWORD
(compared to BYTE
) memory fetching I expect significant boost, AFAIK the Intel i7 is one of those, unfortunately I lack the opportunity to play with it.
Or the two bottom-lines:
The FASTEST variant counting all hits:
Railgun_Quadruplet_7
8x2 with average performance: 1266KB/clock (best performance achieved with Microsoft_v16_Ox)!
The FASTEST variant returning the offset of first hit only:
Railgun_Quadruplet_6pp
52 with average performance: 1773KB/clock (1,943,883,635,456 bytes / 1,070,220 clocks = 1732 MB/s best performance achieved with Intel_v12_O3)!
An add-on from 2012-Jan-26:
Enter
Railgun_Quadruplet_7Gulliver ... the most promising variant.
I am happy to share my latest etude - a result of a week-long sub-quest of mine saturated with music and program-mess-ing.
The first search-function that breaks (effortlessly) the 3GB/s barrier on a computer with 3GB/s memory write.
char * Railgun_Quadruplet_7Gulliver_count_hits (char * pbTarget, char * pbPattern, unsigned long cbTarget, unsigned long cbPattern)
{
char * pbTargetMax = pbTarget + cbTarget;
register unsigned long ulHashPattern;
register unsigned long ulHashTarget;
signed long count;
signed long countSTATIC;
unsigned char SINGLET;
unsigned long Quadruplet2nd;
unsigned long Quadruplet3rd;
unsigned long Quadruplet4th;
unsigned long AdvanceHopperGrass;
long i;
int a, j;
unsigned int bm_bc[256];
unsigned int bm_bc2nd[256];
unsigned char bm_Sunday_Order2[256*256];
unsigned long Gulliver;
if (cbPattern > cbTarget)
return(NULL);
if ( cbPattern<4) {
pbTarget = pbTarget+cbPattern;
ulHashPattern = ( (*(char *)(pbPattern))<<8 ) + *(pbPattern+(cbPattern-1));
if ( cbPattern==3) {
for ( ;; ) {
if ( ulHashPattern == ( (*(char *)(pbTarget-3))<<8 ) + *(pbTarget-1) ) {
if ( *(char *)(pbPattern+1) == *(char *)(pbTarget-2) ) Railgunhits++;
}
if ( (char)(ulHashPattern>>8) != *(pbTarget-2) ) pbTarget++;
pbTarget++;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else {
}
for ( ;; ) {
if ( ulHashPattern == ( (*(char *)(pbTarget-2))<<8 ) + *(pbTarget-1) )
Railgunhits++;
if ( (char)(ulHashPattern>>8) != *(pbTarget-1) ) pbTarget++;
pbTarget++;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else {
if (cbTarget<961) {
pbTarget = pbTarget+cbPattern;
ulHashPattern = *(unsigned long *)(pbPattern);
SINGLET = ulHashPattern & 0xFF;
Quadruplet2nd = SINGLET<<8;
Quadruplet3rd = SINGLET<<16;
Quadruplet4th = SINGLET<<24;
for ( ;; ) {
AdvanceHopperGrass = 0;
ulHashTarget = *(unsigned long *)(pbTarget-cbPattern);
if ( ulHashPattern == ulHashTarget ) {
count = cbPattern-1;
while ( count && *(char *)(pbPattern+(cbPattern-count)) == *(char *)(pbTarget-count) ) {
if ( cbPattern-1==AdvanceHopperGrass+count && SINGLET != *(char *)(pbTarget-count) ) AdvanceHopperGrass++;
count--;
}
if ( count == 0) Railgunhits++;
} else {
if ( Quadruplet2nd != (ulHashTarget & 0x0000FF00) ) {
AdvanceHopperGrass++;
if ( Quadruplet3rd != (ulHashTarget & 0x00FF0000) ) {
AdvanceHopperGrass++;
if ( Quadruplet4th != (ulHashTarget & 0xFF000000) ) AdvanceHopperGrass++;
}
}
}
AdvanceHopperGrass++;
pbTarget = pbTarget + AdvanceHopperGrass;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else {
countSTATIC = cbPattern-2-2;
for (a=0; a < 256; a++) {bm_bc[a]=cbPattern; bm_bc2nd[a]=cbPattern+1;}
for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1;
for (j=0; j < cbPattern; j++) bm_bc2nd[pbPattern[j]]=cbPattern-j;
for (a=0; a < 256*256; a++) {bm_Sunday_Order2[a]= cbPattern-1;}
for (j=0; j < cbPattern-1; j++) bm_Sunday_Order2[*(unsigned short *)(pbPattern+j)]=j;
ulHashPattern = *(unsigned short *)(pbPattern+cbPattern-1-1);
ulHashTarget = *(unsigned long *)(pbPattern);
AdvanceHopperGrass = 0;
i=0;
while (i <= cbTarget-cbPattern-1-1) {
Gulliver = bm_Sunday_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1]];
if ( Gulliver == cbPattern-2 ) {
if ( *(unsigned long *)&pbTarget[i] == ulHashTarget) {
count = countSTATIC;
while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
count--;
if ( count == 0) Railgunhits++;
}
i = i + 1;
} else if ( Gulliver == cbPattern-1 )
i = i + Gulliver;
else
i = i + cbPattern - Gulliver - 2;
AdvanceHopperGrass++;
}
if (i == cbTarget-cbPattern-1) {
if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
count = countSTATIC;
while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
count--;
if ( count == 0) Railgunhits++;
}
i++;
AdvanceHopperGrass++;
}
if (i == cbTarget-cbPattern) {
if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
count = countSTATIC;
while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
count--;
if ( count == 0) Railgunhits++;
}
AdvanceHopperGrass++;
}
GlobalSP += (int)((double)cbTarget/AdvanceHopperGrass*100);
GlobalI += AdvanceHopperGrass;
printf("Skip-Performance(bigger-the-better): %d%%, %d skips/iterations\n",(int)((double)cbTarget/AdvanceHopperGrass*100), AdvanceHopperGrass);
return(NULL);
}
}
}
I was very touched (thus inspired) by this cuore song:
Strong enough to tell you why
I feel enough to tell you something that's very close to me
In the eye of my mind
Feel the reasons that we thought
You just feel all
Falling I'm standing across the stream
Fearing saying I
I'm strong enough to find my courage
To find the way I'm always near it's brewing in my mind
Falling I'm standing across the stream
Fearing saying I
I'm strong enough to find my courage
To find the way I'm always near it's growing in my mind
/Artist: Elsiane - Across The Stream/
I salute everyone who feels-and-keeps the artistic way near.
An add-on from 2012-Jan-24:
Just wanted to see what Sunday's technique (applied on two chars, order 2) holds, this resulted in appearance of Railgun_Quadruplet_7Elsiane, the biggest-strider so far, I dream (already) of Railgun_Quadruplet_7Gulliver...
Horspool-Sunday-Sunday jumps to:
--------------------------------\
|
Horspool-Sunday jumps to:
-------------------------------\
|
Horspool jumps to:
------------------------------\
|
String: faster
Needle: fast
Sunday skip-table:
43215
Horspool skip-table:
32144
For above example:
Horspool jumps to A where A=4 'e' position
Sunday jumps to B where B=5 'r' position
Horspool-Sunday ('SunHorse') jumps to max(A,B) i.e. 5
Horspool-Sunday-Sunday ('Elsiane') jumps to max(A,B,C) i.e. max(5,6)=6 since neither 'e' nor 'r' appear in the needle i.e. right after 'r'
char * Railgun_Quadruplet_7Elsiane_count_hits (char * pbTarget, char * pbPattern, unsigned long cbTarget, unsigned long cbPattern)
{
char * pbTargetMax = pbTarget + cbTarget;
register unsigned long ulHashPattern;
register unsigned long ulHashTarget;
signed long count;
signed long countSTATIC;
unsigned char SINGLET;
unsigned long Quadruplet2nd;
unsigned long Quadruplet3rd;
unsigned long Quadruplet4th;
unsigned long AdvanceHopperGrass;
long i;
int a, j;
unsigned int bm_bc[256];
unsigned int bm_bc2nd[256];
unsigned char bm_Sunday_Order2[256*256];
if (cbPattern > cbTarget)
return(NULL);
if ( cbPattern<4) {
pbTarget = pbTarget+cbPattern;
ulHashPattern = ( (*(char *)(pbPattern))<<8 ) + *(pbPattern+(cbPattern-1));
if ( cbPattern==3) {
for ( ;; ) {
if ( ulHashPattern == ( (*(char *)(pbTarget-3))<<8 ) + *(pbTarget-1) ) {
if ( *(char *)(pbPattern+1) == *(char *)(pbTarget-2) ) Railgunhits++;
}
if ( (char)(ulHashPattern>>8) != *(pbTarget-2) ) pbTarget++;
pbTarget++;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else {
}
for ( ;; ) {
if ( ulHashPattern == ( (*(char *)(pbTarget-2))<<8 ) + *(pbTarget-1) )
Railgunhits++;
if ( (char)(ulHashPattern>>8) != *(pbTarget-1) ) pbTarget++;
pbTarget++;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else {
if (cbTarget<961) {
pbTarget = pbTarget+cbPattern;
ulHashPattern = *(unsigned long *)(pbPattern);
SINGLET = ulHashPattern & 0xFF;
Quadruplet2nd = SINGLET<<8;
Quadruplet3rd = SINGLET<<16;
Quadruplet4th = SINGLET<<24;
for ( ;; ) {
AdvanceHopperGrass = 0;
ulHashTarget = *(unsigned long *)(pbTarget-cbPattern);
if ( ulHashPattern == ulHashTarget ) {
count = cbPattern-1;
while ( count && *(char *)(pbPattern+(cbPattern-count)) == *(char *)(pbTarget-count) ) {
if ( cbPattern-1==AdvanceHopperGrass+count && SINGLET != *(char *)(pbTarget-count) ) AdvanceHopperGrass++;
count--;
}
if ( count == 0) Railgunhits++;
} else {
if ( Quadruplet2nd != (ulHashTarget & 0x0000FF00) ) {
AdvanceHopperGrass++;
if ( Quadruplet3rd != (ulHashTarget & 0x00FF0000) ) {
AdvanceHopperGrass++;
if ( Quadruplet4th != (ulHashTarget & 0xFF000000) ) AdvanceHopperGrass++;
}
}
}
AdvanceHopperGrass++;
pbTarget = pbTarget + AdvanceHopperGrass;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else {
countSTATIC = cbPattern-2-2;
for (a=0; a < 256; a++) {bm_bc[a]=cbPattern; bm_bc2nd[a]=cbPattern+1;}
for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1;
for (j=0; j < cbPattern; j++) bm_bc2nd[pbPattern[j]]=cbPattern-j;
for (a=0; a < 256*256; a++) {bm_Sunday_Order2[a]= 0;}
for (j=0; j < cbPattern; j++)
for (a=0; a < 256; a++) {
bm_Sunday_Order2[(a<<8) + pbPattern[j]]=1;
bm_Sunday_Order2[(pbPattern[j]<<8) + a]=1;
}
ulHashPattern = *(unsigned long *)(pbPattern);
AdvanceHopperGrass = 0;
i=0;
while (i <= cbTarget-cbPattern-1-1) {
if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
count = countSTATIC;
while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
count--;
if ( count == 0) Railgunhits++;
}
if ( bm_Sunday_Order2[*(unsigned short *)&pbTarget[i+cbPattern]] == 0 )
i = i + cbPattern + 1 + 1;
else
if ( bm_bc[(unsigned char)pbTarget[i+cbPattern-1]] < bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]] )
i= i + bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]];
else
i= i + bm_bc[(unsigned char)pbTarget[i+cbPattern-1]];
AdvanceHopperGrass++;
}
if (i == cbTarget-cbPattern-1) {
if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
count = countSTATIC;
while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
count--;
if ( count == 0) Railgunhits++;
}
i++;
AdvanceHopperGrass++;
}
if (i == cbTarget-cbPattern) {
if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
count = countSTATIC;
while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
count--;
if ( count == 0) Railgunhits++;
}
AdvanceHopperGrass++;
}
GlobalSP += (int)((double)cbTarget/AdvanceHopperGrass*100);
GlobalI += AdvanceHopperGrass;
return(NULL);
}
}
}
I like a lot this look-ahead approach of D.M. Sunday, but can't see how to quicken it, any ideas?
An add-on from 2012-Jan-22:
[
A face-off
'SunHorse' vs
Railgun_Quadruplet_7sun vs
Railgun_Quadruplet_7 vs
Boyer_Moore vs
Brute_Force
Okay, here comes
Railgun_Quadruplet_7sun - the new fastest hitter.
Two tests were done - the first/second '6x2'/'52' searches 12/52 patterns (for all appearances i.e. counting all hits) into 206,908,949 bytes long English text.
First, 7deuce is in the past, a bad luck it was, yes.
Wow, I have received collateral damage by the last shoot-out in which D.M. Sunday excels both in Speed-Performance and Skip-Performance.
Speed-Performance (measured in KB/clock, bigger-the-better) Summary:
Function '6x2' test (12 patterns 4-19 chars in length)
Railgun_Quadruplet_7sun 2,092KB/clock / 153,936% / 005,572,202,064 Iterations BONBON
Railgun_Quadruplet_7 2,030KB/clock / 137,568% / 006,454,416,320 Iterations Leaves-The-Town
Railgun_Quadruplet_7sunhorse 1,250KB/clock / 173,840% / 005,145,288,080 Iterations UNDERRATED
Boyer_Moore_Flensburg 1,154KB/clock / 139,168% / 006,428,115,568 Iterations Slow
Railgun_Quadruplet_7deuce 1,125KB/clock / 155,744% / 005,995,597,776 Iterations Slower
Brute_Force_Dummy 0,338KB/clock / 019,200% / 039,726,516,640 Iterations Slowest
Function '52' test (52 patterns 4-175 chars in length)
Railgun_Quadruplet_7sun 1,657KB/clock / 768,416% / 022,362,482,640 Iterations BONBON
Railgun_Quadruplet_7 1,637KB/clock / 708,768% / 025,368,049,712 Iterations Leaves-The-Town
Railgun_Quadruplet_7sunhorse 1,046KB/clock / 939,520% / 020,187,534,736 Iterations UNDERRATED
Railgun_Quadruplet_7deuce 0,923KB/clock / 864,048% / 023,292,952,992 Iterations Slow
Boyer_Moore_Flensburg 0,919KB/clock / 720,272% / 025,227,312,176 Iterations Slower
Brute_Force_Dummy 0,276KB/clock / 083,200% / 172,148,232,496 Iterations Slowest
Skip-Performance (measured in %, bigger-the-better; Iterations, lesser-the-better) Summary:
Function '6x2' test (12 patterns 4-19 chars in length)
Railgun_Quadruplet_7sunhorse 1,250KB/clock / 173,840% / 005,145,288,080 Iterations BONBON
Railgun_Quadruplet_7deuce 1,125KB/clock / 155,744% / 005,995,597,776 Iterations Leaves-The-Town
Railgun_Quadruplet_7sun 2,092KB/clock / 153,936% / 005,572,202,064 Iterations Bitter-Sweet
Boyer_Moore_Flensburg 1,154KB/clock / 139,168% / 006,428,115,568 Iterations Loopy
Railgun_Quadruplet_7 2,030KB/clock / 137,568% / 006,454,416,320 Iterations Loopier
Brute_Force_Dummy 0,338KB/clock / 019,200% / 039,726,516,640 Iterations Loopiest
Function '52' test (52 patterns 4-175 chars in length)
Railgun_Quadruplet_7sunhorse 1,046KB/clock / 939,520% / 020,187,534,736 Iterations BONBON
Railgun_Quadruplet_7deuce 0,923KB/clock / 864,048% / 023,292,952,992 Iterations Leaves-The-Town
Railgun_Quadruplet_7sun 1,657KB/clock / 768,416% / 022,362,482,640 Iterations Bitter-Sweet
Boyer_Moore_Flensburg 0,919KB/clock / 720,272% / 025,227,312,176 Iterations Loopy
Railgun_Quadruplet_7 1,637KB/clock / 708,768% / 025,368,049,712 Iterations Loopier
Brute_Force_Dummy 0,276KB/clock / 083,200% / 172,148,232,496 Iterations Loopiest
For English texts I would have no second thought which function to choose: it is Railgun_Quadruplet_7sunhorse (Boyer-Moore reinforced by Sunday-Horspool) - strongest of all Boyer-Moore variants(including the original), sadly [still] not among the fastest.
I can't figure out the cause for such unproportionality: 'sunhorse' features amazing (for English texts) Skip-Performance i.e. number of pattern comparisions agaist the text is so low but Speed-Performance (on this CPU) lags behind.
The test package is downloadable at:
http://www.sanmayce.com/Downloads/_KAZE_strstr_SHORT-SHOWDOWN_7sun_vs_7_vs_7sunhorse_vs_7deuce_vs_BMF.7z
The test package contains:
D:\_KAZE_strstr_SHORT-SHOWDOWN_7sun_vs_7_vs_7sunhorse_vs_7deuce_vs_BMF>dir
01/22/2012 07:00 AM 59,198 FH_Flensburg.zip
01/22/2012 07:00 AM 206,908,949 OSHO.TXT
01/22/2012 07:00 AM 469,283 Results_Pentium_Merom_2166Mhz.txt
01/22/2012 07:00 AM 175 RUNME_IT_TAKES_some_20_MINUTES.BAT
01/22/2012 07:00 AM 191 strstr_COMPILE_Microsoft.bat
01/22/2012 07:00 AM 603,625 strstr_SHORT-SHOWDOWN.c
01/22/2012 07:00 AM 633,053 strstr_SHORT-SHOWDOWN.cod
01/22/2012 07:00 AM 108,032 strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe
D:\_KAZE_strstr_SHORT-SHOWDOWN_7sun_vs_7_vs_7sunhorse_vs_7deuce_vs_BMF>
A very easy-to-the-eyes/mind article at http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/bm.htm
Here I say 'Triple THANKS' to Prof. Dr. Hans Werner Lang (Institut für Medieninformatik und Technische Informatik Fachhochschule Flensburg), I used your code (after a straightforward port). On his page (http://www.inf.fh-flensburg.de/lang/) there are very well presented algorithms - a very good resource indeed.
I expect Railgun_Quadruplet_7sunhorse to become a gun of choice but only on CPUs with FAR more effective branching/memory handling than my Pentium Merom.
I still do not understand Turbo Boyer-Moore algorithm, maybe it holds a significant Skip-Performance boost.
This project is under researching, that is why I have called it a quest. Anyone who can point out faster functions is welcomed to share them with us here.
My favorite UNDERRATED gun has main loop as follows:
; 3209 : i=0;
; 3210 : while (i <= cbTarget-cbPattern-1) {
010e6 8b 4c 24 20 mov ecx, DWORD PTR _cbTarget$GSCopy$[esp+2104]
010ea 89 44 24 28 mov DWORD PTR _ulHashPattern$[esp+2104], eax
010ee 33 c0 xor eax, eax
010f0 2b ca sub ecx, edx
010f2 89 4c 24 14 mov DWORD PTR tv554[esp+2104], ecx
010f6 8d 0c 13 lea ecx, DWORD PTR [ebx+edx]
010f9 89 44 24 10 mov DWORD PTR _AdvanceHopperGrass$[esp+2104], eax
010fd 89 4c 24 18 mov DWORD PTR tv451[esp+2104], ecx
$LN13@Railgun_Qu@5:
; 3211 : ulHashTarget=*(unsigned long *)&pbTarget[i];
; 3212 : if ( ulHashTarget == ulHashPattern)
01101 8b 54 24 28 mov edx, DWORD PTR _ulHashPattern$[esp+2104]
01105 39 14 18 cmp DWORD PTR [eax+ebx], edx
01108 75 41 jne SHORT $LN8@Railgun_Qu@5
; 3213 : {
; 3214 : count = countSTATIC;
0110a 8b 7c 24 1c mov edi, DWORD PTR _countSTATIC$[esp+2104]
; 3215 : while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) ) {
0110e 85 ff test edi, edi
01110 74 2d je SHORT $LN85@Railgun_Qu@5
; 3213 : {
; 3214 : count = countSTATIC;
01112 8d 56 04 lea edx, DWORD PTR [esi+4]
01115 8d 74 18 04 lea esi, DWORD PTR [eax+ebx+4]
01119 8d a4 24 00 00
00 00 npad 7
$LL10@Railgun_Qu@5:
; 3215 : while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) ) {
01120 8a 0a mov cl, BYTE PTR [edx]
01122 3a 0e cmp cl, BYTE PTR [esi]
01124 75 11 jne SHORT $LN9@Railgun_Qu@5
; 3216 : count--;
01126 46 inc esi
01127 42 inc edx
01128 4f dec edi
01129 75 f5 jne SHORT $LL10@Railgun_Qu@5
; 3217 : }
; 3218 : if ( count == 0) Railgunhits++;
0112b 8b 74 24 24 mov esi, DWORD PTR _pbPattern$GSCopy$[esp+2104]
0112f ff 05 00 00 00
00 inc DWORD PTR _Railgunhits
01135 eb 14 jmp SHORT $LN8@Railgun_Qu@5
$LN9@Railgun_Qu@5:
01137 85 ff test edi, edi
01139 75 0c jne SHORT $LN111@Railgun_Qu@5
0113b 8b 74 24 24 mov esi, DWORD PTR _pbPattern$GSCopy$[esp+2104]
$LN85@Railgun_Qu@5:
0113f ff 05 00 00 00
00 inc DWORD PTR _Railgunhits
01145 eb 04 jmp SHORT $LN8@Railgun_Qu@5
$LN111@Railgun_Qu@5:
01147 8b 74 24 24 mov esi, DWORD PTR _pbPattern$GSCopy$[esp+2104]
$LN8@Railgun_Qu@5:
; 3219 : }
; 3220 :
; 3221 :
; 3222 :
; 3223 :
; 3224 :
; 3225 :
; 3226 : if ( bm_bc[(unsigned char)pbTarget[i+cbPattern-1]] < bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]] )
0114b 8b 54 24 18 mov edx, DWORD PTR tv451[esp+2104]
0114f 0f b6 4c 02 ff movzx ecx, BYTE PTR [edx+eax-1]
01154 0f b6 14 02 movzx edx, BYTE PTR [edx+eax]
01158 8b 4c 8c 30 mov ecx, DWORD PTR _bm_bc$[esp+ecx*4+2104]
0115c 8b 94 94 30 04
00 00 mov edx, DWORD PTR _bm_bc2nd$[esp+edx*4+2104]
01163 3b ca cmp ecx, edx
01165 73 04 jae SHORT $LN7@Railgun_Qu@5
; 3227 : i= i + bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]];
01167 03 c2 add eax, edx
; 3228 : else
01169 eb 02 jmp SHORT $LN6@Railgun_Qu@5
$LN7@Railgun_Qu@5:
; 3229 : i= i + bm_bc[(unsigned char)pbTarget[i+cbPattern-1]];
0116b 03 c1 add eax, ecx
$LN6@Railgun_Qu@5:
; 3207 :
; 3208 : AdvanceHopperGrass = 0;
; 3209 : i=0;
; 3210 : while (i <= cbTarget-cbPattern-1) {
0116d 8b 4c 24 14 mov ecx, DWORD PTR tv554[esp+2104]
; 3230 :
; 3231 :
; 3232 : AdvanceHopperGrass++;
01171 ff 44 24 10 inc DWORD PTR _AdvanceHopperGrass$[esp+2104]
01175 49 dec ecx
01176 3b c1 cmp eax, ecx
01178 76 87 jbe SHORT $LN13@Railgun_Qu@5
; 3233 : }
The size is:
01178 - 01101 +1 +1 -4 = 75bytes, -4 because 'inc DWORD PTR _AdvanceHopperGrass$[esp+2104]' is only for statistical use.
]
Background
Function homepage:
http://www.sanmayce.com/Railgun/index.html
Note: In order to reproduce the tests shown on the picture (below) you need the test package (57.2 MB) downloadable at function's homepage, thanks Randor for pointing out.
Very informative resource on strstr search techniques at:
Tests done with 16 patterns with 197MB haystack (being a pure English book-like text).
The machine is a mainstream laptop on 2.1GHz with Intel Merom CPU.
Using the Code
The code is pretty simple and straightforward.
char * Railgun_Quadruplet_6ppp (char * pbTarget,
char * pbPattern,
unsigned long cbTarget,
unsigned long cbPattern)
{
char * pbTargetMax = pbTarget + cbTarget;
register unsigned long ulHashPattern;
unsigned long ulHashTarget;
signed long count;
signed long countSTATIC;
unsigned char SINGLET;
unsigned long Quadruplet2nd;
unsigned long Quadruplet3rd;
unsigned long Quadruplet4th;
unsigned long AdvanceHopperGrass;
long i; int a, j, bm_bc[ASIZE]; unsigned char ch;
if (cbPattern > cbTarget)
return(NULL);
if ( cbPattern<4) { pbTarget = pbTarget+cbPattern;
ulHashPattern = ( (*(char *)(pbPattern))<<8 ) + *(pbPattern+(cbPattern-1));
if ( cbPattern==3) {
for ( ;; )
{
if ( ulHashPattern == ( (*(char *)(pbTarget-3))<<8 ) + *(pbTarget-1) ) {
if ( *(char *)(pbPattern+1) == *(char *)(pbTarget-2) ) return((pbTarget-3));
}
if ( (char)(ulHashPattern>>8) != *(pbTarget-2) ) pbTarget++;
pbTarget++;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else {
}
for ( ;; )
{
if ( ulHashPattern == ( (*(char *)(pbTarget-2))<<8 ) + *(pbTarget-1) )
return((pbTarget-2));
if ( (char)(ulHashPattern>>8) != *(pbTarget-1) ) pbTarget++;
pbTarget++;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else { if (cbTarget<961) {
pbTarget = pbTarget+cbPattern;
ulHashPattern = *(unsigned long *)(pbPattern);
SINGLET = ulHashPattern & 0xFF;
Quadruplet2nd = SINGLET<<8;
Quadruplet3rd = SINGLET<<16;
Quadruplet4th = SINGLET<<24;
for ( ;; )
{
AdvanceHopperGrass = 0;
ulHashTarget = *(unsigned long *)(pbTarget-cbPattern);
if ( ulHashPattern == ulHashTarget ) { count = cbPattern-1;
while ( count && *(char *)(pbPattern+(cbPattern-count)) ==
*(char *)(pbTarget-count) ) {
if ( cbPattern-1==AdvanceHopperGrass+count && SINGLET !=
*(char *)(pbTarget-count) ) AdvanceHopperGrass++;
count--;
}
if ( count == 0) return((pbTarget-cbPattern));
} else { if ( Quadruplet2nd != (ulHashTarget & 0x0000FF00) ) {
AdvanceHopperGrass++;
if ( Quadruplet3rd != (ulHashTarget & 0x00FF0000) ) {
AdvanceHopperGrass++;
if ( Quadruplet4th != (ulHashTarget & 0xFF000000) ) AdvanceHopperGrass++;
}
}
}
AdvanceHopperGrass++;
pbTarget = pbTarget + AdvanceHopperGrass;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else { countSTATIC = cbPattern-2-2; ulHashPattern = *(unsigned long *)(pbPattern);
for (a=0; a < ASIZE; a++) bm_bc[a]=cbPattern;
for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1;
i=0;
while (i <= cbTarget-cbPattern) {
ch=pbTarget[i+cbPattern-1];
if (ch == pbPattern[cbPattern-1] &&
*(long *)&pbTarget[i] == ulHashPattern) {
count = countSTATIC;
while ( count !=0 && *(char *)(pbPattern+1+3+(countSTATIC-count)) ==
*(char *)(&pbTarget[i]+1+3+(countSTATIC-count)) )
{ count--;
}
if ( count <= 0) return(pbTarget+i);
}
i+=bm_bc[ch];
}
return(NULL);
} } }
Stomp stomp I arrived, here comes the Railgun_Quadruplet
revision 7:
char * Railgun_Quadruplet_7 (char * pbTarget, char * pbPattern,
unsigned long cbTarget, unsigned long cbPattern)
{
char * pbTargetMax = pbTarget + cbTarget;
register unsigned long ulHashPattern;
unsigned long ulHashTarget;
signed long count;
signed long countSTATIC;
unsigned char SINGLET;
unsigned long Quadruplet2nd;
unsigned long Quadruplet3rd;
unsigned long Quadruplet4th;
unsigned long AdvanceHopperGrass;
long i;
int a, j, bm_bc[ASIZE];
unsigned char ch;
if (cbPattern > cbTarget) return(NULL);
if (cbPattern<4) {
pbTarget = pbTarget+cbPattern;
ulHashPattern = ( (*(char *)(pbPattern))<<8 ) +
*(pbPattern+(cbPattern-1));
if (cbPattern==3) {
for ( ;; ) {
if ( ulHashPattern ==
( (*(char *)(pbTarget-3))<<8 ) + *(pbTarget-1) ) {
if ( *(char *)(pbPattern+1) ==
*(char *)(pbTarget-2) ) return((pbTarget-3));
}
if ( (char)(ulHashPattern>>8) !=
*(pbTarget-2) ) pbTarget++;
pbTarget++;
if (pbTarget > pbTargetMax) return(NULL);
}
} else {
}
for ( ;; ) {
if ( ulHashPattern == ( (*(char *)(pbTarget-2))<<8 )
+ *(pbTarget-1) ) return((pbTarget-2));
if ( (char)(ulHashPattern>>8) != *(pbTarget-1) )
pbTarget++;
pbTarget++;
if (pbTarget > pbTargetMax) return(NULL);
}
} else {
if (cbTarget<961) {
pbTarget = pbTarget+cbPattern;
ulHashPattern = *(unsigned long *)(pbPattern);
SINGLET = ulHashPattern & 0xFF;
Quadruplet2nd = SINGLET<<8;
Quadruplet3rd = SINGLET<<16;
Quadruplet4th = SINGLET<<24;
for ( ;; ) {
AdvanceHopperGrass = 0;
ulHashTarget = *(unsigned long *)(pbTarget-cbPattern);
if ( ulHashPattern == ulHashTarget ) {
count = cbPattern-1;
while ( count && *(char *)
(pbPattern+(cbPattern-count)) ==
*(char *)(pbTarget-count) ) {
if ( cbPattern-1==
AdvanceHopperGrass+count &&
SINGLET != *(char *)(pbTarget-count)
) AdvanceHopperGrass++;
count--;
}
if ( count == 0)
return((pbTarget-cbPattern));
} else {
if ( Quadruplet2nd != (ulHashTarget
& 0x0000FF00) ) {
AdvanceHopperGrass++;
if ( Quadruplet3rd !=
(ulHashTarget & 0x00FF0000) ) {
AdvanceHopperGrass++;
if ( Quadruplet4th !=
(ulHashTarget & 0xFF000000)
) AdvanceHopperGrass++;
}
}
}
AdvanceHopperGrass++;
pbTarget = pbTarget + AdvanceHopperGrass;
if (pbTarget > pbTargetMax) return(NULL);
}
} else {
countSTATIC = cbPattern-2-2;
ulHashPattern = *(unsigned long *)(pbPattern);
for (a=0; a < ASIZE; a++) bm_bc[a]=cbPattern;
for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=
cbPattern-j-1;
i=0;
while (i <= cbTarget-cbPattern) {
if ( *(unsigned long *)&pbTarget[i] ==
ulHashPattern ) {
count = countSTATIC;
while ( count !=0 && *(char *)
(pbPattern+1+3+(countSTATIC-count)) ==
*(char *)(&pbTarget[i]+1+3+
(countSTATIC-count)) ) count--;
if ( count == 0) return(pbTarget+i);
}
i= i + bm_bc[(unsigned char)pbTarget[i+cbPattern-1]];
}
return(NULL);
}
}
}
The tweak (in BMH fragment) which speeds-up significantly the obsolete (already) r.6+++ is this:
countSTATIC = cbPattern-2-2; ulHashPattern = *(unsigned long *)(pbPattern);
for (a=0; a < ASIZE; a++) bm_bc[a]=cbPattern;
for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1;
i=0;
while (i <= cbTarget-cbPattern) {
if ( *(unsigned long *)&pbTarget[i] == ulHashPattern ) {
count = countSTATIC;
while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) ==
*(char *)(&pbTarget[i]+(countSTATIC-count)+4) ) { count--;
}
if ( count == 0) return(pbTarget+i);
}
i= i + bm_bc[(unsigned char)pbTarget[i+cbPattern-1]];
}
return(NULL);
In order to see what path I had to traverse to make this tweak, you may see the discussion board (r.7m thread). The tests below show NOT the full boost simply because they average times for all lengths thus losing/diminishing the high speeds times (being 4-3-2 times lesser compared with pattern lengths 2-3-4-5-6).
D:\_KAZE_new-stuff>strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe
strstr_SHORT-SHOWDOWN, revision 7, written by Kaze.
Input Pattern(up to 19+2000 chars): frustration
Doing Search for Pattern(11bytes) into String(206908949bytes) as-one-line ...
WARNING! The next line works with BMH only for pattern 4[+] long,
otherwise (for 2 and 3) other searcher takes over!
Railgun_6pp_hits/Railgun_6pp_clocks: 1355/101
Railgun_6pp performance: 2000KB/clock
Doing Search for Pattern(11bytes) into String(206908949bytes) as-one-line ...
WARNING! The next line works with BMH only for pattern 4[+] long,
otherwise (for 2 and 3) other searcher takes over!
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1355/91
Railgun_Quadruplet_7 performance: 2220KB/clock
Note: Executing the next two tests 256 times i.e. the search is for 256x8x2 patterns!
Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ...
Not-Counting-Hits-Just-Returns-First-One
Found ('an') at 157 position, Railgun_Quadruplet_6pp performance: 0KB/clock
Found ('to') at 853 position, Railgun_Quadruplet_6pp performance: 0KB/clock
Found ('TDK') at -4325408 position, Railgun_Quadruplet_6pp performance: 922KB/clock
Found ('the') at 511 position, Railgun_Quadruplet_6pp performance: 0KB/clock
Found ('fast') at 42381 position, Railgun_Quadruplet_6pp performance: 41KB/clock
Found ('easy') at 30163 position, Railgun_Quadruplet_6pp performance: 29KB/clock
Found ('grmbl') at -4325408 position, Railgun_Quadruplet_6pp performance: 1167KB/clock
Found ('email') at 69587982 position, Railgun_Quadruplet_6pp performance: 1078KB/clock
Found ('pasting') at 134600126 position, Railgun_Quadruplet_6pp performance: 1383KB/clock
Found ('amazing') at 1629806 position, Railgun_Quadruplet_6pp performance: 93KB/clock
Found ('underdog') at 137424498 position,
Railgun_Quadruplet_6pp performance: 1698KB/clock
Found ('superdog') at -4325408 position, Railgun_Quadruplet_6pp performance: 1603KB/clock
Found ('participants') at 16060 position, Railgun_Quadruplet_6pp performance: 15KB/clock
Found ('skillessness') at -4325408 position,
Railgun_Quadruplet_6pp performance: 1836KB/clock
Found ('I should have known') at 41831880 position,
Railgun_Quadruplet_6pp performance: 2403KB/clock
Found ('human consciousness') at 3863 position,
Railgun_Quadruplet_6pp performance: 3KB/clock
Railgun_Quadruplet_6pp 8x2 i.e. average performance: 1342KB/clock
ReallyTraversed: 310,477,846,528 bytes
Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ...
Not-Counting-Hits-Just-Returns-First-One
Found ('an') at 157 position, Railgun_Quadruplet_7 performance: 0KB/clock
Found ('to') at 853 position, Railgun_Quadruplet_7 performance: 0KB/clock
Found ('TDK') at -4325408 position, Railgun_Quadruplet_7 performance: 990KB/clock
Found ('the') at 511 position, Railgun_Quadruplet_7 performance: 0KB/clock
Found ('fast') at 42381 position, Railgun_Quadruplet_7 performance: 41KB/clock
Found ('easy') at 30163 position, Railgun_Quadruplet_7 performance: 29KB/clock
Found ('grmbl') at -4325408 position, Railgun_Quadruplet_7 performance: 1069KB/clock
Found ('email') at 69587982 position, Railgun_Quadruplet_7 performance: 1078KB/clock
Found ('pasting') at 134600126 position, Railgun_Quadruplet_7 performance: 1383KB/clock
Found ('amazing') at 1629806 position, Railgun_Quadruplet_7 performance: 1591KB/clock
Found ('underdog') at 137424498 position, Railgun_Quadruplet_7 performance: 1698KB/clock
Found ('superdog') at -4325408 position, Railgun_Quadruplet_7 performance: 1422KB/clock
Found ('participants') at 16060 position, Railgun_Quadruplet_7 performance: 15KB/clock
Found ('skillessness') at -4325408 position,
Railgun_Quadruplet_7 performance: 2149KB/clock
Found ('I should have known') at 41831880 position,
Railgun_Quadruplet_7 performance: 2403KB/clock
Found ('human consciousness') at 3863 position,
Railgun_Quadruplet_7 performance: 3KB/clock
Railgun_Quadruplet_7 8x2 i.e. average performance: 1354KB/clock
ReallyTraversed: 310,477,846,528 bytes
Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ...
Found ('an') 1987797 time(s), BM_HORSPOOL performance: 293KB/clock
Found ('to') 1076629 time(s), BM_HORSPOOL performance: 293KB/clock
Found ('TDK') 0 time(s), BM_HORSPOOL performance: 516KB/clock
Found ('the') 2114180 time(s), BM_HORSPOOL performance: 379KB/clock
Found ('fast') 5945 time(s), BM_HORSPOOL performance: 585KB/clock
Found ('easy') 5191 time(s), BM_HORSPOOL performance: 585KB/clock
Found ('grmbl') 0 time(s), BM_HORSPOOL performance: 759KB/clock
Found ('email') 1 time(s), BM_HORSPOOL performance: 756KB/clock
Found ('pasting') 2 time(s), BM_HORSPOOL performance: 918KB/clock
Found ('amazing') 323 time(s), BM_HORSPOOL performance: 1074KB/clock
Found ('underdog') 4 time(s), BM_HORSPOOL performance: 1069KB/clock
Found ('superdog') 0 time(s), BM_HORSPOOL performance: 1074KB/clock
Found ('participants') 147 time(s), BM_HORSPOOL performance: 1603KB/clock
Found ('skillessness') 0 time(s), BM_HORSPOOL performance: 1422KB/clock
Found ('I should have known') 1 time(s), BM_HORSPOOL performance: 1433KB/clock
Found ('human consciousness') 519 time(s), BM_HORSPOOL performance: 1820KB/clock
BM_HORSPOOL 8x2 i.e. average performance: 666KB/clock
Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ...
Found ('an') 1987797 time(s), Railgun_6pp performance: 678KB/clock
Found ('to') 1076629 time(s), Railgun_6pp performance: 643KB/clock
Found ('TDK') 0 time(s), Railgun_6pp performance: 1174KB/clock
Found ('the') 2114180 time(s), Railgun_6pp performance: 678KB/clock
Found ('fast') 5945 time(s), Railgun_6pp performance: 918KB/clock
Found ('easy') 5191 time(s), Railgun_6pp performance: 990KB/clock
Found ('grmbl') 0 time(s), Railgun_6pp performance: 1287KB/clock
Found ('email') 1 time(s), Railgun_6pp performance: 1167KB/clock
Found ('pasting') 2 time(s), Railgun_6pp performance: 1603KB/clock
Found ('amazing') 323 time(s), Railgun_6pp performance: 1603KB/clock
Found ('underdog') 4 time(s), Railgun_6pp performance: 1603KB/clock
Found ('superdog') 0 time(s), Railgun_6pp performance: 1820KB/clock
Found ('participants') 147 time(s), Railgun_6pp performance: 2149KB/clock
Found ('skillessness') 0 time(s), Railgun_6pp performance: 2126KB/clock
Found ('I should have known') 1 time(s), Railgun_6pp performance: 2557KB/clock
Found ('human consciousness') 519 time(s), Railgun_6pp performance: 2126KB/clock
Railgun_6pp 8x2 i.e. average performance: 1218KB/clock
Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ...
Found ('an') 1987797 time(s), Railgun_Quadruplet_7 performance: 643KB/clock
Found ('to') 1076629 time(s), Railgun_Quadruplet_7 performance: 678KB/clock
Found ('TDK') 0 time(s), Railgun_Quadruplet_7 performance: 1074KB/clock
Found ('the') 2114180 time(s), Railgun_Quadruplet_7 performance: 643KB/clock
Found ('fast') 5945 time(s), Railgun_Quadruplet_7 performance: 1074KB/clock
Found ('easy') 5191 time(s), Railgun_Quadruplet_7 performance: 1074KB/clock
Found ('grmbl') 0 time(s), Railgun_Quadruplet_7 performance: 1287KB/clock
Found ('email') 1 time(s), Railgun_Quadruplet_7 performance: 1167KB/clock
Found ('pasting') 2 time(s), Railgun_Quadruplet_7 performance: 1603KB/clock
Found ('amazing') 323 time(s), Railgun_Quadruplet_7 performance: 1603KB/clock
Found ('underdog') 4 time(s), Railgun_Quadruplet_7 performance: 1820KB/clock
Found ('superdog') 0 time(s), Railgun_Quadruplet_7 performance: 1603KB/clock
Found ('participants') 147 time(s), Railgun_Quadruplet_7 performance: 2557KB/clock
Found ('skillessness') 0 time(s), Railgun_Quadruplet_7 performance: 2126KB/clock
Found ('I should have known') 1 time(s), Railgun_Quadruplet_7 performance: 2557KB/clock
Found ('human consciousness') 519 time(s),
Railgun_Quadruplet_7 performance: 2557KB/clock
Railgun_Quadruplet_7 8x2 i.e. average performance: 1232KB/clock
^C
D:\_KAZE_new-stuff>
In my opinion, it is one of the seven commandments of speed: do not put the cart before the horse, i.e., for adjacent (32?) memory accesses first extract the byte with lowest address. strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe, revision 7, some quick inputs:
D:\_KAZE_new-stuff>strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe
strstr_SHORT-SHOWDOWN, revision 7, written by Kaze.
Input Pattern(up to 19+2000 chars): fast
Doing Search for Pattern(4bytes) into String(206908949bytes) as-one-line ...
WARNING! The next line works with BMH only for pattern 4[+] long,
otherwise (for 2 and 3) other searcher takes over!
Railgun_6pp_hits/Railgun_6pp_clocks: 5945/215
Railgun_6pp performance: 939KB/clock
Doing Search for Pattern(4bytes) into String(206908949bytes) as-one-line ...
WARNING! The next line works with BMH only for pattern 4[+] long,
otherwise (for 2 and 3) other searcher takes over!
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 5945/193
Railgun_Quadruplet_7 performance: 1046KB/clock
^C
...
Input Pattern(up to 19+2000 chars): from
Railgun_6pp_hits/Railgun_6pp_clocks: 83155/196
Railgun_6pp performance: 1030KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 83155/194
Railgun_Quadruplet_7 performance: 1041KB/clock
Input Pattern(up to 19+2000 chars): rabea
Railgun_6pp_hits/Railgun_6pp_clocks: 0/176
Railgun_6pp performance: 1148KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/162
Railgun_Quadruplet_7 performance: 1247KB/clock
Input Pattern(up to 19+2000 chars): makes
Railgun_6pp_hits/Railgun_6pp_clocks: 9281/172
Railgun_6pp performance: 1174KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 9281/162
Railgun_Quadruplet_7 performance: 1247KB/clock
Input Pattern(up to 19+2000 chars): monkey
Railgun_6pp_hits/Railgun_6pp_clocks: 1691/141
Railgun_6pp performance: 1433KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1691/139
Railgun_Quadruplet_7 performance: 1453KB/clock
Input Pattern(up to 19+2000 chars): punny
Railgun_6pp_hits/Railgun_6pp_clocks: 0/156
Railgun_6pp performance: 1295KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/155
Railgun_Quadruplet_7 performance: 1303KB/clock
Input Pattern(up to 19+2000 chars): funny
Railgun_6pp_hits/Railgun_6pp_clocks: 179/157
Railgun_6pp performance: 1287KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 179/156
Railgun_Quadruplet_7 performance: 1295KB/clock
Input Pattern(up to 19+2000 chars): ramjet
Railgun_6pp_hits/Railgun_6pp_clocks: 0/152
Railgun_6pp performance: 1329KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/137
Railgun_Quadruplet_7 performance: 1474KB/clock
Input Pattern(up to 19+2000 chars): fallen
Railgun_6pp_hits/Railgun_6pp_clocks: 2578/151
Railgun_6pp performance: 1338KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 2578/138
Railgun_Quadruplet_7 performance: 1464KB/clock
Input Pattern(up to 19+2000 chars): elephant
Railgun_6pp_hits/Railgun_6pp_clocks: 1198/124
Railgun_6pp performance: 1629KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1198/112
Railgun_Quadruplet_7 performance: 1804KB/clock
Input Pattern(up to 19+2000 chars): layalina
Railgun_6pp_hits/Railgun_6pp_clocks: 0/121
Railgun_6pp performance: 1669KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/110
Railgun_Quadruplet_7 performance: 1836KB/clock
Input Pattern(up to 19+2000 chars): I should have known
Railgun_6pp_hits/Railgun_6pp_clocks: 1/89
Railgun_6pp performance: 2270KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1/84
Railgun_Quadruplet_7 performance: 2405KB/clock
Input Pattern(up to 19+2000 chars): human consciousness
Railgun_6pp_hits/Railgun_6pp_clocks: 519/80
Railgun_6pp performance: 2525KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 519/75
Railgun_Quadruplet_7 performance: 2694KB/clock
Input Pattern(up to 19+2000 chars): But he's looking right through me
Railgun_6pp_hits/Railgun_6pp_clocks: 0/84
Railgun_6pp performance: 2405KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/76
Railgun_Quadruplet_7 performance: 2658KB/clock
Input Pattern(up to 19+2000 chars): I'm living in an age that calls darkness light
Railgun_6pp_hits/Railgun_6pp_clocks: 0/72
Railgun_6pp performance: 2806KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/67
Railgun_Quadruplet_7 performance: 3015KB/clock
Input Pattern(up to 19+2000 chars): Following the wanderings of a dream -
a dream that keeps my soul alive
Railgun_6pp_hits/Railgun_6pp_clocks: 0/69
Railgun_6pp performance: 2928KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/63
Railgun_Quadruplet_7 performance: 3207KB/clock
Input Pattern(up to 19+2000 chars): I notice what matters and
I got nothing to lose but darkness and shadows
Railgun_6pp_hits/Railgun_6pp_clocks: 0/66
Railgun_6pp performance: 3061KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/63
Railgun_Quadruplet_7 performance: 3207KB/clock
Input Pattern(up to 19+2000 chars): Beth Ditto
Railgun_6pp_hits/Railgun_6pp_clocks: 0/118
Railgun_6pp performance: 1712KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/107
Railgun_Quadruplet_7 performance: 1888KB/clock
Input Pattern(up to 19+2000 chars): SR71
Railgun_6pp_hits/Railgun_6pp_clocks: 0/175
Railgun_6pp performance: 1154KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/182
Railgun_Quadruplet_7 performance: 1110KB/clock
Input Pattern(up to 19+2000 chars): Apache
Railgun_6pp_hits/Railgun_6pp_clocks: 1/162
Railgun_6pp performance: 1247KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1/133
Railgun_Quadruplet_7 performance: 1519KB/clock
Input Pattern(up to 19+2000 chars): Fibonacci
Railgun_6pp_hits/Railgun_6pp_clocks: 0/106
Railgun_6pp performance: 1906KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/98
Railgun_Quadruplet_7 performance: 2061KB/clock
Input Pattern(up to 19+2000 chars): Tesla
Railgun_6pp_hits/Railgun_6pp_clocks: 0/175
Railgun_6pp performance: 1154KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/160
Railgun_Quadruplet_7 performance: 1262KB/clock
Input Pattern(up to 19+2000 chars): Albert
Railgun_6pp_hits/Railgun_6pp_clocks: 932/154
Railgun_6pp performance: 1312KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 932/137
Railgun_Quadruplet_7 performance: 1474KB/clock
Input Pattern(up to 19+2000 chars): Toshiba
Railgun_6pp_hits/Railgun_6pp_clocks: 0/130
Railgun_6pp performance: 1554KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/121
Railgun_Quadruplet_7 performance: 1669KB/clock
Input Pattern(up to 19+2000 chars): Orion
Railgun_6pp_hits/Railgun_6pp_clocks: 1/175
Railgun_6pp performance: 1154KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1/160
Railgun_Quadruplet_7 performance: 1262KB/clock
Input Pattern(up to 19+2000 chars): pharaoh
Railgun_6pp_hits/Railgun_6pp_clocks: 26/129
Railgun_6pp performance: 1566KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 26/122
Railgun_Quadruplet_7 performance: 1656KB/clock
Input Pattern(up to 19+2000 chars): lemon
Railgun_6pp_hits/Railgun_6pp_clocks: 33/178
Railgun_6pp performance: 1135KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 33/161
Railgun_Quadruplet_7 performance: 1255KB/clock
Input Pattern(up to 19+2000 chars): grammar
Railgun_6pp_hits/Railgun_6pp_clocks: 218/123
Railgun_6pp performance: 1642KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 218/117
Railgun_Quadruplet_7 performance: 1727KB/clock
D:\_KAZE_new-stuff>
Grmbl, pattern 'SR71' compared to 'fast' gives me a hint how things are far from 'fastest'.
Points of Interest
The interesting part about tests I have done: 'memcmp
' and 'memcpy
' when inlined (properly) give much better results. The thing which pop-ups again-and-again: Never take anything for granted. A bug (which doesn't affect the previous results, correct) was crushed, now r.6+++ is ready to go/hit.
As you can see from the C (commentless and tabulated) and ASM sources (see the top of the page for downloading):
The Microsoft 16.00.30319.01 compiled Railgun_Quadruplet_6ppp
is 00c6b-00a00+5 = 624 bytes long.
The Quadruplet (where cbTarget<961) fragment loop is 00bc3-00b17+5 = 177 bytes long.
The Boyer-Moore-Horspool (where cbTarget>=961) fragment loop is 00c61-00c20+2 = 67 bytes long.
An interesting question for more experienced coders:
In case of reducing the BMH loop down by 3 bytes (i.e. to 64 bytes), can we obtain some additional/significant boost?
History
- Revision 6+++ written 2011-Sep-07
- Revision 7 written 2011-Sep-25
- Revision 7 'Sun' written 2012-Jan-22
- Revision 7 'Elsiane' written 2012-Jan-24
- Revision 7 'Gulliver' written 2012-Jan-26