Click here to Skip to main content
Click here to Skip to main content

The ReplaceMany Method

, 9 Jan 2012 CPOL
Rate this:
Please Sign up or sign in to vote.
Performing many replacements in a string at one time.
This is an old version of the currently published article.

performance1.gif

Introduction

This article introduces a String.ReplaceMany extension method, which performs multiple string replacements at one time. A single ReplaceMany method invocation is equivalent to a sequence of calls to the String.Replace[^]. The following snippet:

string str = "abab cdecde ab";
string result = str.ReplaceMany(new [] {"ab", "cde"}, new [] {"x", "y" });

will give the same result as

result = str.Replace("ab", "x").Replace("cde", "y");

Performance is the main difference between those two methods. In the best (and predictable) scenarios, the ReplaceMany is 10 times faster than the .NET Framework Replace. In the worst-case scenarios, the ReplaceMany method performs slower for a small number of replace strings, however it catches up when there are more of them. Generally, in most scenarios the introduced method's performance is the same or better.

The project targets .NET 4.0, but it will compile under .NET 2.0, after removing the this keyword from the method's signature.

Background

The idea came up while developing a code which took a template text, which contained markers in form @variable. These markers was then replaced with actual content, using the String.Replace method. I have realized that performance had been dropping as the project was developed and I was looking for ways to optimize the code. One thing which has caught my eye was a sequence of Replace method invocations.

Using the code

A full method signature is

string ReplaceMany(this string str, string[] oldValues, string[] newValues, 
    int resultStringLength = 0)

It is an extenstion method of a String class, which takes two obligatory parameters and one optional.

  • oldValues - an array of strings which should be searched for in the processed string,
  • newValues - an array of replacement strings,
  • resultStringLength - a tip for the method saying how big the result string will probably be. This parameter should be overestimated. The idea is to prevent a buffer reallocation and therefore improve performace by removing an unnecessary copying. If the result string length is not easily predictible, then this parameter should be omitted.

That would be all about the usage of the method. Now let's dig into an implementation.

Implementation

The implementation is a bit tricky and contains a plenty of optimizations, including unsafe code blocks and calls to native memory management functions. The first version used fixed blocks[^] inside a loop, but it caused an unexpected performance drop. In the current version, strings are pinned at the beginning and freed at the method's exit point. By "pinning", I mean telling GC (a garbage collector, the .NET Framework memory management engine) not to move an object around in a memory while operating on it. It is necessary when using pointers (unsafe code) and can be accoplished by using fixed statements or a GCHandle structure[^]. For comparing characters, the memcmp function[^] is used. Memory copying is performed by the memcpy[^].

public static unsafe string ReplaceMany(this string str, 
    string[] oldValues, string[] newValues, int resultStringLength)
{
    int oldCount = oldValues.Length;
    int newCount = newValues.Length;
    if (oldCount != newCount)
        throw new ArgumentException("Each old value must match exactly one new value");
    for (int i = 0; i < oldCount; i++)
    {
        if (string.IsNullOrEmpty(oldValues[i]))
            throw new ArgumentException("Old value may not be null or empty.", "oldValues");
        if (newValues[i] == null)
            throw new ArgumentException("New value may be null.", "newValues");
    }
    // cheating for a small number of replacement strings
    if (oldCount == 1)
        return str.Replace(oldValues[0], newValues[0]);
    if (oldCount == 2)
        return str.Replace(oldValues[0], newValues[0]).Replace(oldValues[1], newValues[1]);
    int strLen = str.Length;
    int buildSpace = resultStringLength == 0 ? strLen << 1 : resultStringLength;
    // A hand-made StringBuilder is here
    char[] buildArr = new char[buildSpace];        
    // cached pinned pointers
    GCHandle buildHandle = GCHandle.Alloc(buildArr, GCHandleType.Pinned);
    GCHandle[] oldHandles = new GCHandle[oldCount];
    GCHandle[] newHandles = new GCHandle[newCount];
    int* newLens = stackalloc int[newCount];
    int* oldLens = stackalloc int[newCount];
    char** oldPtrs = stackalloc char*[newCount];
    char** newPtrs = stackalloc char*[newCount];
    // other caches
    for (int i = 0; i < oldCount; i++)
    {
        oldHandles[i] = GCHandle.Alloc(oldValues[i], GCHandleType.Pinned);
        newHandles[i] = GCHandle.Alloc(newValues[i], GCHandleType.Pinned);
        oldPtrs[i] = (char*)oldHandles[i].AddrOfPinnedObject();
        newPtrs[i] = (char*)newHandles[i].AddrOfPinnedObject();
        newLens[i] = newValues[i].Length;
        oldLens[i] = oldValues[i].Length;
    }
    int buildIndex = 0;
    fixed (char* _strFix = str)
    {
        char* build = (char*)buildHandle.AddrOfPinnedObject();
        char* pBuild = build;
        char* pStr = _strFix;
        char* endStr = pStr + strLen;
        char* copyStartPos = pStr;
        while (pStr != endStr)
        {
            bool find = false;
            for (int i = 0; i < oldCount; ++i)
            {
                int oldValLen = *(oldLens+i);
                // if the string to find does not exceed the original string
                if (oldValLen > 0 && pStr + oldValLen <= endStr)
                {
                    char* _oldFix = *(oldPtrs + i);
                    if (*pStr == *_oldFix) // check the first char
                    {
                        // compare the rest. First, compare the second character
                        find = oldValLen == 1;
                        if (!find)
                        {
                            if (*(pStr + 1) == *(_oldFix + 1))
                                find = oldValLen == 2
                                // use native memcmp function.
                                || 0 == memcmp((byte*)(pStr + 2), (byte*)(_oldFix + 2), (oldValLen - 2) << 1);
                        }
                                
                        if (find)
                        {
                            int newValLen = newLens[i];
                            char* newFix = newPtrs[i];
                            int copyLen = (int)(pStr - copyStartPos);
                            // allocate new space if needed.
                            if (buildIndex + newValLen + copyLen > buildSpace)
                            {
                                buildHandle.Free();
                                int oldSpace = buildSpace;
                                buildSpace = Math.Max((int)(buildIndex + newValLen + copyLen), buildSpace << 1);
                                buildArr = ExpandArray(buildArr, oldSpace, buildSpace);
                                buildHandle = GCHandle.Alloc(buildArr, GCHandleType.Pinned);
                                build = (char*)buildHandle.AddrOfPinnedObject();
                                pBuild = build + buildIndex;
                            }
                            // if there is a part from the original string to copy, then do it.
                            if (copyLen > 0)
                            {
                                memcpy((byte*)(pBuild), (byte*)copyStartPos, copyLen << 1);
                                buildIndex += copyLen;
                                pBuild = build + buildIndex;
                            }
                            // append the replacement to the builder
                            memcpy((byte*)(pBuild), (byte*)newFix, newValLen << 1);
                            pBuild += newValLen;
                            buildIndex += newValLen;
                            pStr += oldValLen;
                            copyStartPos = pStr;
                        }
                    }
                }
            }
            // if not found, just increment the pointer within the main string
            if (!find)
                pStr++;
        }
        // if there is a part from the original string to copy, then do it.
        if (copyStartPos != pStr)
        {
            int copyLen = (int)(pStr - copyStartPos);
            // again, allocate new space if needed
            if (buildIndex + copyLen > buildSpace)
            {
                buildHandle.Free();
                int oldSpace = buildSpace;
                buildSpace = Math.Max((int)(buildIndex + copyLen), buildSpace << 1);
                buildArr = ExpandArray(buildArr, oldSpace, buildSpace);
                buildHandle = GCHandle.Alloc(buildArr, GCHandleType.Pinned);
                build = (char*)buildHandle.AddrOfPinnedObject();
                pBuild = build + buildIndex;
            }
            // append the ending
            memcpy((byte*)(pBuild), (byte*)copyStartPos, copyLen << 1);
            buildIndex += copyLen;
        }
    }
    // unpin string handles
    for (int i = 0; i < newCount; i++)
    {
        oldHandles[i].Free();
        newHandles[i].Free();
    }
    buildHandle.Free();
    return new string(buildArr, 0, buildIndex);
}

The ExpandArray method is used to allocate extra space for a buffer.

public static unsafe char[] ExpandArray(char[] array, int oldSize, int newSize)
{
    if (oldSize > newSize || oldSize > array.Length)
        throw new ArgumentOutOfRangeException();
    char[] bigger;
    bigger = new char[newSize];
    fixed (char* bpt = bigger, apt = array)
        memcpy((byte*)bpt, (byte*)apt, oldSize<<1);
    return bigger;
}

The C++ functions exposed by msvcrt.dll are called using P/Invoke[^].

[DllImport("msvcrt.dll", CallingConvention = CallingConvention.Cdecl)]
private static extern unsafe int memcmp(byte* b1, byte* b2, int count);
[DllImport("msvcrt.dll", CallingConvention = CallingConvention.Cdecl)]
private static extern unsafe int memcpy(byte* dest, byte* src, int count);

Performance tests

I have done a lot of testing to ensure that I haven't implemented a method which is actually slower that sequential calling String.Replace. Here are some interesing results.

The ReplaceMany method performs best when small strings are replaced with big strings. In a test environment, I'd been using 10-character long strings as "old values" and 100-character strings as "new values". The variable parameter was a number of replace strings, aka a number of String.Replace invocations. The results were as following:

Replacement count, ReplaceMany, .NET Replace
1	0,31	0,69
2	2,20	2,77
3	4,33	6,44
4	5,30	11,87
5	6,65	17,95
6	7,53	26,05
7	8,87	38,90
8	10,03	49,83
9	11,39	59,98
10	12,14	72,69
20	24,57	289,05
30	39,17	638,13
40	62,23	1 135,73
50	79,46	1 769,71
60	105,82	2 562,58
70	128,08	3 493,79
80	148,00	4 540,07
90	176,70	5 752,98
100	203,57	7 105,83

All results are in miliseconds and multiplied by 1000, for clarity. The ReplaceMany method starts to win when there are 3 replacement strings (the first two lines are result of cheating in the implementation). With 100 strings, there is almost a 10x performance increase compared to a standard .NET code.

The testing code is very striaghtforward. A code which generates problem instances:

old = new string[repcount];
newv = new string[repcount];
build.Clear();
o = new StringBuilder(101);
n = new StringBuilder(101);
for (int i = 0; i < repcount; i++)
{
    for (int j = 0; j < oldLength; j++)
    {
        o.Append((char)r.Next(48, 70));
    }
    for (int j = 0; j < newLength; j++)
    {
        n.Append((char)r.Next(48, 70));
    }
    old[i] = o.ToString();
    newv[i] = n.ToString();
    for (int j = 0; j < extra; j++)
    {
        build.Append((char)r.Next(48, 70));
    }
    build.Append(o.ToString()).Append(" ").Append(o.ToString());
    o.Clear();
    n.Clear();
}
str = build.ToString();

Actual testing:

// (create test instance)
stopwatch.Reset();
stopwatch.Start();
for (int i = 0; i < 1000; i++)
    str.ReplaceMany(old, newv);
stopwatch.Stop();
TimeSpan me = stopwatch.Elapsed;
Console.Write(" {0}", me.TotalMilliseconds);

stopwatch.Reset();
stopwatch.Start();
for (int i = 0; i < 1000; i++)
{
    for (int j = 0; j < repcount; j++)
    {
        str = str.Replace(old[j], newv[j]);
    }
}
stopwatch.Stop();
TimeSpan their = stopwatch.Elapsed;

Some other results:

In this example, replacement strings had the same length as the old ones.

Old strings' lengths: 10
Replacement strings' lengths: 10
Extra characters added to test string: 0

Replacement count,	ReplaceMany,	.NET Replace
1	0,18	0,09
2	0,58	0,30
3	2,67	0,78
4	4,43	1,58
5	4,78	1,86
6	5,20	2,79
7	6,27	3,94
8	7,00	5,13
9	7,81	7,29
10	9,38	7,92
20	19,32	30,88   <<< ReplaceMany starts to win
30	31,37	68,43
40	48,19	119,84
50	66,90	187,91
60	83,35	266,79
70	105,66	378,26
80	131,70	498,32
90	156,60	614,43
100	177,36	759,50
150	333,09	1 721,25
200	510,63	3 002,31
250	776,97	4 727,57
300	1 049,78	6 716,75

Here the Replace was figuratively dead.

Old strings' lengths: 10		
Replacement strings' lengths: 1000		
		
Extra characters added to test string: 100		
Replacement count	ReplaceMany	.NET Replace
1	7,25	7,51
2	10,50	30,85
3	20,28	70,18
4	30,21	124,02
5	40,42	192,79
6	52,00	276,55
7	69,28	365,53
8	79,49	478,08
9	97,92	605,11
10	114,78	748,88
20	386,18	3 015,35
30	817,37	6 701,57
40	1 294,36	12 204,29
50	2 099,41	18 854,68
60	2 662,90	27 963,31
70	3 574,89	37 529,65
80	4 803,84	49 003,12
90	5 677,84	61 244,01
100	7 182,78	76 841,37

This is the worst-case scenario, and the .NET code wins. Extra characters added to a test string ensure that there are no buffer reallocations. It seems that the .NET's Replace uses a slow algorithm for that. A difference decreases as a number of replacement strings increases but still the ReplaceMany method is slower.

Old strings' lengths: 10		
Replacement strings' lengths: 10		
		
Extra characters added to test string: 100		
Replacement count	ReplaceMany	.NET Replace
1	0,88	0,47
2	3,29	1,81
3	10,19	4,97
4	17,20	7,27
5	21,53	10,65
6	31,20	16,17
7	38,30	22,71
8	57,25	31,58
9	54,21	34,68
10	84,23	47,56
20	256,46	189,46
30	568,25	409,81
40	913,33	693,08
50	1 363,18	1 084,12
60	1 918,29	1 549,88
70	2 615,43	2 111,76
80	3 320,37	2 755,09
90	4 193,21	3 484,52
100	5 164,74	4 306,08

When replacing 10-character strings with single characters, ReplaceMany looses, too.

Old strings' lengths: 10		
Replacement strings' lengths: 1		
		
Extra characters added to test string: 0		
Replacement count	ReplaceMany	.NET Replace
1	0,16	0,06
2	0,44	0,12
3	3,80	0,17
4	4,51	0,25
5	6,19	0,38
6	5,68	0,79
7	7,00	0,68
8	9,44	0,87
9	10,71	1,49
10	9,58	1,59
20	22,00	5,50
30	34,80	9,15
40	45,64	15,61
50	62,99	27,36
60	88,32	33,80
70	96,93	45,69
80	115,93	60,93
90	140,77	77,02
100	162,57	95,69
150	290,59	200,42
200	453,33	348,31
250	662,53	540,45
300	884,62	783,57
350	1 153,41	1 047,60
400	1 473,32	1 363,51
450	1 810,70	1 782,50
500	2 208,63	2 186,38
550	2 691,38	2 635,10
600	3 367,29	3 149,86
650	3 959,81	3 655,76
700	4 585,45	4 260,98
750	5 176,85	4 854,98
800	6 012,21	5 769,26

In this example, single characters was replaced with two-character strings. The String.Replace crashed every time! It was also very unstable.

Old strings' lengths: 1		
Replacement strings' lengths: 2		
		
Extra characters added to test string: 0		
Replacement count	ReplaceMany	.NET Replace
1	0,21	0,11
2	0,43	0,22
3	3,02	14,07
4	4,51	1,08
5	5,92	193,21
6	7,67	99,14
7	8,88	5,08
8	11,39	7,68
9	12,78	8,48
10	14,98	10,62
20	36,87	579,65
30	64,54	          <<< String.Replace crashes
Unhandled	Exception:	OutOfMemoryException.

Similarily, when 2-character pairs was replaced with 4-character strings:

Old strings' lengths: 2		
Replacement strings' lengths: 4		
		
Extra characters added to test string: 0		
Replacement count	ReplaceMany	.NET Replace
1	0,13	0,08
2	0,34	0,23
3	2,79	0,48
4	3,54	0,85
5	4,59	1,10
6	4,99	1,31
7	6,04	2,00
8	6,90	2,39
9	8,37	2,90
10	8,43	3,68
20	18,10	13,87
30	29,36	230,67 <<<
40	51,07	64,50
50	58,87	100,51
60	81,64	156,22
70	99,94	2 415,98
80	115,88	289,54
90	135,12	1525,4413
100	159,60	24 188,11  << 24 seconds WTF? (reproducable)

Summary

Concluding, the ReplaceMany is best when you have to replace many different short strings with long strings. Filling a script template with data is a good example of a practical application. Moreover, using the ReplaceMany method seems to be safer in some scenarios, when the .NET method either crashes or behaves unstable.

History

  • 04 Jan 2012 -- the first version posted.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Jacek Gajek
Software Developer (Junior)
Poland Poland
My name is Jacek Gajek. I have graduated in computer science with a master's degree from Polibuda in Wrocław. I like C# and Monthy Python's sense of humour.

Comments and Discussions


Discussions posted for the Published version of this article. Posting a message here will take you to the publicly available article in order to continue your conversation in public.
 
QuestionNice article PinmemberNchantim26-Mar-14 6:10 
AnswerRe: Nice article PinmemberJacek Gajek26-Mar-14 7:23 
Questionfine article PinmemberCIDev13-Feb-12 11:18 
GeneralMy vote of 5 PinmemberMihai MOGA8-Feb-12 7:38 
GeneralMy vote of 5 PinmemberFernandoUY2-Feb-12 9:04 
Questiongood idea PinmemberDean Oliver25-Jan-12 6:16 
GeneralMy vote of 5 Pinmemberdojohansen24-Jan-12 0:04 
GeneralMy vote of 5 Pinmemberii_noname_ii13-Jan-12 4:47 
SuggestionA different approach to the problem Pinmemberdojohansen13-Jan-12 4:25 
GeneralRe: A different approach to the problem PinmemberJacek Gajek13-Jan-12 6:30 
QuestionMy vote of 3 Pinmemberdojohansen13-Jan-12 3:22 
AnswerRe: My vote of 3 [modified] PinmemberJacek Gajek13-Jan-12 6:23 
AnswerRe: My vote of 3 PinmemberJacek Gajek15-Jan-12 3:53 
AnswerRe: My vote of 3 PinmemberJacek Gajek16-Jan-12 7:35 
GeneralRe: My vote of 3 Pinmemberdojohansen24-Jan-12 0:06 
GeneralMy vote of 5 PinmemberSteppenwolfe10-Jan-12 10:04 
GeneralMy vote of 5 PinmemberHendrion10-Jan-12 8:23 
GeneralMy vote of 1 Pinmembermichaelschuer9-Jan-12 23:43 
GeneralRe: My vote of 1 [modified] PinmemberJacek Gajek10-Jan-12 0:30 
GeneralRe: My vote of 1 Pinmemberdojohansen13-Jan-12 3:31 
GeneralRe: My vote of 1 PinmemberJacek Gajek13-Jan-12 6:32 
AnswerRe: My vote of 1 Pinmvpthatraja18-Jan-12 0:54 
GeneralRe: My vote of 1 PinmemberPaul8910-Jan-12 10:32 
GeneralRe: My vote of 1 PinmemberJacek Gajek16-Jan-12 7:40 
GeneralNice one Pinmvpthatraja9-Jan-12 0:50 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web04 | 2.8.1411023.1 | Last Updated 9 Jan 2012
Article Copyright 2012 by Jacek Gajek
Everything else Copyright © CodeProject, 1999-2014
Layout: fixed | fluid