Optimizing string operations in C#

Kubajzz

4.78/5 (20 votes)

11 Jan 2010CPOL9 min read

93.1K

622

Should you use Regular Expressions, built-in .NET methods, or your own solution written from scratch?

Download performance test - 4.51 KB

Introduction

A programmer usually wants to write easily readable, maintainable, and extensible code. However, in some cases, performance becomes the most important thing. This article provides several useful tips on boosting the performance of common string operations without using unsafe code.

Background

I recently spent some time working on a simple code editor, and one of the key features of the app is syntax highlighting. When implementing a function like this, performance is crucial, and it took a while for me to optimize my code. I learned a lot when working on that project, and I would like to share my experience with all of you.

The project attached to this article contains a simple benchmark application to compare all the methods described below.

Searching for a word in a string

Searching for something inside a string is a very common task. There are several ways to do this, and the following section discusses each option:

Using Regular Expressions

Regular Expressions are a very powerful, useful, and often, very fast tool for data validation and string searching. However, when performance is important, Regular Expressions might become a nightmare, and there are several things you can do to make your code run faster.

Rule #1: Write a good expression

It's easy to write a Regular Expression, but sometimes, it's a challenge to write an efficient Regular Expression. Optimizing Regular Expressions is beyond the scope of this article, and there are many articles on the web and books that cover this topic. The most important rule is: keep it simple. Complex Regular Expressions that include a lot of alternations usually take too long to execute.

Rule #2: Do not use static methods of the Regex class

The Regex class provides several static methods that enable basic operations. The following code finds out whether the string variable input contains a substring given by the variable pattern.

string input = "The quick brown fox jumps over the lazy dog";
string pattern = "fox";
if (Regex.IsMatch(input, pattern)) {
  /* More code here... */
}

The IsMatch method creates a Regex object from the pattern variable, and then tries to match the input string. Performance-wise, this process is slow, and it is only useful if the Regular Expression is not used repeatedly.

Rule #3: Re-use Regex objects if you can

As stated above, the creation of a Regex object takes a while, and you should avoid doing it too often. In some scenarios, you can initialize all necessary Regular Expressions during the startup of your application and then use them multiple times when processing a long text. This will increase the performance significantly.

Rule # 4: Consider using compiled Regular Expressions

The performance may be even better if you use the Compiled option when creating a Regex object.

Regex pattern = new Regex("SomePattern", RegexOptions.Compiled);

There are several disadvantages though - compiled Regular Expressions increase the startup time of your application and cause more memory use. Jeff Atwood wrote a brilliant article that discusses the advantages and the disadvantages of compiled Regular Expressions: http://www.codinghorror.com/blog/archives/000228.html.

Using the IndexOf() method

The IndexOf method of the String type is very useful when searching for a sub-string inside a string. There are a few things you might want to know to use this method efficiently.

Use the char type if you can

There are two basic overloads of the IndexOf method - the first parameter can be a character or a string. The char version is much faster though. If you know the string you are searching for is only one character long, use the char type instead.

An unsuccessful search takes longer than a successful search

This fact is not very surprising - if the IndexOf method succeeds, it returns the position of the first occurrence of the given string or character, and it ignores the rest of the input string. Nonetheless, if the input string does not contain the specified character or string, the IndexOf function has to search the whole input. What's the conclusion?

There is often a way to eliminate unsuccessful searches. For example, if you want to know whether an input string contains any of a set of words, you can first search for the words that are more likely to appear in the input string.

What about writing a text-search method from scratch?

Now, I'm getting to the most interesting part. The IndexOf method is quite fast when searching for a character, but it's slow when searching for a word. What if there is a faster way to search for a word inside a string?

I spent some time looking for the fastest possible solution, and the following method is what I came up with:

static int FastIndexOf(string source, string pattern) {
  if (pattern == null) throw new ArgumentNullException();
  if (pattern.Length == 0) return 0;
  if (pattern.Length == 1) return source.IndexOf(pattern[0]);
 bool found;
  int limit = source.Length - pattern.Length + 1;
  if (limit < 1) return -1;
 // Store the first 2 characters of "pattern"
  char c0 = pattern[0];
  char c1 = pattern[1];
 // Find the first occurrence of the first character
  int first = source.IndexOf(c0, 0, limit);
 while (first != -1) {
   // Check if the following character is the same like
    // the 2nd character of "pattern"
    if (source[first + 1] != c1) {
      first = source.IndexOf(c0, ++first, limit - first);
      continue;
    }
   // Check the rest of "pattern" (starting with the 3rd character)
    found = true;
    for (int j = 2; j < pattern.Length; j++)
      if (source[first + j] != pattern[j]) {
        found = false;
        break;
      }
   // If the whole word was found, return its index, otherwise try again
    if (found) return first;
    first = source.IndexOf(c0, ++first, limit - first);
  }
  return -1;
}

This function is based on the fact that the IndexOf method is very fast when searching for a character.

The basic idea is simple - the method searches for the first character of the word, and once it is found, it checks for the following characters form the rest of the word. The code is a little messy since it includes a bunch of small optimizations. For example, I found out that the average performance is even better if the method checks the second character (after the first one has been found) before starting the "for" loop that goes through the rest of the word.

Note:

The behavior of this function is very similar to the String.IndexOf() function, but in some exceptional cases, the results may vary. If you need to perform culture-sensitive string searches, or you need to work with characters that are not in the Basic Multilingual Plane of the Unicode standard, the method provided above might not return accurate results.

Comparison

I've done a lot of testing on randomly generated strings to compare the three options listed above, and here are the results:

The IndexOf function was always approximately 20% slower than my custom FastIndexOf method when searching for a word.
Regular Expressions are the fastest way to find a word inside a string if both the word and the input string are long enough.

The graph below is based on the values I measured, and it shows which of the two fastest options is better depending on the length of the input strings:

The vertical axis indicates the length of the word being searched for (in characters), and the horizontal axis shows the length of the input string (in characters). The striped area is where a custom search method outperforms Regular Expressions.

Please note that all measured values are approximate.

Replacing characters with escape sequences

Another common task when working with strings is to replace a set of characters with a set of escape sequences. Sometimes the replacement is very easy - you only have to place a backslash (or another character) before every occurrence of an escaped character.

Imagine the following scenario: you are building an RTF file and you want to insert a string into the file. The "{", "}", and "\" characters have a special meaning in RTF and, therefore, must be preceded with a backslash. The question is: what is the fastest way to replace each of the characters with a corresponding escape sequence?

Regular Expressions

Using the Replace method of the Regex class is one of the possible solutions. However, Regular Expressions are by far the slowest option in this case.

The Replace() method of String

Another simple option is to use the Replace method of the string type. When building an RTF file, the replace procedure would be the following:

string input;
 
/* .... */
 
input.Replace("\\", "\\\\").Replace("{", "\\{").Replace("}", "\\}");

It's easily readable, short, and clean, but it's not always the fastest option.

Writing a custom method from scratch

Just like in the previous case, I tried to write a custom method that does the same task. Here is the code:

static string Escape1(string source, char[] escapeChars, char escape) {
  int i = source.IndexOfAny(escapeChars);
  while (i != -1) {
    source = source.Insert(i, escape.ToString());
    i = source.IndexOfAny(escapeChars, i + 2);
  }
  return source.ToString();
}

The first parameter is the input string, the second one is an array of characters that should be escaped, and the last parameter is the escape character. As you probably know, editing a string directly is not very efficient if you need to do many operations in a row. The StringBuilder class provides an efficient way to edit strings although it takes a bit of time to initialize a StringBuilder object, and it's only reasonable to use a StringBuilder if you know that you will need to edit the string repeatedly.

I created another version of the method above using StringBuilder:

static string Escape2(string source, char[] escapeChars, char escape) {
  StringBuilder s = new StringBuilder();
  int j = 0;
  int i = source.IndexOfAny(escapeChars);
  while (i != -1) {
    s.Append(source.Substring(j, i - j));
    s.Append(escape);
    j = i;
    i = source.IndexOfAny(escapeChars, j + 1);
  }
  s.Append(source.Substring(j));
  return s.ToString();
}

Comparison

I used the three methods described above to replace a set of three characters with an escape sequence. The graph below shows which method is the fastest, depending on the input string:

The vertical axis shows the number of escaped characters in a string, while the horizontal axis shows the frequency of escape characters in %. A frequency of 50 means that every other character in the input string needs to be escaped, and a frequency of 1 means that only 1% of the characters in the input string (every 100^th character) is escaped.

The String.Replace method shows to be the fastest one if there are enough characters to be escaped and if the frequency of escaped characters is high enough (the striped area in the top-left corner of the graph).

As the rest of the graph shows, a custom character-escaping method is the fastest option in most cases. As expected, the StringBuilder version of the method is faster if the number of replaced characters is high enough, in this case, approximately 5 characters.

If you want to know a little bit more about String/StringBuilder performance comparison, I recommend the following two resources:

Conclusion

If you want an excellent performance, you have to know what kind of data you are going to work with, and you will have to spend some time comparing all available options, just like I did.

I hope this article pushed you in the right direction and helped you to increase the performance of your code. Any feedback is greatly appreciated!

Sample benchmark application

Here is the source code of the project I used to compare the performance of all the methods listed above. Feel free to download it and use it any way you like:

Download performance test - 4.51 KB

History

Edited: 2010-01-10: Added a note about the differences between the FastIndexOf and IndexOf functions.
Edited: 2010-01-07: The FastIndexOf function has been improved a little bit. Thanks to James Curran for his suggestion.
Posted: 2010-01-06.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)