Click here to Skip to main content
15,867,765 members
Articles / Programming Languages / C#
Article

Boyer-Moore and related exact string matching algorithms

Rate me:
Please Sign up or sign in to vote.
4.44/5 (20 votes)
20 Jan 2006CPOL3 min read 108.3K   4K   57   19
Implementations of Boyer-Moore and several related string matching algorithms in C# 2.0.

Introduction

This article presents a straightforward implementation of the Boyer-Moore string matching algorithm and a number of related algorithms.

Background

When searching for one string within another, .NET programmers normally reach for String.IndexOf. For the vast majority of applications, that's just fine. However, when searching very long strings, there are other algorithms that deliver more performance than the basic algorithm that .NET provides. This article gives simple implementations of several of those algorithms.

It's worth noting that String.IndexOf performs a much more complex comparison than these algorithms provide, since String.IndexOf performs a culture-sensitive comparison - this is a highly non-trivial matter when a case insensitive match is desired! If you need culture sensitive matching, stick to String.IndexOf.

What this article doesn't cover

This article does not attempt to teach you how to write high performance string matching algorithms, nor does it try to explain the algorithms that are presented. If you're looking for that kind of information, consult an algorithms book such as Knuth or Cormen, et al. An excellent online reference for string matching algorithms is the Handbook of Exact String Matching Algorithms.

Using the code

The BoyerMooreSearch class provides implementations of the original Boyer-Moore algorithm plus a number of related algorithms. An implementation that uses String.IndexOf is also provided for comparison.

The text to be found is passed to the BoyerMooreSearch constructor. Each searching algorithm is presented using a common IEnumerable<int> model.

C#
namespace StringMatch
{
  public class BoyerMoore
  {
    /// <summary>
    /// Constructor
    /// </summary>
    /// <param name="pattern">Pattern for search</param>
    public BoyerMoore(string pattern) { ... }

    /// <summary>
    /// Return all matched of the pattern
    /// in the specified text using algorithm name
    /// </summary>
    /// <param name="text">text to be searched</param>
    /// <param name="startingIndex">Index at 
    ///       which search begins</param>
    /// <returns>IEnumerable which returns
    /// the indexes of pattern matches</returns>
    public IEnumerable<int> AlgorithmNameMatch(string text, 
                                 int startingIndex) { ... }

    /// <summary>
    /// Return all matched of the pattern in the
    /// specified text using algorithm name
    /// </summary>
    /// <param name="text">text to be searched</param>
    /// <returns>IEnumerable which returns the indexes of pattern matches</returns>
    public IEnumerable<int> AlgorithmNameMatch(string text) { ... }
  }
}

Each algorithm returns an IEnumerable<int> that yields all of the indexes where the pattern (passed to the constructor) is found in the text (passed to the algorithm). This design allows the pre-computation that all of the Boyer-Moore related algorithms make use of to be shared across any number of searches for a specific pattern.

C#
BoyerMoore bm = new BoyerMoore("abcd");
foreach (int index in bm.BoyerMooreMatch("abcdsflkjwertabc" + 
         "dfglkjdfgcdjweertopabjsdvacbwer" + 
         "kjhsdfkljhacbsdflkjsdflkjabcd"))
  Console.WriteLine("Matched at index {0}", index);

This example will list matches at 0, 13, and 72.

The Algorithms

The BoyerMoore class implements the following algorithms:

  • BCLMatch - search using String.IndexOf.
  • HorspoolMatch - search using the Horspool algorithm. This is functionally identical to the "Quick Search" algorithm and is presented in some algorithms books as being the Boyer-Moore algorithm. Due to its simplicity, this is often the highest performing of the Boyer-Moore-like algorithms even though others have better theoretical performance.
  • BoyerMooreMatch - search using the Boyer-Moore algorithm.
  • TurboBoyerMooreMatch - search using the Turbo Boyer-Moore algorithm.
  • ApostolicoGiancarloMatch - search using the Apostolico-Giancarlo algorithm. Although this algorithm has the best theoretical performance (when considering only character comparisons, as is traditional when evaluating matching algorithms), it is often the slowest in practice due to the complexity of the code.

The demo program

The supplied demo program is a simple console application that searches a specified file for all occurrences of a string, reporting the time taken by each algorithm. Running the demo application on my machine (3Ghz P4) searching a large text file for a frequently occurring string, yielded the following times:

Finding all occurrences of pattern 
         in path (7324800 characters in length)
String.indexOf found 87473 matches in 0.0579839 seconds
Horspool found 21832 matches in 0.0180492 seconds
Boyer-Moore found 87473 matches in 0.0196312 seconds
Turbo Boyer-Moore found 87473 matches in 0.0282827 seconds
Apostolico-GiancarloMatch found 87473 matches in 0.0563225 seconds

I my experience, the ratios shown here are typical, but your mileage will vary depending on what you're looking for and how many partial matches there are in the searched text.

History

  • 17-Jan-2006 - Initial version.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
United States United States
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
SuggestionChange "chunkIndex" datatype to long. Pin
2xman4-Mar-17 16:48
2xman4-Mar-17 16:48 
GeneralMy vote of 5 Pin
Armando de la Torre4-Sep-11 9:59
Armando de la Torre4-Sep-11 9:59 
GeneralNot able to cope with Unicode? Pin
rivermsfly11-Jun-11 22:37
rivermsfly11-Jun-11 22:37 
GeneralRe: Not able to cope with Unicode? Pin
Carl Daniel12-Jun-11 4:53
Carl Daniel12-Jun-11 4:53 
GeneralWarning about the case-sensitiveness Pin
Fernando Nájera3-Dec-10 0:08
Fernando Nájera3-Dec-10 0:08 
GeneralRe: Warning about the case-sensitiveness Pin
Carl Daniel3-Dec-10 4:48
Carl Daniel3-Dec-10 4:48 
GeneralActual Perf Data (and Horspool bug) Pin
mpgflx22-Jun-10 6:27
mpgflx22-Jun-10 6:27 
GeneralRe: Actual Perf Data (and Horspool bug) Pin
Fernando Nájera2-Dec-10 23:57
Fernando Nájera2-Dec-10 23:57 
QuestionAbout the License of BoyerMooreSearch? Pin
neesha_anj17-Jun-10 11:16
neesha_anj17-Jun-10 11:16 
AnswerRe: About the License of BoyerMooreSearch? Pin
Carl Daniel17-Jun-10 11:31
Carl Daniel17-Jun-10 11:31 
GeneralRe: About the License of BoyerMooreSearch? Pin
neesha_anj17-Jun-10 11:46
neesha_anj17-Jun-10 11:46 
QuestionIndexOf performance improvement? Pin
summersgill30-Nov-09 4:00
summersgill30-Nov-09 4:00 
AnswerRe: IndexOf performance improvement? Pin
Carl Daniel17-Jun-10 11:36
Carl Daniel17-Jun-10 11:36 
GeneralThis code doesn't work at all Pin
egr26-Mar-09 8:50
egr26-Mar-09 8:50 
GeneralRe: This code doesn't work at all Pin
Carl Daniel17-Jun-10 11:34
Carl Daniel17-Jun-10 11:34 
GeneralBug in horspool Pin
Messenjah8-Sep-07 3:07
Messenjah8-Sep-07 3:07 
GeneralRe: Bug in horspool Pin
Fernando Nájera2-Dec-10 23:57
Fernando Nájera2-Dec-10 23:57 
QuestionBug: Non - ASCII UNICODE character cause misbehave Pin
cemmerven17-Apr-07 11:30
cemmerven17-Apr-07 11:30 
AnswerRe: Bug: Non - ASCII UNICODE character cause misbehave Pin
Carl Daniel17-Jun-10 11:33
Carl Daniel17-Jun-10 11:33 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.