Click here to Skip to main content
Click here to Skip to main content

Boyer-Moore and related exact string matching algorithms

, 20 Jan 2006
Rate this:
Please Sign up or sign in to vote.
Implementations of Boyer-Moore and several related string matching algorithms in C# 2.0.

Introduction

This article presents a straightforward implementation of the Boyer-Moore string matching algorithm and a number of related algorithms.

Background

When searching for one string within another, .NET programmers normally reach for String.IndexOf. For the vast majority of applications, that's just fine. However, when searching very long strings, there are other algorithms that deliver more performance than the basic algorithm that .NET provides. This article gives simple implementations of several of those algorithms.

It's worth noting that String.IndexOf performs a much more complex comparison than these algorithms provide, since String.IndexOf performs a culture-sensitive comparison - this is a highly non-trivial matter when a case insensitive match is desired! If you need culture sensitive matching, stick to String.IndexOf.

What this article doesn't cover

This article does not attempt to teach you how to write high performance string matching algorithms, nor does it try to explain the algorithms that are presented. If you're looking for that kind of information, consult an algorithms book such as Knuth or Cormen, et al. An excellent online reference for string matching algorithms is the Handbook of Exact String Matching Algorithms.

Using the code

The BoyerMooreSearch class provides implementations of the original Boyer-Moore algorithm plus a number of related algorithms. An implementation that uses String.IndexOf is also provided for comparison.

The text to be found is passed to the BoyerMooreSearch constructor. Each searching algorithm is presented using a common IEnumerable<int> model.

namespace StringMatch
{
  public class BoyerMoore
  {
    /// <summary>
    /// Constructor
    /// </summary>
    /// <param name="pattern">Pattern for search</param>
    public BoyerMoore(string pattern) { ... }

    /// <summary>
    /// Return all matched of the pattern
    /// in the specified text using algorithm name
    /// </summary>
    /// <param name="text">text to be searched</param>
    /// <param name="startingIndex">Index at 
    ///       which search begins</param>
    /// <returns>IEnumerable which returns
    /// the indexes of pattern matches</returns>
    public IEnumerable<int> AlgorithmNameMatch(string text, 
                                 int startingIndex) { ... }

    /// <summary>
    /// Return all matched of the pattern in the
    /// specified text using algorithm name
    /// </summary>
    /// <param name="text">text to be searched</param>
    /// <returns>IEnumerable which returns the indexes of pattern matches</returns>
    public IEnumerable<int> AlgorithmNameMatch(string text) { ... }
  }
}

Each algorithm returns an IEnumerable<int> that yields all of the indexes where the pattern (passed to the constructor) is found in the text (passed to the algorithm). This design allows the pre-computation that all of the Boyer-Moore related algorithms make use of to be shared across any number of searches for a specific pattern.

BoyerMoore bm = new BoyerMoore("abcd");
foreach (int index in bm.BoyerMooreMatch("abcdsflkjwertabc" + 
         "dfglkjdfgcdjweertopabjsdvacbwer" + 
         "kjhsdfkljhacbsdflkjsdflkjabcd"))
  Console.WriteLine("Matched at index {0}", index);

This example will list matches at 0, 13, and 72.

The Algorithms

The BoyerMoore class implements the following algorithms:

  • BCLMatch - search using String.IndexOf.
  • HorspoolMatch - search using the Horspool algorithm. This is functionally identical to the "Quick Search" algorithm and is presented in some algorithms books as being the Boyer-Moore algorithm. Due to its simplicity, this is often the highest performing of the Boyer-Moore-like algorithms even though others have better theoretical performance.
  • BoyerMooreMatch - search using the Boyer-Moore algorithm.
  • TurboBoyerMooreMatch - search using the Turbo Boyer-Moore algorithm.
  • ApostolicoGiancarloMatch - search using the Apostolico-Giancarlo algorithm. Although this algorithm has the best theoretical performance (when considering only character comparisons, as is traditional when evaluating matching algorithms), it is often the slowest in practice due to the complexity of the code.

The demo program

The supplied demo program is a simple console application that searches a specified file for all occurrences of a string, reporting the time taken by each algorithm. Running the demo application on my machine (3Ghz P4) searching a large text file for a frequently occurring string, yielded the following times:

Finding all occurrences of pattern 
         in path (7324800 characters in length)
String.indexOf found 87473 matches in 0.0579839 seconds
Horspool found 21832 matches in 0.0180492 seconds
Boyer-Moore found 87473 matches in 0.0196312 seconds
Turbo Boyer-Moore found 87473 matches in 0.0282827 seconds
Apostolico-GiancarloMatch found 87473 matches in 0.0563225 seconds

I my experience, the ratios shown here are typical, but your mileage will vary depending on what you're looking for and how many partial matches there are in the searched text.

History

  • 17-Jan-2006 - Initial version.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Carl Daniel

United States United States
No Biography provided

Comments and Discussions

 
GeneralMy vote of 5 PinmemberArmando de la Torre4-Sep-11 9:59 
GeneralNot able to cope with Unicode? Pinmemberrivermsfly11-Jun-11 22:37 
I noticed
 
int[] badCharacterShift = new int[256];
 
in method BuildBadCharacterShift
 
Does "256" mean Unicode is not covered by these algorithms?
GeneralRe: Not able to cope with Unicode? PinmemberCarl Daniel12-Jun-11 4:53 
GeneralWarning about the case-sensitiveness PinmemberFernando Nájera3-Dec-10 0:08 
GeneralRe: Warning about the case-sensitiveness PinmemberCarl Daniel3-Dec-10 4:48 
GeneralActual Perf Data (and Horspool bug) Pinmembermpgflx22-Jun-10 6:27 
GeneralRe: Actual Perf Data (and Horspool bug) PinmemberFernando Nájera2-Dec-10 23:57 
QuestionAbout the License of BoyerMooreSearch? Pinmemberneesha_anj17-Jun-10 11:16 
AnswerRe: About the License of BoyerMooreSearch? PinmemberCarl Daniel17-Jun-10 11:31 
GeneralRe: About the License of BoyerMooreSearch? Pinmemberneesha_anj17-Jun-10 11:46 
QuestionIndexOf performance improvement? Pinmembersummersgill30-Nov-09 4:00 
AnswerRe: IndexOf performance improvement? PinmemberCarl Daniel17-Jun-10 11:36 
GeneralThis code doesn't work at all Pinmemberegr26-Mar-09 8:50 
GeneralRe: This code doesn't work at all PinmemberCarl Daniel17-Jun-10 11:34 
GeneralBug in horspool PinmemberMessenjah8-Sep-07 3:07 
GeneralRe: Bug in horspool PinmemberFernando Nájera2-Dec-10 23:57 
QuestionBug: Non - ASCII UNICODE character cause misbehave Pinmembercemmerven17-Apr-07 11:30 
AnswerRe: Bug: Non - ASCII UNICODE character cause misbehave PinmemberCarl Daniel17-Jun-10 11:33 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web01 | 2.8.140721.1 | Last Updated 20 Jan 2006
Article Copyright 2006 by Carl Daniel
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid