11,639,126 members (68,213 online)

# Efficient Boyer-Moore Search in Unicode Strings

, 30 Jun 2005 81.2K 1.1K 78
 Rate this:
An article on implementing Boyer-Moore algorithm for Unicode strings in C#.

## Introduction

Suppose you have to perform a case-insensitive text search over a Unicode string. You probably know that Boyer-Moore algorithm is the most efficient algorithm for such a task. This article shows how to implement the Boyer-Moore algorithm for Unicode, using C#.

## Background

The efficiency of a search algorithm can be estimated by the number of symbol comparisons it takes to scan the whole text in search for the pattern match. The Boyer-Moore algorithm is efficient because it avoids some unnecessary comparisons and produces longer shifts of the pattern along the text. With Boyer-Moore, it is also easy to perform efficient case-insensitive searches. I will not go into details of the Boyer-Moore algorithm as you can find them elsewhere. It is interesting to note however that there are several implementations of the main idea of the algorithm that differ in complexity. Simple implementations that use one-dimensional shift tables are not very efficient when searching for complex patterns that contain several repeating sub-patterns. The implementation I show here uses a two-dimensional shift table. The best way to explain an algorithm is to explain the data structures it uses, and the best way to explain a data structure is to demonstrate it using a simple example. Following is a simple example of the two-dimensional shift table for Boyer-Moore. Suppose we search a string that consists only of characters a, b, c, d, e (in any number and sequence) for a pattern "abdab" (note that "ab" sequence occurs twice in this pattern). The shift-table for the pattern "abdab" would look like this:

The number of columns in the table is equal to the number of characters in the pattern and the number of rows is equal to the number of characters in the charset. As comparison in Boyer-Moore starts from the last character of the pattern, the table is built from right to left. Each table cell contains values by which the pattern should be shifted given that all the previous pattern characters (the characters to the right of the current character) match corresponding text characters. When we start to compare pattern characters to substring characters from right to left we find the shift value for the pattern in the cell whose column corresponds to the pattern character and whose row corresponds to the substring character.

The shift values are calculated for every character that may occur in the string where we search. You may have noticed that zero shifts occur in the cells for which pattern characters match the charset characters. This means that while the pattern matches the substring, the pattern shouldn't be shifted. If the table was scanned from right to left and in every column we get zero shift for the corresponding substring then we have found the substring that matches the pattern. Other shift values are calculated so as to produce maximum shifts without skipping sequences that might be parts of the matching substring. The shift of five, for example, is the "full" shift of the five-character pattern and the shift of three reflects the fact that the two last characters of the pattern coincide with the two first characters (so if the two last characters of the substring match the pattern while others do not, we shift the pattern by three, and not by five, because these two characters may be the beginning of the matching substring). From a broader perspective, we may view the shift table as a representation of a finite automaton whose states are the values stored in the table.

It is easy to build a shift table with the number of rows equal to the number of charset characters if the charset is single-byte. But for Unicode charset, such a table would be quite large and inefficient. It is easy to solve this problem if we notice that table rows differ from each other only for the characters that actually appear in the pattern. In the table shown above, you can see that the rows for the characters c and e contain the same set of shift values and this set would be valid for any other character that doesn't occur in the pattern. So we might store table rows corresponding to characters found in the pattern in some fast-access structure, say a hash-table, and keep a separate set of shift values for all other characters that might appear in the string.

## The code

Now when we have discussed how the shift table is built and how it works, let's have a look at the code that builds the thing. There is `BMSearcher` class in the demo project that performs the actual search. The class' constructor takes a pattern and builds a shift table for it.

```public class BMSearcher
{
// Shift table for chars present in the pattern
protected Hashtable PatternCharShifts;
// Shifts for all other chars
private int[] OtherCharShifts;
// Length of the search pattern
private int PatternLength;

public BMSearcher(string Pattern)
{
PatternCharShifts = new Hashtable();
// Building shift table
PatternLength = Pattern.Length;
int MaxShift = PatternLength;
// Constructing the table where number
// of columns is equal to PatternLength
// and number of rows is equal to the
// number of distinct chars in the pattern
for (int i = 0; i < PatternLength; i++)
if (!PatternCharShifts.ContainsKey(Pattern[i]))
OtherCharShifts = new int[PatternLength];
// Filling the last column of the
// table with maximum shifts (pattern length)
foreach(DictionaryEntry Row in PatternCharShifts)
((int[])Row.Value)[PatternLength - 1] = MaxShift;
OtherCharShifts[PatternLength - 1] = MaxShift;
// Calculating other shifts (filling each column
// from PatternLength - 2 to 0 (from right to left)
for(int i = PatternLength - 1; i >= 0; i--)
{
// Suffix string contains the characters
// right to the character being processsed
string Suffix = new String(Pattern.ToCharArray(),
i + 1,  PatternLength - i - 1);
// if Pattern begins with Suffix
// the maximum shift is equal to i + 1
if (Pattern.StartsWith(Suffix))
MaxShift = i + 1;
// Store shift for characters not present in the pattern
OtherCharShifts[i] = MaxShift;
// We shorten patter by one char in NewPattern.
string NewPattern = new string(Pattern.ToCharArray(),
0, Pattern.Length -1);
if ((NewPattern.LastIndexOf(Suffix) > 0) || (Suffix.Length == 0))
foreach(DictionaryEntry Row in PatternCharShifts)
{
string NewSuffix  = (char)Row.Key + Suffix;
// Calculate shifts:
//Check if there are other occurences
//of the new suffix in the pattern
// If several occurences exist, we need the rightmost one
int NewSuffixPos = NewPattern.LastIndexOf(NewSuffix);
if (NewSuffixPos >= 0)
((int[])Row.Value)[i] = i - NewSuffixPos;
else
((int[])Row.Value)[i] = MaxShift;
// Storing 0 if characters
// in a row and a columnt are the same
if ((char)Row.Key == Pattern[i])
((int[])Row.Value)[i] = 0;
}
else
foreach(DictionaryEntry Row in PatternCharShifts)
{
// if Suffix doesn't occure in NewPattern
// we simply use previous shift value
((int[])Row.Value)[i] = MaxShift;
if ((char)Row.Key == Pattern[i])
((int[])Row.Value)[i] = 0;
}
}
}```

Constructor stores the table in a `PatternCharShifts` object while shift values for characters not present in the pattern are stored in an `OtherCharShifts` array. `PatternCharShifts` member is declared `protected`. The reason for this will be shown later. There is a `GetTable` method in `BMSearcher` that returns the string representation of the table built by the constructor (the rows in the table are separated by line breaks). This method is used in the attached demo. The code shown above uses `System.Collections.Hashtable` for storing shifts as I did in the first version of the demo. Now I reimplemented the demo with the custom hash table class `BMHashTable` to avoid performance penalty caused by boxing the arguments. It doesn't make any difference to the algorithm itself. Let's now look at the `Search` method that does the actual search. This method takes two arguments: the string where the search should be performed and position in the string where the search should start from. The method returns the index of the first (starting from the `Pos`) matching substring. The index is relative to the beginning of the string. The method returns -1 if no match is found.

```public int Search(string Text, int StartPos)
{
int Pos = StartPos;
while (Pos <= Text.Length - PatternLength)
for(int i = PatternLength - 1; i >= 0; i--)
{
if(PatternCharShifts.ContainsKey((object)Text[Pos + i]))
{
// pattern contains char Text[Pos + i]
int Shift =
((int[])PatternCharShifts[(object)Text[Pos + i]])[i];
if (Shift != 0)
{
Pos += Shift; // shifting
break;
}
else
if (i == 0)
// we came to the leftmost pattern
// character so pattern matches
return Pos;
// return matching substring start index
}
else
{
Pos += OtherCharShifts[i]; // shifting
break;
}
}
// Nothing is found
return -1;
}```

The `BMSearcher` class performs case-sensitive search. Implementing case-insensitive search with Boyer-Moore algorithm is quite easy - all we have to do is modify our shift table. We need to add new rows to the table so that the table would contain rows representing pattern characters in both upper and lower case. For example, if we have a pattern "Abc", we need to build the table as described above and then add three rows for a, B, and C to implement the case-insensitive search. Since the shift values should be the same for the same characters in different cases, we may build a table for case-sensitive search and then simply copy rows for the case-complement characters. If the shift table is modified this way, we can use the same search routine for the case-insensitive search.

This technique is implemented in a class `CIBMSearcher` derived from `BMSearcher`. Here is the constructor for the class:

```public CIBMSearcher(string Pattern, bool CaseSensitive) : base(Pattern)
{
if(!CaseSensitive)
{
Hashtable TmpHT = new Hashtable();
foreach(DictionaryEntry Row in PatternCharShifts)
{
if(!TmpHT.ContainsKey(Row.Key))
char ch = (char)Row.Key;
if (char.IsLower(ch))
ch = char.ToUpper(ch);
else
ch = char.ToLower(ch);
if(!TmpHT.ContainsKey(ch))
}
PatternCharShifts = TmpHT;
}
}```

This constructor calls the base constructor for building case-sensitive Boyer-Moore table and then, if `CaseSensitive` value is `false`, adds new rows to the table.

A list of licenses authors might use can be found here

## Share

 Web Developer Russian Federation
No Biography provided

## You may also be interested in...

 First Prev Next
 Compare to string.IndexOf alsan wong30-Nov-07 13:23 alsan wong 30-Nov-07 13:23
 Re: Compare to string.IndexOf PIEBALDconsult30-Nov-07 13:38 PIEBALDconsult 30-Nov-07 13:38
 Re: Compare to string.IndexOf alsan wong30-Nov-07 21:22 alsan wong 30-Nov-07 21:22
 Re: Compare to string.IndexOf [modified] alsan wong30-Nov-07 21:57 alsan wong 30-Nov-07 21:57
 Re: Compare to string.IndexOf Jonathan Wood8-Feb-09 10:43 Jonathan Wood 8-Feb-09 10:43
 Can anyone verify? Jon Gohr1-Aug-05 3:21 Jon Gohr 1-Aug-05 3:21
 Re: Can anyone verify? Jon Gohr1-Aug-05 3:37 Jon Gohr 1-Aug-05 3:37
 Re: Can anyone verify? leseul11-Aug-05 12:29 leseul 11-Aug-05 12:29
 Accent Insensitivity Fábio Batista9-Jul-05 9:59 Fábio Batista 9-Jul-05 9:59
 Re: Accent Insensitivity Anonymous9-Jul-05 10:28 Anonymous 9-Jul-05 10:28
 Boxing Jeffrey Sax24-Jun-05 2:24 Jeffrey Sax 24-Jun-05 2:24
 Re: Boxing leseul24-Jun-05 20:52 leseul 24-Jun-05 20:52
 Re: Boxing leseul30-Jun-05 22:43 leseul 30-Jun-05 22:43
 performance Unruled Boy13-Jun-05 20:50 Unruled Boy 13-Jun-05 20:50
 Re: performance leseul14-Jun-05 1:22 leseul 14-Jun-05 1:22
 Re: performance Unruled Boy4-Jul-05 5:13 Unruled Boy 4-Jul-05 5:13
 Re: performance Unruled Boy4-Jul-05 16:43 Unruled Boy 4-Jul-05 16:43
 Re: performance leseul4-Jul-05 23:21 leseul 4-Jul-05 23:21
 Last Visit: 31-Dec-99 18:00     Last Update: 31-Jul-15 1:52 Refresh 1