Click here to Skip to main content
Click here to Skip to main content

Using Morris Pratt String Searching In Byte Arrays

, 17 Feb 2014
Rate this:
Please Sign up or sign in to vote.
A short discussion of string searching in raw data.

Introduction

This is a tip about how I use Morris Pratt string searching algorithm in byte arrays to get the same efficient result. I'm assuming that you've understood the basic concept of this algorithm so it won't be repeated in this topic. Also, you should already know extension methods before you get started with this topic.

Background

In 2013, I built an extension method of Morris Pratt searching algorithm to search specific string in text file. But later, I had an idea and tried to apply it on every enumerable-related type if each element in it is equatable. So I expanded my method and then get the same efficient result if I search it in byte array, which means I can directly search string in raw text file and don't have to use ASCII or Unicode to decode it at first. The program became more efficient in text searching.

Using the Code

It's easy to use these extensions. First of all, let's take a look of their prototypes:

        public static int MorrisPrattSearchFirst<T>(this T[] t, T[] p)
            where T : IEquatable<T> { }

        public static int MorrisPrattSearchFirst<T>(this T[] t, T[] p, int startIndex)
            where T : IEquatable<T> { } 

where t is the parent array and p is the specified array for searching. The second method allows users to search p in t starting at non-zero index. Constraint on type parameter T makes sure that each element in both arrays is equatable. Returned value represents the index of first matched result (-1 if not found). Both of them are defined under the CollectionAssistant.IEnumerableExtentions class.

Here's a complete example of using them. First, we search key word "Ishmael" in character arrays from the text file MobyDick.txt. And next search it in ASCII-encoded byte arrays. Then compare the efficiency.

using System;
using System.Diagnostics;
using System.IO;
using System.Text;

namespace CollectionAssistant.Demo
{
    class Program
    {
        static void Main(string[] args)
        {
            // This is a example which demonstrates the differences 
            // between searching text in characters and in raw byte-array.
            string keyWord = "Ishmael";

            // Searching key word in character array. 
            char[] originalCharArray = File.ReadAllText("MobyDick.txt").ToCharArray();
            char[] keyWordCharArray = keyWord.ToCharArray();

            int index = -1;
            int pos = 0;

            Console.WriteLine("Start searching key word in character array.");

            Stopwatch watch = Stopwatch.StartNew();

            do
            {
                // Using the extension method.
                index = originalCharArray.MorrisPrattSearchFirst(keyWordCharArray, pos);

                if (index >= 0)
                {
                    Console.WriteLine("Key word \"{0}\" found. Index: {1}", keyWord, index);
                    pos += (index + keyWordCharArray.Length);
                }
            } while (index >= 0 && pos < originalCharArray.Length);
             
            watch.Stop();

            Console.WriteLine("Elapsed ticks: {0}", watch.ElapsedTicks); 
            Console.ReadKey(true);

            pos = 0; 
             
            Console.WriteLine("Start searching key word in byte array.");

            // Searching key word in byte array.
            byte[] originalByteArray = File.ReadAllBytes("MobyDick.txt");
            byte[] keyWordByteArray = Encoding.Default.GetBytes(keyWord);

            watch.Restart();

            do
            {
                // Using the extension method.
                index = originalByteArray.MorrisPrattSearchFirst(keyWordByteArray, pos);

                if (index >= 0)
                {
                    Console.WriteLine("Key word \"{0}\" found. Index: {1}", keyWord, index);
                    pos += (index + keyWordByteArray.Length);
                }
            } while (index >= 0 && pos < originalByteArray.Length);

            watch.Stop();

            Console.WriteLine("Elapsed ticks: {0}", watch.ElapsedTicks); 
            Console.ReadKey(true);
        }
    }
}

The result shows that both of them performed almost same efficiently: about 3000 ~ 8000 ticks in this case.

Test result.

Conclusion

Searching text in character arrays and in byte arrays are both efficient while using Morris Pratt algorithm. But sometimes, it's better in byte arrays especially for some situations such as text file or multipart/form-data - which means parent byte array possibly contains both text and non-text data and can't be decoded into character array or string.

You can obtain the complete source code from here.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Robert Vandenberg Huang
Software Developer
Taiwan Taiwan
Software developer, video gamer and drummer(and BTW a huge Jazz Music fan). Been working on software developing since college age. Believing to play songs on drums and to craft systems are the same things: both require the view of art and logic.
Follow on   Twitter   LinkedIn

Comments and Discussions

 
-- There are no messages in this forum --
| Advertise | Privacy | Mobile
Web02 | 2.8.140721.1 | Last Updated 18 Feb 2014
Article Copyright 2014 by Robert Vandenberg Huang
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid