Click here to Skip to main content
11,579,093 members (65,708 online)
Click here to Skip to main content

Fuzzy Search

, 2 Jun 2009 CPOL 69.6K 3.4K 115
Rate this:
Please Sign up or sign in to vote.
A simple implementation of the fuzzy string search.

Introduction

This article describes a simple implementation of the fuzzy (string) search. It can be used for approximate string matching (for more information, see http://en.wikipedia.org/wiki/Fuzzy_string_searching).

Other algorithms for approximate string searching exist (e.g., Soundex), but those aren't as easy to implement. The algorithm in this article is easy to implement, and can be used for tasks where approximate string searching is used in an easy way.

A List<string> is used for searching, and therefore it's quite easy to search a database.

Background

The algorithm used the Levenshtein-distance for determining how exact a string from a word list matches the word to be found. Information about the Levenshtein-distance can be found at http://en.wikipedia.org/wiki/Levenshtein_distance.

Using the code

The following example will show how simply the class can be used.

static void Main(string[] args)
{
    string word = "Code Project";
    List<string> wordList = new List<string>
    {
        "Code Project",
        "Code project",
        "codeproject",
        "Code Projekt",
        "Kode Project",
        "Other Project"
    };

    List<string> foundWords = FuzzySearch.Search(
        word,
        wordList,
        0.70);

    foundWords.ForEach(i => Console.WriteLine(i));
    Console.ReadKey();
}

Output:

Code Project
Code project
codeproject
Code Projekt
Kode Project

Implementation

A basic approach is shown. Instead of the Levenshtein-distance, a more optimized algorithm could be used - but here, a quite simple implementation is given for clarity reasons.

Levenshtein-distance

For computing the Levenshtein-distance, I use the following algorithm:

public static int LevenshteinDistance(string src, string dest)
{
    int[,] d = new int[src.Length + 1, dest.Length + 1];
    int i, j, cost;
    char[] str1 = src.ToCharArray();
    char[] str2 = dest.ToCharArray();

    for (i = 0; i <= str1.Length; i++)
    {
        d[i, 0] = i;
    }
    for (j = 0; j <= str2.Length; j++)
    {
        d[0, j] = j;
    }
    for (i = 1; i <= str1.Length; i++)
    {
        for (j = 1; j <= str2.Length; j++)
        {

            if (str1[i - 1] == str2[j - 1])
                cost = 0;
            else
                cost = 1;

            d[i, j] =
                Math.Min(
                    d[i - 1, j] + 1,              // Deletion
                    Math.Min(
                        d[i, j - 1] + 1,          // Insertion
                        d[i - 1, j - 1] + cost)); // Substitution

            if ((i > 1) && (j > 1) && (str1[i - 1] == 
                str2[j - 2]) && (str1[i - 2] == str2[j - 1]))
            {
                d[i, j] = Math.Min(d[i, j], d[i - 2, j - 2] + cost);
            }
        }
    }

    return d[str1.Length, str2.Length];
}

The Searching

In the search process, for each word in the wordlist, the Levenshtein-distance is computed, and with this distance, a score. This score represents how good the strings match. The input argument fuzzyness determines how much the strings can differ.

public static List<string> Search(
    string word,
    List<string> wordList,
    double fuzzyness)
{
    List<string> foundWords = new List<string>();

    foreach (string s in wordList)
    {
        // Calculate the Levenshtein-distance:
        int levenshteinDistance =
            LevenshteinDistance(word, s);

        // Length of the longer string:
        int length = Math.Max(word.Length, s.Length);

        // Calculate the score:
        double score = 1.0 - (double)levenshteinDistance / length;

        // Match?
        if (score > fuzzyness)
            foundWords.Add(s);
    }
    return foundWords;
}

LINQ-variant

This piece of code could be written in LINQ too.

public static List<string> Search(
    string word,
    List<string> wordList,
    double fuzzyness)
{
    // Tests have prove that the !LINQ-variant is about 3 times
    // faster!
    List<string> foundWords =
        (
            from s in wordList
            let levenshteinDistance = LevenshteinDistance(word, s)
            let length = Math.Max(s.Length, word.Length)
            let score = 1.0 - (double)levenshteinDistance / length
            where score > fuzzyness
            select s
        ).ToList();

    return foundWords;
}

History

  • 2009 June 1st: Initial release.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Günther M. FOIDL
Software Developer (Senior) Foidl Günther
Austria Austria
Engineer in combustion engine development.
Programming languages: C#, FORTRAN 95, Matlab

FIS-overall worldcup winner in Speedski (Downhill) 2008/09 and 2009/10.

You may also be interested in...

Comments and Discussions

 
Questionqwestion Pin
scheee17-Jan-13 1:42
memberscheee17-Jan-13 1:42 
Hi!
I can't understand why we need a "score" parametr.
I thought that we schould compare the "levenshteinDistance" with number of mistales.
Could you explain the differense?
SuggestionParallelize may help but it is quite limited Pin
Member 95751485-Nov-12 18:29
memberMember 95751485-Nov-12 18:29 
GeneralMy vote of 5 Pin
Petr Abdulin18-Apr-12 19:36
memberPetr Abdulin18-Apr-12 19:36 
GeneralMy vote of 5 Pin
Arlen Navasartian9-Apr-12 8:26
memberArlen Navasartian9-Apr-12 8:26 
GeneralMy vote of 5 Pin
manoj kumar choubey26-Feb-12 21:32
membermanoj kumar choubey26-Feb-12 21:32 
GeneralWorks Great ! Pin
damon8810-Dec-09 19:27
memberdamon8810-Dec-09 19:27 
GeneralRe: Works Great ! Pin
Günther M. FOIDL29-Dec-09 13:33
memberGünther M. FOIDL29-Dec-09 13:33 
GeneralAbout dictionary Pin
Win32nipuh21-Oct-09 21:54
memberWin32nipuh21-Oct-09 21:54 
GeneralRe: About dictionary Pin
Günther M. FOIDL22-Oct-09 10:55
memberGünther M. FOIDL22-Oct-09 10:55 
Generalgreat explanation Pin
Win32nipuh8-Jun-09 20:29
memberWin32nipuh8-Jun-09 20:29 
GeneralRe: great explanation Pin
Günther M. FOIDL9-Jun-09 14:31
memberGünther M. FOIDL9-Jun-09 14:31 
GeneralRe: great explanation Pin
coduresearch2-Sep-09 7:45
membercoduresearch2-Sep-09 7:45 
GeneralRe: great explanation Pin
Günther M. FOIDL10-Sep-09 6:22
memberGünther M. FOIDL10-Sep-09 6:22 
GeneralRe: great explanation Pin
Nicolas Dorier21-Oct-09 8:28
memberNicolas Dorier21-Oct-09 8:28 
GeneralSearching large text Pin
daniel.zolnjan4-Jun-09 5:56
memberdaniel.zolnjan4-Jun-09 5:56 
GeneralRe: Searching large text Pin
Günther M. FOIDL4-Jun-09 9:25
memberGünther M. FOIDL4-Jun-09 9:25 
GeneralI would also have used extension methods to make my code more readable Pin
Moshe Plotkin1-Jun-09 14:06
memberMoshe Plotkin1-Jun-09 14:06 
GeneralRe: I would also have used extension methods to make my code more readable Pin
Günther M. FOIDL1-Jun-09 21:46
memberGünther M. FOIDL1-Jun-09 21:46 
GeneralLinq does not seam to be slower [modified] Pin
Moshe Plotkin1-Jun-09 13:55
memberMoshe Plotkin1-Jun-09 13:55 
GeneralRe: Linq does not seam to be slower Pin
Günther M. FOIDL1-Jun-09 21:44
memberGünther M. FOIDL1-Jun-09 21:44 
GeneralRe: Linq does not seam to be slower Pin
Moshe Plotkin2-Jun-09 2:15
memberMoshe Plotkin2-Jun-09 2:15 
GeneralRe: Linq does not seam to be slower Pin
Günther M. FOIDL2-Jun-09 3:14
memberGünther M. FOIDL2-Jun-09 3:14 
GeneralRe: Linq does not seam to be slower Pin
riskka10-Jun-09 2:28
memberriskka10-Jun-09 2:28 
Questionmore optimized algorithm? Pin
Unruled Boy1-Jun-09 3:11
memberUnruled Boy1-Jun-09 3:11 
AnswerRe: more optimized algorithm? Pin
Günther M. FOIDL1-Jun-09 3:22
memberGünther M. FOIDL1-Jun-09 3:22 
GeneralRe: more optimized algorithm? Pin
gstolarov10-Jun-09 12:18
membergstolarov10-Jun-09 12:18 
SuggestionRe: more optimized algorithm? Pin
Psycho_Coder22-May-14 19:10
professionalPsycho_Coder22-May-14 19:10 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web03 | 2.8.150603.1 | Last Updated 2 Jun 2009
Article Copyright 2009 by Günther M. FOIDL
Everything else Copyright © CodeProject, 1999-2015
Layout: fixed | fluid