5,696,576 members and growing! (14,890 online)
Email Password   helpLost your password?
General Programming » String handling » Regular Expressions     Beginner License: The Code Project Open License (CPOL)

A Naive String Comparer

By Pete O'Hanlon

A class to perform a "naive" comparison of two chunks of text to see if they look to be the same.
C# (C# 2.0, C# 3.0, C#), ASP.NET

Posted: 17 Apr 2008
Updated: 6 May 2008
Views: 7,428
Bookmarked: 7 times
Announcements
Loading...



Search    
Advanced Search
Sitemap
15 votes for this Article.
Popularity: 4.90 Rating: 4.17 out of 5
1 vote, 6.7%
1
3 votes, 20.0%
2
2 votes, 13.3%
3
3 votes, 20.0%
4
6 votes, 40.0%
5
Note: This is an unedited contribution. If this article is inappropriate, needs attention or copies someone else's work without reference then please Report This Article

Introduction

A while ago, we wrote an application for a client who were running a competition that included a tie breaker question. The tie breaker had to be unique so there was a need to deduplicate the entries to remove multiple matches (due to a quaint little law where if you awarded a prize to one person based on their tie breaker and somebody else could prove they had entered with the same tie breaker then you had to award them the prize as well; providing they satisfied all other prize winning criteria). The remaining entries would be judged by a panel to see which ones were the best.

As you can imagine, the fact that the tie breakers were input manually made this potentially fraught due to the possibility of data capture miskeys or incorrect spelling by the entrant. To get round this, the client requested a naive comparer which used some simple rules to identify potential duplicate entries. This article is a version of that comparer.

The basics

The first thing you need to do is load in a the string you want to compare against and set up whether or not you want to strip HTML tags from the text when performing comparisons. This is a little extension that I added to the original design so that it could be used to detect whether or not an article here on CP was just a copy of the base template. In the original, HTML tags weren't removed.

The basic algorithm is to remove HTML tags using the following regular expression:

</?\w+((\s+\w+(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)+\s*|\s*)/?>

This expression simply grabs all text inside angle brackets, and the application replaces them with an empty string using a Regex.Replace.

Using a similar technique, all vowels, whitespace and special characters are removed from the text. The text can then be compared.

The method to actually remove the "offending" tags is shown here:

/// <summary>
/// Break the relevant text item down into a string that is stripped of all HTML tags, vowels and 
/// whitespace characters.
/// </summary>
/// <param name="source">The item that needs to be "cleansed" using the relevant regular expressions.</param>
/// <returns>The "cleansed" string.</returns>
/// <remarks>This implementation starts off by removing all HTML tags, before removing any vowels and
/// whitespace characters.</remarks>
private string BreakdownText(string source)
{
  if (string.IsNullOrEmpty(source))
    throw new ArgumentNullException("source");
  // First of all, strip the tags out of the source.
  if (_tagType == StripTagType.StripTags)
  {
    source = ReplaceCharacters(source, _tagRemover, string.Empty);
  }
  source = ReplaceCharacters(source, _chars, string.Empty);
  return ReplaceCharacters(source, _vowels, string.Empty).ToLowerInvariant();
}
/// <summary>
/// Method to replace one set of characters with another.
/// </summary>
/// <param name="source">The source text to replace the characters in.</param>
/// <param name="regexText">The regular expression defining the characters to replace.</param>
/// <param name="replaceText"></param>
/// <returns></returns>
private string ReplaceCharacters(string source, string regexText, string replaceText)
{
  Regex regex = new Regex(regexText,
  RegexOptions.IgnoreCase | RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace);
  return regex.Replace(source, replaceText);
}

The order in which items are removed is important. You want to strip the HTML tags out before you remove the none-alphabetic and vowel characters. By the way, the _vowels and _char regular expressions are defined as [eiouy] and [^b-z] respectively.

If the two sanitized pieces of text you want to compare match, then you have a duplicate. However, as we have said, there's always the possibility of there being a misspelling so the comparer goes further and looks through the two pieces of text to identify characters that don't match. If the number of characters falls outside a particular threshold then we know that the two pieces of text are sufficiently different to count them as unique.

/// <summary>
/// This method calculates the number of unmatched characters in the source string and determines
/// whether or not it falls within the target threshold.
/// </summary>
/// <param name="source">The source to check.</param>
/// <param name="target">The target string to check against.</param>
/// <returns>True if the string matches (or matches to within the tolerance), false otherwise.</returns>
private bool Compare(string source, string target)
{
  int targetLength = target.Length; 
  StringBuilder unmatched = new StringBuilder();
  while (target.Length > 0 && source.Length > 0 && unmatched.Length <= _diffCharacters)
  {
    if (target[0] == source[0])
    {
      target = target.Remove(0, 1);
      source = source.Remove(0, 1);
    }
    else
    {
      // Now, look for the next instance of this letter in source.
      int index = source.IndexOf(target[0]);
      if (index > 0)
      {
        unmatched.Append(source.Substring(0, index));
        source = source.Remove(0, index);
      }
    }
  }
  return unmatched.Length <= _diffCharacters;
}

And that's it - a naive comparer. To run it, all you need to do is this:

static void Main(string[] args)
{
  string s = "<H1>This is a sample.</H1>";
  NaiveComparer c = new NaiveComparer(s, StripTagType.StripTags, 10);
  if (c.Compare("Hello" + s + "aeiou123"))
  {
    Console.WriteLine("It matches");
  }
  if (!c.Compare("Hellot there from this completely arbitrary " + 
    "and totally different way of matching <<<< <<<<<< " + s + 
    ">>>>>>>>>>>>>>-----"))
    Console.WriteLine("No match");
  Console.ReadLine();
}

I hope that this is useful for you, in those occasions where you need to compare to see if items are sufficiently different to allow them through.

Edit History

07 May 08 Fixed bug found by Chris Maunder. Thanks Chris.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Pete O'Hanlon


Mvp
A developer for more years than I care to remember. I live in the North East of England with 2 wonderful daughters and a wonderful wife.
Occupation: CEO
Location: United Kingdom United Kingdom

Other popular String handling articles:

Article Top
Sign Up to vote for this article
You must Sign In to use this message board.
FAQ FAQ Noise ToleranceSearch Search Messages 
 Layout  Per page   
 Msgs 1 to 11 of 11 (Total in Forum: 11) (Refresh)FirstPrevNext
GeneralBug reportadminChris Maunder15:30 6 May '08  
GeneralRe: Bug reportmvpPete O'Hanlon23:53 6 May '08  
QuestionWhy didn't you use Soundex?memberpintel20:22 22 Apr '08  
GeneralRe: Why didn't you use Soundex?mvpPete O'Hanlon23:10 22 Apr '08  
GeneralDumb questionadminChris Maunder2:37 18 Apr '08  
GeneralRe: Dumb questionmvpPete O'Hanlon3:20 18 Apr '08  
GeneralRe: Dumb questionadminChris Maunder4:00 18 Apr '08  
GeneralRe: Dumb questionmvpPete O'Hanlon5:11 18 Apr '08  
GeneralRe: Dumb questionsupporterpeterchen1:00 7 May '08  
GeneralFormattingmembernorm .net22:01 17 Apr '08  
GeneralRe: FormattingmvpPete O'Hanlon0:18 18 Apr '08  

General General    News News    Question Question    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

PermaLink | Privacy | Terms of Use
Last Updated: 6 May 2008
Editor:
Copyright 2008 by Pete O'Hanlon
Everything else Copyright © CodeProject, 1999-2008
Web17 | Advertise on the Code Project