Click here to Skip to main content
Licence CPOL
First Posted 17 Apr 2008
Views 20,287
Downloads 68
Bookmarked 15 times

A Naive String Comparer

By | 6 May 2008 | Article
A class to perform a "naive" comparison of two chunks of text to see if they look to be the same.

Introduction

A while ago, we wrote an application for a client who was running a competition that included a tie breaker question. The tie breaker had to be unique so there was a need to deduplicate the entries to remove multiple matches (due to a quaint little law where if you awarded a prize to one person based on their tie breaker and somebody else could prove they had entered with the same tie breaker, then you had to award them the prize as well; providing they satisfied all other prize winning criteria). The remaining entries would be judged by a panel to see which ones were the best.

As you can imagine, the fact that the tie breakers were input manually made this potentially fraught due to the possibility of data capture miskeys or incorrect spelling by the entrant. To get round this, the client requested a naive comparer which used some simple rules to identify potential duplicate entries. This article is a version of that comparer.

The Basics

The first thing you need to do is load in the string you want to compare against and set up whether or not you want to strip HTML tags from the text when performing comparisons. This is a little extension that I added to the original design so that it could be used to detect whether or not an article here on CodeProject was just a copy of the base template. In the original, HTML tags weren't removed.

The basic algorithm is to remove HTML tags using the following regular expression:

</?\w+((\s+\w+(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)+\s*|\s*)/?>

This expression simply grabs all text inside angle brackets, and the application replaces them with an empty string using a Regex.Replace.

Using a similar technique, all vowels, whitespace and special characters are removed from the text. The text can then be compared.

The method to actually remove the "offending" tags is shown here:

/// <summary>
/// Break the relevant text item down into a string 
/// that is stripped of all HTML tags, vowels and 
/// whitespace characters.
/// </summary>
/// <param name="source">The item that needs to be "cleansed" 
/// using the relevant regular expressions.</param>
/// <returns>The "cleansed" string.</returns>
/// <remarks>This implementation starts off by removing all HTML tags, 
/// before removing any vowels and
/// whitespace characters.</remarks>
private string BreakdownText(string source)
{
  if (string.IsNullOrEmpty(source))
    throw new ArgumentNullException("source");
  // First of all, strip the tags out of the source.
  if (_tagType == StripTagType.StripTags)
  {
    source = ReplaceCharacters(source, _tagRemover, string.Empty);
  }
  source = ReplaceCharacters(source, _chars, string.Empty);
  return ReplaceCharacters(source, _vowels, string.Empty).ToLowerInvariant();
}
/// <summary>
/// Method to replace one set of characters with another.
/// </summary>
/// <param name="source">The source text to replace the characters in.</param>
/// <param name="regexText">The regular expression defining the characters to replace.
/// </param>
/// <param name="replaceText"></param>
/// <returns></returns>
private string ReplaceCharacters(string source, string regexText, string replaceText)
{
  Regex regex = new Regex(regexText,
  RegexOptions.IgnoreCase | RegexOptions.Multiline | 
		RegexOptions.IgnorePatternWhitespace);
  return regex.Replace(source, replaceText);
}

The order in which items are removed is important. You want to strip the HTML tags out before you remove the none-alphabetic and vowel characters. By the way, the _vowels and _char regular expressions are defined as [eiouy] and [^b-z] respectively.

If the two sanitized pieces of text you want to compare match, then you have a duplicate. However, as we have said, there's always the possibility of there being a misspelling so the comparer goes further and looks through the two pieces of text to identify characters that don't match. If the number of characters falls outside a particular threshold, then we know that the two pieces of text are sufficiently different to count them as unique.

/// <summary>
/// This method calculates the number of unmatched characters 
/// in the source string and determines
/// whether or not it falls within the target threshold.
/// </summary>
/// <param name="source">The source to check.</param>
/// <param name="target">The target string to check against.</param>
/// <returns>True if the string matches (or matches to within the tolerance), 
/// false otherwise.</returns>
private bool Compare(string source, string target)
{
  int targetLength = target.Length; 
  StringBuilder unmatched = new StringBuilder();
  while (target.Length > 0 && source.Length > 0 && unmatched.Length <= _diffCharacters)
  {
    if (target[0] == source[0])
    {
      target = target.Remove(0, 1);
      source = source.Remove(0, 1);
    }
    else
    {
      // Now, look for the next instance of this letter in source.
      int index = source.IndexOf(target[0]);
      if (index > 0)
      {
        unmatched.Append(source.Substring(0, index));
        source = source.Remove(0, index);
      }
    }
  }
  return unmatched.Length <= _diffCharacters;
}

And that's it - a naive comparer. To run it, all you need to do is this:

static void Main(string[] args)
{
  string s = "<H1>This is a sample.</H1>";
  NaiveComparer c = new NaiveComparer(s, StripTagType.StripTags, 10);
  if (c.Compare("Hello" + s + "aeiou123"))
  {
    Console.WriteLine("It matches");
  }
  if (!c.Compare("Hellot there from this completely arbitrary " + 
    "and totally different way of matching <<<< <<<<<< " + s + 
    ">>>>>>>>>>>>>>-----"))
    Console.WriteLine("No match");
  Console.ReadLine();
}

I hope that this is useful for you, on those occasions where you need to compare to see if items are sufficiently different to allow them through.

Edit History

  • 07 May 08 Fixed bug found by Chris Maunder. Thanks Chris.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Pete O'Hanlon

CEO

United Kingdom United Kingdom

Member

Follow on Twitter Follow on Twitter
Google+
A developer for more years than I care to remember. I live in the North East of England with 2 wonderful daughters and a wonderful wife.
 
I am not the Stig, but I do wish I had Lotus Tuned Suspension.

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board. (secure sign-in)
 
Search this forum  
 FAQ
    Noise  Layout  Per page   
  Refresh
Generalusing word dll Pinmemberanshhhhh20:35 25 Jun '09  
GeneralBug report PinadminChris Maunder14:30 6 May '08  
GeneralRe: Bug report PinmvpPete O'Hanlon22:53 6 May '08  
Good catch Chris. Thanks - I've uploaded the modified version.
 

Deja View - the feeling that you've seen this post before.
 

My blog | My articles



QuestionWhy didn't you use Soundex? Pinmemberpintel19:22 22 Apr '08  
GeneralRe: Why didn't you use Soundex? PinmvpPete O'Hanlon22:10 22 Apr '08  
GeneralDumb question PinadminChris Maunder1:37 18 Apr '08  
GeneralRe: Dumb question PinmvpPete O'Hanlon2:20 18 Apr '08  
GeneralRe: Dumb question PinadminChris Maunder3:00 18 Apr '08  
GeneralRe: Dumb question PinmvpPete O'Hanlon4:11 18 Apr '08  
GeneralRe: Dumb question Pinmemberpeterchen0:00 7 May '08  
GeneralFormatting Pinmembernorm .net21:01 17 Apr '08  
GeneralRe: Formatting PinmvpPete O'Hanlon23:18 17 Apr '08  

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Mobile
Web04 | 2.5.120529.1 | Last Updated 7 May 2008
Article Copyright 2008 by Pete O'Hanlon
Everything else Copyright © CodeProject, 1999-2012
Terms of Use
Layout: fixed | fluid