Click here to Skip to main content
15,881,173 members
Please Sign up or sign in to vote.
5.00/5 (2 votes)
See more:
Hello all,

Ok, I am banging my head against the wall for a while now trying different techniques. None of them are working well.

I have two strings. I need to compare them and get an exact percentage of match,

ie. "four score and seven years ago" TO "for scor and sevn yeres ago"

Well, I first started by comparing every word to every word, tracking every hit, and percentage = count \ numOfWords. Nope, didn't take into account misspelled words. ("four" <> "for" even though it is close)

Then I started by trying to compare every char in each char, incrementing the string char if not a match (to count for misspellings). But, I would get false hits because the first string could have every char in the second but not in the exact order of the second. ("stuff avail" <> "stu vail" (but it would come back as such, low percentage, but a hit. 9 \ 11 = 81%))

SO, I then tried comparing PAIRS of chars in each string. If string1[i] = string2[k] AND string1[i+1] = string2[k+1], increment the count, and increment the "k" when it doesn't match (to track mispellings. "for" and "four" should come back with a 75% hit.) That doesn't seem to work either. It is getting closer, but even with an exact match it is only returns 94%. And then it really gets screwed up when something is really misspelled. (Code at the bottom)

Any ideas or directions to go?

Thanks,

Josh


count = 0
j = 0
k = 0
While j < strTempName.Length - 2 And k < strTempFile.Length - 2
    ' To ignore non letters or digits '
    If Not strTempName(j).IsLetter(strTempName(j)) Then
        j += 1
    End If

    ' To ignore non letters or digits '
    If Not strTempFile(k).IsLetter(strTempFile(k)) Then
        k += 1
    End If

    ' compare pair of chars '
    While (strTempName(j) <> strTempFile(k) And _ 
           strTempName(j + 1) <> strTempFile(k + 1) And _ 
           k < strTempFile.Length - 2)
        k += 1
    End While
    count += 1
    j += 1
    k += 1

End While

perc = count / (strTempName.Length - 1)
Posted
Updated 28-Mar-16 12:47pm

You can use Levenshtein Distance[^] algorithm. It is very well known algorithm with easy implementation.
This[^] page contains Java/C++/VB implementations of the algorithm.
And here[^] you can find generic implementation of this algorithm (this time in C#, but converting to VB.NET should not be a problem).

I hope this helps. :)
 
Share this answer
 
v2
Comments
#realJSOP 19-Jan-11 15:07pm    
Proposed as answer
Nuri Ismail 20-Jan-11 3:02am    
Thank you John!
Maciej Los 9-Mar-12 15:28pm    
Good answer, good link. My 5!
Please, see my question. Would you like to join into discussion?
May be this will help as a bases. You need to modify it.

Points to remember:
1) It compares character by character
2) Skips characters until next match
3) Wait at the end of word
4) Jumps to next word when new word starts on first string

VB
Function Compare(ByVal str1 As String, ByVal str2 As String) As Double
  Dim count As Integer = If(str1.Length > str2.Length, str1.Length, str2.Length)
  Dim hits As Integer = 0
  Dim i, j As Integer : i = 0 : j = 0
  For i = 0 To str1.Length - 1
    If str1.Chars(i) = " " Then i += 1 : j = str2.IndexOf(" "c, j) + 1 : hits += 1
    While j < str2.Length AndAlso str2.Chars(j) <> " "c
      If str1.Chars(i) = str2.Chars(j) Then
        hits += 1
        j += 1
        Exit While
      Else
        j += 1
      End If
    End While
    If Not (j < str2.Length AndAlso str2.Chars(j) <> " "c) Then
      j -= 1
    End If
  Next
  Return Math.Round((hits / count), 2)
End Function


Sample Output:
"four"<->"for" = 0.75
"four stud"<->"for studs" = 0.89
 
Share this answer
 
v2
Comments
Maciej Los 9-Mar-12 15:27pm    
Interesting solution... My 5!
Please, see my question. Would you like to join into discussion?

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900