Click here to Skip to main content
15,847,077 members
Please Sign up or sign in to vote.
3.86/5 (4 votes)
See more:
For example, i have 5 to 10 names, like:
JOHN COMP.<br />
JOHN COMPANY<br />
JOHN COMPANY.<br />
JOHN CO.<br />
JOHN COMPN


How to compare these strings (text) and get the similarity in percentage?
Do exists any algorithm for this comparision?

[EDIT]Dear SAKryukov, below is answer on your questions[/EDIT]
I have simple database in ms Excel (near 1000 records). The data are sorted by the name of company. The other column are: post-code, city, etc. There is only one criterium: compare the closest names and return the similarity of them. If they are almost the same - show them.

I was wondering abuot "simple" algorithm based on text-weights, like this (do not look at well-formed code, becouse it was quickly created):
VB
Function GetTextWeight(ByVal sText As String) As Long
Dim sWeights As String
Dim i As Integer, j As Integer, lWeight As Long

On Error GoTo Err_GetTextWeight

Do While Len(sWeights) < Len(sText)
    sWeights = sWeights & "3579"
Loop

For i = 1 To Len(sText)
    lWeight = lWeight + CLng(Asc(Mid(sText, i, 1)) * CInt(Mid(sWeights, i, 1)))
Next i

Do While lWeight > 10
    lWeight = lWeight / 10
Loop

Exit_GetTextWeight:
    GetTextWeight = lWeight
    Exit Function

Err_GetTextWeight:
    lWeight = 0
    Resume Exit_GetTextWeight

End Function


What i'm trying to do, is to find algorithm which can return values:

IDTextSimilarity (Weight)
1ALIOR BANK4
1ALIOR BANK4
5ALIOR BANK S.A.6
19ALIOR BANK SA6
100ALIOR BANK S.A6
77ALIOR BANK S A6


I'm NOT trying to compare two strings in code. I'm trying to find a hint, the hint for user, which should analyze similar strings ("duplicates"). If he decide to merge id's for similar strings, he can do that.
Posted
Updated 3-Mar-12 14:10pm
v3
Comments
Sergey Alexandrovich Kryukov 3-Mar-12 16:18pm    
Before approaching text similarity algorithm, you need to define text similarity criteria.
There is no "default" or "assumed" definition. It should work for different length and can be very different.

For example, in the primitive criterion
"The Beatles" and "The John company"
are close then the other combinations of your samples, but
"John company" and "The John company" are not close, even though the first pair is close in alphabetic order.

It is not clear what criteria do you want. First of all, because you don't explain the purpose.
If this is just because someone ordered you to solve this problem, the problem has no solution because it is not defined.

--SA
Maciej Los 3-Mar-12 18:20pm    
Thank you for your reply. Please, take a look at my question. I've made some changes, i've added explanations.
Sergey Alexandrovich Kryukov 3-Mar-12 18:44pm    
Nothing is clear. Comparison involved two strings, and your function accepts only one argument. From the very beginning -- it makes no sense. So The code is simply illiterate -- who uses GOTO, ever?!
You table also does not show which string to you compare all of the rest. You weight cannot be the attribute of the string: if should be a function of the two...

Again, why doing all that? Search, search relevance..?
--SA
Maciej Los 3-Mar-12 19:01pm    
Do not shout on me. I'm conscious programer... As i said, i'm looking for similarity... Right now i'm using MS Excel VBA, that's why i'm using GoTo instruction. Why i use tags: VB, VBA, VB.NET? Becouse these programming languages are similar...

I'm NOT trying to compare two strings in code. I'm trying to find a hint, the hint for user, which should analyze similar strings ("duplicates"). If he decide to merge id's for similar strings, he can do that. I don't know how to explain more...
Sergey Alexandrovich Kryukov 3-Mar-12 19:43pm    
Who shouts on you, me? Are you serious?

By the way, similarity between VB and VB.NET is as concerned as as your samples. VB.NET is damn far from VB, much closer to C#...

So, if you are not trying to compare strings, it contradicts to the idea of "similarity". Similarity is a function of two strings, by definition. Even after your explanation above it remains the true. Otherwise -- sorry, I fail to understand what are you trying to do. Especially if you don't know how to explain more...

As to GOTO, it is irrelevant who you thing you are and what did you use. Nobody wants to attack you -- just don't do it. If you want help and advice, of course...

--SA

You may be looking for the Gestalt Approach as described in Dr. Dobbs Article of July 1988: Pattern Matching: the Gestalt Approach[^].
Maybe, this[^] article helps too.

Cheers

Andi
 
Share this answer
 
Comments
Maciej Los 4-Mar-12 6:41am    
Thank you very much! It was very helpful!
Sergey Alexandrovich Kryukov 4-Mar-12 12:53pm    
Please see my last comments to your comment and the one by OP. Based on my present understanding of the problem, this is not a perfect solution, but certainly a decent attempt which deserves my 5, so I up-voted the answer.
--SA
Sergey Alexandrovich Kryukov 5-Mar-12 21:20pm    
As I promised to try, I published my answer, please see. I would be interested to see your feedback.
For some disclaimer, please see my last comments to the question.
--SA
Sergey Alexandrovich Kryukov 5-Mar-12 21:23pm    
Andi,

By the way, it was so nice of you to address to me directly via LinkedIn. I accepted your call for a contact.

If you are interested, you could also contact me very directly via my Web site you can find through my CodeProject profile. It has a "contact me" page. Many CodeProject people already found it out and sent me some questions or feedback.
--SA
Andreas Gieriet 5-Mar-12 22:22pm    
Thanks for accepting. BTW: I'm currently very busy on the job - I will look at your Solution #2 as soon as I got some air to breath...
Cheers
Andi
Here is my idea:

The algorithm should accept three things on input: to string to be compared and the meta-data. The meta-data would contain parameters for comparison: weight factors to be used, character sets, maybe even dictionaries; please see below on optionality. Basically, you would need to play with set of parameters and experiment with real data samples, to determine the best set of parameters, which can also vary depending on cultures, industry, etc.

The algorithm should return the similarity factor as a floating-point number: the more the factor, the more similar. For certainty, let's assume this is System.Double.

For the first step, strings are compared for being 100% identical, in that case it returns double.PositiveInfinity. This value is very convenient to avoid messing up with normalization of distribution. Is is also convenient because infinite values can be correctly compared with "regular" non-NaN values using '>=', '<=', '>', '>' or '==' operators.

Let's define the set of delimiter characters in the form of array of characters. This is a delicate decision, I'll try to discuss is later. For first approximation, let's include all punctuation (including Unicode punctuation like «, », ', —, –, “, ”, etc.), also some symbols like © or ™, etc., and — importantly — a blank space in all its forms. Let's assume we have this set as char[] delimiters.

On next step, let's split each input string input into lexemes like this:
C#
string[] lexemes = input.Split(delimiters, System.StringSplitOptions.RemoveEmptyEntries);
Here, StringSplitOptions.RemoveEmptyEntries is very important.

This is a very important step. For the first thing, it expresses the following: we unify all delimiters (to the minimal information: one ore more delimiter in certain place), merge consecutive delimiters together and give them the least priority; in other words, we shall not consider differences between those delimiters. From this point, the delimiters go out of consideration.

Now, the core of the algorithm: count the number of matches between the lexemes without taking into account the order, taking each match with some weight factor depending on the length of the lexeme. This is the first step where we use meta-data. I don't discuss what is the "match". For a first approximation, the match can be the case-insensitive string comparison. It could be two comparisons: case sensitive and case-insensitive; the case of 100% case-sensitive match going with higher weight factor.

Optional improvement of lexeme match algorithm #1: in the case-insensitive match, count the number of case-sensitive matches character-by-character and modify the weight factor of the match depending on the percentage of case-sensitive character matches.

Optional improvement of lexeme match algorithm #2: compare lexemes using Gestalt-approach string comparison suggested by Andi.

Basically, that's it. There can be different options on top of it. For example, the core algorithm could be improved by adding positional comparison of the lexemes: a match can be given an elevated weight factor is the match happens at the same or close position.

Now, about the dictionaries. You could have a dictionary of low-priority words, such as articles and prepositions. If a lexeme match is found, the weight factor could be multiplied by low (<1) factor of a "low-priority word". This method is widely used these day for Web search. I must note that this method could work well in case of 1-3 different languages, mostly from Germanic/Roman-based cultures but might have negative impact in case of very different cultures or many different languages. From the other hand, the vast dictionary could work in the following way: if a match is found, and a word is not found in a dictionary, it can be given an elevated weight factor, as for a "unique word".

Again, every approach we discussed on this page requires extensive research on the real data samples.

If you decide to lead this work to some reasonable working result, if would be very nice if you publish the results of your research. I personally would be very interested to learn the results. In case you do, I would expect you notify Andi and myself.

Good luck,
—SA
 
Share this answer
 
v4
Comments
Maciej Los 6-Mar-12 16:51pm    
WOW! Thank you very much! Great idea! My 5!
I'm little confused, becouse this is the area which i'm never explored. So, i need a little bit of time to understand what you are trying to tell me. I promise to contact with You and Andi if i'll ever find the solution. Sorry for my language ;)
Joezer BH 1-Aug-13 2:33am    
5+!!

Wow, probably the longest thread on a Q&A I've seen so far in CP
Sergey Alexandrovich Kryukov 6-Mar-12 17:23pm    
Great. And you are very welcome.

You should understand this is a kind of preliminary plan which needs some research to justify. Nevertheless, I was coming back to ideas on this topic from time to time during couple of days. Interesting topic, you know.

I am not even talking about your final solution, I wold be interested to know if your research would show some results on real-life representative data samples. Feel free to post a comment if you need some assistance in code development or other concerns -- I'll try to help.

Good luck, would be glad to hear from you later.
--SA
Sergey Alexandrovich Kryukov 1-Aug-13 8:11am    
Yee, I guess...
Thank you very much.
—SA
Sergey Alexandrovich Kryukov 6-Mar-12 22:20pm    
Buy the way, I think you could consider accepting this answer formally (I mean, green "Accept" button) -- you can always accept more then one.

Thank you for your contact through my Web site. I don't mind to keep the contact and discuss the matters you mentioned from time to time. Will be glad if it helps you. You contact me the same way; I'll reply to you directly as soon as we have something to discuss. Please write...
--SA

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900