Text similarity algorithm

Question

3.86/5 (4 votes)

See more:

For example, i have 5 to 10 names, like:

JOHN COMP.<br />
JOHN COMPANY<br />
JOHN COMPANY.<br />
JOHN CO.<br />
JOHN COMPN

How to compare these strings (text) and get the similarity in percentage?
Do exists any algorithm for this comparision?

[EDIT]Dear SAKryukov, below is answer on your questions[/EDIT]
I have simple database in ms Excel (near 1000 records). The data are sorted by the name of company. The other column are: post-code, city, etc. There is only one criterium: compare the closest names and return the similarity of them. If they are almost the same - show them.

I was wondering abuot "simple" algorithm based on text-weights, like this (do not look at well-formed code, becouse it was quickly created):

VB

Function GetTextWeight(ByVal sText As String) As Long
Dim sWeights As String
Dim i As Integer, j As Integer, lWeight As Long

On Error GoTo Err_GetTextWeight

Do While Len(sWeights) < Len(sText)
    sWeights = sWeights & "3579"
Loop

For i = 1 To Len(sText)
    lWeight = lWeight + CLng(Asc(Mid(sText, i, 1)) * CInt(Mid(sWeights, i, 1)))
Next i

Do While lWeight > 10
    lWeight = lWeight / 10
Loop

Exit_GetTextWeight:
    GetTextWeight = lWeight
    Exit Function

Err_GetTextWeight:
    lWeight = 0
    Resume Exit_GetTextWeight

End Function

What i'm trying to do, is to find algorithm which can return values:

ID	Text	Similarity (Weight)
1	ALIOR BANK	4
1	ALIOR BANK	4
5	ALIOR BANK S.A.	6
19	ALIOR BANK SA	6
100	ALIOR BANK S.A	6
77	ALIOR BANK S A	6

I'm NOT trying to compare two strings in code. I'm trying to find a hint, the hint for user, which should analyze similar strings ("duplicates"). If he decide to merge id's for similar strings, he can do that.

Posted 3-Mar-12 10:10am

Maciej Los

Updated 3-Mar-12 13:10pm

v3

Add a Solution

Comments

Sergey Alexandrovich Kryukov 3-Mar-12 16:18pm

Before approaching text similarity algorithm, you need to define text similarity criteria.
There is no "default" or "assumed" definition. It should work for different length and can be very different.

For example, in the primitive criterion
"The Beatles" and "The John company"
are close then the other combinations of your samples, but
"John company" and "The John company" are not close, even though the first pair is close in alphabetic order.

It is not clear what criteria do you want. First of all, because you don't explain the purpose.
If this is just because someone ordered you to solve this problem, the problem has no solution because it is not defined.

--SA

Maciej Los 3-Mar-12 18:20pm

Thank you for your reply. Please, take a look at my question. I've made some changes, i've added explanations.

Sergey Alexandrovich Kryukov 3-Mar-12 18:44pm

Nothing is clear. Comparison involved two strings, and your function accepts only one argument. From the very beginning -- it makes no sense. So The code is simply illiterate -- who uses GOTO, ever?!
You table also does not show which string to you compare all of the rest. You weight cannot be the attribute of the string: if should be a function of the two...

Again, why doing all that? Search, search relevance..?
--SA

Maciej Los 3-Mar-12 19:01pm

Do not shout on me. I'm conscious programer... As i said, i'm looking for similarity... Right now i'm using MS Excel VBA, that's why i'm using GoTo instruction. Why i use tags: VB, VBA, VB.NET? Becouse these programming languages are similar...

I'm NOT trying to compare two strings in code. I'm trying to find a hint, the hint for user, which should analyze similar strings ("duplicates"). If he decide to merge id's for similar strings, he can do that. I don't know how to explain more...

Sergey Alexandrovich Kryukov 3-Mar-12 19:43pm

Who shouts on you, me? Are you serious?

By the way, similarity between VB and VB.NET is as concerned as as your samples. VB.NET is damn far from VB, much closer to C#...

So, if you are not trying to compare strings, it contradicts to the idea of "similarity". Similarity is a function of two strings, by definition. Even after your explanation above it remains the true. Otherwise -- sorry, I fail to understand what are you trying to do. Especially if you don't know how to explain more...

As to GOTO, it is irrelevant who you thing you are and what did you use. Nobody wants to attack you -- just don't do it. If you want help and advice, of course...

--SA

Andreas Gieriet 3-Mar-12 21:34pm

Hey SAKryukov,
this guy asks for some advise on checking similarity of text, obviously not similarity of semantics. Please let the OP decide himself if this makes sense to his task at hand. I "dislike" very much too, if I have to tell people *why* I came to the question - it should not matter.

E.g. I had a similar situation in the past and the approach as described in Solution #1 did prove sufficient enough for the particular problem. Why the heck do you want to tell anyone that a question don't make sense!? The table is admittedly a bit odd, but with a bit of imagination, you can assume that it compares to a (not listed) reference text.

If one should employ "goto" is not part of the question.

BTW: You do not shout, but you pee on his leg. No big difference - both come over quite aggressively.

Cheers

Andi

Sergey Alexandrovich Kryukov 4-Mar-12 4:02am

Aggressively? I can't understand you guys. Probably this is beyond me. If someone shows me my mistake I either say thank you and fix it or try to proof it wasn't a mistake. Everything else is not helping. Do you want me to ignore problems and hence not help? And there is a big cultural difference. For example, to me, the word "pee" sounds very rude, but I usually don't blame people for this (this very case is just for example).

Now, what makes sense and what is not is not always a matter of choice. The similarity criteria are always a function of two objects, not one. Don't you agree? If I miss something, please explain it.

It looks like you argue that concrete criteria do not matter and I don't need to discuss if they make sense or not. Then I would agree. I discussed it in my first comment. I said that OP needs to define them. So, my "does not makes sense" means this: not that the criteria are wrong, they are still not defined. I feel that you did not get it, hence your comment above.

By the way, fundamentally, there is no difference between semantic or non-semantic criteria. The are all semantic, if you will. The should be, that's it. What makes sense depends on the ultimate goal and other factors.
--SA

Andreas Gieriet 4-Mar-12 9:02am

C'est le ton qui fait la musique ;-) - and I did not meet the appropriate wording neither. :-( My appologies for that.

On the subjet: I understand from the title and the few strings he shows, that the OP has a pile of data that he has to "sieve" somehow, picking the low hanging fruits first, which are those that have "similar" texts. "Similar" is a uncertain term, but the question is quite clear in my eyes from a user point of view. So, the question still is, if there is something that shows similarity, like a user would see it instantly? There are some algorithms, one of these is the "Gestalt" approach. I think we can assume that there is no unique "similarity check" algortihm (the same holds for two individual persons, the one find some text closer to each other than another person does). The "Gestalt" approach gives a percentage of similarity based on the number of characters in common substrings. This works most of the time well, but not always (it gives a "pessimistic" number)... At least, it allows to pick the low hanging fruits first and then focus on the hard stuff.

My appologies again.

Regards

Andi

Sergey Alexandrovich Kryukov 4-Mar-12 12:39pm

Je ne parle pas encore Français. :-)
This is another time I'm sorry I gave up my second foreign language in the university many years ago. Still remember what I remember from my first credit lesson, no more. Thank you for the discussion, no problems, I hope...

I'm discussing the problem with OP right now, trying to understand better the purpose of this relationship and expected properties, as this is not yet clear. What you think is uncertain, I am agree is uncertain. What you think is clear, is not really clear to me. At least description of the purpose could be better. If the purpose is just to remove "nearly duplicate" data, there can be false positives. Sometimes, a mismatch in one letter clearly indicates that a company is different, sometime if is not important; for example, abbreviated word can come with dot or not. If the system says: "Hey, you are responsible for the data; then you should know that the collection of records suggests there are duplicate companies: look at the company names, they are similar: this and this, this and that. What shall we do with them?" -- and the human operator decides. If the system removed duplicates, this is a different story (and I would not trust any algorithm except the one indicating 100% identical records). Also, I don't know, if the similarity should be used in some other way or not. Alternatively, the requirements to algorithm should be considered as a set of criteria. For example: "two strings having one mismatch letter in the middle of the strings should be closer than the two strings having one mismatch in the letter in first position", but it also should depend on lengths -- and so on. This is difficult to formulate, in fact. In general, I think the problem is very difficult. It also depends on how much we can tolerate possible false negatives and false positives. This is a real problem.

Thanks for the explanations on your answer. I actually took a look at the article you referenced; it could be a viable solution. It's certainly not perfect, if you compare with a person with proper background in specific language. To take is seriously, it should be considered as a difficult linguistic problem, and such problems are in intensively research right now; one of the applications and the points of competition is the Web search. But can we get a better solution than that of "Gestalt" approach? I don't know. At least it needs some thinking.
--SA

Andreas Gieriet 4-Mar-12 14:35pm

My native language is not French neither but we use that cited proverb too - we have anyways many French words mixed in on daily use in Swiss-German.

Regarding similarity: I personnally would never let the system remove duplicates - all these potential duplicates need a human to decide based on more background information. Usually, one is not willing to add all needed information into the system to let it decide. So, my expectation for such an algorithm is to provide some correlation figures that might help to decide if some text is to be considered as duplicate. If you have more data in the system that might help in calculating the similarity like address, etc., then that information could be taken too. E.g. concatenate all address fields per record into one string and calculate the correlation by the Gestalt approach.

What concerns the weighting of the character position: I've observed with the Gestalt approach, that it does not really matter. You get quite accurate numbers if you can add enough data to the calculation.

E.g. the hitlist for the given strings looks as shown below:


96%  JOHN COMPANY  - JOHN COMPANY.
91%  JOHN COMPANY  - JOHN COMPN
90%  JOHN COMP.    - JOHN COMPN
88%  JOHN COMP.    - JOHN CO.
87%  JOHN COMPANY. - JOHN COMPN
87%  JOHN COMP.    - JOHN COMPANY.
82%  JOHN COMP.    - JOHN COMPANY
78%  JOHN CO.      - JOHN COMPN
76%  JOHN COMPANY. - JOHN CO.
70%  JOHN COMPANY  - JOHN CO.

Andi

Sergey Alexandrovich Kryukov 4-Mar-12 14:44pm

You speculations look reasonable. I'm coming to some other ideas while OP might think on my concerns. If I decide to post my own attempt of a solution, I'll notify you -- I would be interested in your opinion.

My question is: as you figured the figures above, does it mean you are using the code. Is is .NET code as OP needs?

--SA

P.S.:
The tag <pre> does not work in comments, but <code> does.

Andreas Gieriet 4-Mar-12 16:32pm

Thanks for the hint - I've adjusted the comment.

The figures are produced by some C# code that lingers around in my project scrach pad since some years. It's not ready for publication, though.

Sergey Alexandrovich Kryukov 4-Mar-12 4:09am

Yes, on second thought, I would agree that GetTextWeight formally define some criterion of "similarity": the similarity between two strings would be defined as difference between weights.

If so, I wish OP explained that in first place, and I wish to understand the meaning of such comparison. It resembles the cache. (But normally a cache is designed to be very different if the input data is changed minimally.)

--SA

Maciej Los 4-Mar-12 7:04am

Thank you, for second try ;) SA, i know how to find duplicates, but i'm trying to find function which can help to find similar records. It mean: if the address of company (post-code, city) is the same and the names are grouped in alphabetical order, it's more than likely that the names of companies are the same. Is it more clear?

Maciej Los 4-Mar-12 6:49am

SA, ok, maybe i'm tired... Sorry. I agree with you: "Similarity is a function of two strings, by definition". Belive me, i really need you help. So, i'm thankful for your advice now and any advice in the future.

Sergey Alexandrovich Kryukov 4-Mar-12 12:51pm

No problem at all. Except your problem with the similarity. Please see my answer to Andi above. (The one started with French phrase, where I explain why I think the problem is difficult and what makes me think it is still uncertain, discussion of possible purpose, false positive and false negatives.)

My conclusions at this point:

1) If you want and advice in algorithm of similarity, from scratch, you probably need to give more information on the requirements to the comparison function, which is difficult, or explain more in terms of ultimate purpose of it. Please see my comment I mentioned above where I speculate on how it could be used. At the same time, I am not sure it can help, just because I think the problem is very difficult. Maybe you need to agree on some limited quality of solution.

2) If you would like to evaluate and possibly fix your algorithm you have shown: I cannot even understand it; neither the purpose no the meaning of those calculations of density.

3) Gestalt approach algorithm suggested by Andi is certainly not perfect (at least considering my understanding of the problem I tried to explain in that comment), but I'm not sure that you can devise any better. It could be useful to create the code based on it and experiment with it.

--SA

Sergey Alexandrovich Kryukov 4-Mar-12 14:40pm

OK, let's try to give it a shot. I think I'm coming to some ideas. Let me think some more; and in the meanwhile you please try to address my concerns about the purpose of all this (yes, up to the business settings) and to give other relevant considerations. I'll think if I can post an answer; and we'll discuss it some more. Deal?
--SA

Maciej Los 4-Mar-12 17:34pm

Dear SAKryukov, Dear Andreas,
Thank you very much for your opinions, comments.
First of all, i'm sorry for each mistake i've made, i've got a lot to learn about english language.
If you allow, i move the discussion to the forum and i'll be thankful for your activity.

Sergey Alexandrovich Kryukov 5-Mar-12 21:12pm

[Double post, by mistake, to be removed --SA]

Sergey Alexandrovich Kryukov 5-Mar-12 21:15pm

As I promised to try, I published my advice on the topic. Please see.

I do understand that it can be challenging in terms of the performance of the implementation. I also realize that this is a kitchen-made approach where the big science is being applied, so it could be somewhat naive. However, I knew the cases where my home-made approach worked well. I would be interested to see it.

Good luck, please send some feedback.
--SA

2 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Andreas Gieriet · Accepted Answer · 2012-03-03T15:20:00

Solution 1

You may be looking for the Gestalt Approach as described in Dr. Dobbs Article of July 1988: Pattern Matching: the Gestalt Approach[^].
Maybe, this[^] article helps too.

Cheers

Andi

Posted 3-Mar-12 15:20pm

Andreas Gieriet

Comments

Maciej Los 4-Mar-12 6:41am

Thank you very much! It was very helpful!

Sergey Alexandrovich Kryukov 4-Mar-12 12:53pm

Please see my last comments to your comment and the one by OP. Based on my present understanding of the problem, this is not a perfect solution, but certainly a decent attempt which deserves my 5, so I up-voted the answer.
--SA

Sergey Alexandrovich Kryukov 5-Mar-12 21:20pm

As I promised to try, I published my answer, please see. I would be interested to see your feedback.
For some disclaimer, please see my last comments to the question.
--SA

Sergey Alexandrovich Kryukov 5-Mar-12 21:23pm

Andi,

By the way, it was so nice of you to address to me directly via LinkedIn. I accepted your call for a contact.

If you are interested, you could also contact me very directly via my Web site you can find through my CodeProject profile. It has a "contact me" page. Many CodeProject people already found it out and sent me some questions or feedback.
--SA

Andreas Gieriet 5-Mar-12 22:22pm

Thanks for accepting. BTW: I'm currently very busy on the job - I will look at your Solution #2 as soon as I got some air to breath...
Cheers
Andi

RaisKazi 6-Mar-12 20:45pm

My 5.

Andreas Gieriet 7-Mar-12 3:34am

Thanks!

Sergey Alexandrovich Kryukov · Accepted Answer · 2012-03-05T15:08:00

Solution 2

Here is my idea:

The algorithm should accept three things on input: to string to be compared and the meta-data. The meta-data would contain parameters for comparison: weight factors to be used, character sets, maybe even dictionaries; please see below on optionality. Basically, you would need to play with set of parameters and experiment with real data samples, to determine the best set of parameters, which can also vary depending on cultures, industry, etc.

The algorithm should return the similarity factor as a floating-point number: the more the factor, the more similar. For certainty, let's assume this is System.Double.

For the first step, strings are compared for being 100% identical, in that case it returns double.PositiveInfinity. This value is very convenient to avoid messing up with normalization of distribution. Is is also convenient because infinite values can be correctly compared with "regular" non-NaN values using '>=', '<=', '>', '>' or '==' operators.

Let's define the set of delimiter characters in the form of array of characters. This is a delicate decision, I'll try to discuss is later. For first approximation, let's include all punctuation (including Unicode punctuation like «, », ', —, –, “, ”, etc.), also some symbols like © or ™, etc., and — importantly — a blank space in all its forms. Let's assume we have this set as char[] delimiters.

On next step, let's split each input string input into lexemes like this:

C#

string[] lexemes = input.Split(delimiters, System.StringSplitOptions.RemoveEmptyEntries);

Here, StringSplitOptions.RemoveEmptyEntries is very important.

This is a very important step. For the first thing, it expresses the following: we unify all delimiters (to the minimal information: one ore more delimiter in certain place), merge consecutive delimiters together and give them the least priority; in other words, we shall not consider differences between those delimiters. From this point, the delimiters go out of consideration.

Now, the core of the algorithm: count the number of matches between the lexemes without taking into account the order, taking each match with some weight factor depending on the length of the lexeme. This is the first step where we use meta-data. I don't discuss what is the "match". For a first approximation, the match can be the case-insensitive string comparison. It could be two comparisons: case sensitive and case-insensitive; the case of 100% case-sensitive match going with higher weight factor.

Optional improvement of lexeme match algorithm #1: in the case-insensitive match, count the number of case-sensitive matches character-by-character and modify the weight factor of the match depending on the percentage of case-sensitive character matches.

Optional improvement of lexeme match algorithm #2: compare lexemes using Gestalt-approach string comparison suggested by Andi.

Basically, that's it. There can be different options on top of it. For example, the core algorithm could be improved by adding positional comparison of the lexemes: a match can be given an elevated weight factor is the match happens at the same or close position.

Now, about the dictionaries. You could have a dictionary of low-priority words, such as articles and prepositions. If a lexeme match is found, the weight factor could be multiplied by low (<1) factor of a "low-priority word". This method is widely used these day for Web search. I must note that this method could work well in case of 1-3 different languages, mostly from Germanic/Roman-based cultures but might have negative impact in case of very different cultures or many different languages. From the other hand, the vast dictionary could work in the following way: if a match is found, and a word is not found in a dictionary, it can be given an elevated weight factor, as for a "unique word".

Again, every approach we discussed on this page requires extensive research on the real data samples.

If you decide to lead this work to some reasonable working result, if would be very nice if you publish the results of your research. I personally would be very interested to learn the results. In case you do, I would expect you notify Andi and myself.

Good luck,

—SA

Posted 5-Mar-12 15:08pm

Sergey Alexandrovich Kryukov

Updated 6-Mar-12 13:50pm

v4

Comments

Maciej Los 6-Mar-12 16:51pm

WOW! Thank you very much! Great idea! My 5!
I'm little confused, becouse this is the area which i'm never explored. So, i need a little bit of time to understand what you are trying to tell me. I promise to contact with You and Andi if i'll ever find the solution. Sorry for my language ;)

Joezer BH 1-Aug-13 2:33am

5+!!

Wow, probably the longest thread on a Q&A I've seen so far in CP

Sergey Alexandrovich Kryukov 6-Mar-12 17:23pm

Great. And you are very welcome.

You should understand this is a kind of preliminary plan which needs some research to justify. Nevertheless, I was coming back to ideas on this topic from time to time during couple of days. Interesting topic, you know.

I am not even talking about your final solution, I wold be interested to know if your research would show some results on real-life representative data samples. Feel free to post a comment if you need some assistance in code development or other concerns -- I'll try to help.

Good luck, would be glad to hear from you later.
--SA

Sergey Alexandrovich Kryukov 1-Aug-13 8:11am

Yee, I guess...
Thank you very much.
—SA

Sergey Alexandrovich Kryukov 6-Mar-12 22:20pm

Buy the way, I think you could consider accepting this answer formally (I mean, green "Accept" button) -- you can always accept more then one.

Thank you for your contact through my Web site. I don't mind to keep the contact and discuss the matters you mentioned from time to time. Will be glad if it helps you. You contact me the same way; I'll reply to you directly as soon as we have something to discuss. Please write...
--SA

Maciej Los 7-Mar-12 10:27am

SA, Thank you for everything what you've done until now and for promise to help.
Answer formally accepted ;)

Sergey Alexandrovich Kryukov 7-Mar-12 19:33pm

Thank you. You are very welcome, as well as your future messages.
Good luck.
--SA

RaisKazi 6-Mar-12 20:46pm

Certainly 5.
Edit - In fact 5+. :)

Sergey Alexandrovich Kryukov 6-Mar-12 22:14pm

:-)
Thank you, Rais.

By the way, It was nice of you to call me to establish a LinkedIn contact -- I accepted it.
--SA

Andreas Gieriet 7-Mar-12 5:13am

Some quick note: This approach is close to the technique I use to analyze textual data e.g. bash shell (on Microsoft powershell it's possible too, but a bit more verbose):


sed s/.../.../ig a.txt | sort -u > x.txt
sed s/.../.../ig b.txt | sort -u > y.txt
diff -w x.txt y.txt > diff.txt

The big difference is the "diff" ("sed" is used as filter, like you do with the delimiter and dictionary, mainly remove irrelevant data) - diff is equal or not-equal, but nothing in between.

PS: I'll give more comments on your idea in the next comment.

Andreas Gieriet 7-Mar-12 6:15am

I think we talk here about two differen "worlds":
A) the question for a C# function that gives a normalized correlation for two strings
(as oposed to "Equals")
-


     // returns 0.0 for zero match, ..., 1.0 for full match)
     double Correlation(string a, string b);

-


     // returns false for not equal, true for equal
     bool Equals(string a, string b);

B) general purpose "relevance" function as needed and used in search engines,
based on indices/dictionaries, weights, cultures,
ordering of lexemes (or lack thereof), etc.

Your algorithm idea is more of the B) case than of the A) case, where as I understand the OP was asking for an A) case answer.

Nonetheless, I think your idea is interesting, but as you say, it needs investigatons. I'm not fluent in the current state of search engine algorithms (maybe I should go over and ask the guys at Google here in Zurich who have their office a few blocks away from mine ;-).

I assume some of your idea is used in search engines today, like the "normalizing" stuff, like ignoring some "delimiters", weighting (and even ignoring some of) the remaining "words" based on some heuristics, maybe ignoring the sequence and duplication of words, and finally calculate a relative order for the given "search text".

Comming back to the OP's question: I think that the Gestalt Approach is a useful way for daily problems in fuzzy string comparison (as alternative of using "Equals").

I'm considering posting some article on a C# implementation of the Gestalt Approach - but this must wait a bit...

Andreas Gieriet 7-Mar-12 6:20am

For fun:

The Gestalt comparison of your first four paragraphs of this solution result in:


|rank|correlat.|-------------first------------|-----------second--------------| 
|  1 |  29.68% | The algorithm should retu... | Let's define the set of de... |
|  2 |  26.95% | The algorithm should acce... | For the first step, string... |
|  3 |  25.40% | For the first step, strin... | Let's define the set of de... |
|  4 |  24.51% | The algorithm should acce... | The algorithm should retur... |
|  5 |  19.43% | The algorithm should acce... | Let's define the set of de... |
|  6 |   9.30% | The algorithm should retu... | For the first step, string... |
|----|---------|------------------------------|-------------------------------|

more in the next comment - the comment size is limeted as I found out... :-(

Andreas Gieriet 7-Mar-12 6:39am

Comparing variants of your name, my name (with some typos) and the name of the city I live gives the following hit list (all permutations considered):


|rank|correlat.|-------------first------------|-----------second--------------| 
|  1 | 100.00% | Zürich                       | ZÜRICH                        |
|  2 |  96.55% | Andreas Gieriet              | Andreas Giriet                |
|  3 |  92.31% | Gieriet                      | Giriet                        |
|  4 |  92.31% | Zurich                       | Zuerich                       |
|  5 |  90.32% | Sergey Kryukov               | Sergey A. Kryukov             |
|  6 |  83.33% | Zürich                       | Zurich                        |
|  7 |  83.33% | Zurich                       | ZÜRICH                        |
|  8 |  76.92% | Zürich                       | Zuerich                       |
|  9 |  76.92% | Zuerich                      | ZÜRICH                        |
| 10 |  69.57% | SAKryukov                    | Sergey Kryukov                |
| 11 |  69.23% | SAKryukov                    | Sergey A. Kryukov             |
| 12 |  63.64% | Gieriet                      | Andreas Gieriet               |
| 13 |  60.00% | Giriet                       | Andreas Giriet                |
| 14 |  57.14% | Gieriet                      | Andreas Giriet                |
| 15 |  57.14% | Giriet                       | Andreas Gieriet               |
| 16 |  42.86% | Gieriet                      | Zuerich                       |
| 17 |  36.36% | SA                           | SAKryukov                     |
| 18 |  33.33% | Giriet                       | Zürich                        |
| 19 |  33.33% | Giriet                       | Zurich                        |
| 20 |  33.33% | Giriet                       | ZÜRICH                        |
| 21 |  30.77% | Gieriet                      | Zürich                        |
| 22 |  30.77% | Gieriet                      | Zurich                        |
| 23 |  30.77% | Gieriet                      | ZÜRICH                        |
| 24 |  30.77% | Giriet                       | Zuerich                       |
| 25 |  28.57% | Sergey Kryukov               | Gieriet                       |
| 26 |  28.57% | Sergey Kryukov               | Andreas Giriet                |
| 27 |  28.57% | Andreas Giriet               | Zuerich                       |
| 28 |  27.59% | Sergey Kryukov               | Andreas Gieriet               |
| 29 |  27.27% | Andreas Gieriet              | Zuerich                       |
| 30 |  25.00% | Sergey A. Kryukov            | Gieriet                       |
| 31 |  25.00% | Sergey A. Kryukov            | Andreas Gieriet               |
| 32 |  21.05% | SA                           | Sergey A. Kryukov             |
| 33 |  20.00% | Sergey Kryukov               | Giriet                        |
| 34 |  20.00% | Andreas Giriet               | Zürich                        |
| 35 |  20.00% | Andreas Giriet               | Zurich                        |
...

Andreas Gieriet 7-Mar-12 6:40am

... final part...

My fuzzy evaluation of the fuzzy values ;-) are:
- Correlation 0%-50% can be considered as mismatch.
- Correlation 50%-60% has some similarity.
- Correlation 60%-75% has some relevant common parts
- Correlation 75%-100% are quite closely related or equal.

And you see that abbreviations and acronyms do not easily match with the full text. Reason for that is, that the correlation normalization is:
matching_characters * 2 / (length(a) + length(b))

Maciej Los 7-Mar-12 10:31am

Andreas, your insights are very valuable. I have similar thoughts. If you allow, can i count on your help and valuable comments?

Andreas Gieriet 7-Mar-12 17:12pm

Ask and I'll try to response - response time may vary, though.
Cheers
Andi

Sergey Alexandrovich Kryukov 7-Mar-12 18:23pm

Thank you for the interesting comments. Very reasonable considerations. As to the A) and B) views on the problem, I would agree that your fuzzy classifications of those tasks make sense. Even though I majorly agree that my approach is closer to B), but what I offered my idea, my feeling was that B) is more adequate, even it is applied to the problem of duplicates along. It looks like we agree on the use of the further research based on real-like data samples.

Thank you for interesting discussion, your information and insight.
--SA

Andreas Gieriet 11-Mar-12 8:16am

I just came across another usage of the Gestalt approach: comparing test results.
I have a list of test run results over multiple version of a piece of software. Each test run consists of several thousands of test method calls, each with several Assert statements (error message and stack dump).
Comparing visually these test run results is a tedious and error prone task.
This can be eased by the following (pseudocode):


// get results
var result1 = GetResult(version1);
var result2 = GetResult(version2);
// get all test methods with errors that are new, fixed, or still broken
var newErrors    = Complement(result2.Errors, result1.Errors);
var fixedErrors  = Complement(result1.Errors, result2.Errors);
var commonErrorPairs = Intersect(result1.Errors, result2.Errors);
// set correlation: message and stack
foreach (var pair in commonErrorPairs)
{
    pair.msgCor   = Gestalt.Correlation(pair.a.msg, pair.b.msg);
    pair.stackCor = Gestalt.Correlation(pair.a.stack, pair.b.stack);
}
// report
var TH = 0.9; // threashold for "same" or different error message or stack dump
WriteResults("Fixed errors",   from e in fixedErrors select e.name);
WriteResults("New errors",     from e in newErrors select e.name);
WriteResults("Same errors",    from p in commonErrorPairs
                               where p.msgCor == 1.0 && p.stackCor == 1.0
                               select p.a.name);
WriteResults("Similar errors", from p in commonErrorPairs
                               where p.msgCor > TH && p.stackCor > TH
                               select p.a.name);
WriteResults("Changed errors", from p in commonErrorPairs
                               where p.msgCor <= TH || p.stackCor <= TH
                               select p.a.name);

Sergey Alexandrovich Kryukov 12-Mar-12 12:02pm

Yes, interesting. Thank you, Andi.
--SA