 |
|
 |
It's always good to see other string/patter matching algorithms.
I use Levenshtein, Ratcliff/Obsershelp, Soundex (and variants), Double Metaphone, NYSIIS encoding and various other of my own routines with cunning names such as Approx.
|
|
|
|
 |
|
 |
This function is really great! This is what I need for my project.
Momentary I'm doing my internship and i need to make a webapplication that compares two strings with eachother.
Is there any way that i can find a vb.net version? Or a easy way to transform it?
To be honest I have never wrote something in C...
Anyway keep up the good work!
|
|
|
|
 |
|
 |
iviruz.be wrote: Or a easy way to transform it?
I think it should be very easy to transform it. The function is small, just two nested loops in few lines and a swap of 2 arrays. Take a look at the code.
I don't know the details of how you'd do it in vb.net. To be honest I have never wrote something in vb.net... But I think the language is powerful enough to support what my algorithm does.
Rui A. Rebelo
I don't smoke, don't gamble, don't sniff, don't drink and don't womanize.
My only defect is that I lie just a little bit, sometimes.
Tim Maia (brazilian pop singer)
|
|
|
|
 |
|
|
 |
|
 |
> If you studied algorithms in any university and they didn’t teach you it, ask your money back.
I have never heard of DP before. Mind you, I learned very little in CS at uni, apart from Chomsky Normal Form and Big O notation
I would appreciate a little more explanation of DP than the single line you've given. I'm intrigued.
I think though that two strings should not be considered to be 100% similar unless they are identical. So that
"Codeproject" should be more similar to "Codeproject" than to "Code the project". (This would have the practical application that you could sort by similarity and present the most likely option).
And as others have said, it should be commutative:
simil(A, B) == simil(B, A).
|
|
|
|
 |
|
 |
Thanks for the feedback, Don.
The suggestions you made are already in my list.
I will implement them soon.
Cheers,
Rui A. Rebelo
Computers are useless, they can only provide answers.
Pablo Picasso
|
|
|
|
 |
|
|
 |
|
 |
Hi Mohamed,
I've always associated hashing with hash tables,
databases indexes, etc.
You gave an interesting new use for it.
I will try to see if I can somehow use your speed
in my approach.
Thanks,
Rui A. Rebelo
Computers are useless, they can only provide answers.
Pablo Picasso
|
|
|
|
 |
|
 |
Probably explaining how the code works in detail. Maybe wrap code in C++?
|
|
|
|
 |
|
 |
> Probably explaining how the code works in detail.
Gulp! I was afraid of hearing that. Dynamic Programming is not easy to explain, especially in a short web article. But I'll try.
> Maybe wrap code in C++?
No problem. Any design sugestion?
Thanks,
Rui A. Rebelo
Computers are useless, they can only provide answers.
Pablo Picasso
|
|
|
|
 |
|
 |
Thanks for keeping it in plain C. This was exactly what I needed for my current project
I planned to write something on my own but I don't have so much experience in algorithms and
so my solution would have been far less effective than yours. But you made me curious about
this dynamic programming thing. I guess I'll read something about it (maybe your article if
it comes out) or buy the book you suggested.
Many thanks,
B!GF00T
PS: A C++ wrapper would of course still be fine
I was only very happy since my current project can
only use C, since it will be part of something which
will run on a car computer and the rest is all C..
|
|
|
|
 |
|
 |
Another way: calculate the Levenshtein distance. Although this algorithm doesn't focus on the longest equal part(s) of two strings, it can cope with typos instead. The algorithm is very short and there are a lot of efficient solutions around on the web. The result of the Levenshtein distance is an absolute value for the similarity, but it can also be calculated relative to the length of the strings compared.
|
|
|
|
 |
|
 |
The "simil" function proposed in article looks very similar to Levenshtein distance (it may even be perfectly equivalent, I didn't check), except that it returns the distance divided by the first string length.
I feel this is a bad idea as simil("hello world","h") and simil("h","hello world") will return different values.
DynaBits
Mechanical Design Automation Software
|
|
|
|
 |
|
 |
khb wrote:
The algorithm is very short and there are a lot of efficient solutions around on the web.
Is there a C++ implementation you can recomend? If you are using LD can you give us some idea of the type of application. Thanks.
Neville Franks, Author of ED for Windows www.getsoft.com and coming soon: Surfulater www.surfulater.com
|
|
|
|
 |
|
 |
Neville Franks wrote:
Is there a C++ implementation you can recomend?
I know of this one[^]. Personally, I would "templatize" it to work with basic_string rather that string and use boost::multi_array for the matrix to reduce memory overhead, but it seems to be usable as-is.
My programming blahblahblah blog. If you ever find anything useful here, please let me know to remove it.
|
|
|
|
 |
|
 |
That's funny. This is exactly the implementation I used some time ago to find files which have similar file names...
|
|
|
|
 |
|
 |
Thanks Nemanja. I'd already been to merriampark.com but missed this C++ implementation.
Neville Franks, Author of ED for Windows www.getsoft.com and coming soon: Surfulater www.surfulater.com
|
|
|
|
 |
|
 |
Levenshtein! Gee, I'm surprised on how sofisticated the comments are. I didn't expect the discussion would become so nice.
Actually I developed this algorithm when I was studying Levenshtein's. As I wrote in the article I was studying edit distance (another common name for Levenshtein distance) and this is a simpler (dumber?) version of Levenshtein's edit distance algorithm.
Levenshtein "seems" to not focus in the longest common subsequence but it does, indirectly. If you take the Levenshtein distance divide it by 2 and subtract if from the length of the shortest string you will get exactly the length of the LCS (that is, if the string has no inversions like "Codeproject" x "projectCode"). Note that when there is a match of the characters of a column and row in a Levenshtein's matrix the value of the cell is decreased. So the distance decreases when there is a match. I do the opposite, increase it because I am measuring the similarity (or proximity, length of LCS, whatever...).
I believe that full blown Lovenshtein is an overkill for simple string comparisons in strings of less than 100 bytes. The inversion problem is important in some areas (DNA alignment) but not so common in handling simple strings. I took this approach thinking about servers (database and web) and worried (too much?) about scalability.
Thanks,
Rui A. Rebelo
Computers are useless, they can only provide answers.
Pablo Picasso
|
|
|
|
 |
|
 |
See Ratcliff/Obershelp pattern recognition at http://www.nist.gov/dads/HTML/ratcliffObershelp.html.
We have used a variant of the R/O algorithm for "gestalt" matching of strings in our programs with good results for many years. Your approach sounds similar but does it pick up multiple partial matches and character transpositions as well as R/O?
gjr
|
|
|
|
 |
|
|
 |
|
 |
The algorithm originally gained some popularity in educational apps so that children could (mis)type some text into a quiz without being totally zapped just for their spelling.
Q: "What is the capital of Australia?"
A: "Canburra" should be OK even if the correct answer is "PollyLand" (sorry, "Canberra").
We use it in our Construction industry apps so that users do not have to correctly enter alphanumeric codes or descriptions to find records, help etc.
It produces a much more friendly user environment and provides the capability to sort picklists by "similarity" even when what is entered is not the leading portion of the text. (Anyone really interested in the details can check our website at http://www.procon.com.au/)
My variant of R/O is written in X86 assembler as this proved to be three times faster than a 'C' version and calculating a simil value for many records can be slow. (Subsequently rewritten in C++ for our pending Windows apps but I have not done any timimg tests on the assumption that people running Windows are used to slow list sorting!) We also needed to modify the algorithm for fixed length records so that some weighting was given to trailing spaces, position, text case etc.
I recommend the algorithm as users are very impressed when they can (mis)type "eletrc" into a field and get a picklist from hundreds of items "sorted by similarity" so that it brings items containing the words electrical, etc to the top of the list.
gjr
|
|
|
|
 |
|
 |
Hi gpj,
Thanks for the tip on R/O; I must confess I didn't know this algorithm.I looked for documentation about it and found an article by the creator John Ratclift in the July/1988 edition of Dr. Dobbs Journal. That's why it took me so much to answer you.
You mentioned that my approach is similar to R/O; it has some in common, indeed in the fact that both base the analysis in the length of the longest common subsequence. The greatest difference is in how both programs calculate this length and what they do with it. I use 2 cascated loops. RO uses recursion. Their algorithm finds first the greatest contiguous subsequence and then finds other sequences in the non-matching parts.
Ratclift claims that his algorithm has a quadratic ( O(n^2)) performance boundary. I found it surprising but the data he presented really corroborates it.
I have no doubt that my program will also perform in a quadratic bound, since it is simply one loop inside the other, I need to make some tests to see how both approaches rate. I will try to find an implementation of RO in C or do it myself.
Ratclift implements his algorith in assembly in DDJ. My assembly is a litle rusty, but from what I understood from the code and from his explanations both algorithms would handle the conditions you suggested.
But there is one situation that both my approach and RO (it seems) don't handle graciously: the inversions. Take the following example: "Codeproject" x "projectCode". Both algorithms would allign the LCS "project" and give the rattings based solelly in this sequence, ignoring that "Code" alligns crosswise. In the traditional edit distance problem I should align "Code", assign a weight for the inversion and deduce this weight from the indice of the whole alignment. For some very specific problems this is very important (in Bioinformatics is fundamental).
The other difference is that my function divides the lenght of the longest common subsequence by the length in the first string (as I wrote in the article). RO divides the double of the LCS by the sum of both strings.
Some people complained here about my approach. It seemed natural to me but I think they might have a point. Actually it is not hard to convert my results to RO's results. You can simply call the function like this:
sim= simil( str1, str2)*2*strlen(str1)/(strlen(str1)+strlen(str2));
Or change the last line in the simil() function from:
return lenLCS/=len1;
to:
return lenLCS*=2.0/(len1+len2);
Both RO and my program should give the same values.
I might address these issues in a newer version if this article raises some interest.
Thanks
Rui A. Rebelo
Computers are useless, they can only provide answers.
Pablo Picasso
|
|
|
|
 |
|
|
 |
|
 |
Thank you Neville.;)
I am checking it now.
Rui A. Rebelo
Computers are useless, they can only provide answers.
Pablo Picasso
|
|
|
|
 |