Click here to Skip to main content
15,887,027 members
Articles / Mobile Apps
Article

More than strcmp(): similarity in strings

Rate me:
Please Sign up or sign in to vote.
4.88/5 (19 votes)
12 Sep 20045 min read 117.1K   2.3K   38   27
A function which returns the similarity between two strings (how much they're equal).

Sample Image - SimilarStrings.gif

Introduction

You already saw this in spellcheckers: a list with correct words which are similar to the misspelled word you've typed. Wouldn’t it be nice to incorporate this notion of similar words as a functionality in your applications? There are lots of possible uses for it:

  • The most obvious case is searching for a name in a database. People with an unusual name always need to patiently spell it and hope it is written correctly in the database. Now, imagine you could scan a database and search for names that are similar (look alike) to any given name and rank them by similarity, so the user could choose them. Cool, eh?
  • Show a list of web addresses similar to the one the user typed wrongly and was not found. Or use it in a search for "approximate" keywords in your site.
  • You may also use this idea to check similarities between written text (academic papers, source code of programs, etc.) to detect plagiarism, for instance.
  • Hey, what about comparing genetic sequences, and to discover evolutionary and functional similarities among genes or proteins? Actually, this was done with mitochondrial DNA to trace all the human ethnical evolution and create the concept of Mitochondrial Eve… OK, OK, insanely off-topic...

Here, I offer a possible solution for implementing such comparisons: a function that returns how much two strings are “alike” or similar. But first, let’s define the problem more precisely.

Background

Similarity of strings can be defined in more than one way. Here, I will use a very simple one: the length of the longest common sequence divided by the length of the first string. Suppose you have two strings (e.g.: “consecutivelly” and “successfully”). Now see that you have several sub sequences that appear in both strings, but not necessarily consecutively (e.g.: “c-s-u-l-l-y”, “s-c-u-l-l-y”, “s-e-u-l-l-y”, "s-u-e-l-l-y", etc.). The measure of similarity I propose is the division of the length of the longest of these subsequences by the length of the first parameter string (e.g., strlen(“suelly”)/strlen(“consecutively”), which is approximately 42.85%). So far this is simple, right?

Now, the hard part seems to be how to discover efficiently that longest subsequence. Well, actually this is not so hard: we can use a classical algorithmic technique called dynamic programming. If you studied algorithms in any university and they didn’t teach you it, ask your money back. Very grossly, you can think of dynamic programming (or DP) as a kind of reversed recursion used in optimization problems. Some optimization problems have a recursive solution. However, this recursive solution may involve the computation of some sub solutions which can appear repeated times. To such problems, we can use DP with great gains of performance. It would be too lengthy to explain here this technique in detail, so I strongly suggest you to take a look at a good book on algorithms (I suggest one at the bottom in the Points of Interest section). By the way, “programming” in “dynamic programming” doesn’t refer to computer programming but to programming in a general mathematical sense; i.e.: a sequence of steps which lead to a solution.

Using the code

The code is very simple to use. The function which returns the similarity has the following signature (in string.h):

float simil( char *str1, char *str2);

The return value is a number between 0 and 1. This value, as I mentioned before, is the length of the longest common sequence divided by the length of str1. 0 means the two strings have nothing in common. 1 means the first string is totally contained in the second. It is up to you to decide which string to use as first or second. But please take notice that using the longest string as the first parameter yields lower similarity values and using the shortest yields higher (look at the picture above). Obviously, the decision on how you make the call and which value will be the minimum acceptable will depend on the problem you are trying to solve. If you are scanning names in a database, you might want to let the user adjust this value. If you are building a script for a 404 response page in a website then you will adjust this value comparing the most common errors to a list of addresses in your site.

I also provide another function you might be interested in:

char *LCS( char *str1, char *str2);

This function returns a pointer to a static array of char (with maximum size MAX_LCS, defined in string.h) containing one of the longest sequences. All you have to do is to compile and link simil.c to your code and leave simil.h in a folder in your include path.

Points of Interest

I made this code when studying a more generic problem called edit distance, which is a measure of how much a string (or DNA sequence, throughout evolution and by genetic mutations) must be changed to generate another one. This is a typical problem for dynamic programming. You can read a deep and complete explanation of what is dynamic programming in this book: Introduction to Algorithms, 2nd edition; by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford Stein. I have a love/hate relationship with this book. It is one of the two most complete books on algorithms I've ever seen (the other is Knuth's). But it is also one of the most pedantic books I've ever read. It is the typical academic book: deep knowledge, lots of proofs, and convoluted formalisms. Robert Sedgewick's "Algorithms in C" also has an introduction to DP, but not so complete. If you have another suggestion of book or site on dynamic programming, please post it in the comments. :-)

History

  • 12/September/2004: Version 1.0.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


Written By
Engineer
Canada Canada
I live in Edmonton, Alberta, Canada with my wife and son. That's after Portugal (was born there, in Lisbon), Angola but mostly Brazil.
I work as a Software Engineer at here.

Comments and Discussions

 
QuestionUse like DLL Pin
Célio13-Aug-14 4:47
Célio13-Aug-14 4:47 
QuestionUsing this in a DLL Pin
Member 1082542517-May-14 11:51
Member 1082542517-May-14 11:51 
AnswerRe: Using this in a DLL Pin
Célio13-Aug-14 4:42
Célio13-Aug-14 4:42 
GeneralThanks Pin
RugbyLeague28-Apr-09 22:45
RugbyLeague28-Apr-09 22:45 
GeneralGreat! Pin
iviruz.be1-Apr-08 11:53
iviruz.be1-Apr-08 11:53 
GeneralRe: Great! Pin
Rui A. Rebelo2-Apr-08 2:15
Rui A. Rebelo2-Apr-08 2:15 
GeneralSoundex Pin
Robert F20-Sep-04 0:23
sussRobert F20-Sep-04 0:23 
GeneralI want my money back... Pin
Don Clugston14-Sep-04 21:24
Don Clugston14-Sep-04 21:24 
GeneralRe: I want my money back... Pin
Rui A. Rebelo15-Sep-04 7:57
Rui A. Rebelo15-Sep-04 7:57 
Generalneumerical analysis Pin
Mo Hossny13-Sep-04 22:51
Mo Hossny13-Sep-04 22:51 
GeneralRe: neumerical analysis Pin
Rui A. Rebelo15-Sep-04 8:17
Rui A. Rebelo15-Sep-04 8:17 
GeneralWorth Pin
NormDroid12-Sep-04 21:23
professionalNormDroid12-Sep-04 21:23 
GeneralRe: Worth Pin
Rui A. Rebelo13-Sep-04 17:40
Rui A. Rebelo13-Sep-04 17:40 
GeneralRe: Worth Pin
B!GFooT30-Mar-06 20:00
B!GFooT30-Mar-06 20:00 
GeneralLevenshtein distance Pin
khb12-Sep-04 20:49
khb12-Sep-04 20:49 
GeneralRe: Levenshtein distance Pin
Dr. Goulu12-Sep-04 21:04
professionalDr. Goulu12-Sep-04 21:04 
GeneralRe: Levenshtein distance Pin
Neville Franks13-Sep-04 2:09
Neville Franks13-Sep-04 2:09 
GeneralRe: Levenshtein distance Pin
Nemanja Trifunovic13-Sep-04 2:54
Nemanja Trifunovic13-Sep-04 2:54 
GeneralRe: Levenshtein distance Pin
khb13-Sep-04 3:35
khb13-Sep-04 3:35 
GeneralRe: Levenshtein distance Pin
Neville Franks13-Sep-04 23:51
Neville Franks13-Sep-04 23:51 
GeneralRe: Levenshtein distance Pin
Rui A. Rebelo13-Sep-04 17:30
Rui A. Rebelo13-Sep-04 17:30 
GeneralRatcliff/Obershelp pattern recognition Pin
gjr12-Sep-04 16:54
gjr12-Sep-04 16:54 
GeneralRe: Ratcliff/Obershelp pattern recognition Pin
Neville Franks13-Sep-04 2:07
Neville Franks13-Sep-04 2:07 
GeneralRe: Ratcliff/Obershelp pattern recognition Pin
gjr14-Sep-04 16:52
gjr14-Sep-04 16:52 
GeneralRe: Ratcliff/Obershelp pattern recognition Pin
Rui A. Rebelo13-Sep-04 17:11
Rui A. Rebelo13-Sep-04 17:11 
Hi gpj,
Thanks for the tip on R/O; I must confess I didn't know this algorithm.I looked for documentation about it and found an article by the creator John Ratclift in the July/1988 edition of Dr. Dobbs Journal. That's why it took me so much to answer you.

You mentioned that my approach is similar to R/O; it has some in common, indeed in the fact that both base the analysis in the length of the longest common subsequence. The greatest difference is in how both programs calculate this length and what they do with it. I use 2 cascated loops. RO uses recursion. Their algorithm finds first the greatest contiguous subsequence and then finds other sequences in the non-matching parts.

Ratclift claims that his algorithm has a quadratic ( O(n^2)) performance boundary. I found it surprising but the data he presented really corroborates it.

I have no doubt that my program will also perform in a quadratic bound, since it is simply one loop inside the other, I need to make some tests to see how both approaches rate. I will try to find an implementation of RO in C or do it myself.

Ratclift implements his algorith in assembly in DDJ. My assembly is a litle rusty, but from what I understood from the code and from his explanations both algorithms would handle the conditions you suggested.

But there is one situation that both my approach and RO (it seems) don't handle graciously: the inversions. Take the following example: "Codeproject" x "projectCode". Both algorithms would allign the LCS "project" and give the rattings based solelly in this sequence, ignoring that "Code" alligns crosswise. In the traditional edit distance problem I should align "Code", assign a weight for the inversion and deduce this weight from the indice of the whole alignment. For some very specific problems this is very important (in Bioinformatics is fundamental).

The other difference is that my function divides the lenght of the longest common subsequence by the length in the first string (as I wrote in the article). RO divides the double of the LCS by the sum of both strings.
Some people complained here about my approach. It seemed natural to me but I think they might have a point. Actually it is not hard to convert my results to RO's results. You can simply call the function like this:
sim= simil( str1, str2)*2*strlen(str1)/(strlen(str1)+strlen(str2));


Or change the last line in the simil() function from:
return lenLCS/=len1;

to:
return lenLCS*=2.0/(len1+len2);

Both RO and my program should give the same values.

I might address these issues in a newer version if this article raises some interest.
Thanks

Rui A. Rebelo

Computers are useless, they can only provide answers.
Pablo Picasso

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.