Click here to Skip to main content
15,861,125 members
Articles / Mobile Apps
Article

More than strcmp(): similarity in strings

Rate me:
Please Sign up or sign in to vote.
4.88/5 (19 votes)
12 Sep 20045 min read 116.5K   2.3K   38   27
A function which returns the similarity between two strings (how much they're equal).

Sample Image - SimilarStrings.gif

Introduction

You already saw this in spellcheckers: a list with correct words which are similar to the misspelled word you've typed. Wouldn’t it be nice to incorporate this notion of similar words as a functionality in your applications? There are lots of possible uses for it:

  • The most obvious case is searching for a name in a database. People with an unusual name always need to patiently spell it and hope it is written correctly in the database. Now, imagine you could scan a database and search for names that are similar (look alike) to any given name and rank them by similarity, so the user could choose them. Cool, eh?
  • Show a list of web addresses similar to the one the user typed wrongly and was not found. Or use it in a search for "approximate" keywords in your site.
  • You may also use this idea to check similarities between written text (academic papers, source code of programs, etc.) to detect plagiarism, for instance.
  • Hey, what about comparing genetic sequences, and to discover evolutionary and functional similarities among genes or proteins? Actually, this was done with mitochondrial DNA to trace all the human ethnical evolution and create the concept of Mitochondrial Eve… OK, OK, insanely off-topic...

Here, I offer a possible solution for implementing such comparisons: a function that returns how much two strings are “alike” or similar. But first, let’s define the problem more precisely.

Background

Similarity of strings can be defined in more than one way. Here, I will use a very simple one: the length of the longest common sequence divided by the length of the first string. Suppose you have two strings (e.g.: “consecutivelly” and “successfully”). Now see that you have several sub sequences that appear in both strings, but not necessarily consecutively (e.g.: “c-s-u-l-l-y”, “s-c-u-l-l-y”, “s-e-u-l-l-y”, "s-u-e-l-l-y", etc.). The measure of similarity I propose is the division of the length of the longest of these subsequences by the length of the first parameter string (e.g., strlen(“suelly”)/strlen(“consecutively”), which is approximately 42.85%). So far this is simple, right?

Now, the hard part seems to be how to discover efficiently that longest subsequence. Well, actually this is not so hard: we can use a classical algorithmic technique called dynamic programming. If you studied algorithms in any university and they didn’t teach you it, ask your money back. Very grossly, you can think of dynamic programming (or DP) as a kind of reversed recursion used in optimization problems. Some optimization problems have a recursive solution. However, this recursive solution may involve the computation of some sub solutions which can appear repeated times. To such problems, we can use DP with great gains of performance. It would be too lengthy to explain here this technique in detail, so I strongly suggest you to take a look at a good book on algorithms (I suggest one at the bottom in the Points of Interest section). By the way, “programming” in “dynamic programming” doesn’t refer to computer programming but to programming in a general mathematical sense; i.e.: a sequence of steps which lead to a solution.

Using the code

The code is very simple to use. The function which returns the similarity has the following signature (in string.h):

float simil( char *str1, char *str2);

The return value is a number between 0 and 1. This value, as I mentioned before, is the length of the longest common sequence divided by the length of str1. 0 means the two strings have nothing in common. 1 means the first string is totally contained in the second. It is up to you to decide which string to use as first or second. But please take notice that using the longest string as the first parameter yields lower similarity values and using the shortest yields higher (look at the picture above). Obviously, the decision on how you make the call and which value will be the minimum acceptable will depend on the problem you are trying to solve. If you are scanning names in a database, you might want to let the user adjust this value. If you are building a script for a 404 response page in a website then you will adjust this value comparing the most common errors to a list of addresses in your site.

I also provide another function you might be interested in:

char *LCS( char *str1, char *str2);

This function returns a pointer to a static array of char (with maximum size MAX_LCS, defined in string.h) containing one of the longest sequences. All you have to do is to compile and link simil.c to your code and leave simil.h in a folder in your include path.

Points of Interest

I made this code when studying a more generic problem called edit distance, which is a measure of how much a string (or DNA sequence, throughout evolution and by genetic mutations) must be changed to generate another one. This is a typical problem for dynamic programming. You can read a deep and complete explanation of what is dynamic programming in this book: Introduction to Algorithms, 2nd edition; by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford Stein. I have a love/hate relationship with this book. It is one of the two most complete books on algorithms I've ever seen (the other is Knuth's). But it is also one of the most pedantic books I've ever read. It is the typical academic book: deep knowledge, lots of proofs, and convoluted formalisms. Robert Sedgewick's "Algorithms in C" also has an introduction to DP, but not so complete. If you have another suggestion of book or site on dynamic programming, please post it in the comments. :-)

History

  • 12/September/2004: Version 1.0.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


Written By
Engineer
Canada Canada
I live in Edmonton, Alberta, Canada with my wife and son. That's after Portugal (was born there, in Lisbon), Angola but mostly Brazil.
I work as a Software Engineer at here.

Comments and Discussions

 
QuestionUse like DLL Pin
Célio13-Aug-14 4:47
Célio13-Aug-14 4:47 
QuestionUsing this in a DLL Pin
Member 1082542517-May-14 11:51
Member 1082542517-May-14 11:51 
AnswerRe: Using this in a DLL Pin
Célio13-Aug-14 4:42
Célio13-Aug-14 4:42 
GeneralThanks Pin
RugbyLeague28-Apr-09 22:45
RugbyLeague28-Apr-09 22:45 
GeneralGreat! Pin
iviruz.be1-Apr-08 11:53
iviruz.be1-Apr-08 11:53 
GeneralRe: Great! Pin
Rui A. Rebelo2-Apr-08 2:15
Rui A. Rebelo2-Apr-08 2:15 
iviruz.be wrote:
Or a easy way to transform it?


I think it should be very easy to transform it. The function is small, just two nested loops in few lines and a swap of 2 arrays. Take a look at the code.

I don't know the details of how you'd do it in vb.net. To be honest I have never wrote something in vb.net... Wink | ;) But I think the language is powerful enough to support what my algorithm does.

Rui A. Rebelo
I don't smoke, don't gamble, don't sniff, don't drink and don't womanize.
My only defect is that I lie just a little bit, sometimes.

Tim Maia (brazilian pop singer)

GeneralSoundex Pin
Robert F20-Sep-04 0:23
sussRobert F20-Sep-04 0:23 
GeneralI want my money back... Pin
Don Clugston14-Sep-04 21:24
Don Clugston14-Sep-04 21:24 
GeneralRe: I want my money back... Pin
Rui A. Rebelo15-Sep-04 7:57
Rui A. Rebelo15-Sep-04 7:57 
Generalneumerical analysis Pin
Mo Hossny13-Sep-04 22:51
Mo Hossny13-Sep-04 22:51 
GeneralRe: neumerical analysis Pin
Rui A. Rebelo15-Sep-04 8:17
Rui A. Rebelo15-Sep-04 8:17 
GeneralWorth Pin
NormDroid12-Sep-04 21:23
professionalNormDroid12-Sep-04 21:23 
GeneralRe: Worth Pin
Rui A. Rebelo13-Sep-04 17:40
Rui A. Rebelo13-Sep-04 17:40 
GeneralRe: Worth Pin
B!GFooT30-Mar-06 20:00
B!GFooT30-Mar-06 20:00 
GeneralLevenshtein distance Pin
khb12-Sep-04 20:49
khb12-Sep-04 20:49 
GeneralRe: Levenshtein distance Pin
Dr. Goulu12-Sep-04 21:04
professionalDr. Goulu12-Sep-04 21:04 
GeneralRe: Levenshtein distance Pin
Neville Franks13-Sep-04 2:09
Neville Franks13-Sep-04 2:09 
GeneralRe: Levenshtein distance Pin
Nemanja Trifunovic13-Sep-04 2:54
Nemanja Trifunovic13-Sep-04 2:54 
GeneralRe: Levenshtein distance Pin
khb13-Sep-04 3:35
khb13-Sep-04 3:35 
GeneralRe: Levenshtein distance Pin
Neville Franks13-Sep-04 23:51
Neville Franks13-Sep-04 23:51 
GeneralRe: Levenshtein distance Pin
Rui A. Rebelo13-Sep-04 17:30
Rui A. Rebelo13-Sep-04 17:30 
GeneralRatcliff/Obershelp pattern recognition Pin
gjr12-Sep-04 16:54
gjr12-Sep-04 16:54 
GeneralRe: Ratcliff/Obershelp pattern recognition Pin
Neville Franks13-Sep-04 2:07
Neville Franks13-Sep-04 2:07 
GeneralRe: Ratcliff/Obershelp pattern recognition Pin
gjr14-Sep-04 16:52
gjr14-Sep-04 16:52 
GeneralRe: Ratcliff/Obershelp pattern recognition Pin
Rui A. Rebelo13-Sep-04 17:11
Rui A. Rebelo13-Sep-04 17:11 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.