11,433,486 members (56,775 online)

# An improvement on capturing similarity between strings

, 5 Aug 2005
 Rate this:
An improvement on capturing similarity between strings. The algorithm was developed in C#.

## Introduction

This article describes a way of capturing the similarity between two strings (or words). String similarity is a confidence score that reflects the relation between the meanings of two strings, which usually consists of multiple words or acronyms. Currently, in this approach I am more concerned on the measurement which reflects the relation between the patterns of the two strings, rather than the meaning of the words.

I implemented this algorithm when I was developing a tool to make the matching between XML schemas semi-automatic.

## Preparing the ground

In this paper, I have implemented two algorithms:

• Levenshtein algorithm[1]
• The Kuhn-Munkres algorithm (also known as the Hungarian method)[2]

Without going "deep into theory", if you want to understand these algorithms, please read about them in the algorithm books. Other information about them can be reached at:

## Problem

The string similarity algorithm was developed to satisfy the following requirements:

• A true reflection of lexical similarity - strings with small differences should be recognized as being similar. In particular, a significant sub-string overlap should point to a high level of similarity between the strings.
• A robustness to changes of word order- two strings which contain the same words, but in a different order, should be recognized as being similar. On the other hand, if one string is just a random anagram of the characters contained in the other, then it should (usually) be recognized as dissimilar.
• Language independence - the algorithm should work not only in English, but also in many different languages.

## Solution

The similarity is calculated in three steps:

• Partition each string into a list of tokens.
• Computing the similarity between tokens by using a string edit-distance algorithm (extension feature: semantic similarity measurement using the WordNet library).
• Computing the similarity between two token lists.

### Tokenization

A string is a list of words or abbreviations, it may be composed to follow the Camel or Pascal casing without separator characters. Take an example: ‘fileName’ is being tokenized as “file” and “Name”. This work is done by the tokeniser.cs class.

## Compute the similarity between two words

The first method uses an edit-distance string matching algorithm: Levenshtein. The string edit distance is the total cost of transforming one string into another using a set of edit rules, each of which has an associated cost. Levenshtein distance is obtained by finding the cheapest way to transform one string into another. Transformations are the one-step operations of (single-phone) insertion, deletion and substitution. In the simplest version substitutions cost about two units except when the source and target are identical, in which case the cost is zero. Insertions and deletions costs half that of substitutions.

This code shows a dynamic implementation of the algorithm:

```private int ComputeDistance (string s, string t)
{
int n=s.Length;
int m=t.Length;
int[,] distance=new int[n + 1, m + 1]; // matrix
int cost=0;
if(n == 0) return m;
if(m == 0) return n;
//init1
for(int i=0; i <= n; distance[i, 0]=i++);
for(int j=0; j <= m; distance[0, j]=j++);
//find min distance
for(int i=1; i <= n; i++)
{
for(int j=1; j <= m;j++)
{
cost=(t.Substring(j - 1, 1) ==
s.Substring(i - 1, 1) ? 0 : 1);
distance[i,j]=Min3(distance[i - 1, j] + 1,
distance[i, j - 1] + 1,
distance[i - 1, j - 1] + cost);
}
}
return distance[n, m];
}```

### The similarity score

I assume that all relation scores are in the [0, 1] range, which means that if the score gets a maximum value (equal to 1) then the two string are absolutely similar.

```public float GetSimilarity(string string1, string string2)
{
float dis=ComputeDistance(string1, string2);
float maxLen=string1.Length;
if (maxLen < string2.Length)
maxLen = string2.Length;
if (maxLen == 0.0F)
return 1.0F;
else
return 1.0F - dis/maxLen;
}```

### Similarity between two token lists

After splitting each string into token lists, we capture the similarity between two strings by computing the similarity of those two token lists, which is reduced to the bipartite graph matching problem. A related classical problem on matching in bipartite graphs is the assignment problem, which is the quest to find the optimal assignment of workers to jobs that maximizes the sum of ratings, given all non-negative ratings Cost[i,j] of each worker i to each job j.

### The problem can now be described as follows

Given a graph G(V,E), G can be partitioned into two sets of disjoint nodes X(left) and Y (right) such that every edge connects a node in X with a node in Y, and each edge has a non-negative weight.

• X is the set of the first list of tokens.
• Y is the set of the second list of tokens.

E is a set of edges connecting between each couple of vertex (X ,Y), the weight of each edge which connects an x1 to a y1 is computed by the similarity of x1 token and y1 token (using the `GetSimilarity` function).

`Example: x<SUB>1</SUB>= “test”, y<SUB>1</SUB>=”tet”, e(x<SUB>1</SUB>, y<SUB>1</SUB>) = 0.75`

The task is to find a subset of node-disjoint edges that has the maximum total weight. The similarity of two strings is computed by the number of matched strings.

### Initialize the weight of the edges

The results of `GetSimilarity` function are used to compute the weight of edges.

`w(x,y) = GetSimilarity(token[x], token[y])`

### Connecting the edges to maximize the total weight

We use the Hungarian method to solve the maximum total weight of bipartite matching problem. The class used to implement this algorithm is MatchsMaker.cs.

## Future improvements

### Computing semantic similarity between words using the WordNet

The above section allows us to get the similarity score between patterns of strings. However, sometimes we need a semantic measurement. This problem leads us to find a semantic similarity. Now, I am experimenting with using WordNet[1] to compute the similarity between words inside the English dictionary.

#### The WordNet

WordNet is a lexical database which is available online and provides a large repository of English lexical items. The WordNet was designed to establish connections between four types of POS (Parts of Speech), noun, verb, adjective, and adverb. The smallest unit in WordNet is synset, which represents a specific meaning of a word. It includes the word, its explanation, and the synonyms of its meaning. The specific meaning of one word under one type of POS is called a sense. Each sense of a word is in a different synset. Synsets are equivalent to senses = structures containing sets of terms with synonymous meanings. Each synset has a gloss that defines the concept it represents. For example the words night, nighttime and dark constitute a single synset that has the following gloss: the time after sunset and before sunrise while it is dark outside. Synsets are connected to one another through explicit semantic relations. Some of these relations (hypernymy, hyponymy for nouns and hypernymy and troponymy for verbs) constitute kind-of and part-of (holonymy and meronymy for nouns) hierarchies. For example, tree is a kind of plant, tree is a hyponym of plant and plant is hypernym of tree. Analogously trunk is a part of tree and we have that trunk is meronym of tree and tree is holonym of trunk. For one word and one type of POS, if there are more than one sense, WordNet organizes them in the order of the most frequently used to the least frequently used.

Base on WordNet and its .NET API (that is provided by Troy Simpson); I use synonym and hypernym relations to capture the semantic similarities of tokens. Given a pair of words, once a path that connects the two words is found, I determine their similarity based on two factors: the length of the path and the order of the sense involved in this path.

To find a path connection between words we use the breadth-first search to check if they are synonyms. Searching the connection between two words in WordNet is an expensive operation due to the large searching space. We define two restrictions in order to reduce the computational time. The first one is that only synonym relations are considered (hyponym and hypernym will be considered later), since exhausting all the relations is too costly. This restriction is also adopted in some related works. Another restriction is to limit the length of the searching path. If the searcher cannot find a path that has been connected within a limited length, we stop searching and give the result as no path found. Hence, this case should have a substitution, the edit-distance instead.

#### The similarity score

Due to the synonym consideration, first we use the following formula to calculate the semantic similarity of words score:

`WordSim(s,t)=SenWeight(s)*SenWeight(t) / PathLength`
• Where `s` and `t`: denote the source and target words being compared.
• `SenseWeight`: denotes a weight calculated according to the order of this sense and the count of total senses.
• `PathLength`: denotes the length of the connection path from `s` to `t`.

(This work is an experiment and is not available in this version, it will soon be released.)

## Using the code

Add the similarity project to your workspace. Create a method like the following:

```public Test()
{
string s1="The code project's article";
string s2="Article of The CodeProject";
MatchsMaker match=new MatchsMaker(s1, s2) ;
Console.WriteLine(match.Score) ;
}```

where `s1`, `s2` are two strings for capturing similarity. The `match.Score` is the similarity score of the two strings. In this example, the score is 0.94.

## References

A list of licenses authors might use can be found here

## Share

Software Developer
Vietnam
I'm still alive...but temporarily moved to work on mobile & web stuffs(j2me/brew/php/flash...something not M\$). things have just been very busy, and probably will continue...so don't have chance to maintain & respond. Hope will have time to try to write again, because many ideas with WPF &silver light are waiting. wish me luck

FYI:
- MESHSimPack project(c# library for measuring similarity among concepts of the MESH ontology):
http://sourceforge.net/projects/meshsimpack.

 First PrevNext
 The Compiled Library is here (.DLL) luke reiner30-Apr-15 11:31 luke reiner 30-Apr-15 11:31
 My vote of 4 Amir Mehrabi-Jorshari13-Nov-13 19:59 Amir Mehrabi-Jorshari 13-Nov-13 19:59
 Comparing a name inputted against a database of names JacksonVF2-Nov-13 0:23 JacksonVF 2-Nov-13 0:23
 My vote of 5 ricardmag7-Mar-13 9:09 ricardmag 7-Mar-13 9:09
 My vote of 5 manoj kumar choubey16-Feb-12 0:36 manoj kumar choubey 16-Feb-12 0:36
 doesnt run in c# 2010 mountainman3362-Feb-12 23:03 mountainman336 2-Feb-12 23:03
 Heres an unsafe version 2 to 3x faster [modified] TheToid29-Jun-11 19:55 TheToid 29-Jun-11 19:55
 Re: Heres an unsafe version 2 to 3x faster [modified] markus folius13-Dec-11 17:52 markus folius 13-Dec-11 17:52
 Re: Heres an unsafe version 2 to 3x faster [modified] xandeq1-Jun-13 7:42 xandeq 1-Jun-13 7:42
 Similarity between two strings Santo15-Dec-07 23:36 Santo 15-Dec-07 23:36
 Re: Similarity between two strings TheToid29-Jun-11 19:57 TheToid 29-Jun-11 19:57
 Thanks! Daniel Carvalho Liedke5-Nov-07 14:15 Daniel Carvalho Liedke 5-Nov-07 14:15
 Very Useful icould13-Oct-06 1:57 icould 13-Oct-06 1:57
 I receive different score for Test value ike2010201023-Aug-06 6:04 ike20102010 23-Aug-06 6:04
 Re: I receive different score for Test value VCSKicks20-Feb-09 9:50 VCSKicks 20-Feb-09 9:50
 Re: I receive different score for Test value swati gapat10-Mar-14 18:08 swati gapat 10-Mar-14 18:08
 I need your email abdo12345678923-Jun-06 16:16 abdo123456789 23-Jun-06 16:16
 Strange behavior? [modified] Ty Hansen23-May-06 12:09 Ty Hansen 23-May-06 12:09
 Re: Strange behavior? Ty Hansen23-May-06 16:38 Ty Hansen 23-May-06 16:38
 Latest code Thanh Dao24-May-06 5:11 Thanh Dao 24-May-06 5:11
 help me with implementation boumedian14-May-06 22:18 boumedian 14-May-06 22:18
 Re: help me with implementation Thanh Dao16-May-06 0:57 Thanh Dao 16-May-06 0:57
 Re: help me with implementation Behind The Scene7-Aug-06 5:49 Behind The Scene 7-Aug-06 5:49
 Request Paula Jones9-May-06 0:48 Paula Jones 9-May-06 0:48
 Re: Request Thanh Dao12-May-06 7:06 Thanh Dao 12-May-06 7:06
 How to run your code? gauravgupta1235-Mar-06 9:46 gauravgupta123 5-Mar-06 9:46
 String Similarity Open source Thanh Dao5-Oct-05 17:46 Thanh Dao 5-Oct-05 17:46
 Hopcroft-Karp Thanh Dao11-Sep-05 0:47 Thanh Dao 11-Sep-05 0:47
 Re: Hopcroft-Karp Super Lloyd28-Sep-05 1:38 Super Lloyd 28-Sep-05 1:38
 Re: Hopcroft-Karp Thanh Dao28-Sep-05 3:02 Thanh Dao 28-Sep-05 3:02
 I just saw and it looks very interesting, and I will test it soon . I don't stress the complexity of matching algorithm, this is not an issue. Anyway, we can apply a simple fast and acceptable heuristic method instead of using a graph matching algorithm(I will present it in a new article Semantic similarity by 1-oct). In some particular situations, hopcroft-karp sounds good, but we need to make some changes for maximal total weight problem, because it was original designed for maximizing number of matching pairs(mariage theorem). I just personally made some greedy modifications and have yet tested carefully. Thank you very much for the good work, Thanh Dao
 Re: Hopcroft-Karp Super Lloyd28-Sep-05 3:56 Super Lloyd 28-Sep-05 3:56
 Re: Hopcroft-Karp Thanh Dao5-Oct-05 17:50 Thanh Dao 5-Oct-05 17:50
 strange not winning the prize of the month Unruled Boy5-Sep-05 6:23 Unruled Boy 5-Sep-05 6:23
 Re: strange not winning the prize of the month Thanh Dao5-Sep-05 16:54 Thanh Dao 5-Sep-05 16:54
 invalid character in StripAccents() Unruled Boy14-Aug-05 21:54 Unruled Boy 14-Aug-05 21:54
 Re: invalid character in StripAccents() Thanh Dao15-Aug-05 1:39 Thanh Dao 15-Aug-05 1:39
 Remove stop words Martin a. Warin12-Aug-05 18:04 Martin a. Warin 12-Aug-05 18:04
 Re: Remove stop words Martin A. Warin12-Aug-05 18:12 Martin A. Warin 12-Aug-05 18:12
 Re: Remove stop words Thanh Dao14-Aug-05 4:45 Thanh Dao 14-Aug-05 4:45
 A suggestion! Carl Mercier11-Aug-05 17:24 Carl Mercier 11-Aug-05 17:24
 Re: A suggestion! Thanh Dao11-Aug-05 22:50 Thanh Dao 11-Aug-05 22:50
 Re: A suggestion! Carl Mercier12-Aug-05 9:17 Carl Mercier 12-Aug-05 9:17
 Re: A suggestion! Thanh Dao14-Aug-05 4:12 Thanh Dao 14-Aug-05 4:12
 Re: Interesting Thanh Dao5-Aug-05 17:11 Thanh Dao 5-Aug-05 17:11
 excellent & more... Unruled Boy5-Aug-05 3:25 Unruled Boy 5-Aug-05 3:25
 Re: excellent & more... Thanh Dao5-Aug-05 17:10 Thanh Dao 5-Aug-05 17:10
 Re: excellent & more... Troy Simpson10-Aug-05 15:24 Troy Simpson 10-Aug-05 15:24
 Re: excellent & more... Thanh Dao10-Aug-05 17:45 Thanh Dao 10-Aug-05 17:45
 Thing... Chistian Lemon5-Aug-05 3:20 Chistian Lemon 5-Aug-05 3:20
 Last Visit: 31-Dec-99 19:00     Last Update: 4-May-15 23:47 Refresh 12 Next »