Click here to Skip to main content
Click here to Skip to main content

An improvement on capturing similarity between strings

By , 5 Aug 2005
 

Introduction

This article describes a way of capturing the similarity between two strings (or words). String similarity is a confidence score that reflects the relation between the meanings of two strings, which usually consists of multiple words or acronyms. Currently, in this approach I am more concerned on the measurement which reflects the relation between the patterns of the two strings, rather than the meaning of the words.

I implemented this algorithm when I was developing a tool to make the matching between XML schemas semi-automatic.

Preparing the ground

In this paper, I have implemented two algorithms:

  • Levenshtein algorithm[1]
  • The Kuhn-Munkres algorithm (also known as the Hungarian method)[2]

Without going "deep into theory", if you want to understand these algorithms, please read about them in the algorithm books. Other information about them can be reached at:

  1. Levenshtein
  2. The optimal assignment problem

Problem

The string similarity algorithm was developed to satisfy the following requirements:

  • A true reflection of lexical similarity - strings with small differences should be recognized as being similar. In particular, a significant sub-string overlap should point to a high level of similarity between the strings.
  • A robustness to changes of word order- two strings which contain the same words, but in a different order, should be recognized as being similar. On the other hand, if one string is just a random anagram of the characters contained in the other, then it should (usually) be recognized as dissimilar.
  • Language independence - the algorithm should work not only in English, but also in many different languages.

Solution

The similarity is calculated in three steps:

  • Partition each string into a list of tokens.
  • Computing the similarity between tokens by using a string edit-distance algorithm (extension feature: semantic similarity measurement using the WordNet library).
  • Computing the similarity between two token lists.

Tokenization

A string is a list of words or abbreviations, it may be composed to follow the Camel or Pascal casing without separator characters. Take an example: ‘fileName’ is being tokenized as “file” and “Name”. This work is done by the tokeniser.cs class.

Compute the similarity between two words

The first method uses an edit-distance string matching algorithm: Levenshtein. The string edit distance is the total cost of transforming one string into another using a set of edit rules, each of which has an associated cost. Levenshtein distance is obtained by finding the cheapest way to transform one string into another. Transformations are the one-step operations of (single-phone) insertion, deletion and substitution. In the simplest version substitutions cost about two units except when the source and target are identical, in which case the cost is zero. Insertions and deletions costs half that of substitutions.

This code shows a dynamic implementation of the algorithm:

private int ComputeDistance (string s, string t)
{
    int n=s.Length;
    int m=t.Length;
    int[,] distance=new int[n + 1, m + 1]; // matrix
    int cost=0;
    if(n == 0) return m;
    if(m == 0) return n;
    //init1
    for(int i=0; i <= n; distance[i, 0]=i++);
    for(int j=0; j <= m; distance[0, j]=j++);
    //find min distance
    for(int i=1; i <= n; i++)
    {
        for(int j=1; j <= m;j++)
        {
            cost=(t.Substring(j - 1, 1) == 
                s.Substring(i - 1, 1) ? 0 : 1);
            distance[i,j]=Min3(distance[i - 1, j] + 1,
            distance[i, j - 1] + 1,
            distance[i - 1, j - 1] + cost);
        }
    }
    return distance[n, m];
}

The similarity score

I assume that all relation scores are in the [0, 1] range, which means that if the score gets a maximum value (equal to 1) then the two string are absolutely similar.

public float GetSimilarity(string string1, string string2) 
{ 
    float dis=ComputeDistance(string1, string2);
    float maxLen=string1.Length;
    if (maxLen < string2.Length)
    maxLen = string2.Length;
    if (maxLen == 0.0F)
    return 1.0F;
    else
    return 1.0F - dis/maxLen; 
}

Similarity between two token lists

After splitting each string into token lists, we capture the similarity between two strings by computing the similarity of those two token lists, which is reduced to the bipartite graph matching problem. A related classical problem on matching in bipartite graphs is the assignment problem, which is the quest to find the optimal assignment of workers to jobs that maximizes the sum of ratings, given all non-negative ratings Cost[i,j] of each worker i to each job j.

The problem can now be described as follows

Given a graph G(V,E), G can be partitioned into two sets of disjoint nodes X(left) and Y (right) such that every edge connects a node in X with a node in Y, and each edge has a non-negative weight.

  • X is the set of the first list of tokens.
  • Y is the set of the second list of tokens.

E is a set of edges connecting between each couple of vertex (X ,Y), the weight of each edge which connects an x1 to a y1 is computed by the similarity of x1 token and y1 token (using the GetSimilarity function).

Example: x<SUB>1</SUB>= “test”, y<SUB>1</SUB>=”tet”, e(x<SUB>1</SUB>, y<SUB>1</SUB>) = 0.75

The task is to find a subset of node-disjoint edges that has the maximum total weight. The similarity of two strings is computed by the number of matched strings.

Initialize the weight of the edges

The results of GetSimilarity function are used to compute the weight of edges.

w(x,y) = GetSimilarity(token[x], token[y])

Connecting the edges to maximize the total weight

We use the Hungarian method to solve the maximum total weight of bipartite matching problem. The class used to implement this algorithm is MatchsMaker.cs.

Future improvements

Computing semantic similarity between words using the WordNet

The above section allows us to get the similarity score between patterns of strings. However, sometimes we need a semantic measurement. This problem leads us to find a semantic similarity. Now, I am experimenting with using WordNet[1] to compute the similarity between words inside the English dictionary.

The WordNet

WordNet is a lexical database which is available online and provides a large repository of English lexical items. The WordNet was designed to establish connections between four types of POS (Parts of Speech), noun, verb, adjective, and adverb. The smallest unit in WordNet is synset, which represents a specific meaning of a word. It includes the word, its explanation, and the synonyms of its meaning. The specific meaning of one word under one type of POS is called a sense. Each sense of a word is in a different synset. Synsets are equivalent to senses = structures containing sets of terms with synonymous meanings. Each synset has a gloss that defines the concept it represents. For example the words night, nighttime and dark constitute a single synset that has the following gloss: the time after sunset and before sunrise while it is dark outside. Synsets are connected to one another through explicit semantic relations. Some of these relations (hypernymy, hyponymy for nouns and hypernymy and troponymy for verbs) constitute kind-of and part-of (holonymy and meronymy for nouns) hierarchies. For example, tree is a kind of plant, tree is a hyponym of plant and plant is hypernym of tree. Analogously trunk is a part of tree and we have that trunk is meronym of tree and tree is holonym of trunk. For one word and one type of POS, if there are more than one sense, WordNet organizes them in the order of the most frequently used to the least frequently used.

Base on WordNet and its .NET API (that is provided by Troy Simpson); I use synonym and hypernym relations to capture the semantic similarities of tokens. Given a pair of words, once a path that connects the two words is found, I determine their similarity based on two factors: the length of the path and the order of the sense involved in this path.

To find a path connection between words we use the breadth-first search to check if they are synonyms. Searching the connection between two words in WordNet is an expensive operation due to the large searching space. We define two restrictions in order to reduce the computational time. The first one is that only synonym relations are considered (hyponym and hypernym will be considered later), since exhausting all the relations is too costly. This restriction is also adopted in some related works. Another restriction is to limit the length of the searching path. If the searcher cannot find a path that has been connected within a limited length, we stop searching and give the result as no path found. Hence, this case should have a substitution, the edit-distance instead.

The similarity score

Due to the synonym consideration, first we use the following formula to calculate the semantic similarity of words score:

WordSim(s,t)=SenWeight(s)*SenWeight(t) / PathLength
  • Where s and t: denote the source and target words being compared.
  • SenseWeight: denotes a weight calculated according to the order of this sense and the count of total senses.
  • PathLength: denotes the length of the connection path from s to t.

(This work is an experiment and is not available in this version, it will soon be released.)

Using the code

Add the similarity project to your workspace. Create a method like the following:

public Test()
{
   string s1="The code project's article";
   string s2="Article of The CodeProject";
   MatchsMaker match=new MatchsMaker(s1, s2) ;
   Console.WriteLine(match.Score) ;
}

where s1, s2 are two strings for capturing similarity. The match.Score is the similarity score of the two strings. In this example, the score is 0.94.

References

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Thanh Dao
Software Developer
Vietnam Vietnam
Member
I'm still alive...but temporarily moved to work on mobile & web stuffs(j2me/brew/php/flash...something not M$). things have just been very busy, and probably will continue...so don't have chance to maintain & respond. Hope will have time to try to write again, because many ideas with WPF &silver light are waiting. wish me luck Smile | :)
 
FYI:
- MESHSimPack project(c# library for measuring similarity among concepts of the MESH ontology):
http://sourceforge.net/projects/meshsimpack.

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
GeneralMy vote of 5memberricardmag7 Mar '13 - 8:09 
Great work!!! thanks
GeneralMy vote of 5membermanoj kumar choubey15 Feb '12 - 23:36 
Nice
Questiondoesnt run in c# 2010membermountainman3362 Feb '12 - 22:03 
just flashes the console never appears!
SuggestionHeres an unsafe version 2 to 3x faster [modified]memberTheToid29 Jun '11 - 18:55 
Thanks for this code it works really nicely, heres an unsafe version of the distance function, it runs about 2-3x faster! Smile | :)
 
        public static unsafe int StringDistance(string s1, string s2)
        {
            int liLengthS1 = s1.Length;
            int liLengthS2 = s2.Length;
            int liColWidth = liLengthS2 + 1;
            int[,] distance = new int[liLengthS1 + 1, liColWidth];
 
            int liCost = 0;
            if (liLengthS1 == 0) return liLengthS2;
            if (liLengthS2 == 0) return liLengthS1;
 
            fixed (char* lpchS1 = s1, lpchS2 = s2)
            fixed (int* lpiDistance = distance)
            {
                for (int liRow = 0; liRow <= liLengthS1; lpiDistance[liRow * liColWidth] = liRow++) ;
                for (int liCol = 0; liCol <= liLengthS2; lpiDistance[liCol] = liCol++) ;
 
                int* lpiRow = lpiDistance;
                int* lpiRowBefore = lpiDistance;
 
                for (int liRow = 1; liRow <= liLengthS1; liRow++)
                {
                    lpiRow += liColWidth;
 
                    for (int liCol = 1; liCol <= liLengthS2; liCol++)
                    {
                        liCost = (lpchS2[liCol - 1] == lpchS1[liRow - 1] ? 0 : 1);
 
                        lpiRow[liCol] =
                            Min3(
                                lpiRowBefore[liCol] + 1,
                                lpiRow[liCol - 1] + 1,
                                lpiRowBefore[liCol - 1] + liCost
                            );
                    }
 
                    lpiRowBefore += liColWidth;
                }
            }
 
            return distance[liLengthS1, liLengthS2];
        }

modified on Thursday, June 30, 2011 10:53 AM

GeneralRe: Heres an unsafe version 2 to 3x faster [modified]membermarkus folius13 Dec '11 - 16:52 
That's amazing!   Some tests I did confirmed it was several times faster.   I haven't looked at unsafe code before - this opens my eyes... Smile | :)
GeneralSimilarity between two stringsmemberSanto15 Dec '07 - 22:36 
Hi,
 
I have a problem to compile the code you presented above. The compiler could not recognize Min3
function in the first method. could you please help me out.
 
Regards
Peter
peter

GeneralRe: Similarity between two stringsmemberTheToid29 Jun '11 - 18:57 
If you look at the code you will see that Min3 just returns the minimum of 3 integers:
 
        static int Min3(int a, int b, int c)
        {
            return Math.Min(Math.Min(a, b), c);
        }
 
should do it nicely.
GeneralThanks!memberDaniel Carvalho Liedke5 Nov '07 - 13:15 
Thanks a lot for the code, very nice and useful Big Grin | :-D
GeneralVery Usefulmembericould13 Oct '06 - 0:57 
Thank you very much for sharing this code.
 
Aykut
GeneralI receive different score for Test valuememberike2010201023 Aug '06 - 5:04 
I receive a value of .9130435 using the example of:
 
public Test()
{
string s1="The code project's article";
string s2="Article of The CodeProject";
MatchsMaker match=new MatchsMaker(s1, s2) ;
Console.WriteLine(match.Score) ;
}
 
The only changes I've made to the code is the namespace.
GeneralRe: I receive different score for Test valuememberVCSKicks20 Feb '09 - 8:50 
Same here.
 
Still pretty cool article
 
Visual C# Kicks - Free C#.NET articles, resources, and downloads at
http://www.vcskicks.com

GeneralI need your emailmemberabdo12345678923 Jun '06 - 15:16 
please i need your email to discuss subjects about text similarity
 
dfds
QuestionStrange behavior? [modified]memberTy Hansen23 May '06 - 11:09 
First of all, I am really excited about this functionality, it is much better than my own attempts to do the same thing. However I am experiancing a strange bug. I use Visual Studio 2003, and the code works flawlessly when my build configuration is 'Debug', but when i recompile in 'Release' mode, the similarity scores are completely wrong, even though they were right before. Does anyone have any idea why this could be happening.
Thanks
-Ty
 
-- modified at 17:17 Tuesday 23rd May, 2006
 
Here are some more details.
I have isolated the build options that affect how result of the similarity calculation. In order for it to work correctly, my project needs to have the Optimize Code option set to false, and the Generate Debugging Information option set to true.
AnswerRe: Strange behavior?memberTy Hansen23 May '06 - 15:38 
Strange Solution?
I got it working finally.. though I don't quite understand how.
I achieved success when i commented out the following lines of code in the GetTotal() method of the BipartiteMatcher class.
 

Trace.WriteLine (_leftTokens[i] + " <-> " + _rightTokens [_outgoing[i]] + " : " + _cost[i, _outgoing[i]]) ;
 
float a=1.0F - System.Math.Max(_leftTokens[i].Length , _rightTokens[_outgoing[i]].Length ) != 0 ? _cost[i, _outgoing[i]]/System.Math.Max(_leftTokens[i].Length , _rightTokens[_outgoing[i]].Length ) : 1;
 
nA += a;

 
These lines don's seem to do anything but set the value of a variable that isn't used anywhere... yet somehow it was changing the outcome of the calculation ... but only when compiled in release mode. I am happy to have this working but no less confused as to the cause.

GeneralLatest codememberThanh Dao24 May '06 - 4:11 
The implementation attached in this article was not optimized. I suggest you checking out the following code where I did some improvement. Under the /Matcher and /TextHelper folder.
 
http://opensvn.csie.org/WordNetDotNet/trunk/Projects/Thanh/[^]
 
Thanh Dao
Generalhelp me with implementationmemberboumedian14 May '06 - 21:18 
am unable to run the code that is avaiable for download with your article. It's not that there is a problem in the code. It's just that I do not know how to go about it. I am using Visual studio.net 2003. Please get me started. Thanks a lot.
 

GeneralRe: help me with implementationmemberThanh Dao15 May '06 - 23:57 
It should run normally. Take a look at test code, it is a console application.
 
Thanh Dao
GeneralRe: help me with implementationmemberBehind The Scene7 Aug '06 - 4:49 
He's a noob/newbie. I think he's asking you to be his teacher. Sigh | :sigh:
 
ROFLOLMFAO

GeneralRequestmemberPaula Jones8 May '06 - 23:48 
Dear Sir,
This article("http://www.thecodeproject.com/csharp/improvestringsimilarity.asp") ,is an interesting concept and idea. Was a good read and thoroughly enjoyed going through it. Although i have to admit didn't understand most of the coding in the program, being an englishs' student at uni myself , but managed to get in running in the end somehow by my brothers help.
 
I need to write an article in the next two weeks, and think this would be an interesting topic for someone not from the computing field and a good discussion point, also because this a working model, so its not just some proposal but it has actually been implemented.
 
I would like to request you, if you would be grateful enough and if its possible for you, that could send me a sort of brief step by step explanation of the stages that this program works in/ processes data and then reaches to the end result between 0-1, on purely conceptual basis(simple english) in context to this program.
I would indeed be quite grateful to you.
My mail id: paulajone@googlemail.com
Thanking you.
- Paula.
GeneralRe: RequestmemberThanh Dao12 May '06 - 6:06 
Hello Paula,
 
I glad that you enjoyed it. As you can read in the article, here are three major steps:
- Split up a sentence into word/token. (assuming that in most natural language, a sentence is a sequence of words and delimiters.)
- Compute the similarity of two words. (you know what does the similarity mean... )
- Compute the similarity of two bag of words.
 
firstly I want to explain little bit about string edit distance problem.
Let give a simple game: given two words. and you have three operators to do with a word : replace, insert, delete. If you can find out the least number of steps to transform the first word to the second one, you win. So how will we go about finding the best solution ? Mr, Levenstein has done it.
 
private int ComputeDistance (string s, string t)
 
{
 
int n=s.Length;
 
int m=t.Length;
 
Step 0: Create an NxM matrix called the distance matrix in which each element [i,j] represents the cheapest cost (least number of edit step) of transforming the sequence of first i character (in the s string) to another one sequence of first j character (in the t string).
 
int[,] distance=new int[n + 1, m + 1]; // matrix
 
int cost=0;
 
if(n == 0) return m;
 
if(m == 0) return n;
 
Step 1: Initialize the distance matrix [i, 0] :For each row of the distance matrix, calculate value for the element distance[i, 0] = i. That means the cheapest cost to transform a string of length "i" to a empty string is the total steps of using the delete operation.
 
for(int i=0; i <= n; distance[i, 0]=i++);
 
Step 2: Initialize the distance matrix [0, j] :For each row of the distance matrix, calculate value for the element distance[ 0, j] = j. That means the cheapest cost to transform a string of length "j" to a empty string is the total steps of using the delete operation.
 
for(int j=0; j <= m; distance[0, j]=j++);
 
Step 3: Finding the solution for the problem of size [i,j] . The idea behind this code is bottom-up dynamic programming technique. For each element [i,j] we compute the solution for problem of [i,j] bases on the previous smaller problem. We trarvese the matrix programmatically from smaller to bigger. Repeat step 4 until reach the problem size NxM
 
for(int i=1; i <= n; i++)
 
{
 
for(int j=1; j <= m;j++)
 
{
 
Step 4: Compute the solution for problem of size [i,j] bases on previous smaller problem : [i-1, j], [i,j-1], [i-1, j-1]
 
cost=(t.Substring(j - 1, 1) == s.Substring(i - 1, 1) ? 0 : 1);
distance[i,j]=Min3(distance[i - 1, j] + 1,
distance[i, j - 1] + 1,
distance[i - 1, j - 1] + cost);
 

distance[i,j]=Min3(distance[i - 1, j] + 1, // if we base the problem [i-1, j] we need to insert one character. for example : from ABC to ABCD we need to append "D" to the first string.
distance[i, j - 1] + 1, // if we base the problem [i, j - 1] we need to delete one character. for example : from ABCD to ABC we need to delete "D".
 
distance[i - 1, j - 1] + cost);//if we base the problem [i - 1, j - 1] , if the S_characterAt(i-1) = S_characterAt(j-1) we have nothing to do, otherwise if they are different then we need to replace. for example ABCD to ABCE we need to replace "D" to "E".
 
}
 
}
 
return distance[n, m];
 
}
 
The smaller number of distance [n, m] the more similar two strings are.
There are several ways to compute the similarity value, may use some co-efficient. In example : Sim = (maxLen - distance[n, m]) / maxlen.
 
The idea of dynamic programming is quite simple. If the perfect solution for the smaller (previous) problem does not affect to the solution for the bigger problem (next) we can compute the prefect solution of the bigger problem based on solution of the previous smaller problems. Let you imagine the representation of problems on a hierarchical tree. A parent node is bigger problem, and a smaller is child node. To compute the solution for the parent node you do not need to look at its descendant just need to care its children.
 
Let me know if I am misunderstand your request and bear with me. I will be back to say about maximize bipartite matching algorithm. (I am currently sinking into research paper Sleepy | :zzz: ).

 

Thanh Dao
 


QuestionHow to run your code?membergauravgupta1235 Mar '06 - 8:46 
I am unable to run the code that is avaiable for download with your article. It's not that there is a problem in the code. It's just that I do not know how to go about it. I am using Visual studio.net 2003. Please get me started. Thanks a lot.
 
Gaurav Gupta

GeneralString Similarity Open sourcememberThanh Dao5 Oct '05 - 16:46 
If you intend to work on this topic, I strongly suggest you visit this library :
http://www.dcs.shef.ac.uk/~sam/stringmetrics.html[^]
 
There is lot of metrics there. I've also supplemented the TF/IDF measure which was missing in the library.
If you have any confusion on how to aggregrate/combinate multiple similarity metrics into a single one measurement, then feel free to drop me a line or can directly quiz the author of the above open source .
 
Thanh Dao
 

 

-- modified at 9:23 Saturday 29th October, 2005
GeneralHopcroft-KarpmemberThanh Dao10 Sep '05 - 23:47 
The modified version of this algorithm gives us a lower complexity
Thanh Dao
 

 
-- modified at 6:45 Thursday 29th September, 2005
GeneralRe: Hopcroft-KarpmemberSuper Lloyd28 Sep '05 - 0:38 
hu? what about 'Hopcroft-Karp' ?
 
Anyway I did some carefull reading of your code this time (instead superficial amazed test as before).
 
I already made some 'improvment' I hope you will like.
here they are:
=================================================
public class Levenshtein
{
public static int Min3(int a, int b, int c)
{
return a < b ? (a < c ? a : c) : (b < c ? b : c);
}
///
/// this distance goes from 0 to 1 + Max(s.Length, t.Length).
/// if they have different size they would be different
///

public static int GetMaxDistance(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] distance = new int[n + 1, m + 1]; // matrix
int cost = 0;
 
if (n == 0) return m;
if (m == 0) return n;
 
for (int i = 0; i <= n; distance[i, 0] = i++) ;
for (int j = 0; j <= m; distance[0, j] = j++) ;
 
//find min distance
for (int i = 1; i <= n; i++)
{
for (int j = 1; j <= m; j++)
{
cost = t[j - 1] == s[i - 1] ? 0 : 1;
distance[i, j] = Min3(distance[i - 1, j] + 1,
distance[i, j - 1] + 1,
distance[i - 1, j - 1] + cost);
}
}
 
return distance[n, m];
}
///
/// this distance goes from 0 to 1 + Min(s.Length, t.Length).
/// it will be 1 or 0 if one word is caontained in the other
///

public static int GetMinDistance(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] distance = new int[n + 1, m + 1]; // matrix
int cost = 0;
 
if (n == 0) return m;
if (m == 0) return n;
 
for (int i = 0; i <= n; distance[i, 0] = i++) ;
for (int j = 0; j <= m; distance[0, j] = j++) ;
 
//find min distance
for (int i = 1; i <= n; i++)
{
for (int j = 1; j <= m; j++)
{
cost = t[j - 1] == s[i - 1] ? 0 : 1;
distance[i, j] = Min3(distance[i - 1, j] + 1,
distance[i, j - 1] + 1,
distance[i - 1, j - 1] + cost);
}
}
 
if (n < m)
{
int ret = distance[n, n];
for (int i = n + 1; i <= m; i++)
{
int j = distance[n, i] - (i - n);
if (j < ret)
ret = j;
}
return ret;
}
else
{
int ret = distance[m, m];
for (int i = m + 1; i <= n; i++)
{
int j = distance[i, m] - (i - m);
if (j < ret)
ret = j;
}
return ret;
}
 
return distance[n, m];
}
=================================================
public class Tokeniser
{
static Dictionary accent;
static Tokeniser()
{
accent = new Dictionary();
accent['à'] = 'a';
accent['â'] = 'a';
accent['ä'] = 'a';
accent['á'] = 'a';
accent['é'] = 'e';
accent['è'] = 'e';
accent['ê'] = 'e';
accent['ë'] = 'e';
accent['ì'] = 'i';
accent['î'] = 'i';
accent['ï'] = 'i';
accent['ò'] = 'o';
accent['ô'] = 'o';
accent['ö'] = 'o';
accent['ù'] = 'u';
accent['û'] = 'u';
accent['ü'] = 'u';
accent['ç'] = 'c';
accent['’'] = '\'';
accent['ñ'] = 'n';
}
 
public static List Tokenize(string str)
{
Set delimiters = new Set();
delimiters.AddRange(new char[] {
' ', '\t', '\r', '\n', ',',
'.', ',', ':', ';', '?', '!',
'(', ')', '{', '}', '[', ']' });
return Tokenize(str, delimiters);
}
public static List Tokenize(string str, Set delimiters)
{
List ret = new List();
StringBuilder sb = new StringBuilder();
foreach (char c in str)
{
char c2 = Simplify(c);
 
if (delimiters.Contains(c2))
{
if (sb.Length > 0)
ret.Add(sb.ToString());
sb.Length = 0;
}
else
sb.Append(c2);
}
if (sb.Length > 0)
ret.Add(sb.ToString());
 
return ret;
}
public static char Simplify(char c)
{
if (Char.IsSeparator(c))
return ' ';
c = Char.ToLower(c);
if (accent.ContainsKey(c))
c = accent[c];
return c;
}
public static string Simplify(string s)
{
StringBuilder sb = new StringBuilder();
foreach (char c in s)
sb.Append(Simplify(c));
return sb.ToString();
}
}
=================================================
 

-- modified at 7:20 Wednesday 28th September, 2005
GeneralRe: Hopcroft-KarpmemberThanh Dao28 Sep '05 - 2:02 

I just saw and it looks very interesting, and I will test it soon Smile | :) .
 
I don't stress the complexity of matching algorithm, this is not an issue. Anyway, we can apply a simple fast and acceptable heuristic method instead of using a graph matching algorithm(I will present it in a new article Semantic similarity by 1-oct). In some particular situations, hopcroft-karp sounds good, but we need to make some changes for maximal total weight problem, because it was original designed for maximizing number of matching pairs(mariage theorem). I just personally made some greedy modifications and have yet tested carefully.
 
Thank you very much for the good work,
 

 
Thanh Dao
GeneralRe: Hopcroft-KarpmemberSuper Lloyd28 Sep '05 - 2:56 
Please note that:
 
1. I submit my change in a hurry but there was a bug... I fixed it inline 50 minutes later...
Now GetMinDistance() work very nicely for functionality as the "look for" text box in the SDK documentation (even better dare I say)
 
2. for the Tokenizer I use a class Set found in the PowerCollection, free & OpenSource library available there:
http://www.wintellect.com/powercollections/[^]
GeneralRe: Hopcroft-KarpmemberThanh Dao5 Oct '05 - 16:50 
I am sorry if it took so long to get back, i've tested your code yet, but your tokenizer seems only work only VS 2005.
May be a next few days I will update a new efficient tokenizer class which works better than the current, also a combination trigram, bigram, and soundex, affixes (prefix -sufix).
 
Thanh Dao
 

 

 


Generalstrange not winning the prize of the monthmemberUnruled Boy5 Sep '05 - 5:23 
D'Oh! | :doh:
 
Regards,
unruledboy@hotmail.com
GeneralRe: strange not winning the prize of the monthmemberThanh Dao5 Sep '05 - 15:54 
Smile | :)
 
Thanh Dao
Generalinvalid character in StripAccents()memberUnruled Boy14 Aug '05 - 20:54 
it becomes:
 
string beforeConversion = "嗬饴淠崃樯枞晔胨焯钗锵蛞粼鲋缜採";
string afterConversion = "aAaAaAaAeEeEeEeEiIiIiIoOoOoOuUuUuUcC'n";
 
System.Text.StringBuilder sb = new System.Text.StringBuilder(input);
 
for (int i = 0; i < beforeConversion.Length; i++)
{
char beforeChar = beforeConversion[i];
char afterChar = afterConversion[i];
 
sb.Replace(beforeChar, afterChar);
}
 
sb.Replace("?, "oe");
sb.Replace("?, "ae");
 

I suggest that you handle these characters but their char number, not the char itself.
 
Regards,
unruledboy@hotmail.com
GeneralRe: invalid character in StripAccents()memberThanh Dao15 Aug '05 - 0:39 
Do you mean that "" are confused ? or the string beforConversion self-became that when module was running under a chinese OS ?
I think that 3 above chars look alike but they could not be treated as the same since they have the different order : 58058, 58248, 58343 , respectively. I just think that: Does the string class treat this problem automatically ?
 
if the code sb.Replace("œ", "oe"); self-becomes sb.Replace("?, "oe"); , this problem is a serious error.
 
Thanh Dao
GeneralRemove stop wordssussMartin a. Warin12 Aug '05 - 17:04 
Thanh
 
I just saw and though it is obvious that many of words in a string/sentence do not describle the string content. words like THE,IS ,AN , OF...are small, frequently occuring words and often ignored.
Many search engine/IR using automatic document indexing those non significant words are reomoved from the document vector, so document will only be represented by content bearing words.
 

GeneralRe: Remove stop wordssussMartin A. Warin12 Aug '05 - 17:12 
Suggest using probalistic indexing ranks so the stop word are modeled by a poison distribution.
 

 
Martin
GeneralRe: Remove stop wordsmemberThanh Dao14 Aug '05 - 3:45 
Hi Martin,
 
Thank you very much. That's great. I think that the number of words in a string(which in this case) is not high enought to use probalistic indexing ranks so it is difficult to implement term frequency in automatic indexing. Instead, I will use a common stop words list to remove the high frequency words.
 

 
Thanh Dao
GeneralA suggestion!memberCarl Mercier11 Aug '05 - 16:24 
Wildcards would be great! Example:
 
Doc* would be a perfect match for Doctor
 
Anyone thinks its worth adding?

GeneralRe: A suggestion!memberThanh Dao11 Aug '05 - 21:50 
This looks like a regular expression. Personally I was thinking of prefix/suffix matcher that mean if a word is prefix or postfix of another word then say that they're equivalent.
Prefix is efficient in case similar acronyms or cognate words.
These are syntactic similarity and it may not imply semantic relateness.
i.e: "net" and "network" but also the same with "hot" and "hotel".
 
Thanh Dao
GeneralRe: A suggestion!memberCarl Mercier12 Aug '05 - 8:17 
Yeah.. maybe that would work.
 
Wouldn't be great if
 
abcd*ghi would match abcdefghi at 100%? I'm just throwing ideas around to improve this thing. Very useful code that you have here!
GeneralRe: A suggestion!memberThanh Dao14 Aug '05 - 3:12 
Hi Carl,
 
That's great but I need to think more about working with regular expression.
 
Thank you, have a nice daySmile | :)
 
Thanh Dao
GeneralInterestingmemberLOTSAD5 Aug '05 - 6:28 
Not something that I would ever need to do but I enjoyed reading about it.
GeneralRe: InterestingmemberThanh Dao5 Aug '05 - 16:11 
Thank you ! I' glad to see that.
 
Thanh Dao
Generalexcellent &amp; more...memberUnruled Boy5 Aug '05 - 2:25 
hey, exactly what i want! simply excellent!
 
btw, where is latest source code of WordNet.NET? I did not find it at http://opensource.ebswift.com/WordNet.Net/[^]D'Oh! | :doh:
 
Regards,
unruledboy@hotmail.com
GeneralRe: excellent & more...memberThanh Dao5 Aug '05 - 16:10 
I' glad to know that's useful for you.
 
Here is the link to WordNet C# port:
http://opensvn.csie.org/WordNetDotNet/trunk/
 
Thanks
 
Thanh Dao
GeneralRe: excellent & more...memberTroy Simpson10 Aug '05 - 14:24 
Unruled Boy wrote:
btw, where is latest source code of WordNet.NET? I did not find it at http://opensource.ebswift.com/WordNet.Net/[^]
 
Hello, I'm the one who released WordNet.Net to LGPL (based on the port of Malcom Crowe), great work Thanh!
 
The source is actually linked to the 'Subversion Repository' link on that page. The link is a version controlled source repository, and as you can see if you just go to that web page you would have to be downloading lots of different source files. The way to do it is to get yourself a subversion client and let it do the leg work for you. Once you download the source you can keep it up-to-date with nothing more than a right-click and a menu selection from Windows Explorer.
 
I have made a wiki page on how to use the repository with TortoiseSVN here: http://www.ebswift.com/Wiki/wikka.php?wakka=WordNetDotNetRepository[^].
 
~Troy Simpson
 
http://www.ebswift.com
GeneralRe: excellent &amp; more...memberThanh Dao10 Aug '05 - 16:45 
Hi Troy,
 
Nice to see you here, Thank you very much!
 

 
Thanh Dao
GeneralThing...memberChistian Lemon5 Aug '05 - 2:20 
This version now looks clear to understand and works better.
When would you release the semantic similairty measurement ?
 

Thanks,

 
Christ
GeneralSemanticmemberThanh Dao5 Aug '05 - 16:20 
the example will come with another article.
just a few day
GeneralRe: Thing...memberChistian Lemon8 Aug '05 - 4:56 
Did you try with Resnik method ?
this is a link to his publications:
http://www.umiacs.umd.edu/users/resnik/pubs.html
 
Let me know something if i can help you. thanks again for this wonderful effort, cool guy Smile | :)

 
Christ
GeneralRe: Thing...memberThanh Dao9 Aug '05 - 21:31 
Yes, I've already read it. I am interested in what he said "the shorter connection path between nodes, the more similar they're...".
 
yeah, let code and help me ...i will mail you in private Smile | :)
 
thanks
 

 
Thanh Dao
GeneralInfinite Loopmemberharisphang3 Aug '05 - 13:51 
Hi, it's great job to have this similarity code. However, I am not sure why the comparison of these two string causes infinite loop in the code. Could u help me to find out why?
 
here are the code that i put on the test.cs
public Test()
{
string s1="2 JALAN PEMBERITA U1/49 \nTEMASYA INDUSTRIAL PARK \nGLENMARIE 40150 SHAH ALAM SEL";
string s2="D-3-2 ARCADIA APARTMENT \nUSJ 11/1 SUBANG JAYA \n47600 PETALING JAYA";
 

MatchsMaker match=new MatchsMaker(s1, s2) ;
Console.WriteLine(match.Score) ;
Console.ReadLine() ;
}

 
Haris.
GeneralRe: Infinite Loopmemberharisphang3 Aug '05 - 14:05 
I tried to cut down the string by deleting few words to be like this. And it works perfectly.
 
string s1="2 JALAN PEMBERITA U1 49 TEMASYA INDUSTRIAL PARK GLENMARIE ";
string s2="D 3 2 ARCADIA APARTMENT USJ 11 1 SUBANG JAYA 47600 PETALING JAYA";

 
I guess it seems that the code can't get too long words or strings. Can anyone overcome this problem?

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web01 | 2.6.130523.1 | Last Updated 5 Aug 2005
Article Copyright 2005 by Thanh Dao
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid