Add your own alternative version
Stats
246.5K views 4.9K downloads 135 bookmarked
Posted
28 Jul 2005

Comments and Discussions



Hello Paula,
I glad that you enjoyed it. As you can read in the article, here are three major steps:
 Split up a sentence into word/token. (assuming that in most natural language, a sentence is a sequence of words and delimiters.)
 Compute the similarity of two words. (you know what does the similarity mean... )
 Compute the similarity of two bag of words.
firstly I want to explain little bit about string edit distance problem.
Let give a simple game: given two words. and you have three operators to do with a word : replace, insert, delete. If you can find out the least number of steps to transform the first word to the second one, you win. So how will we go about finding the best solution ? Mr, Levenstein has done it.
private int ComputeDistance (string s, string t)
{
int n=s.Length;
int m=t.Length;
Step 0: Create an NxM matrix called the distance matrix in which each element [i,j] represents the cheapest cost (least number of edit step) of transforming the sequence of first i character (in the s string) to another one sequence of first j character (in the t string).
int[,] distance=new int[n + 1, m + 1]; // matrix
int cost=0;
if(n == 0) return m;
if(m == 0) return n;
Step 1: Initialize the distance matrix [i, 0] :For each row of the distance matrix, calculate value for the element distance[i, 0] = i. That means the cheapest cost to transform a string of length "i" to a empty string is the total steps of using the delete operation.
for(int i=0; i <= n; distance[i, 0]=i++);
Step 2: Initialize the distance matrix [0, j] :For each row of the distance matrix, calculate value for the element distance[ 0, j] = j. That means the cheapest cost to transform a string of length "j" to a empty string is the total steps of using the delete operation.
for(int j=0; j <= m; distance[0, j]=j++);
Step 3: Finding the solution for the problem of size [i,j] . The idea behind this code is bottomup dynamic programming technique. For each element [i,j] we compute the solution for problem of [i,j] bases on the previous smaller problem. We trarvese the matrix programmatically from smaller to bigger. Repeat step 4 until reach the problem size NxM
for(int i=1; i <= n; i++)
{
for(int j=1; j <= m;j++)
{
Step 4: Compute the solution for problem of size [i,j] bases on previous smaller problem : [i1, j], [i,j1], [i1, j1]
cost=(t.Substring(j  1, 1) == s.Substring(i  1, 1) ? 0 : 1);
distance[i,j]=Min3(distance[i  1, j] + 1,
distance[i, j  1] + 1,
distance[i  1, j  1] + cost);
distance[i,j]=Min3(distance[i  1, j] + 1, // if we base the problem [i1, j] we need to insert one character. for example : from ABC to ABCD we need to append "D" to the first string.
distance[i, j  1] + 1, // if we base the problem [i, j  1] we need to delete one character. for example : from ABCD to ABC we need to delete "D".
distance[i  1, j  1] + cost);//if we base the problem [i  1, j  1] , if the S_characterAt(i1) = S_characterAt(j1) we have nothing to do, otherwise if they are different then we need to replace. for example ABCD to ABCE we need to replace "D" to "E".
}
}
return distance[n, m];
}
The smaller number of distance [n, m] the more similar two strings are.
There are several ways to compute the similarity value, may use some coefficient. In example : Sim = (maxLen  distance[n, m]) / maxlen.
The idea of dynamic programming is quite simple. If the perfect solution for the smaller (previous) problem does not affect to the solution for the bigger problem (next) we can compute the prefect solution of the bigger problem based on solution of the previous smaller problems. Let you imagine the representation of problems on a hierarchical tree. A parent node is bigger problem, and a smaller is child node. To compute the solution for the parent node you do not need to look at its descendant just need to care its children.
Let me know if I am misunderstand your request and bear with me. I will be back to say about maximize bipartite matching algorithm. (I am currently sinking into research paper ).
Thanh Dao





I am unable to run the code that is avaiable for download with your article. It's not that there is a problem in the code. It's just that I do not know how to go about it. I am using Visual studio.net 2003. Please get me started. Thanks a lot.
Gaurav Gupta





If you intend to work on this topic, I strongly suggest you visit this library :
http://www.dcs.shef.ac.uk/~sam/stringmetrics.html[^]
There is lot of metrics there. I've also supplemented the TF/IDF measure which was missing in the library.
If you have any confusion on how to aggregrate/combinate multiple similarity metrics into a single one measurement, then feel free to drop me a line or can directly quiz the author of the above open source .
Thanh Dao
 modified at 9:23 Saturday 29th October, 2005





The modified version of this algorithm gives us a lower complexity
Thanh Dao
 modified at 6:45 Thursday 29th September, 2005





hu? what about 'HopcroftKarp' ?
Anyway I did some carefull reading of your code this time (instead superficial amazed test as before).
I already made some 'improvment' I hope you will like.
here they are:
=================================================
public class Levenshtein
{
public static int Min3(int a, int b, int c)
{
return a < b ? (a < c ? a : c) : (b < c ? b : c);
}
/// <summary>
/// this distance goes from 0 to 1 + Max(s.Length, t.Length).
/// if they have different size they would be different
///
public static int GetMaxDistance(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] distance = new int[n + 1, m + 1]; // matrix
int cost = 0;
if (n == 0) return m;
if (m == 0) return n;
for (int i = 0; i <= n; distance[i, 0] = i++) ;
for (int j = 0; j <= m; distance[0, j] = j++) ;
//find min distance
for (int i = 1; i <= n; i++)
{
for (int j = 1; j <= m; j++)
{
cost = t[j  1] == s[i  1] ? 0 : 1;
distance[i, j] = Min3(distance[i  1, j] + 1,
distance[i, j  1] + 1,
distance[i  1, j  1] + cost);
}
}
return distance[n, m];
}
/// <summary>
/// this distance goes from 0 to 1 + Min(s.Length, t.Length).
/// it will be 1 or 0 if one word is caontained in the other
///
public static int GetMinDistance(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] distance = new int[n + 1, m + 1]; // matrix
int cost = 0;
if (n == 0) return m;
if (m == 0) return n;
for (int i = 0; i <= n; distance[i, 0] = i++) ;
for (int j = 0; j <= m; distance[0, j] = j++) ;
//find min distance
for (int i = 1; i <= n; i++)
{
for (int j = 1; j <= m; j++)
{
cost = t[j  1] == s[i  1] ? 0 : 1;
distance[i, j] = Min3(distance[i  1, j] + 1,
distance[i, j  1] + 1,
distance[i  1, j  1] + cost);
}
}
if (n < m)
{
int ret = distance[n, n];
for (int i = n + 1; i <= m; i++)
{
int j = distance[n, i]  (i  n);
if (j < ret)
ret = j;
}
return ret;
}
else
{
int ret = distance[m, m];
for (int i = m + 1; i <= n; i++)
{
int j = distance[i, m]  (i  m);
if (j < ret)
ret = j;
}
return ret;
}
return distance[n, m];
}
=================================================
public class Tokeniser
{
static Dictionary<char, char> accent;
static Tokeniser()
{
accent = new Dictionary<char, char>();
accent['à'] = 'a';
accent['â'] = 'a';
accent['ä'] = 'a';
accent['á'] = 'a';
accent['é'] = 'e';
accent['è'] = 'e';
accent['ê'] = 'e';
accent['ë'] = 'e';
accent['ì'] = 'i';
accent['î'] = 'i';
accent['ï'] = 'i';
accent['ò'] = 'o';
accent['ô'] = 'o';
accent['ö'] = 'o';
accent['ù'] = 'u';
accent['û'] = 'u';
accent['ü'] = 'u';
accent['ç'] = 'c';
accent['’'] = '\'';
accent['ñ'] = 'n';
}
public static List<string> Tokenize(string str)
{
Set<char> delimiters = new Set<char>();
delimiters.AddRange(new char[] {
' ', '\t', '\r', '\n', ',',
'.', ',', ':', ';', '?', '!',
'(', ')', '{', '}', '[', ']' });
return Tokenize(str, delimiters);
}
public static List<string> Tokenize(string str, Set<char> delimiters)
{
List<string> ret = new List<string>();
StringBuilder sb = new StringBuilder();
foreach (char c in str)
{
char c2 = Simplify(c);
if (delimiters.Contains(c2))
{
if (sb.Length > 0)
ret.Add(sb.ToString());
sb.Length = 0;
}
else
sb.Append(c2);
}
if (sb.Length > 0)
ret.Add(sb.ToString());
return ret;
}
public static char Simplify(char c)
{
if (Char.IsSeparator(c))
return ' ';
c = Char.ToLower(c);
if (accent.ContainsKey(c))
c = accent[c];
return c;
}
public static string Simplify(string s)
{
StringBuilder sb = new StringBuilder();
foreach (char c in s)
sb.Append(Simplify(c));
return sb.ToString();
}
}
=================================================
 modified at 7:20 Wednesday 28th September, 2005





I just saw and it looks very interesting, and I will test it soon .
I don't stress the complexity of matching algorithm, this is not an issue. Anyway, we can apply a simple fast and acceptable heuristic method instead of using a graph matching algorithm(I will present it in a new article Semantic similarity by 1oct). In some particular situations, hopcroftkarp sounds good, but we need to make some changes for maximal total weight problem, because it was original designed for maximizing number of matching pairs(mariage theorem). I just personally made some greedy modifications and have yet tested carefully.
Thank you very much for the good work,
Thanh Dao





Please note that:
1. I submit my change in a hurry but there was a bug... I fixed it inline 50 minutes later...
Now GetMinDistance() work very nicely for functionality as the "look for" text box in the SDK documentation (even better dare I say)
2. for the Tokenizer I use a class Set<t> found in the PowerCollection, free & OpenSource library available there:
http://www.wintellect.com/powercollections/[^]





I am sorry if it took so long to get back, i've tested your code yet, but your tokenizer seems only work only VS 2005.
May be a next few days I will update a new efficient tokenizer class which works better than the current, also a combination trigram, bigram, and soundex, affixes (prefix sufix).
Thanh Dao





Regards,
unruledboy@hotmail.com





Thanh Dao





it becomes:
string beforeConversion = "嗬饴淠崃樯枞晔胨焯钗锵蛞粼鲋缜採";
string afterConversion = "aAaAaAaAeEeEeEeEiIiIiIoOoOoOuUuUuUcC'n";
System.Text.StringBuilder sb = new System.Text.StringBuilder(input);
for (int i = 0; i < beforeConversion.Length; i++)
{
char beforeChar = beforeConversion[i];
char afterChar = afterConversion[i];
sb.Replace(beforeChar, afterChar);
}
sb.Replace("?, "oe");
sb.Replace("?, "ae");
I suggest that you handle these characters but their char number, not the char itself.
Regards,
unruledboy@hotmail.com





Do you mean that "" are confused ? or the string beforConversion selfbecame that when module was running under a chinese OS ?
I think that 3 above chars look alike but they could not be treated as the same since they have the different order : 58058, 58248, 58343 , respectively. I just think that: Does the string class treat this problem automatically ?
if the code sb.Replace("œ", "oe"); selfbecomes sb.Replace("?, "oe"); , this problem is a serious error.
Thanh Dao





Thanh
I just saw and though it is obvious that many of words in a string/sentence do not describle the string content. words like THE,IS ,AN , OF...are small, frequently occuring words and often ignored.
Many search engine/IR using automatic document indexing those non significant words are reomoved from the document vector, so document will only be represented by content bearing words.





Suggest using probalistic indexing ranks so the stop word are modeled by a poison distribution.
Martin





Hi Martin,
Thank you very much. That's great. I think that the number of words in a string(which in this case) is not high enought to use probalistic indexing ranks so it is difficult to implement term frequency in automatic indexing. Instead, I will use a common stop words list to remove the high frequency words.
Thanh Dao





Wildcards would be great! Example:
Doc* would be a perfect match for Doctor
Anyone thinks its worth adding?





This looks like a regular expression. Personally I was thinking of prefix/suffix matcher that mean if a word is prefix or postfix of another word then say that they're equivalent.
Prefix is efficient in case similar acronyms or cognate words.
These are syntactic similarity and it may not imply semantic relateness.
i.e: "net" and "network" but also the same with "hot" and "hotel".
Thanh Dao





Yeah.. maybe that would work.
Wouldn't be great if
abcd*ghi would match abcdefghi at 100%? I'm just throwing ideas around to improve this thing. Very useful code that you have here!





Hi Carl,
That's great but I need to think more about working with regular expression.
Thank you, have a nice day
Thanh Dao





Not something that I would ever need to do but I enjoyed reading about it.





Thank you ! I' glad to see that.
Thanh Dao






I' glad to know that's useful for you.
Here is the link to WordNet C# port:
http://opensvn.csie.org/WordNetDotNet/trunk/
Thanks
Thanh Dao





Unruled Boy wrote:
btw, where is latest source code of WordNet.NET? I did not find it at http://opensource.ebswift.com/WordNet.Net/[^]
Hello, I'm the one who released WordNet.Net to LGPL (based on the port of Malcom Crowe), great work Thanh!
The source is actually linked to the 'Subversion Repository' link on that page. The link is a version controlled source repository, and as you can see if you just go to that web page you would have to be downloading lots of different source files. The way to do it is to get yourself a subversion client and let it do the leg work for you. Once you download the source you can keep it uptodate with nothing more than a rightclick and a menu selection from Windows Explorer.
I have made a wiki page on how to use the repository with TortoiseSVN here: http://www.ebswift.com/Wiki/wikka.php?wakka=WordNetDotNetRepository[^].
~Troy Simpson
http://www.ebswift.com





Hi Troy,
Nice to see you here, Thank you very much!
Thanh Dao







General News Suggestion Question Bug Answer Joke Praise Rant Admin Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

