Add your own alternative version
Stats
245.4K views 4.8K downloads 135 bookmarked
Posted
29 Jul 2005

Comments and Discussions



Hi,
I have a problem to compile the code you presented above. The compiler could not recognize Min3 function in the first method. could you please help me out.
Regards Peter
peter





If you look at the code you will see that Min3 just returns the minimum of 3 integers:
static int Min3(int a, int b, int c)
{
return Math.Min(Math.Min(a, b), c);
}
should do it nicely.





Thanks a lot for the code, very nice and useful





Thank you very much for sharing this code.
Aykut





I receive a value of .9130435 using the example of:
public Test() { string s1="The code project's article"; string s2="Article of The CodeProject"; MatchsMaker match=new MatchsMaker(s1, s2) ; Console.WriteLine(match.Score) ; }
The only changes I've made to the code is the namespace.





Same here.
Still pretty cool article





i have executed this code i have question project can use for automatic answer checking in my





please i need your email to discuss subjects about text similarity
dfds





First of all, I am really excited about this functionality, it is much better than my own attempts to do the same thing. However I am experiancing a strange bug. I use Visual Studio 2003, and the code works flawlessly when my build configuration is 'Debug', but when i recompile in 'Release' mode, the similarity scores are completely wrong, even though they were right before. Does anyone have any idea why this could be happening. Thanks Ty
 modified at 17:17 Tuesday 23rd May, 2006
Here are some more details. I have isolated the build options that affect how result of the similarity calculation. In order for it to work correctly, my project needs to have the Optimize Code option set to false, and the Generate Debugging Information option set to true.





Strange Solution? I got it working finally.. though I don't quite understand how. I achieved success when i commented out the following lines of code in the GetTotal() method of the BipartiteMatcher class.
Trace.WriteLine (_leftTokens[i] + " <> " + _rightTokens [_outgoing[i]] + " : " + _cost[i, _outgoing[i]]) ;
float a=1.0F  System.Math.Max(_leftTokens[i].Length , _rightTokens[_outgoing[i]].Length ) != 0 ? _cost[i, _outgoing[i]]/System.Math.Max(_leftTokens[i].Length , _rightTokens[_outgoing[i]].Length ) : 1;
nA += a;
These lines don's seem to do anything but set the value of a variable that isn't used anywhere... yet somehow it was changing the outcome of the calculation ... but only when compiled in release mode. I am happy to have this working but no less confused as to the cause.






am unable to run the code that is avaiable for download with your article. It's not that there is a problem in the code. It's just that I do not know how to go about it. I am using Visual studio.net 2003. Please get me started. Thanks a lot.





It should run normally. Take a look at test code, it is a console application.
Thanh Dao





He's a noob/newbie. I think he's asking you to be his teacher.
ROFLOLMFAO





Dear Sir, This article("http://www.thecodeproject.com/csharp/improvestringsimilarity.asp") ,is an interesting concept and idea. Was a good read and thoroughly enjoyed going through it. Although i have to admit didn't understand most of the coding in the program, being an englishs' student at uni myself , but managed to get in running in the end somehow by my brothers help.
I need to write an article in the next two weeks, and think this would be an interesting topic for someone not from the computing field and a good discussion point, also because this a working model, so its not just some proposal but it has actually been implemented.
I would like to request you, if you would be grateful enough and if its possible for you, that could send me a sort of brief step by step explanation of the stages that this program works in/ processes data and then reaches to the end result between 01, on purely conceptual basis(simple english) in context to this program. I would indeed be quite grateful to you. My mail id: paulajone@googlemail.com Thanking you.  Paula.





Hello Paula,
I glad that you enjoyed it. As you can read in the article, here are three major steps:  Split up a sentence into word/token. (assuming that in most natural language, a sentence is a sequence of words and delimiters.)  Compute the similarity of two words. (you know what does the similarity mean... )  Compute the similarity of two bag of words.
firstly I want to explain little bit about string edit distance problem. Let give a simple game: given two words. and you have three operators to do with a word : replace, insert, delete. If you can find out the least number of steps to transform the first word to the second one, you win. So how will we go about finding the best solution ? Mr, Levenstein has done it.
private int ComputeDistance (string s, string t)
{
int n=s.Length;
int m=t.Length;
Step 0: Create an NxM matrix called the distance matrix in which each element [i,j] represents the cheapest cost (least number of edit step) of transforming the sequence of first i character (in the s string) to another one sequence of first j character (in the t string).
int[,] distance=new int[n + 1, m + 1]; // matrix
int cost=0;
if(n == 0) return m;
if(m == 0) return n;
Step 1: Initialize the distance matrix [i, 0] :For each row of the distance matrix, calculate value for the element distance[i, 0] = i. That means the cheapest cost to transform a string of length "i" to a empty string is the total steps of using the delete operation.
for(int i=0; i <= n; distance[i, 0]=i++);
Step 2: Initialize the distance matrix [0, j] :For each row of the distance matrix, calculate value for the element distance[ 0, j] = j. That means the cheapest cost to transform a string of length "j" to a empty string is the total steps of using the delete operation.
for(int j=0; j <= m; distance[0, j]=j++);
Step 3: Finding the solution for the problem of size [i,j] . The idea behind this code is bottomup dynamic programming technique. For each element [i,j] we compute the solution for problem of [i,j] bases on the previous smaller problem. We trarvese the matrix programmatically from smaller to bigger. Repeat step 4 until reach the problem size NxM
for(int i=1; i <= n; i++)
{
for(int j=1; j <= m;j++)
{
Step 4: Compute the solution for problem of size [i,j] bases on previous smaller problem : [i1, j], [i,j1], [i1, j1]
cost=(t.Substring(j  1, 1) == s.Substring(i  1, 1) ? 0 : 1); distance[i,j]=Min3(distance[i  1, j] + 1, distance[i, j  1] + 1, distance[i  1, j  1] + cost);
distance[i,j]=Min3(distance[i  1, j] + 1, // if we base the problem [i1, j] we need to insert one character. for example : from ABC to ABCD we need to append "D" to the first string. distance[i, j  1] + 1, // if we base the problem [i, j  1] we need to delete one character. for example : from ABCD to ABC we need to delete "D".
distance[i  1, j  1] + cost);//if we base the problem [i  1, j  1] , if the S_characterAt(i1) = S_characterAt(j1) we have nothing to do, otherwise if they are different then we need to replace. for example ABCD to ABCE we need to replace "D" to "E".
}
}
return distance[n, m];
}
The smaller number of distance [n, m] the more similar two strings are. There are several ways to compute the similarity value, may use some coefficient. In example : Sim = (maxLen  distance[n, m]) / maxlen.
The idea of dynamic programming is quite simple. If the perfect solution for the smaller (previous) problem does not affect to the solution for the bigger problem (next) we can compute the prefect solution of the bigger problem based on solution of the previous smaller problems. Let you imagine the representation of problems on a hierarchical tree. A parent node is bigger problem, and a smaller is child node. To compute the solution for the parent node you do not need to look at its descendant just need to care its children.
Let me know if I am misunderstand your request and bear with me. I will be back to say about maximize bipartite matching algorithm. (I am currently sinking into research paper ).
Thanh Dao





I am unable to run the code that is avaiable for download with your article. It's not that there is a problem in the code. It's just that I do not know how to go about it. I am using Visual studio.net 2003. Please get me started. Thanks a lot.
Gaurav Gupta





If you intend to work on this topic, I strongly suggest you visit this library : http://www.dcs.shef.ac.uk/~sam/stringmetrics.html[^]
There is lot of metrics there. I've also supplemented the TF/IDF measure which was missing in the library. If you have any confusion on how to aggregrate/combinate multiple similarity metrics into a single one measurement, then feel free to drop me a line or can directly quiz the author of the above open source .
Thanh Dao
 modified at 9:23 Saturday 29th October, 2005





The modified version of this algorithm gives us a lower complexity Thanh Dao
 modified at 6:45 Thursday 29th September, 2005





hu? what about 'HopcroftKarp' ?
Anyway I did some carefull reading of your code this time (instead superficial amazed test as before).
I already made some 'improvment' I hope you will like. here they are: ================================================= public class Levenshtein { public static int Min3(int a, int b, int c) { return a < b ? (a < c ? a : c) : (b < c ? b : c); } /// <summary> /// this distance goes from 0 to 1 + Max(s.Length, t.Length). /// if they have different size they would be different /// public static int GetMaxDistance(string s, string t) { int n = s.Length; int m = t.Length; int[,] distance = new int[n + 1, m + 1]; // matrix int cost = 0;
if (n == 0) return m; if (m == 0) return n;
for (int i = 0; i <= n; distance[i, 0] = i++) ; for (int j = 0; j <= m; distance[0, j] = j++) ;
//find min distance for (int i = 1; i <= n; i++) { for (int j = 1; j <= m; j++) { cost = t[j  1] == s[i  1] ? 0 : 1; distance[i, j] = Min3(distance[i  1, j] + 1, distance[i, j  1] + 1, distance[i  1, j  1] + cost); } }
return distance[n, m]; } /// <summary> /// this distance goes from 0 to 1 + Min(s.Length, t.Length). /// it will be 1 or 0 if one word is caontained in the other /// public static int GetMinDistance(string s, string t) { int n = s.Length; int m = t.Length; int[,] distance = new int[n + 1, m + 1]; // matrix int cost = 0;
if (n == 0) return m; if (m == 0) return n;
for (int i = 0; i <= n; distance[i, 0] = i++) ; for (int j = 0; j <= m; distance[0, j] = j++) ;
//find min distance for (int i = 1; i <= n; i++) { for (int j = 1; j <= m; j++) { cost = t[j  1] == s[i  1] ? 0 : 1; distance[i, j] = Min3(distance[i  1, j] + 1, distance[i, j  1] + 1, distance[i  1, j  1] + cost); } }
if (n < m) { int ret = distance[n, n]; for (int i = n + 1; i <= m; i++) { int j = distance[n, i]  (i  n); if (j < ret) ret = j; } return ret; } else { int ret = distance[m, m]; for (int i = m + 1; i <= n; i++) { int j = distance[i, m]  (i  m); if (j < ret) ret = j; } return ret; }
return distance[n, m]; } ================================================= public class Tokeniser { static Dictionary<char, char> accent; static Tokeniser() { accent = new Dictionary<char, char>(); accent['à'] = 'a'; accent['â'] = 'a'; accent['ä'] = 'a'; accent['á'] = 'a'; accent['é'] = 'e'; accent['è'] = 'e'; accent['ê'] = 'e'; accent['ë'] = 'e'; accent['ì'] = 'i'; accent['î'] = 'i'; accent['ï'] = 'i'; accent['ò'] = 'o'; accent['ô'] = 'o'; accent['ö'] = 'o'; accent['ù'] = 'u'; accent['û'] = 'u'; accent['ü'] = 'u'; accent['ç'] = 'c'; accent['’'] = '\''; accent['ñ'] = 'n'; }
public static List<string> Tokenize(string str) { Set<char> delimiters = new Set<char>(); delimiters.AddRange(new char[] { ' ', '\t', '\r', '\n', ',', '.', ',', ':', ';', '?', '!', '(', ')', '{', '}', '[', ']' }); return Tokenize(str, delimiters); } public static List<string> Tokenize(string str, Set<char> delimiters) { List<string> ret = new List<string>(); StringBuilder sb = new StringBuilder(); foreach (char c in str) { char c2 = Simplify(c);
if (delimiters.Contains(c2)) { if (sb.Length > 0) ret.Add(sb.ToString()); sb.Length = 0; } else sb.Append(c2); } if (sb.Length > 0) ret.Add(sb.ToString());
return ret; } public static char Simplify(char c) { if (Char.IsSeparator(c)) return ' '; c = Char.ToLower(c); if (accent.ContainsKey(c)) c = accent[c]; return c; } public static string Simplify(string s) { StringBuilder sb = new StringBuilder(); foreach (char c in s) sb.Append(Simplify(c)); return sb.ToString(); } } =================================================
 modified at 7:20 Wednesday 28th September, 2005





I just saw and it looks very interesting, and I will test it soon .
I don't stress the complexity of matching algorithm, this is not an issue. Anyway, we can apply a simple fast and acceptable heuristic method instead of using a graph matching algorithm(I will present it in a new article Semantic similarity by 1oct). In some particular situations, hopcroftkarp sounds good, but we need to make some changes for maximal total weight problem, because it was original designed for maximizing number of matching pairs(mariage theorem). I just personally made some greedy modifications and have yet tested carefully.
Thank you very much for the good work,
Thanh Dao





Please note that:
1. I submit my change in a hurry but there was a bug... I fixed it inline 50 minutes later... Now GetMinDistance() work very nicely for functionality as the "look for" text box in the SDK documentation (even better dare I say)
2. for the Tokenizer I use a class Set<t> found in the PowerCollection, free & OpenSource library available there: http://www.wintellect.com/powercollections/[^]





I am sorry if it took so long to get back, i've tested your code yet, but your tokenizer seems only work only VS 2005. May be a next few days I will update a new efficient tokenizer class which works better than the current, also a combination trigram, bigram, and soundex, affixes (prefix sufix).
Thanh Dao





Regards, unruledboy@hotmail.com





Thanh Dao







General News Suggestion Question Bug Answer Joke Praise Rant Admin Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

