|
|
Comments and Discussions
|
|
 |
|
|

|
just flashes the console never appears!
|
|
|
|

|
Thanks for this code it works really nicely, heres an unsafe version of the distance function, it runs about 2-3x faster!
public static unsafe int StringDistance(string s1, string s2)
{
int liLengthS1 = s1.Length;
int liLengthS2 = s2.Length;
int liColWidth = liLengthS2 + 1;
int[,] distance = new int[liLengthS1 + 1, liColWidth];
int liCost = 0;
if (liLengthS1 == 0) return liLengthS2;
if (liLengthS2 == 0) return liLengthS1;
fixed (char* lpchS1 = s1, lpchS2 = s2)
fixed (int* lpiDistance = distance)
{
for (int liRow = 0; liRow <= liLengthS1; lpiDistance[liRow * liColWidth] = liRow++) ;
for (int liCol = 0; liCol <= liLengthS2; lpiDistance[liCol] = liCol++) ;
int* lpiRow = lpiDistance;
int* lpiRowBefore = lpiDistance;
for (int liRow = 1; liRow <= liLengthS1; liRow++)
{
lpiRow += liColWidth;
for (int liCol = 1; liCol <= liLengthS2; liCol++)
{
liCost = (lpchS2[liCol - 1] == lpchS1[liRow - 1] ? 0 : 1);
lpiRow[liCol] =
Min3(
lpiRowBefore[liCol] + 1,
lpiRow[liCol - 1] + 1,
lpiRowBefore[liCol - 1] + liCost
);
}
lpiRowBefore += liColWidth;
}
}
return distance[liLengthS1, liLengthS2];
}
modified on Thursday, June 30, 2011 10:53 AM
|
|
|
|

|
That's amazing! Some tests I did confirmed it was several times faster. I haven't looked at unsafe code before - this opens my eyes...
|
|
|
|

|
Does it calculates a number between 0 and 1 ? Because the similarity article code return to me numbers like 0 to 6, never the maximum value.
|
|
|
|

|
Hi,
I have a problem to compile the code you presented above. The compiler could not recognize Min3
function in the first method. could you please help me out.
Regards
Peter
peter
|
|
|
|

|
If you look at the code you will see that Min3 just returns the minimum of 3 integers:
static int Min3(int a, int b, int c)
{
return Math.Min(Math.Min(a, b), c);
}
should do it nicely.
|
|
|
|

|
Thanks a lot for the code, very nice and useful
|
|
|
|

|
Thank you very much for sharing this code.
Aykut
|
|
|
|

|
I receive a value of .9130435 using the example of:
public Test()
{
string s1="The code project's article";
string s2="Article of The CodeProject";
MatchsMaker match=new MatchsMaker(s1, s2) ;
Console.WriteLine(match.Score) ;
}
The only changes I've made to the code is the namespace.
|
|
|
|

|
Same here.
Still pretty cool article
|
|
|
|

|
please i need your email to discuss subjects about text similarity
dfds
|
|
|
|

|
First of all, I am really excited about this functionality, it is much better than my own attempts to do the same thing. However I am experiancing a strange bug. I use Visual Studio 2003, and the code works flawlessly when my build configuration is 'Debug', but when i recompile in 'Release' mode, the similarity scores are completely wrong, even though they were right before. Does anyone have any idea why this could be happening.
Thanks
-Ty
-- modified at 17:17 Tuesday 23rd May, 2006
Here are some more details.
I have isolated the build options that affect how result of the similarity calculation. In order for it to work correctly, my project needs to have the Optimize Code option set to false, and the Generate Debugging Information option set to true.
|
|
|
|

|
Strange Solution?
I got it working finally.. though I don't quite understand how.
I achieved success when i commented out the following lines of code in the GetTotal() method of the BipartiteMatcher class.
Trace.WriteLine (_leftTokens[i] + " <-> " + _rightTokens [_outgoing[i]] + " : " + _cost[i, _outgoing[i]]) ;
float a=1.0F - System.Math.Max(_leftTokens[i].Length , _rightTokens[_outgoing[i]].Length ) != 0 ? _cost[i, _outgoing[i]]/System.Math.Max(_leftTokens[i].Length , _rightTokens[_outgoing[i]].Length ) : 1;
nA += a;
These lines don's seem to do anything but set the value of a variable that isn't used anywhere... yet somehow it was changing the outcome of the calculation ... but only when compiled in release mode. I am happy to have this working but no less confused as to the cause.
|
|
|
|
|

|
am unable to run the code that is avaiable for download with your article. It's not that there is a problem in the code. It's just that I do not know how to go about it. I am using Visual studio.net 2003. Please get me started. Thanks a lot.
|
|
|
|

|
It should run normally. Take a look at test code, it is a console application.
Thanh Dao
|
|
|
|

|
He's a noob/newbie. I think he's asking you to be his teacher.
ROFLOLMFAO
|
|
|
|

|
Dear Sir,
This article("http://www.thecodeproject.com/csharp/improvestringsimilarity.asp") ,is an interesting concept and idea. Was a good read and thoroughly enjoyed going through it. Although i have to admit didn't understand most of the coding in the program, being an englishs' student at uni myself , but managed to get in running in the end somehow by my brothers help.
I need to write an article in the next two weeks, and think this would be an interesting topic for someone not from the computing field and a good discussion point, also because this a working model, so its not just some proposal but it has actually been implemented.
I would like to request you, if you would be grateful enough and if its possible for you, that could send me a sort of brief step by step explanation of the stages that this program works in/ processes data and then reaches to the end result between 0-1, on purely conceptual basis(simple english) in context to this program.
I would indeed be quite grateful to you.
My mail id: paulajone@googlemail.com
Thanking you.
- Paula.
|
|
|
|

|
Hello Paula,
I glad that you enjoyed it. As you can read in the article, here are three major steps:
- Split up a sentence into word/token. (assuming that in most natural language, a sentence is a sequence of words and delimiters.)
- Compute the similarity of two words. (you know what does the similarity mean... )
- Compute the similarity of two bag of words.
firstly I want to explain little bit about string edit distance problem.
Let give a simple game: given two words. and you have three operators to do with a word : replace, insert, delete. If you can find out the least number of steps to transform the first word to the second one, you win. So how will we go about finding the best solution ? Mr, Levenstein has done it.
private int ComputeDistance (string s, string t)
{
int n=s.Length;
int m=t.Length;
Step 0: Create an NxM matrix called the distance matrix in which each element [i,j] represents the cheapest cost (least number of edit step) of transforming the sequence of first i character (in the s string) to another one sequence of first j character (in the t string).
int[,] distance=new int[n + 1, m + 1]; // matrix
int cost=0;
if(n == 0) return m;
if(m == 0) return n;
Step 1: Initialize the distance matrix [i, 0] :For each row of the distance matrix, calculate value for the element distance[i, 0] = i. That means the cheapest cost to transform a string of length "i" to a empty string is the total steps of using the delete operation.
for(int i=0; i <= n; distance[i, 0]=i++);
Step 2: Initialize the distance matrix [0, j] :For each row of the distance matrix, calculate value for the element distance[ 0, j] = j. That means the cheapest cost to transform a string of length "j" to a empty string is the total steps of using the delete operation.
for(int j=0; j <= m; distance[0, j]=j++);
Step 3: Finding the solution for the problem of size [i,j] . The idea behind this code is bottom-up dynamic programming technique. For each element [i,j] we compute the solution for problem of [i,j] bases on the previous smaller problem. We trarvese the matrix programmatically from smaller to bigger. Repeat step 4 until reach the problem size NxM
for(int i=1; i <= n; i++)
{
for(int j=1; j <= m;j++)
{
Step 4: Compute the solution for problem of size [i,j] bases on previous smaller problem : [i-1, j], [i,j-1], [i-1, j-1]
cost=(t.Substring(j - 1, 1) == s.Substring(i - 1, 1) ? 0 : 1);
distance[i,j]=Min3(distance[i - 1, j] + 1,
distance[i, j - 1] + 1,
distance[i - 1, j - 1] + cost);
distance[i,j]=Min3(distance[i - 1, j] + 1, // if we base the problem [i-1, j] we need to insert one character. for example : from ABC to ABCD we need to append "D" to the first string.
distance[i, j - 1] + 1, // if we base the problem [i, j - 1] we need to delete one character. for example : from ABCD to ABC we need to delete "D".
distance[i - 1, j - 1] + cost);//if we base the problem [i - 1, j - 1] , if the S_characterAt(i-1) = S_characterAt(j-1) we have nothing to do, otherwise if they are different then we need to replace. for example ABCD to ABCE we need to replace "D" to "E".
}
}
return distance[n, m];
}
The smaller number of distance [n, m] the more similar two strings are.
There are several ways to compute the similarity value, may use some co-efficient. In example : Sim = (maxLen - distance[n, m]) / maxlen.
The idea of dynamic programming is quite simple. If the perfect solution for the smaller (previous) problem does not affect to the solution for the bigger problem (next) we can compute the prefect solution of the bigger problem based on solution of the previous smaller problems. Let you imagine the representation of problems on a hierarchical tree. A parent node is bigger problem, and a smaller is child node. To compute the solution for the parent node you do not need to look at its descendant just need to care its children.
Let me know if I am misunderstand your request and bear with me. I will be back to say about maximize bipartite matching algorithm. (I am currently sinking into research paper ).
Thanh Dao
|
|
|
|

|
I am unable to run the code that is avaiable for download with your article. It's not that there is a problem in the code. It's just that I do not know how to go about it. I am using Visual studio.net 2003. Please get me started. Thanks a lot.
Gaurav Gupta
|
|
|
|

|
If you intend to work on this topic, I strongly suggest you visit this library :
http://www.dcs.shef.ac.uk/~sam/stringmetrics.html[^]
There is lot of metrics there. I've also supplemented the TF/IDF measure which was missing in the library.
If you have any confusion on how to aggregrate/combinate multiple similarity metrics into a single one measurement, then feel free to drop me a line or can directly quiz the author of the above open source .
Thanh Dao
-- modified at 9:23 Saturday 29th October, 2005
|
|
|
|

|
The modified version of this algorithm gives us a lower complexity
Thanh Dao
-- modified at 6:45 Thursday 29th September, 2005
|
|
|
|

|
hu? what about 'Hopcroft-Karp' ?
Anyway I did some carefull reading of your code this time (instead superficial amazed test as before).
I already made some 'improvment' I hope you will like.
here they are:
=================================================
public class Levenshtein
{
public static int Min3(int a, int b, int c)
{
return a < b ? (a < c ? a : c) : (b < c ? b : c);
}
///
/// this distance goes from 0 to 1 + Max(s.Length, t.Length).
/// if they have different size they would be different
///
public static int GetMaxDistance(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] distance = new int[n + 1, m + 1]; // matrix
int cost = 0;
if (n == 0) return m;
if (m == 0) return n;
for (int i = 0; i <= n; distance[i, 0] = i++) ;
for (int j = 0; j <= m; distance[0, j] = j++) ;
//find min distance
for (int i = 1; i <= n; i++)
{
for (int j = 1; j <= m; j++)
{
cost = t[j - 1] == s[i - 1] ? 0 : 1;
distance[i, j] = Min3(distance[i - 1, j] + 1,
distance[i, j - 1] + 1,
distance[i - 1, j - 1] + cost);
}
}
return distance[n, m];
}
///
/// this distance goes from 0 to 1 + Min(s.Length, t.Length).
/// it will be 1 or 0 if one word is caontained in the other
///
public static int GetMinDistance(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] distance = new int[n + 1, m + 1]; // matrix
int cost = 0;
if (n == 0) return m;
if (m == 0) return n;
for (int i = 0; i <= n; distance[i, 0] = i++) ;
for (int j = 0; j <= m; distance[0, j] = j++) ;
//find min distance
for (int i = 1; i <= n; i++)
{
for (int j = 1; j <= m; j++)
{
cost = t[j - 1] == s[i - 1] ? 0 : 1;
distance[i, j] = Min3(distance[i - 1, j] + 1,
distance[i, j - 1] + 1,
distance[i - 1, j - 1] + cost);
}
}
if (n < m)
{
int ret = distance[n, n];
for (int i = n + 1; i <= m; i++)
{
int j = distance[n, i] - (i - n);
if (j < ret)
ret = j;
}
return ret;
}
else
{
int ret = distance[m, m];
for (int i = m + 1; i <= n; i++)
{
int j = distance[i, m] - (i - m);
if (j < ret)
ret = j;
}
return ret;
}
return distance[n, m];
}
=================================================
public class Tokeniser
{
static Dictionary accent;
static Tokeniser()
{
accent = new Dictionary();
accent['à'] = 'a';
accent['â'] = 'a';
accent['ä'] = 'a';
accent['á'] = 'a';
accent['é'] = 'e';
accent['è'] = 'e';
accent['ê'] = 'e';
accent['ë'] = 'e';
accent['ì'] = 'i';
accent['î'] = 'i';
accent['ï'] = 'i';
accent['ò'] = 'o';
accent['ô'] = 'o';
accent['ö'] = 'o';
accent['ù'] = 'u';
accent['û'] = 'u';
accent['ü'] = 'u';
accent['ç'] = 'c';
accent['’'] = '\'';
accent['ñ'] = 'n';
}
public static List Tokenize(string str)
{
Set delimiters = new Set();
delimiters.AddRange(new char[] {
' ', '\t', '\r', '\n', ',',
'.', ',', ':', ';', '?', '!',
'(', ')', '{', '}', '[', ']' });
return Tokenize(str, delimiters);
}
public static List Tokenize(string str, Set delimiters)
{
List ret = new List();
StringBuilder sb = new StringBuilder();
foreach (char c in str)
{
char c2 = Simplify(c);
if (delimiters.Contains(c2))
{
if (sb.Length > 0)
ret.Add(sb.ToString());
sb.Length = 0;
}
else
sb.Append(c2);
}
if (sb.Length > 0)
ret.Add(sb.ToString());
return ret;
}
public static char Simplify(char c)
{
if (Char.IsSeparator(c))
return ' ';
c = Char.ToLower(c);
if (accent.ContainsKey(c))
c = accent[c];
return c;
}
public static string Simplify(string s)
{
StringBuilder sb = new StringBuilder();
foreach (char c in s)
sb.Append(Simplify(c));
return sb.ToString();
}
}
=================================================
-- modified at 7:20 Wednesday 28th September, 2005
|
|
|
|
 |
|
|
General News Suggestion Question Bug Answer Joke Rant Admin
Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.
|
An improvement on capturing similarity between strings. The algorithm was developed in C#.
| Type | Article |
| Licence | |
| First Posted | 28 Jul 2005 |
| Views | 166,449 |
| Bookmarked | 126 times |
|
|