|
|
Comments and Discussions
|
|
 |
|
|

|
just flashes the console never appears!
|
|
|
|

|
Thanks for this code it works really nicely, heres an unsafe version of the distance function, it runs about 2-3x faster!
public static unsafe int StringDistance(string s1, string s2)
{
int liLengthS1 = s1.Length;
int liLengthS2 = s2.Length;
int liColWidth = liLengthS2 + 1;
int[,] distance = new int[liLengthS1 + 1, liColWidth];
int liCost = 0;
if (liLengthS1 == 0) return liLengthS2;
if (liLengthS2 == 0) return liLengthS1;
fixed (char* lpchS1 = s1, lpchS2 = s2)
fixed (int* lpiDistance = distance)
{
for (int liRow = 0; liRow <= liLengthS1; lpiDistance[liRow * liColWidth] = liRow++) ;
for (int liCol = 0; liCol <= liLengthS2; lpiDistance[liCol] = liCol++) ;
int* lpiRow = lpiDistance;
int* lpiRowBefore = lpiDistance;
for (int liRow = 1; liRow <= liLengthS1; liRow++)
{
lpiRow += liColWidth;
for (int liCol = 1; liCol <= liLengthS2; liCol++)
{
liCost = (lpchS2[liCol - 1] == lpchS1[liRow - 1] ? 0 : 1);
lpiRow[liCol] =
Min3(
lpiRowBefore[liCol] + 1,
lpiRow[liCol - 1] + 1,
lpiRowBefore[liCol - 1] + liCost
);
}
lpiRowBefore += liColWidth;
}
}
return distance[liLengthS1, liLengthS2];
}
modified on Thursday, June 30, 2011 10:53 AM
|
|
|
|

|
That's amazing! Some tests I did confirmed it was several times faster. I haven't looked at unsafe code before - this opens my eyes...
|
|
|
|

|
Hi,
I have a problem to compile the code you presented above. The compiler could not recognize Min3
function in the first method. could you please help me out.
Regards
Peter
peter
|
|
|
|

|
If you look at the code you will see that Min3 just returns the minimum of 3 integers:
static int Min3(int a, int b, int c)
{
return Math.Min(Math.Min(a, b), c);
}
should do it nicely.
|
|
|
|

|
Thanks a lot for the code, very nice and useful
|
|
|
|

|
Thank you very much for sharing this code.
Aykut
|
|
|
|

|
I receive a value of .9130435 using the example of:
public Test()
{
string s1="The code project's article";
string s2="Article of The CodeProject";
MatchsMaker match=new MatchsMaker(s1, s2) ;
Console.WriteLine(match.Score) ;
}
The only changes I've made to the code is the namespace.
|
|
|
|

|
Same here.
Still pretty cool article
|
|
|
|

|
please i need your email to discuss subjects about text similarity
dfds
|
|
|
|

|
First of all, I am really excited about this functionality, it is much better than my own attempts to do the same thing. However I am experiancing a strange bug. I use Visual Studio 2003, and the code works flawlessly when my build configuration is 'Debug', but when i recompile in 'Release' mode, the similarity scores are completely wrong, even though they were right before. Does anyone have any idea why this could be happening.
Thanks
-Ty
-- modified at 17:17 Tuesday 23rd May, 2006
Here are some more details.
I have isolated the build options that affect how result of the similarity calculation. In order for it to work correctly, my project needs to have the Optimize Code option set to false, and the Generate Debugging Information option set to true.
|
|
|
|

|
Strange Solution?
I got it working finally.. though I don't quite understand how.
I achieved success when i commented out the following lines of code in the GetTotal() method of the BipartiteMatcher class.
Trace.WriteLine (_leftTokens[i] + " <-> " + _rightTokens [_outgoing[i]] + " : " + _cost[i, _outgoing[i]]) ;
float a=1.0F - System.Math.Max(_leftTokens[i].Length , _rightTokens[_outgoing[i]].Length ) != 0 ? _cost[i, _outgoing[i]]/System.Math.Max(_leftTokens[i].Length , _rightTokens[_outgoing[i]].Length ) : 1;
nA += a;
These lines don's seem to do anything but set the value of a variable that isn't used anywhere... yet somehow it was changing the outcome of the calculation ... but only when compiled in release mode. I am happy to have this working but no less confused as to the cause.
|
|
|
|
|

|
am unable to run the code that is avaiable for download with your article. It's not that there is a problem in the code. It's just that I do not know how to go about it. I am using Visual studio.net 2003. Please get me started. Thanks a lot.
|
|
|
|

|
It should run normally. Take a look at test code, it is a console application.
Thanh Dao
|
|
|
|

|
He's a noob/newbie. I think he's asking you to be his teacher.
ROFLOLMFAO
|
|
|
|

|
Dear Sir,
This article("http://www.thecodeproject.com/csharp/improvestringsimilarity.asp") ,is an interesting concept and idea. Was a good read and thoroughly enjoyed going through it. Although i have to admit didn't understand most of the coding in the program, being an englishs' student at uni myself , but managed to get in running in the end somehow by my brothers help.
I need to write an article in the next two weeks, and think this would be an interesting topic for someone not from the computing field and a good discussion point, also because this a working model, so its not just some proposal but it has actually been implemented.
I would like to request you, if you would be grateful enough and if its possible for you, that could send me a sort of brief step by step explanation of the stages that this program works in/ processes data and then reaches to the end result between 0-1, on purely conceptual basis(simple english) in context to this program.
I would indeed be quite grateful to you.
My mail id: paulajone@googlemail.com
Thanking you.
- Paula.
|
|
|
|

|
Hello Paula,
I glad that you enjoyed it. As you can read in the article, here are three major steps:
- Split up a sentence into word/token. (assuming that in most natural language, a sentence is a sequence of words and delimiters.)
- Compute the similarity of two words. (you know what does the similarity mean... )
- Compute the similarity of two bag of words.
firstly I want to explain little bit about string edit distance problem.
Let give a simple game: given two words. and you have three operators to do with a word : replace, insert, delete. If you can find out the least number of steps to transform the first word to the second one, you win. So how will we go about finding the best solution ? Mr, Levenstein has done it.
private int ComputeDistance (string s, string t)
{
int n=s.Length;
int m=t.Length;
Step 0: Create an NxM matrix called the distance matrix in which each element [i,j] represents the cheapest cost (least number of edit step) of transforming the sequence of first i character (in the s string) to another one sequence of first j character (in the t string).
int[,] distance=new int[n + 1, m + 1]; // matrix
int cost=0;
if(n == 0) return m;
if(m == 0) return n;
Step 1: Initialize the distance matrix [i, 0] :For each row of the distance matrix, calculate value for the element distance[i, 0] = i. That means the cheapest cost to transform a string of length "i" to a empty string is the total steps of using the delete operation.
for(int i=0; i <= n; distance[i, 0]=i++);
Step 2: Initialize the distance matrix [0, j] :For each row of the distance matrix, calculate value for the element distance[ 0, j] = j. That means the cheapest cost to transform a string of length "j" to a empty string is the total steps of using the delete operation.
for(int j=0; j <= m; distance[0, j]=j++);
Step 3: Finding the solution for the problem of size [i,j] . The idea behind this code is bottom-up dynamic programming technique. For each element [i,j] we compute the solution for problem of [i,j] bases on the previous smaller problem. We trarvese the matrix programmatically from smaller to bigger. Repeat step 4 until reach the problem size NxM
for(int i=1; i <= n; i++)
{
for(int j=1; j <= m;j++)
{
Step 4: Compute the solution for problem of size [i,j] bases on previous smaller problem : [i-1, j], [i,j-1], [i-1, j-1]
cost=(t.Substring(j - 1, 1) == s.Substring(i - 1, 1) ? 0 : 1);
distance[i,j]=Min3(distance[i - 1, j] + 1,
distance[i, j - 1] + 1,
distance[i - 1, j - 1] + cost);
distance[i,j]=Min3(distance[i - 1, j] + 1, // if we base the problem [i-1, j] we need to insert one character. for example : from ABC to ABCD we need to append "D" to the first string.
distance[i, j - 1] + 1, // if we base the problem [i, j - 1] we need to delete one character. for example : from ABCD to ABC we need to delete "D".
distance[i - 1, j - 1] + cost);//if we base the problem [i - 1, j - 1] , if the S_characterAt(i-1) = S_characterAt(j-1) we have nothing to do, otherwise if they are different then we need to replace. for example ABCD to ABCE we need to replace "D" to "E".
}
}
return distance[n, m];
}
The smaller number of distance [n, m] the more similar two strings are.
There are several ways to compute the similarity value, may use some co-efficient. In example : Sim = (maxLen - distance[n, m]) / maxlen.
The idea of dynamic programming is quite simple. If the perfect solution for the smaller (previous) problem does not affect to the solution for the bigger problem (next) we can compute the prefect solution of the bigger problem based on solution of the previous smaller problems. Let you imagine the representation of problems on a hierarchical tree. A parent node is bigger problem, and a smaller is child node. To compute the solution for the parent node you do not need to look at its descendant just need to care its children.
Let me know if I am misunderstand your request and bear with me. I will be back to say about maximize bipartite matching algorithm. (I am currently sinking into research paper ).
Thanh Dao
|
|
|
|

|
I am unable to run the code that is avaiable for download with your article. It's not that there is a problem in the code. It's just that I do not know how to go about it. I am using Visual studio.net 2003. Please get me started. Thanks a lot.
Gaurav Gupta
|
|
|
|

|
If you intend to work on this topic, I strongly suggest you visit this library :
http://www.dcs.shef.ac.uk/~sam/stringmetrics.html[^]
There is lot of metrics there. I've also supplemented the TF/IDF measure which was missing in the library.
If you have any confusion on how to aggregrate/combinate multiple similarity metrics into a single one measurement, then feel free to drop me a line or can directly quiz the author of the above open source .
Thanh Dao
-- modified at 9:23 Saturday 29th October, 2005
|
|
|
|

|
The modified version of this algorithm gives us a lower complexity
Thanh Dao
-- modified at 6:45 Thursday 29th September, 2005
|
|
|
|

|
hu? what about 'Hopcroft-Karp' ?
Anyway I did some carefull reading of your code this time (instead superficial amazed test as before).
I already made some 'improvment' I hope you will like.
here they are:
=================================================
public class Levenshtein
{
public static int Min3(int a, int b, int c)
{
return a < b ? (a < c ? a : c) : (b < c ? b : c);
}
///
/// this distance goes from 0 to 1 + Max(s.Length, t.Length).
/// if they have different size they would be different
///
public static int GetMaxDistance(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] distance = new int[n + 1, m + 1]; // matrix
int cost = 0;
if (n == 0) return m;
if (m == 0) return n;
for (int i = 0; i <= n; distance[i, 0] = i++) ;
for (int j = 0; j <= m; distance[0, j] = j++) ;
//find min distance
for (int i = 1; i <= n; i++)
{
for (int j = 1; j <= m; j++)
{
cost = t[j - 1] == s[i - 1] ? 0 : 1;
distance[i, j] = Min3(distance[i - 1, j] + 1,
distance[i, j - 1] + 1,
distance[i - 1, j - 1] + cost);
}
}
return distance[n, m];
}
///
/// this distance goes from 0 to 1 + Min(s.Length, t.Length).
/// it will be 1 or 0 if one word is caontained in the other
///
public static int GetMinDistance(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] distance = new int[n + 1, m + 1]; // matrix
int cost = 0;
if (n == 0) return m;
if (m == 0) return n;
for (int i = 0; i <= n; distance[i, 0] = i++) ;
for (int j = 0; j <= m; distance[0, j] = j++) ;
//find min distance
for (int i = 1; i <= n; i++)
{
for (int j = 1; j <= m; j++)
{
cost = t[j - 1] == s[i - 1] ? 0 : 1;
distance[i, j] = Min3(distance[i - 1, j] + 1,
distance[i, j - 1] + 1,
distance[i - 1, j - 1] + cost);
}
}
if (n < m)
{
int ret = distance[n, n];
for (int i = n + 1; i <= m; i++)
{
int j = distance[n, i] - (i - n);
if (j < ret)
ret = j;
}
return ret;
}
else
{
int ret = distance[m, m];
for (int i = m + 1; i <= n; i++)
{
int j = distance[i, m] - (i - m);
if (j < ret)
ret = j;
}
return ret;
}
return distance[n, m];
}
=================================================
public class Tokeniser
{
static Dictionary accent;
static Tokeniser()
{
accent = new Dictionary();
accent['à'] = 'a';
accent['â'] = 'a';
accent['ä'] = 'a';
accent['á'] = 'a';
accent['é'] = 'e';
accent['è'] = 'e';
accent['ê'] = 'e';
accent['ë'] = 'e';
accent['ì'] = 'i';
accent['î'] = 'i';
accent['ï'] = 'i';
accent['ò'] = 'o';
accent['ô'] = 'o';
accent['ö'] = 'o';
accent['ù'] = 'u';
accent['û'] = 'u';
accent['ü'] = 'u';
accent['ç'] = 'c';
accent['’'] = '\'';
accent['ñ'] = 'n';
}
public static List Tokenize(string str)
{
Set delimiters = new Set();
delimiters.AddRange(new char[] {
' ', '\t', '\r', '\n', ',',
'.', ',', ':', ';', '?', '!',
'(', ')', '{', '}', '[', ']' });
return Tokenize(str, delimiters);
}
public static List Tokenize(string str, Set delimiters)
{
List ret = new List();
StringBuilder sb = new StringBuilder();
foreach (char c in str)
{
char c2 = Simplify(c);
if (delimiters.Contains(c2))
{
if (sb.Length > 0)
ret.Add(sb.ToString());
sb.Length = 0;
}
else
sb.Append(c2);
}
if (sb.Length > 0)
ret.Add(sb.ToString());
return ret;
}
public static char Simplify(char c)
{
if (Char.IsSeparator(c))
return ' ';
c = Char.ToLower(c);
if (accent.ContainsKey(c))
c = accent[c];
return c;
}
public static string Simplify(string s)
{
StringBuilder sb = new StringBuilder();
foreach (char c in s)
sb.Append(Simplify(c));
return sb.ToString();
}
}
=================================================
-- modified at 7:20 Wednesday 28th September, 2005
|
|
|
|

|
I just saw and it looks very interesting, and I will test it soon .
I don't stress the complexity of matching algorithm, this is not an issue. Anyway, we can apply a simple fast and acceptable heuristic method instead of using a graph matching algorithm(I will present it in a new article Semantic similarity by 1-oct). In some particular situations, hopcroft-karp sounds good, but we need to make some changes for maximal total weight problem, because it was original designed for maximizing number of matching pairs(mariage theorem). I just personally made some greedy modifications and have yet tested carefully.
Thank you very much for the good work,
Thanh Dao
|
|
|
|

|
Please note that:
1. I submit my change in a hurry but there was a bug... I fixed it inline 50 minutes later...
Now GetMinDistance() work very nicely for functionality as the "look for" text box in the SDK documentation (even better dare I say)
2. for the Tokenizer I use a class Set found in the PowerCollection, free & OpenSource library available there:
http://www.wintellect.com/powercollections/[^]
|
|
|
|

|
I am sorry if it took so long to get back, i've tested your code yet, but your tokenizer seems only work only VS 2005.
May be a next few days I will update a new efficient tokenizer class which works better than the current, also a combination trigram, bigram, and soundex, affixes (prefix -sufix).
Thanh Dao
|
|
|
|

|
Regards,
unruledboy@hotmail.com
|
|
|
|

|
Thanh Dao
|
|
|
|

|
it becomes:
string beforeConversion = "嗬饴淠崃樯枞晔胨焯钗锵蛞粼鲋缜採";
string afterConversion = "aAaAaAaAeEeEeEeEiIiIiIoOoOoOuUuUuUcC'n";
System.Text.StringBuilder sb = new System.Text.StringBuilder(input);
for (int i = 0; i < beforeConversion.Length; i++)
{
char beforeChar = beforeConversion[i];
char afterChar = afterConversion[i];
sb.Replace(beforeChar, afterChar);
}
sb.Replace("?, "oe");
sb.Replace("?, "ae");
I suggest that you handle these characters but their char number, not the char itself.
Regards,
unruledboy@hotmail.com
|
|
|
|

|
Do you mean that "" are confused ? or the string beforConversion self-became that when module was running under a chinese OS ?
I think that 3 above chars look alike but they could not be treated as the same since they have the different order : 58058, 58248, 58343 , respectively. I just think that: Does the string class treat this problem automatically ?
if the code sb.Replace("œ", "oe"); self-becomes sb.Replace("?, "oe"); , this problem is a serious error.
Thanh Dao
|
|
|
|

|
Thanh
I just saw and though it is obvious that many of words in a string/sentence do not describle the string content. words like THE,IS ,AN , OF...are small, frequently occuring words and often ignored.
Many search engine/IR using automatic document indexing those non significant words are reomoved from the document vector, so document will only be represented by content bearing words.
|
|
|
|

|
Suggest using probalistic indexing ranks so the stop word are modeled by a poison distribution.
Martin
|
|
|
|

|
Hi Martin,
Thank you very much. That's great. I think that the number of words in a string(which in this case) is not high enought to use probalistic indexing ranks so it is difficult to implement term frequency in automatic indexing. Instead, I will use a common stop words list to remove the high frequency words.
Thanh Dao
|
|
|
|

|
Wildcards would be great! Example:
Doc* would be a perfect match for Doctor
Anyone thinks its worth adding?
|
|
|
|

|
This looks like a regular expression. Personally I was thinking of prefix/suffix matcher that mean if a word is prefix or postfix of another word then say that they're equivalent.
Prefix is efficient in case similar acronyms or cognate words.
These are syntactic similarity and it may not imply semantic relateness.
i.e: "net" and "network" but also the same with "hot" and "hotel".
Thanh Dao
|
|
|
|

|
Yeah.. maybe that would work.
Wouldn't be great if
abcd*ghi would match abcdefghi at 100%? I'm just throwing ideas around to improve this thing. Very useful code that you have here!
|
|
|
|

|
Hi Carl,
That's great but I need to think more about working with regular expression.
Thank you, have a nice day
Thanh Dao
|
|
|
|

|
Not something that I would ever need to do but I enjoyed reading about it.
|
|
|
|

|
Thank you ! I' glad to see that.
Thanh Dao
|
|
|
|
|

|
I' glad to know that's useful for you.
Here is the link to WordNet C# port:
http://opensvn.csie.org/WordNetDotNet/trunk/
Thanks
Thanh Dao
|
|
|
|

|
Unruled Boy wrote:
btw, where is latest source code of WordNet.NET? I did not find it at http://opensource.ebswift.com/WordNet.Net/[^]
Hello, I'm the one who released WordNet.Net to LGPL (based on the port of Malcom Crowe), great work Thanh!
The source is actually linked to the 'Subversion Repository' link on that page. The link is a version controlled source repository, and as you can see if you just go to that web page you would have to be downloading lots of different source files. The way to do it is to get yourself a subversion client and let it do the leg work for you. Once you download the source you can keep it up-to-date with nothing more than a right-click and a menu selection from Windows Explorer.
I have made a wiki page on how to use the repository with TortoiseSVN here: http://www.ebswift.com/Wiki/wikka.php?wakka=WordNetDotNetRepository[^].
~Troy Simpson
http://www.ebswift.com
|
|
|
|

|
Hi Troy,
Nice to see you here, Thank you very much!
Thanh Dao
|
|
|
|

|
This version now looks clear to understand and works better.
When would you release the semantic similairty measurement ?
Thanks,
Christ
|
|
|
|

|
the example will come with another article.
just a few day
|
|
|
|

|
Did you try with Resnik method ?
this is a link to his publications:
http://www.umiacs.umd.edu/users/resnik/pubs.html
Let me know something if i can help you. thanks again for this wonderful effort, cool guy
Christ
|
|
|
|

|
Yes, I've already read it. I am interested in what he said "the shorter connection path between nodes, the more similar they're...".
yeah, let code and help me ...i will mail you in private
thanks
Thanh Dao
|
|
|
|

|
Hi, it's great job to have this similarity code. However, I am not sure why the comparison of these two string causes infinite loop in the code. Could u help me to find out why?
here are the code that i put on the test.cs
public Test()
{
string s1="2 JALAN PEMBERITA U1/49 \nTEMASYA INDUSTRIAL PARK \nGLENMARIE 40150 SHAH ALAM SEL";
string s2="D-3-2 ARCADIA APARTMENT \nUSJ 11/1 SUBANG JAYA \n47600 PETALING JAYA";
MatchsMaker match=new MatchsMaker(s1, s2) ;
Console.WriteLine(match.Score) ;
Console.ReadLine() ;
}
Haris.
|
|
|
|

|
I tried to cut down the string by deleting few words to be like this. And it works perfectly.
string s1="2 JALAN PEMBERITA U1 49 TEMASYA INDUSTRIAL PARK GLENMARIE ";
string s2="D 3 2 ARCADIA APARTMENT USJ 11 1 SUBANG JAYA 47600 PETALING JAYA";
I guess it seems that the code can't get too long words or strings. Can anyone overcome this problem?
|
|
|
|
 |
|
|
General News Suggestion Question Bug Answer Joke Rant Admin
|
An improvement on capturing similarity between strings. The algorithm was developed in C#.
| Type | Article |
| Licence | |
| First Posted | 28 Jul 2005 |
| Views | 164,455 |
| Bookmarked | 126 times |
|
|