Click here to Skip to main content
Rate this: bad
good
Please Sign up or sign in to vote.
See more: C# SQL .NET string comparision , +
myTable row count is 150000 rows. I need to compare the values in NAME column. If I detect similarity <= similarity level and if the NUM column values are the same I should log it.
They way I loop through this table takes forever. What other solutions are there?
DataTable dt1 = new DataTable();
dt1.Load(DbInfo.DataRdr(Conn, "SELECT * FROM myTable"));
for (int i = 0; i < dt1.Rows.Count; i++)
{
 for (int j = 0; i + 1 < dt1.Rows.Count; j++)
 {
   if (dt1.Rows[i]["NUM"].ToString() == dt1.Rows[j]["NUM"].ToString())
   {
      if (dt1.Rows[i]["Name"].ToString().
LevenshteinDistance(dt1.Rows[j]["Name"].ToString()) <= 10)
      {
        Logging.Write(...);
       }
   }
  }
}
    public static int LevenshteinDistance(this string s, string t)
    {
        if (s == null)
            throw new ArgumentNullException("s");
        if (t == null)
            throw new ArgumentNullException("t");
        int n = s.Length;
        int m = t.Length;
        int[,] d = new int[n+1,m+1];
        if (n == 0 || m == 0)
            return Math.Max(m, n);
        for (int i = 0; i <= n; i++)
        {
            d[i, 0] = i;
        }
        for (int i = 0; i < m; i++)
        {
            d[0, i] = i;
        }
        for (int i = 0; i < n; i++)
        {
            for (int j = 0; j < m; j++)
            {
                int cost = (t[j] == s[i]) ? 0 : 1;
                d[i + 1, j + 1] = Math.Min(Math.Min(d[i, j + 1] + 1, d[i + 1, j] + 1), d[i, j] + cost);
            }
        }
        return d[n, m];
    }myTable row count is 150000 rows.
I need to compare the values in NAME column. If I detect similarity <= similarity level and if the NUM column values are the same I should log it. 
 
They way I loop through this table takes forever. What other solutions are there?
 
<pre lang="cs">DataTable dt1 = new DataTable();
dt1.Load(DbInfo.DataRdr(Conn, "SELECT * FROM myTable"));
 
   for (int i = 0; i < dt1.Rows.Count; i++)
   {
 
     for (int j = 0; j < dt1.Rows.Count; j++)
     {
       if (dt1.Rows[i]["Name"].ToString().LevenshteinDistance(dt1.Rows[j]            ["Name"].ToString()) <= 10)
       {
          if (dt1.Rows[i]["NUM"].ToString() == dt1.Rows[i]["NUM"].ToString())
          {
            Logging.Write(...);
           }
       }
      }
 
  }
Posted 8-Jul-11 10:27am
Edited 8-Jul-11 10:54am
v2
Rate this: bad
good
Please Sign up or sign in to vote.

Solution 1

You can start with
 
for (int j = i + 1; j < dt1.Rows.Count; j++)
 
since once two words have been compared, the inverse is always the same number.
  Permalink  
Comments
Sevak Abedi at 8-Jul-11 15:39pm
   
thx
Rate this: bad
good
Please Sign up or sign in to vote.

Solution 2

Assuming that you are only interested in similar words, then you can modify LevenshteinDistance so that you will stop comparing 2 strings as soon as the maximum similarity level has been reached.
 
public static int? LevenshteinDistance(this string s, string t, 
    int similarityLimit)
{
    // Some code...

    // In the inner loop, add:
    if (cost > similarityLevel)
    {
        return null;
    }
}
 
And in calling code:
int limit = 10;
int? similarity = LevenshteinDistance(row1, row2, limit);
bool isSimilar = similarity.HasValue && similarity.Value <= limit;
 
Assuming that a lot of pairs have a cost much higher than the limit, it will greatly reduce the execution time.
 
Then since you are only interested to know if NAME ar similar when NUM are equal, then you might be able to reverse those 2 tests since the second one (a string comparison) should be much faster. Thus you can eliminate most calls to LevenshteinDistance assuming that NUM are not all equals.
 
Thus your inner loop might look like this:
// First do the "fast" check
if (dt1.Rows[i]["NUM"].ToString() == dt1.Rows[i]["NUM"].ToString())
{
    // Then do the "slow" check...
    if (dt1.Rows[i]["Name"].ToString().LevenshteinDistance(
        dt1.Rows[j]["Name"].ToString()) <= 10)
    {
        Logging.Write(...);
    }
}
 
Then assuming that you have more than a few percents of NUM that are equals, you can sort items by their NUM. One way to do it would be to sort the data returned by the reader using a query like:
SELECT * FROM myTable ORDER BY NUM
If the database is indexed, it might be the faster option. Otherwise, it might be done by using something like Dictionary<string, List<int>> where each distinct NUM are added to the dictionary and the list is filled with the indexes in the table.
 
var numDictionary = new Dictionary<string, List<int>>;
for (int i = 0; i < dt1.Rows.Count; i++)
{
    var rowNum = dt1.Rows[i]["NUM"].ToString();
 
    List<int> listIndexes;
    if (!numDictionary.TryGetValue(rowNum, out listIndexes))
    {
        list = new List<int>;
        numDictionary.Add(rowNum, list);
    }
    list.Add(i);
}
 
Then you can compute the LevenshteinDistance for each list in the dictionary using nested loop as you have done in the original algorithm.
 
Note that the internal loop can stop at the index of the external loop. That is, stop just before j is equal to i.
 
If the order does not matters, then you can write items in the log directly. If not, fill some structures with the information you wil need to output and sort that structure before outputting the results.
  Permalink  
v2

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

  Print Answers RSS
0 George Jonsson 215
1 Kornfeld Eliyahu Peter 169
2 Zoltán Zörgő 139
3 PIEBALDconsult 130
4 OriginalGriff 120
0 OriginalGriff 6,165
1 DamithSL 4,658
2 Maciej Los 4,107
3 Kornfeld Eliyahu Peter 3,649
4 Sergey Alexandrovich Kryukov 3,342


Advertise | Privacy | Mobile
Web02 | 2.8.141220.1 | Last Updated 9 Jul 2011
Copyright © CodeProject, 1999-2014
All Rights Reserved. Terms of Service
Layout: fixed | fluid

CodeProject, 503-250 Ferrand Drive Toronto Ontario, M3C 3G8 Canada +1 416-849-8900 x 100