Click here to Skip to main content
15,749,352 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Hi,
I have the below scenario:
I have two huge lists;both are of huge dimensions;50k elements each of custom objects.
Now I need to compare each of them and find how many matches I get based on some property of the object.
Here is a sample of what I had been trying to achieve,even-though this is based on a simple string elements but the idea remains the same.
Here I am just creating 26 sublists-each of which starts with a,b and so on.
C#
List<string> Existing = new List<string>() { "abc","xyf","asder","dhfth","rert","nhfju","iulyuk","qefwef","deferwf","wfwef","fwer","zxsg","gewrg","trewt","fewr","gwer","rfewrfg","kyu","piol","qfgthy","rtgrh","rtgtg","rgtet","geger","mtmtey","hgrwgtbgt rtg","erg4twedc rtg","wt45t","wqfef","bdvf","zaqwe gt","lkujdr","brgtbhreth","wefwegf","qaddw","gerg","bdvsefgv"};
            List<string> New = new List<string>() { "dasfwe", "xyf", "asder", "bgh", "wafwef", "liuk", "xwdewd", "qefwef", "deferwf", "erfwe", "fwer", "zxsg", "gewrg", "trewt", "fewr", "gwer", "rfewrfg", "kyu", "piol", "qfgthy", "rtgrh", "rtgtg", "rgtet", "xcewf", "mtmtey", "wdwd rtg", "erg4twedc rtg"};
           Existing.Sort();//it will sort alphabetically
            New.Sort();//it will sort alphabetically
            List<string> Alphabets = new List<string>() { "a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"};
            foreach (var character in Alphabets)
            {
                var existingList = Existing.Where(e => e.StartsWith(character)).ToList();
                var newList = New.Where(n => n.StartsWith(character)).ToList();
                //In the below 2 foreach loops we are only comparing items from the 2 lists where both the items starts with same character
                //instead of looping through entire list
                foreach (var newItem in newList)
                {
                    foreach (var existingItem in existingList)
                    {
                        if (newItem == existingItem)
                            Console.WriteLine("Match Found!");//Do our logic here!
                    }
                }
           
            }

The problem is:it's taking huge time!!
How can I do it more efficiently?
Posted
Updated 8-May-15 0:17am
v2

I think I can help you.. how do you like this as a teaser? :)
Original Method:   20000 matches, 4197 ms
Dictionary Method: 20000 matches,   27 ms


As long as you "only" want to compare some ten-thousand strings and not billions, you can just use a Dictionary. In a first loop, fill in the "existing" values, then in a second loop look up if a "new" value is found.

I modified your code a bit, but it's essentially the same. Take a look at it and in case you have questions, please leave a comment.

C#
static void Main()
{
    List<string> prejit = new List<string>() { "abc", "def" };

    OriginalMethod(prejit, prejit);
    DictionaryMethod(prejit, prejit);

    List<string> Existing = new List<string>();
    List<string> New = new List<string>();

    for (int i = 0; i < 20000; i++)
    {
        string s = GetRandomString();
        Existing.Add(s);
        New.Add(s);
    }

    for (int i = 0; i < 80000; i++)
        Existing.Add(GetRandomString());

    for (int i = 0; i < 80000; i++)
        New.Add(GetRandomString());

    OriginalMethod(Existing, New);
    DictionaryMethod(Existing, New);

    Console.ReadKey();
}

static string GetRandomString()
{
    string path = System.IO.Path.GetRandomFileName();
    while(!Char.IsLetter(path[0]))
        path = System.IO.Path.GetRandomFileName();
    return path;
}

static void OriginalMethod(List<string> Existing, List<string> New)
{
    var Alphabets = new List<string>() { "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z" };

    var sw = System.Diagnostics.Stopwatch.StartNew();

    int matchCount = 0;

    foreach (var character in Alphabets)
    {
        var existingList = Existing.Where(e => e.StartsWith(character)).ToList();
        var newList = New.Where(n => n.StartsWith(character)).ToList();
        foreach (var newItem in newList)
            foreach (var existingItem in existingList)
                if (newItem == existingItem)
                    matchCount++;
    }

    var elapsed = sw.ElapsedMilliseconds;

    Console.WriteLine("Original Method: {0} matches, {1} ms", matchCount, elapsed);
}

static void DictionaryMethod(List<string> Existing, List<string> New)
{
    var existingLookup = new Dictionary<string, List<string>>(Existing.Count);
    List<string> matches;
    int matchCount = 0;

    var sw = System.Diagnostics.Stopwatch.StartNew();

    foreach (string s in Existing)
    {
        if (!existingLookup.TryGetValue(s, out matches))
        {
            matches = new List<string>();
            existingLookup.Add(s, matches);
        }
        matches.Add(s);
    }

    foreach (string s in New)
        if (existingLookup.TryGetValue(s, out matches))
            matchCount += matches.Count;

    var elapsed = sw.ElapsedMilliseconds;

    Console.WriteLine("Dictionary Method: {0} matches, {1} ms", matchCount, elapsed);
}

If all you need is to know whether a matching "existing" value exists or not (and not which one / which ones) you can simplify and increase its "speed" even more by replacing the Dictionary<string, List<string>> by a HashSet<string>.
 
Share this answer
 
LINQ!!

replace it with this:
C#
List<string> Existing = new List<string>() { "abc", "xyf", "asder", "dhfth", "rert", "nhfju", "iulyuk", "qefwef", "deferwf", "wfwef", "fwer", "zxsg", "gewrg", "trewt", "fewr", "gwer", "rfewrfg", "kyu", "piol", "qfgthy", "rtgrh", "rtgtg", "rgtet", "geger", "mtmtey", "hgrwgtbgt rtg", "erg4twedc rtg", "wt45t", "wqfef", "bdvf", "zaqwe gt", "lkujdr", "brgtbhreth", "wefwegf", "qaddw", "gerg", "bdvsefgv" };
List<string> New = new List<string>() { "dasfwe", "xyf", "asder", "bgh", "wafwef", "liuk", "xwdewd", "qefwef", "deferwf", "erfwe", "fwer", "zxsg", "gewrg", "trewt", "fewr", "gwer", "rfewrfg", "kyu", "piol", "qfgthy", "rtgrh", "rtgtg", "rgtet", "xcewf", "mtmtey", "wdwd rtg", "erg4twedc rtg" };
Existing.Sort();//it will sort alphabetically
New.Sort();//it will sort alphabetically
List<string> Alphabets = new List<string>() { "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z" };
foreach (
    var existingList in
    from character in Alphabets
    let existingList = Existing.Where(e => e.StartsWith(character)).ToList()
    let newList = New.Where(n => n.StartsWith(character)).ToList()
    from newItem in
        (from newItem in newList from existingItem in existingList.Where(existingItem => newItem == existingItem)
         select newItem)
    select existingList)
{
    Console.WriteLine("Match Found!");//Do our logic here!
}


PS: the remaining loop can also be removed depending on what needs to happen instead of "Console.WriteLine("Match Found!")" but this should be MUCH faster already.

PPS: I cheated and used ReSharper(tm) to produce the code

PPPS: from comments:


C#
//StartsWith does some error checking on a 'per object' basis.  It has no idea what string you are throwing at it.  You can reduce the 'per object' checks if you are fully aware of what your strings may contain
Existing
    .Where(e=>e.length>0)
    .Select(e=>new {full = e, first = e.ToCharArray().First()})
    .Where(c=>c.first==character)
    .Select(c=>c.full).ToList();

//OR even
Existing
    .Where(e=>e.length>0)
    .Where(e=>e.ToCharArray().First()==character)
    .ToList();
 
Share this answer
 
v3
Comments
avishekrc 8-May-15 6:13am    
Existing.Where(e => e.StartsWith(character)).ToList()///this is where it is taking time.
Andy Lanng 8-May-15 6:37am    
That's already a pretty damn refined clause. I don't know if it would be any better but you could try a custom 'startswith'. It might be better:

Existing.Where(e=>e.length>0).select(e=>e.ToCharArray().First()).Where(c=>c==character).ToList();

I added some error checking by checking the string length first. If you know that e.length cannot be zero then you could exclude that. string.StartsWith does some simple error checking but most of it is probably useless to you in this case. This version doesn't require any 'per object' process so it should be 'quite' faster... maybe :S
Andy Lanng 8-May-15 6:39am    
I just notice that you don't actually get the string back at the end. I'll try again and add it to the solution

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900