A Performant Items in List A that are not in List B

Marc Clifton

3.91/5 (8 votes)

Jan 30, 2021

CPOL

3 min read

11652

Performant Items in List A that are not in List B

Download source - 1.6 KB

Introduction
The Test Setup
Running the Tests
The Except Method
Using Where
Creating an Extension Method
Providing a Comparitor
Adding Selectors
Using a HashSet
A Performant List A not in List B
Conclusion
History

Introduction

After reading some Stack Overflow posts on how to find items in List A that are not in List B, and the performance issues of some of the solutions (and encountering said performance issues in real life), I decided to look at how to create a performant "not in" method that meets the requirements that I have, namely that I'm not comparing simple value types, but rather a property (or properties) within a different classes.

Note that I've ignored the whole "override Equals and GetHashCode" concept because it doesn't add anything to the solution.

The Test Setup

My test setup creates 5000 integer entries, 0-4999, and another 5000 entries, 5000-9999, in a couple lists so I can test both direct value comparison and values contained as a property in a simple test class.

Given:

class A
{
    public int Value { get; set; }
}

class B
{
    public int Value { get; set; }
}

and:

static List<A> listOfA;
static List<B> listOfB;
static List<int> listA;
static List<int> listB;
static int samples = 5000;
static int repeat = 10;

The test cases are set up as:

<a>public static void CreateTestData(int samples)
{
    listA = new List<int>();
    listB = new List<int>();
    listOfA = new List<A>();
    listOfB = new List<B>();

    samples.ForEach(n => listA.Add(n));
    samples.ForEach(n => listB.Add(n + samples));

    samples.ForEach(n => listOfA.Add(new A() { Value = n }));
    samples.ForEach(n => listOfB.Add(new B() { Value = n + samples }));
}

Running the Tests

Several different implementations are tested for their performance:

static void Main(string[] args)
{
    CreateTestData(samples);
    Test("Except", ValueNotInUsingExcept);
    Test("Where", ValueNotInUsingWhere);
    Test("NotIn with Equals", ValueNotInUsingEquals);
    Test("NotIn with Comparitor", ValueNotInUsingComparitor);
    Test("NotIn with Selector and Comparitor", ValueNotInUsingSelectorAndComparitor);
    Test("HashedNotIn with Selector and Comparitor", 
          ValueNotInUsingHashedSelectorAndComparitor);
    Test("ExceptNotIn with Selector and Comparitor", ValueNotInUsingExceptSelector);

    Console.WriteLine("\r\nDone.\r\nPress ENTER to close window.");
    Console.ReadLine();
}

And the supporting code to run the tests, the idea being to run each test 10 times and average the time each test took to run:

public static void Test(string msg, Action action)
{
    var avg = repeat.Average(() => Timing.Time(action));

    Console.WriteLine($"{msg}: {avg}ms");
}

and:

public static decimal Average(this int n, Func<long> func)
{
    long count = 0;
    n.ForEach(() => count += func());
    decimal avg = count / (decimal)n;

    return avg;
}

and we use the built in Stopwatch class:

static class Timing
{
    public static long Time(Action action)
    {
        Stopwatch sw = new Stopwatch();
        sw.Start();
        action();
        sw.Stop();

        return sw.ElapsedMilliseconds;
    }
}

The Except Method

Linq has this amazing Except extension method which is very very fast. So, given a simple list of values, we can get the items in list A that are not in list B:

public static void ValueNotInUsingExcept()
{
    // force evaluation with ToList()
    var items = listA.Except(listB).ToList();
    Assertion.Assert(items.Count == samples, $"Expected {samples}.");
}

And it really is fast:

Except: 2.2ms

Using Where

One of the suggestions I encountered was to code the "not in" function using Linq's Where method:

public static void ValueNotInUsingWhere()
{
    // force evaluation with ToList()
    var items = listA.Where(a => !listB.Any(b => a == b)).ToList();
    Assertion.Assert(items.Count == samples, $"Expected {samples}.");
}

We can immediately start to see a performance degradation: Where: 290.2ms

Creating an Extension Method

And the performance degrades further when we create an extension method, which forces us to use the Equals method. Given:

public static void ValueNotInUsingEquals()
{
    // force evaluation with ToList()
    var items = listA.NotIn(listB).ToList();
    Assertion.Assert(items.Count == samples, $"Expected {samples}.");
}

implemented as:

public static IEnumerable<T> NotIn<T>
    (this IEnumerable<T> listA, IEnumerable<T> listB) where T : struct
{
    var items = listA.Where(a => !listB.Any(b => a.Equals(b)));

    return items;
}

The performance gets worse:

NotIn with Equals: 426.5ms

I'm going to continue using an extension method approach however, as we will get to the ultimate point in a bit.

Providing a Comparitor

Interestingly, providing a comparitor improves performance. Given:

public static void ValueNotInUsingComparitor()
{
    // force evaluation with ToList()
    var items = listA.NotIn(listB, (a, b) => a == b).ToList();
    Assertion.Assert(items.Count == samples, $"Expected {samples}.");
}

implemented as:

public static IEnumerable<T> NotIn<T>
    (this IEnumerable<T> listA, IEnumerable<T> listB, Func<T, T, bool> comparitor)
{
    var items = listA.Where(a => !listB.Any(b => comparitor(a, b)));

    return items;
}

We see an slight improvement:

NotIn with Comparitor: 348.4ms

Adding Selectors

Now let's add selectors to extract the values we want to compare from a property of a class. Given:

public static void ValueNotInUsingSelectorAndComparitor()
{
    // force evaluation with ToList()
    var items = listOfA.NotIn(listOfB, a => a.Value, b => b.Value, (a, b) => a == b).ToList();
    Assertion.Assert(items.Count == samples, $"Expected {samples}.");
}

implemented as:

public static IEnumerable<A> NotIn<A, B, Q>(this IEnumerable<A> source, 
    IEnumerable<B> target, Func<A, Q> sourceSelector, Func<B, Q> targetSelector, 
    Func<Q, Q, bool> comparitor)
{
    var items = source.Where(s => !target.Any
                (t => comparitor(sourceSelector(s), targetSelector(t))));

    return items;
}

The performance degrades considerably:

NotIn with Selector and Comparitor: 711.1ms

Using a HashSet

If we use a HashSet for the list "B" target:

public static void ValueNotInUsingHashedSelectorAndComparitor()
{
    // force evaluation with ToList()
    var items = listOfA.HashedNotIn(listOfB, a => a.Value, 
                                    b => b.Value, (a, b) => a == b).ToList();
    Assertion.Assert(items.Count == samples, $"Expected {samples}.");
}

implemented as:

public static IEnumerable<A> HashedNotIn<A, B, Q>(this IEnumerable<A> source, 
       IEnumerable<B> target, Func<A, Q> sourceSelector, Func<B, Q> targetSelector, 
       Func<Q, Q, bool> comparitor)
{
    var hashedTargets = new HashSet<Q>(target.Select(t => targetSelector(t)));
    var items = source.Where(s => !hashedTargets.Any(t => comparitor(sourceSelector(s), t)));

    return items;
}

We actually see a slight performance improvement:

HashedNotIn with Selector and Comparitor: 524.2ms

This still falls very short of the performance of the Except method.

A Performant List A not in List B

So in my final implementation, I take the ability to provide selectors and use the Except method, returning the instances of A that are not in B for the given selectors. With:

public static void ValueNotInUsingExceptSelector()
{
    // force evaluation with ToList()
    var items = listOfA.ExceptNotIn(listOfB, a => a.Value, b => b.Value).ToList();
    Assertion.Assert(items.Count == samples, $"Expected {samples}.");
}

implemented as:

/// <summary>
/// Will only return a single entry of A when A1 and A2 select the same value.
/// </summary>
public static IEnumerable<A> ExceptNotIn<A, B, Q>(this IEnumerable<A> source, 
    IEnumerable<B> target, Func<A, Q> sourceSelector, Func<B, Q> targetSelector)
{
    Dictionary<Q, A> dict = new Dictionary<Q, A>();
            
    var evaldSource = source.Select(s =>
    {
        var val = sourceSelector(s);
        dict[val] = s;
        return val;
    }).ToList();

    var targets = target.Select(t => targetSelector(t));
    var qitems = evaldSource.Except(targets);
    var items = qitems.Select(q => dict[q]);

    return items;
}

Suddenly, we achieve the same performance as the Except method using value types:

ExceptNotIn with Selector and Comparitor: 1.4ms

Which is rather impressive considering the work being done to create the dictionary and invoke the selectors.

The magic here is that we:

Create a dictionary of "selected source value" mapped to the source instance.
Preselect all the target values.
Use the Except method given the values.
Return the source instances based on the returned values.

As the comment indicates, if you have two instances of the source which have the same "select value", you'll only get one instance of the source returned. I can live with that as I usually know that my source instance values are distinct.

Conclusion

What can I say? It certainly behooves one to investigate the performance of the Linq in the context of how one is using Linq!

History

30^th January, 2021: Initial version

A Performant Items in List A that are not in List B

Contents