A Performant Items in List A that are not in List B






3.91/5 (8 votes)
Performant Items in List A that are not in List B
Contents
- Introduction
- The Test Setup
- Running the Tests
- The Except Method
- Using Where
- Creating an Extension Method
- Providing a Comparitor
- Adding Selectors
- Using a HashSet
- A Performant List A not in List B
- Conclusion
- History
Introduction
After reading some Stack Overflow posts on how to find items in List A that are not in List B, and the performance issues of some of the solutions (and encountering said performance issues in real life), I decided to look at how to create a performant "not in" method that meets the requirements that I have, namely that I'm not comparing simple value types, but rather a property (or properties) within a different classes.
Note that I've ignored the whole "override Equals
and GetHashCode
" concept because it doesn't add anything to the solution.
The Test Setup
My test setup creates 5000 integer entries, 0-4999, and another 5000 entries, 5000-9999, in a couple lists so I can test both direct value comparison and values contained as a property in a simple test class.
Given:
class A
{
public int Value { get; set; }
}
class B
{
public int Value { get; set; }
}
and:
static List<A> listOfA;
static List<B> listOfB;
static List<int> listA;
static List<int> listB;
static int samples = 5000;
static int repeat = 10;
The test cases are set up as:
<a>public static void CreateTestData(int samples)
{
listA = new List<int>();
listB = new List<int>();
listOfA = new List<A>();
listOfB = new List<B>();
samples.ForEach(n => listA.Add(n));
samples.ForEach(n => listB.Add(n + samples));
samples.ForEach(n => listOfA.Add(new A() { Value = n }));
samples.ForEach(n => listOfB.Add(new B() { Value = n + samples }));
}
Running the Tests
Several different implementations are tested for their performance:
static void Main(string[] args)
{
CreateTestData(samples);
Test("Except", ValueNotInUsingExcept);
Test("Where", ValueNotInUsingWhere);
Test("NotIn with Equals", ValueNotInUsingEquals);
Test("NotIn with Comparitor", ValueNotInUsingComparitor);
Test("NotIn with Selector and Comparitor", ValueNotInUsingSelectorAndComparitor);
Test("HashedNotIn with Selector and Comparitor",
ValueNotInUsingHashedSelectorAndComparitor);
Test("ExceptNotIn with Selector and Comparitor", ValueNotInUsingExceptSelector);
Console.WriteLine("\r\nDone.\r\nPress ENTER to close window.");
Console.ReadLine();
}
And the supporting code to run the tests, the idea being to run each test 10 times and average the time each test took to run:
public static void Test(string msg, Action action)
{
var avg = repeat.Average(() => Timing.Time(action));
Console.WriteLine($"{msg}: {avg}ms");
}
and:
public static decimal Average(this int n, Func<long> func)
{
long count = 0;
n.ForEach(() => count += func());
decimal avg = count / (decimal)n;
return avg;
}
and we use the built in Stopwatch
class:
static class Timing
{
public static long Time(Action action)
{
Stopwatch sw = new Stopwatch();
sw.Start();
action();
sw.Stop();
return sw.ElapsedMilliseconds;
}
}
The Except Method
Linq has this amazing Except
extension method which is very very fast. So, given a simple list of values, we can get the items in list A
that are not in list B
:
public static void ValueNotInUsingExcept()
{
// force evaluation with ToList()
var items = listA.Except(listB).ToList();
Assertion.Assert(items.Count == samples, $"Expected {samples}.");
}
And it really is fast:
Except: 2.2ms
Using Where
One of the suggestions I encountered was to code the "not in" function using Linq's Where
method:
public static void ValueNotInUsingWhere()
{
// force evaluation with ToList()
var items = listA.Where(a => !listB.Any(b => a == b)).ToList();
Assertion.Assert(items.Count == samples, $"Expected {samples}.");
}
We can immediately start to see a performance degradation: Where: 290.2ms
Creating an Extension Method
And the performance degrades further when we create an extension method, which forces us to use the Equals
method. Given:
public static void ValueNotInUsingEquals()
{
// force evaluation with ToList()
var items = listA.NotIn(listB).ToList();
Assertion.Assert(items.Count == samples, $"Expected {samples}.");
}
implemented as:
public static IEnumerable<T> NotIn<T>
(this IEnumerable<T> listA, IEnumerable<T> listB) where T : struct
{
var items = listA.Where(a => !listB.Any(b => a.Equals(b)));
return items;
}
The performance gets worse:
NotIn with Equals: 426.5ms
I'm going to continue using an extension method approach however, as we will get to the ultimate point in a bit.
Providing a Comparitor
Interestingly, providing a comparitor improves performance. Given:
public static void ValueNotInUsingComparitor()
{
// force evaluation with ToList()
var items = listA.NotIn(listB, (a, b) => a == b).ToList();
Assertion.Assert(items.Count == samples, $"Expected {samples}.");
}
implemented as:
public static IEnumerable<T> NotIn<T>
(this IEnumerable<T> listA, IEnumerable<T> listB, Func<T, T, bool> comparitor)
{
var items = listA.Where(a => !listB.Any(b => comparitor(a, b)));
return items;
}
We see an slight improvement:
NotIn with Comparitor: 348.4ms
Adding Selectors
Now let's add selectors to extract the values we want to compare from a property of a class. Given:
public static void ValueNotInUsingSelectorAndComparitor()
{
// force evaluation with ToList()
var items = listOfA.NotIn(listOfB, a => a.Value, b => b.Value, (a, b) => a == b).ToList();
Assertion.Assert(items.Count == samples, $"Expected {samples}.");
}
implemented as:
public static IEnumerable<A> NotIn<A, B, Q>(this IEnumerable<A> source,
IEnumerable<B> target, Func<A, Q> sourceSelector, Func<B, Q> targetSelector,
Func<Q, Q, bool> comparitor)
{
var items = source.Where(s => !target.Any
(t => comparitor(sourceSelector(s), targetSelector(t))));
return items;
}
The performance degrades considerably:
NotIn with Selector and Comparitor: 711.1ms
Using a HashSet
If we use a HashSet
for the list "B
" target:
public static void ValueNotInUsingHashedSelectorAndComparitor()
{
// force evaluation with ToList()
var items = listOfA.HashedNotIn(listOfB, a => a.Value,
b => b.Value, (a, b) => a == b).ToList();
Assertion.Assert(items.Count == samples, $"Expected {samples}.");
}
implemented as:
public static IEnumerable<A> HashedNotIn<A, B, Q>(this IEnumerable<A> source,
IEnumerable<B> target, Func<A, Q> sourceSelector, Func<B, Q> targetSelector,
Func<Q, Q, bool> comparitor)
{
var hashedTargets = new HashSet<Q>(target.Select(t => targetSelector(t)));
var items = source.Where(s => !hashedTargets.Any(t => comparitor(sourceSelector(s), t)));
return items;
}
We actually see a slight performance improvement:
HashedNotIn with Selector and Comparitor: 524.2ms
This still falls very short of the performance of the Except
method.
A Performant List A not in List B
So in my final implementation, I take the ability to provide selectors and use the Except
method, returning the instances of A
that are not in B
for the given selectors. With:
public static void ValueNotInUsingExceptSelector()
{
// force evaluation with ToList()
var items = listOfA.ExceptNotIn(listOfB, a => a.Value, b => b.Value).ToList();
Assertion.Assert(items.Count == samples, $"Expected {samples}.");
}
implemented as:
/// <summary>
/// Will only return a single entry of A when A1 and A2 select the same value.
/// </summary>
public static IEnumerable<A> ExceptNotIn<A, B, Q>(this IEnumerable<A> source,
IEnumerable<B> target, Func<A, Q> sourceSelector, Func<B, Q> targetSelector)
{
Dictionary<Q, A> dict = new Dictionary<Q, A>();
var evaldSource = source.Select(s =>
{
var val = sourceSelector(s);
dict[val] = s;
return val;
}).ToList();
var targets = target.Select(t => targetSelector(t));
var qitems = evaldSource.Except(targets);
var items = qitems.Select(q => dict[q]);
return items;
}
Suddenly, we achieve the same performance as the Except
method using value types:
ExceptNotIn with Selector and Comparitor: 1.4ms
Which is rather impressive considering the work being done to create the dictionary and invoke the selectors.
The magic here is that we:
- Create a dictionary of "selected source value" mapped to the source instance.
- Preselect all the target values.
- Use the
Except
method given the values. - Return the source instances based on the returned values.
As the comment indicates, if you have two instances of the source which have the same "select value", you'll only get one instance of the source returned. I can live with that as I usually know that my source instance values are distinct.
Conclusion
What can I say? It certainly behooves one to investigate the performance of the Linq in the context of how one is using Linq!
History
- 30th January, 2021: Initial version