Comparing two lists with value selectors, implemented as an extension method, that has the similar performance as the Except Linq method.
Contents
After reading some Stack Overflow posts on how to find items in List A that are not in List B, and the performance issues of some of the solutions (and encountering said performance issues in real life), I decided to look at how to create a performant "not in" method that meets the requirements that I have, namely that I'm not comparing simple value types, but rather a property (or properties) within a different classes.
Note that I've ignored the whole "override Equals
and GetHashCode
" concept because it doesn't add anything to the solution.
My test setup creates 5000 integer entries, 0-4999, and another 5000 entries, 5000-9999, in a couple lists so I can test both direct value comparison and values contained as a property in a simple test class.
Given:
class A
{
public int Value { get; set; }
}
class B
{
public int Value { get; set; }
}
and:
static List<A> listOfA;
static List<B> listOfB;
static List<int> listA;
static List<int> listB;
static int samples = 5000;
static int repeat = 10;
The test cases are set up as:
<a>public static void CreateTestData(int samples)
{
listA = new List<int>();
listB = new List<int>();
listOfA = new List<A>();
listOfB = new List<B>();
samples.ForEach(n => listA.Add(n));
samples.ForEach(n => listB.Add(n + samples));
samples.ForEach(n => listOfA.Add(new A() { Value = n }));
samples.ForEach(n => listOfB.Add(new B() { Value = n + samples }));
}
Several different implementations are tested for their performance:
static void Main(string[] args)
{
CreateTestData(samples);
Test("Except", ValueNotInUsingExcept);
Test("Where", ValueNotInUsingWhere);
Test("NotIn with Equals", ValueNotInUsingEquals);
Test("NotIn with Comparitor", ValueNotInUsingComparitor);
Test("NotIn with Selector and Comparitor", ValueNotInUsingSelectorAndComparitor);
Test("HashedNotIn with Selector and Comparitor",
ValueNotInUsingHashedSelectorAndComparitor);
Test("ExceptNotIn with Selector and Comparitor", ValueNotInUsingExceptSelector);
Console.WriteLine("\r\nDone.\r\nPress ENTER to close window.");
Console.ReadLine();
}
And the supporting code to run the tests, the idea being to run each test 10 times and average the time each test took to run:
public static void Test(string msg, Action action)
{
var avg = repeat.Average(() => Timing.Time(action));
Console.WriteLine($"{msg}: {avg}ms");
}
and:
public static decimal Average(this int n, Func<long> func)
{
long count = 0;
n.ForEach(() => count += func());
decimal avg = count / (decimal)n;
return avg;
}
and we use the built in Stopwatch
class:
static class Timing
{
public static long Time(Action action)
{
Stopwatch sw = new Stopwatch();
sw.Start();
action();
sw.Stop();
return sw.ElapsedMilliseconds;
}
}
Linq has this amazing Except
extension method which is very very fast. So, given a simple list of values, we can get the items in list A
that are not in list B
:
public static void ValueNotInUsingExcept()
{
var items = listA.Except(listB).ToList();
Assertion.Assert(items.Count == samples, $"Expected {samples}.");
}
And it really is fast:
Except: 2.2ms
One of the suggestions I encountered was to code the "not in" function using Linq's Where
method:
public static void ValueNotInUsingWhere()
{
var items = listA.Where(a => !listB.Any(b => a == b)).ToList();
Assertion.Assert(items.Count == samples, $"Expected {samples}.");
}
We can immediately start to see a performance degradation: Where: 290.2ms
And the performance degrades further when we create an extension method, which forces us to use the Equals
method. Given:
public static void ValueNotInUsingEquals()
{
var items = listA.NotIn(listB).ToList();
Assertion.Assert(items.Count == samples, $"Expected {samples}.");
}
implemented as:
public static IEnumerable<T> NotIn<T>
(this IEnumerable<T> listA, IEnumerable<T> listB) where T : struct
{
var items = listA.Where(a => !listB.Any(b => a.Equals(b)));
return items;
}
The performance gets worse:
NotIn with Equals: 426.5ms
I'm going to continue using an extension method approach however, as we will get to the ultimate point in a bit.
Interestingly, providing a comparitor improves performance. Given:
public static void ValueNotInUsingComparitor()
{
var items = listA.NotIn(listB, (a, b) => a == b).ToList();
Assertion.Assert(items.Count == samples, $"Expected {samples}.");
}
implemented as:
public static IEnumerable<T> NotIn<T>
(this IEnumerable<T> listA, IEnumerable<T> listB, Func<T, T, bool> comparitor)
{
var items = listA.Where(a => !listB.Any(b => comparitor(a, b)));
return items;
}
We see an slight improvement:
NotIn with Comparitor: 348.4ms
Now let's add selectors to extract the values we want to compare from a property of a class. Given:
public static void ValueNotInUsingSelectorAndComparitor()
{
var items = listOfA.NotIn(listOfB, a => a.Value, b => b.Value, (a, b) => a == b).ToList();
Assertion.Assert(items.Count == samples, $"Expected {samples}.");
}
implemented as:
public static IEnumerable<A> NotIn<A, B, Q>(this IEnumerable<A> source,
IEnumerable<B> target, Func<A, Q> sourceSelector, Func<B, Q> targetSelector,
Func<Q, Q, bool> comparitor)
{
var items = source.Where(s => !target.Any
(t => comparitor(sourceSelector(s), targetSelector(t))));
return items;
}
The performance degrades considerably:
NotIn with Selector and Comparitor: 711.1ms
If we use a HashSet
for the list "B
" target:
public static void ValueNotInUsingHashedSelectorAndComparitor()
{
var items = listOfA.HashedNotIn(listOfB, a => a.Value,
b => b.Value, (a, b) => a == b).ToList();
Assertion.Assert(items.Count == samples, $"Expected {samples}.");
}
implemented as:
public static IEnumerable<A> HashedNotIn<A, B, Q>(this IEnumerable<A> source,
IEnumerable<B> target, Func<A, Q> sourceSelector, Func<B, Q> targetSelector,
Func<Q, Q, bool> comparitor)
{
var hashedTargets = new HashSet<Q>(target.Select(t => targetSelector(t)));
var items = source.Where(s => !hashedTargets.Any(t => comparitor(sourceSelector(s), t)));
return items;
}
We actually see a slight performance improvement:
HashedNotIn with Selector and Comparitor: 524.2ms
This still falls very short of the performance of the Except
method.
So in my final implementation, I take the ability to provide selectors and use the Except
method, returning the instances of A
that are not in B
for the given selectors. With:
public static void ValueNotInUsingExceptSelector()
{
var items = listOfA.ExceptNotIn(listOfB, a => a.Value, b => b.Value).ToList();
Assertion.Assert(items.Count == samples, $"Expected {samples}.");
}
implemented as:
public static IEnumerable<A> ExceptNotIn<A, B, Q>(this IEnumerable<A> source,
IEnumerable<B> target, Func<A, Q> sourceSelector, Func<B, Q> targetSelector)
{
Dictionary<Q, A> dict = new Dictionary<Q, A>();
var evaldSource = source.Select(s =>
{
var val = sourceSelector(s);
dict[val] = s;
return val;
}).ToList();
var targets = target.Select(t => targetSelector(t));
var qitems = evaldSource.Except(targets);
var items = qitems.Select(q => dict[q]);
return items;
}
Suddenly, we achieve the same performance as the Except
method using value types:
ExceptNotIn with Selector and Comparitor: 1.4ms
Which is rather impressive considering the work being done to create the dictionary and invoke the selectors.
The magic here is that we:
- Create a dictionary of "selected source value" mapped to the source instance.
- Preselect all the target values.
- Use the
Except
method given the values. - Return the source instances based on the returned values.
As the comment indicates, if you have two instances of the source which have the same "select value", you'll only get one instance of the source returned. I can live with that as I usually know that my source instance values are distinct.
What can I say? It certainly behooves one to investigate the performance of the Linq in the context of how one is using Linq!
- 30th January, 2021: Initial version