Click here to Skip to main content
15,394,479 members
Please Sign up or sign in to vote.
4.00/5 (1 vote)
See more:
A string array with 8000 items has to be filtered into items starting with 'a' (=600 items).

Using LINQ is very slow
Using a simple loop is very fast.

Is there a away to improve the LINQ performance in this case?

What I have tried:

string[] items = { ... 8000 items from csv ...};
var foo = new List<string>();

// 3 seconds..
foo.AddRange(items.Where(item => item.StartsWith("a")))

// <1 second
foreach (var item in items)
    if (item.StartsWith("a")) foo.Add(item);
Posted
Updated 26-Sep-18 0:22am
Comments
Maciej Los 25-Sep-18 4:57am
   
Remove AddRange() method and try this:
List<string>() foo = items.Where(item => item.StartsWith("a")).ToList();
Sni.DelWoods 25-Sep-18 5:43am
   
AddRange() does not slow down the process, it's the LINQ expression.

Linq is not designed for performance, it is designed for ease of use. If you are dealing with a lot of data and performance is important then use the foreach loop.
   
Comments
Sni.DelWoods 25-Sep-18 5:46am
   
Seems that LINQ has absolutly no optimization on performance.
I would have exptected that LINQ is a bit slower on large, complex objects but not on a simple string array.

Nevermind, good to know so I'm building an extension method for that case.
Thanks.
Dave Kreskowiak 25-Sep-18 16:37pm
   
That's because LINQ has to use enumerable interfaces exposed by the collections it's working on. That is extra code that adds overhead to accessing items from a collection. A Where clause will go over every single item in a collection and calling your evaluation on them, looking for any item that satisfies the condition. That's a ton of overhead.

I suspect there's something else going on with your code, or with your computer. Three seconds to filter an array of 8000 items is unbelievably slow.

Using Benchmark.NET[^] to test the following code:
C#
public class LinqBenchmark
{
    private readonly string[] _data;

    public LinqBenchmark()
    {
        _data = new string[8000];

        var rnd = new Random();
        for (int i = 0; i < _data.Length; i++)
        {
            _data[i] = rnd.Next(0, 1000).ToString();
        }
    }

    [Benchmark]
    public List<string> For()
    {
        var result = new List<string>();
        for (int i = 0; i < _data.Length; i++)
        {
            if (_data[i].StartsWith("1"))
            {
                result.Add(_data[i]);
            }
        }

        return result;
    }

    [Benchmark]
    public List<string> ForEach()
    {
        var result = new List<string>();
        foreach (string item in _data)
        {
            if (item.StartsWith("1"))
            {
                result.Add(item);
            }
        }

        return result;
    }

    [Benchmark]
    public List<string> AddRange()
    {
        var result = new List<string>();
        result.AddRange(_data.Where(i => i.StartsWith("1")));
        return result;
    }

    [Benchmark]
    public List<string> ToList()
    {
        return _data.Where(i => i.StartsWith("1")).ToList();
    }
}
The results:
BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17134.285 (1803/April2018Update/Redstone4)
Intel Core i7-4770K CPU 3.50GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
Frequency=3417969 Hz, Resolution=292.5714 ns, Timer=TSC
  [Host]     : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 32bit LegacyJIT-v4.7.3163.0
  DefaultJob : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 32bit LegacyJIT-v4.7.3163.0


   Method |     Mean |     Error |    StdDev |
--------- |---------:|----------:|----------:|
      For | 772.1 us | 10.709 us | 10.017 us |
  ForEach | 766.5 us |  1.323 us |  1.238 us |
 AddRange | 796.3 us |  2.858 us |  2.534 us |
   ToList | 798.5 us | 12.835 us | 12.005 us |

// * Hints *
Outliers
  LinqBenchmark.AddRange: Default -> 1 outlier  was  removed

// * Legends *
  Mean   : Arithmetic mean of all measurements
  Error  : Half of 99.9% confidence interval
  StdDev : Standard deviation of all measurements
  1 us   : 1 Microsecond (0.000001 sec)

As you can see, none of the methods took more than one millisecond on average. The difference between them is less than 30 microseconds.

LINQ may not be the most performant option, but the difference isn't usually anywhere near as bad as the times you're quoting.
   
v2
Comments
Maciej Los 25-Sep-18 11:22am
   
Fantastic answer, Richard!
As to the AddRange() and ToList() method... As far as i remember, a ToList() method uses, in fact, AddRange(). That's the reason of very similar time of execution. ;)
Richard Deeming 25-Sep-18 11:26am
   
ToList[^] uses the List<T> constructor[^], which doesn't call the AddRange method[^] at all. :)
Maciej Los 25-Sep-18 11:27am
   
Have you tested with IlSpy?
Richard Deeming 25-Sep-18 11:29am
   
DotPeek shows:
[__DynamicallyInvokable]
public static List<TSource> ToList<TSource>(this IEnumerable<TSource> source)
{
    if (source == null)
        throw Error.ArgumentNull(nameof (source));
    return new List<TSource>(source);
}

And:
[__DynamicallyInvokable]
public List(IEnumerable<T> collection)
{
    if (collection == null)
        ThrowHelper.ThrowArgumentNullException(ExceptionArgument.collection);
    ICollection<T> objs = collection as ICollection<T>;
    if (objs != null)
    {
        int count = objs.Count;
        if (count == 0)
        {
            this._items = List<T>._emptyArray;
        }
        else
        {
            this._items = new T[count];
            objs.CopyTo(this._items, 0);
            this._size = count;
        }
    }
    else
    {
        this._size = 0;
        this._items = List<T>._emptyArray;
        foreach (T obj in collection)
            this.Add(obj);
    }
}
Richard Deeming 25-Sep-18 11:28am
   
I suspect a lot of the time will be spent resizing the internal storage of the List<T> class. If you could allocate enough room at the start, it would probably be significantly faster. :)
Really late to answer here, but I was curious what a slightly more rigorous exploration would yield. The method used to analyze/select your population, for such a small population (8K items), with such a simple filter (starts with 'a'), should be largely irrelevant. Even with a population of 100M items, the worst I could do was about 2 seconds.

Instead, its far more likely that your problem is with the generation/reading of your items.

I am guessing you must have an exceptionally slow CSV reader. Even for the comparatively small 8K item size, unless you have a huge character size per item, I would expect fast performance from even the worst CSV reader (of which I am aware).

Results and test harness follow...

Results:

StartsWithA: 00:00:31.3843456
GetMatchesIndexCount: 00:00:00.4149453
GetMatchesIndexListAdHoc: 00:00:00.4930803
GetMatchesIndexListPreAllocated: 00:00:00.4762712
GetMatchesIndexListTruncated: 00:00:00.4896025
GetMatchesForeachCount: 00:00:00.4298655
GetMatchesForeachListAdHoc: 00:00:00.4599720
GetMatchesForeachListPreAllocated: 00:00:00.4488830
GetMatchesForeachListTruncated: 00:00:02.0583127
GetMatchesLinqArray: 00:00:00.5453610
GetMatchesLinqList: 00:00:00.4848105

Program:

using System;
using System.Linq;
using System.Text;
using System.Collections.Generic;
using System.Diagnostics;

namespace PrefixBenchmark
{
  public class Program
  {
    public const int OddsOfA = 26; // 1:26 odds of starting with "a".
    public const int SampleCount = 100000000; // Number of strings to use while testing
    public const int MaximumLength = 5; // Maximum string length

    public static void Main(string[] args)
    {
      var stopwatch = new Stopwatch();

      stopwatch.Restart();
      string[] sequence = StartsWithA(SampleCount, new Random(), OddsOfA, MaximumLength);
      stopwatch.Stop();
      Console.WriteLine($"{nameof(StartsWithA)}: {stopwatch.Elapsed}");

      Time(nameof(GetMatchesIndexCount), stopwatch, sequence, GetMatchesIndexCount);
      Time(nameof(GetMatchesIndexListAdHoc), stopwatch, sequence, GetMatchesIndexListAdHoc);
      Time(nameof(GetMatchesIndexListPreAllocated), stopwatch, sequence, GetMatchesIndexListPreAllocated);
      Time(nameof(GetMatchesIndexListTruncated), stopwatch, sequence, GetMatchesIndexListTruncated);
      Time(nameof(GetMatchesForeachCount), stopwatch, sequence, GetMatchesForeachCount);
      Time(nameof(GetMatchesForeachListAdHoc), stopwatch, sequence, GetMatchesForeachListAdHoc);
      Time(nameof(GetMatchesForeachListPreAllocated), stopwatch, sequence, GetMatchesForeachListPreAllocated);
      Time(nameof(GetMatchesForeachListTruncated), stopwatch, sequence, GetMatchesForeachListTruncated);
      Time(nameof(GetMatchesLinqArray), stopwatch, sequence, GetMatchesLinqArray);
      Time(nameof(GetMatchesLinqList), stopwatch, sequence, GetMatchesLinqList);
    }


    private static T Time<T>(string name, Stopwatch stopwatch, string[] sequence,
      Func<string[], T> test)
    {
      stopwatch.Restart();
      T result = test(sequence);
      stopwatch.Stop();
      Console.WriteLine($"{name}: {stopwatch.Elapsed}");
      return result;
    }

    private static int GetMatchesIndexCount(string[] sequence)
    {
      int length = sequence.Length;
      int count = 0;
      for (int index = 0; index < length; index++)
        if (sequence[index].StartsWith('a'))
          count++;
      return count;
    }

    private static List<string> GetMatchesIndexListAdHoc(string[] sequence)
    {
      int length = sequence.Length;
      var list = new List<string>();
      for (int index = 0; index < length; index++)
      {
        string candidate = sequence[index];
        if (candidate.StartsWith('a'))
          list.Add(candidate);
      }
      return list;
    }

    private static List<string> GetMatchesIndexListPreAllocated(string[] sequence)
    {
      int length = sequence.Length;
      var list = new List<string>(length);
      for (int index = 0; index < length; index++)
      {
        string candidate = sequence[index];
        if (candidate.StartsWith('a'))
          list.Add(candidate);
      }
      return list;
    }

    private static List<string> GetMatchesIndexListTruncated(string[] sequence)
    {
      int length = sequence.Length;
      var list = new List<string>(length);
      for (int index = 0; index < length; index++)
      {
        string candidate = sequence[index];
        if (candidate.StartsWith('a'))
          list.Add(candidate);
      }
      list.TrimExcess();
      return list;
    }

    private static int GetMatchesForeachCount(string[] sequence)
    {
      int count = 0;
      foreach (string candidate in sequence)
        if (candidate.StartsWith('a'))
          count++;
      return count;
    }

    private static List<string> GetMatchesForeachListAdHoc(string[] sequence)
    {
      var list = new List<string>();
      foreach (string candidate in sequence)
      if (candidate.StartsWith('a'))
        list.Add(candidate);
      return list;
    }

    private static List<string> GetMatchesForeachListPreAllocated(string[] sequence)
    {
      var list = new List<string>(sequence.Length);
      foreach (string candidate in sequence)
        if (candidate.StartsWith('a'))
          list.Add(candidate);
      return list;
    }

    private static List<string> GetMatchesForeachListTruncated(string[] sequence)
    {
      var list = new List<string>(sequence.Length);
      foreach (string candidate in sequence)
        if (candidate.StartsWith('a'))
          list.Add(candidate);
      list.TrimExcess();
      return list;
    }

    private static string[] GetMatchesLinqArray(string[] sequence) =>
      sequence
        .Where(candidate => candidate.StartsWith('a'))
        .ToArray();

    private static List<string> GetMatchesLinqList(string[] sequence) =>
      sequence
        .Where(candidate => candidate.StartsWith('a'))
        .ToList();

    /// <summary>
    /// Creates a string with a 1:<paramref name="odds"/> likelihood of starting with the letter "a".
    /// </summary>
    /// <param name="sampleCount">The number of samples to generate.</param>
    /// <param name="random">The random number generator.</param>
    /// <param name="odds">The denominator of a 1:<paramref name="odds"/> likelihood of selection.</param>
    /// <param name="maximumLength">The maximum character length of the string.</param>
    /// <returns>A sequence of <paramref name="sampleCount"/> strings with the specified characteristics.</returns>
    protected static string[] StartsWithA(int sampleCount, Random random, int odds,
int maximumLength)
    {
      string[] samples = new string[sampleCount];
      for (int index = 0; index < sampleCount; index++)
        samples[index] = StartsWithA(random, odds, maximumLength);
      return samples;
    }

    /// <summary>
    /// Creates a string with a 1:<paramref name="odds"/> likelihood of starting with the letter "a".
    /// </summary>
    /// <param name="random">The random number generator.</param>
    /// <param name="odds">The denominator of a 1:<paramref name="odds"/> likelihood of selection.</param>
    /// <param name="maximumLength">The maximum character length of the string.</param>
    /// <returns>A string with the specified characteristics.</returns>
    protected static string StartsWithA(Random random, int odds, int maximumLength)
    {
      int length = random.Next(maximumLength) + 1;
      var builder = new StringBuilder(random.Next(maximumLength) + 1);

      builder.Append(IsSelected(random, odds) ? 'a' : 'b');
      for (int index = 0; index < length; index++)
        builder.Append('b');

      return builder.ToString();
    }

    /// <summary>
    /// Evaluates a 1:<paramref name="odds"/> likelihood of selection.
    /// </summary>
    /// <param name="random">The random number generator.</param>
    /// <param name="odds">The denominator of a 1:<paramref name="odds"/> likelihood of selection.</param>
    /// <returns>True, if a 1:<paramref name="odds"/> event occurs; otherwise, false.</returns>
    protected static bool IsSelected(Random random, int odds) =>
      random.Next(odds) == 0;
  }
}
   
Comments
Maciej Los 26-Sep-18 2:34am
   
5ed!
Thanks for your very good comments.

I recognized the slow exection is caused by the Immediate Window:
? System.DateTime.Now.ToLongTimeString() + " | n=" + arrPlz.Where(x => x.StartsWith("0")).Count() + " | " + System.DateTime.Now.ToLongTimeString()

"12:02:20 | n=624 | 12:02:24"
Using the same LINQ expression in the class code works perfect (even with a List<>):
string[8000] items = new[] {"01234", "123123", ...};

items = items.Where(item => item.StartsWith("0")).ToArray();
var lst = items.Where(item => item.StartsWith("0")).ToList();
items = lst.ToArray();
So the Immediate Window slows down that process in a high way.
   

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900