Click here to Skip to main content
Click here to Skip to main content

Using LINQ to Calculate Basic Statistics

, 3 Dec 2013
Rate this:
Please Sign up or sign in to vote.
Extension methods for variance, standard deviation, range, median, and mode.

Up to date source on GitHub 

Now with a NuGet package 

Introduction 

While working on another project, I found myself needing to calculate basic statistics on various sets of data of various underlying types. LINQ has Count, Min, Max, and Average, but no other statistical aggregates. As I always do in a case like this, I started with Google, figuring someone else must have written some handy extension methods for this already. There are plenty of statistical and numerical processing packages out there, but what I want is a simple and lightweight implementation for the basic stats: variance (sample and population), standard deviation (sample and population), covariance, Pearson (chi squared), range, median, and mode.

Background

I've modeled the API on the various overloads of Enumerable.Average, so you are able to use these methods on the same types of collections that those methods accept. Hopefully, this will make the usage familiar and easy to use.

That means overloads for collections of the common numerical data types and their Nullable counter parts, as well as convenient selector overloads.

public static decimal? StandardDeviation(this IEnumerable<decimal?> source);
public static decimal StandardDeviation(this IEnumerable<decimal> source);
public static double? StandardDeviation(this IEnumerable<double?> source);
public static double StandardDeviation(this IEnumerable<double> source);
public static float? StandardDeviation(this IEnumerable<float?> source);
public static float StandardDeviation(this IEnumerable<float> source);
public static double? StandardDeviation(this IEnumerable<int?> source);
public static double StandardDeviation(this IEnumerable<int> source);
public static double? StandardDeviation(this IEnumerable<long?> source);
public static double StandardDeviation(this IEnumerable<long> source);
public static decimal? StandardDeviation<TSource>
    (this IEnumerable<TSource> source, Func<TSource, decimal?> selector);
public static decimal StandardDeviation<TSource>
    (this IEnumerable<TSource> source, Func<TSource, decimal> selector);
public static double? StandardDeviation<TSource>
    (this IEnumerable<TSource> source, Func<TSource, double?> selector);
public static double StandardDeviation<TSource>
    (this IEnumerable<TSource> source, Func<TSource, double> selector);
public static float? StandardDeviation<TSource>
    (this IEnumerable<TSource> source, Func<TSource, float?> selector);
public static float StandardDeviation<TSource>
    (this IEnumerable<TSource> source, Func<TSource, float> selector);
public static double? StandardDeviation<TSource>
    (this IEnumerable<TSource> source, Func<TSource, int?> selector);
public static double StandardDeviation<TSource>
    (this IEnumerable<TSource> source, Func<TSource, int> selector);
public static double? StandardDeviation<TSource>
    (this IEnumerable<TSource> source, Func<TSource, long?> selector);
public static double StandardDeviation<TSource>
    (this IEnumerable<TSource> source, Func<TSource, long> selector);

All of the overloads that take a collection of Nullable types only include actual values in the calculated result. For example:

public static double? StandardDeviation(this IEnumerable<double?> source)
{
    IEnumerable<double> values = source.Coalesce();
    if (values.Any())
        return values.StandardDeviation();

    return null;
}

where the Coalesce method is:

public static IEnumerable<T> Coalesce<T>(this IEnumerable<T?> source) where T : struct
{
    Debug.Assert(source != null);
    return source.Where(x => x.HasValue).Select(x => (T)x);
}

A Note About Mode

Since a distribution of values may not have a mode, all of the Mode methods return a Nullable type. For instance, in the series { 1, 2, 3, 4 }, no single value appears more than once. In cases such as this, the return value will be null.

In the case where there are multiple modes, Mode returns the maximum mode (i.e., the value that appears the most times). If there is a tie for the maximum mode, it returns the smallest value in the set of maximum modes.

There are also two methods for calculating all modes in a series. These return an IEnumerable of all of the modes in descending order of modality.

The Statistics Calculations

Links, descriptions, and mathematical images from Wikipedia.

Variance

Variance is the measure of the amount of variation of all the scores for a variable (not just the extremes which give the range).

Sample variance is typically denoted by the lower case sigma squared: σ2.

variance

public static double Variance(this IEnumerable<double> source) 
{ 
    int n = 0;
    double mean = 0;
    double M2 = 0;

    foreach (double x in source)
    {
        n = n + 1;
        double delta = x - mean;
        mean = mean + delta / n;
        M2 += delta * (x - mean);
    }
    return M2 / (n - 1);
}

Standard Deviation

The Standard Deviation of a statistical population, a data set, or a probability distribution is the square root of its variance.

Standard deviation is typically denoted by the lower case sigma: σ.

standard deviation

public static double StandardDeviation(this IEnumerable<double> source) 
{ 
    return Math.Sqrt(source.Variance());
}

Median

Median is the number separating the higher half of a sample, a population, or a probability distribution, from the lower half.

public static double Median(this IEnumerable<double> source) 
{ 
    var sortedList = from number in source 
        orderby number 
        select number; 
        
    int count = sortedList.Count(); 
    int itemIndex = count / 2; 
    if (count % 2 == 0) // Even number of items. 
        return (sortedList.ElementAt(itemIndex) + 
                sortedList.ElementAt(itemIndex - 1)) / 2; 
        
    // Odd number of items. 
    return sortedList.ElementAt(itemIndex); 
}

Mode

Mode is the value that occurs the most frequently in a data set or a probability distribution.

public static T? Mode<T>(this IEnumerable<T> source) where T : struct
{
    var sortedList = from number in source
                     orderby number
                     select number;

    int count = 0;
    int max = 0;
    T current = default(T);
    T? mode = new T?();

    foreach (T next in sortedList)
    {
        if (current.Equals(next) == false)
        {
            current = next;
            count = 1;
        }
        else
        {
            count++;
        }

        if (count > max)
        {
            max = count;
            mode = current;
        }
    }

    if (max > 1)
        return mode;

    return null;
}

Range

Range is the length of the smallest interval which contains all the data.

public static double Range(this IEnumerable<double> source)
{
    return source.Max() - source.Min();
}

Covariance

Covariance is a measure of how much two variables change together.

public static double Covariance(this IEnumerable<double> source, IEnumerable<double> other)
{
    int len = source.Count();

    double avgSource = source.Average();
    double avgOther = other.Average();
    double covariance = 0;
    
    for (int i = 0; i < len; i++)
        covariance += (source.ElementAt(i) - avgSource) * (other.ElementAt(i) - avgOther);

    return covariance / len; 
}

Pearson's Chi Square Test

Pearson's chi square test is used to assess two types of comparisons: tests of goodness of fit, and tests of independence.

In other words, it is a measure of how well a sample distribution matches a predicted distribution or the degree of correlation between two sample distributions. Pearson's is often used in scientific applications to test the validity of hypotheses.

public static double Pearson(this IEnumerable<double> source, 
                             IEnumerable<double> other)
{
    return source.Covariance(other) / (source.StandardDeviationP() * 
                             other.StandardDeviationP());
}

Using the Code

The included Unit Tests should provide plenty of examples for how to use these methods, but at its simplest, they behave like other enumerable extension methods. The following program...

static void Main(string[] args)
{
      IEnumerable<int> data = new int[] { 1, 2, 5, 6, 6, 8, 9, 9, 9 };

      Console.WriteLine("Count = {0}", data.Count());
      Console.WriteLine("Average = {0}", data.Average());
      Console.WriteLine("Median = {0}", data.Median());
      Console.WriteLine("Mode = {0}", data.Mode());
      Console.WriteLine("Sample Variance = {0}", data.Variance());
      Console.WriteLine("Sample Standard Deviation = {0}", data.StandardDeviation());
      Console.WriteLine("Population Variance = {0}", data.VarianceP());
      Console.WriteLine("Population Standard Deviation = {0}", 
                    data.StandardDeviationP());
      Console.WriteLine("Range = {0}", data.Range());
}

... produces:

Count = 9
Average = 6.11111111111111
Median = 6
Mode = 9
Sample Variance = 9.11111111111111
Sample Standard Deviation = 3.01846171271247
Population Variance = 8.09876543209877
Population Standard Deviation = 2.8458329944146
Range = 8

Points of Interest

I didn't spend much time optimizing the calculations, so be careful if you are evaluating extremely large data sets. If you come up with an optimization in any of the attached code, drop me a note and I'll update the source.

Hopefully, you'll find this code handy the next time you need some simple statistics calculation.

History

  • Version 1.0 - Initial upload, 9/19/2009.
  • Version 1.1 - Added Covariance and Pearson as well as a couple of fixes/optimizations, 10/26/2009.
  • Version 1.2 - Updated variance implementation and added GitHub and NuGet links 12/3/2013 

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Don Kackman
Team Leader Starkey Laboratories
United States United States
The first computer program I ever wrote was in BASIC on a TRS-80 Model I and it looked something like:
10 PRINT "Don is cool"
20 GOTO 10
It only went downhill from there.
 
Hey look, I've got a blog

Comments and Discussions

 
GeneralMy vote of 5 PinmemberJS000014-Dec-13 4:59 
QuestionError messages Pinmemberwvd_vegt3-Dec-13 21:32 
AnswerRe: Error messages PinmemberDon Kackman4-Dec-13 14:09 
GeneralVery cool, but... [modified] Pinmemberindranil banerjee1-Dec-13 12:20 
GeneralRe: Very cool, but... PinmemberDon Kackman2-Dec-13 3:54 
GeneralRe: Very cool, but... Pinmemberindranil banerjee2-Dec-13 12:17 
GeneralRe: Very cool, but... PinmemberDon Kackman3-Dec-13 16:56 
GeneralMy vote of 5 PinmemberMahsa Hassankashi3-Apr-13 22:11 
Questionyour variance algorithm numerically unstable Pinmemberdanong3-Jul-12 15:40 
AnswerRe: your variance algorithm numerically unstable PinmemberDon Kackman3-Dec-13 16:57 
GeneralMy vote of 5 PinmemberMohd Zaki Zakaria13-Jun-12 22:24 
GeneralRe: My vote of 5 PinmemberDon Kackman2-Jul-12 7:26 
GeneralMy vote of 5 Pinmembermarkus folius14-Mar-12 21:11 
QuestionWould have to get a hint about TREND in stats? Pinmembermanfbraun12-Aug-11 9:04 
Hi!
 
Much thanks first for this nice piece!! Excactly, what I started to play with. Is it possible to get a hint about to caculate a TREND for a given list ?
 
Thanks anyway!
 
br++mabra
GeneralMy vote of 3 Pinmembermramakrishnan25-Jun-10 12:30 
Generalmore to explain PinmemberRozis18-Oct-09 12:29 
GeneralRe: more to explain PinmemberDon Kackman19-Oct-09 3:16 
GeneralRe: more to explain PinmemberDon Kackman26-Oct-09 9:44 
GeneralMode inconsistency PinmemberRichard Deeming22-Sep-09 7:24 
GeneralRe: Mode inconsistency PinmemberDon Kackman22-Sep-09 11:05 
GeneralRe: Mode inconsistency PinmemberDon Kackman23-Sep-09 7:24 
GeneralVery useful PinmemberMR_SAM_PIPER21-Sep-09 14:27 
GeneralRe: Very useful PinmemberDon Kackman21-Sep-09 14:49 
GeneralRe: Very useful PinmemberDon Kackman23-Sep-09 7:25 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web04 | 2.8.140721.1 | Last Updated 3 Dec 2013
Article Copyright 2009 by Don Kackman
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid