13,042,421 members (82,812 online)
Add your own
alternative version

#### Stats

92.6K views
1.7K downloads
117 bookmarked
Posted 19 Sep 2009

# Using LINQ to Calculate Basic Statistics

, 17 Jan 2015
 Rate this:
Please Sign up or sign in to vote.
Extension methods for variance, standard deviation, range, median, mode and some other basic descriptive statistics.

## Introduction

While working on another project, I found myself needing to calculate basic statistics on various sets of data of various underlying types. LINQ has Count, Min, Max, and Average, but no other statistical aggregates. As I always do in a case like this, I started with Google, figuring someone else must have written some handy extension methods for this already. There are plenty of statistical and numerical processing packages out there, but what I want is a simple and lightweight implementation for the basic stats: variance (sample and population), standard deviation (sample and population), covariance, Pearson (chi squared), range, median, least squares, root mean square, histogram, and mode.

## Background

I've modeled the API on the various overloads of Enumerable.Average, so you are able to use these methods on the same types of collections that those methods accept. Hopefully, this will make the usage familiar and easy to use.

That means overloads for collections of the common numerical data types and their Nullable counter parts, as well as convenient selector overloads.

public static decimal? StandardDeviation(this IEnumerable<decimal?> source);
public static decimal StandardDeviation(this IEnumerable<decimal> source);
public static double? StandardDeviation(this IEnumerable<double?> source);
public static double StandardDeviation(this IEnumerable<double> source);
public static float? StandardDeviation(this IEnumerable<float?> source);
public static float StandardDeviation(this IEnumerable<float> source);
public static double? StandardDeviation(this IEnumerable<int?> source);
public static double StandardDeviation(this IEnumerable<int> source);
public static double? StandardDeviation(this IEnumerable<long?> source);
public static double StandardDeviation(this IEnumerable<long> source);
public static decimal? StandardDeviation<TSource>
(this IEnumerable<TSource> source, Func<TSource, decimal?> selector);
public static decimal StandardDeviation<TSource>
(this IEnumerable<TSource> source, Func<TSource, decimal> selector);
public static double? StandardDeviation<TSource>
(this IEnumerable<TSource> source, Func<TSource, double?> selector);
public static double StandardDeviation<TSource>
(this IEnumerable<TSource> source, Func<TSource, double> selector);
public static float? StandardDeviation<TSource>
(this IEnumerable<TSource> source, Func<TSource, float?> selector);
public static float StandardDeviation<TSource>
(this IEnumerable<TSource> source, Func<TSource, float> selector);
public static double? StandardDeviation<TSource>
(this IEnumerable<TSource> source, Func<TSource, int?> selector);
public static double StandardDeviation<TSource>
(this IEnumerable<TSource> source, Func<TSource, int> selector);
public static double? StandardDeviation<TSource>
(this IEnumerable<TSource> source, Func<TSource, long?> selector);
public static double StandardDeviation<TSource>
(this IEnumerable<TSource> source, Func<TSource, long> selector);

All of the overloads that take a collection of Nullable types only include actual values in the calculated result. For example:

public static double? StandardDeviation(this IEnumerable<double?> source)
{
IEnumerable<double> values = source.AllValues();
if (values.Any())
return values.StandardDeviation();

return null;
}

where the AllValues method is:

public static IEnumerable<T> AllValues<T>(this IEnumerable<T?> source) where T : struct
{
Debug.Assert(source != null);
return source.Where(x => x.HasValue).Select(x => (T)x);
}

### A Note About Mode

Since a distribution of values may not have a mode, all of the Mode methods return a Nullable type. For instance, in the series { 1, 2, 3, 4 }, no single value appears more than once. In cases such as this, the return value will be null.

In the case where there are multiple modes, Mode returns the maximum mode (i.e., the value that appears the most times). If there is a tie for the maximum mode, it returns the smallest value in the set of maximum modes.

There are also two methods for calculating all modes in a series. These return an IEnumerable of all of the modes in descending order of modality.

## The Statistics Calculations

Links, descriptions, and mathematical images from Wikipedia.

### Variance

Variance is the measure of the amount of variation of all the scores for a variable (not just the extremes which give the range).

Sample variance is typically denoted by the lower case sigma squared: σ2.

public static double Variance(this IEnumerable<double> source)
{
int n = 0;
double mean = 0;
double M2 = 0;

foreach (double x in source)
{
n = n + 1;
double delta = x - mean;
mean = mean + delta / n;
M2 += delta * (x - mean);
}
return M2 / (n - 1);
}

### Standard Deviation

The Standard Deviation of a statistical population, a data set, or a probability distribution is the square root of its variance.

Standard deviation is typically denoted by the lower case sigma: σ.

public static double StandardDeviation(this IEnumerable<double> source)
{
return Math.Sqrt(source.Variance());
}

### Median

Median is the number separating the higher half of a sample, a population, or a probability distribution, from the lower half.

public static double Median(this IEnumerable<double> source)
{
var sortedList = from number in source
orderby number
select number;

int count = sortedList.Count();
int itemIndex = count / 2;
if (count % 2 == 0) // Even number of items.
return (sortedList.ElementAt(itemIndex) +
sortedList.ElementAt(itemIndex - 1)) / 2;

// Odd number of items.
return sortedList.ElementAt(itemIndex);
}

### Mode

Mode is the value that occurs the most frequently in a data set or a probability distribution.

public static T? Mode<T>(this IEnumerable<T> source) where T : struct
{
var sortedList = from number in source
orderby number
select number;

int count = 0;
int max = 0;
T current = default(T);
T? mode = new T?();

foreach (T next in sortedList)
{
if (current.Equals(next) == false)
{
current = next;
count = 1;
}
else
{
count++;
}

if (count > max)
{
max = count;
mode = current;
}
}

if (max > 1)
return mode;

return null;
}

### Histogram

A Histogram is a representation of a continuous distribiuton of data. Given a continuous data set, the histogram counts how many occurences of its data points fall into a set of contiguous ranges of values (aka bins). There is no single approach for determining the number of bins as this is dependent on the data and analysis being performed. There are some standard mechanisms for calculating bin size based onte the number of data points. Three of these are included in a set of BinCount extension methods. There are also different approaches to deteremining how to determine the range of each bin. These are indicated with the BinningMode enumeration. In all cases except one bin ranges include the values the >= the range minimum and < the range maximum; [min, max). When the BinningMode is MaxValueInclusive the maximum bin range will include the max value rather than exclude it: [min, max].

/// <summary>
/// Controls how the range of the bins are determined
/// </summary>
public enum BinningMode
{
/// <summary>
/// The minimum will be equal to the sequence min and the maximum equal to infinity
/// such that:
/// [min, min + binSize), [min * i, min * i + binSize), ... , [min * n, positiveInfinity)
/// </summary>
Unbounded,

/// <summary>
/// The minimum will be the sequnce min and the maximxum equal to sequence max
/// The last bin will max inclusive instead of exclusive
/// </summary>
/// [min, min + binSize), [min * i, min * i + binSize), ... , [min * n, max]
MaxValueInclusive,

/// <summary>
/// The total range will be expanded such that the min is
/// less then the sequence min and max is greater then the sequence max
/// [min - (binSize / 2), min - (binSize / 2) + binSize), [min - (binSize / 2) * i, min - (binSize / 2) * i + binSize), ... , [min - (bin / 2) * n, min + (binSize / 2))
/// </summary>
ExpandRange
}

Creating the histogram invloves creating an array of Bins with the appropriate ranges and then determining how many data points fall into each range.

public static IEnumerable<Bin> Histogram(this IEnumerable<double> source, int binCount, BinningMode mode = BinningMode.Unbounded)
{
if (source == null)
throw new ArgumentNullException("source");

if (!source.Any())
throw new InvalidOperationException("source sequence contains no elements");

var bins = BinFactory.CreateBins(source.Min(), source.Max(), binCount, mode);
source.AssignBins(bins);

return bins;
}

### Range

Range is the length of the smallest interval which contains all the data.

public static double Range(this IEnumerable<double> source)
{
return source.Max() - source.Min();
}

### Covariance

Covariance is a measure of how much two variables change together.

public static double Covariance(this IEnumerable<double> source, IEnumerable<double> other)
{
int len = source.Count();

double avgSource = source.Average();
double avgOther = other.Average();
double covariance = 0;

for (int i = 0; i < len; i++)
covariance += (source.ElementAt(i) - avgSource) * (other.ElementAt(i) - avgOther);

return covariance / len;
}

### Pearson's Chi Square Test

Pearson's chi square test is used to assess two types of comparisons: tests of goodness of fit, and tests of independence.

In other words, it is a measure of how well a sample distribution matches a predicted distribution or the degree of correlation between two sample distributions. Pearson's is often used in scientific applications to test the validity of hypotheses.

public static double Pearson(this IEnumerable<double> source,
IEnumerable<double> other)
{
return source.Covariance(other) / (source.StandardDeviationP() *
other.StandardDeviationP());
}

### Linear Least Squares

Least Squares is an apporach for deteremining the approximate solution for a distribution of data used in regression analysis. Said another way, given a distribution of 2 dimensional data what is the equation the best predicts y as a function of x in the form y = mx + b, where m is the slope of the line and b is where it intercepts the y axis on a 2d graph.

For this calculation a struct is returned that indicates m and b.

public static LeastSquares LeastSquares(this IEnumerable<Tuple<double, double>> source)
{
int numPoints = 0;
double sumX = 0;
double sumY = 0;
double sumXX = 0;
double sumXY = 0;

foreach (var tuple in source)
{
numPoints++;
sumX += tuple.Item1;
sumY += tuple.Item2;
sumXX += tuple.Item1 * tuple.Item1;
sumXY += tuple.Item1 * tuple.Item2;
}

if (numPoints < 2)
throw new InvalidOperationException("Source must have at least 2 elements");

double b = (-sumX * sumXY + sumXX * sumY) / (numPoints * sumXX - sumX * sumX);
double m = (-sumX * sumY + numPoints * sumXY) / (numPoints * sumXX - sumX * sumX);

return new LeastSquares(m, b);
}

### Root Mean Square

Root Mean Square is the measure of the magnitude of a varying series. This is particularly useful for waveforms.

public static double RootMeanSquare(this IEnumerable<double> source)
{
if (source.Count() < 2)
throw new InvalidOperationException("Source must have at least 2 elements");

double s = source.Aggregate(0.0, (x, d) => x += Math.Pow(d, 2));

return Math.Sqrt(s / source.Count());
}

## Using the Code

The included Unit Tests should provide plenty of examples for how to use these methods, but at its simplest, they behave like other enumerable extension methods. The following program...

static void Main(string[] args)
{
IEnumerable<int> data = new int[] { 1, 2, 5, 6, 6, 8, 9, 9, 9 };

Console.WriteLine("Count = {0}", data.Count());
Console.WriteLine("Average = {0}", data.Average());
Console.WriteLine("Median = {0}", data.Median());
Console.WriteLine("Mode = {0}", data.Mode());
Console.WriteLine("Sample Variance = {0}", data.Variance());
Console.WriteLine("Sample Standard Deviation = {0}", data.StandardDeviation());
Console.WriteLine("Population Variance = {0}", data.VarianceP());
Console.WriteLine("Population Standard Deviation = {0}",
data.StandardDeviationP());
Console.WriteLine("Range = {0}", data.Range());
}

... produces:

Count = 9
Average = 6.11111111111111
Median = 6
Mode = 9
Sample Variance = 9.11111111111111
Sample Standard Deviation = 3.01846171271247
Population Variance = 8.09876543209877
Population Standard Deviation = 2.8458329944146
Range = 8

## Points of Interest

I didn't spend much time optimizing the calculations, so be careful if you are evaluating extremely large data sets. If you come up with an optimization in any of the attached code, drop me a note and I'll update the source.

Hopefully, you'll find this code handy the next time you need some simple statistics calculation.

### A Note about the T4 Templates

I've never found much use for code generation templates but in developing this library they greatly simplified one use case: namely arithmatic operations cannot be directly expressed in C# generics. Because operators are implemented as static methods, and there is no mechanism to require a Type to have a particular static method, the compiler has no way of generically resoliving "-" in this chunk of code:

public static T Range<T>(this IEnumerable<T> source)
{
// error CS0019: Operator '-' cannot be applied to operands of type 'T' and 'T'
return source.Max() - source.Min();
}

If you look at the pattern set by Average and Sum in the framework classes (which I've tried to emulate here), they operate on enumerations of int, long, float, double and decimal. In order to avoid much "copy, paste, modify operand types in code and comments" T4 templates came in very handy.

Basically for each operation that can operate on a set of intrinsic types the template:

1. Declare a list of the types supported
2. Iterate over the list and generate the code commenting, method signature and body for all of the overloads supporting the given intrinsic type
public static partial class EnumerableStats
{
<# var types = new List<string>()
{
"int", "long", "float", "double", "decimal"
};

foreach(var type in types)
{#>
/// <summary>
/// Computes the Range of a sequence of nullable <#= type #> values.
/// </summary>
/// <param name="source">The sequence of elements.</param>
/// <returns>The Range.</returns>
public static <#= type #>? Range(this IEnumerable<<#= type #>?> source)
{
IEnumerable<<#= type #>> values = source.AllValues();
if (values.Any())
return values.Range();

return null;
}

/// <summary>
/// Computes the Range of a sequence of <#= type #> values.
/// </summary>
/// <param name="source">The sequence of elements.</param>
/// <returns>The Range.</returns>
public static <#= type #> Range(this IEnumerable<<#= type #>> source)
{
return source.Max() - source.Min();
}

...
etc etc
...
<# } #>
}

This is nice as it ensures that all types support the same set of overloads, and have identical implementations and code commenting.

## History

• Version 1.0 - Initial upload, 9/19/2009.
• Version 1.1 - Added Covariance and Pearson as well as a couple of fixes/optimizations, 10/26/2009.
• Version 1.2 - Updated variance implementation and added GitHub and NuGet links 12/3/2013
• Version 1.3 - Added description of Least Squares and Histogram 8/30/2014

## License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

## About the Author

 Team Leader Starkey Laboratories United States
The first computer program I ever wrote was in BASIC on a TRS-80 Model I and it looked something like:
10 PRINT "Don is cool"
20 GOTO 10

It only went downhill from there.

Hey look, I've got a blog

 Pro

## Comments and Discussions

 First PrevNext
 Histogram Member 1257780720-Jun-17 11:22 Member 12577807 20-Jun-17 11:22
 Re: Histogram Don Kackman24-Jun-17 3:22 Don Kackman 24-Jun-17 3:22
 In case of more than one mode element Member 1313536218-Apr-17 23:24 Member 13135362 18-Apr-17 23:24
 Re: In case of more than one mode element Don Kackman6-May-17 2:19 Don Kackman 6-May-17 2:19
 Re: In case of more than one mode element Member 131353628-Jun-17 3:22 Member 13135362 8-Jun-17 3:22
 Formulas FatCatProgrammer30-Jul-15 4:23 FatCatProgrammer 30-Jul-15 4:23
 Concerning performance Member 105861251-Sep-14 20:14 Member 10586125 1-Sep-14 20:14
 Re: Concerning performance Don Kackman2-Sep-14 4:17 Don Kackman 2-Sep-14 4:17
 Re: Concerning performance Member 105861253-Sep-14 11:04 Member 10586125 3-Sep-14 11:04
 Re: Concerning performance Don Kackman3-Sep-14 13:21 Don Kackman 3-Sep-14 13:21
 My vote of 5 MarkBoreham30-Aug-14 7:13 MarkBoreham 30-Aug-14 7:13
 Re: My vote of 5 Don Kackman30-Aug-14 9:07 Don Kackman 30-Aug-14 9:07
 My vote of 5 JS000014-Dec-13 4:59 JS00001 4-Dec-13 4:59
 Error messages wvd_vegt3-Dec-13 21:32 wvd_vegt 3-Dec-13 21:32
 Re: Error messages Don Kackman4-Dec-13 14:09 Don Kackman 4-Dec-13 14:09
 Very cool, but... indranil banerjee1-Dec-13 12:20 indranil banerjee 1-Dec-13 12:20
 Re: Very cool, but... Don Kackman2-Dec-13 3:54 Don Kackman 2-Dec-13 3:54
 Re: Very cool, but... indranil banerjee2-Dec-13 12:17 indranil banerjee 2-Dec-13 12:17
 Re: Very cool, but... Don Kackman3-Dec-13 16:56 Don Kackman 3-Dec-13 16:56
 My vote of 5 Mahsa Hassankashi3-Apr-13 22:11 Mahsa Hassankashi 3-Apr-13 22:11
 your variance algorithm numerically unstable danong3-Jul-12 15:40 danong 3-Jul-12 15:40
 Re: your variance algorithm numerically unstable Don Kackman3-Dec-13 16:57 Don Kackman 3-Dec-13 16:57
 My vote of 5 Mohd Zaki Zakaria13-Jun-12 22:24 Mohd Zaki Zakaria 13-Jun-12 22:24
 Re: My vote of 5 Don Kackman2-Jul-12 7:26 Don Kackman 2-Jul-12 7:26
 My vote of 5 markus folius14-Mar-12 21:11 markus folius 14-Mar-12 21:11
 Last Visit: 31-Dec-99 18:00     Last Update: 20-Jul-17 12:09 Refresh 12 Next »

General    News    Suggestion    Question    Bug    Answer    Joke    Praise    Rant    Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Terms of Use | Mobile
Web01 | 2.8.170713.1 | Last Updated 17 Jan 2015
Article Copyright 2009 by Don Kackman
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid