13,140,591 members (47,911 online)
alternative version

#### Stats

65K views
63 bookmarked
Posted 28 Jun 2008

# A User-Friendly C# Descriptive Statistic Class

, 28 Jun 2008
 Rate this:
An article on most commonly used descriptive statistics, including standard deviations, skewness, kurtosis, percentiles, quartiles, etc.

## Introduction

The 80-20 rules applies: even with the advances of statistics, most of our work requires only univariate descriptive statistics – those involve the calculations of mean, standard deviation, range, skewness, kurtosis, percentile, quartiles, etc. This article describes a simple way to construct a set of classes to implement descriptive statistics in C#. The emphasis is on the ease of use at the users' end.

## Requirements

To run the code, you need to have the following:

• .NET Framework 2.0 and above
• Microsoft Visual Studio 2005 if you want to open the project files included in the download project
• Nunit 2.4 if you want to run the unit tests included in the download project

The download also includes a NUnit test in case you want to make changes to the code and run your own unit test.

## The Code

The goal of the code design is to simplify the usage. We envisage that the user will perform the following code to get the desired results. This involves a simple 3-steps process:

1. Instantiate a `Descriptive `object
2. Invoke its `.Analyze() `method
3. Retrieve results from its `.Result `object

Here is a typical user’s code:

```double[] x  = {1, 2, 4, 7, 8, 9, 10, 12};
Descriptive desp = new Descriptive(x);
desp.Analyze(); // analyze the data
Console.WriteLine("Result is: " + desp.Result.FirstQuartile.ToString());```

Two classes are implemented:

• `DescriptiveResult`
• `Descriptive`

`DescriptiveResult `is a class from which a result object derives, which holds the analysis results. In our implementation, the `.Result `member variable is defined as follows:

```/// <span class="code-SummaryComment"><summary></span>
/// The result class the holds the analysis results
/// <span class="code-SummaryComment"></summary></span>
public class DescriptiveResult
{
// sortedData is used to calculate percentiles
internal double[] sortedData;

/// <span class="code-SummaryComment"><summary></span>
/// DescriptiveResult default constructor
/// <span class="code-SummaryComment"></summary></span>
public DescriptiveResult() { }

/// <span class="code-SummaryComment"><summary></span>
/// Count
/// <span class="code-SummaryComment"></summary></span>
public uint Count;
/// <span class="code-SummaryComment"><summary></span>
/// Sum
/// <span class="code-SummaryComment"></summary></span>
public double Sum;
/// <span class="code-SummaryComment"><summary></span>
/// Arithmetic mean
/// <span class="code-SummaryComment"></summary></span>
public double Mean;
/// <span class="code-SummaryComment"><summary></span>
/// Geometric mean
/// <span class="code-SummaryComment"></summary></span>
public double GeometricMean;
/// <span class="code-SummaryComment"><summary></span>
/// Harmonic mean
/// <span class="code-SummaryComment"></summary></span>
public double HarmonicMean;
/// <span class="code-SummaryComment"><summary></span>
/// Minimum value
/// <span class="code-SummaryComment"></summary></span>
public double Min;
/// <span class="code-SummaryComment"><summary></span>
/// Maximum value
/// <span class="code-SummaryComment"></summary></span>
public double Max;
/// <span class="code-SummaryComment"><summary></span>
/// The range of the values
/// <span class="code-SummaryComment"></summary></span>
public double Range;
/// <span class="code-SummaryComment"><summary></span>
/// Sample variance
/// <span class="code-SummaryComment"></summary></span>
public double Variance;
/// <span class="code-SummaryComment"><summary></span>
/// Sample standard deviation
/// <span class="code-SummaryComment"></summary></span>
public double StdDev;
/// <span class="code-SummaryComment"><summary></span>
/// Skewness of the data distribution
/// <span class="code-SummaryComment"></summary></span>
public double Skewness;
/// <span class="code-SummaryComment"><summary></span>
/// Kurtosis of the data distribution
/// <span class="code-SummaryComment"></summary></span>
public double Kurtosis;
/// <span class="code-SummaryComment"><summary></span>
/// Interquartile range
/// <span class="code-SummaryComment"></summary></span>
public double IQR;
/// <span class="code-SummaryComment"><summary></span>
/// Median, or second quartile, or at 50 percentile
/// <span class="code-SummaryComment"></summary></span>
public double Median;
/// <span class="code-SummaryComment"><summary></span>
/// First quartile, at 25 percentile
/// <span class="code-SummaryComment"></summary></span>
public double FirstQuartile;
/// <span class="code-SummaryComment"><summary></span>
/// Third quartile, at 75 percentile
/// <span class="code-SummaryComment"></summary></span>
public double ThirdQuartile;

/// <span class="code-SummaryComment"><summary></span>
/// Sum of Error
/// <span class="code-SummaryComment"></summary></span>
internal double SumOfError;

/// <span class="code-SummaryComment"><summary></span>
/// The sum of the squares of errors
/// <span class="code-SummaryComment"></summary></span>
internal double SumOfErrorSquare;

/// <span class="code-SummaryComment"><summary></span>
/// Percentile
/// <span class="code-SummaryComment"></summary></span>
/// <span class="code-SummaryComment"><param name="percent">Pecentile, between 0 to 100</param></span>
/// <span class="code-SummaryComment"><returns>Percentile<returns></span>
```

For simplicity, most member variables are implemented as `public `variables. The only member function - `Percentile` - allows the user to pass the argument (in percentage, e.g. 30 for 30%) and receive the percentile result.

The following table lists the available results (assuming that the `Descriptive `object name you use is `desp`:

 Result Result stored in variable Number of data points `desp.Result.Count` Minimum value `desp.Result.Min` Maximum value `desp.Result.Max` Range of values `desp.Result.Range` Sum of values `desp.Result.Sum` Arithmetic mean `desp.Result.Mean` Geometric mean `desp.Result.GeometricMean` Harmonic mean `desp.Result.HarmonicMean` Sample variance `desp.Result.Variance` Sample standard deviation `desp.Result.StdDev` Skewness of the distribution `desp.Result.Skewness` Kurtosis of the distribution `desp.Result.Kurtosis` Interquartile range `desp.Result.IQR` Median (50% percentile) `desp.Result.Median` FirstQuartile: 25% percentile `desp.Result.FirstQuartile` ThirdQuartile: 75% percentile `desp.Result.ThirdQuartile` Percentile `desp.Result.Percentile()`*

* The argument of percentile is values from 0 to 100, which indicates the percentile desired.

## Descriptive Class

The `Descriptive `class does all the analysis, and it is implemented as follows:

```/// <span class="code-SummaryComment"><summary></span>
/// Descriptive class
/// <span class="code-SummaryComment"></summary></span>
public class Descriptive
{
private double[] data;
private double[] sortedData;

/// <span class="code-SummaryComment"><summary></span>
/// Descriptive results
/// <span class="code-SummaryComment"></summary></span>
public DescriptiveResult Result = new DescriptiveResult();

#region Constructors
/// <span class="code-SummaryComment"><summary></span>
/// Descriptive analysis default constructor
/// <span class="code-SummaryComment"></summary></span>
public Descriptive() { } // default empty constructor

/// <span class="code-SummaryComment"><summary></span>
/// Descriptive analysis constructor
/// <span class="code-SummaryComment"></summary></span>
/// <span class="code-SummaryComment"><param name="dataVariable">Data array</param></span>
public Descriptive(double[] dataVariable)
{
data = dataVariable;
}
#endregion //  Constructors```

Note that we need a `sortedData `class to facilitate percentile and quartile-related statistics. It stores the sorted version of the user data.

The constructor of `Descriptive `class allows the user to assign the data array during the object instantiation:

```double[] x  = {1, 2, 4, 7, 8, 9, 10, 12};
Descriptive desp = new Descriptive(x);```

Once the `Descriptive `object is instantiated, the user only needs to call the `.Analyze() `method to perform the analysis. Subsequently, the user can retrieve the analysis results from the `.Result `object in the `Descriptive `object.

The `Analyze() `method is implemented as follows:

```/// <span class="code-SummaryComment"><summary></span>
/// Run the analysis to obtain descriptive information of the data
/// <span class="code-SummaryComment"></summary></span>
public void Analyze()
{
// initializations
Result.Count = 0;
Result.Min = Result.Max = Result.Range = Result.Mean =
Result.Sum = Result.StdDev = Result.Variance = 0.0d;

double sumOfSquare = 0.0d;
double sumOfESquare = 0.0d; // must initialize

double[] squares = new double[data.Length];
double cumProduct = 1.0d; // to calculate geometric mean
double cumReciprocal = 0.0d; // to calculate harmonic mean

// First iteration
for (int i = 0; i < data.Length; i++)
{
if (i==0) // first data point
{
Result.Min = data[i];
Result.Max = data[i];
Result.Mean = data[i];
Result.Range = 0.0d;
}
else
{ // not the first data point
if (data[i] < Result.Min) Result.Min = data[i];
if (data[i] > Result.Max) Result.Max = data[i];
}
Result.Sum += data[i];
squares[i] = Math.Pow(data[i], 2); //TODO: may not be necessary
sumOfSquare += squares[i];

cumProduct *= data[i];
cumReciprocal += 1.0d / data[i];
}

Result.Count = (uint)data.Length;
double n = (double)Result.Count; // use a shorter variable in double type
Result.Mean = Result.Sum / n;
Result.GeometricMean = Math.Pow(cumProduct, 1.0 / n);
// see http://mathworld.wolfram.com/HarmonicMean.html
Result.HarmonicMean = 1.0d / (cumReciprocal / n);
Result.Range = Result.Max - Result.Min;

// second loop, calculate Stdev, sum of errors
//double[] eSquares = new double[data.Length];
double m1 = 0.0d;
double m2 = 0.0d;
double m3 = 0.0d; // for skewness calculation
double m4 = 0.0d; // for kurtosis calculation
// for skewness
for (int i = 0; i < data.Length; i++)
{
double m = data[i] - Result.Mean;
double mPow2 = m * m;
double mPow3 = mPow2 * m;
double mPow4 = mPow3 * m;

m1 += Math.Abs(m);

m2 += mPow2;

// calculate skewness
m3 += mPow3; // Math.Pow((data[i] - mean), 3);

// calculate skewness
m4 += mPow4; // Math.Pow((data[i] - mean), 4);

}

Result.SumOfError = m1;
Result.SumOfErrorSquare = m2; // Added for Excel function DEVSQ
sumOfESquare = m2;

// var and standard deviation
Result.Variance = sumOfESquare / ((double)Result.Count - 1);
Result.StdDev = Math.Sqrt(Result.Variance);

// using Excel approach
double skewCum = 0.0d; // the cum part of SKEW formula
for (int i = 0; i < data.Length; i++)
{
skewCum += Math.Pow((data[i] - Result.Mean) / Result.StdDev, 3);
}
Result.Skewness = n / (n - 1) / (n - 2) * skewCum;

// kurtosis: see http://en.wikipedia.org/wiki/Kurtosis (heading: Sample Kurtosis)
double m2_2 = Math.Pow(sumOfESquare, 2);
Result.Kurtosis = ((n + 1) * n * (n - 1)) / ((n - 2) * (n - 3)) *
(m4 / m2_2) -
3 * Math.Pow(n - 1, 2) / ((n - 2) * (n - 3)); // second last formula for G2

// calculate quartiles
sortedData = new double[data.Length];
data.CopyTo(sortedData, 0);
Array.Sort(sortedData);

// copy the sorted data to result object so that
// user can calculate percentile easily
Result.sortedData = new double[data.Length];
sortedData.CopyTo(Result.sortedData, 0);

Result.FirstQuartile = percentile(sortedData, 25);
Result.ThirdQuartile = percentile(sortedData, 75);
Result.Median = percentile(sortedData, 50);
Result.IQR = percentile(sortedData, 75) - percentile(sortedData, 25);

} // end of method Analyze```

The calculations of descriptive statistics are quite straightforward, except for the percentile function (and the subsequent quartile calculations), is a little tricky. Therefore, I have a separate function to handle it, as follows:

```/// <span class="code-SummaryComment"><summary></span>
/// Calculate percentile of a sorted data set
/// <span class="code-SummaryComment"></summary></span>
/// <span class="code-SummaryComment"><param name="sortedData">array of double values</param></span>
/// <span class="code-SummaryComment"><param name="p">percentile, value 0-100</param></span>
/// <span class="code-SummaryComment"><returns></returns></span>
internal static double percentile(double[] sortedData, double p)
{
// algo derived from Aczel pg 15 bottom
if (p >= 100.0d) return sortedData[sortedData.Length - 1];

double position = (double)(sortedData.Length + 1) * p / 100.0;
double leftNumber = 0.0d, rightNumber = 0.0d;

double n = p / 100.0d * (sortedData.Length - 1) + 1.0d;

if (position >= 1)
{
leftNumber = sortedData[(int)System.Math.Floor(n) - 1];
rightNumber = sortedData[(int)System.Math.Floor(n)];
}
else
{
leftNumber = sortedData[0]; // first data
rightNumber = sortedData[1]; // first data
}

if (leftNumber == rightNumber)
return leftNumber;
else
{
double part = n - System.Math.Floor(n);
return leftNumber + part * (rightNumber - leftNumber);
}
} // end of internal function percentile```

The percentile algorithm is derived from Amir Aczel’s book "Complete Business Statistics".

## Conclusion

The descriptive statistics program presented here provides a simple way to obtain commonly used descriptive statistics, including standard deviations, skewness, kurtosis, percentiles, quartiles, etc.

## History

• 28th June, 2008: Initial post

Jan Low, PhD, is a senior software architect at Foundasoft.com, Malaysia. He is also the author of various text analysis software, statistical libraries, image processing libraries, and security encryption component. He programs primarily in C#, C++ and VB.NET.
Occupation: Senior software architect
Location: Malaysia

## Share

 Architect Foundasoft.com Malaysia
Programmer and software architect.

## You may also be interested in...

 First Prev Next
 Too slow for streaming data... krn_2k30-Jan-09 0:28 krn_2k 30-Jan-09 0:28
 Re: Too slow for streaming data... krn_2k12-Feb-09 1:43 krn_2k 12-Feb-09 1:43
 I made some revisions so it can handle weights cartfer8-Nov-08 4:43 cartfer 8-Nov-08 4:43
 The code doesn't ... ahem ... work jlundstocholm16-Oct-08 1:37 jlundstocholm 16-Oct-08 1:37
 Re: The code doesn't ... ahem ... work jlundstocholm16-Oct-08 1:57 jlundstocholm 16-Oct-08 1:57
 Re: The code doesn't ... ahem ... work Jan Low, PhD16-Oct-08 2:11 Jan Low, PhD 16-Oct-08 2:11
 Re: The code doesn't ... ahem ... work Jan Low, PhD16-Oct-08 2:07 Jan Low, PhD 16-Oct-08 2:07
 Good point, and I agree that floating point will never be as accurate as decimal - that is why when counting money (which is always considered "mission critical" operation), we always use decimal. The rounding error become unbearable if the data set get bigger. This project is a demonstrative, and I bet you can easily change the datatype to decimal without much pain. My background very much engineering (which doesn't need to calculate rocket flight path), so I stick with less computational intensive floating point data type. Jan Low
 Good clean work Saar Yahalom28-Jun-08 23:26 Saar Yahalom 28-Jun-08 23:26
 Sweet Pete O'Hanlon28-Jun-08 11:43 Pete O'Hanlon 28-Jun-08 11:43