12,401,189 members (38,765 online)
alternative version

58.7K views
63 bookmarked
Posted

# A User-Friendly C# Descriptive Statistic Class

, 28 Jun 2008 CPOL
 Rate this:
An article on most commonly used descriptive statistics, including standard deviations, skewness, kurtosis, percentiles, quartiles, etc.

## Introduction

The 80-20 rules applies: even with the advances of statistics, most of our work requires only univariate descriptive statistics – those involve the calculations of mean, standard deviation, range, skewness, kurtosis, percentile, quartiles, etc. This article describes a simple way to construct a set of classes to implement descriptive statistics in C#. The emphasis is on the ease of use at the users' end.

## Requirements

To run the code, you need to have the following:

• .NET Framework 2.0 and above
• Microsoft Visual Studio 2005 if you want to open the project files included in the download project
• Nunit 2.4 if you want to run the unit tests included in the download project

The download also includes a NUnit test in case you want to make changes to the code and run your own unit test.

## The Code

The goal of the code design is to simplify the usage. We envisage that the user will perform the following code to get the desired results. This involves a simple 3-steps process:

1. Instantiate a `Descriptive `object
2. Invoke its `.Analyze() `method
3. Retrieve results from its `.Result `object

Here is a typical user’s code:

```double[] x  = {1, 2, 4, 7, 8, 9, 10, 12};
Descriptive desp = new Descriptive(x);
desp.Analyze(); // analyze the data
Console.WriteLine("Result is: " + desp.Result.FirstQuartile.ToString());```

Two classes are implemented:

• `DescriptiveResult`
• ``` ````Descriptive`

`DescriptiveResult `is a class from which a result object derives, which holds the analysis results. In our implementation, the `.Result `member variable is defined as follows:

```/// <span class="code-SummaryComment"><summary></span>
/// The result class the holds the analysis results
/// <span class="code-SummaryComment"></summary></span>
public class DescriptiveResult
{
// sortedData is used to calculate percentiles
internal double[] sortedData;

/// <span class="code-SummaryComment"><summary></span>
/// DescriptiveResult default constructor
/// <span class="code-SummaryComment"></summary></span>
public DescriptiveResult() { }

/// <span class="code-SummaryComment"><summary></span>
/// Count
/// <span class="code-SummaryComment"></summary></span>
public uint Count;
/// <span class="code-SummaryComment"><summary></span>
/// Sum
/// <span class="code-SummaryComment"></summary></span>
public double Sum;
/// <span class="code-SummaryComment"><summary></span>
/// Arithmetic mean
/// <span class="code-SummaryComment"></summary></span>
public double Mean;
/// <span class="code-SummaryComment"><summary></span>
/// Geometric mean
/// <span class="code-SummaryComment"></summary></span>
public double GeometricMean;
/// <span class="code-SummaryComment"><summary></span>
/// Harmonic mean
/// <span class="code-SummaryComment"></summary></span>
public double HarmonicMean;
/// <span class="code-SummaryComment"><summary></span>
/// Minimum value
/// <span class="code-SummaryComment"></summary></span>
public double Min;
/// <span class="code-SummaryComment"><summary></span>
/// Maximum value
/// <span class="code-SummaryComment"></summary></span>
public double Max;
/// <span class="code-SummaryComment"><summary></span>
/// The range of the values
/// <span class="code-SummaryComment"></summary></span>
public double Range;
/// <span class="code-SummaryComment"><summary></span>
/// Sample variance
/// <span class="code-SummaryComment"></summary></span>
public double Variance;
/// <span class="code-SummaryComment"><summary></span>
/// Sample standard deviation
/// <span class="code-SummaryComment"></summary></span>
public double StdDev;
/// <span class="code-SummaryComment"><summary></span>
/// Skewness of the data distribution
/// <span class="code-SummaryComment"></summary></span>
public double Skewness;
/// <span class="code-SummaryComment"><summary></span>
/// Kurtosis of the data distribution
/// <span class="code-SummaryComment"></summary></span>
public double Kurtosis;
/// <span class="code-SummaryComment"><summary></span>
/// Interquartile range
/// <span class="code-SummaryComment"></summary></span>
public double IQR;
/// <span class="code-SummaryComment"><summary></span>
/// Median, or second quartile, or at 50 percentile
/// <span class="code-SummaryComment"></summary></span>
public double Median;
/// <span class="code-SummaryComment"><summary></span>
/// First quartile, at 25 percentile
/// <span class="code-SummaryComment"></summary></span>
public double FirstQuartile;
/// <span class="code-SummaryComment"><summary></span>
/// Third quartile, at 75 percentile
/// <span class="code-SummaryComment"></summary></span>
public double ThirdQuartile;

/// <span class="code-SummaryComment"><summary></span>
/// Sum of Error
/// <span class="code-SummaryComment"></summary></span>
internal double SumOfError;

/// <span class="code-SummaryComment"><summary></span>
/// The sum of the squares of errors
/// <span class="code-SummaryComment"></summary></span>
internal double SumOfErrorSquare;

/// <span class="code-SummaryComment"><summary></span>
/// Percentile
/// <span class="code-SummaryComment"></summary></span>
/// <span class="code-SummaryComment"><param name="percent">Pecentile, between 0 to 100</param></span>
/// <span class="code-SummaryComment"><returns>Percentile<returns></span>
```

For simplicity, most member variables are implemented as `public `variables. The only member function - `Percentile` - allows the user to pass the argument (in percentage, e.g. 30 for 30%) and receive the percentile result.

The following table lists the available results (assuming that the `Descriptive `object name you use is `desp`:

 Result Result stored in variable Number of data points `desp.Result.Count` Minimum value `desp.Result.Min` Maximum value `desp.Result.Max` Range of values `desp.Result.Range` Sum of values `desp.Result.Sum` Arithmetic mean `desp.Result.Mean` Geometric mean `desp.Result.GeometricMean` Harmonic mean `desp.Result.HarmonicMean` Sample variance `desp.Result.Variance` Sample standard deviation `desp.Result.StdDev` Skewness of the distribution `desp.Result.Skewness` Kurtosis of the distribution `desp.Result.Kurtosis` Interquartile range `desp.Result.IQR` Median (50% percentile) `desp.Result.Median` FirstQuartile: 25% percentile `desp.Result.FirstQuartile` ThirdQuartile: 75% percentile `desp.Result.ThirdQuartile` Percentile `desp.Result.Percentile()`*

* The argument of percentile is values from 0 to 100, which indicates the percentile desired.

## Descriptive Class

The `Descriptive `class does all the analysis, and it is implemented as follows:

```/// <span class="code-SummaryComment"><summary></span>
/// Descriptive class
/// <span class="code-SummaryComment"></summary></span>
public class Descriptive
{
private double[] data;
private double[] sortedData;

/// <span class="code-SummaryComment"><summary></span>
/// Descriptive results
/// <span class="code-SummaryComment"></summary></span>
public DescriptiveResult Result = new DescriptiveResult();

#region Constructors
/// <span class="code-SummaryComment"><summary></span>
/// Descriptive analysis default constructor
/// <span class="code-SummaryComment"></summary></span>
public Descriptive() { } // default empty constructor

/// <span class="code-SummaryComment"><summary></span>
/// Descriptive analysis constructor
/// <span class="code-SummaryComment"></summary></span>
/// <span class="code-SummaryComment"><param name="dataVariable">Data array</param></span>
public Descriptive(double[] dataVariable)
{
data = dataVariable;
}
#endregion //  Constructors```

Note that we need a `sortedData `class to facilitate percentile and quartile-related statistics. It stores the sorted version of the user data.

The constructor of `Descriptive `class allows the user to assign the data array during the object instantiation:

```double[] x  = {1, 2, 4, 7, 8, 9, 10, 12};
Descriptive desp = new Descriptive(x);```

Once the `Descriptive `object is instantiated, the user only needs to call the `.Analyze() `method to perform the analysis. Subsequently, the user can retrieve the analysis results from the `.Result `object in the `Descriptive `object.

The `Analyze() `method is implemented as follows:

```/// <span class="code-SummaryComment"><summary></span>
/// Run the analysis to obtain descriptive information of the data
/// <span class="code-SummaryComment"></summary></span>
public void Analyze()
{
// initializations
Result.Count = 0;
Result.Min = Result.Max = Result.Range = Result.Mean =
Result.Sum = Result.StdDev = Result.Variance = 0.0d;

double sumOfSquare = 0.0d;
double sumOfESquare = 0.0d; // must initialize

double[] squares = new double[data.Length];
double cumProduct = 1.0d; // to calculate geometric mean
double cumReciprocal = 0.0d; // to calculate harmonic mean

// First iteration
for (int i = 0; i < data.Length; i++)
{
if (i==0) // first data point
{
Result.Min = data[i];
Result.Max = data[i];
Result.Mean = data[i];
Result.Range = 0.0d;
}
else
{ // not the first data point
if (data[i] < Result.Min) Result.Min = data[i];
if (data[i] > Result.Max) Result.Max = data[i];
}
Result.Sum += data[i];
squares[i] = Math.Pow(data[i], 2); //TODO: may not be necessary
sumOfSquare += squares[i];

cumProduct *= data[i];
cumReciprocal += 1.0d / data[i];
}

Result.Count = (uint)data.Length;
double n = (double)Result.Count; // use a shorter variable in double type
Result.Mean = Result.Sum / n;
Result.GeometricMean = Math.Pow(cumProduct, 1.0 / n);
// see http://mathworld.wolfram.com/HarmonicMean.html
Result.HarmonicMean = 1.0d / (cumReciprocal / n);
Result.Range = Result.Max - Result.Min;

// second loop, calculate Stdev, sum of errors
//double[] eSquares = new double[data.Length];
double m1 = 0.0d;
double m2 = 0.0d;
double m3 = 0.0d; // for skewness calculation
double m4 = 0.0d; // for kurtosis calculation
// for skewness
for (int i = 0; i < data.Length; i++)
{
double m = data[i] - Result.Mean;
double mPow2 = m * m;
double mPow3 = mPow2 * m;
double mPow4 = mPow3 * m;

m1 += Math.Abs(m);

m2 += mPow2;

// calculate skewness
m3 += mPow3; // Math.Pow((data[i] - mean), 3);

// calculate skewness
m4 += mPow4; // Math.Pow((data[i] - mean), 4);

}

Result.SumOfError = m1;
Result.SumOfErrorSquare = m2; // Added for Excel function DEVSQ
sumOfESquare = m2;

// var and standard deviation
Result.Variance = sumOfESquare / ((double)Result.Count - 1);
Result.StdDev = Math.Sqrt(Result.Variance);

// using Excel approach
double skewCum = 0.0d; // the cum part of SKEW formula
for (int i = 0; i < data.Length; i++)
{
skewCum += Math.Pow((data[i] - Result.Mean) / Result.StdDev, 3);
}
Result.Skewness = n / (n - 1) / (n - 2) * skewCum;

// kurtosis: see http://en.wikipedia.org/wiki/Kurtosis (heading: Sample Kurtosis)
double m2_2 = Math.Pow(sumOfESquare, 2);
Result.Kurtosis = ((n + 1) * n * (n - 1)) / ((n - 2) * (n - 3)) *
(m4 / m2_2) -
3 * Math.Pow(n - 1, 2) / ((n - 2) * (n - 3)); // second last formula for G2

// calculate quartiles
sortedData = new double[data.Length];
data.CopyTo(sortedData, 0);
Array.Sort(sortedData);

// copy the sorted data to result object so that
// user can calculate percentile easily
Result.sortedData = new double[data.Length];
sortedData.CopyTo(Result.sortedData, 0);

Result.FirstQuartile = percentile(sortedData, 25);
Result.ThirdQuartile = percentile(sortedData, 75);
Result.Median = percentile(sortedData, 50);
Result.IQR = percentile(sortedData, 75) - percentile(sortedData, 25);

} // end of method Analyze```

The calculations of descriptive statistics are quite straightforward, except for the percentile function (and the subsequent quartile calculations), is a little tricky. Therefore, I have a separate function to handle it, as follows:

```/// <span class="code-SummaryComment"><summary></span>
/// Calculate percentile of a sorted data set
/// <span class="code-SummaryComment"></summary></span>
/// <span class="code-SummaryComment"><param name="sortedData">array of double values</param></span>
/// <span class="code-SummaryComment"><param name="p">percentile, value 0-100</param></span>
/// <span class="code-SummaryComment"><returns></returns></span>
internal static double percentile(double[] sortedData, double p)
{
// algo derived from Aczel pg 15 bottom
if (p >= 100.0d) return sortedData[sortedData.Length - 1];

double position = (double)(sortedData.Length + 1) * p / 100.0;
double leftNumber = 0.0d, rightNumber = 0.0d;

double n = p / 100.0d * (sortedData.Length - 1) + 1.0d;

if (position >= 1)
{
leftNumber = sortedData[(int)System.Math.Floor(n) - 1];
rightNumber = sortedData[(int)System.Math.Floor(n)];
}
else
{
leftNumber = sortedData[0]; // first data
rightNumber = sortedData[1]; // first data
}

if (leftNumber == rightNumber)
return leftNumber;
else
{
double part = n - System.Math.Floor(n);
return leftNumber + part * (rightNumber - leftNumber);
}
} // end of internal function percentile```

The percentile algorithm is derived from Amir Aczel’s book "Complete Business Statistics".

## Conclusion

The descriptive statistics program presented here provides a simple way to obtain commonly used descriptive statistics, including standard deviations, skewness, kurtosis, percentiles, quartiles, etc.

## History

• 28th June, 2008: Initial post

Jan Low, PhD, is a senior software architect at Foundasoft.com, Malaysia. He is also the author of various text analysis software, statistical libraries, image processing libraries, and security encryption component. He programs primarily in C#, C++ and VB.NET.
Occupation: Senior software architect
Location: Malaysia

## Share

 Architect Foundasoft.com Malaysia
Programmer and software architect.

## You may also be interested in...

 First Prev Next
 Too slow for streaming data... krn_2k30-Jan-09 0:28 krn_2k 30-Jan-09 0:28
 Re: Too slow for streaming data... krn_2k12-Feb-09 1:43 krn_2k 12-Feb-09 1:43
 I made some revisions so it can handle weights cartfer8-Nov-08 4:43 cartfer 8-Nov-08 4:43
 The code doesn't ... ahem ... work jlundstocholm16-Oct-08 1:37 jlundstocholm 16-Oct-08 1:37
 Re: The code doesn't ... ahem ... work jlundstocholm16-Oct-08 1:57 jlundstocholm 16-Oct-08 1:57
 Dammit ... I just did some more tests to verify my claim - and it seems my own test was wrong. I appologize for my initial post.
 Re: The code doesn't ... ahem ... work Jan Low, PhD16-Oct-08 2:11 Jan Low, PhD 16-Oct-08 2:11
 Re: The code doesn't ... ahem ... work Jan Low, PhD16-Oct-08 2:07 Jan Low, PhD 16-Oct-08 2:07
 Good clean work Saar Yahalom28-Jun-08 23:26 Saar Yahalom 28-Jun-08 23:26
 Sweet Pete O'Hanlon28-Jun-08 11:43 Pete O'Hanlon 28-Jun-08 11:43