A Generic Frequency Table in C# with Descriptive Statistics
Frequencies, descriptive statistics and normality testing
What's New?
GetData
: The new parameter Pristine
indicates if you want to get data in the same order as entered. That might be of importance in case of time-dependant data. The new Add
-method stores the position(s) of each value using a Dictionary<T, List<int>>
.
Introduction
The exploration of empirical data is a common task in various fields. Especially the computational analysis of such data is often cumbered by the nature (the type) of the data. So - while writing statistical routines by my own - I decided to develop a "Frequency Table Class" which accepts any possible type of data.
Requirements
My class has to satisfy requirements as follows:
- Accept any type of data (especially multiple precision types)
- Accept a given array
- Provide methods for adding and removing single values
- Automatically update the absolute frequency when a value is added/removed
- A simple way to fetch mode, highest frequency ...
- Provide a method to get the table as an array
- Returned arrays must be sortable by frequency and value
- Provide fields/properties to describe the table
Code
The backbone of this class is the FrequencyTableEntry<T>
structure:
// A generic structure storing the frequency information for each value
public struct FrequencyTableEntry<T> where T : IComparable<T>
{
// Constructor
// val: The value counted
// absFreq: The absolute frequency
// relFreq: The relative frequency
// percentage: The percentage
public FrequencyTableEntry(T val, int absFreq, double relFreq, double percentage)
{
Value = val;
AbsoluteFreq = absFreq;
RelativeFreq = relFreq;
Percentage = percentage;
}
public T Value;
public int AbsoluteFreq;
public double RelativeFreq;
public double Percentage;
}
The specified type T
has to implement the IComparable
-Interface (needed for the sorting routine). The class stores the data in a generic Dictionary: _entries = new Dictionary<T,int>()
: the _entries.Keys
-Collection contains the values to count, the _entries.Values
-Collection contains the absolute frequency for this particular value:
public FrequencyTable(int initialCapacity)
{
_entries = new Dictionary<T, int>(initialCapacity);
.
.
}
To provide easy access to table entries, the implemented enumerator returns the structure above:
public IEnumerator<FrequencyTableEntry<T>> GetEnumerator()
{
// the structure to return
FrequencyTableEntry<T> _output;
// the frequency
int int_f;
// the "double-typed" frequency
double dbl_f;
foreach (T key in _entries.Keys)
{
int_f = _entries[key];
dbl_f = (double)int_f;
// fill the structure
_output = new FrequencyTableEntry<T>
(key, int_f, dbl_f / _count, dbl_f / _count * 100.0);
// yielding - cool thing that
yield return _output;
}
}
The general Add(T value)
method looks like this:
public void Add(T value)
{
List<int> _tempPos;
// if the Dictionary already contains value, then
// we have to update frequency and _count
if (_entries.ContainsKey(value))
{
// update the frequency
_entries[value]++;
// update mode and highest frequency
if (_entries[value] > _high)
{
_high = _entries[value];
_mode = value;
}
// add 1 to sample size
_count++;
foreach (T key in _entries.Keys)
{
_relFrequencies[key] = (double)_entries[key] / _count;
}
UpdateSumAndMean(value);
// store the actual position of the entry in the dataset
_positions.TryGetValue(value, out _tempPos);
// the position is equal to _count
_tempPos.Add(_count);
// remove old entry
_positions.Remove(value);
// store new entry
_positions.Add(value, _tempPos);
}
else /* if the dictionary does not contain value add a new entry */
{
// if the highest frequency is still zero, set it to one
if (_high < 1)
{
_high = 1;
_mode = value;
}
// add a new entry - frequency is one
_entries.Add(value, 1);
// add 1 to table length
_length++;
// add 1 to sample size
_count++;
// update relative frequencies
_relFrequencies.Add(value, 0.0);
foreach (T key in _entries.Keys)
{
_relFrequencies[key] = (double)_entries[key] / _count;
}
UpdateSumAndMean(value);
// create a new entry and set position to _count
_tempPos = new List<int>();
_tempPos.Add(_count);
// store it
_positions.Add(value, _tempPos);
}
}
To simplify the analysis of a given text, I have implemented a special constructor:
// Constructor - the created instance analyzes the
// frequency of characters in a given string
// Text: String to analyze
public FrequencyTable(T Text, TextAnalyzeMode mode)
{
_positions = new Dictionary<T, List<int>>();
// if T is not string -> Exception
if (!(Text is string))
throw new ArgumentException();
// the table itself
_entries = new Dictionary<T, int>();
_relFrequencies = new Hashtable();
// number of entries in _entries
_length = 0;
// sample size
_count = 0;
// description of the table
_description = "";
// a user defined tag
_tag = 0;
// the highest frequency
_high = 0;
_dblSum = double.NaN;
_mean = double.NaN;
_alpha = double.NaN;
AnalyzeString(Text, mode);
_p = double.NaN;
}
The associated Add
method:
public void Add(T Text, TextAnalyzeMode mode)
{
if (!(Text is string))
throw new ArgumentException();
AnalyseString(Text, mode);
}
In my opinion, it is useful to provide different modes regarding literal analysis. These modes are provided by TextAnalyzeMode
:
public enum TextAnalyzeMode
{
AllCharacters,
NoNumerals,
NoSpecialCharacters,
LettersOnly,
NumeralsOnly,
SpecialCharactersOnly
}
The analysis itself is performed by AnalyzeString(T Text, TextAnalyzeMode mode)
:
private void AnalyzeString(T Text, TextAnalyzeMode mode)
{
// character strings
string str_specialChars = @"""!§$%&/()=?@€<>|µ,.;:-_#'*+~²³ ";
string str_Numbers = "0123456789";
// Adding the entries according to mode
switch (mode)
{
case TextAnalyzeMode.AllCharacters:
foreach (char v in Text.ToString())
this.Add((T)Convert.ChangeType((object)v, Text.Getype()));
break;
case TextAnalyzeMode.LettersOnly:
foreach (char v in Text.ToString())
{
if ((str_specialChars.IndexOf(v) == -1) &
(str_Numbers.IndexOf(v) == -1))
this.Add((T)Convert.ChangeType((object)v, Text.GetType()));
}
break;
case TextAnalyzeMode.NoNumerals:
foreach (char v in Text.ToString())
{
if (str_Numbers.IndexOf(v) == -1)
this.Add((T)Convert.ChangeType((object)v, Text.GetType()));
}
break;
case TextAnalyzeMode.NoSpecialCharacters:
foreach (char v in Text.ToString())
{
if (str_specialChars.IndexOf(v) == -1)
this.Add((T)Convert.ChangeType((object)v, Text.GetType()));
}
break;
case TextAnalyzeMode.NumeralsOnly:
foreach (char v in Text.ToString())
{
if (str_Numbers.IndexOf(v) != -1)
this.Add((T)Convert.ChangeType((object)v, Text.GetType()));
}
break;
case TextAnalyzeMode.SpecialCharactersOnly:
foreach (char v in Text.ToString())
{
if (str_specialChars.IndexOf(v) != -1)
this.Add((T)Convert.ChangeType((object)v, Text.GetType()));
}
break;
}
}
Test for Normality
The question if given data are "Gaussian-distributed" is often raised. There are some robust and valid tests to answer this question. I have implemented the "good old" Kolmogorov-Smirnov test (KS-Test). Alternatively one can use the D'Agostino-Pearson test. There are two new properties concerning normality testing:
IsGaussian
: Returnstrue
if data are numerical and the computed p-value is greater than Alpha (see below)Alpha
: Defines the "significance level" for the KS-Test
The KS-Test method is shown below. The method returns true
, if the test is applicable. In case of non-numerical data, the method returns false
. The out
-parameter p
contains the p
-value on exit. This value can be accessed by calling the P_Value
property.
private bool KS_Test(out double p)
{
// D-statistic
double D = double.NaN;
CumulativeFrequencyTableEntry<T>[] empCDF =
GetCumulativeFrequencyTable(CumulativeFrequencyTableFormat.EachDatapointOnce);
// store the test CDF
double testCDF;
// array to store datapoints
double[] data = new double[empCDF.Length];
FrequencyTableEntry<T>[] table = GetTableAsArray
(FrequencyTableSortOrder.Value_Ascending);
int i = 0;
// prevent exceptions if T is not numerical
try
{
foreach (FrequencyTableEntry<T> entry in table)
{
data[i] = (double)Convert.ChangeType(entry.Value, TypeCode.Double);
i++;
}
}
catch
{
p = double.NaN;
return false;
}
// estimate the parameters of the expected Gaussian distribution
// first: compute the mean
double mean = Mean;
// compute the bias-corrected variance
// as an estimator for the population variance (actually we need the
// square root)
double _sqrt_var = Math.Sqrt(VariancePop);
// now we have to determine the greatest difference between the
// sample cumulative distribution function (empCDF) and
// the distribution function to test (testCDF)
double _sqrt2 = Math.Sqrt(2.0);
double _erf;
double max1 = 0.0;
double max2 = 0.0;
double _temp;
for (i = 0; i < empCDF.Length; i++)
{
// compute the expected distribution using the error function
_erf = Erf(((data[i] - mean) / _sqrt_var) / _sqrt2);
testCDF = 0.5 * (1.0 + _erf);
_temp = Math.Abs(empCDF[i].CumulativeRelativeFrequency - testCDF);
if (_temp > max1)
max1 = _temp;
if (i > 0)
_temp = Math.Abs(empCDF[i - 1].CumulativeRelativeFrequency - testCDF);
else
_temp = testCDF;
if (_temp > max2)
max2 = _temp;
}
// the statistics to use is
// max{diff1,diff2}
D = max1 > max2 ? max1 : max2;
// now compute the p-value using a z-transformation
if (!Double.IsNaN(D))
{
double z = Math.Sqrt((double)SampleSize) * D;
p = KS_Prob_Smirnov(z);
}
else
p = double.NaN;
return true;
}
To compute the "test distribution" (the Gaussian CDF in this case) we need the so called error function. I have used the Erf-implementation written by Miroslav Stampar (see Special Function(s) for C#), which is a translation of the Cephes Math Library by Stephen L. Moshier.
Descriptive Statistics
I think it is useful to implement some fundamental statistical properties inside the class.
Cumulative Frequencies
First of all, it is needed to implement a method which returns the empirical distribution function (the cumulative density function) of the given data:
public CumulativeFrequencyTableEntry<T>[]
GetCumulativeFrequencyTable(CumulativeFrequencyTableFormat Format)
{
CumulativeFrequencyTableEntry<T>[] _output = null;
// get the frequency table as array for easier processing
FrequencyTableEntry<T>[] _freqTable =
GetTableAsArray(FrequencyTableSortOrder.Value_Ascending);
// temporary values
double tempCumRelFreq = 0.0;
int tempCumAbsFreq = 0;
int i, k;
switch (Format)
{
// each datapoint will returned
case CumulativeFrequencyTableFormat.EachDatapoint:
// initialize the result
_output = new CumulativeFrequencyTableEntry<T>[SampleSize];
for (i = 0; i < _freqTable.Length; i++)
{
// update the cumulative frequency - relative and absolute
tempCumAbsFreq += _freqTable[i].AbsoluteFreq;
tempCumRelFreq += _freqTable[i].RelativeFreq;
// fill the array
for (k = tempCumAbsFreq - _freqTable[i].AbsoluteFreq;k < tempCumAbsFreq; k++)
{
_output[k] = new CumulativeFrequencyTableEntry<T>
(_freqTable[i].Value, tempCumRelFreq, tempCumAbsFreq);
}
}
break;
// here each different entry will be returned once
case CumulativeFrequencyTableFormat.EachDatapointOnce:
// initialize the result
_output = new CumulativeFrequencyTableEntry<T>[Length];
for (i = 0; i < _freqTable.Length; i++)
{
// update the cumulative frequency - relative and absolute
tempCumAbsFreq += _freqTable[i].AbsoluteFreq;
tempCumRelFreq += _freqTable[i].RelativeFreq;
// fill the array
_output[i] = new CumulativeFrequencyTableEntry<T>
(_freqTable[i].Value, tempCumRelFreq, tempCumAbsFreq);
}
break;
}
// done
return _output;
}
(Sorry for that strange formatting - this edit tool...)
Where Are My Data??
Ok - you need an array of the added data? That is the way:
public T[] GetData(bool Pristine)
{
T[] result = new T[SampleSize];
// if the order is not important
if (!Pristine)
{
CumulativeFrequencyTableEntry<T>[] cf = GetCumulativeFrequencyTable
(CumulativeFrequencyTableFormat.EachDatapoint);
for (int i = 0; i < SampleSize; i++)
result[i] = cf[i].Value;
}
else /* return the data in same order as entered */
{
List<int> l;
foreach (T key in _positions.Keys)
{
_positions.TryGetValue(key, out l);
foreach (int k in l)
{
result[k - 1] = key;
}
}
}
return result;
}
What Else?
There are some public
properties concerning descriptive statistics:
Mean
Median
Mode
Minimum
Maximum
VarianceSample
VariancePop
(unbiased estimator)StandardDevSample
StandardDevPop
(unbiased estimator)StandardError
Sum
SampleSize
- the number of data (read only)HighestFrequency
- the highest frequency observedSmallestFrequency
- the smallest frequencyScarcestValue
- the scarcest valueKurtosis
KurtosisExcess
Skewness
If the data is not numerical, all properties above will return double.NaN
.
Miscellaneous
Here is a list of the remaining public
properties and methods.
Properties
Length
- The number of table entries (read only)Tag
- An object which can be set by the user (writable)Description
- The description of the table (writable)P_Value
(contains thep
value computed by the Kolmogorov-Smirnov Test)
Methods
Add(T Value)
andAdd(T Test, TextAnalyzeMode mode)
Remove(T Value)
GetTableAsArray()
andGetTableAsArray(FrequencyTableSortOrder order)
(sorting is done by using the Quicksort-Algorithm)GetEnumerator()
ContainsValue(T value)
GetCumulativeFrequencyTable(CumulativeFrequencyTableFormat Format)
GetData(bool Pristine)
- Returns the data as an array (sorted or in input order)GetRelativeFrequency(T value, out double relFreq)
The code is (I think so) well documented so you can use it to get a detailed insight into my solution. I am sure that this solution is not perfect, but it is a good starting point.
For a better overview, I have added a compiled help file (see download at the top of this page).
History
- Version 1.0 - 18 Jan '07
- Initial release
- Version 1.5 - 04 Feb '07
- Minor bug fixes (highest frequency was not set correctly)
- Added normality testing
- Added descriptive statistics
- 09 Feb '07
P_Value
added, release number not changed
- Version 2.0 26 Feb '07
GetData(bool Pristine)
added