## What's New?

`GetData`

: The new parameter `Pristine`

indicates if you want to get data in the same order as entered. That might be of importance in case of time-dependant data. The new `Add`

-method stores the position(s) of each value using a `Dictionary<</span />T, List<</span />int</span />></span />></span />`

.

## Introduction

The exploration of empirical data is a common task in various fields. Especially the computational analysis of such data is often cumbered by the nature (the type) of the data. So - while writing statistical routines by my own - I decided to develop a "Frequency Table Class" which accepts any possible type of data.

## Requirements

My class has to satisfy requirements as follows:

- Accept any type of data (especially multiple precision types)
- Accept a given array
- Provide methods for adding and removing single values
- Automatically update the absolute frequency when a value is added/removed
- A simple way to fetch mode, highest frequency ...
- Provide a method to get the table as an array
- Returned arrays must be sortable by frequency and value
- Provide fields/properties to describe the table

## Code

The backbone of this class is the `FrequencyTableEntry<</span />T></span />`

structure:

public</span /> struct</span /> FrequencyTableEntry<T> where T : IComparable<T>
{
public</span /> FrequencyTableEntry(T val, int</span /> absFreq, double</span /> relFreq, double</span /> percentage)
{
Value = val;
AbsoluteFreq = absFreq;
RelativeFreq = relFreq;
Percentage = percentage;
}
public</span /> T Value;
public</span /> int</span /> AbsoluteFreq;
public</span /> double</span /> RelativeFreq;
public</span /> double</span /> Percentage;
}

The specified type `T `

has to implement the `IComparable`

-Interface (needed for the sorting routine). The class stores the data in a generic Dictionary: `_entries = new</span /> Dictionary<</span />T,int</span />></span />()`

: the `_entries.Keys`

-Collection contains the values to count, the `_entries.Values`

-Collection contains the absolute frequency for this particular value:

public</span /> FrequencyTable(int</span /> initialCapacity)
{
_entries = new</span /> Dictionary<T, int>(initialCapacity);
.
.
}

To provide easy access to table entries, the implemented enumerator returns the structure above:

public</span /> IEnumerator<FrequencyTableEntry<T>> GetEnumerator()
{
FrequencyTableEntry<T> _output;
int</span /> int_f;
double</span /> dbl_f;
foreach</span /> (T key in</span /> _entries.Keys)
{
int_f = _entries[key];
dbl_f = (double</span />)int_f;
_output = new</span /> FrequencyTableEntry<T>
(key, int_f, dbl_f / _count, dbl_f / _count * 100</span />.0</span />);
yield return</span /> _output;
}
}

The general `Add(T value</span />)`

method looks like this:

public</span /> void</span /> Add(T value)
{
List<int> _tempPos;
if</span /> (_entries.ContainsKey(value))
{
_entries[value]++;
if</span /> (_entries[value] > _high)
{
_high = _entries[value];
_mode = value;
}
_count++;
foreach</span /> (T key in</span /> _entries.Keys)
{
_relFrequencies[key] = (double</span />)_entries[key] / _count;
}
UpdateSumAndMean(value);
_positions.TryGetValue(value, out</span /> _tempPos);
_tempPos.Add(_count);
_positions.Remove(value);
_positions.Add(value, _tempPos);
}
else</span /> </span />
{
if</span /> (_high < 1</span />)
{
_high = 1</span />;
_mode = value;
}
_entries.Add(value, 1</span />);
_length++;
_count++;
_relFrequencies.Add(value, 0</span />.0</span />);
foreach</span /> (T key in</span /> _entries.Keys)
{
_relFrequencies[key] = (double</span />)_entries[key] / _count;
}
UpdateSumAndMean(value);
_tempPos = new</span /> List<int>();
_tempPos.Add(_count);
_positions.Add(value, _tempPos);
}
}

To simplify the analysis of a given text, I have implemented a special constructor:

public</span /> FrequencyTable(T Text, TextAnalyzeMode mode)
{
_positions = new</span /> Dictionary<T, List<int>>();
if</span /> (!(Text is</span /> string</span />))
throw</span /> new</span /> ArgumentException();
_entries = new</span /> Dictionary<T, int>();
_relFrequencies = new</span /> Hashtable();
_length = 0</span />;
_count = 0</span />;
_description = "</span />"</span />;
_tag = 0</span />;
_high = 0</span />;
_dblSum = double</span />.NaN;
_mean = double</span />.NaN;
_alpha = double</span />.NaN;
AnalyzeString(Text, mode);
_p = double</span />.NaN;
}

The associated `Add`

method:

public</span /> void</span /> Add(T Text, TextAnalyzeMode mode)
{
if</span /> (!(Text is</span /> string</span />))
throw</span /> new</span /> ArgumentException();
AnalyseString(Text, mode);
}

In my opinion, it is useful to provide different modes regarding literal analysis. These modes are provided by `TextAnalyzeMode`

:

public</span /> enum</span /> TextAnalyzeMode
{
AllCharacters,
NoNumerals,
NoSpecialCharacters,
LettersOnly,
NumeralsOnly,
SpecialCharactersOnly
}

The analysis itself is performed by `AnalyzeString(T Text, TextAnalyzeMode mode)`

:

private</span /> void</span /> AnalyzeString(T Text, TextAnalyzeMode mode)
{
string</span /> str_specialChars = @"</span />"</span />"</span />!§$%&/()=?@€<>|µ,.;:-_#'*+~²³ "</span />;
string</span /> str_Numbers = "</span />0123456789"</span />;
switch</span /> (mode)
{
case</span /> TextAnalyzeMode.AllCharacters:
foreach</span /> (char</span /> v in</span /> Text.ToString())
this</span />.Add((T)Convert.ChangeType((object</span />)v, Text.Getype()));
break</span />;
case</span /> TextAnalyzeMode.LettersOnly:
foreach</span /> (char</span /> v in</span /> Text.ToString())
{
if</span /> ((str_specialChars.IndexOf(v) == -1) &
(str_Numbers.IndexOf(v) == -1))
this</span />.Add((T)Convert.ChangeType((object</span />)v, Text.GetType()));
}
break</span />;
case</span /> TextAnalyzeMode.NoNumerals:
foreach</span /> (char</span /> v in</span /> Text.ToString())
{
if</span /> (str_Numbers.IndexOf(v) == -1)
this</span />.Add((T)Convert.ChangeType((object</span />)v, Text.GetType()));
}
break</span />;
case</span /> TextAnalyzeMode.NoSpecialCharacters:
foreach</span /> (char</span /> v in</span /> Text.ToString())
{
if</span /> (str_specialChars.IndexOf(v) == -1)
this</span />.Add((T)Convert.ChangeType((object</span />)v, Text.GetType()));
}
break</span />;
case</span /> TextAnalyzeMode.NumeralsOnly:
foreach</span /> (char</span /> v in</span /> Text.ToString())
{
if</span /> (str_Numbers.IndexOf(v) != -1)
this</span />.Add((T)Convert.ChangeType((object</span />)v, Text.GetType()));
}
break</span />;
case</span /> TextAnalyzeMode.SpecialCharactersOnly:
foreach</span /> (char</span /> v in</span /> Text.ToString())
{
if</span /> (str_specialChars.IndexOf(v) != -1)
this</span />.Add((T)Convert.ChangeType((object</span />)v, Text.GetType()));
}
break</span />;
}
}

## Test for Normality

The question if given data are "Gaussian-distributed" is often raised. There are some robust and valid tests to answer this question. I have implemented the "good old" Kolmogorov-Smirnov test (KS-Test). Alternatively one can use the D'Agostino-Pearson test. There are two new properties concerning normality testing:

`IsGaussian`

: Returns `true</span /> `

if data are numerical and the computed p-value is greater than Alpha (see below)
`Alpha`

: Defines the "significance level" for the KS-Test

The KS-Test method is shown below. The method returns `true</span />`

, if the test is applicable. In case of non-numerical data, the method returns `false</span />`

. The `out`

-parameter `p `

contains the `p`

-value on exit. This value can be accessed by calling the `P_Value`

property.

private</span /> bool</span /> KS_Test(out</span /> double</span /> p)
{
double</span /> D = double</span />.NaN;
CumulativeFrequencyTableEntry<T>[] empCDF =
GetCumulativeFrequencyTable(CumulativeFrequencyTableFormat.EachDatapointOnce);
double</span /> testCDF;
double</span />[] data = new</span /> double</span />[empCDF.Length];
FrequencyTableEntry<T>[] table = GetTableAsArray
(FrequencyTableSortOrder.Value_Ascending);
int</span /> i = 0</span />;
try</span />
{
foreach</span /> (FrequencyTableEntry<T> entry in</span /> table)
{
data[i] = (double</span />)Convert.ChangeType(entry.Value, TypeCode.Double</span />);
i++;
}
}
catch</span />
{
p = double</span />.NaN;
return</span /> false</span />;
}
double</span /> mean = Mean;
double</span /> _sqrt_var = Math.Sqrt(VariancePop);
double</span /> _sqrt2 = Math.Sqrt(2</span />.0</span />);
double</span /> _erf;
double</span /> max1 = 0</span />.0</span />;
double</span /> max2** =** 0</span />.0</span />;
double</span /> _temp;
for</span /> (i = 0</span />; i < empCDF.Length; i++)
{
_erf = Erf(((data[i] - mean) / _sqrt_var) / _sqrt2);
testCDF = 0</span />.5</span /> * (1</span />.0</span /> + _erf);
_temp = Math.Abs(empCDF[i].CumulativeRelativeFrequency - testCDF);
if</span /> (_temp > max1)
max1 = _temp;
if</span /> (i > 0</span />)
_temp = Math.Abs(empCDF[i - 1</span />].CumulativeRelativeFrequency - testCDF);
else</span />
_temp = testCDF;
if</span /> (_temp > max2)
max2 = _temp;
}
D = max1 > max2 ? max1 : max2;
if</span /> (!Double.IsNaN(D))
{
double</span /> z = Math.Sqrt((double</span />)SampleSize) * D;
p = KS_Prob_Smirnov(z);
}
else</span />
p = double</span />.NaN;
return</span /> true</span />;
}

To compute the "test distribution" (the Gaussian CDF in this case) we need the so called error function. I have used the Erf-implementation written by Miroslav Stampar (see Special Function(s) for C#), which is a translation of the Cephes Math Library by Stephen L. Moshier.

## Descriptive Statistics

I think it is useful to implement some fundamental statistical properties inside the class.

### Cumulative Frequencies

First of all, it is needed to implement a method which returns the empirical distribution function (the cumulative density function) of the given data:

public</span /> CumulativeFrequencyTableEntry<T>[]
GetCumulativeFrequencyTable(CumulativeFrequencyTableFormat Format)
{
CumulativeFrequencyTableEntry<T>[] _output = null</span />;
FrequencyTableEntry<T>[] _freqTable =
GetTableAsArray(FrequencyTableSortOrder.Value_Ascending);
double</span /> tempCumRelFreq = 0</span />.0</span />;
int</span /> tempCumAbsFreq = 0</span />;
int</span /> i, k;
switch</span /> (Format)
{
case</span /> CumulativeFrequencyTableFormat.EachDatapoint:
_output = new</span /> CumulativeFrequencyTableEntry<T>[SampleSize];
for</span /> (i = 0</span />; i < _freqTable.Length; i++)
{
tempCumAbsFreq += _freqTable[i].AbsoluteFreq;
tempCumRelFreq += _freqTable[i].RelativeFreq;
for</span /> (k = tempCumAbsFreq - _freqTable[i].AbsoluteFreq;k < tempCumAbsFreq; k++)
{
_output[k] = new</span /> CumulativeFrequencyTableEntry<T>
(_freqTable[i].Value, tempCumRelFreq, tempCumAbsFreq);
}
}
break</span />;
case</span /> CumulativeFrequencyTableFormat.EachDatapointOnce:
_output = new</span /> CumulativeFrequencyTableEntry<T>[Length];
for</span /> (i = 0</span />; i < _freqTable.Length; i++)
{
tempCumAbsFreq += _freqTable[i].AbsoluteFreq;
tempCumRelFreq += _freqTable[i].RelativeFreq;
_output[i] = new</span /> CumulativeFrequencyTableEntry<T>
(_freqTable[i].Value, tempCumRelFreq, tempCumAbsFreq);
}
break</span />;
}
return</span /> _output;
}

(Sorry for that strange formatting - this edit tool...)

### Where Are My Data??

Ok - you need an array of the added data? That is the way:

public</span /> T[] GetData(bool</span /> Pristine)
{
T[] result = new</span /> T[SampleSize];
if</span /> (!Pristine)
{
CumulativeFrequencyTableEntry<T>[] cf = GetCumulativeFrequencyTable
(CumulativeFrequencyTableFormat.EachDatapoint);
for</span /> (int</span /> i = 0</span />; i < SampleSize; i++)
result[i] = cf[i].Value;
}
else</span /> </span />
{
List<int> l;
foreach</span /> (T key in</span /> _positions.Keys)
{
_positions.TryGetValue(key, out</span /> l);
foreach</span /> (int</span /> k in</span /> l)
{
result[k - 1</span />] = key;
}
}
}
return</span /> result;
}

### What Else?

There are some `public</span /> `

properties concerning descriptive statistics:

`Mean`

`Median`

`Mode`

`Minimum`

`Maximum`

`VarianceSample`

`VariancePop`

(unbiased estimator)
`StandardDevSample`

`StandardDevPop `

(unbiased estimator)
`StandardError`

`Sum`

`SampleSize `

- the number of data (read only)
`HighestFrequency `

- the highest frequency observed
`SmallestFrequency `

- the smallest frequency
`ScarcestValue `

- the scarcest value
`Kurtosis`

`KurtosisExcess`

`Skewness`

If the data is not numerical, all properties above will return `double</span />.NaN`

.

### Miscellaneous

Here is a list of the remaining `public</span /> `

properties and methods.

#### Properties

`Length `

- The number of table entries (read only)
`Tag `

- An object which can be set by the user (writable)
`Description `

- The description of the table (writable)
`P_Value`

(contains the `p `

value computed by the Kolmogorov-Smirnov Test)

#### Methods

`Add(T Value)`

and `Add(T Test, TextAnalyzeMode mode)`

`Remove(T Value)`

`GetTableAsArray()`

and `GetTableAsArray(FrequencyTableSortOrder order)`

(sorting is done by using the Quicksort-Algorithm)
`GetEnumerator()`

`ContainsValue(T value</span />)`

`GetCumulativeFrequencyTable(CumulativeFrequencyTableFormat Format)`

`GetData(bool</span /> Pristine)`

- Returns the data as an array (sorted or in input order)
`GetRelativeFrequency(T value</span />, out double</span /> relFreq) `

The code is (I think so) well documented so you can use it to get a detailed insight into my solution. I am sure that this solution is not perfect, but it is a good starting point.

For a better overview, I have added a compiled help file (see download at the top of this page).

## History

- Version 1.0 - 18 Jan '07
- Version 1.5 - 04 Feb '07
- Minor bug fixes (highest frequency was not set correctly)
- Added normality testing
- Added descriptive statistics

- 09 Feb '07
`P_Value `

added, release number not changed

- Version 2.0 26 Feb '07
`GetData(bool</span /> Pristine)`

added