13,149,099 members (56,852 online)
alternative version

#### Stats

72.6K views
60 bookmarked
Posted 18 Jan 2007

# A Generic Frequency Table in C# with Descriptive Statistics

, 26 Feb 2007
 Rate this:
Frequencies, descriptive statistics and normality testing

## What's New?

`GetData`: The new parameter `Pristine` indicates if you want to get data in the same order as entered. That might be of importance in case of time-dependant data. The new `Add`-method stores the position(s) of each value using a `Dictionary<</span />T, List<</span />int</span />></span />></span />`.

## Introduction

The exploration of empirical data is a common task in various fields. Especially the computational analysis of such data is often cumbered by the nature (the type) of the data. So - while writing statistical routines by my own - I decided to develop a "Frequency Table Class" which accepts any possible type of data.

## Requirements

My class has to satisfy requirements as follows:

• Accept any type of data (especially multiple precision types)
• Accept a given array
• Provide methods for adding and removing single values
• Automatically update the absolute frequency when a value is added/removed
• A simple way to fetch mode, highest frequency ...
• Provide a method to get the table as an array
• Returned arrays must be sortable by frequency and value
• Provide fields/properties to describe the table

## Code

The backbone of this class is the `FrequencyTableEntry<</span />T></span />` structure:

```//</span /> A generic structure storing the frequency information for each value</span />
public</span /> struct</span /> FrequencyTableEntry<T> where T : IComparable<T>
{
//</span /> Constructor</span />
//</span /> val: The value counted</span />
//</span /> absFreq: The absolute frequency</span />
//</span /> relFreq: The relative frequency</span />
//</span /> percentage: The percentage</span />
public</span /> FrequencyTableEntry(T val, int</span /> absFreq, double</span /> relFreq, double</span /> percentage)
{
Value = val;
AbsoluteFreq = absFreq;
RelativeFreq = relFreq;
Percentage = percentage;
}
public</span /> T Value;
public</span /> int</span /> AbsoluteFreq;
public</span /> double</span /> RelativeFreq;
public</span /> double</span /> Percentage;
}```

The specified type `T `has to implement the `IComparable`-Interface (needed for the sorting routine). The class stores the data in a generic Dictionary: `_entries = new</span /> Dictionary<</span />T,int</span />></span />()`: the `_entries.Keys`-Collection contains the values to count, the `_entries.Values`-Collection contains the absolute frequency for this particular value:

```public</span /> FrequencyTable(int</span /> initialCapacity)
{
_entries = new</span /> Dictionary<T, int>(initialCapacity);
.
.
}```

To provide easy access to table entries, the implemented enumerator returns the structure above:

```public</span /> IEnumerator<FrequencyTableEntry<T>> GetEnumerator()
{
//</span /> the structure to return</span />
FrequencyTableEntry<T> _output;
//</span /> the frequency</span />
int</span /> int_f;
//</span /> the "double-typed" frequency</span />
double</span /> dbl_f;
foreach</span /> (T key in</span /> _entries.Keys)
{
int_f = _entries[key];
dbl_f = (double</span />)int_f;
//</span /> fill the structure</span />
_output = new</span /> FrequencyTableEntry<T>
(key, int_f, dbl_f / _count, dbl_f / _count * 100</span />.0</span />);
//</span /> yielding - cool thing that</span />
yield return</span /> _output;
}
}```

The general `Add(T value</span />)` method looks like this:

```public</span /> void</span /> Add(T value)
{
List<int> _tempPos;
//</span /> if the Dictionary already contains value, then</span />
//</span /> we have to update frequency and _count</span />
if</span /> (_entries.ContainsKey(value))
{
//</span /> update the frequency</span />
_entries[value]++;
//</span /> update mode and highest frequency</span />
if</span /> (_entries[value] > _high)
{
_high = _entries[value];
_mode = value;
}
//</span /> add 1 to sample size</span />
_count++;
foreach</span /> (T key in</span /> _entries.Keys)
{
_relFrequencies[key] = (double</span />)_entries[key] / _count;
}
//</span /> store the actual position of the entry in the dataset</span />
_positions.TryGetValue(value, out</span /> _tempPos);
//</span /> the position is equal to _count</span />
//</span /> remove old entry</span />
_positions.Remove(value);
//</span /> store new entry</span />
}
else</span /> /*</span /> if the dictionary does not contain value add a new entry */</span />
{
//</span /> if the highest frequency is still zero, set it to one</span />
if</span /> (_high < 1</span />)
{
_high = 1</span />;
_mode = value;
}
//</span /> add a new entry - frequency is one</span />
//</span /> add 1 to table length</span />
_length++;
//</span /> add 1 to sample size</span />
_count++;
//</span /> update relative frequencies</span />
_relFrequencies.Add(value, 0</span />.0</span />);
foreach</span /> (T key in</span /> _entries.Keys)
{
_relFrequencies[key] = (double</span />)_entries[key] / _count;
}
//</span /> create a new entry and set position to _count</span />
_tempPos = new</span /> List<int>();
//</span /> store it</span />
}
}```

To simplify the analysis of a given text, I have implemented a special constructor:

```//</span /> Constructor - the created instance analyzes the </span />
//</span /> frequency of characters in a given string</span />
//</span /> Text: String to analyze</span />
public</span /> FrequencyTable(T Text, TextAnalyzeMode mode)
{
_positions = new</span /> Dictionary<T, List<int>>();
//</span /> if T is not string -> Exception</span />
if</span /> (!(Text is</span /> string</span />))
throw</span /> new</span /> ArgumentException();
//</span /> the table itself</span />
_entries = new</span /> Dictionary<T, int>();
_relFrequencies = new</span /> Hashtable();
//</span /> number of entries in _entries</span />
_length = 0</span />;
//</span /> sample size</span />
_count = 0</span />;
//</span /> description of the table</span />
_description = "</span />"</span />;
//</span /> a user defined tag</span />
_tag = 0</span />;
//</span /> the highest frequency</span />
_high = 0</span />;
_dblSum = double</span />.NaN;
_mean = double</span />.NaN;
_alpha = double</span />.NaN;
AnalyzeString(Text, mode);
_p = double</span />.NaN;
}```

The associated `Add` method:

```public</span /> void</span /> Add(T Text, TextAnalyzeMode mode)
{
if</span /> (!(Text is</span /> string</span />))
throw</span /> new</span /> ArgumentException();
AnalyseString(Text, mode);
}```

In my opinion, it is useful to provide different modes regarding literal analysis. These modes are provided by `TextAnalyzeMode`:

```public</span /> enum</span /> TextAnalyzeMode
{
AllCharacters,
NoNumerals,
NoSpecialCharacters,
LettersOnly,
NumeralsOnly,
SpecialCharactersOnly
}```

The analysis itself is performed by `AnalyzeString(T Text, TextAnalyzeMode mode)`:

```private</span /> void</span /> AnalyzeString(T Text, TextAnalyzeMode mode)
{
//</span /> character strings</span />
string</span /> str_specialChars = @"</span />"</span />"</span />!§\$%&/()=?@€<>|µ,.;:-_#'*+~²³ "</span />;
string</span /> str_Numbers = "</span />0123456789"</span />;
//</span /> Adding the entries according to mode</span />
switch</span /> (mode)
{
case</span /> TextAnalyzeMode.AllCharacters:
foreach</span /> (char</span /> v in</span /> Text.ToString())
this</span />.Add((T)Convert.ChangeType((object</span />)v, Text.Getype()));
break</span />;
case</span /> TextAnalyzeMode.LettersOnly:
foreach</span /> (char</span /> v in</span /> Text.ToString())
{
if</span /> ((str_specialChars.IndexOf(v) == -1) &
(str_Numbers.IndexOf(v) == -1))
this</span />.Add((T)Convert.ChangeType((object</span />)v, Text.GetType()));
}
break</span />;
case</span /> TextAnalyzeMode.NoNumerals:
foreach</span /> (char</span /> v in</span /> Text.ToString())
{
if</span /> (str_Numbers.IndexOf(v) == -1)
this</span />.Add((T)Convert.ChangeType((object</span />)v, Text.GetType()));
}
break</span />;
case</span /> TextAnalyzeMode.NoSpecialCharacters:
foreach</span /> (char</span /> v in</span /> Text.ToString())
{
if</span /> (str_specialChars.IndexOf(v) == -1)
this</span />.Add((T)Convert.ChangeType((object</span />)v, Text.GetType()));
}
break</span />;
case</span /> TextAnalyzeMode.NumeralsOnly:
foreach</span /> (char</span /> v in</span /> Text.ToString())
{
if</span /> (str_Numbers.IndexOf(v) != -1)
this</span />.Add((T)Convert.ChangeType((object</span />)v, Text.GetType()));
}
break</span />;
case</span /> TextAnalyzeMode.SpecialCharactersOnly:
foreach</span /> (char</span /> v in</span /> Text.ToString())
{
if</span /> (str_specialChars.IndexOf(v) != -1)
this</span />.Add((T)Convert.ChangeType((object</span />)v, Text.GetType()));
}
break</span />;
}
}```

## Test for Normality

The question if given data are "Gaussian-distributed" is often raised. There are some robust and valid tests to answer this question. I have implemented the "good old" Kolmogorov-Smirnov test (KS-Test). Alternatively one can use the D'Agostino-Pearson test. There are two new properties concerning normality testing:

• `IsGaussian`: Returns `true</span /> `if data are numerical and the computed p-value is greater than Alpha (see below)
• `Alpha`: Defines the "significance level" for the KS-Test

The KS-Test method is shown below. The method returns `true</span />`, if the test is applicable. In case of non-numerical data, the method returns `false</span />`. The `out`-parameter `p `contains the `p`-value on exit. This value can be accessed by calling the `P_Value` property.

```private</span /> bool</span /> KS_Test(out</span /> double</span /> p)
{
//</span /> D-statistic</span />
double</span /> D = double</span />.NaN;
CumulativeFrequencyTableEntry<T>[] empCDF =
GetCumulativeFrequencyTable(CumulativeFrequencyTableFormat.EachDatapointOnce);
//</span /> store the test CDF</span />
double</span /> testCDF;
//</span /> array to store datapoints</span />
double</span />[] data = new</span /> double</span />[empCDF.Length];
FrequencyTableEntry<T>[] table = GetTableAsArray
(FrequencyTableSortOrder.Value_Ascending);
int</span /> i = 0</span />;
//</span /> prevent exceptions if T is not numerical</span />
try</span />
{
foreach</span /> (FrequencyTableEntry<T> entry in</span /> table)
{
data[i] = (double</span />)Convert.ChangeType(entry.Value, TypeCode.Double</span />);
i++;
}
}
catch</span />
{
p = double</span />.NaN;
return</span /> false</span />;
}
//</span /> estimate the parameters of the expected Gaussian distribution</span />
//</span /> first: compute the mean</span />
double</span /> mean = Mean;
//</span /> compute the bias-corrected variance</span />
//</span /> as an estimator for the population variance (actually we need the</span />
//</span /> square root)</span />
double</span /> _sqrt_var = Math.Sqrt(VariancePop);
//</span /> now we have to determine the greatest difference between the</span />
//</span /> sample cumulative distribution function (empCDF) and</span />
//</span /> the distribution function to test (testCDF)</span />
double</span /> _sqrt2 = Math.Sqrt(2</span />.0</span />);
double</span /> _erf;
double</span /> max1 = 0</span />.0</span />;
double</span /> max2 = 0</span />.0</span />;
double</span /> _temp;
for</span /> (i = 0</span />; i < empCDF.Length; i++)
{
//</span /> compute the expected distribution using the error function</span />
_erf = Erf(((data[i] - mean) / _sqrt_var) / _sqrt2);
testCDF = 0</span />.5</span /> * (1</span />.0</span /> + _erf);
_temp = Math.Abs(empCDF[i].CumulativeRelativeFrequency - testCDF);
if</span /> (_temp > max1)
max1 = _temp;
if</span /> (i > 0</span />)
_temp = Math.Abs(empCDF[i - 1</span />].CumulativeRelativeFrequency - testCDF);
else</span />
_temp = testCDF;
if</span /> (_temp > max2)
max2 = _temp;
}
//</span /> the statistics to use is</span />
//</span /> max{diff1,diff2}</span />
D = max1 > max2 ? max1 : max2;
//</span /> now compute the p-value using a z-transformation</span />
if</span /> (!Double.IsNaN(D))
{
double</span /> z = Math.Sqrt((double</span />)SampleSize) * D;
p = KS_Prob_Smirnov(z);
}
else</span />
p = double</span />.NaN;
return</span /> true</span />;
}```

To compute the "test distribution" (the Gaussian CDF in this case) we need the so called error function. I have used the Erf-implementation written by Miroslav Stampar (see Special Function(s) for C#), which is a translation of the Cephes Math Library by Stephen L. Moshier.

## Descriptive Statistics

I think it is useful to implement some fundamental statistical properties inside the class.

### Cumulative Frequencies

First of all, it is needed to implement a method which returns the empirical distribution function (the cumulative density function) of the given data:

```public</span /> CumulativeFrequencyTableEntry<T>[]
GetCumulativeFrequencyTable(CumulativeFrequencyTableFormat Format)
{
CumulativeFrequencyTableEntry<T>[] _output = null</span />;
//</span /> get the frequency table as array for easier processing</span />
FrequencyTableEntry<T>[] _freqTable =
GetTableAsArray(FrequencyTableSortOrder.Value_Ascending);
//</span /> temporary values</span />
double</span /> tempCumRelFreq = 0</span />.0</span />;
int</span /> tempCumAbsFreq = 0</span />;
int</span /> i, k;
switch</span /> (Format)
{
//</span /> each datapoint will returned</span />
case</span /> CumulativeFrequencyTableFormat.EachDatapoint:
//</span /> initialize the result</span />
_output = new</span /> CumulativeFrequencyTableEntry<T>[SampleSize];
for</span /> (i = 0</span />; i < _freqTable.Length; i++)
{
//</span /> update the cumulative frequency - relative and absolute</span />
tempCumAbsFreq += _freqTable[i].AbsoluteFreq;
tempCumRelFreq += _freqTable[i].RelativeFreq;
//</span /> fill the array</span />
for</span /> (k = tempCumAbsFreq - _freqTable[i].AbsoluteFreq;k < tempCumAbsFreq; k++)
{
_output[k] = new</span /> CumulativeFrequencyTableEntry<T>
(_freqTable[i].Value, tempCumRelFreq, tempCumAbsFreq);
}
}
break</span />;
//</span /> here each different entry will be returned once</span />
case</span /> CumulativeFrequencyTableFormat.EachDatapointOnce:
//</span /> initialize the result</span />
_output = new</span /> CumulativeFrequencyTableEntry<T>[Length];
for</span /> (i = 0</span />; i < _freqTable.Length; i++)
{
//</span /> update the cumulative frequency - relative and absolute</span />
tempCumAbsFreq += _freqTable[i].AbsoluteFreq;
tempCumRelFreq += _freqTable[i].RelativeFreq;
//</span /> fill the array</span />
_output[i] = new</span /> CumulativeFrequencyTableEntry<T>
(_freqTable[i].Value, tempCumRelFreq, tempCumAbsFreq);
}
break</span />;
}
//</span /> done</span />
return</span /> _output;
}```

(Sorry for that strange formatting - this edit tool...)

### Where Are My Data??

Ok - you need an array of the added data? That is the way:

```public</span /> T[] GetData(bool</span /> Pristine)
{
T[] result = new</span /> T[SampleSize];
//</span /> if the order is not important</span />
if</span /> (!Pristine)
{
CumulativeFrequencyTableEntry<T>[] cf = GetCumulativeFrequencyTable
(CumulativeFrequencyTableFormat.EachDatapoint);
for</span /> (int</span /> i = 0</span />; i < SampleSize; i++)
result[i] = cf[i].Value;
}
else</span /> /*</span /> return the data in same order as entered */</span />
{
List<int> l;
foreach</span /> (T key in</span /> _positions.Keys)
{
_positions.TryGetValue(key, out</span /> l);
foreach</span /> (int</span /> k in</span /> l)
{
result[k - 1</span />] = key;
}
}
}
return</span /> result;
}```

### What Else?

There are some `public</span /> `properties concerning descriptive statistics:

• `Mean`
• `Median`
• `Mode`
• `Minimum`
• `Maximum`
• `VarianceSample`
• `VariancePop` (unbiased estimator)
• `StandardDevSample`
• `StandardDevPop `(unbiased estimator)
• `StandardError`
• `Sum`
• `SampleSize `- the number of data (read only)
• `HighestFrequency `- the highest frequency observed
• `SmallestFrequency `- the smallest frequency
• `ScarcestValue `- the scarcest value
• `Kurtosis`
• `KurtosisExcess`
• `Skewness`

If the data is not numerical, all properties above will return `double</span />.NaN`.

### Miscellaneous

Here is a list of the remaining `public</span /> `properties and methods.

#### Properties

• `Length `- The number of table entries (read only)
• `Tag `- An object which can be set by the user (writable)
• `Description `- The description of the table (writable)
• `P_Value` (contains the `p `value computed by the Kolmogorov-Smirnov Test)

#### Methods

• `Add(T Value)` and `Add(T Test, TextAnalyzeMode mode)`
• `Remove(T Value)`
• `GetTableAsArray()` and `GetTableAsArray(FrequencyTableSortOrder order)` (sorting is done by using the Quicksort-Algorithm)
• `GetEnumerator()`
• `ContainsValue(T value</span />)`
• `GetCumulativeFrequencyTable(CumulativeFrequencyTableFormat Format)`
• `GetData(bool</span /> Pristine)` - Returns the data as an array (sorted or in input order)
• `GetRelativeFrequency(T value</span />, out double</span /> relFreq) `

The code is (I think so) well documented so you can use it to get a detailed insight into my solution. I am sure that this solution is not perfect, but it is a good starting point.

For a better overview, I have added a compiled help file (see download at the top of this page).

## History

• Version 1.0 - 18 Jan '07
• Initial release
• Version 1.5 - 04 Feb '07
• Minor bug fixes (highest frequency was not set correctly)
• Added normality testing
• Added descriptive statistics
• 09 Feb '07
• `P_Value `added, release number not changed
• Version 2.0 26 Feb '07
• `GetData(bool</span /> Pristine)` added

## About the Author

 Web Developer Germany
Anesthesiologist from Germany
- first contact: 1985 - ATARI 800 XE (there was a great assembler: ATMAS II)
- special interests: my son, number theory, statistics, linear algebra, medicine (of course)

## Comments and Discussions

 First Prev Next
 Time Delay of two Frequencies mangotj9-Jun-10 17:30 mangotj 9-Jun-10 17:30
 This is great and easy to use -- Thanks Michael Yourshaw25-Jul-08 17:48 Michael Yourshaw 25-Jul-08 17:48
 Dictionary and performance bronislav27-Apr-08 1:49 bronislav 27-Apr-08 1:49
 How to add to a frequency > 1 theperm30-Aug-07 7:51 theperm 30-Aug-07 7:51
 Do, ut des visalia12-Apr-07 5:51 visalia 12-Apr-07 5:51
 Re: Do, ut des V. Thieme16-Jan-09 23:38 V. Thieme 16-Jan-09 23:38
 Suggestion TemplMetaProg26-Feb-07 23:51 TemplMetaProg 26-Feb-07 23:51
 Re: Suggestion V. Thieme27-Feb-07 3:25 V. Thieme 27-Feb-07 3:25
 Excellent class with additional functions.... [modified] scslmd31-Jan-07 11:56 scslmd 31-Jan-07 11:56
 Re: Excellent class with additional functions.... Volker Thieme1-Feb-07 7:11 Volker Thieme 1-Feb-07 7:11
 This is a very nice post! Grav-Vt27-Jan-07 4:57 Grav-Vt 27-Jan-07 4:57
 Re: This is a very nice post! Volker Thieme27-Jan-07 6:35 Volker Thieme 27-Jan-07 6:35
 Outstanding AreJay25-Jan-07 6:37 AreJay 25-Jan-07 6:37
 Re: Outstanding Volker Thieme25-Jan-07 9:10 Volker Thieme 25-Jan-07 9:10
 Awesome Steve Hansen18-Jan-07 21:04 Steve Hansen 18-Jan-07 21:04
 Re: Awesome Volker Thieme18-Jan-07 21:16 Volker Thieme 18-Jan-07 21:16
 Last Visit: 31-Dec-99 18:00     Last Update: 25-Sep-17 2:14 Refresh 1

General    News    Suggestion    Question    Bug    Answer    Joke    Praise    Rant    Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.