## What's New?

`GetData`

: The new parameter `Pristine`

indicates if you want to get data in the same order as entered. That might be of importance in case of time-dependant data. The new `Add`

-method stores the position(s) of each value using a `Dictionary<T, List<int>>`

.

## Introduction

The exploration of empirical data is a common task in various fields. Especially the computational analysis of such data is often cumbered by the nature (the type) of the data. So - while writing statistical routines by my own - I decided to develop a "Frequency Table Class" which accepts any possible type of data.

## Requirements

My class has to satisfy requirements as follows:

- Accept any type of data (especially multiple precision types)
- Accept a given array
- Provide methods for adding and removing single values
- Automatically update the absolute frequency when a value is added/removed
- A simple way to fetch mode, highest frequency ...
- Provide a method to get the table as an array
- Returned arrays must be sortable by frequency and value
- Provide fields/properties to describe the table

## Code

The backbone of this class is the `FrequencyTableEntry<T>`

structure:

public</span />

The specified type `T `

has to implement the `IComparable`

-Interface (needed for the sorting routine). The class stores the data in a generic Dictionary: `_entries = new Dictionary<T,int>()`

: the `_entries.Keys`

-Collection contains the values to count, the `_entries.Values`

-Collection contains the absolute frequency for this particular value:

public</span />

To provide easy access to table entries, the implemented enumerator returns the structure above:

public</span />

The general `Add(T value)`

method looks like this:

public</span />

To simplify the analysis of a given text, I have implemented a special constructor:

public</span />

The associated `Add`

method:

public</span />

In my opinion, it is useful to provide different modes regarding literal analysis. These modes are provided by `TextAnalyzeMode`

:

public</span />

The analysis itself is performed by `AnalyzeString(T Text, TextAnalyzeMode mode)`

:

private</span />

## Test for Normality

The question if given data are "Gaussian-distributed" is often raised. There are some robust and valid tests to answer this question. I have implemented the "good old" Kolmogorov-Smirnov test (KS-Test). Alternatively one can use the D'Agostino-Pearson test. There are two new properties concerning normality testing:

`IsGaussian`

: Returns `true `

if data are numerical and the computed p-value is greater than Alpha (see below)
`Alpha`

: Defines the "significance level" for the KS-Test

The KS-Test method is shown below. The method returns `true`

, if the test is applicable. In case of non-numerical data, the method returns `false`

. The `out`

-parameter `p `

contains the `p`

-value on exit. This value can be accessed by calling the `P_Value`

property.

private</span />

To compute the "test distribution" (the Gaussian CDF in this case) we need the so called error function. I have used the Erf-implementation written by Miroslav Stampar (see Special Function(s) for C#), which is a translation of the Cephes Math Library by Stephen L. Moshier.

## Descriptive Statistics

I think it is useful to implement some fundamental statistical properties inside the class.

### Cumulative Frequencies

First of all, it is needed to implement a method which returns the empirical distribution function (the cumulative density function) of the given data:

public</span />

(Sorry for that strange formatting - this edit tool...)

### Where Are My Data??

Ok - you need an array of the added data? That is the way:

public</span />

### What Else?

There are some `public `

properties concerning descriptive statistics:

`Mean`

`Median`

`Mode`

`Minimum`

`Maximum`

`VarianceSample`

`VariancePop`

(unbiased estimator)
`StandardDevSample`

`StandardDevPop `

(unbiased estimator)
`StandardError`

`Sum`

`SampleSize `

- the number of data (read only)
`HighestFrequency `

- the highest frequency observed
`SmallestFrequency `

- the smallest frequency
`ScarcestValue `

- the scarcest value
`Kurtosis`

`KurtosisExcess`

`Skewness`

If the data is not numerical, all properties above will return `double.NaN`

.

### Miscellaneous

Here is a list of the remaining `public `

properties and methods.

#### Properties

`Length `

- The number of table entries (read only)
`Tag `

- An object which can be set by the user (writable)
`Description `

- The description of the table (writable)
`P_Value`

(contains the `p `

value computed by the Kolmogorov-Smirnov Test)

#### Methods

`Add(T Value)`

and `Add(T Test, TextAnalyzeMode mode)`

`Remove(T Value)`

`GetTableAsArray()`

and `GetTableAsArray(FrequencyTableSortOrder order)`

(sorting is done by using the Quicksort-Algorithm)
`GetEnumerator()`

`ContainsValue(T value)`

`GetCumulativeFrequencyTable(CumulativeFrequencyTableFormat Format)`

`GetData(bool Pristine)`

- Returns the data as an array (sorted or in input order)
`GetRelativeFrequency(T value, out double relFreq) `

The code is (I think so) well documented so you can use it to get a detailed insight into my solution. I am sure that this solution is not perfect, but it is a good starting point.

For a better overview, I have added a compiled help file (see download at the top of this page).

## History

- Version 1.0 - 18 Jan '07
- Version 1.5 - 04 Feb '07
- Minor bug fixes (highest frequency was not set correctly)
- Added normality testing
- Added descriptive statistics

- 09 Feb '07
`P_Value `

added, release number not changed

- Version 2.0 26 Feb '07
`GetData(bool Pristine)`

added