65.9K
CodeProject is changing. Read more.
Home

A Generic Frequency Table in C# with Descriptive Statistics

starIconstarIconstarIconstarIcon
emptyStarIcon
starIcon

4.72/5 (14 votes)

Jan 18, 2007

CPOL

4 min read

viewsIcon

94267

downloadIcon

2038

Frequencies, descriptive statistics and normality testing

What's New?

GetData: The new parameter Pristine indicates if you want to get data in the same order as entered. That might be of importance in case of time-dependant data. The new Add-method stores the position(s) of each value using a Dictionary<T, List<int>>.

Introduction

The exploration of empirical data is a common task in various fields. Especially the computational analysis of such data is often cumbered by the nature (the type) of the data. So - while writing statistical routines by my own - I decided to develop a "Frequency Table Class" which accepts any possible type of data.

Requirements

My class has to satisfy requirements as follows:

  • Accept any type of data (especially multiple precision types)
  • Accept a given array
  • Provide methods for adding and removing single values
  • Automatically update the absolute frequency when a value is added/removed
  • A simple way to fetch mode, highest frequency ...
  • Provide a method to get the table as an array
  • Returned arrays must be sortable by frequency and value
  • Provide fields/properties to describe the table

Code

The backbone of this class is the FrequencyTableEntry<T> structure:

// A generic structure storing the frequency information for each value
public struct FrequencyTableEntry<T> where T : IComparable<T>
{
    // Constructor
    // val: The value counted
    // absFreq: The absolute frequency
    // relFreq: The relative frequency
    // percentage: The percentage
    public FrequencyTableEntry(T val, int absFreq, double relFreq, double percentage)
    {
        Value = val;
        AbsoluteFreq = absFreq;
        RelativeFreq = relFreq;
        Percentage = percentage;
    }
    public T Value;
    public int AbsoluteFreq;
    public double RelativeFreq;
    public double Percentage;
}

The specified type T has to implement the IComparable-Interface (needed for the sorting routine). The class stores the data in a generic Dictionary: _entries = new Dictionary<T,int>(): the _entries.Keys-Collection contains the values to count, the _entries.Values-Collection contains the absolute frequency for this particular value:

public FrequencyTable(int initialCapacity)
{
    _entries = new Dictionary<T, int>(initialCapacity);
    .
    .
}

To provide easy access to table entries, the implemented enumerator returns the structure above:

public IEnumerator<FrequencyTableEntry<T>> GetEnumerator()
{
    // the structure to return
    FrequencyTableEntry<T> _output;
    // the frequency
    int int_f;
    // the "double-typed" frequency
    double dbl_f;
    foreach (T key in _entries.Keys)
    {
        int_f = _entries[key];
        dbl_f = (double)int_f;
        // fill the structure
        _output = new FrequencyTableEntry<T>
        (key, int_f, dbl_f / _count, dbl_f / _count * 100.0);
        // yielding - cool thing that
        yield return _output;
    }
}

The general Add(T value) method looks like this:

public void Add(T value)
{
    List<int> _tempPos;
    // if the Dictionary already contains value, then
    // we have to update frequency and _count
    if (_entries.ContainsKey(value))
    {
        // update the frequency
        _entries[value]++;
        // update mode and highest frequency
        if (_entries[value] > _high)
        {
            _high = _entries[value];
            _mode = value;
        }
        // add 1 to sample size
        _count++;
        foreach (T key in _entries.Keys)
        {
            _relFrequencies[key] = (double)_entries[key] / _count;
        }
        UpdateSumAndMean(value);
        // store the actual position of the entry in the dataset
        _positions.TryGetValue(value, out _tempPos);
        // the position is equal to _count
        _tempPos.Add(_count);
        // remove old entry
        _positions.Remove(value);
        // store new entry
        _positions.Add(value, _tempPos);
    }
    else /* if the dictionary does not contain value add a new entry */
    {
        // if the highest frequency is still zero, set it to one
        if (_high < 1)
        {
            _high = 1;
            _mode = value;
        }
        // add a new entry - frequency is one
        _entries.Add(value, 1);
        // add 1 to table length
        _length++;
        // add 1 to sample size
        _count++;
        // update relative frequencies
        _relFrequencies.Add(value, 0.0);
        foreach (T key in _entries.Keys)
        {
            _relFrequencies[key] = (double)_entries[key] / _count;
        }
        UpdateSumAndMean(value);
        // create a new entry and set position to _count
        _tempPos = new List<int>();
        _tempPos.Add(_count);
        // store it
        _positions.Add(value, _tempPos);
    }
}

To simplify the analysis of a given text, I have implemented a special constructor:

// Constructor - the created instance analyzes the 
// frequency of characters in a given string
// Text: String to analyze
public FrequencyTable(T Text, TextAnalyzeMode mode)
{
    _positions = new Dictionary<T, List<int>>();
    // if T is not string -> Exception
    if (!(Text is string))
        throw new ArgumentException();
    // the table itself
    _entries = new Dictionary<T, int>();
    _relFrequencies = new Hashtable();
    // number of entries in _entries
    _length = 0;
    // sample size
    _count = 0;
    // description of the table
    _description = "";
    // a user defined tag
    _tag = 0;
    // the highest frequency
    _high = 0;
    _dblSum = double.NaN;
    _mean = double.NaN;
    _alpha = double.NaN;
    AnalyzeString(Text, mode);
    _p = double.NaN;
}

The associated Add method:

public void Add(T Text, TextAnalyzeMode mode)
{
    if (!(Text is string))
        throw new ArgumentException();
    AnalyseString(Text, mode);
}

In my opinion, it is useful to provide different modes regarding literal analysis. These modes are provided by TextAnalyzeMode:

public enum TextAnalyzeMode
{
    AllCharacters,
    NoNumerals,
    NoSpecialCharacters,
    LettersOnly,
    NumeralsOnly,
    SpecialCharactersOnly
}

The analysis itself is performed by AnalyzeString(T Text, TextAnalyzeMode mode):

private void AnalyzeString(T Text, TextAnalyzeMode mode)
{
    // character strings
    string str_specialChars = @"""!§$%&/()=?@€<>|µ,.;:-_#'*+~²³ ";
    string str_Numbers = "0123456789";
    // Adding the entries according to mode
    switch (mode)
    {
        case TextAnalyzeMode.AllCharacters:
            foreach (char v in Text.ToString())
                this.Add((T)Convert.ChangeType((object)v, Text.Getype()));
            break;
        case TextAnalyzeMode.LettersOnly:
            foreach (char v in Text.ToString())
            {
                if ((str_specialChars.IndexOf(v) == -1) & 
            	(str_Numbers.IndexOf(v) == -1))
                    this.Add((T)Convert.ChangeType((object)v, Text.GetType()));
            }
            break;
        case TextAnalyzeMode.NoNumerals:
            foreach (char v in Text.ToString())
            {
                if (str_Numbers.IndexOf(v) == -1)
                        this.Add((T)Convert.ChangeType((object)v, Text.GetType()));
            }
            break;
        case TextAnalyzeMode.NoSpecialCharacters:
            foreach (char v in Text.ToString())
            {
                if (str_specialChars.IndexOf(v) == -1)
                    this.Add((T)Convert.ChangeType((object)v, Text.GetType()));
            }
            break;
        case TextAnalyzeMode.NumeralsOnly:
            foreach (char v in Text.ToString())
            {
                if (str_Numbers.IndexOf(v) != -1)
                    this.Add((T)Convert.ChangeType((object)v, Text.GetType()));
            }
            break;
        case TextAnalyzeMode.SpecialCharactersOnly:
            foreach (char v in Text.ToString())
            {
                if (str_specialChars.IndexOf(v) != -1)
                    this.Add((T)Convert.ChangeType((object)v, Text.GetType()));
            }
            break;
    }
}

Test for Normality

The question if given data are "Gaussian-distributed" is often raised. There are some robust and valid tests to answer this question. I have implemented the "good old" Kolmogorov-Smirnov test (KS-Test). Alternatively one can use the D'Agostino-Pearson test. There are two new properties concerning normality testing:

  • IsGaussian: Returns true if data are numerical and the computed p-value is greater than Alpha (see below)
  • Alpha: Defines the "significance level" for the KS-Test

The KS-Test method is shown below. The method returns true, if the test is applicable. In case of non-numerical data, the method returns false. The out-parameter p contains the p-value on exit. This value can be accessed by calling the P_Value property.

private bool KS_Test(out double p)
{
   // D-statistic
   double D = double.NaN;
   CumulativeFrequencyTableEntry<T>[] empCDF = 
	GetCumulativeFrequencyTable(CumulativeFrequencyTableFormat.EachDatapointOnce);
   // store the test CDF
   double testCDF;
   // array to store datapoints
   double[] data = new double[empCDF.Length];
   FrequencyTableEntry<T>[] table = GetTableAsArray
			(FrequencyTableSortOrder.Value_Ascending);
   int i = 0;
   // prevent exceptions if T is not numerical
   try
   {
      foreach (FrequencyTableEntry<T> entry in table)
      {
         data[i] = (double)Convert.ChangeType(entry.Value, TypeCode.Double);
         i++;
      }
   }
   catch
   {
      p = double.NaN;
      return false;
   }
   // estimate the parameters of the expected Gaussian distribution
   // first: compute the mean
   double mean = Mean;
   // compute the bias-corrected variance
   // as an estimator for the population variance (actually we need the
   // square root)
   double _sqrt_var = Math.Sqrt(VariancePop);
   // now we have to determine the greatest difference between the
   // sample cumulative distribution function (empCDF) and
   // the distribution function to test (testCDF)
   double _sqrt2 = Math.Sqrt(2.0);
   double _erf;
   double max1 = 0.0;
   double max2 = 0.0;
   double _temp;
   for (i = 0; i < empCDF.Length; i++)
   {
      // compute the expected distribution using the error function
      _erf = Erf(((data[i] - mean) / _sqrt_var) / _sqrt2);
      testCDF = 0.5 * (1.0 + _erf);
      _temp = Math.Abs(empCDF[i].CumulativeRelativeFrequency - testCDF);
      if (_temp > max1)
        max1 = _temp;
      if (i > 0)
        _temp = Math.Abs(empCDF[i - 1].CumulativeRelativeFrequency - testCDF);
      else
        _temp = testCDF;
      if (_temp > max2)
        max2 = _temp;
   }
   // the statistics to use is
   // max{diff1,diff2}
   D = max1 > max2 ? max1 : max2;
   // now compute the p-value using a z-transformation
   if (!Double.IsNaN(D))
   {
      double z = Math.Sqrt((double)SampleSize) * D;
      p = KS_Prob_Smirnov(z);
   }
   else
      p = double.NaN;
   return true;
}

To compute the "test distribution" (the Gaussian CDF in this case) we need the so called error function. I have used the Erf-implementation written by Miroslav Stampar (see Special Function(s) for C#), which is a translation of the Cephes Math Library by Stephen L. Moshier.

Descriptive Statistics

I think it is useful to implement some fundamental statistical properties inside the class.

Cumulative Frequencies

First of all, it is needed to implement a method which returns the empirical distribution function (the cumulative density function) of the given data:

public CumulativeFrequencyTableEntry<T>[] 
	GetCumulativeFrequencyTable(CumulativeFrequencyTableFormat Format)
{
   CumulativeFrequencyTableEntry<T>[] _output = null;
   // get the frequency table as array for easier processing
   FrequencyTableEntry<T>[] _freqTable = 
       GetTableAsArray(FrequencyTableSortOrder.Value_Ascending);
   // temporary values
   double tempCumRelFreq = 0.0;
   int tempCumAbsFreq = 0;
   int i, k;
   switch (Format)
   {
      // each datapoint will returned
      case CumulativeFrequencyTableFormat.EachDatapoint:
        // initialize the result
        _output = new CumulativeFrequencyTableEntry<T>[SampleSize];
        for (i = 0; i < _freqTable.Length; i++)
        {
          // update the cumulative frequency - relative and absolute
          tempCumAbsFreq += _freqTable[i].AbsoluteFreq;
          tempCumRelFreq += _freqTable[i].RelativeFreq;
          // fill the array
          for (k = tempCumAbsFreq - _freqTable[i].AbsoluteFreq;k < tempCumAbsFreq; k++)
          {
            _output[k] = new CumulativeFrequencyTableEntry<T>
			(_freqTable[i].Value, tempCumRelFreq, tempCumAbsFreq);
          }
        }
      break;
      // here each different entry will be returned once
      case CumulativeFrequencyTableFormat.EachDatapointOnce:
        // initialize the result
        _output = new CumulativeFrequencyTableEntry<T>[Length];
        for (i = 0; i < _freqTable.Length; i++)
        {
          // update the cumulative frequency - relative and absolute
          tempCumAbsFreq += _freqTable[i].AbsoluteFreq;
          tempCumRelFreq += _freqTable[i].RelativeFreq;
          // fill the array
          _output[i] = new CumulativeFrequencyTableEntry<T>
			(_freqTable[i].Value, tempCumRelFreq, tempCumAbsFreq);
        }
       break;
   }
   // done
   return _output;
}

(Sorry for that strange formatting - this edit tool...)

Where Are My Data??

Ok - you need an array of the added data? That is the way:

public T[] GetData(bool Pristine)
{
    T[] result = new T[SampleSize];
    // if the order is not important
    if (!Pristine)
    {
        CumulativeFrequencyTableEntry<T>[] cf = GetCumulativeFrequencyTable
	        (CumulativeFrequencyTableFormat.EachDatapoint);
        for (int i = 0; i < SampleSize; i++)
            result[i] = cf[i].Value;
    }
    else /* return the data in same order as entered */
    {
        List<int> l;
        foreach (T key in _positions.Keys)
        {
            _positions.TryGetValue(key, out l);
            foreach (int k in l)
            {
                result[k - 1] = key;
            }
        }
    }
    return result;
}

What Else?

There are some public properties concerning descriptive statistics:

  • Mean
  • Median
  • Mode
  • Minimum
  • Maximum
  • VarianceSample
  • VariancePop (unbiased estimator)
  • StandardDevSample
  • StandardDevPop (unbiased estimator)
  • StandardError
  • Sum
  • SampleSize - the number of data (read only)
  • HighestFrequency - the highest frequency observed
  • SmallestFrequency - the smallest frequency
  • ScarcestValue - the scarcest value
  • Kurtosis
  • KurtosisExcess
  • Skewness

If the data is not numerical, all properties above will return double.NaN.

Miscellaneous

Here is a list of the remaining public properties and methods.

Properties

  • Length - The number of table entries (read only)
  • Tag - An object which can be set by the user (writable)
  • Description - The description of the table (writable)
  • P_Value (contains the p value computed by the Kolmogorov-Smirnov Test)

Methods

  • Add(T Value) and Add(T Test, TextAnalyzeMode mode)
  • Remove(T Value)
  • GetTableAsArray() and GetTableAsArray(FrequencyTableSortOrder order) (sorting is done by using the Quicksort-Algorithm)
  • GetEnumerator()
  • ContainsValue(T value)
  • GetCumulativeFrequencyTable(CumulativeFrequencyTableFormat Format)
  • GetData(bool Pristine) - Returns the data as an array (sorted or in input order)
  • GetRelativeFrequency(T value, out double relFreq)

The code is (I think so) well documented so you can use it to get a detailed insight into my solution. I am sure that this solution is not perfect, but it is a good starting point.

For a better overview, I have added a compiled help file (see download at the top of this page).

History

  • Version 1.0 - 18 Jan '07
    • Initial release
  • Version 1.5 - 04 Feb '07
    • Minor bug fixes (highest frequency was not set correctly)
    • Added normality testing
    • Added descriptive statistics
  • 09 Feb '07
    • P_Value added, release number not changed
  • Version 2.0 26 Feb '07
    • GetData(bool Pristine) added