Click here to Skip to main content
15,867,835 members
Articles / Programming Languages / C#

A Generic Frequency Table in C# with Descriptive Statistics

Rate me:
Please Sign up or sign in to vote.
4.73/5 (15 votes)
26 Feb 2007CPOL4 min read 93K   2K   60   17
Frequencies, descriptive statistics and normality testing

What's New?

GetData: The new parameter Pristine indicates if you want to get data in the same order as entered. That might be of importance in case of time-dependant data. The new Add-method stores the position(s) of each value using a Dictionary<</span />T, List<</span />int</span />></span />></span />.

Introduction

The exploration of empirical data is a common task in various fields. Especially the computational analysis of such data is often cumbered by the nature (the type) of the data. So - while writing statistical routines by my own - I decided to develop a "Frequency Table Class" which accepts any possible type of data.

Requirements

My class has to satisfy requirements as follows:

  • Accept any type of data (especially multiple precision types)
  • Accept a given array
  • Provide methods for adding and removing single values
  • Automatically update the absolute frequency when a value is added/removed
  • A simple way to fetch mode, highest frequency ...
  • Provide a method to get the table as an array
  • Returned arrays must be sortable by frequency and value
  • Provide fields/properties to describe the table

Code

The backbone of this class is the FrequencyTableEntry<</span />T></span /> structure:

C#
//</span /> A generic structure storing the frequency information for each value</span />
public</span /> struct</span /> FrequencyTableEntry<T> where T : IComparable<T>
{
    //</span /> Constructor</span />
    //</span /> val: The value counted</span />
    //</span /> absFreq: The absolute frequency</span />
    //</span /> relFreq: The relative frequency</span />
    //</span /> percentage: The percentage</span />
    public</span /> FrequencyTableEntry(T val, int</span /> absFreq, double</span /> relFreq, double</span /> percentage)
    {
        Value = val;
        AbsoluteFreq = absFreq;
        RelativeFreq = relFreq;
        Percentage = percentage;
    }
    public</span /> T Value;
    public</span /> int</span /> AbsoluteFreq;
    public</span /> double</span /> RelativeFreq;
    public</span /> double</span /> Percentage;
}

The specified type T has to implement the IComparable-Interface (needed for the sorting routine). The class stores the data in a generic Dictionary: _entries = new</span /> Dictionary<</span />T,int</span />></span />(): the _entries.Keys-Collection contains the values to count, the _entries.Values-Collection contains the absolute frequency for this particular value:

C#
public</span /> FrequencyTable(int</span /> initialCapacity)
{
    _entries = new</span /> Dictionary<T, int>(initialCapacity);
    .
    .
}

To provide easy access to table entries, the implemented enumerator returns the structure above:

C#
public</span /> IEnumerator<FrequencyTableEntry<T>> GetEnumerator()
{
    //</span /> the structure to return</span />
    FrequencyTableEntry<T> _output;
    //</span /> the frequency</span />
    int</span /> int_f;
    //</span /> the "double-typed" frequency</span />
    double</span /> dbl_f;
    foreach</span /> (T key in</span /> _entries.Keys)
    {
        int_f = _entries[key];
        dbl_f = (double</span />)int_f;
        //</span /> fill the structure</span />
        _output = new</span /> FrequencyTableEntry<T>
        (key, int_f, dbl_f / _count, dbl_f / _count * 100</span />.0</span />);
        //</span /> yielding - cool thing that</span />
        yield return</span /> _output;
    }
}

The general Add(T value</span />) method looks like this:

C#
public</span /> void</span /> Add(T value)
{
    List<int> _tempPos;
    //</span /> if the Dictionary already contains value, then</span />
    //</span /> we have to update frequency and _count</span />
    if</span /> (_entries.ContainsKey(value))
    {
        //</span /> update the frequency</span />
        _entries[value]++;
        //</span /> update mode and highest frequency</span />
        if</span /> (_entries[value] > _high)
        {
            _high = _entries[value];
            _mode = value;
        }
        //</span /> add 1 to sample size</span />
        _count++;
        foreach</span /> (T key in</span /> _entries.Keys)
        {
            _relFrequencies[key] = (double</span />)_entries[key] / _count;
        }
        UpdateSumAndMean(value);
        //</span /> store the actual position of the entry in the dataset</span />
        _positions.TryGetValue(value, out</span /> _tempPos);
        //</span /> the position is equal to _count</span />
        _tempPos.Add(_count);
        //</span /> remove old entry</span />
        _positions.Remove(value);
        //</span /> store new entry</span />
        _positions.Add(value, _tempPos);
    }
    else</span /> /*</span /> if the dictionary does not contain value add a new entry */</span />
    {
        //</span /> if the highest frequency is still zero, set it to one</span />
        if</span /> (_high < 1</span />)
        {
            _high = 1</span />;
            _mode = value;
        }
        //</span /> add a new entry - frequency is one</span />
        _entries.Add(value, 1</span />);
        //</span /> add 1 to table length</span />
        _length++;
        //</span /> add 1 to sample size</span />
        _count++;
        //</span /> update relative frequencies</span />
        _relFrequencies.Add(value, 0</span />.0</span />);
        foreach</span /> (T key in</span /> _entries.Keys)
        {
            _relFrequencies[key] = (double</span />)_entries[key] / _count;
        }
        UpdateSumAndMean(value);
        //</span /> create a new entry and set position to _count</span />
        _tempPos = new</span /> List<int>();
        _tempPos.Add(_count);
        //</span /> store it</span />
        _positions.Add(value, _tempPos);
    }
}

To simplify the analysis of a given text, I have implemented a special constructor:

C#
//</span /> Constructor - the created instance analyzes the </span />
//</span /> frequency of characters in a given string</span />
//</span /> Text: String to analyze</span />
public</span /> FrequencyTable(T Text, TextAnalyzeMode mode)
{
    _positions = new</span /> Dictionary<T, List<int>>();
    //</span /> if T is not string -> Exception</span />
    if</span /> (!(Text is</span /> string</span />))
        throw</span /> new</span /> ArgumentException();
    //</span /> the table itself</span />
    _entries = new</span /> Dictionary<T, int>();
    _relFrequencies = new</span /> Hashtable();
    //</span /> number of entries in _entries</span />
    _length = 0</span />;
    //</span /> sample size</span />
    _count = 0</span />;
    //</span /> description of the table</span />
    _description = "</span />"</span />;
    //</span /> a user defined tag</span />
    _tag = 0</span />;
    //</span /> the highest frequency</span />
    _high = 0</span />;
    _dblSum = double</span />.NaN;
    _mean = double</span />.NaN;
    _alpha = double</span />.NaN;
    AnalyzeString(Text, mode);
    _p = double</span />.NaN;
}

The associated Add method:

C#
public</span /> void</span /> Add(T Text, TextAnalyzeMode mode)
{
    if</span /> (!(Text is</span /> string</span />))
        throw</span /> new</span /> ArgumentException();
    AnalyseString(Text, mode);
}

In my opinion, it is useful to provide different modes regarding literal analysis. These modes are provided by TextAnalyzeMode:

C#
public</span /> enum</span /> TextAnalyzeMode
{
    AllCharacters,
    NoNumerals,
    NoSpecialCharacters,
    LettersOnly,
    NumeralsOnly,
    SpecialCharactersOnly
}

The analysis itself is performed by AnalyzeString(T Text, TextAnalyzeMode mode):

C#
private</span /> void</span /> AnalyzeString(T Text, TextAnalyzeMode mode)
{
    //</span /> character strings</span />
    string</span /> str_specialChars = @"</span />"</span />"</span />!§$%&/()=?@€<>|µ,.;:-_#'*+~²³ "</span />;
    string</span /> str_Numbers = "</span />0123456789"</span />;
    //</span /> Adding the entries according to mode</span />
    switch</span /> (mode)
    {
        case</span /> TextAnalyzeMode.AllCharacters:
            foreach</span /> (char</span /> v in</span /> Text.ToString())
                this</span />.Add((T)Convert.ChangeType((object</span />)v, Text.Getype()));
            break</span />;
        case</span /> TextAnalyzeMode.LettersOnly:
            foreach</span /> (char</span /> v in</span /> Text.ToString())
            {
                if</span /> ((str_specialChars.IndexOf(v) == -1) & 
            	(str_Numbers.IndexOf(v) == -1))
                    this</span />.Add((T)Convert.ChangeType((object</span />)v, Text.GetType()));
            }
            break</span />;
        case</span /> TextAnalyzeMode.NoNumerals:
            foreach</span /> (char</span /> v in</span /> Text.ToString())
            {
                if</span /> (str_Numbers.IndexOf(v) == -1)
                        this</span />.Add((T)Convert.ChangeType((object</span />)v, Text.GetType()));
            }
            break</span />;
        case</span /> TextAnalyzeMode.NoSpecialCharacters:
            foreach</span /> (char</span /> v in</span /> Text.ToString())
            {
                if</span /> (str_specialChars.IndexOf(v) == -1)
                    this</span />.Add((T)Convert.ChangeType((object</span />)v, Text.GetType()));
            }
            break</span />;
        case</span /> TextAnalyzeMode.NumeralsOnly:
            foreach</span /> (char</span /> v in</span /> Text.ToString())
            {
                if</span /> (str_Numbers.IndexOf(v) != -1)
                    this</span />.Add((T)Convert.ChangeType((object</span />)v, Text.GetType()));
            }
            break</span />;
        case</span /> TextAnalyzeMode.SpecialCharactersOnly:
            foreach</span /> (char</span /> v in</span /> Text.ToString())
            {
                if</span /> (str_specialChars.IndexOf(v) != -1)
                    this</span />.Add((T)Convert.ChangeType((object</span />)v, Text.GetType()));
            }
            break</span />;
    }
}

Test for Normality

The question if given data are "Gaussian-distributed" is often raised. There are some robust and valid tests to answer this question. I have implemented the "good old" Kolmogorov-Smirnov test (KS-Test). Alternatively one can use the D'Agostino-Pearson test. There are two new properties concerning normality testing:

  • IsGaussian: Returns true</span /> if data are numerical and the computed p-value is greater than Alpha (see below)
  • Alpha: Defines the "significance level" for the KS-Test

The KS-Test method is shown below. The method returns true</span />, if the test is applicable. In case of non-numerical data, the method returns false</span />. The out-parameter p contains the p-value on exit. This value can be accessed by calling the P_Value property.

C#
private</span /> bool</span /> KS_Test(out</span /> double</span /> p)
{
   //</span /> D-statistic</span />
   double</span /> D = double</span />.NaN;
   CumulativeFrequencyTableEntry<T>[] empCDF = 
	GetCumulativeFrequencyTable(CumulativeFrequencyTableFormat.EachDatapointOnce);
   //</span /> store the test CDF</span />
   double</span /> testCDF;
   //</span /> array to store datapoints</span />
   double</span />[] data = new</span /> double</span />[empCDF.Length];
   FrequencyTableEntry<T>[] table = GetTableAsArray
			(FrequencyTableSortOrder.Value_Ascending);
   int</span /> i = 0</span />;
   //</span /> prevent exceptions if T is not numerical</span />
   try</span />
   {
      foreach</span /> (FrequencyTableEntry<T> entry in</span /> table)
      {
         data[i] = (double</span />)Convert.ChangeType(entry.Value, TypeCode.Double</span />);
         i++;
      }
   }
   catch</span />
   {
      p = double</span />.NaN;
      return</span /> false</span />;
   }
   //</span /> estimate the parameters of the expected Gaussian distribution</span />
   //</span /> first: compute the mean</span />
   double</span /> mean = Mean;
   //</span /> compute the bias-corrected variance</span />
   //</span /> as an estimator for the population variance (actually we need the</span />
   //</span /> square root)</span />
   double</span /> _sqrt_var = Math.Sqrt(VariancePop);
   //</span /> now we have to determine the greatest difference between the</span />
   //</span /> sample cumulative distribution function (empCDF) and</span />
   //</span /> the distribution function to test (testCDF)</span />
   double</span /> _sqrt2 = Math.Sqrt(2</span />.0</span />);
   double</span /> _erf;
   double</span /> max1 = 0</span />.0</span />;
   double</span /> max2 = 0</span />.0</span />;
   double</span /> _temp;
   for</span /> (i = 0</span />; i < empCDF.Length; i++)
   {
      //</span /> compute the expected distribution using the error function</span />
      _erf = Erf(((data[i] - mean) / _sqrt_var) / _sqrt2);
      testCDF = 0</span />.5</span /> * (1</span />.0</span /> + _erf);
      _temp = Math.Abs(empCDF[i].CumulativeRelativeFrequency - testCDF);
      if</span /> (_temp > max1)
        max1 = _temp;
      if</span /> (i > 0</span />)
        _temp = Math.Abs(empCDF[i - 1</span />].CumulativeRelativeFrequency - testCDF);
      else</span />
        _temp = testCDF;
      if</span /> (_temp > max2)
        max2 = _temp;
   }
   //</span /> the statistics to use is</span />
   //</span /> max{diff1,diff2}</span />
   D = max1 > max2 ? max1 : max2;
   //</span /> now compute the p-value using a z-transformation</span />
   if</span /> (!Double.IsNaN(D))
   {
      double</span /> z = Math.Sqrt((double</span />)SampleSize) * D;
      p = KS_Prob_Smirnov(z);
   }
   else</span />
      p = double</span />.NaN;
   return</span /> true</span />;
}

To compute the "test distribution" (the Gaussian CDF in this case) we need the so called error function. I have used the Erf-implementation written by Miroslav Stampar (see Special Function(s) for C#), which is a translation of the Cephes Math Library by Stephen L. Moshier.

Descriptive Statistics

I think it is useful to implement some fundamental statistical properties inside the class.

Cumulative Frequencies

First of all, it is needed to implement a method which returns the empirical distribution function (the cumulative density function) of the given data:

C#
public</span /> CumulativeFrequencyTableEntry<T>[] 
	GetCumulativeFrequencyTable(CumulativeFrequencyTableFormat Format)
{
   CumulativeFrequencyTableEntry<T>[] _output = null</span />;
   //</span /> get the frequency table as array for easier processing</span />
   FrequencyTableEntry<T>[] _freqTable = 
       GetTableAsArray(FrequencyTableSortOrder.Value_Ascending);
   //</span /> temporary values</span />
   double</span /> tempCumRelFreq = 0</span />.0</span />;
   int</span /> tempCumAbsFreq = 0</span />;
   int</span /> i, k;
   switch</span /> (Format)
   {
      //</span /> each datapoint will returned</span />
      case</span /> CumulativeFrequencyTableFormat.EachDatapoint:
        //</span /> initialize the result</span />
        _output = new</span /> CumulativeFrequencyTableEntry<T>[SampleSize];
        for</span /> (i = 0</span />; i < _freqTable.Length; i++)
        {
          //</span /> update the cumulative frequency - relative and absolute</span />
          tempCumAbsFreq += _freqTable[i].AbsoluteFreq;
          tempCumRelFreq += _freqTable[i].RelativeFreq;
          //</span /> fill the array</span />
          for</span /> (k = tempCumAbsFreq - _freqTable[i].AbsoluteFreq;k < tempCumAbsFreq; k++)
          {
            _output[k] = new</span /> CumulativeFrequencyTableEntry<T>
			(_freqTable[i].Value, tempCumRelFreq, tempCumAbsFreq);
          }
        }
      break</span />;
      //</span /> here each different entry will be returned once</span />
      case</span /> CumulativeFrequencyTableFormat.EachDatapointOnce:
        //</span /> initialize the result</span />
        _output = new</span /> CumulativeFrequencyTableEntry<T>[Length];
        for</span /> (i = 0</span />; i < _freqTable.Length; i++)
        {
          //</span /> update the cumulative frequency - relative and absolute</span />
          tempCumAbsFreq += _freqTable[i].AbsoluteFreq;
          tempCumRelFreq += _freqTable[i].RelativeFreq;
          //</span /> fill the array</span />
          _output[i] = new</span /> CumulativeFrequencyTableEntry<T>
			(_freqTable[i].Value, tempCumRelFreq, tempCumAbsFreq);
        }
       break</span />;
   }
   //</span /> done</span />
   return</span /> _output;
}

(Sorry for that strange formatting - this edit tool...)

Where Are My Data??

Ok - you need an array of the added data? That is the way:

C#
public</span /> T[] GetData(bool</span /> Pristine)
{
    T[] result = new</span /> T[SampleSize];
    //</span /> if the order is not important</span />
    if</span /> (!Pristine)
    {
        CumulativeFrequencyTableEntry<T>[] cf = GetCumulativeFrequencyTable
	        (CumulativeFrequencyTableFormat.EachDatapoint);
        for</span /> (int</span /> i = 0</span />; i < SampleSize; i++)
            result[i] = cf[i].Value;
    }
    else</span /> /*</span /> return the data in same order as entered */</span />
    {
        List<int> l;
        foreach</span /> (T key in</span /> _positions.Keys)
        {
            _positions.TryGetValue(key, out</span /> l);
            foreach</span /> (int</span /> k in</span /> l)
            {
                result[k - 1</span />] = key;
            }
        }
    }
    return</span /> result;
}

What Else?

There are some public</span /> properties concerning descriptive statistics:

  • Mean
  • Median
  • Mode
  • Minimum
  • Maximum
  • VarianceSample
  • VariancePop (unbiased estimator)
  • StandardDevSample
  • StandardDevPop (unbiased estimator)
  • StandardError
  • Sum
  • SampleSize - the number of data (read only)
  • HighestFrequency - the highest frequency observed
  • SmallestFrequency - the smallest frequency
  • ScarcestValue - the scarcest value
  • Kurtosis
  • KurtosisExcess
  • Skewness

If the data is not numerical, all properties above will return double</span />.NaN.

Miscellaneous

Here is a list of the remaining public</span /> properties and methods.

Properties

  • Length - The number of table entries (read only)
  • Tag - An object which can be set by the user (writable)
  • Description - The description of the table (writable)
  • P_Value (contains the p value computed by the Kolmogorov-Smirnov Test)

Methods

  • Add(T Value) and Add(T Test, TextAnalyzeMode mode)
  • Remove(T Value)
  • GetTableAsArray() and GetTableAsArray(FrequencyTableSortOrder order) (sorting is done by using the Quicksort-Algorithm)
  • GetEnumerator()
  • ContainsValue(T value</span />)
  • GetCumulativeFrequencyTable(CumulativeFrequencyTableFormat Format)
  • GetData(bool</span /> Pristine) - Returns the data as an array (sorted or in input order)
  • GetRelativeFrequency(T value</span />, out double</span /> relFreq)

The code is (I think so) well documented so you can use it to get a detailed insight into my solution. I am sure that this solution is not perfect, but it is a good starting point.

For a better overview, I have added a compiled help file (see download at the top of this page).

History

  • Version 1.0 - 18 Jan '07
    • Initial release
  • Version 1.5 - 04 Feb '07
    • Minor bug fixes (highest frequency was not set correctly)
    • Added normality testing
    • Added descriptive statistics
  • 09 Feb '07
    • P_Value added, release number not changed
  • Version 2.0 26 Feb '07
    • GetData(bool</span /> Pristine) added

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Web Developer
Germany Germany
Anesthesiologist from Germany
- first contact: 1985 - ATARI 800 XE (there was a great assembler: ATMAS II)
- special interests: my son, number theory, statistics, linear algebra, medicine (of course)

Comments and Discussions

 
QuestionProblem downloading source Pin
CJM73MS192810-Jul-13 1:41
CJM73MS192810-Jul-13 1:41 
Hi, I can download and open everything but the source code. I get "cannot open file as archive" when using 7-zip. Can anyone help?
GeneralTime Delay of two Frequencies Pin
mangotj9-Jun-10 17:30
mangotj9-Jun-10 17:30 
GeneralThis is great and easy to use -- Thanks Pin
Michael Yourshaw25-Jul-08 17:48
Michael Yourshaw25-Jul-08 17:48 
GeneralDictionary and performance Pin
John65467227-Apr-08 1:49
John65467227-Apr-08 1:49 
QuestionHow to add to a frequency > 1 Pin
theperm30-Aug-07 7:51
theperm30-Aug-07 7:51 
GeneralDo, ut des Pin
visalia12-Apr-07 5:51
visalia12-Apr-07 5:51 
GeneralRe: Do, ut des Pin
V. Thieme16-Jan-09 23:38
V. Thieme16-Jan-09 23:38 
GeneralSuggestion Pin
TemplMetaProg26-Feb-07 23:51
TemplMetaProg26-Feb-07 23:51 
GeneralRe: Suggestion Pin
V. Thieme27-Feb-07 3:25
V. Thieme27-Feb-07 3:25 
GeneralExcellent class with additional functions.... [modified] Pin
scslmd31-Jan-07 11:56
scslmd31-Jan-07 11:56 
GeneralRe: Excellent class with additional functions.... Pin
V. Thieme1-Feb-07 7:11
V. Thieme1-Feb-07 7:11 
GeneralThis is a very nice post! Pin
Grav-Vt27-Jan-07 4:57
Grav-Vt27-Jan-07 4:57 
GeneralRe: This is a very nice post! Pin
V. Thieme27-Jan-07 6:35
V. Thieme27-Jan-07 6:35 
GeneralOutstanding Pin
Are Jay25-Jan-07 6:37
Are Jay25-Jan-07 6:37 
GeneralRe: Outstanding Pin
V. Thieme25-Jan-07 9:10
V. Thieme25-Jan-07 9:10 
GeneralAwesome Pin
Steve Hansen18-Jan-07 21:04
Steve Hansen18-Jan-07 21:04 
GeneralRe: Awesome Pin
V. Thieme18-Jan-07 21:16
V. Thieme18-Jan-07 21:16 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.