Click here to Skip to main content
15,887,683 members
Articles / Programming Languages / C#

A Generic Frequency Table in C# with Descriptive Statistics

Rate me:
Please Sign up or sign in to vote.
4.73/5 (15 votes)
26 Feb 2007CPOL4 min read 93.2K   2K   60   17
Frequencies, descriptive statistics and normality testing

What's New?

GetData: The new parameter Pristine indicates if you want to get data in the same order as entered. That might be of importance in case of time-dependant data. The new Add-method stores the position(s) of each value using a Dictionary<</span />T, List<</span />int</span />></span />></span />.

Introduction

The exploration of empirical data is a common task in various fields. Especially the computational analysis of such data is often cumbered by the nature (the type) of the data. So - while writing statistical routines by my own - I decided to develop a "Frequency Table Class" which accepts any possible type of data.

Requirements

My class has to satisfy requirements as follows:

  • Accept any type of data (especially multiple precision types)
  • Accept a given array
  • Provide methods for adding and removing single values
  • Automatically update the absolute frequency when a value is added/removed
  • A simple way to fetch mode, highest frequency ...
  • Provide a method to get the table as an array
  • Returned arrays must be sortable by frequency and value
  • Provide fields/properties to describe the table

Code

The backbone of this class is the FrequencyTableEntry<</span />T></span /> structure:

C#
//</span /> A generic structure storing the frequency information for each value</span />
public</span /> struct</span /> FrequencyTableEntry<T> where T : IComparable<T>
{
    //</span /> Constructor</span />
    //</span /> val: The value counted</span />
    //</span /> absFreq: The absolute frequency</span />
    //</span /> relFreq: The relative frequency</span />
    //</span /> percentage: The percentage</span />
    public</span /> FrequencyTableEntry(T val, int</span /> absFreq, double</span /> relFreq, double</span /> percentage)
    {
        Value = val;
        AbsoluteFreq = absFreq;
        RelativeFreq = relFreq;
        Percentage = percentage;
    }
    public</span /> T Value;
    public</span /> int</span /> AbsoluteFreq;
    public</span /> double</span /> RelativeFreq;
    public</span /> double</span /> Percentage;
}

The specified type T has to implement the IComparable-Interface (needed for the sorting routine). The class stores the data in a generic Dictionary: _entries = new</span /> Dictionary<</span />T,int</span />></span />(): the _entries.Keys-Collection contains the values to count, the _entries.Values-Collection contains the absolute frequency for this particular value:

C#
public</span /> FrequencyTable(int</span /> initialCapacity)
{
    _entries = new</span /> Dictionary<T, int>(initialCapacity);
    .
    .
}

To provide easy access to table entries, the implemented enumerator returns the structure above:

C#
public</span /> IEnumerator<FrequencyTableEntry<T>> GetEnumerator()
{
    //</span /> the structure to return</span />
    FrequencyTableEntry<T> _output;
    //</span /> the frequency</span />
    int</span /> int_f;
    //</span /> the "double-typed" frequency</span />
    double</span /> dbl_f;
    foreach</span /> (T key in</span /> _entries.Keys)
    {
        int_f = _entries[key];
        dbl_f = (double</span />)int_f;
        //</span /> fill the structure</span />
        _output = new</span /> FrequencyTableEntry<T>
        (key, int_f, dbl_f / _count, dbl_f / _count * 100</span />.0</span />);
        //</span /> yielding - cool thing that</span />
        yield return</span /> _output;
    }
}

The general Add(T value</span />) method looks like this:

C#
public</span /> void</span /> Add(T value)
{
    List<int> _tempPos;
    //</span /> if the Dictionary already contains value, then</span />
    //</span /> we have to update frequency and _count</span />
    if</span /> (_entries.ContainsKey(value))
    {
        //</span /> update the frequency</span />
        _entries[value]++;
        //</span /> update mode and highest frequency</span />
        if</span /> (_entries[value] > _high)
        {
            _high = _entries[value];
            _mode = value;
        }
        //</span /> add 1 to sample size</span />
        _count++;
        foreach</span /> (T key in</span /> _entries.Keys)
        {
            _relFrequencies[key] = (double</span />)_entries[key] / _count;
        }
        UpdateSumAndMean(value);
        //</span /> store the actual position of the entry in the dataset</span />
        _positions.TryGetValue(value, out</span /> _tempPos);
        //</span /> the position is equal to _count</span />
        _tempPos.Add(_count);
        //</span /> remove old entry</span />
        _positions.Remove(value);
        //</span /> store new entry</span />
        _positions.Add(value, _tempPos);
    }
    else</span /> /*</span /> if the dictionary does not contain value add a new entry */</span />
    {
        //</span /> if the highest frequency is still zero, set it to one</span />
        if</span /> (_high < 1</span />)
        {
            _high = 1</span />;
            _mode = value;
        }
        //</span /> add a new entry - frequency is one</span />
        _entries.Add(value, 1</span />);
        //</span /> add 1 to table length</span />
        _length++;
        //</span /> add 1 to sample size</span />
        _count++;
        //</span /> update relative frequencies</span />
        _relFrequencies.Add(value, 0</span />.0</span />);
        foreach</span /> (T key in</span /> _entries.Keys)
        {
            _relFrequencies[key] = (double</span />)_entries[key] / _count;
        }
        UpdateSumAndMean(value);
        //</span /> create a new entry and set position to _count</span />
        _tempPos = new</span /> List<int>();
        _tempPos.Add(_count);
        //</span /> store it</span />
        _positions.Add(value, _tempPos);
    }
}

To simplify the analysis of a given text, I have implemented a special constructor:

C#
//</span /> Constructor - the created instance analyzes the </span />
//</span /> frequency of characters in a given string</span />
//</span /> Text: String to analyze</span />
public</span /> FrequencyTable(T Text, TextAnalyzeMode mode)
{
    _positions = new</span /> Dictionary<T, List<int>>();
    //</span /> if T is not string -> Exception</span />
    if</span /> (!(Text is</span /> string</span />))
        throw</span /> new</span /> ArgumentException();
    //</span /> the table itself</span />
    _entries = new</span /> Dictionary<T, int>();
    _relFrequencies = new</span /> Hashtable();
    //</span /> number of entries in _entries</span />
    _length = 0</span />;
    //</span /> sample size</span />
    _count = 0</span />;
    //</span /> description of the table</span />
    _description = "</span />"</span />;
    //</span /> a user defined tag</span />
    _tag = 0</span />;
    //</span /> the highest frequency</span />
    _high = 0</span />;
    _dblSum = double</span />.NaN;
    _mean = double</span />.NaN;
    _alpha = double</span />.NaN;
    AnalyzeString(Text, mode);
    _p = double</span />.NaN;
}

The associated Add method:

C#
public</span /> void</span /> Add(T Text, TextAnalyzeMode mode)
{
    if</span /> (!(Text is</span /> string</span />))
        throw</span /> new</span /> ArgumentException();
    AnalyseString(Text, mode);
}

In my opinion, it is useful to provide different modes regarding literal analysis. These modes are provided by TextAnalyzeMode:

C#
public</span /> enum</span /> TextAnalyzeMode
{
    AllCharacters,
    NoNumerals,
    NoSpecialCharacters,
    LettersOnly,
    NumeralsOnly,
    SpecialCharactersOnly
}

The analysis itself is performed by AnalyzeString(T Text, TextAnalyzeMode mode):

C#
private</span /> void</span /> AnalyzeString(T Text, TextAnalyzeMode mode)
{
    //</span /> character strings</span />
    string</span /> str_specialChars = @"</span />"</span />"</span />!§$%&/()=?@€<>|µ,.;:-_#'*+~²³ "</span />;
    string</span /> str_Numbers = "</span />0123456789"</span />;
    //</span /> Adding the entries according to mode</span />
    switch</span /> (mode)
    {
        case</span /> TextAnalyzeMode.AllCharacters:
            foreach</span /> (char</span /> v in</span /> Text.ToString())
                this</span />.Add((T)Convert.ChangeType((object</span />)v, Text.Getype()));
            break</span />;
        case</span /> TextAnalyzeMode.LettersOnly:
            foreach</span /> (char</span /> v in</span /> Text.ToString())
            {
                if</span /> ((str_specialChars.IndexOf(v) == -1) & 
            	(str_Numbers.IndexOf(v) == -1))
                    this</span />.Add((T)Convert.ChangeType((object</span />)v, Text.GetType()));
            }
            break</span />;
        case</span /> TextAnalyzeMode.NoNumerals:
            foreach</span /> (char</span /> v in</span /> Text.ToString())
            {
                if</span /> (str_Numbers.IndexOf(v) == -1)
                        this</span />.Add((T)Convert.ChangeType((object</span />)v, Text.GetType()));
            }
            break</span />;
        case</span /> TextAnalyzeMode.NoSpecialCharacters:
            foreach</span /> (char</span /> v in</span /> Text.ToString())
            {
                if</span /> (str_specialChars.IndexOf(v) == -1)
                    this</span />.Add((T)Convert.ChangeType((object</span />)v, Text.GetType()));
            }
            break</span />;
        case</span /> TextAnalyzeMode.NumeralsOnly:
            foreach</span /> (char</span /> v in</span /> Text.ToString())
            {
                if</span /> (str_Numbers.IndexOf(v) != -1)
                    this</span />.Add((T)Convert.ChangeType((object</span />)v, Text.GetType()));
            }
            break</span />;
        case</span /> TextAnalyzeMode.SpecialCharactersOnly:
            foreach</span /> (char</span /> v in</span /> Text.ToString())
            {
                if</span /> (str_specialChars.IndexOf(v) != -1)
                    this</span />.Add((T)Convert.ChangeType((object</span />)v, Text.GetType()));
            }
            break</span />;
    }
}

Test for Normality

The question if given data are "Gaussian-distributed" is often raised. There are some robust and valid tests to answer this question. I have implemented the "good old" Kolmogorov-Smirnov test (KS-Test). Alternatively one can use the D'Agostino-Pearson test. There are two new properties concerning normality testing:

  • IsGaussian: Returns true</span /> if data are numerical and the computed p-value is greater than Alpha (see below)
  • Alpha: Defines the "significance level" for the KS-Test

The KS-Test method is shown below. The method returns true</span />, if the test is applicable. In case of non-numerical data, the method returns false</span />. The out-parameter p contains the p-value on exit. This value can be accessed by calling the P_Value property.

C#
private</span /> bool</span /> KS_Test(out</span /> double</span /> p)
{
   //</span /> D-statistic</span />
   double</span /> D = double</span />.NaN;
   CumulativeFrequencyTableEntry<T>[] empCDF = 
	GetCumulativeFrequencyTable(CumulativeFrequencyTableFormat.EachDatapointOnce);
   //</span /> store the test CDF</span />
   double</span /> testCDF;
   //</span /> array to store datapoints</span />
   double</span />[] data = new</span /> double</span />[empCDF.Length];
   FrequencyTableEntry<T>[] table = GetTableAsArray
			(FrequencyTableSortOrder.Value_Ascending);
   int</span /> i = 0</span />;
   //</span /> prevent exceptions if T is not numerical</span />
   try</span />
   {
      foreach</span /> (FrequencyTableEntry<T> entry in</span /> table)
      {
         data[i] = (double</span />)Convert.ChangeType(entry.Value, TypeCode.Double</span />);
         i++;
      }
   }
   catch</span />
   {
      p = double</span />.NaN;
      return</span /> false</span />;
   }
   //</span /> estimate the parameters of the expected Gaussian distribution</span />
   //</span /> first: compute the mean</span />
   double</span /> mean = Mean;
   //</span /> compute the bias-corrected variance</span />
   //</span /> as an estimator for the population variance (actually we need the</span />
   //</span /> square root)</span />
   double</span /> _sqrt_var = Math.Sqrt(VariancePop);
   //</span /> now we have to determine the greatest difference between the</span />
   //</span /> sample cumulative distribution function (empCDF) and</span />
   //</span /> the distribution function to test (testCDF)</span />
   double</span /> _sqrt2 = Math.Sqrt(2</span />.0</span />);
   double</span /> _erf;
   double</span /> max1 = 0</span />.0</span />;
   double</span /> max2 = 0</span />.0</span />;
   double</span /> _temp;
   for</span /> (i = 0</span />; i < empCDF.Length; i++)
   {
      //</span /> compute the expected distribution using the error function</span />
      _erf = Erf(((data[i] - mean) / _sqrt_var) / _sqrt2);
      testCDF = 0</span />.5</span /> * (1</span />.0</span /> + _erf);
      _temp = Math.Abs(empCDF[i].CumulativeRelativeFrequency - testCDF);
      if</span /> (_temp > max1)
        max1 = _temp;
      if</span /> (i > 0</span />)
        _temp = Math.Abs(empCDF[i - 1</span />].CumulativeRelativeFrequency - testCDF);
      else</span />
        _temp = testCDF;
      if</span /> (_temp > max2)
        max2 = _temp;
   }
   //</span /> the statistics to use is</span />
   //</span /> max{diff1,diff2}</span />
   D = max1 > max2 ? max1 : max2;
   //</span /> now compute the p-value using a z-transformation</span />
   if</span /> (!Double.IsNaN(D))
   {
      double</span /> z = Math.Sqrt((double</span />)SampleSize) * D;
      p = KS_Prob_Smirnov(z);
   }
   else</span />
      p = double</span />.NaN;
   return</span /> true</span />;
}

To compute the "test distribution" (the Gaussian CDF in this case) we need the so called error function. I have used the Erf-implementation written by Miroslav Stampar (see Special Function(s) for C#), which is a translation of the Cephes Math Library by Stephen L. Moshier.

Descriptive Statistics

I think it is useful to implement some fundamental statistical properties inside the class.

Cumulative Frequencies

First of all, it is needed to implement a method which returns the empirical distribution function (the cumulative density function) of the given data:

C#
public</span /> CumulativeFrequencyTableEntry<T>[] 
	GetCumulativeFrequencyTable(CumulativeFrequencyTableFormat Format)
{
   CumulativeFrequencyTableEntry<T>[] _output = null</span />;
   //</span /> get the frequency table as array for easier processing</span />
   FrequencyTableEntry<T>[] _freqTable = 
       GetTableAsArray(FrequencyTableSortOrder.Value_Ascending);
   //</span /> temporary values</span />
   double</span /> tempCumRelFreq = 0</span />.0</span />;
   int</span /> tempCumAbsFreq = 0</span />;
   int</span /> i, k;
   switch</span /> (Format)
   {
      //</span /> each datapoint will returned</span />
      case</span /> CumulativeFrequencyTableFormat.EachDatapoint:
        //</span /> initialize the result</span />
        _output = new</span /> CumulativeFrequencyTableEntry<T>[SampleSize];
        for</span /> (i = 0</span />; i < _freqTable.Length; i++)
        {
          //</span /> update the cumulative frequency - relative and absolute</span />
          tempCumAbsFreq += _freqTable[i].AbsoluteFreq;
          tempCumRelFreq += _freqTable[i].RelativeFreq;
          //</span /> fill the array</span />
          for</span /> (k = tempCumAbsFreq - _freqTable[i].AbsoluteFreq;k < tempCumAbsFreq; k++)
          {
            _output[k] = new</span /> CumulativeFrequencyTableEntry<T>
			(_freqTable[i].Value, tempCumRelFreq, tempCumAbsFreq);
          }
        }
      break</span />;
      //</span /> here each different entry will be returned once</span />
      case</span /> CumulativeFrequencyTableFormat.EachDatapointOnce:
        //</span /> initialize the result</span />
        _output = new</span /> CumulativeFrequencyTableEntry<T>[Length];
        for</span /> (i = 0</span />; i < _freqTable.Length; i++)
        {
          //</span /> update the cumulative frequency - relative and absolute</span />
          tempCumAbsFreq += _freqTable[i].AbsoluteFreq;
          tempCumRelFreq += _freqTable[i].RelativeFreq;
          //</span /> fill the array</span />
          _output[i] = new</span /> CumulativeFrequencyTableEntry<T>
			(_freqTable[i].Value, tempCumRelFreq, tempCumAbsFreq);
        }
       break</span />;
   }
   //</span /> done</span />
   return</span /> _output;
}

(Sorry for that strange formatting - this edit tool...)

Where Are My Data??

Ok - you need an array of the added data? That is the way:

C#
public</span /> T[] GetData(bool</span /> Pristine)
{
    T[] result = new</span /> T[SampleSize];
    //</span /> if the order is not important</span />
    if</span /> (!Pristine)
    {
        CumulativeFrequencyTableEntry<T>[] cf = GetCumulativeFrequencyTable
	        (CumulativeFrequencyTableFormat.EachDatapoint);
        for</span /> (int</span /> i = 0</span />; i < SampleSize; i++)
            result[i] = cf[i].Value;
    }
    else</span /> /*</span /> return the data in same order as entered */</span />
    {
        List<int> l;
        foreach</span /> (T key in</span /> _positions.Keys)
        {
            _positions.TryGetValue(key, out</span /> l);
            foreach</span /> (int</span /> k in</span /> l)
            {
                result[k - 1</span />] = key;
            }
        }
    }
    return</span /> result;
}

What Else?

There are some public</span /> properties concerning descriptive statistics:

  • Mean
  • Median
  • Mode
  • Minimum
  • Maximum
  • VarianceSample
  • VariancePop (unbiased estimator)
  • StandardDevSample
  • StandardDevPop (unbiased estimator)
  • StandardError
  • Sum
  • SampleSize - the number of data (read only)
  • HighestFrequency - the highest frequency observed
  • SmallestFrequency - the smallest frequency
  • ScarcestValue - the scarcest value
  • Kurtosis
  • KurtosisExcess
  • Skewness

If the data is not numerical, all properties above will return double</span />.NaN.

Miscellaneous

Here is a list of the remaining public</span /> properties and methods.

Properties

  • Length - The number of table entries (read only)
  • Tag - An object which can be set by the user (writable)
  • Description - The description of the table (writable)
  • P_Value (contains the p value computed by the Kolmogorov-Smirnov Test)

Methods

  • Add(T Value) and Add(T Test, TextAnalyzeMode mode)
  • Remove(T Value)
  • GetTableAsArray() and GetTableAsArray(FrequencyTableSortOrder order) (sorting is done by using the Quicksort-Algorithm)
  • GetEnumerator()
  • ContainsValue(T value</span />)
  • GetCumulativeFrequencyTable(CumulativeFrequencyTableFormat Format)
  • GetData(bool</span /> Pristine) - Returns the data as an array (sorted or in input order)
  • GetRelativeFrequency(T value</span />, out double</span /> relFreq)

The code is (I think so) well documented so you can use it to get a detailed insight into my solution. I am sure that this solution is not perfect, but it is a good starting point.

For a better overview, I have added a compiled help file (see download at the top of this page).

History

  • Version 1.0 - 18 Jan '07
    • Initial release
  • Version 1.5 - 04 Feb '07
    • Minor bug fixes (highest frequency was not set correctly)
    • Added normality testing
    • Added descriptive statistics
  • 09 Feb '07
    • P_Value added, release number not changed
  • Version 2.0 26 Feb '07
    • GetData(bool</span /> Pristine) added

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Web Developer
Germany Germany
Anesthesiologist from Germany
- first contact: 1985 - ATARI 800 XE (there was a great assembler: ATMAS II)
- special interests: my son, number theory, statistics, linear algebra, medicine (of course)

Comments and Discussions

 
QuestionProblem downloading source Pin
CJM73MS192810-Jul-13 1:41
CJM73MS192810-Jul-13 1:41 
GeneralTime Delay of two Frequencies Pin
mangotj9-Jun-10 17:30
mangotj9-Jun-10 17:30 
GeneralThis is great and easy to use -- Thanks Pin
Michael Yourshaw25-Jul-08 17:48
Michael Yourshaw25-Jul-08 17:48 
GeneralDictionary and performance Pin
John65467227-Apr-08 1:49
John65467227-Apr-08 1:49 
QuestionHow to add to a frequency > 1 Pin
theperm30-Aug-07 7:51
theperm30-Aug-07 7:51 
GeneralDo, ut des Pin
visalia12-Apr-07 5:51
visalia12-Apr-07 5:51 
GeneralRe: Do, ut des Pin
V. Thieme16-Jan-09 23:38
V. Thieme16-Jan-09 23:38 
GeneralSuggestion Pin
TemplMetaProg26-Feb-07 23:51
TemplMetaProg26-Feb-07 23:51 
GeneralRe: Suggestion Pin
V. Thieme27-Feb-07 3:25
V. Thieme27-Feb-07 3:25 
GeneralExcellent class with additional functions.... [modified] Pin
scslmd31-Jan-07 11:56
scslmd31-Jan-07 11:56 
I wrote something similar but your class is more complete (distribution, percentages, etc). I am trying to write something one step beyond but I'm stuck...

Currently your class returns the frequency distribution, but can you modify this class to also report the location of the individual nodes instead of just a sum of the nodes? That is, it can parse 5 distinct values and the distribution of the frequency. However, can you modify and list each node for the distinct values? Example your class can show this:

node # freq
1 1 0.05
2 5 0.25
3 7 0.35
4 4 0.20
5 3 0.15
-- ----
20 1.00

But can you modify it so it will list the location of each of the nodes?

node # location freq
1 1 15 0.05
2 5 2 5 9 10 12 0.25
3 7 1 4 8 11 16 18 19 0.35
4 4 3 6 17 20 0.20
5 3 7 13 14 0.15

I tried defining something like Dictionary<string, nodeStore> where nodeStore is a class containing count, location, etc. but I can't seem to make it work. Typing off the top of my head here:

<code>
class nodeStore
{
int count;
ArrayList node;
public nodeStore() { count = 0; node = new Arraylist() };
public void Add (int position) { node.Add(position); count++; }
public ArrayList Get () { return node };
}

static public List<string> getFrequency(List<string> inputList) {
Dictionary<string, nodeStore> Frequency = new Dictionary<string, nodeStore>();
nodeStore node;
for (int i = 0; i < inputlist[i]; i++) {
if (!Frequency.ContainsKey(inputList[i])) {
node = new nodeStore();
Frequency.Add(inputList[i], node);
} else {
???Frequency(inputList[i])++; // how do you call node.Add?
???Frequency.Add(inputList[i],node.Add(i)); //this doesn't work either...
}
.
.
.
}
</code>


-- modified at 18:08 Wednesday 31st January, 2007
GeneralRe: Excellent class with additional functions.... Pin
V. Thieme1-Feb-07 7:11
V. Thieme1-Feb-07 7:11 
GeneralThis is a very nice post! Pin
Grav-Vt27-Jan-07 4:57
Grav-Vt27-Jan-07 4:57 
GeneralRe: This is a very nice post! Pin
V. Thieme27-Jan-07 6:35
V. Thieme27-Jan-07 6:35 
GeneralOutstanding Pin
Are Jay25-Jan-07 6:37
Are Jay25-Jan-07 6:37 
GeneralRe: Outstanding Pin
V. Thieme25-Jan-07 9:10
V. Thieme25-Jan-07 9:10 
GeneralAwesome Pin
Steve Hansen18-Jan-07 21:04
Steve Hansen18-Jan-07 21:04 
GeneralRe: Awesome Pin
V. Thieme18-Jan-07 21:16
V. Thieme18-Jan-07 21:16 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.