Click here to Skip to main content
11,478,642 members (46,464 online)
Click here to Skip to main content

Linear correlation and statistical functions

, 9 Jan 2010 CPOL 27.9K 909 30
Rate this:
Please Sign up or sign in to vote.
Linear correlation between variables.

Introduction

One of the most widely used measures of association between variables that are ordinal is the linear correlation coefficient. This article will give a brief introduction on this with a sample application. Given a business data of profit and sales, we will give a descriptive statistics summary on each attribute and compute its correlation.

Background

The analysis of data inevitably involves some trafficking with the field of statistics, that gray area which is not quite a branch of mathematics - and just as surely not quite a branch of science.

Data is a collection of objects and their attributes. An attribute is a property or characteristic of an object. So, a collection of attributes describes an object. There are four types of attributes:

  1. Nominal
  2. Ordinal
  3. Interval
  4. Ratio

Qualitative data includes the following:

  • Nominal (category): the existence of choice without order
  • Ordinal (order, index): has a sequence, or order of choice

Interval and Ratio are Quantitative data:

  • Interval: Temp in Fahrenheit or Celcius, Calendar date
  • Ratio: Temp in Kelvin, length, time

Nominal/Ordinal variable example

Gender is a nominal variable with a value of 1=male and 2=female. Gender can also be relabeled and changed to reverse order like 5=Female and 225=Male. Both examples are valid as long as they are consistent throughout our definition. For ordinal variables, however, ordering matters. The sequence of our label is based on its rank. A rank of priority can be 1=Best. A higher value has a lower importance. For example: 1= very dissatisfied, 100= very satisfied.

In summary, the value of an ordinal attribute provides enough information to order objects. The operations that can be performed on ordinal variables can be the following: median, percentile, rank, correlation, run tests, sign tests ...

To understand the data, we need to have a preliminary exploration of the data to better understand its characteristics. In the case of this article, we will look at some statistical methods.

Covariance & Correlation

Wikipedia defines covariance as a measure of how much two variables change together. However, http://mathworld.wolfram.com/Covariance.html defines covariance as providing a measure of the strength of the correlation between two or more sets of random variants. In general, a correlation gives the strength of the relationship between variables.

So, for pairs of quantities (xi; yi); i = 1,...,N, the linear correlation coefficient, r (also called the product-moment correlation coefficient, or Pearson's r) is given by the formula:

1.JPG

where, as usual, x is the mean of the xis, y is the mean of the yis.

Wikipedia defines Pearson's Correlation Coefficient with the following formula:

2.png

If we have a series of n measurements of X and Y written as xi and yi where i = 1, 2, ..., n, then the sample correlation coefficient can be used to estimate the population Pearson correlation between X and Y, where x and y are the sample means of X and Y, sx and sy are the sample standard deviations of X and Y.

Source: Numerical recipes:

The value of r lies between -1 and 1, inclusive. It takes on a value of 1, termed complete positive correlation, when the data points lie on a perfect straight line with positive slope, with x and y increasing together. The value 1 holds independent of the magnitude of the slope. If the data points lie on a perfect straight line with a negative slope, y decreasing as x increases, then r has the value -1; This is called complete negative correlation. A value of r near zero indicates that the variables x and y are uncorrelated.

When a correlation is known to be significant, r is one conventional way of summarizing its strength. In fact, the value of r can be translated into a statement about what residuals (root mean square deviations) are to be expected if the data is fitted to a straight line by the least-squares method.

Unfortunately, r is a rather poor statistic for deciding whether an observed correlation is statistically significant, and/or whether one observed correlation is significantly stronger than another. The reason is that r is ignorant of the individual distributions of x and y, so there is no universal way to compute its distribution in the case of the null hypothesis.

About the only general statement that can be made is this: If the null hypothesis is that x and y are uncorrelated, and if the distributions for x and y each have enough convergent moments (tails die off sufficiently rapidly), and if N is large (typically> 500), then r is distributed approximately normally, with a mean of zero and a standard deviation of 1 divided by square root of N. In that case, the (double-sided) significance of the correlation, that is, the probability that jrj should be larger than its observed value in the null hypothesis, is:

3.JPG

where erfc(x) is the complementary error function.

Implementation / Data

Month 01 02 03 04 05 06 07 08 09 10 11 12
Profit 525 300 450 300 400 500 550 125 300 400 500 550
Sales 250 225 275 350 325 375 450 400 500 550 600 525
public static void GetPearson(double[] x, 
                              double[] y,
                              ref Double r,ref Double prob)
{
    //will regularize the unusual case of complete correlation
    const double TINY=1.0e-20;
    int j,n=x.Length;
    Double yt,xt,t,df;
    Double syy=0.0,sxy=0.0,sxx=0.0,ay=0.0,ax=0.0;
    for (j=0;j<n;j++) {
    //finds the mean
        ax += x[j];
        ay += y[j];
    }
    ax /= n;
    ay /= n;
    for (j=0;j<n;j++) {
     // compute correlation coefficient
        xt=x[j]-ax;
        yt=y[j]-ay;
        sxx += xt*xt;
        syy += yt*yt;
        sxy += xt*yt;
    }
    r=sxy/(Math.Sqrt(sxx*syy)+TINY);
    //for a large n
    prob=erfcc(Math.Abs(z*Math.Sqrt(n-1.0))/1.4142136);
}

public static Double erfcc(Double x)
{
    Double t, z, ans;
    z = Math.Abs(x);
    t = 1.0 / (1.0 + 0.5 * z); 
    ans = t * Math.Exp(-z * z - 1.26551223 + t * (1.00002368 + 
                           t * (0.37409196 + t * (0.09678418 +
                          t * (-0.18628806 + t * (0.27886807 + 
                          t * (-1.13520398 + t * (1.48851587 + 
                     t * (-0.82215223 + t * 0.17087277)))))))));
    return x >= 0.0 ? ans : 2.0 - ans;
}
To test the function
double[] x = new double[4]{525, 300, 450, 300, 400, 500, 
550, 125, 300, 400, 500, 550};

double[] y = new double[4]{250, 225, 275, 350, 
325, 375, 450, 400, 500, 550, 600, 525};

double r = 0, prob = 0;
GetPearson(x, y,ref r,ref prob);

The same output is computed via Excel's Pearson and Correl function:

7.JPG

Other statistical functions that can be performed on the data include the following: Min, Max, Median, Mean, and Mode.

public static ArrayList CalculateMode(params int[] array)
{
    ArrayList mode = new ArrayList();
    int[] sortArray = new int[array.Length];

    int[,] frequency = new int[array.Length, 2];
    Array.Copy(array, sortArray, array.Length);
    Array.Sort(sortArray);
    int tmp = sortArray[0]; // beginning value
    int index = 0; // index into frequency array
    int i = 0;
    while (i < sortArray.Length)
    {
        frequency[index, 0] = tmp; // copy value

        while (tmp.Equals(sortArray[i]))
        {
            frequency[index, 1]++;  // count occurances
            i++;

            if (i > sortArray.Length - 1) // don't overrun array!
                break;
        }

        if (i < sortArray.Length)
        {
            tmp = sortArray[i]; // move to next value
            index++;
        }
    }
    Array.Clear(sortArray, 0, sortArray.Length); // zero it out

    for (i = 0; i < sortArray.Length; i++)
        sortArray[i] = frequency[i, 1]; // copy the occurance values

    Array.Sort(sortArray); // ascending order

    index = sortArray.Length - 1;

    switch (sortArray[index])
    {
        case 0:
            mode.Add("none");
            break;

        case 1:
            mode.Add("none");
            break;

        default:
            for (i = 0; i < frequency.Length / frequency.Rank; i++)
                if (frequency[i, 1].Equals(sortArray[index]))
                    mode.Add(frequency[i, 0].ToString());
                    // convert mode integer to a string
            break;
    }
    return mode;
}

public static double CalculateMedian(params int[] pNumbers)
{
    int len = pNumbers.Length;
    if (len == 0)
        throw new System.ArgumentException("No data");

    int size = pNumbers.Length;
    int mid = size / 2;
    double median = (size % 2 != 0) ? (double)pNumbers[mid] :
    ((double)pNumbers[mid] + (double)pNumbers[mid - 1]) / 2;
    return median;
}
public static double CalculateMean(params int[] data)
{
    int len = data.Length;
    if (len == 0)
        throw new System.ArgumentException("No data");

    double sum = 0;
    for (int i = 0; i < data.Length; i++)
        sum += data[i];
    return sum / len;
}
public static int CalculateMin(params int[] rest)
{
    int min = rest[0];
    foreach (int next in rest)
    {
        min = Math.Min(min, next);
    }
    return min;
}
public static int CalculateMax(params int[] rest)
{
    int max = rest[0];
    foreach (int next in rest)
    {
        max = Math.Max(max, next);
    }
    return max;
}

Covariance: The covariance is especially useful when looking at the variance of the sum of two random variants. It provides a measure of the strength of the correlation between two or more sets of random variants.

NumberedEquation1.gif

Note: The correctness of the implementation has not been tested.

public static void GetCovar(double[,] matrix)
{
    int x = matrix.GetLength(0); 
    int y = matrix.GetLength(1); 
    //Subtract Means
    Double[,] SubMeansResult = new double[x, y];
    for (int i = 0; i < y; i++)
    {
        var tmp = 0d;
            for (int j = 0; j < x; j++)
            {
                tmp += matrix[j,i];

            }
            var mean = tmp / x;
            for (int j = 0; j < x; j++)
            {
                SubMeansResult[j,i] = matrix[j,i] - mean;
            }
     }
    //Traspose
    Double[,] Transresult = new double[y, x];
    for (int i = 0; i < x; i++)
        for (int j = 0; j < y; j++)
            Transresult[j, i] = matrix[i, j];

    //Multiply
    y = SubMeansResult.GetLength(0); //X
    x = Transresult.GetLength(1); //Y
    Double[,] MultiplyResult = new double[x, y];
    for (int i = 0; i < x; i++)
        for (int j = 0; j < y; j++)
        {
            double temp = MultiplyResult[i, j];
            for (int k = 0; k < x; k++)
            {
                temp += Transresult[i, k] * SubMeansResult[i, k];
            }
            MultiplyResult[i, j] = temp;
        }

        //normalizing the resulting matrix
        for (int i=0; i<x; i++)
        {
            for (int j = 0; j < y; j++)
            {
                double tmp = MultiplyResult[i,j];
                tmp /= y - 1;
                MultiplyResult[i,j] = tmp;
            }
        }
    }
}

References

  • Kardi Teknomo PHD - Data attribute exploration
  • Numerical Recipes - William Press et. al.
  • Wikipedia
  • Open Source community

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

saharkiz
Web Developer
Philippines Philippines
My name : Aresh Saharkhiz.
Origin: Unknown

Education : BS Computer Science
MS Computer Science
Interests : Genetic Programming
Neural Networks
Game AI
Programming: (language == statementEnd(semicolon)


http://sites.google.com/site/docaresh

Skill:
Flash
Carrara 3D
PHP,ASP,ASP.NET
J2SE

Comments and Discussions

 
QuestionThe name 'z' does not exist in the current context Pin
FabioI7-Apr-15 6:29
memberFabioI7-Apr-15 6:29 
Questiongood job, just a minor fix Pin
behzadg17-Apr-12 8:28
memberbehzadg17-Apr-12 8:28 
GeneralPercentile function Pin
saharkiz18-Jan-10 23:39
membersaharkiz18-Jan-10 23:39 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web03 | 2.8.150520.1 | Last Updated 10 Jan 2010
Article Copyright 2009 by saharkiz
Everything else Copyright © CodeProject, 1999-2015
Layout: fixed | fluid