12,758,901 members (32,254 online)
Add your own
alternative version

Stats

161.6K views
6.8K downloads
92 bookmarked
Posted 17 Apr 2008

An Algorithm for Weighted Linear Regression

, 17 Apr 2008 CPOL
 Rate this:
Please Sign up or sign in to vote.
A C# implementation of a general weighted linear regression with complete statistics.

Introduction

Linear regression is a useful technique for representing observed data by a mathematical equation. For linear regression, the independent variable (data) is assumed to be a linear function of various independent variables. Normally, the regression problem is formulated as a least squares minimization problem. For the case of linear least squares, the resulting analysis requires the solution of a set of simultaneous equations that can be easily solved using Gaussian Elimination. The applications of linear least squares and Gaussian elimination are well known techniques.

Unfortunately, there are many times when one knows that the dependent variable is not a linear function, but that a transformed variable might be. In addition, in many cases, it is not only necessary to compute the best formula to represent the data, but to also estimate the accuracy of the parameters. This article presents a C# implementation of a weighted linear regression, using an efficient symmetric matrix inversion algorithm to overcome the problem of nonlinearity of the dependent variable and to compute the complete variance-covariance matrix to allow estimation of confidence intervals in the estimated regression coefficients.

The files included with this article contain the source code for the linear regression class, as well as an example program.

Weighted Linear Regression

The standard linear regression problem can be stated mathematically as follows, where yj represents the jth measured or observed dependent variable value, xi,j represents the jth measured independent variable value for the ith variable, and Ci is the regression coefficient to be determined. M represents the number of data points, and N represents the number of linear terms in the regression equation.

It is important to note that this statement assumes that all errors have the same significance, but this is often not true. If there is a drift in the precision of the measurements, for example, some errors may be more or less important than others. Another important case arises if we need to transform the dependent variable, y, in order to get a linear representation. For example, in many practical cases, the logarithm of y (or some other function) may be much better at representing the data. In the case of a logarithmic relationship, we can fit Log(y) = C1x1 + C2x2 + ... to represent the relationship y = e(C1x1 + C2x2 + ...). In the case where a transformation is needed, however, the errors in the transformed variable are no longer necessarily all of the same significance. As a simple example using the Log transformation, note that Log(1) +/- 0.1 represents a much different range than Log(1000) +/- 0.1. In such cases, it is also possible to approximately represent the variation in error significance by using a weighted regression, as shown in the following modified equation.

In this formulation, the squared difference between each observed and predicted value is multiplied by a weighting factor, wj to account for the variation in significance of the errors. If the difference is due to variations in measurement precision, the weight factors will need to be determined based on the precision drift. If the differences in significance are due to a variable transformation, however, we can often estimate them based on the functional form of the transformation.

In the case of a variable transformation, we can approximate the error in terms of the derivative of the function. Assuming that we are measuring y and transforming to f(y), the following relationship represents a first order approximation:

Essentially, the weight factor is used as a first order correction, and if the regression yields small residual errors, the approximation is very close. As can be seen, for the case where a Log(y) transformation is used, the weights for each data point would be (dLog(y)/dy)-2 = y2.

It might seem more reasonable to define the weights as multiplying the difference in the computed and measured values, rather than the difference squared. The reason that they are defined as multiplying the difference squared is due to the relationship between the statistical variance and the least squares terms. In the case of a changing measurement precision, it makes sense to adjust the variance, which is related to the square of the difference, rather than the difference itself.

For more information on regression analysis, including weighted regressions, please refer to the book by Draper and Smith (1966) listed in the references. This book should be considered one of the classical texts on practical regression analysis.

Solving the Weighted Regression

Solving the linear regression equation is straightforward. Since the terms are linear and the objective is to compute the minimum with respect to all of the coefficients, the standard derivation is to take the derivative of the least squares sum with respect to each coefficient, Ci, and require that the derivatives are all exactly zero. This yields a set of simultaneous equations with the coefficients, Ci, as unknowns which can be solved using standard linear algebra. In the weighted least squares case, the equations are the same as the standard, unweighted case, except the weights are included in each of the sums. For reference, the equations are:

Most simple least squares algorithms use Gaussian Elimination to solve the simultaneous equations, since it is fast and easy to program. In fact, if all you need is the best set of coefficients, it's probably best to use Gaussian elimination. If, however, you want to do some additional analyses, then Gaussian Elimination may not be the best option.

An alternate method for solving the equations is to represent the simultaneous equations as a matrix equation:

Solving the matrix equation can be accomplished by inverting the X matrix, then multiplying by the B vector to determine the values of C. The reason that this is an option worth considering is twofold:

1. the inverted X matrix is directly proportional to the variance-covariance matrix that contains almost all of the information about the accuracy of the coefficient estimates, and
2. X happens to be a symmetric matrix that can be efficiently inverted rapidly.

The heart of the algorithm for the weighted linear regression is actually the method that inverts the symmetric matrix. The source code for this method is shown below. Note that I can't claim this algorithm. I actually was given this algorithm while in graduate school, back in about 1972, by an office mate by the name of Eric Griggs. I don't know where Eric got it, but it has been floating around in various forms for a long, long time. The original code that I obtained was written in Fortran 1, and has since been translated into many other languages. Since it was published in at least one report done for the State of Texas, it is public domain. If anyone knows who originally came up with the algorithm, I'd be pleased to give them due credit.

The method `bool SymmetricMatrixInvert(double[,] V)` takes a square, symmetric matrix, `V`, as its argument, and inverts the matrix in place. In other words, when the method is called, `V` contains a symmetric matrix. If the inversion fails due to the matrix being singular, the method returns `false`; if the inversion succeeds, it returns `true`. After the call, `V` contains the inverted matrix.

```public bool SymmetricMatrixInvert(double[,] V)
{
int N = (int)Math.Sqrt(V.Length);
double[] t = new double[N];
double[] Q = new double[N];
double[] R = new double[N];
double AB;
int K, L, M;

// Invert a symetric matrix in V
for (M = 0; M < N; M++)
R[M] = 1;
K = 0;
for (M = 0; M < N; M++)
{
double Big = 0;
for (L = 0; L < N; L++)
{
AB = Math.Abs(V[L, L]);
if ((AB > Big) && (R[L] != 0))
{
Big = AB;
K = L;
}
}
if (Big == 0)
{
return false;
}
R[K] = 0;
Q[K] = 1 / V[K, K];
t[K] = 1;
V[K, K] = 0;
if (K != 0)
{
for (L = 0; L < K; L++)
{
t[L] = V[L, K];
if (R[L] == 0)
Q[L] = V[L, K] * Q[K];
else
Q[L] = -V[L, K] * Q[K];
V[L, K] = 0;
}
}
if ((K + 1) < N)
{
for (L = K + 1; L < N; L++)
{
if (R[L] != 0)
t[L] = V[K, L];
else
t[L] = -V[K, L];
Q[L] = -V[K, L] * Q[K];
V[K, L] = 0;
}
}
for (L = 0; L < N; L++)
for (K = L; K < N; K++)
V[L, K] = V[L, K] + t[L] * Q[K];
}
M = N;
L = N - 1;
for (K = 1; K < N; K++)
{
M = M - 1;
L = L - 1;
for (int J = 0; J <= L; J++)
V[M, J] = V[J, M];
}
return true;
}
```

Besides the matrix inversion, additional code is needed to set up the equations, compute statistics, etc. For that, I defined a class `LinearRegression` that also saves the results of various statistical measures, such as the multiple correlation coefficient, standard deviation of the residual errors, the Fisher F statistic for the regression, as well as the variance/covariance matrix, computed coefficients and their standard errors, calculated values, and residuals. Although I don't show all the details here, the source code for the class contains properties to retrieve all of these items once the regression has been computed.

```public class LinearRegression
{
double[,] V;            // Least squares and var/covar matrix
public double[] C;      // Coefficients
public double[] SEC;    // Std Error of coefficients
double RYSQ;            // Multiple correlation coefficient
double SDV;             // Standard deviation of errors
double FReg;            // Fisher F statistic for regression
double[] Ycalc;         // Calculated values of Y
double[] DY;            // Residual values of Y

public bool Regress(double[] Y, double[,] X, double[] W)
{
// Y[j]   = j-th observed data point
// X[i,j] = j-th value of the i-th independent varialble
// W[j]   = j-th weight value

int M = Y.Length;             // M = Number of data points
int N = X.Length / M;         // N = Number of linear terms
int NDF = M - N;              // Degrees of freedom
Ycalc = new double[M];
DY = new double[M];
// If not enough data, don't attempt regression
if (NDF < 1)
{
return false;
}
V = new double[N, N];
C = new double[N];
SEC = new double[N];
double[] B = new double[N];   // Vector for LSQ

// Clear the matrices to start out
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
V[i, j] = 0;

// Form Least Squares Matrix
for (int i = 0; i < N; i++)
{
for (int j = 0; j < N; j++)
{
V[i, j] = 0;
for (int k = 0; k < M; k++)
V[i, j] = V[i, j] + W[k] * X[i, k] * X[j, k];
}
B[i] = 0;
for (int k = 0; k < M; k++)
B[i] = B[i] + W[k] * X[i, k] * Y[k];
}
// V now contains the raw least squares matrix
if (!SymmetricMatrixInvert(V))
{
return false;
}
// V now contains the inverted least square matrix
// Matrix multpily to get coefficients C = VB
for (int i = 0; i < N; i++)
{
C[i] = 0;
for (int j = 0; j < N; j++)
C[i] = C[i] + V[i, j] * B[j];
}

// Calculate statistics
double TSS = 0;
double RSS = 0;
double YBAR = 0;
double WSUM = 0;
for (int k = 0; k < M; k++)
{
YBAR = YBAR + W[k] * Y[k];
WSUM = WSUM + W[k];
}
YBAR = YBAR / WSUM;
for (int k = 0; k < M; k++)
{
Ycalc[k] = 0;
for (int i = 0; i < N; i++)
Ycalc[k] = Ycalc[k] + C[i] * X[i, k];
DY[k] = Ycalc[k] - Y[k];
TSS = TSS + W[k] * (Y[k] - YBAR) * (Y[k] - YBAR);
RSS = RSS + W[k] * DY[k] * DY[k];
}
double SSQ = RSS / NDF;
RYSQ = 1 - RSS / TSS;
FReg = 9999999;
if (RYSQ < 0.9999999)
FReg = RYSQ / (1 - RYSQ) * NDF / (N - 1);
SDV = Math.Sqrt(SSQ);

// Calculate var-covar matrix and std error of coefficients
for (int i = 0; i < N; i++)
{
for (int j = 0; j < N; j++)
V[i, j] = V[i, j] * SSQ;
SEC[i] = Math.Sqrt(V[i, i]);
}
return true;
}
}
```

Using the class to perform a weighted linear regression is straightforward. Simply create an instance of the `LinearRegression` class, put the data and weights into suitable arrays, then call the `Regress` method, which returns `true` if the regression is calculated and `false` if it fails, usually due to not enough points. The coefficients are available from the `Coefficients[]` array, and the standard error of the coefficients are available in the `CoefficientsStandardError[]` array. The following code snippet shows how to call the regression once the data arrays have been set up.

```LinearRegression linReg = new LinearRegression();
if (linReg.Regress(y, x, w))
{
// Do whatever you need to do with the results
}
```

The sample program is a simple illustration of the use of the `LinearRegression` class. By default, it generates 20 values with random errors based on the equation y = 4 + 3x, and allows fitting the data with a general polynomial equation. If you change the equation type from `y` to `Log(y)` using the checkbox, the weights are recomputed accordingly. Pressing the Fit button performs the regression and displays the coefficients, along with their standard errors. Note that if you use more than a 1st order polynomial, the 3rd and higher order coefficients will be meaningless, as indicated by their standard errors. To compute the actual confidence intervals, the standard errors should be multiplied by the appropriate Student's t statistic, which is not included in this article.

A screenshot of the sample program is shown above.

You can change the values in the table and perform your own regressions if you like. With some modifications, the sample program could be converted into a general regression package, although I have not done that here.

Comments

Linear regression is a powerful technique that can be used to represent observed data and trends with equations. If you have one variable that is a simple function of another variable, perhaps, graphical techniques are good enough. However, in more complex cases, with many thousands of data points and perhaps dozens of variables, graphical techniques are not practical. The use of a weighted regression, and having access to the variance/covariance matrix, as well as the standard error of the coefficients is often enlightening, in that it may show which coefficients are meaningless and should be eliminated from the equation or replaced by terms of a different nature.

Unfortunately, I have observed many naive applications of regression techniques. In many published papers, the authors have blindly increased the number of terms in a regression to improve their results. Unfortunately, in many cases, the small improvements are statistically meaningless, but if they never look at the statistics, they don't recognize that. In other cases, I have seen function transformations used, such as fitting Log(y) rather than y, without regard to the variations of accuracy. Again, looking at errors in Log(y) can be very misleading, unless the proper methods are used. The direct use of Log(y) is really only valid if the measurements have a precision represented as a percent of their values, while most measurements have an absolute precision.

Part of the problem may be due to a lack of understanding of the underlying mathematical principles, but part of the problem is also due to the lack of general software that implements the correct procedures. It is my hope that perhaps someone will find these algorithms useful. I'd be very interested in any comments concerning statistical procedures or any observations to improve the algorithms even more.

References

• Draper, N. R. and H. Smith, Applied Regression Analysis, New York: Wiley (1966)

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

 Engineer Comport Computing United States
Walt has been playing with software since around 1967 and has generated more runtime errors than the average village idiot. He is a CEO, Petroleum Engineer, software consultant, janitor, and now a graduate student again. Rather than sleep, he also plays with algorithms and systems for technical computing, develops software for engineering evaluations and is an avid amateur radio operator.

Walt was admitted back to UT Austin and is actually attempting to complete a PhD in engineering, thereby proving that he is crazier than the average old fart.

And now UT has gone and admitted Walt to PhD candidacy,now my disertation has been submitted and it is underreview, has been reviewed,so I'm preparing for my defense to I can wrtap things up and graduate. proving that old guys can still ... what was he doing again?

Comments and Discussions

 First PrevNext
 Gr8 code. Any idea how to get estimate for multilevel (Random effect). I am asking about Linear mixed model mrk79blr21-Oct-15 8:20 mrk79blr 21-Oct-15 8:20
 Clarification of parameter X Member 1034618319-Oct-13 8:29 Member 10346183 19-Oct-13 8:29
 Re: Clarification of parameter X Walt Fair, Jr.19-Oct-13 11:22 Walt Fair, Jr. 19-Oct-13 11:22
 got question... Amelie12222-May-13 20:40 Amelie122 22-May-13 20:40
 Re: got question... Walt Fair, Jr.23-May-13 10:28 Walt Fair, Jr. 23-May-13 10:28
 Wonderful and Instructive Tom9115-Mar-13 8:48 Tom911 5-Mar-13 8:48
 thx para2x25-Feb-13 23:10 para2x 25-Feb-13 23:10
 Problem with the inverse algorithm jldearmas8-Jan-13 12:37 jldearmas 8-Jan-13 12:37
 I think I found the solution for my problem.Thanks! araferna20-Oct-12 2:41 araferna 20-Oct-12 2:41
 Deviates significantly from excel with larger row counts! IgnacioB11-Aug-12 18:04 IgnacioB 11-Aug-12 18:04
 Question on capacity eliaso521-Mar-12 7:44 eliaso5 21-Mar-12 7:44
 Re: Question on capacity Walt Fair, Jr.21-Mar-12 10:24 Walt Fair, Jr. 21-Mar-12 10:24
 Multiple Linear regression y=c1x1+c2x2+c3x3+b deviha26-Oct-11 1:45 deviha 26-Oct-11 1:45
 Re: Multiple Linear regression y=c1x1+c2x2+c3x3+b Walt Fair, Jr.26-Oct-11 4:21 Walt Fair, Jr. 26-Oct-11 4:21
 Thanks salasmig28-Sep-11 0:19 salasmig 28-Sep-11 0:19
 My vote of 5 Ian Shlasko23-May-11 11:11 Ian Shlasko 23-May-11 11:11
 My vote of 5 Member 285760525-Mar-11 1:17 Member 2857605 25-Mar-11 1:17
 Thank you - Code to Fit Array of XY with order N Scott Raymond31-Dec-10 8:16 Scott Raymond 31-Dec-10 8:16
 Re: Thank you - Code to Fit Array of XY with order N Walt Fair, Jr.31-Dec-10 8:29 Walt Fair, Jr. 31-Dec-10 8:29
 My vote of 5 Eswa5-Dec-10 21:02 Eswa 5-Dec-10 21:02
 My vote of 5 hellosunny012344-Nov-10 18:19 hellosunny01234 4-Nov-10 18:19
 My vote of 5 xiatiandeyu91724-Sep-10 17:25 xiatiandeyu917 24-Sep-10 17:25
 My vote of 4 paulsasik13-Aug-10 6:46 paulsasik 13-Aug-10 6:46
 matrix inversion seems to be broken schifano5-Jul-10 11:33 schifano 5-Jul-10 11:33
 Re: matrix inversion seems to be broken schifano5-Jul-10 11:37 schifano 5-Jul-10 11:37
 Last Visit: 31-Dec-99 19:00     Last Update: 24-Feb-17 18:17 Refresh 123 Next »

General    News    Suggestion    Question    Bug    Answer    Joke    Praise    Rant    Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170217.1 | Last Updated 17 Apr 2008
Article Copyright 2008 by Walt Fair, Jr.
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid