Click here to Skip to main content
Click here to Skip to main content

Multi-Linear Regression in Java

, 23 Mar 2013
Rate this:
Please Sign up or sign in to vote.
This article introduces multi-linear regression/classification with simple examples and provide the codes in Java.

Introduction   

I introduce a very popular subject in  statistical modelling; multi-linear (or multi-variate) regression (MLR) or classification. In simple examples I will show you the usage of  MLR. MLR has been used extensively in science (biological, pharmaceutical, financial, medical and more). 

Background  

A few months ago I wrote an article about matrix operations in Java. I suggest reading that article first since the codes in this article are heavily dependent on the matrix operations.

To understand multi-linear regression (MLR), have a look at the following table: 


Diet scoreMaleage>20BMI
40127
71129
61023
20020
30121

The body mass index of five people have been measured. For each person, the diet score, whether they are male or female and whether they are older than 20 have also been recorded in three columns. Do not ask me what diet score is and how to measure them, because I do not know and this is just a toy example. The question is: what is the relationships between BMI and diet score, gender and age? If we have the diet score, gender and age of a new person, can we get his/her body mass index? MLR is here to answer these questions. We expect the relationships between BMI and three variables to be something like this: 

BMI =  beta0    beta1 * Diet_Score  beta2  * Male  beta3  * age   

Based on this equation, in order to predict the value of BMI for a person with known diet score, gender and age, you need to know the values of all beta. MLR finds the value of all missing coefficients. We call beta0   as bias. In most of real-life applications, having a large bias means the predictors (i.e. the three variables) do not have enough predictive power and having small bias is a good sign of having a good predictive model. A large bias could possibly mean that there are descriptors that can explain the observations which we have not discovered them yet.  

Let's show the BMI column in the above table as a column matrix and name it Y and the values of all independent variables as a 3 x 3 matrix with name and finally the values of beta matrix that will be discovered later as a column matrix b. The unknown matrix b can be found as  

b = (X'X)-1X'Y 

where X' is the transpose of matrix X and -1 returns the inverse of the matrix.   

 If you want to have bias, you need to add a new column to matrix X. This new column should be the first one and its value for all rows must be 1.  

Limitation of MLR: MLR works only when the number of columns in X matrix is less than or equals the number of rows. In other words, the number of descriptors cannot be more than the number of observations. Another limitation is about the inverse operation in the above equation. Not all matrices have inverse and when we cannot get the inverse of X'X, then the calculation of b matrix will fail and therefore the MLR will fail. There are other methods such as Partial Least Square or Support Vector Machine that work fine when MLR fails.  

 Using the code   

We only need the implementation of a single method on top of all matrix operations methods described in another article in order to create the model and find the values of b matrix. 

public Matrix calculate() throws NoSquareException {
	if (bias)
		this.X = X.insertColumnWithValue1();
	checkDiemnsion();
	Matrix Xtr = MatrixMathematics.transpose(X); //X'
	Matrix XXtr = MatrixMathematics.multiply(Xtr,X); //X'X
	Matrix inverse_of_XXtr = MatrixMathematics.inverse(XXtr); //(X'X)^-1
	if (inverse_of_XXtr == null) {
		System.out.println("Matrix X'X does not have any inverse. So MLR failed to create the model for these data.");
		return null;
	}
	Matrix XtrY = MatrixMathematics.multiply(Xtr,Y); //X'Y
	return MatrixMathematics.multiply(inverse_of_XXtr,XtrY); //(X'X)^-1 X'Y
}

The above codes follows the following steps in order to get the b matrix: 

  1. If you want to have bias (i.e. beta0), then add a new column to X matrix
  2. Then check the input matrices are valid
  3. Then find the transpose of X  (i.e. X' ) 
  4. Then multiply X by X'  
  5. Then find the inverse of matrix from step 4; i.e. (XX')-1 
  6. Then multiply X' by Y  
  7. Finally multiply matrix from operation in step 5 by matrix of operation in step 6 
Now let's test the method on the above example:
Matrix X = new Matrix(new double[][]{{4,0,1},{7,1,1},{6,1,0},{2,0,0},{3,0,1}});
Matrix Y = new Matrix(new double[][]{{27},{29},{23},{20},{21}});
MultiLinear ml = new MultiLinear(X, Y);
Matrix beta = ml.calculate();
When we use the constructor with two arguments, then the value of bias by default is true. Here are the results: 

beta0  = 9.25,  beta1 = 4.75, beta2 = -13.5, beta3 = -1.25   

BMI =  9.25  4.75  * Diet_Score -13.5 * Male - 1.25  * age

This is a model to predict the MSI having the values of all independent variables (i.e. diet score, gender and age). The size of the values for beta and also their sign shows their importance.   In this illustrative example, diet score and gender have a greater contribution to BMI than age, and effect of gender and diet score is opposite; i.e. people with more high diet score have more BMI and males have significantly lower BMI with respect to females.  It is interesting to see the insight that MLR is giving about the BMI observations.  

One final question: Is this a good model? The minimum that we can do is to use the model (the above equation) and predict the BMI and then compare them with the observed values:  


BMI    predicted   
27    27   
29    27.75   
23   24.25 
20 18.75
21 22.25

As you can see the predicted ones are not that far from the observed ones. You can find the error (i.e. predicted - observed) for each case and calculate the mean squared error (MSE) that can indicate how accurate our model is. The lower the MSE, the better the model. There are plenty of fancy statistical tests that can be used to examine the suitability of the model which I will ignore in this article. You can find a couple of more tests in the code. One of the test examples is a classification analysis using MLR.  

Points of Interest

With a few lines of codes I tried to illustrate one of the most important statistical modelling algorithms (MLR). I have not tested the codes for large matrices and because of recursive operations that we have you may need to increase the thread's stack size (i.e. -Xss flag). Please let me know if you have some interesting data that we can test the codes. 

History

This is the first version (v1.0.0). 


License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Ata Amini
Software Developer (Senior) Investment Bank
United Kingdom United Kingdom
I have a PhD in computational chemistry from Newcastle University. I worked for Imperial College London as research scientist for more than 6 years before joining HSBC, Standard Bank and recently BlackRock as software developer.

Comments and Discussions

 
QuestionSomething's Missing PinmemberMember 1016589822-Jul-13 10:06 
AnswerRe: Something's Missing PinmemberAta Amini22-Jul-13 11:13 
GeneralRe: Something's Missing PinmemberMember 1016589822-Jul-13 11:16 
GeneralRe: Something's Missing -1 PinmemberMember 1016589822-Jul-13 11:17 
QuestionGreat code PinmemberJeroen Heijning17-Jun-13 9:47 
AnswerRe: Great code PinmemberAta Amini17-Jun-13 10:14 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web04 | 2.8.140814.1 | Last Updated 23 Mar 2013
Article Copyright 2013 by Ata Amini
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid