Step-by-Step Guide to Implement Machine Learning VIII - Linear Regression

Ryukkkk

5.00/5 (2 votes)

May 28, 2019

CPOL

3 min read

10499

Easy to implement machine learning

Introduction

There universally exists a relationship among variables. Indeed, the relationship can be divided into two categories, namely, certainty relation and uncertainty relation. The certainty relation can be expressed with a function. The certainty relation is also called correlation, which can be studied with regression analysis.

Generally, the linear regression model is:

$h_{\theta}\left(x\right)=\theta^{T}x$

The optimal $\theta$ can be determined by minimum the loss function:

$J\left(\theta\right)=\sum^{m}_{i=1}\left(h_{\theta}\left(x\right)^{(i)}-y^{(i)}\right)^{2}$

Regression Model

Linear regression consists of linear regression, local weighted linear regression, ridge regression, Lasso regression and stepwise linear regression.

Linear Regression

The parameter $\theta$ for linear regression can be calculated by gradient descent method or regular expression. Because gradient descent method has been introduced in Step-by-Step Guide to Implement Machine Learning IV - Logistic Regression, we introduce the solution with regular expression in this article.

First, calculate the derivative of loss function:

$\begin{align*} \nabla_{\theta}J\left(\theta\right)& =\nabla_{\theta}\frac{1}{2}\left(X\theta-Y\right)^{T}\left(X\theta-Y\right)\\ &= \frac{1}{2}\nabla_{\theta}\left(\theta^{T}X^{T}X\theta-\theta^{T}X^{T}Y-YX\theta+Y^{T}Y\right)\\ &=\frac{1}{2}\nabla_{\theta}tr\left(\theta^{T}X^{T}X\theta-\theta^{T}X^{T}Y-YX\theta+Y^{T}Y\right)\\ &=\frac{1}{2}\nabla_{\theta}\left(tr\theta^{T}X^{T}X\theta-2trYX\theta\right)\\ &=\frac{1}{2}\nabla_{\theta}\left(X^{T}X\theta+X^{T}X\theta-2X^{T}Y\right)\\ &= X^{T}X\theta-X^{T}Y \end{align*}$

Then, make the derivative equal to 0, we can obtain:

$X^{T}X\theta=X^{T}Y$

Finally, $\theta$ is:

$\theta=\left(X^{T}X\right)^{-1}X^{T}Y$

where X is the training data and Y is the corresponding label. The code of linear regression is shown below:

    def standardLinearRegression(self, x, y):
        if self.norm_type == "Standardization":
            x = preProcess.Standardization(x)
        else:
            x = preProcess.Normalization(x)

        xTx = np.dot(x.T, x)
        if np.linalg.det(xTx) == 0:   # calculate the Determinant of xTx
            print("Error: Singluar Matrix !")
            return
        w = np.dot(np.linalg.inv(xTx), np.dot(x.T, y))
        return w

Local Weighted Linear Regression

It is underfitting in linear regression for it using the unbiased estimation of minimum mean square error(MMSE). To solve the problem, we assign weights on the points around the point to be predicted. Then, we apply normal regression analysis on it. The loss function for local weighted linear regression is:

$J\left(\theta\right)=\sum^{m}_{i=1}w^{(i)}\left(h_{\theta}\left(x\right)^{(i)}-y^{(i)}\right)^{2}$

Like linear regression, we calculate the derivative of loss function and make it equal to 0. The optimal $\theta$ is

$\theta=\left(X^{T}WX\right)^{-1}X^{T}WY$

The weights in local weighted linear regression is like the kernel function in SVM, which is given by:

$w^{(i)}=exp\left(-\frac{\left(x^{(i)}-x\right)^{T}\left(x^{(i)}-x\right)}{2\tau^{2}}\right)$

The code of local weighted linear regression is shown below:

    def LWLinearRegression(self, x, y, sample):
        if self.norm_type == "Standardization":
            x = preProcess.Standardization(x)
        else:
            x = preProcess.Normalization(x)

        sample_num = len(x)
        weights = np.eye(sample_num)
        for i in range(sample_num):
            diff = sample - x[i, :]
            weights[i, i] = np.exp(np.dot(diff, diff.T)/(-2 * self.k ** 2))
        xTx = np.dot(x.T, np.dot(weights, x))
        if np.linalg.det(xTx) == 0:
            print("Error: Singluar Matrix !")
            return
        result = np.dot(np.linalg.inv(xTx), np.dot(x.T, np.dot(weights, y)))
        return np.dot(sample.T, result)

Ridge Regression

If the feature dimension is large, than the number of samples, the input matrix is not full rank, whose inverse matrix doesn't exist. To solve the problem, ridge regression add $\lambda I$ to make the matrix nonsingular. Actually, it is equal to add L2 regularization on the loss function for ridge regression, namely:

$J\left(\theta\right)=\sum^{m}_{i=1}\left(h_{\theta}\left(x\right)^{(i)}-y^{(i)}\right)^{2}+\lambda\left| |\theta| \right|_{2}$

Like linear regression, we calculate the derivative of loss function and make it equal to 0. The optimal $\theta$ is:

$\theta=\left(X^{T}X+\lambda^{2}I\right)^{-1}X^{T}Y$

The code of ridge regression is shown below:

    def ridgeRegression(self, x, y):
        if self.norm_type == "Standardization":
            x = preProcess.Standardization(x)
        else:
            x = preProcess.Normalization(x)

        feature_dim = len(x[0])
        xTx = np.dot(x.T, x)
        matrix = xTx + np.exp(feature_dim)*self.lamda
        if np.linalg.det(xTx) == 0:
            print("Error: Singluar Matrix !")
            return
        w = np.dot(np.linalg.inv(matrix), np.dot(x.T, y))
        return w

Lasso Regression

Like ridge regression, Lasso regression add L1 regularization on the loss function, namely:

$J\left(\theta\right)=\sum^{m}_{i=1}\left(h_{\theta}\left(x\right)^{(i)}-y^{(i)}\right)^{2}+\lambda\left| |\theta| \right|_{1}$

Because the L1 regularization contains absolute value expression, the loss function is not derivable anywhere. Thus, we apply coordinate descent method (CD). The CD gets a minimum at a direction each iteration, namely,

$\theta^{i+1}_{j}=arg\min\limits_{\theta_{i}} J\left(\theta_{1},\theta_{2}^{(i-1)},..., \theta_{n}^{(i-1)}\right)$

We can get a closed solution for CD, which is given by:

$\begin{equation} \left\{ \begin{array}{lr} \theta_{j}=\rho_{j}+\lambda &\ if\ \rho_{i}<-\lambda \\ \theta_{j}=0 &if\ -\lambda\leq\rho_{i}\leq\lambda \\ \theta_{j}=\rho_{j}-\lambda &\ if\ \rho_{i}>\lambda \\ \end{array} \right. \end{equation}$

where:

$\rho_{j}=\sum_{i=1}^{m}x_{j}^{(i)}\left(y^{(i)}-\sum_{k\ne j}^{n}\theta_{k}x_{k}^{(i)}\right)\\=\sum_{i=1}^{m}x_{j}^{(i)}\left(y^{(i)}-\hat y_{pred}^{(i)}+\theta_{j}x^{(i)}_{j}\right)$

The code of Lasso regression is shown below:

    def lassoRegression(self, x, y):
        if self.norm_type == "Standardization":
            x = preProcess.Standardization(x)
        else:
            x = preProcess.Normalization(x)

        y = np.expand_dims(y, axis=1)
        sample_num, feataure_dim = np.shape(x)
        w = np.ones([feataure_dim, 1])
        for i in range(self.iterations):
            for j in range(feataure_dim):
                h = np.dot(x[:, 0:j], w[0:j]) + np.dot(x[:, j+1:], w[j+1:])
                w[j] = np.dot(x[:, j], (y - h))
                if j == 0:
                    w[j] = 0
                else:
                    w[j] = self.softThreshold(w[j])
        return w

Stepwise Linear Regression

Stepwise linear regression is similar to Lasso, which applies greedy algorithm at each iteration to get a minimum rather than CD. Stepwise linear regression adds or cuts down a small part on the weights at each iteration. The code of stepwise linear regression is shown below:

    def forwardstepRegression(self, x, y):
        if self.norm_type == "Standardization":
            x = preProcess.Standardization(x)
        else:
            x = preProcess.Normalization(x)

        sample_num, feature_dim = np.shape(x)
        w = np.zeros([self.iterations, feature_dim])
        best_w = np.zeros([feature_dim, 1])
        for i in range(self.iterations):
            min_error = np.inf
            for j in range(feature_dim):
                for sign in [-1, 1]:
                    temp_w = best_w
                    temp_w[j] += sign * self.learning_rate
                    y_hat = np.dot(x, temp_w)
                    error = ((y - y_hat) ** 2).sum()           # MSE
                    if error < min_error:                   # save the best parameters
                        min_error = error
                        best_w = temp_w
            w = best_w
        return w

Conclusion and Analysis

There are many solutions to get the optimal parameter for linear regression. In this article, we only introduce some basic algorithms. Finally, let's compare our linear regression with the linear regression in Sklearn and the detection performance is displayed below:

Sklearn linear regression performance:

Our linear regression performance:

The performances look similar.

The related code and dataset in this article can be found in MachineLearning.

History

28^th May, 2019: Initial version