Click here to Skip to main content
13,903,802 members
Click here to Skip to main content
Add your own
alternative version


44 bookmarked
Posted 11 Jul 2017
Licenced CPOL

Machine Learning - Gradient Descent

, 11 Jul 2017
Rate this:
Please Sign up or sign in to vote.
Best Practice For Learning Basic of Machine Learning and Gradient Descent Based on Linear Regression. This article will explain complete and describe detail by detail and step by step computational matters.

What is Machine Learning

I decided to prepare and discuss about machine learning algorithms in different series which is valuable and can be unique throughout the internet. I claimm that there is rare resource which is SIMPLE and COMPLETE in machine learning.

I want to explain how algorithms I machine learning are working by going through low level explanation instead of just have a short glance on high level. I believe this way can widen our perspective and prepare us to apply them correctly. Because having high level and superficial theory about nice algorithm lead us nowhere, in return if we know what is happening under the layer of these algorithm can help us to use them more properly.

For have a better understanding about machine learning I prefer to talk briefly about the whole story off machine learning.

Machine learning is a branch of computer science which has been extended from pattern recognition and artificial intelligence. Its meaning comes from where computer can learn how to predict phenomena by the aid of training data which are sample data have been occurred in the past and machine tries to anticipate what is the result of an event according to having specific conditions. This learning process occurs in two different ways. One is supervised and another is unsupervised.

Supervised learning is to teach alphabetic to kid under the monitor of his teacher and unsupervised learning is such as kid who tries to walk alone after trial and errors he will understand how to walk correctly. In a better instance, in supervised we tend to classify objects but in unsupervised we actually do not know what we are looking for, we just study and watch some objects and we try to cluster them according their similarity.


Classification – Supervised: Assume a father show a dog to his son and then if son sees another race of dog can detect that it is dog but from another kind of dog. The goal was clear; son should see some dogs and his father told him that “it is dog” then he should detect dog –even from other color or race or size- in future.


Clustering – Unsupervised: But now story is different, a father brings his son to the zoo and show him different animals without telling him that what is their name. So his son just can categorize in his brain the various animals according to their color, size and etc. If I future he sees animal which is similar to one of those animal, he can say that “this animal –which does not know its name- is similar to those animals that I have seen in the zoo on that cage.” So the goal is here that son should cluster new animal to another animal that he has seen by now.

Supervised has f(X) = y and there is value or each features while in unsupervised there is just (X) without knowing its f(X).

Result: as a result, finally we encounter two different example of problem to solve:

Supervised – Classification: According past data which there are specific output value- target or dependent variable which is called “Y” for one or more than one features or independent variable which is called “X” or “X1, X2, …., Xn”. For example, there is data set of patient information and we want to find whether patient will get cancer or not; or what is price of some object in near futures according to their changing price in the past.

Linear Regression

In some problem we have no limitation for “Y” value such as price prediction, but in others such as when we want to know what is the probability of cancer in one patient, we have to give result between zero and one.

Y = aX + b

Y is target value, dependent, output

a is regression coefficient, slope

X is independent value, feature, input

b is constant value

I decided to make everything as KISS = Keep It Simple Stupid

Therefore I designed small data set by myself and we can study on it easily and then you can go to my github and study according to real tutorial data set from machine learning course at university of Stanford by Andrew NG. Click HERE for more info about this course.

This data set is based on relation between study hours of students and their achieved grade. We draw a graph according to above data and assign study hours to horizontal axis and grade to vertical axis. You see I student study more they can gain better grade. So after looking at this graph, we want to predict if one student study 8 hours how much would be his grade?

Gradient Descent

We need a line as blue line to determine progress of changing grade based on study hours. Y and X are known by data set value, we just need a, b value to draw a blue line or “Prediction line”. No matter how much we are precise, it is sometimes impossible or difficult to draw a prediction line without error. Error rate in machine learning is inevitable; but we should find best a, b and minimize error rate to have an accurate output.

The vertical distance between actual value which is black star and predicted value on blue line which is yellow star is called “Prediction Error” or “Cost”. In order to calculate prediction error; we have to minus Prediction Error = Yactual – Ypredicted Yprediction = aX + b

Because we have more than one record in data set, we need to generalize above formula to:

Sum of Squared Errors (SSE) = ½ Sum (YactualYpredicted)2

But we need to minimize SSE as far as possible, we learnt in high school mathematic that we should derivate formula to make it optimum. Because we have two variable so we need two differentiations. Because we are working on a, b, so we should make partial differentiations. In partial differentiation based on “a”, others variable such as “X”, “Y” and “b” in SSE should be kept as constant.

Before study deeply on above data set we need to standardize – normalize or scaling value in order to have equal range and/or variance. Because study hours are between (2,7) but grade is between (50,70), so there is variance in their scale makes some difficulty to compare them.

First step is that compute “cost function” based on Ө = [a, b], these “a” and “b” are values which we have selected randomly.

  1. Select Ө = [a, b], a is slope and b is intercept randomly.
  2. Compute “Cost Function”.
  3. Compute “Gradient Descent” for Ө = [a, b].
    1. anew = aold - r*∑ ∂SSE/∂a r is learning rate
      • ∂SSE/∂a = -(Y-Yp)X
    2. bnew = bold - r*∑ ∂SSE/∂b
      • ∂SSE/∂b = -(Y-Yp)
  4. Again Compute “Cost Function” Cost Function
  5. Compare if new Cost Function value is less than before; if “Yes” so you are in right direction, lets continue.
  6. Repeat and iterate step 3 to 5 until find best value.

** r is learning rate is the pace of learning, cost function should be decreased in every iteration and get convergence. If cost function in each repeat with different a, b is decreased, so we selected suitable r.

Using the code MATLAB

I used MATLAB R2012a (

1. Start to call Function"executegd.m";

Firstly there is "executegd.m", I called all of function from here. In first part I loaded data frrom data.txt which is very simple text. Then I standardize data with mentioned formula above. Then I draw points according new coordinates. Then I assume a=0, b=0 and computed "Cost Function", then I started to calculate best a, b by the aid of "Gradient Descent" with learninng rater r=0.01 and 1500 iteration.

clear ; close all; clc

%% ======================= Part 1: Load Data ==============================
% Load Data
fprintf('Plotting Data ...\n')
data = load('data.txt');

%% ======================= Part 2: Standardize Data =======================
% Standardize Data
data = standardizeData(data);
X = data(:, 1);
y = data(:, 2);
m = length(y); % number of training examples

%% ======================= Part 3: Plotting ===============================
% Plot Data
plotData(X, y);

fprintf('Program paused. Press enter to continue.\n');

%% =================== Part 4: Prediction Error - Copmute Cost ============
fprintf('Running Gradient Descent ...\n')

X = [ones(m, 1), data(:,1)]; % Add a column of ones to x
theta = zeros(2, 1); % initialize fitting parameters

% Some gradient descent settings
iterations = 1500;
alpha = 0.01;

% compute and display initial cost
predictionError(X, y, theta)

%% =================== Part 5: Gradient descent ===========================
% run gradient descent
theta = gradientDescent(X, y, theta, alpha, iterations);

% print theta to screen
fprintf('Theta found by gradient descent: ');
fprintf('%f %f \n', theta(1), theta(2));

% Plot the linear fit
hold on; % keep previous plot visible
plot(X(:,2), X*theta, '-')
legend('Training data', 'Linear regression')
hold off % don't overlay any more plots on this figure

2. Standardize "standardizeData.m"

function [stdata] = standardizeData(data)
%Standardize Data

X = data(:, 1);
Y = data(:, 2);
m = length(Y); % number of training examples
stdata = zeros(m, 2); % initialize fitting parameters

% StdDev = SQRT( sum[(X-mean)^2/(n-1)]  )
meanX = mean(X);
stdX = std(X);

for i = 1:m
    X(i) = ((X(i) - meanX)/stdX);

%Standardize(X) = X-mean /Std(X)

meanY = mean(Y);
stdY = std(Y);

for i = 1:m
    Y(i) = ((Y(i) - meanY)/stdY);  

 stdata(:, 1)= X(:);
 stdata(:, 2)=Y(:);

3. Plot "plotData.m"

function plotData(x, y)

figure; % open a new figure window


Then you should click on "Enter".

4. Compute "Cost Function" "predictionError.m"

function J = predictionError(X, y, theta)
%COMPUTECOST Compute cost for linear regression
%   J = COMPUTECOST(X, y, theta) computes the cost of using theta as the
%   parameter for linear regression to fit the data points in X and y

% Initialize some useful values

m = length(y); % number of training examples
%theta = zeros(2, 1); % initialize fitting parameters

% You need to return the following variables correctly

% ====================== YOUR CODE HERE ======================
% Instructions: Compute the cost of a particular choice of theta
%               You should set J to the cost.

J=1/(2*m)*sum((X*theta - y).^2);

% =========================================================================


5. Compute theta(a, b) which build best predictio line by Gradient Descent, "gradientDescent.m"

function [theta] = gradientDescent(X, y, theta, alpha, num_iters)
%GRADIENTDESCENT Performs gradient descent to learn theta
%   theta = GRADIENTDESENT(X, y, theta, alpha, num_iters) updates theta by
%   taking num_iters gradient steps with learning rate alpha

% Initialize some useful values
%m = length(y); % number of training examples
J_history = zeros(num_iters, 1);

m = length(y); % number of training examples

for iter = 1:num_iters

    if iter == 1
        J_history(iter) = predictionError(X, y, theta);
    elseif iter > 1

        DefSSEb = sum(B);
        DefSSEa = sum(C);


        theta(1,1) = (bold - (alpha*(DefSSEb/m)));
        theta(2,1) = (aold - (alpha*(DefSSEa/m)));

        J_history(iter) = predictionError(X, y, theta);

theta = theta(:);

Points of Interest

I found Machine Learning very exciting, I decided to work on it.

Gradient Descent is first and foremost step to learn machine learning. As summarize Machine learning is “getting data and work on data then give back result which is called its prediction”

Therefore, error will occur 100%, the goal is using machine for prediction –because of huge and big data, machine can do it faster- but by observing error and try to select better prediction by gradient descent.

Gradient Descent story will not finish here in linear regression (with just one X), you will encounter it in multi variable and neural network.

Finally I strongly recommend you to register machine learning in

& use my github as guidence for assignment:


Feel free to leave any feedback on this article; it is a pleasure to see your opinions and vote about this code. If you have any questions, please do not hesitate to ask me here.


This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


About the Author

Mahsa Hassankashi
Doctorandin Technische Universität Berlin
Iran (Islamic Republic of) Iran (Islamic Republic of)
I have been working with different technologies and data more than 10 years.
I`d like to challenge with complex problem, then make it easy for using everyone. This is the best joy.

ICT Master in Norway 2013
Doctorandin at Technische Universität Berlin in Data Scientist ( currently )
Diamond is nothing except the pieces of the coal which have continued their activities finally they have become Diamond.

*Article of The Community Spotlight in Microsoft ASP.NET, Wednesday, February 11, 2015,
*Article of The Day in Microsoft ASP.NET Tuesday, February 3, 2015,
*1 Jan 2015: CodeProject MVP 2015
*22 Mar 2014: Best Web Dev Article of February 2014 - Second Prize

You may also be interested in...

Comments and Discussions

PraiseMessage Closed Pin
19-Mar-19 0:45
memberMember 1418774319-Mar-19 0:45 
GeneralArticle Pin
Member 1405424414-Nov-18 0:22
memberMember 1405424414-Nov-18 0:22 
QuestionCode execution result Pin
helloQ11-Jul-18 20:29
memberhelloQ11-Jul-18 20:29 
GeneralAppriciation Pin
kiranmaibabu3-Jul-18 1:30
professionalkiranmaibabu3-Jul-18 1:30 
SuggestionSuggest article on Newtons method in lieu of Gradient Descent Pin
ccamatthews4-Jan-18 3:41
memberccamatthews4-Jan-18 3:41 
PraiseGreat Article Pin
Sarika@1214-Sep-17 22:29
memberSarika@1214-Sep-17 22:29 
GeneralRe: Great Article Pin
Mahsa Hassankashi16-Sep-17 0:16
memberMahsa Hassankashi16-Sep-17 0:16 
QuestionGreat article! Pin
eslipak26-Aug-17 13:17
professionaleslipak26-Aug-17 13:17 
AnswerRe: Great article! Pin
Mahsa Hassankashi11-Sep-17 5:18
memberMahsa Hassankashi11-Sep-17 5:18 
GeneralRe: Great article! Pin
eslipak11-Sep-17 11:11
professionaleslipak11-Sep-17 11:11 
GeneralRe: Great article! Pin
Mahsa Hassankashi12-Sep-17 2:02
memberMahsa Hassankashi12-Sep-17 2:02 
GeneralRe: Great article! Pin
eslipak12-Sep-17 12:32
professionaleslipak12-Sep-17 12:32 
GeneralMy vote of 5 Pin
eslipak26-Aug-17 10:22
professionaleslipak26-Aug-17 10:22 
GeneralRe: My vote of 5 Pin
Mahsa Hassankashi11-Sep-17 5:13
memberMahsa Hassankashi11-Sep-17 5:13 
PraiseVery Nice Explanation Pin
Member 1149024522-Aug-17 4:31
memberMember 1149024522-Aug-17 4:31 
GeneralRe: Very Nice Explanation Pin
Mahsa Hassankashi11-Sep-17 5:18
memberMahsa Hassankashi11-Sep-17 5:18 
GeneralMy vote of 5 Pin
Member 1149024522-Aug-17 3:40
memberMember 1149024522-Aug-17 3:40 
GeneralRe: My vote of 5 Pin
Mahsa Hassankashi11-Sep-17 5:19
memberMahsa Hassankashi11-Sep-17 5:19 
QuestionSuggestion: mention in your article when gradient search is needed Pin
wim4you20-Aug-17 23:50
memberwim4you20-Aug-17 23:50 
AnswerRe: Suggestion: mention in your article when gradient search is needed Pin
Mahsa Hassankashi11-Sep-17 5:21
memberMahsa Hassankashi11-Sep-17 5:21 
Praisevery nice Pin
BillW337-Aug-17 3:00
professionalBillW337-Aug-17 3:00 
GeneralRe: very nice Pin
Mahsa Hassankashi7-Aug-17 12:18
memberMahsa Hassankashi7-Aug-17 12:18 
GeneralGot my 5 Pin
Igor Ladnik7-Aug-17 0:44
mvaIgor Ladnik7-Aug-17 0:44 
GeneralRe: Got my 5 Pin
Mahsa Hassankashi7-Aug-17 12:17
memberMahsa Hassankashi7-Aug-17 12:17 
QuestionNice one! Pin
Chris Maunder4-Aug-17 7:27
adminChris Maunder4-Aug-17 7:27 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Cookies | Terms of Use | Mobile
Web06 | 2.8.190306.1 | Last Updated 11 Jul 2017
Article Copyright 2017 by Mahsa Hassankashi
Everything else Copyright © CodeProject, 1999-2019
Layout: fixed | fluid