## Introduction

Neural Networks is one of the most trending solutions in machine learning methods. This method is very good for problems for which no exact solution exists. Recently, by growing the popularity of these methods, so many libraries have been developed in Matlab, Python, C++, and etc, which get training set as input and automatically build up an appropriate Neural Network for the assumed problem. Also, by developing high-speed CPUs and GPUs and even more NPUs which are optimized exactly for calculation of Neural Networks, Deep Neural Networks are growing every day. But by using these libraries, sometimes we don't exactly understand what happened exactly and how we reach the optimized network. Knowing the fundamental of a solution is very important in developing the past methods. So in this article, a very simple structure of Neural Network algorithm for approximating \(f(x))( = sin(x)\) is illustrated and also is implemented in C++ step by step.

## Background

One of the most successful and useful Neural Networks is Feed Forward Supervised Neural Networks or Multi-Layer Perceptron Neural Networks (MLP). This kind of Neural Network includes three parts as follows:

- Input Layer
- Hidden Layers
- Output Layer

Each layer has several nodes called Neurons which connect to other Neurons in the network. Every Neural Network has five important properties as follows:

- Number of input nodes
- Number of nodes per hidden layer
- Number of output nodes
- Number of hidden layers
- The learning rate

In this example, we use an MLP Neural Network with one hidden layer and 5 neurons in its hidden layer. We use \(sigmoid\\) function as each nodes activation function, which defines as follows:

$ sigmoid (x) = \frac{1}{1 + e^{-x}} $

The following figure shows used Neural Network structure, where \(x\) is input vector, \(h)( = )(sigmoid(c + Wx)\) is the output of the hidden layer and \(\hat{y})(~=~)(f_\theta(x))(~=~ b + Vh\) is the main output of the neural network which in this example will be an approximation of \(y)( = )(sin(x)\).

The loss function for this classical example could be the squared error as follows:

$ L(\hat{y}, y) = \|\hat{y} - y\|^2 $

So the training function will be:

$ C(\theta)~=~\frac{1}{n} \sum_t \|y^{(t)} - \hat{y}^{(t)}\|^2 ~=~ \frac{1}{n} \sum_t \|y^{(t)}~-~(b+V sigmoid(c+Wx^{(t)}))\|^2 $

Where \((x^{(t)}, y^{(t)})\) is \(t-th\) input of training. The main goal of Neural Network is minimizing this cost function by updating the \(W, V, b, c\) parameters. Finally, the classical training procedure in this example is gradient descent, which iteratively updates the parameters \(W, V, b, c\) according to:

$\begin{aligned} W~ & \leftarrow~W - \epsilon~\nabla_W L(f_\theta(x^{(t)}),~ y^{(t)}) \\ \\ c~ & \leftarrow~c - \epsilon~\nabla_c L(f_\theta(x^{(t)}),~ y^{(t)}) \\ \\ V~ & \leftarrow~V - \epsilon~\nabla_V L(f_\theta(x^{(t)}),~ y^{(t)}) \\ \\ b~ & \leftarrow~b - \epsilon~\nabla_b L(f_\theta(x^{(t)}),~ y^{(t)}) \end{aligned}$

## Using the Code

We need some data to train proposed Neural Network. The training set is very important, a general training set leads to better approximation. For example, here we divide the range \([0,~2\pi]\) into `Train_Set_Size`

parts and give it to Neural Network as a training set.

#define Train_Set_Size 20
#define PI 3.141592653589793238463
...
vector<pair<double, double>> trainSet;
trainSet.resize(Train_Set_Size);
for (int i = 0; i < Train_Set_Size; i++) {
trainSet[i] = make_pair(i * 2 * PI / Train_Set_Size, sin(i * 2 * PI / Train_Set_Size));
}
...

As \(sigmoid\) function is not a C++ standard function, we have to define a function to calculate\(sigmoid\).

$ sigmoid (x) = \frac{1}{1 + e^{-x}} $

...
double sigmoid(double x) {
return (1.0f / (1.0f + std::exp(-x)));
}
...

Also, we need a function to calculate \(\hat{y})(~=~)(f_\theta(x)\) in each iteration:

$ f_\theta(x)~=~b+V sigmoid(c+Wx^{(t)}) $

...
double f_theta(double x) {
double result = b;
for (int i = 0; i < N; i++) {
result += V[i] * sigmoid(c[i] + W[i] * x);
}
return result;
}
...

Where \(x\) is \(1 \times 1\) input train value, \(W\) is \(1 \times 1\) wieght vector, and \(c\) is scaler value for each Neuron's offset, but we store all nodes offset value in a \(1 \times 5\) array for simplicity and finally \(b\) is output nodes offset value. We define these variables at the beginning of our code as follows:

#define N 5
...
double c[N] = {};
double W[N] = {};
double V[N] = {};
double b = 0;
...

As mentioned before, in each iteration, we have to update \(W, V, b, c\) parameters. For parameter \(W\) by calculating gradient respect to \(W\), we have:

$\begin{aligned} W~ &\leftarrow~W - \epsilon~\nabla_W L(f_\theta(x^{(t)}),~ y^{(t)}) \\ \\ W~ &\leftarrow~W - \epsilon \times 2 \times (f_\theta(x^{(t)})- y^{(t)}) \times (V.x^{(t)}.sigmoid(c+W~x^{(t)})) \times (1-V.x^{(t)}.sigmoid(c+W~x^{(t)})) \end{aligned}$

...
for (int i = 0; i < N; i++) {
W[i] = W[i] - epsilon * 2 * (f_theta(x) - y) * V[i] * x *
(1 - sigmoid(c[i] + W[i] * x)) * sigmoid(c[i] + W[i] * x);
}
...

We can calculate the rest of the parameters as follows:

\(V\):

$\begin{aligned} V~ &\leftarrow~V - \epsilon~\nabla_V L(f_\theta(x^{(t)}),~ y^{(t)}) \\ \\ V~ &\leftarrow~V - \epsilon \times 2 \times (f_\theta(x^{(t)})- y^{(t)}) \times sigmoid(c+W~x^{(t)}) \end{aligned}$

...
for (int i = 0; i < N; i++) {
V[i] = V[i] - epsilon * (f_theta(x) - y) * sigmoid(c[i] + W[i] * x);
}
...

\(b\):

$\begin{aligned} b~ &\leftarrow~b - \epsilon~\nabla_b L(f_\theta(x^{(t)}),~ y^{(t)}) \\ \\ b~ &\leftarrow~b - \epsilon \times 2 \times (f_\theta(x^{(t)})- y^{(t)}) \end{aligned}$

...
b = b - epsilon * (f_theta(x) - y);
...

\(c\):

$\begin{aligned} c~ &\leftarrow~c - \epsilon~\nabla_c L(f_\theta(x^{(t)}),~ y^{(t)}) \\ \\ c~ &\leftarrow~c - \epsilon \times 2 \times (f_\theta(x^{(t)})- y^{(t)}) \times (V.sigmoid(c+W~x^{(t)})) \times (1-sigmoid(c+W~x^{(t)})) \end{aligned}$

...
for (int i = 0; i < N; i++) {
c[i] = c[i] - epsilon * (f_theta(x) - y) * V[i] *
(1 - sigmoid(c[i] + W[i] * x)) * sigmoid(c[i] + W[i] * x);
}
...

Now for each train data, we must update the Neural Network parameters, but most of the time, we must give all of the training pairs to the neural network several times to converge a good result. Each time we give all training data to the neural network called "`Epoch`

":

...
for (int j = 0; j > epoch; j++) {
for (int i = 0; i < Train_Set_Size; i++) {
train(trainSet[i].first, trainSet[i].second);
}
std::cout << j << "\r";
}
...

Finally, we use GNUplot for drawing the results.

## All Together

#include <iostream>
#include <vector>
#include <math.h>
#include <time.h>
using namespace std;
#define Train_Set_Size 20
#define PI 3.141592653589793238463
#define N 5
#define epsilon 0.05
#define epoch 50000
double c[N] = {};
double W[N] = {};
double V[N] = {};
double b = 0;
double sigmoid(double x) {
return (1.0f / (1.0f + std::exp(-x)));
}
double f_theta(double x) {
double result = b;
for (int i = 0; i < N; i++) {
result += V[i] * sigmoid(c[i] + W[i] * x);
}
return result;
}
void train(double x, double y) {
for (int i = 0; i < N; i++) {
W[i] = W[i] - epsilon * 2 * (f_theta(x) - y) * V[i] * x *
(1 - sigmoid(c[i] + W[i] * x)) * sigmoid(c[i] + W[i] * x);
}
for (int i = 0; i < N; i++) {
V[i] = V[i] - epsilon * 2 * (f_theta(x) - y) * sigmoid(c[i] + W[i] * x);
}
b = b - epsilon * 2 * (f_theta(x) - y);
for (int i = 0; i < N; i++) {
c[i] = c[i] - epsilon * 2 * (f_theta(x) - y) * V[i] *
(1 - sigmoid(c[i] + W[i] * x)) * sigmoid(c[i] + W[i] * x);
}
}
int main() {
srand(time(NULL));
for (int i = 0; i < N; i++) {
W[i] = 2 * rand() / RAND_MAX -1;
V[i] = 2 * rand() / RAND_MAX -1;
c[i] = 2 * rand() / RAND_MAX -1;
}
vector<pair<double, double>> trainSet;
trainSet.resize(Train_Set_Size);
for (int i = 0; i < Train_Set_Size; i++) {
trainSet[i] = make_pair(i * 2 * PI / Train_Set_Size, sin(i * 2 * PI / Train_Set_Size));
}
for (int j = 0; j < epoch; j++) {
for (int i = 0; i < Train_Set_Size; i++) {
train(trainSet[i].first, trainSet[i].second);
}
std::cout << j << "\r";
}
vector<float> x;
vector<float> y1, y2;
for (int i = 0; i < 1000; i++) {
x.push_back(i * 2 * PI / 1000);
y1.push_back(sin(i * 2 * PI / 1000));
y2.push_back(f_theta(i * 2 * PI / 1000));
}
FILE * gp = _popen("gnuplot", "w");
fprintf(gp, "set terminal wxt size 600,400 \n");
fprintf(gp, "set grid \n");
fprintf(gp, "set title '%s' \n", "f(x) = sin (x)");
fprintf(gp, "set style line 1 lt 3 pt 7 ps 0.1 lc rgb 'green' lw 1 \n");
fprintf(gp, "set style line 2 lt 3 pt 7 ps 0.1 lc rgb 'red' lw 1 \n");
fprintf(gp, "plot '-' w p ls 1, '-' w p ls 2 \n");
for (int k = 0; k < x.size(); k++) {
fprintf(gp, "%f %f \n", x[k], y1[k]);
}
fprintf(gp, "e\n");
for (int k = 0; k < x.size(); k++) {
fprintf(gp, "%f %f \n", x[k], y2[k]);
}
fprintf(gp, "e\n");
fflush(gp);
system("pause");
_pclose(gp);
return 0;
}

## Points of Interest

- Effect of Epoch. By increasing Epoch, we can minimize the error better.
- The effect of initial values of Neural Networks parameters. As it is experienced, by choosing zero as the initial value of parameters, the simple gradient descent method converges to a local minimum, so random values are chosen as initial values in the codes mentioned.
- The epsilon value is very important in convergence rate. Small epsilons would cause premature convergence and the large ones would cause the parameters diverge.
- The simple gradient descent method is not a powerful method for finding the global minimum. As you see from the above figure, the Neural Network doesn't meet a good approximation.

## References

- [1] Neural networks by Simon Haykin
- [2] Deep Learning by Yoshua Bengio-Ian, J. Goodfellow and Aaron Courville (March 30, 2015)