Introduction
The class CBackProp encapsulates a feed-forward neural network and a back-propagation algorithm to train it. This article is intended for those who already have some idea about neural networks and back-propagation algorithms. If you are not familiar with these, I suggest going through some material first.
Background
This is part of an academic project which I worked on during my final semester back in college, for which I needed to find the optimal number and size of hidden layers and learning parameters for different data sets. It wasn't easy finalizing the data structure for the neural net and getting the back-propagation algorithm to work. The motivation for this article is to save someone else the same effort.
Here's a little disclaimer... This article describes a simplistic implementation of the algorithm, and does not sufficiently elaborate on the algorithm. There is a lot of room to improve the included code (like adding exception handling :-), and for many steps, there is a lot more reasoning required than I have included, e.g., values that I have chosen for parameters, and the number of layers/the neurons in each layer are for demonstrating the usage and may not be optimal. To know more about these, I suggest going to:
Using the code
Typically, the usage involves the following steps:
- Create the net using
CBackProp::CBackProp(int nl,int *sz,double b,double a)
- Apply the back-propagation algorithm - train the net by passing the input and the desired output, to
void CBackProp::bpgt(double *in,double *tgt) in a loop, until the Mean square error, which is obtained by CBackProp::double mse(double *tgt), gets reduced to an acceptable value.
- Use the trained net to make predictions by feed-forwarding the input data using
void CBackProp::ffwd(double *in).
The following is a description of the sample program that I have included.
One step at a time...
Setting the objective:
We will try to teach our net to crack the binary A XOR B XOR C. XOR is an obvious choice, it is not linearly separable hence requires hidden layers and cannot be learned by a single perception.
A training data set consists of multiple records, where each record contains fields which are input to the net, followed by fields consisting of the desired output. In this example, it's three inputs + one desired output.
double data[][4]={
0, 0, 0, 0,
0, 0, 1, 1,
0, 1, 0, 1,
0, 1, 1, 0,
1, 0, 0, 1,
1, 0, 1, 0,
1, 1, 0, 0,
1, 1, 1, 1 };
Configuration:
Next, we need to specify a suitable structure for our neural network, i.e., the number of hidden layers it should have and the number of neurons in each layer. Then, we specify suitable values for other parameters: learning rate - beta, we may also want to specify momentum - alpha (this one is optional), and Threshold - thresh (target mean square error, training stops once it is achieved else continues for num_iter number of times).
Let's define a net with 4 layers having 3,3,3, and 1 neuron respectively. Since the first layer is the input layer, i.e., simply a placeholder for the input parameters, it has to be the same size as the number of input parameters, and the last layer being the output layer must be same size as the number of outputs - in our example, these are 3 and 1. Those other layers in between are called hidden layers.
int numLayers = 4, lSz[4] = {3,3,3,1};
double beta = 0.2, alpha = 0.1, thresh = 0.00001;
long num_iter = 500000;
Creating the net:
CBackProp *bp = new CBackProp(numLayers, lSz, beta, alpha);
Training:
for (long i=0; i < num_iter ; i++)
{
bp->bpgt(data[i%8], &data[i%8][3]);
if( bp->mse(&data[i%8][3]) < thresh)
break;
}
Let's test its wisdom:
We prepare test data, which here is the same as training data minus the desired output.
double testData[][3]={
0, 0, 0,
0, 0, 1,
0, 1, 0,
0, 1, 1,
1, 0, 0,
1, 0, 1,
1, 1, 0,
1, 1, 1};
Now, using the trained network to make predictions on our test data....
for ( i = 0 ; i < 8 ; i++ )
{
bp->ffwd(testData[i]);
cout << testData[i][0]<< " "
<< testData[i][1]<< " "
<< testData[i][2]<< " "
<< bp->Out(0) << endl;
}
Now a peek inside:
Storage for the neural net
I think the following code has ample comments and is self-explanatory...
class CBackProp{
double **out;
double **delta;
double ***weight;
int numl;
int *lsize;
double beta;
double alpha;
double ***prevDwt;
double sigmoid(double in);
public:
~CBackProp();
CBackProp(int nl,int *sz,double b,double a);
void bpgt(double *in,double *tgt);
void ffwd(double *in);
double mse(double *tgt);
double Out(int i) const;
};
Some alternative implementations define a separate class for layer / neuron / connection, and then put those together to form a neural network. Although it is definitely a cleaner approach, I decided to use double *** and double ** to store weights and output etc. by allocating the exact amount of memory required, due to:
- The ease it provides while implementing the learning algorithm, for instance, for weight at the connection between (i-1)th layer's jth Neuron and ith layer's kth neuron, I personally prefer
w[i][k][j] (than something like net.layer[i].neuron[k].getWeight(j)). The output of the ith neuron of the jth layer is out[i][j], and so on.
- Another advantage I felt is the flexibility of choosing any number and size of the layers.
CBackProp::CBackProp(int nl,int *sz,double b,double a):beta(b),alpha(a)
{
numl=nl;
lsize=new int[numl];
for(int i=0;i<numl;i++){
lsize[i]=sz[i];
}
out = new double*[numl];
for( i=0;i<numl;i++){
out[i]=new double[lsize[i]];
}
delta = new double*[numl];
for(i=1;i<numl;i++){
delta[i]=new double[lsize[i]];
}
weight = new double**[numl];
for(i=1;i<numl;i++){
weight[i]=new double*[lsize[i]];
}
for(i=1;i<numl;i++){
for(int j=0;j<lsize[i];j++){
weight[i][j]=new double[lsize[i-1]+1];
}
}
prevDwt = new double**[numl];
for(i=1;i<numl;i++){
prevDwt[i]=new double*[lsize[i]];
}
for(i=1;i<numl;i++){
for(int j=0;j<lsize[i];j++){
prevDwt[i][j]=new double[lsize[i-1]+1];
}
}
srand((unsigned)(time(NULL)));
for(i=1;i<numl;i++)
for(int j=0;j<lsize[i];j++)
for(int k=0;k<lsize[i-1]+1;k++)
weight[i][j][k]=(double)(rand())/(RAND_MAX/2) - 1;
for(i=1;i<numl;i++)
for(int j=0;j<lsize[i];j++)
for(int k=0;k<lsize[i-1]+1;k++)
prevDwt[i][j][k]=(double)0.0;
}
Feed-Forward
This function updates the output value for each neuron. Starting with the first hidden layer, it takes the input to each neuron and finds the output (o) by first calculating the weighted sum of inputs and then applying the Sigmoid function to it, and passes it forward to the next layer until the output layer is updated:

where:

void CBackProp::ffwd(double *in)
{
double sum;
for(int i=0;i < lsize[0];i++)
out[0][i]=in[i];
for(i=1;i < numl;i++){
for(int j=0;j < lsize[i];j++){
sum=0.0;
for(int k=0;k < lsize[i-1];k++){
sum+= out[i-1][k]*weight[i][j][k];
}
sum+=weight[i][j][lsize[i-1]];
out[i][j]=sigmoid(sum);
}
}
}
Back-propagating...
The algorithm is implemented in the function void CBackProp::bpgt(double *in,double *tgt). Following are the various steps involved in back-propagating the error in the output layer up till the first hidden layer.
void CBackProp::bpgt(double *in,double *tgt)
{
double sum;
First, we call void CBackProp::ffwd(double *in) to update the output values for each neuron. This function takes the input to the net and finds the output of each neuron:

where:

ffwd(in);
The next step is to find the delta for the output layer:

for(int i=0;i < lsize[numl-1];i++){
delta[numl-1][i]=out[numl-1][i]*
(1-out[numl-1][i])*(tgt[i]-out[numl-1][i]);
}
then find the delta for the hidden layers...

for(i=numl-2;i>0;i--){
for(int j=0;j < lsize[i];j++){
sum=0.0;
for(int k=0;k < lsize[i+1];k++){
sum+=delta[i+1][k]*weight[i+1][k][j];
}
delta[i][j]=out[i][j]*(1-out[i][j])*sum;
}
}
Apply momentum (does nothing if alpha=0):
for(i=1;i < numl;i++){
for(int j=0;j < lsize[i];j++){
for(int k=0;k < lsize[i-1];k++){
weight[i][j][k]+=alpha*prevDwt[i][j][k];
}
weight[i][j][lsize[i-1]]+=alpha*prevDwt[i][j][lsize[i-1]];
}
}
Finally, adjust the weights by finding the correction to the weight.

And then apply the correction:

for(i=1;i < numl;i++){
for(int j=0;j < lsize[i];j++){
for(int k=0;k < lsize[i-1];k++){
prevDwt[i][j][k]=beta*delta[i][j]*out[i-1][k];
weight[i][j][k]+=prevDwt[i][j][k];
}
prevDwt[i][j][lsize[i-1]]=beta*delta[i][j];
weight[i][j][lsize[i-1]]+=prevDwt[i][j][lsize[i-1]];
}
}
How learned is the net?
Mean square error is used as a measure of how well the neural net has learnt.

As shown in the sample XOR program, we apply the above steps until a satisfactorily low error level is achieved. CBackProp::double mse(double *tgt) returns just that.
History
- Created date: 25th Mar'06.
| You must Sign In to use this message board. |
|
|
 |
|
 |
Hi dude, your article is interesting and didactic, I'm focused on understanding your backpropagation algorithm, so I need you can tell me which books or sources you used to implement it into C++, please. I'm having serious problems to adopt/understand the equation tp update weigths, 'cause there are some books and papers using different representations for it, including you. For example, someone is using this formula: weight(new) = weight(old) + learning rate * output error * output(neurons i) * output(neurons i+1) * ( 1 - output(neurons i+1) ) And my doubt is, wtf he gets: output(neurons i) * output(neurons i+1) * ( 1 - output(neurons i+1) ) It doesnot appear in my books  Reference: Backpropagation Thanks in advance.
I'm in love with Delphi 
Billiardo
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
dear Sir, I am a new user!I can find the demo file,but where is the source?Please help me.Thank you !
|
| Sign In·View Thread·PermaLink | 2.00/5 |
|
|
|
 |
|
 |
I was wondering where the bias weights are set to 1 or -1? I think they are initialized to random values in constructor.Am i right? Many thanks
|
| Sign In·View Thread·PermaLink | 2.00/5 |
|
|
|
 |
|
 |
Thank you for uploading your code for implementation of backpropagation neural networks.
I have a question about statement: ">> bp->Out(0) >> endl;"
Function "Out" has no reference anywhere in the code. Could you please clarify on this?
Tang Su
|
| Sign In·View Thread·PermaLink | 1.00/5 |
|
|
|
 |
|
|
 |
|
 |
The code computes MSE for just one training value. Check these lines in main():
for (i=0; i<num_iter ; i++) { bp->bpgt(data[i%8], &data[i%8][3]); if( bp->mse(&data[i%8][3]) < Thresh) {
The learning iterations are arranged such that only one training value is checked at each iteration steps. As a result, for a net with one output, like in the article example, MSE simply becomes the squared error between the net output and the training value, i.e. (Net_Out - Target)^2. This is wrong because the iterations stop when MSE is less than Threshold for just one training value. Instead, iterations should be arranged in epochs where one epoch contains the complete scan over all training values and MSE is calculated by summing squared errors between the net outputs and the corresponding training values for each epoch as in the example below (ntr - number of training samples):
double MSE; int ep; for (ep=0;ep<nep;ep++) { MSE=0.0; for(int i=0;i<ntr;i++) { bp->bpgt(data[i], &data[i][lSz[0]]); MSE+=bp->mse(&data[i][lSz[0]]); } MSE/=ntr; if(MSE<minErr) { cout << "Network trained in " << ep << " epochs. MSE: " << MSE << endl; break; } Otherwise, the code is fairly simple and easy to understand. Thanks.
modified on Friday, June 12, 2009 12:07 AM
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
for(i=1;i < numl;i++){ for(int j=0;j < lsize[i];j++){ for(int k=0;k < lsize[i-1];k++){ prevDwt[i][j][k]=beta*delta[i][j]*out[i-1][k]; weight[i][j][k]+=prevDwt[i][j][k]; } prevDwt[i][j][lsize[i-1]]=beta*delta[i][j]; <----------------------here weight[i][j][lsize[i-1]]+=prevDwt[i][j][lsize[i-1]]; } }
i think should be..
for(i=1;i < numl;i++){ for(int j=0;j < lsize[i];j++){ for(int k=0;k <= lsize[i-1];k++) { <---------------------------here prevDwt[i][j][k]=beta*delta[i][j]*out[i-1][k]; weight[i][j][k]+=prevDwt[i][j][k]; } } }
I made similar modifications to the momentum code as well.
Another issue that i noticed is that iteration stops whenever the error for _one_ of the training data falls below the threshold.
Other than it works good. Thanks
|
| Sign In·View Thread·PermaLink | 1.00/5 |
|
|
|
 |
|
 |
great code. Now I am trying to convert your code to complex number domain. I mean the element of net has complex number (i.e a+ib)
Would you correct me if am I wrong;
in your code // Apply bias sum+=weight[i][j][lsize[i-1]]; // Apply sigmoid function ....
so, is it mean that bias is constant for any layer. I think each neuron should have its own bias.
regards Kadir
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Hello every one ! Does this algorithm works with values that are not between 0 and 1, or do I have to normalize them to implement them ? Thank you. PA
|
| Sign In·View Thread·PermaLink | 1.75/5 |
|
|
|
 |
|
 |
In MatLab,
1st. we use purelin function p(x)=x where x is the summation of total weight multiple with bias that pass thought the output neuron in feedforward, but what is the dpurelin equation/formula that used in backpropagation?
2nd. we use logsig function logsig(x)=1/(1+e^(-x)) where x is the summation of total weight multiple with bias that pass thought the hidden neuron in feedforward method, but wat is the dlogsig equation/formula that used in backpropagation?
3rd. how to write the traingd or trainda function code in C style?
|
| Sign In·View Thread·PermaLink | 2.00/5 |
|
|
|
 |
|
 |
This NN code is very fast. I'm converting this code to C, but having trouble while converting "***weight" variable to C. How to convert multiple pointer variable into C? Do you have a C equivalent code of this NN?
|
| Sign In·View Thread·PermaLink | 1.00/5 |
|
|
|
 |
|
|
 |
|
 |
Although CBackProp will handle any range of input ( unlike the outputs which must be in the range 0,1 - see my earlier post ) I have found that if I also scale the inputs to the range 0,1 the results are improved, sometimes greatly.
The example below shows the results of experimenting with a time series generated from the function y = 0.5 sine( 0.5 x )
Input Actual Pred Error % 0.000000 0.479462 0.366899 0.112563 23.476872 1.000000 0.488765 0.357821 0.130944 26.790794 2.000000 0.378401 0.338155 0.040246 10.635803 3.000000 0.175392 0.293070 -0.117679 67.094771 4.000000 -0.070560 0.182579 -0.253139 -358.756813 5.000000 -0.299236 -0.070652 -0.228584 -76.389083 6.000000 -0.454649 -0.376426 -0.078223 -17.205125 7.000000 -0.498747 -0.470585 -0.028163 -5.646709 8.000000 -0.420735 -0.230091 -0.190645 -45.312249 9.000000 -0.239713 0.271760 -0.511473 -213.368952 10.000000 0.000000 0.328039 -0.328039 1.#INF00 11.000000 0.239713 0.294158 -0.054445 22.712536 12.000000 0.420735 0.164060 0.256676 61.006471 13.000000 0.498747 0.069891 0.428856 85.986604 14.000000 0.454649 0.018435 0.436214 95.945287 15.000000 0.299236 -0.007738 0.306975 102.586070 16.000000 0.070560 -0.020798 0.091358 129.475865 17.000000 -0.175392 -0.027278 -0.148113 -84.447263 18.000000 -0.378401 -0.030488 -0.347914 -91.943055 19.000000 -0.488765 -0.032076 -0.456689 -93.437334 Maximum Absolute Error 0.511473
Input Actual Pred Error % 0.000000 0.479462 0.469578 0.009884 2.061550 0.052632 0.488765 0.447425 0.041340 8.458115 0.105263 0.378401 0.384571 -0.006170 1.630557 0.157895 0.175392 0.211022 -0.035630 20.314779 0.210526 -0.070560 -0.094019 0.023459 -33.246259 0.263158 -0.299236 -0.325074 0.025838 -8.634506 0.315789 -0.454649 -0.412676 -0.041973 -9.231941 0.368421 -0.498747 -0.430788 -0.067960 -13.626074 0.421053 -0.420735 -0.404232 -0.016503 -3.922495 0.473684 -0.239713 -0.289402 0.049689 -20.728649 0.526316 0.000000 -0.004555 0.004555 1.#INF00 0.578947 0.239713 0.286781 -0.047068 19.635338 0.631579 0.420735 0.405822 0.014913 3.544582 0.684211 0.498747 0.432326 0.066422 13.317734 0.736842 0.454649 0.412816 0.041832 9.201034 0.789474 0.299236 0.323439 -0.024203 8.088330 0.842105 0.070560 0.096002 -0.025442 36.057186 0.894737 -0.175392 -0.204064 0.028672 -16.347377 0.947368 -0.378401 -0.382679 0.004277 -1.130393 1.000000 -0.488765 -0.449920 -0.038845 -7.947668 Maximum Absolute Error 0.067960
|
| Sign In·View Thread·PermaLink | 5.00/5 |
|
|
|
 |
|
 |
The calculation of mean square error looks strange to me.
Should we not divide by the number of outputs ( instead of just 2 in every case ) ?
I think the code in double CBackProp::mse(double *tgt) should be:
double mse=0; for(int i=0;i1];i++){ mse+=(tgt[i]-out[numl-1][i])*(tgt[i]-out[numl-1][i]); } return mse/lsize[numl-1];
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
I agree with you, "return mse/2;" should be replaced by "return mse/lsize[numl-1];". File: BackProp.cpp, line 130.
btw It is a really good, simple and well written article.
|
| Sign In·View Thread·PermaLink | 3.67/5 |
|
|
|
 |
|
 |
Thank you for double checking my observation. It is always good to get somebody else to look at something like this, since I might well have been missing something. This is especially the case here, where the original author seems no longer to be around.
Yes, the article, and code, are a good, simple start to working with neural networks. However, the class CBackProp is really too simple to be useful by itself in serious work.
1. The undocumented scaling issues are a problem
2. The interface, which requires data to be passed in through pointers to elements of two dimensional arrays, is quite unfriendly.
3. The learning procedure in the sample application is extremely limited, and will not work with different input data.
4. There is no support for my own particular interest, time series analysis.
I am developing a wrapper class for CBackProp which addresses the above issues.
James
|
| Sign In·View Thread·PermaLink | 1.00/5 |
|
|
|
 |
|
 |
Sorry, my previous statement is incorrect. MSE here is CORRECT.
I just realized, that this is not a standard MSE. It was modified in Neural Networks theory to make it more simple to derive MSE function. - simple example: (1/2 * (x)^2)' = 2 * 1/2 * x' = x'
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
I do not understand.
How can mean square error be anything other than the sum of errors squared, divided by total count?
What does x' mean in your simple example.
James
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
The problem lies in the name MSE... the algorithm's objective is to minimize the sum of the squared errors, which is labeled as mse in the program.
The 1/2 factor is for computational simplicity, as brutjbro points out. In his response, x' denotes the derivative of x. We use the derivative in the minimization steps, and the 1/2 factor eliminates the need to multiply by 2 in several places.
|
| Sign In·View Thread·PermaLink | 2.00/5 |
|
|
|
 |
|
 |
Thanks for great work!
Is it true for the hidden layers that are not immediate neighbours of the output layer with the calculation of "then find the delta for the hidden layers..." in your article?
|
| Sign In·View Thread·PermaLink | 1.00/5 |
|
|
|
 |
|
 |
The training stops too soon. If just one set of predictions match the training output, the training will stop. This often means that other data points still have a long way to go before they converge. Since the initial values are random, training will sometimes stop immediately, because the initial random values happen to match one of the training sets.
I suggest modifying this code
for (long i=0; i < num_iter ; i++) { bp->bpgt(data[i%8], &data[i%8][3]);
if( bp->mse(&data[i%8][3]) < thresh) break; }
to something like
int countOK = 0; for (long i=0; i < num_iter ; i++) { bp->bpgt(data[i%8], &data[i%8][3]);
if( bp->mse(&data[i%8][3]) < thresh) { if( ++countOK > 8 ) { break; } } else { countOK = 0; } }
|
| Sign In·View Thread·PermaLink | 5.00/5 |
|
|
|
 |
|
 |
I can confirm this. I did something like the above to fix. I also made Thresh (a lot) bigger otherwise it takes a long time to converge. I think thresh was mistakenly set small because of this bug ?
PJ.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
It seems to me that the range of outputs is limited to between 0 and 1. Is this correct?
Input Actual Predicted 0 0 0.00447204 1 10 0.999999 2 20 1 3 30 1 4 40 1
Input Actual Predicted 0 0 0.00446577 1 -0.1 0.00257842 2 -0.2 0.00178946 3 -0.3 0.00142328 4 -0.4 0.00123623
|
| Sign In·View Thread·PermaLink | 1.50/5 |
|
|
|
 |
|
 |
I tried to build the multiplication table with the code, but it's not working.
I'm using 16 neurons for the inputs, 8 for each number, and 8 neurons for the output.
I tried the multiplication table from 1 to 5, meaning :
1 * 1 = 1 1 * 2 = 2 1 * 3 = 3 ... 5 * 5 = 25
Eran Aharonovich (eran.aharonovich@gmail.com ) Noviway
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Eran,
CBackProp does a reasonable job on this problem.
You have to keep a couple of things in mind:
1. Choose the correct neuron numbers. I used a three layer configuration, with 3,3,1 neurons in each layer.
2. CBackProp can only handle outputs in the range 0 to 1. The data has to be scaled to fit this range.
Here are my results:
Input1 Input2 Result Predicted Abs Error % Error 1 1 1 1.31615 -0.316154 31.6154 1 2 2 1.93613 0.0638714 3.19357 1 3 3 2.95859 0.04141 1.38033 1 4 4 4.10813 -0.10813 2.70324 1 5 5 5.13011 -0.130111 2.60221 2 1 2 1.92798 0.0720164 3.60082 2 2 4 3.54203 0.457974 11.4494 2 3 6 5.83225 0.167753 2.79588 2 4 8 8.03822 -0.0382202 0.477753 2 5 10 9.87954 0.120461 1.20461 3 1 3 2.95723 0.0427697 1.42566 3 2 6 5.85805 0.141954 2.36589 3 3 9 9.29728 -0.297278 3.30309 3 4 12 12.2072 -0.207176 1.72647 3 5 15 14.9321 0.0678993 0.452662 4 1 4 4.15739 -0.157394 3.93484 4 2 8 8.13779 -0.137789 1.72236 4 3 12 12.2599 -0.259928 2.16607 4 4 16 15.8474 0.152555 0.95347 4 5 20 20.2755 -0.275514 1.37757 5 1 5 5.27889 -0.278891 5.57781 5 2 10 10.0632 -0.0631701 0.631701 5 3 15 14.9618 0.0381991 0.254661 5 4 20 20.1791 -0.179124 0.89562 5 5 25 24.4476 0.552407 2.20963
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|