- Download the Neural Network demo project - 203 Kb (includes a release-build executable that you can run without the need to compile)
- Download a sample neuron weight file - 2,785 Kb (achieves the 99.26% accuracy mentioned above)
- Download the MNIST database - 11,594 Kb total for all four files (external link to four files which are all required for this project)

## Contents

- Introduction
- Some Neural Network Theory
- Structure of the Convolutional Neural Network
- MNIST Database of Handwritten Digits
- Overall Architecture of the Test/Demo Program
- Training the Neural Network
- Tricks That Make Training Faster
- Experiences in Training the Neural Network
- Results
- Bibliography
- License and Version Information

## Introduction

This article chronicles the development of an artificial neural network designed to recognize handwritten digits. Although some theory of neural networks is given here, it would be better if you already understood some neural network concepts, like neurons, layers, weights, and backpropagation.

The neural network described here is *not* a general-purpose neural network, and it's not some kind of a neural network workbench. Rather, we will focus on one very specific neural network (a five-layer convolutional neural network) built for one very specific purpose (to recognize handwritten digits).

The idea of using neural networks for the purpose of recognizing handwritten digits is not a new one. The inspiration for the architecture described here comes from articles written by two separate authors. The first is Dr. Yann LeCun, who was an independent discoverer of the basic backpropagation algorithm. Dr. LeCun hosts an excellent site on his research into neural networks. In particular, you should view his "Learning and Visual Perception" section, which uses animated GIFs to show results of his research. The MNIST database (which provides the database of handwritten digits) was developed by him. I used two of his publications as primary source materials for much of my work, and I highly recommend reading his other publications too (they're posted at his site). Unlike many other publications on neural networks, Dr. LeCun's publications are not inordinately theoretical and math-intensive; rather, they are extremely readable, and provide practical insights and explanations. His articles and publications can be found here. Here are the two publications that I relied on:

- Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-Based Learning Applied to Document Recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998. [46 pages]
- Y. LeCun, L. Bottou, G. Orr, and K. Muller, "Efficient BackProp," in Neural Networks: Tricks of the trade, (G. Orr and Muller K., eds.), 1998. [44 pages]

The second author is Dr. Patrice Simard, a former collaborator with Dr. LeCun when they both worked at AT&T Laboratories. Dr. Simard is now a researcher at Microsoft's "Document Processing and Understanding" group. His articles and publications can be found here, and the publication that I relied on is:

- Patrice Y. Simard, Dave Steinkraus, John Platt, "Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis," International Conference on Document Analysis and Recognition (ICDAR), IEEE Computer Society, Los Alamitos, pp. 958-962, 2003.

One of my goals here was to reproduce the accuracy achieved by Dr. LeCun, who was able to train his neural network to achieve 99.18% accuracy (i.e., an error rate of only 0.82%). This error rate served as a type of "benchmark", guiding my work.

As a final introductory note, I'm not overly proud of the source code, which is most definitely an engineering work-in-progress. I started out with good intentions, to make source code that was flexible and easy to understand and to change. As things progressed, the code started to turn ugly. I began to write code simply to get the job done, sometimes at the expense of clean code and comprehensibility. To add to the mix, I was also experimenting with different ideas, some of which worked and some of which did not. As I removed the failed ideas, I did not always back out all the changes and there are therefore some dangling stubs and dead ends. I contemplated the possibility of not releasing the code. But that was one of my criticisms of the articles I read: none of them included code. So, with trepidation and the recognition that the code is easy to criticize and could really use a re-write, here it is.

## Some Neural Network Theory

This is not a neural network tutorial, but to understand the code and the names of the variables used in it, it helps to see some neural networks basics.

The following discussion is not completely general. It considers only feed-forward neural networks, that is, neural networks composed of multiple layers, in which each layer of neurons feeds only the very next layer of neurons, and receives input only from the immediately preceding layer of neurons. In other words, the neurons don't skip layers.

Consider a neural network that is composed of multiple layers, with multiple neurons in each layer. Focus on one neuron in layer *n*, namely the *i-th* neuron. This neuron gets its inputs from the outputs of neurons in the previous layer, plus a bias whose valued is one ("1"). I use the variable "*x*" to refer to outputs of neurons. The *i-th* neuron applies a weight to each of its inputs, and then adds the weighted inputs together so as to obtain something called the "activation value". I use the variable "*y*" to refer to activation values. The *i-th* neuron then calculates its output value "*x*" by applying an "activation function" to the activation value. I use the letter "*F()*" to refer to the activation function. The activation function is sometimes referred to as a "Sigmoid" function, a "Squashing" function and other names, since its primary purpose is to limit the output of the neuron to some reasonable range like a range of -1 to +1, and thereby inject some degree of non-linearity into the network. Here's a diagram of a small part of the neural network; remember to focus on the *i-th* neuron in level *n*:

This is what each variable means:

is the output of the i-th neuron in layer n | |

is the output of the j-th neuron in layer n-1 | |

is the output of the k-th neuron in layer n-1 | |

is the weight that the i-th neuron in layer n applies to the output of the j-th neuron from layer n-1 (i.e., the previous layer). In other words, it's the weight from the output of the j-th neuron in the previous layer to the i-th neuron in the current (n-th) layer. | |

is the weight that the i-th neuron in layer n applies to the output of the k-th neuron in layer n-1 |

is the general feed-forward equation, where F() is the activation function. We will discuss the activation function in more detail in a moment. |

How does this translate into code and C++ classes? The way I saw it, the above diagram suggested that a neural network is composed of objects of four different classes: layers, neurons in the layers, connections from neurons in one layer to those in another layer, and weights that are applied to connections. Those four classes are reflected in the code, together with a fifth class -- the neural network itself -- which acts as a container for all other objects and which serves as the main interface with the outside world. Here's a simplified view of the classes. Note that the code makes heavy use of `std::vector`

, particularly `std::vector< double >`

:

// simplified view: some members have been omitted, // and some signatures have been altered // helpful typedef's typedef std::vector< NNLayer* > VectorLayers; typedef std::vector< NNWeight* > VectorWeights; typedef std::vector< NNNeuron* > VectorNeurons; typedef std::vector< NNConnection > VectorConnections; // Neural Network class class NeuralNetwork { public: NeuralNetwork(); virtual ~NeuralNetwork(); void Calculate( double* inputVector, UINT iCount, double* outputVector = NULL, UINT oCount = 0 ); void Backpropagate( double *actualOutput, double *desiredOutput, UINT count ); VectorLayers m_Layers; }; // Layer class class NNLayer { public: NNLayer( LPCTSTR str, NNLayer* pPrev = NULL ); virtual ~NNLayer(); void Calculate(); void Backpropagate( std::vector< double >& dErr_wrt_dXn /* in */, std::vector< double >& dErr_wrt_dXnm1 /* out */, double etaLearningRate ); NNLayer* m_pPrevLayer; VectorNeurons m_Neurons; VectorWeights m_Weights; }; // Neuron class class NNNeuron { public: NNNeuron( LPCTSTR str ); virtual ~NNNeuron(); void AddConnection( UINT iNeuron, UINT iWeight ); void AddConnection( NNConnection const & conn ); double output; VectorConnections m_Connections; }; // Connection class class NNConnection { public: NNConnection(UINT neuron = ULONG_MAX, UINT weight = ULONG_MAX); virtual ~NNConnection(); UINT NeuronIndex; UINT WeightIndex; }; // Weight class class NNWeight { public: NNWeight( LPCTSTR str, double val = 0.0 ); virtual ~NNWeight(); double value; };

As you can see from the above, class `NeuralNetwork`

stores a vector of pointers to layers in the neural network, which are represented by class `NNLayer`

. There is no special function to add a layer (there probably should be one); simply use the `std::vector::push_back()`

function. The `NeuralNetwork`

class also provides the two primary interfaces with the outside world, namely, a function to forward propagate the neural network (the `Calculate()`

function) and a function to `Backpropagate()`

the neural network so as to train it.

Each `NNLayer`

stores a pointer to the previous layer, so that it knows where to look for its input values. In addition, it stores a vector of pointers to the neurons in the layer, represented by class `NNNeuron`

, and a vector of pointers to weights, represented by class `NNWeight`

. Similar to the `NeuralNetwork`

class, the pointers to the neurons and to the weights are added using the `std::vector::push_back()`

function. Finally, the `NNLayer`

class includes functions to `Calculate()`

the output values of neurons in the layer, and to `Backpropagate()`

them; in fact, the corresponding function in the `NeuralNetwork`

class simply iterate through all layers in the network and call these functions.

Each `NNNeuron`

stores a vector of connections that tell the neurons where to get its inputs. Connections are added using the `NNNeuron::AddConnection()`

function, which takes an index to a neuron and an index to a weight, constructs a `NNConnection`

object, and `push_back()`

's the new connection onto the vector of connections. Each neuron also stores its own output value, even though it's the `NNLayer`

class that is responsible for calculating the actual value of the output and storing it there. The `NNConnection`

and `NNWeight`

classes respectively store obviously-labeled information.

One legitimate question about the class structure is, why are there separate classes for the weights and the connections? According to the diagram above, each connection has a weight, so why not put them in the same class? The answer lies in the fact that weights are often shared between connections. In fact, the convolutional neural network of this program specifically shares weights amongst its connections. So, for example, even though there might be several hundred neurons in a layer, there might only be a few dozen weights due to sharing. By making the `NNWeight`

class separate from the `NNConnection`

class, this sharing is more readily accomplished.

### Forward Propagation

Forward propagation is the process whereby each of all of the neurons calculates its output value, based on inputs provided by the output values of the neurons that feed it.

In the code, the process is initiated by calling `NeuralNetwork::Calculate()`

. `NeuralNetwork::Calculate()`

directly sets the values of neurons in the input layer, and then iterates through the remaining layers, calling each layer's `NNLayer::Calculate()`

function. This results in a forward propagation that's completely sequential, starting from neurons in the input layer and progressing through to the neurons in the output layer. A sequential calculation is not the only way to forward propagate, but it's the most straightforward. Here's simplified code, which takes a pointer to a C-style array of `double`

s representing the input to the neural network, and stores the output of the neural network to another C-style array of `double`

s:

// simplified code void NeuralNetwork::Calculate(double* inputVector, UINT iCount, double* outputVector /* =NULL */, UINT oCount /* =0 */) { VectorLayers::iterator lit = m_Layers.begin(); VectorNeurons::iterator nit; // first layer is input layer: directly // set outputs of all of its neurons // to the given input vector if ( lit < m_Layers.end() ) { nit = (*lit)->m_Neurons.begin(); int count = 0; ASSERT( iCount == (*lit)->m_Neurons.size() ); // there should be exactly one neuron per input while( ( nit < (*lit)->m_Neurons.end() ) && ( count < iCount ) ) { (*nit)->output = inputVector[ count ]; nit++; count++; } } // iterate through remaining layers, // calling their Calculate() functions for( lit++; lit<m_Layers.end(); lit++ ) { (*lit)->Calculate(); } // load up output vector with results if ( outputVector != NULL ) { lit = m_Layers.end(); lit--; nit = (*lit)->m_Neurons.begin(); for ( int ii=0; ii<oCount; ++ii ) { outputVector[ ii ] = (*nit)->output; nit++; } } }

Inside the layer's `Calculate()`

function, the layer iterates through all neurons in the layer, and for each neuron the output is calculated according to the feed-forward formula given above, namely

This formula is applied by iterating through all connections for the neuron, and for each connection, obtaining the corresponding weight and the corresponding output value from a neuron in the previous layer:

// simplified code void NNLayer::Calculate() { ASSERT( m_pPrevLayer != NULL ); VectorNeurons::iterator nit; VectorConnections::iterator cit; double dSum; for( nit=m_Neurons.begin(); nit<m_Neurons.end(); nit++ ) { NNNeuron& n = *(*nit); // to ease the terminology cit = n.m_Connections.begin(); ASSERT( (*cit).WeightIndex < m_Weights.size() );