## Introduction

This article continues the topic of artificial neural networks and their implementation in the ANNT library. The first article started with basics and described feed forward fully connected neural networks and their training using Stochastic Gradient Descent and Error Back Propagation algorithms. It then demonstrated application of this artificial neural network's architecture in number of tasks. One of those was classification of handwritten characters from the MNIST database. Although being a simple example, it managed to achieve about 96.5% accuracy on a test dataset. In this article we'll have a look at a different architecture of artificial neural networks known as Convolutional Neural Networks (CNN). This type of networks is specifically designed for computer vision tasks and outperforms classical fully connected neural networks when it comes to tasks like image recognition. As another sample application will demonstrate, we'll get to about 99% accuracy on the handwritten characters classification.

Originally the convolutional neural network architecture was introduced by Yann LeCun when he published his work back in 1998. However, it was left largely unnoticed in those days. It took 14 years to get big attention to convolutional networks when the ImageNet competition was won by a team using this architecture. CNNs became very popular after that and were applied to many computer vision applications resulting in development of variety of neural networks based on this architecture. These days state-of-the-art convolutional neural networks achieve accuracies that outperform humans on many image recognition tasks.

## Theoretical background

As in the case with feed forward fully connected artificial neural networks, the idea of convolutional networks was inspired by studying nature - brain of mammals. Work by Hubel and Wiesel in the 1950s and 1960s showed that cats' and monkeys' visual cortexes contain neurons that individually respond to small regions of the visual field. Provided the eyes are not moving, the region of visual space within which visual stimuli affect the firing of a single neuron is known as its receptive field. Neighbouring cells have similar and overlapping receptive fields. Receptive field size and location varies systematically across the cortex to form a complete map of visual space.

In their paper, they described two basic types of visual neuron cells in the brain that each act in a different way: simple cells and complex cells. The simple cells activate, for example, when they identify basic shapes as lines in a fixed area and a specific angle. The complex cells have larger receptive fields and their output is not sensitive to the specific position in the field. These cells continue to respond to a certain stimulus, even though its absolute position on the retina changes.

In 1980, a researcher called Fukushima proposed a hierarchical neural network model, which was named neocognitron. This model was inspired by the concepts of the simple and complex cells. The neocognitron was able to recognise patterns by learning about the shapes of objects.

Later, in 1998, convolutional neural networks were introduced by Yann LeCun and his colleagues. Their first CNN was called LeNet-5 and was able to classify digits from hand-written numbers.

### Architecture of convolutional network

Before getting into the details of building a convolutional neural network, let's have a look at some of the building blocks, which are either specific to this type of networks or got popularized when they have arrived. As it was seen from the previous article, many concepts of artificial neural networks can be implemented as separate entities, which perform calculations for both – inference and training phases. Since the core structure was already laid out in the article before, here we'll be just adding building blocks on top and then stich them together.

#### Convolutional layer

Convolutional layer is the core building block of convolutional neural network. It does assume its input has 3-dimensional shape of some width, height and depth. For the first convolutional layer it is usually an image, which most commonly has its depth of 1 (grayscale image) or 3 (color image with 3 RGB channels). For subsequent convolutional layers the input is represented by a set of feature maps produced by previous layers (here depth is the number of input feature maps). For now, let's assume we deal with inputs having depth of 1, which turns them into 2-dimensional structures then.

So, what the convolutional layer does is essentially an image convolution with some kernel. It is a very common image processing operation, which is used to achieve variety of results. For example, it can be used to make images blurry or make them sharper. But this is not what convolutional networks are interested in. Depending on the kernel in use, image convolution can be used to find certain features in images – vertical or horizontal edges, corners, angles or more complex features like circles, etc. Recall the idea of simple cells in the visual cortex?

Let's see how convolution is calculated. Suppose we have *n* (height) by *m* (width) matrices *K* (kernel) and *I* (image). Then it can be written as dot product of those matrices, where kernel matrix is flipped horizontally and vertically.

For example, if we have 3 by 3 matrices *K* and *I*, the convolution of those can be calculated this way:

The above is the way how convolution is defined when it comes to signal processing. Kernel is flipped vertically and horizontally there. A more straight forward calculation would be just a normal dot product of the *K* and *I* matrices, without any flipping. This operation is called *cross-correlation* and defined this way:

When it comes to signal processing, convolution and cross-correlation have different properties and are used for different purpose. However, when it comes to image processing and neural networks the difference becomes subtle and cross-correlation is often used instead. For neural networks it is really not important at all. As we'll see later, those "convolution" kernels are actually the weights, which neural network needs to learn. So, it is up to the network to decide which kernel to learn - flipped or not. With this in mind, we'll keep it simple and use cross-correlation then. **Note**: further in the article anywhere "convolution" is mentioned, we'll assume normal dot product of two matrices, i.e. cross-correlation.

OK, we now know how to calculate convolution for two matrices of the same size or kernel and image of the same size. However, in image processing it is rarely the case. Kernel is usually a square matrix of size 3 by 3 or 5 by 5 or 7 by 7, etc. While image can be of any size. So how is image convolution calculated then? To calculate image convolution the kernel is moved across the entire image and the weighted sum is calculated at every possible location of the kernel. In image processing this concept is known as sliding window. Calculations start at the top left corner of the image and convolution is calculated between the kernel and corresponding image area of the same size. Then kernel is shifted right by one pixel and another convolution is calculated. It is then repeated until convolution is calculated at every position of the row. Once it is done, the kernel is moved to the start of the next row of pixels and the process continues further. When entire image is processed, we get a feature map - values of individual convolutions at every possible location of the image.

The picture below illustrates the process of calculating image convolution. For an image of 8x8 in size and kernel of 3x3, we get a feature map of 6x6 in size - convolution is calculated only at those locations, where kernel fits entirely into the image. The picture below highlights few regions of the source image and their corresponding convolution values in the resulting feature map.

The above 3x3 kernel is designed to look for object's left edges (or presence of a straight vertical line on the right from the center of the sliding window). High positive values in the resulting feature map indicate presence of the feature we are looking for. Zeros mean absence of the feature. And for this particular example, negative values indicate presence of the "inverse" feature – object's right edges.

As it was shown above, the output feature map gets smaller in size than the source image when convolution is calculated. And the bigger kernel is used, the smaller feature map we get. For a kernel of *n*x*m* in size, the input image loses (*n*-1)x(*m*-1) in size. So, if we would have 5x5 kernel in the above example, then the result feature map would get down to 4x4 in size. In many cases, however, it is preferred to get output feature map of the same size as input. To obtain this, the source image needs to be padded (usually with zeros). For example, if the source image is 8x8 in size and our kernel is 5x5 in size, then we would need to pad the input, so it gets to 12x12 in size, i.e. 4 extra rows/columns added. This is usually done by adding 2 rows/columns on each side of the input image.

So far we've discussed how to compute convolution mathematically and how to compute image convolution when it comes to image processing. However, we are doing artificial neural networks, so we need to see how all the above is related to convolutional layers. To keep it simple for now, let's use the example from the above – 8x8 input image convolved with 3x3 kernel, which gives us 6x6 feature map (output). In this case, our input layer has 64 nodes and our convolutional layer has 36 neurons. However, unlike with fully connected layer, where each neuron of the layer is connected to all neurons of the previous layer, neurons of convolutional layer are connected only to a small group of the previous layer's neurons. Each neuron in convolutional layer has as many connections as the number of weights in the convolution kernel it implements, which is 9 connections in the above example (kernel size 3x3). Since convolutional layer assumes the input has 2D shape (3D in general, but keeping it simple for this example), those connections are done to a rectangular group of previous neurons, which is of the same shape as the kernel in use. The group of connected previous neurons is different for each neuron of the convolutional layer, however it does overlap for the neighbouring neurons. These connections are made in the same way, as pixels of the source image are chosen, when calculating image convolution using sliding window approach. For example, looking at the above image demonstrating image convolution, we can see which of the highlighted outputs on the feature map get connected to which inputs (highlighted with the same color).

Ignoring the fact that neurons of fully connected layers and convolutional layers have different number of connections to the previous layer and that these connections have certain structure, both layers essentially do the same – calculating weighted sum of inputs to produce outputs. There is one more difference though. Unlike with fully connected layers, where each neuron has its own weights, neurons of convolutional layers share them. So, if a layer does one single 3x3 convolution (in practice it does more than one, but keep it for later), it just has one set of weights, i.e. 9, which are shared between each neuron for calculating weighted sum. And, although it was not yet mentioned before, convolutional layers also add bias value to the weighted sum, which is also shared. The table below summarizes the difference between fully connected and convolutional layers and provides some numbers for the above example.

**Fully connected layer** | **Convolutional layer** |

No assumptions about input structure | Input is assumed to have 2D shape (3D in general) |

Each neuron is connected to all neurons of the previous layers
**64 connections each** | Each neuron is connected to a small rectangular group of neurons in the previous layer; number of connections equal to number of weights in convolution kernel
**9 connections each** |

Each neuron has its own weights and bias value
**2304 weights and 36 bias values** | Weights and bias value are shared
**9 weights and 1 bias value** |

For now, we've kept things simple and assumed that both input and output of convolutional layer have 2D shape. However, it is not the case in general. Instead, both input and output have 3D shape. First, lets start with the output. In practice, each convolutional layer computes more than a single convolution. The number of convolutions it does is a configurable parameter, which is set when designing artificial neural network. Each convolution uses its own set of weights (kernel) and bias value and so produces a different feature map. As it was mentioned before, different kernels can be used to look for different features – lines at different angles, curves, corners, etc. And so, it is often desired to get a number of feature maps, which highlight presence of different features. Calculation of those maps is simple - the process of calculating convolution for the given input is repeated multiple times with different kernel's weights/bias every time. Translating it to artificial neurons' world, we are simply adding additional groups of neurons into the convolution layer, which are connected to inputs in the same way as in the case with single kernel. Having same connection pattern, these groups of neurons share different weights and bias values though. Coming back to the example described before, suppose we configure our convolution layer to do 5 convolutions, 3x3 each. In this case number of outputs (number of neurons) is 36*5=180 – 5 groups of neurons organized into 2D shape and repeating same connection pattern. Each group of neurons shares its own set of weights/bias, which gives us 45 weights and 5 bias values in total for the layer.

Now let's discuss 3D nature of inputs. If we speak about the very first convolutional layer, then its input will be some sort of image, most likely. Most of the time it will be either grayscale image (2D data) or color RGB image (3D data). If we speak of subsequent convolutional layers, then input's depth will be equal to the number of feature maps (number of convolutions) calculated by the previous layer. When input gets higher depth, the number of neurons in convolutional layer is not growing. Instead, number of connections with the previous layer is growing. In fact, convolution kernels get 3D shape as well and have *n*x*m*x*d* size, where *d* is the depth of input. Translating it to neurons' world again, we can think of it as if each neuron gets additional connections to every feature map input contains. In the case of 2D input, each neuron was connected to *n*x*m* (kernel size) rectangular area of the input. In the case of 3D input, however, each neuron is connected to number (*d*) of such areas, which are coming from the same location, but from different input feature maps.

Since we've generalized convolutional layers to 3D inputs/outputs and also mentioned bias values, we can update our convolution formula, which is computed at every possible location (*x*, *y*) of the kernel within the input features.

To complete with convolutional layers for now, let's summarize on the parameters used to configure them. When creating fully connected layer, we use only two parameter - number of inputs and number outputs (neurons in the layer). When creating convolutional layers though, we don't need to specify number of outputs. Instead we describe the shape of inputs, *h*x**w**x**d**, and the shape and number of kernels, *n*x**m**@*z*. So, we have 6 numbers: *w* – width of input feature maps (image), *h* – height of input feature maps, *d* – depth of input (number of feature maps), *m* – width of kernels, *n* – height of kernels, *z* – number of kernels (number of output feature maps). The actual size of kernels depends on the input specification and so we get *z* kernels of *n*x*m*x*d* in size. And the size of output then becomes (*h*-*n*+1)x(*w*-*m*+1)x*z* (here we assume input is not padded and kernel is applied only at valid locations).

We'll get back to convolutional layers again when it comes to training them. The above, however, should give an idea of how output is calculated on the inference phase (computing output of a trained network).

#### ReLU activation function

The next building block to describe is ReLU activation function. It is not something new or specific to convolutional neural networks. However, it was popularized a lot with the rise of deeper neural networks. And this is where convolutional networks usually fit.

One of the problems deep neural networks experience is known as vanishing gradient problem. When training artificial neural network using gradient-based learning algorithms and backpropagation, each of the neural network's weights receives an updated proportional to the partial derivative of the error function with respect to the current weight. The problem is that in some cases, the gradient value can be so small, so it effectively prevents the weight from changing its value. One of the causes of this problem is the use of traditional activation functions such as sigmoid and hyperbolic tangent. These functions have gradient in the (0, 1) range, with values close to zero on the majority of function's domain. And since error's partial derivatives are calculated using chain rule, it means that for a *n*-layer network there will be *n* multiplications of these small numbers, meaning gradient decreases exponentially with *n*. As the result, "front" layers of a deep network train very slowly, if at all.

The ReLU function is defined as *f*(*x*)=**max**(0, *x*). Its biggest advantage is that it has constant derivative equal to 1 for values of *x* greater than zero. As the result, it allows better gradient propagation, which speeds up training of deeper artificial neural networks. Also, it is more computationally efficient, making it faster to compute in comparison with sigmoid or hyperbolic tangent.

ReLU function | |
Sigmoid function |

Although ReLU function does have some potential problems as well, so far it looks like the most successful and widely-used activation function when it comes to deep neural networks.

#### Pooling layer

It is a common practice to follow convolutional layer with a pooling layer. The objective of this layer is to down-sample input feature maps produced by the previous convolutions. By reducing the spatial size of inputs, we also reduce the amount of parameters and computation in the neural network. This also helps in controlling overfitting – less parameters means less chance to overfit.

The most common pooling technique is the **MAX** pooling with 2x2 filter and stride 2. For the *n*x*m* input feature map, it produces a *n*/2x*m*/2 map by replacing every 2x2 region in the input with a single value – maximum value of the 4 values in that region. These regions don't overlap, but adjacent to each other, since the filter is moved horizontally and vertically with the step size (stride) equal to its size. Below is example of applying **MAX** pooling to the 6x6 input (colored cells highlight source values of the MAX operator and the corresponding result).

**MAX** pooling is not the only pooling technique. Another common one is **Average** pooling, which calculates average values of the source regions instead of taking their maximum value.

Pooling layers also can be configured with different size of the filter and stride value. For example, some applications use 3x3 filter with stride 2. Such configuration creates an overlapping pattern of pooling regions, since the filter's step size is smaller than its size. Making stride value greater than filter size is uncommon however, since some features may get lost completely.

One important thing to mention about pooling layers is that they operate with 2D feature maps and don't affect depth of the input. So, if input contains 10 feature maps produced by previous convolutional layer, for example, the pooling is applied individually to each map. As the result, it produces same number of feature maps, but smaller in size.

#### Building convolutional neural network

As we now have the most common building blocks, we can put them together into a convolutional neural network. Although there are some network architectures, which are based entirely on convolutional layers, it is a rare case. Most of the time convolutional networks only start with convolutional layers, which perform initial features' extraction, and then followed by fully connected layers, which perform final classification.

As an example, below is the architecture of LeNet-5 convolutional neural network, which was first described by Yann LeCun and applied to classification of hand-written digits. It takes a 32x32 grayscale image as its input and produces a vector of 10 values – probabilities of belonging to certain class (digits from 0 to 9). The table below summarizes the architecture of the network, dimensions of layers’ outputs and number of trainable parameters (weights + biases).

**Layer type** | **Trainable parameters** | **Output size** |

Input image | | 32x32x1 |

Convolution layer 1, 6 kernels of 5x5 in size
ReLU activation | 156 | 28x28x6 |

MAX pooling 1 | | 14x14x6 |

Convolution layer 2, 16 kernels of 5x5 in size
ReLU activation | 416 | 10x10x16 |

MAX pooling 2 | | 5x5x16 |

Convolution layer 3, 120 kernels of 5x5 in size | 3120 | 1x1x120 |

Fully Connected layer 1, 120 inputs, 84 outputs
Sigmoid activation | 10164 | 84 |

Fully Connected layer 2, 84 inputs, 10 outputs
SoftMax activation | 850 | 10 |

With only 14706 trainable parameters, the structure of the above convolutional neural network is very simple. These days there are much more complicated deep networks being developed, which include many millions of parameters to train.

### Training convolutional network

So far we've discussed only the inference part of convolutional neural network, which is calculating its output for a given input. However, the network needs to be trained first to get something meaningful out of it. When it comes to convolution operator in image processing, the kernels there are usually handcrafted and serve specific purpose. Some kernels are used to find objects' edges, some for making pictures sharper or blurry, etc. Very often it is a time-consuming process to design right kernel to perform the task needed. With convolutional neural networks it is all different, however. When designing such network, we think about number of layers, number and size of convolutions done, etc. But we don't set those convolution kernels. Instead, the network will learn those during the training phase, since essentially those kernels are nothing more but weights – same as we have them in fully connected layers.

Training of convolutional artificial networks is done using exactly the same algorithms as used for training of fully connected networks – stochastic gradient descent and backpropagation. As it was demonstrated in the previous article, to calculate partial derivatives of neural network's error with respect to its weights we can use chain rule. It allows us to define complete equations for weights' updates of any trainable layer. However, this time we'll concentrate more on the error back propagation side of things and instead of providing one big equation containing all parts of the chain rule, we'll provide smaller equation's, which are specific to each building block of neural network – fully connected and convolutional layers, activation functions, cost functions, etc.

If we revisit chain rule from the previous article, we'll notice that every building block of a neural network calculates its error gradient as partial derivative of its outputs with respect to its inputs and multiples it with error gradient coming from the block following it. Remember we are moving backward, so calculations start at the last block and flow to previous blocks, i.e. the first block. The last block on the training phase is always a cost function and so it computes error gradient as derivative of cost (its output) with respect to neural network's output (input of the cost function). This can be defined the next way:

All other building blocks take the error gradient from the next block and multiply it with partial derivatives of their own outputs with respect to inputs.

Before describing derivatives of the new building blocks, which we are going to use for convolutional networks, lets revisit derivatives of the building blocks we've used for fully connected networks, but written in the new notation. First, we start with error gradient of MSE cost function with respect to outputs of the network (*y*_{i} – outputs produced by the network, *t*_{i} – target outputs):

Now, when error gradient passes backward through sigmoid activation function, it gets recalculated this way (*o*_{i} here is the output of the sigmoid), which is gradient from the next block (whatever it is – it can be cost function or another layer in multi-layer network) multiplied by sigmoid's derivative:

Alternative, if hyperbolic tangent is used as activation function, its derivative is used instead:

Now we need to propagate error gradient backward through a fully connected layer. Since every input is connected to every output, we get a sum of partial derivatives (*n* is number of neurons in the fully connected layer, *j* is input's index, *i* is outpu's/neuron's index):

Since fully connected layer is a trainable layer, it needs not only to pass error's gradient backward to previous building block/layer, but also update its weights. Using the above defined naming convention, the update rule for weights and biases can be written as bellow (classical SGD):

All of the equations above is a quick repetition of the back propagation from the previous article. Why was it important? Well, first to remind the basics. Second, to rewrite it in a different way, where each building block defines its own error's gradient back propagation equation, which is independent of the other blocks. The way weights' update equation was given in the previous article helps to understand the basics and how the chain rule works. But being one single equation makes it not generic at all. What if we need different cost function instead of MSE? What if we need hyperbolic tangent or ReLU activation instead of sigmoid? The way it is presented in this article makes it more flexible and allows mixing building blocks of artificial neural networks in various ways and train them without assumptions on which layer is followed by which activation and which cost function is in use (well, more or less). Plus, this presentation is more in sync with the actual C++ implementation, where different building blocks are implemented as separate classes, taking care of their own calculations for the forward pass and backward pass during training.

**Note**: If all the above is not clear, however, it is recommended to go through the previous article.

#### Cross-entropy cost function

One of the most common uses of convolutional neural networks is image classification. Given an image, a network needs to classify it into one of the mutually exclusive classes. For example, it can be hand written digits classification, where we have 10 possible classes corresponding to digits from 0 to 9. Or a network can be trained to recognize objects like car, truck, ship, airplane, etc., and so we'll have as many classes as we have types of objects. The main point in this type of classification is that each input image must belong to one class only, i.e. we cannot have objects which are classified as both car and airplane.

When dealing with multi class classification problems, the designed artificial neural network has as many outputs, as the number of classes we have. On the training phase, target outputs are one-hot encoded, i.e. represented with vector of zeros with only one element set to value '1' at the index corresponding to the class. For example, for a task of 4-class classification, our target outputs may look something like this: {0, 1, 0, 0} – class 2, {0, 0, 0, 1} – class 4, etc. None of the target outputs are allowed to have multiple elements set to '1' or another non-zero value. This can be viewed as target probabilities, i.e. the {0, 1, 0, 0} output means that the presented input belongs to class 2 with 100% probability and to other classes with probability of 0%.

When training, however, the actual neural network's outputs will look different though. It may provide an output something like {0.3, 0.35, 0.25, 0.1}, for example. Such output may have different meaning. For a trained network, it may mean the network was presented with a tricky example and it is not very clear, but looks more like class 2 – the highest probability of 35%. Or, if we just started training, it may mean little at all, other than "keep going".

And so, we need a cost function, which would tell us the amount of difference between target and the real output and direct parameters' update of the neural network. When it comes to probabilistic models over mutually exclusive classes, we deal with predicted and the ground-truth probabilities. In such cases, the common choice is the cross-entropy cost function, which has its roots coming from the information theory. As it says, by minimizing cross-entropy, we want to minimize the amount of extra data (bits), required for encoding some events appearing with probability distribution *t*_{i} (target or real distribution) using some estimated probabilities *y*_{i} (which might be close, but no exactly). And to minimize the cross-entropy, we need to make our estimated probabilities to be the same as the real probabilities – which is what we are looking for.

The cross-entropy cost function, the value we need to minimize, is defined as below (same as before – *t*_{i} are target outputs, while *y*_{i} is the output provided by neural network):

Getting its derivative, the gradient of the cost function with respect to neural network's output is then calculated as:

Now we have the cross-entropy cost function instead of MSE and so we can move to other building blocks and see how error gradient propagates backward.

#### SoftMax activation function

For the last layer's activation function of the neural network used for classification problem we could use the sigmoid function, which we've already seen in the previous article and quickly repeated above. Its output is in the (0, 1) range and so can be interpreted as probabilities between 0% and 100%. When neural network is trained with sigmoid in the output layer, it really may provide probabilities close to the ground truth. However, since we deal with mutually exclusive classes, it may not always make perfect sense. For example, provided a challenging example, a network may provide an output vector like this: {0.6, 0.55, 0.1, 0.1}. Yes, looks like class 1 with probability of 60%! But probability of the class 2 is not too far away. And another problem is that if we sum the four probabilities we've got, we get 1.35, which is 135%.

There are two problems we want to address. First, we definitely want to have sum of probabilities equal to 100%. Not more, not less. Also, if we get a tricky example, which looks like class 1, but also seems close to class 2, can we really have a high certainty of 60% that the classification is right?

To resolve the two issues above, we can use a different activation function, which is SoftMax. Same as sigmoid, it provides output in the (0, 1) range. But unlike sigmoid, it does not operate on single values of the input vector, but on the entire vector, and so makes sure the sum of the output vector equals to 1. The SoftMax function is defined the next way:

If we would use SoftMax function instead of sigmoid for the above example (you can use inverse sigmoid to find the source input values), the output vector would look different and make more sense – {0.316, 0.3, 0.192, 0.192}. As we can see, the sum of all values equals to 1, which is 100%. And even though the 1^{st} class seem to win, the probability of it is not that high - only 31.6%.

As for any other activation function, we need to define gradient back propagation equation for the SoftMax function. Here it is:

Going now further backward through the LeNet-5 neural network's architecture, we see fully connected layers and sigmoid activation function. Equations for both were already defined above. So now it is time to address the other building blocks introduced in this article.

#### ReLU activation function

As it was already mentioned above, ReLU activation function became a very popular choice for deeper neural networks, as it allows much better propagation of error's gradient through the network. It is all due to its constant gradient equal to 1 for input values of greater than zero. To complete ReLU activation, we also need to define its equation for gradient back propagation.

#### Pooling layer

Now it is time to propagate error's gradient backward through pooling layer. To make it simple, lets suppose we use 2x2 kernel with stride 2 and we don't use input padding (we apply pooling to valid locations only). With this in mind, it means every value of the output feature map is calculate based on 4 values of the input feature map.

Although pooling layers make assumption that input vectors represent 2D data, the math below will work with inputs/outputs as 1D vectors. To make it all work, we'll define a *i2j*() function, which for the given index *i* of input vector returns corresponding index *j* of output vector. Since each output is calculated based on 4 input values, it means there are 4 input indexes, for which *i2j*() will return the same output index.

Let's start with **Max Pooling**. To define equation for error's gradient back propagation, we'll need one extra thing. On the forward pass, when neural network's output is calculated, the pooling layer will also fill in the *maxIndexes* vector of the same length as output vector. But, if output vector contains maximum value of the corresponding input values, the *maxIndexes* vector contains the index of the maximum value. With all the above, we can define gradient back propagation equation for Max Pooling layer:

As for **Average Pooling** it is even simpler – the error gradient from the previous block is simply divided by the size of pooling kernel, which is 4 our case:

#### Convolutional layer

Finally, it is time to define back propagation pass for convolutional layer. It is not much different from fully connected layer as long as the fact of shared weights is kept in mind.

Let's then start with weights update of the convolutional layer. With fully connected layers it was simple – partial derivative of error with respect to weight *w*_{i,j} equals to error gradient coming from the next block multiplied by corresponding input value – *δ*_{i}^{(k+1)}x_{j}. The reason for this is that each input/output connection is assigned its own weight in fully connected layer, which is not shared. However, it is not the case in convolutional layer. The picture below demonstrates that every weight of convolution kernel is used for many input/output connections. In the example below, the highlighted kernel's weights are used 9 times each – the kernel is applied in 9 different positions within the input image. And so, the partial derivative of error with respect to weight will need to have 9 terms as well – the number of times the weight is used.

Same as with pooling layers, we'll ignore here the fact that convolutional layers deal with 2D/3D data. Instead we'll assume that inputs/outputs/kernels are plain vectors/arrays for now (this is what they end up in C++ anyway). And so, for the example above, the 1^{st} kernel's weight (highlighted in red) is applied to inputs {1, 2, 3, 5, 6, 7, 9, 10, 11, 13, 14, 15}, while the 4^{th} weight is applied to inputs {6,7,8,10,11,12,14,15,16}. Suppose that we have such vector of input indexes used by every weight, which we'll name *weightInputs*_{i} – input of the *i*^{th} weight. Also, we'll define a function of two arguments *i2o(i,j)*, which provides index of output value for the *i*^{th} weight and *j*^{th} input. Here are few examples for the picture above, i2o(1,1)=1, i2o(4,6)=1, i2o(1, 11)=9 and i2o(4,16)=9. With the above naming convention, the weights' update rule for convolutional network can be then defined the next way:

Does the above make sense? Well, the more you think about it, the more it will. All we do is taking error gradients for all the outputs (since each kernel's weight is used to calculate all outputs) and multiply them by corresponding input. Yes, we have multiple kernels. But, they are all applied in the same pattern, so even though we'll need to update weights of different kernels, the *weightInputs* vectors stay the same. However, the *i2o(i,j)* is specific to each kernel. Or it can be extended with extra parameter – kernel index.

Updating bias value is much simpler. Since each kernel/bias is used to calculate every output value, we'll just sum all error gradients for the feature map produced by that kernel.

**Note**: both equations above are done per feature map/kernel, i.e. weights and bias value are not parameterized there with kernel index.

Now it is time to get the final equation for convolutional layer, which is for propagating error gradient backward through the network. This means calculating partial derivatives of error with respect to inputs of the layer. Each input element can be used multiple times to produce an output value of a feature map. It can be used as many times as the number of elements in convolution kernel (number of weights). Some inputs can be used only for one output, though. For example, those are the inputs in corners of the input 2D feature map. But then we also need to keep in mind that every input feature map can be processed multiple times with different kernels, which generate more output maps. Again, lets pretend it is all flat for now, no 2D/3D indexing. Then, let's assume we have another set of helper vectors named *inputOutputs*_{i}, keeping indexes of outputs, which the *i*^{th} input contributes to. Finally, we'll need the *i2w(i, j)* function, which provides index of the weight, which is used to connect *i*^{th} input with *j*^{th} output. Here are few examples again for the above picture: i2w(1, 1)=1, i2w(6,1)=4, i2w(16,9)=4. With all this, we can define equation for propagating error's gradient backward through convolutional layer.

Now it looks like the math is complete – we have everything we need to calculate both as the forward pass through convolutional network, as the backward pass. If it still puzzles, confuses or leaves some uncertainty, go through it all again, think about it. Or dive into the code to see relation between the math and implementation.

## The ANNT library

Implementation of the convolutional artificial neural network in the ANNT library is heavily based on the design set by implementation of fully connected networks described in the previous article. All the core classes are left as they were, only new building blocks were implemented, which allow building them into convolutional neural networks. The new class diagram of the library is shown below – not much of a difference.

Similar to the way it was set before, new building blocks take care of calculating their output on the forward pass and propagating error gradient on the backward pass (as well as calculating initial weights' updates in the case of trainable layers). As the result, all the code for neural network training is left unchanged.

And, as in the case with the rest of the code, the new building blocks utilize SIMD instructions wherever possible to vectorize computations, as well as OpenMP to parallelize them.

### Building the code

The code comes with MSVC (2015 version) solution files and GCC make files. Using MSVC solutions is very easy – every example's solution file includes projects of the example itself and the library. So MSVC option is as easy as opening solution file of required example and hitting build button. If using GCC, the library needs to be built first and then the required sample application by running **make**.

## Usage examples

After the long discussion about the theory and math of convolutional neural networks, it is time to get to practice and actually build some of the networks for image classification tasks – hand written digits and different objects like cars, trucks, ships, airplanes, etc. **Note**: none of these examples claim that the demonstrated neural network's architecture is the best for its task. In fact, none of these examples even say that artificial neural networks is the way to go. Instead, their only purpose is to provide demonstration of using the library.

**Note**: the code snippets below are only small parts of the example applications. To see the complete code of the examples, refer to the source code package provided with the article (which also includes examples for fully connected neural networks described in the previous article).

### MNIST handwritten digits classification

The first example to have a look at is classification of hand-written digits from the MNIST database. The database contains 60000 examples for neural network training and additional 10000 examples for testing of the trained network. The picture below demonstrates some of the examples of different digits to classify.

The convolutional neural network used in this example has the structure very similar to the LeNet-5 network mentioned above. The difference is that we'll use slightly smaller network (well, actually a lot smaller, if we look at the number of weights to train), which has only one fully connected network. Here is structure of the network we'll use:

Conv(32x32x1, 5x5x6 ) -> ReLU -> AvgPool(2x2)
Conv(14x14x6, 5x5x16 ) -> ReLU -> AvgPool(2x2)
Conv(5x5x16, 5x5x120) -> ReLU
FC(120, 10) -> SoftMax

The configuration above tells the size of input for each convolutional layer and the size and number of convolutions they perform. And for fully connected layer it tells number of inputs and outputs. Let's create the convolution neural network of the above structure them.

vector<bool> connectionTable( {
true, true, true, false, false, false,
false, true, true, true, false, false,
false, false, true, true, true, false,
false, false, false, true, true, true,
true, false, false, false, true, true,
true, true, false, false, false, true,
true, true, true, true, false, false,
false, true, true, true, true, false,
false, false, true, true, true, true,
true, false, false, true, true, true,
true, true, false, false, true, true,
true, true, true, false, false, true,
true, true, false, true, true, false,
false, true, true, false, true, true,
true, false, true, true, false, true,
true, true, true, true, true, true
} );
shared_ptr<XNeuralNetwork> net = make_shared<XNeuralNetwork>( );
net->AddLayer( make_shared<XConvolutionLayer>( 32, 32, 1, 5, 5, 6 ) );
net->AddLayer( make_shared<XReLuActivation>( ) );
net->AddLayer( make_shared<XAveragePooling>( 28, 28, 6, 2 ) );
net->AddLayer( make_shared<XConvolutionLayer>( 14, 14, 6, 5, 5, 16, connectionTable ) );
net->AddLayer( make_shared<XReLuActivation>( ) );
net->AddLayer( make_shared<XAveragePooling>( 10, 10, 16, 2 ) );
net->AddLayer( make_shared<XConvolutionLayer>( 5, 5, 16, 5, 5, 120 ) );
net->AddLayer( make_shared<XReLuActivation>( ) );
net->AddLayer( make_shared<XFullyConnectedLayer>( 120, 10 ) );
net->AddLayer( make_shared<XLogSoftMaxActivation>( ) );

Looking at the code above, it is quite clear how the neural network's configuration stated above is translated into the code. Except for one question – "What is the connection table we've got between the first and the second convolutional layers?" Yes, it was not mentioned in the theory part, but is pretty easy to grasp. As we can see from the network's structure and the code, the first layer does 6 convolutions and so produces 6 feature maps. While the second layer does 16 convolutions. In some cases, it is desired to configure layer's convolutions in such way, that they operate only on the subset of input feature maps. As the code above suggests, the first 6 convolutions of the second layer use different patterns of 3 feature maps produced by the first layer. Then the next 9 convolutions use different patterns of 4 feature maps. Finally, the last convolution uses all 6 feature maps of the first layer. This is done to reduce the number of parameters to train and also make sure that different feature maps of the second layer are not all based on the same input feature maps.

When the convolutional network is created, we can do the same as we did with fully connected network - create a training context, specifying cost function and weights' optimizer, and then pass it all to a helper class, which runs training/validation loop and completes it with testing.

shared_ptr<XNetworkTraining> netTraining = make_shared<XNetworkTraining>( net,
make_shared<XAdamOptimizer>( 0.002f ),
make_shared<XNegativeLogLikelihoodCost>( ) );
XClassificationTrainingHelper trainingHelper( netTraining, argc, argv );
trainingHelper.SetValidationSamples( validationImages, encodedValidationLabels, validationLabels );
trainingHelper.SetTestSamples( testImages, encodedTestLabels, testLabels );
trainingHelper.RunTraining( 20, 50, trainImages, encodedTrainLabels, trainLabels );

Below is the sample output of the application, which shows training progress and the final result - classification accuracy on the test data set. We've got 99.01% accuracy, which seems to be a good improvement over fully connected neural network from the previous article, which demonstrated 96.55% accuracy.

MNIST handwritten digits classification example with Convolution ANN
Loaded 60000 training data samples
Loaded 10000 test data samples
Samples usage: training = 50000, validation = 10000, test = 10000
Learning rate: 0.0020, Epochs: 20, Batch Size: 50
Before training: accuracy = 5.00% (2500/50000), cost = 2.3175, 34.324s
Epoch 1 : [==================================================] 123.060s
Training accuracy = 97.07% (48536/50000), cost = 0.0878, 32.930s
Validation accuracy = 97.49% (9749/10000), cost = 0.0799, 6.825s
Epoch 2 : [==================================================] 145.140s
Training accuracy = 97.87% (48935/50000), cost = 0.0657, 36.821s
Validation accuracy = 97.94% (9794/10000), cost = 0.0669, 5.939s
...
Epoch 19 : [==================================================] 101.305s
Training accuracy = 99.75% (49877/50000), cost = 0.0077, 26.094s
Validation accuracy = 98.96% (9896/10000), cost = 0.0684, 6.345s
Epoch 20 : [==================================================] 104.519s
Training accuracy = 99.73% (49865/50000), cost = 0.0107, 28.545s
Validation accuracy = 99.02% (9902/10000), cost = 0.0718, 7.885s
Test accuracy = 99.01% (9901/10000), cost = 0.0542, 5.910s
Total time taken : 3187s (53.12min)

### CIFAR10 images classification

The second example performs classification of color 32x32 images from the CIFAR-10 dataset. It contains 60000 images, of which 50000 are used for training and the other 10000 for testing. The images are divided between the next 10 class: airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck. Few examples of those can be seen below.

As the above picture suggests, the CIFAR-10 dataset is much more complex than the MNIST hand-written digits. First, the images are color. And second, they are much less obvious. Up to the point that if I was not told it is a dog, I would not say it myself. As the result, the network's structure gets a bit bigger. Not that it becomes much deeper, but the number of performed convolutions and trained weights is growing. Below is the structure of the network:

Conv(32x32x3, 5x5x32, BorderMode::Same) -> ReLU -> MaxPool -> BatchNorm
Conv(16x16x32, 5x5x32, BorderMode::Same) -> ReLU -> MaxPool -> BatchNorm
Conv(8x8x32, 5x5x64, BorderMode::Same) -> ReLU -> MaxPool -> BatchNorm
FC(1024, 64) -> ReLU -> BatchNorm
FC(64, 10) -> SoftMax

Translating the above neural network's structure into the code gives the result below. **Note**: since ReLU(MaxPool) produces same result as MaxPool(ReLU), we use the first as it reduces ReLU computation by 75% (although very negligible compared to the rest of the network).

shared_ptr<XNeuralNetwork> net = make_shared<XNeuralNetwork>( );
net->AddLayer( make_shared<XConvolutionLayer>( 32, 32, 3, 5, 5, 32, BorderMode::Same ) );
net->AddLayer( make_shared<XMaxPooling>( 32, 32, 32, 2 ) );
net->AddLayer( make_shared<XReLuActivation>( ) );
net->AddLayer( make_shared<XBatchNormalization>( 16, 16, 32 ) );
net->AddLayer( make_shared<XConvolutionLayer>( 16, 16, 32, 5, 5, 32, BorderMode::Same ) );
net->AddLayer( make_shared<XMaxPooling>( 16, 16, 32, 2 ) );
net->AddLayer( make_shared<XReLuActivation>( ) );
net->AddLayer( make_shared<XBatchNormalization>( 8, 8, 32 ) );
net->AddLayer( make_shared<XConvolutionLayer>( 8, 8, 32, 5, 5, 64, BorderMode::Same ) );
net->AddLayer( make_shared<XMaxPooling>( 8, 8, 64, 2 ) );
net->AddLayer( make_shared<XReLuActivation>( ) );
net->AddLayer( make_shared<XBatchNormalization>( 4, 4, 64 ) );
net->AddLayer( make_shared<XFullyConnectedLayer>( 4 * 4 * 64, 64 ) );
net->AddLayer( make_shared<XReLuActivation>( ) );
net->AddLayer( make_shared<XBatchNormalization>( 64, 1, 1 ) );
net->AddLayer( make_shared<XFullyConnectedLayer>( 64, 10 ) );
net->AddLayer( make_shared<XLogSoftMaxActivation>( ) );

The rest of the example application follows the same pattern as set by the other classification examples - training context is created with required cost function and weights' optimizer and passed to helper class to run the training loop. Below is the example of its output.

CIFAR-10 dataset classification example with Convolutional ANN
Loaded 50000 training data samples
Loaded 10000 test data samples
Samples usage: training = 43750, validation = 6250, test = 10000
Learning rate: 0.0010, Epochs: 20, Batch Size: 50
Before training: accuracy = 9.91% (4336/43750), cost = 2.3293, 844.825s
Epoch 1 : [==================================================] 1725.516s
Training accuracy = 48.25% (21110/43750), cost = 1.9622, 543.087s
Validation accuracy = 47.46% (2966/6250), cost = 2.0036, 77.284s
Epoch 2 : [==================================================] 1742.268s
Training accuracy = 54.38% (23793/43750), cost = 1.3972, 568.358s
Validation accuracy = 52.93% (3308/6250), cost = 1.4675, 76.287s
...
Epoch 19 : [==================================================] 1642.750s
Training accuracy = 90.34% (39522/43750), cost = 0.2750, 599.431s
Validation accuracy = 69.07% (4317/6250), cost = 1.2472, 81.053s
Epoch 20 : [==================================================] 1708.940s
Training accuracy = 91.27% (39931/43750), cost = 0.2484, 578.551s
Validation accuracy = 69.15% (4322/6250), cost = 1.2735, 81.037s
Test accuracy = 68.34% (6834/10000), cost = 1.3218, 122.455s
Total time taken : 48304s (805.07min)

As mentioned above, the CIFAR-10 dataset is definitely more complex. If we managed to get up to 99% test accuracy on MNIST dataset, here we don't get even close to it – about 91% accuracy on training set and 68-69% on test/validation. Plus, it took 13 hours to run the 20 epochs. Just using CPU is definitely not enough for convolutional networks.

## Conclusion

In this article we've covered the new extensions to the ANNT library, which allow building convolutional neural networks. At this point it allows building only simple networks (more or less), where layers of the network follow each other sequentially. Building more advanced popular architectures, which look more like a computational graph, is not yet supported so far. However, before getting there, there are other features need to be implemented first. As the CIFAR-10 example demonstrates, once neural network gets bigger, it requires more computational power for training. And here, using just CPU is not enough. These days GPU support is a must, when it comes to deep learning. And so, this feature would get higher priority rather than supporting complex networks.

As fully connected and convolutional neural networks are covered now, the following step will be to go through some common architectures of recurrent networks, which is the topic for the next article. In the meantime, all the latest code can be found on GitHub, which will get updates as the library evolves further.

## Links

- Kernel (image processing)
- Image Convolution - Machine Learning Guru
- Convolutional Neural Networks - Wikipedia
- CS231n Convolutional Neural Networks for Visual Recognition
- Convolutional Neural Networks from the ground up
- Backpropagation In Convolutional Neural Networks
- Vanishing gradient problem
- ReLU activation function
- LeNet-5 convolutional neural network
- One Hot Encoding
- Cross-entropy cost function
- SoftMax activation function
- Difference between SoftMax and Sigmoid functions
- MNIST database of handwritten digits
- CIFAR-10 dataset