In this series of articles, we’ll show you how to use a Deep Neural Network (DNN) to estimate a person’s age from an image.

In this article – the third in the series – we’ll guide you through one of the most difficult steps in the DL pipeline: the CNN design.

The precision of a CNN prediction directly depends on the CNN structure – its layer stack and parameters.

## CNN Layers

Layers are the CNN building blocks. There are many types of CNN layers; the most commonly used are: convolutional (CONV), activation (ACT), fully-connected (FC), pooling (POOL), normalization (NORM), and dropout (DROP).

CONV layers are the core blocks after which the convolutional networks are named. A convolutional layer contains a set of convolutions (kernels). These kernels are small square matrices. The coefficients of these matrices are "learnable." That is, they are meant to be assigned during the CNN training process to optimize predictions. Training the kernel of convolutional layers allows the network to extract local features and patterns from the images and then decide which of these are valuable for the classification. A CNN consists of several sequential convolutional layers, and every next layer is responsible for extracting more and more abstract features.

An ACT layer applies a nonlinear function to the data. Nonlinearity is an important tool for solving classification problems because almost all classes of the problems are nonlinearly separable. Activation layers are commonly used after convolutional and fully-connected layers. There are many activation functions: step (STEP), sigmoid (SIGM), hyperbolic tangent (TANH), rectified linear unit (ReLU), exponential linear unit (ELU), and so on. The ReLU function is probably the most frequently used in modern CNNs.

FC layers always appear at the end of a CNN. The end stack of the FC layers is another core component of the CNN. Essentially, the fully connected layer stack is a perceptron network. It receives the features extracted by the convolutional layers as its input data. During the CNN training, this sub-network optimizes the weights of its neurons to maximize the prediction power of the CNN.

A POOL layer is often used between the consecutive convolutional layers to decrease the spatial size of the input/output data. NORM layers commonly follow layers to normalize their output to the unit range. DROP layers are typically used between the FC layers to decrease the network connectivity.

## Layer Stacks

There are some common patterns of the layer stacks we can use for successfully solving the various image classification problems. When some CNN wins an Image Classification Challenge, that CNN’s structure becomes the pattern – the standard network architecture – until another CNN outperforms the former winner. Some of the pattern-defining CNNs are: LeNet, WGGNet, ResNet, GoogLeNet, and Xception. Many of the standard CNN architectures are implemented in the Keras framework as out-of-the-box classes. We could use one of these classes to solve the age estimation problem. However, in this series we are aiming to show the complete DL pipeline – so, we are going to use our own CNN architecture, and then create the network using the Keras library.

## Our CNN Structure

Here is the CNN structure we suggest...

As you can see in the diagram, the input data for CNN is a gray image 128 x 128 pixels. We decided that the input image should be shades-of-gray rather than color. Why? Because we think that the features important for age estimation are not determined by colors; rather, they are geometric in nature. The size of the input image – 128 pixels square – is pretty typical for CNNs, where the image sizes vary from 24 to 256 pixels.

The first, convolutional part of the network consists of four stacked subnets. Every subnet is a sequence of convolutional, activation, and normalization layers. The first two convolutional layers use kernels sized 5 x 5 pixels, the last ones use smaller kernels – 3 x 3 pixels. The number of convolutional kernels, K, is the same for all layers. It is the parameter of the CNN that is assigned on the network initialization to deal with problems of different complexity. The convolutional layers use padding to preserve spatial dimensions of the input data. Activation layers in the convolutional subnets use the ReLU activation functions.

The second and the fourth convolutional subnets include pooling layers. These layers use the MAX pooling function, 2 x 2 kernels, and 2 x 2 strides. This reduces every spatial dimension twice after each pooling layer. As the input size for the first pooling layer is 128 x 128 pixels, its output is 64 x 64 pixels. The second pooling layer has the input size 64 x 64 pixels and the output of 32 x 32 pixels. Keep in mind that all these are spatial sizes. The third dimension – depth – is K for all input/output data in all layers.

The last part of the CNN is a "multilayer perceptron." It consists of two "hidden" fully-connected layers with ReLU activation functions and one output fully-connected layer. The two hidden FC layers have N neurons – another CNN parameter – and the last FC layer has X neurons, where X is the number of classes (age groups). The output of the hidden layers is normalized by the normalization layers. There is a dropout layer between the two hidden layers, which reduces their connectivity. The dropout probability is 0.5, which means that every second neuron connection is dropped. The output FC layer uses the SOFTMAX activation. This activation type is a common output layer in classification networks because it evaluates probabilities for every class.

## Next Step

In this article, we briefly discussed the main CNN layer types, their parameters and applications. We then designed the CNN structure for solving our problem of age estimation.

Next step is to build the CNN we’ve designed in Keras.