Keras* Implementation of Siamese-like Networks

Intel

0/5 (0 vote)

Jul 24, 2018

CPOL

8019

This guide will help you to write complex neural networks such as Siamese networks in Keras. It also explains the procedure to write your own custom layers in Keras.

Abstract

Deep learning has revolutionized the field of machine learning. Convolutional Neural Networks (CNNs) have become very popular for solving problems related to image recognition, image reconstruction, and various other computer vision problems. Libraries such TensorFlow* and Keras* make the programmer’s job easier. But, these libraries do not directly provide support for complex networks and uncommonly used layers. This guide will help you to write complex neural networks such as Siamese networks in Keras. It also explains the procedure to write your own custom layers in Keras.

Introduction

Person re-identification is defined as identifying if the same person exists in a given pair of images. Some of the challenges faced while tackling this problem are caused by pictures being taken from various viewpoints and variations in the light intensity that result in pictures of different people looking similar, thus creating a false positive. The Normalized X-Corr model¹ is used to solve the problem of person re-identification. This guide demonstrates a step-by-step implementation of a Normalized X-Corr model using Keras, which is a modification of a Siamese network².

Normalized X-Corr model

Figure 1. Architectural overview of a Normalized X-Corr model.

Overview of the Normalized X-Corr Model

Arulkumar Subramaniam and his colleagues¹ propose a deep neural network to solve the problem of binary classification. Figure 1 gives an overview of the Normalized X-Corr (normxcorr) model. Firstly, the images are passed through conv-pool-conv-pool layers to extract features from the two images. The idea behind using these layers is to extract features from the image, so the weights of conv layers are shared (i.e., both images are passed through the same layers). After extracting the features, establishing a similarity between the features is necessary. This is done by the normalized correlation layer, which is a custom layer that will be discussed later in this guide. This layer basically takes a small 55 patch and then it convolves around in the other feature map and calculates the normalized correlation given by:

normxcorr model

We will denote feature maps as X and Y belonging to the images. Considering the sizes in Figure 1, we take a patch from map of X centered at (x,y) at a given depth and normxcorr is calculated with Y(a,b), where 1 <= a <= 12 and y – 2<= b <= y + 2. Thus, for every X(x,y), 5×12=60, values are generated and stored along the depth of the output feature map. This is done at all depths; therefore, we have output dimensions as 12×37×1500 (i.e., 60×25).

In Figure 2, the size of the image is assumed to be 8×8 for the purpose of demonstration. If we consider the patch centered at the block marked by the red square in image 1 of size 5×5, we calculate Normalized-X-Corr of this patch with patches marked by the green squares in image 2 (i.e., across the entire width of image), and height within [3 - 2, 3 + 5], which is [1,5]. Thus, the total number of values generated by a single patch in image 1 is the width×height allowed (i.e., 8×5=40). These values are stored along the depth of the output feature map. Thus, for one patch, we generate an output of 1×1×40. Considering the entire image, we would have a feature map of size 8×8×40. But, if the input has more than one channel, then the calculated feature maps are stacked one behind the other. Due to this, height and width of the output feature map remain the same, but the depth gets multiplied by the depth of input images. Hence, an input image of 8×8×5 would generate an output feature map of 8×8×(40×5) (i.e., 8×8×200). For the patch centered at the block marked by the blue color, we see that to satisfy the criteria, we need to add padding. Thus, in such cases, the image is padded with zeros.

After the Normalized-X-Corr layer, two conv layers and pooling have been added to concisely incorporate greater context information. On top of it, two fully connected layers are added and a softmax activation function is applied.

More information about the architecture is available in the paper “Deep Neural Networks with Inexact Matching for Person Re-Identification.”

Demonstrating normalization

Figure 2. Demonstrating normalization correlation layers operation.

Diving into the Code

The code below was tested on Intel® AI DevCloud. The following libraries and frameworks were also used: Python* 3 (February 2018 version), Keras* (version 2.1.2), Intel® Optimization for TensorFlow* (version 1.3.0), NumPy (version 1.14.0).

import keras 
import sys 
from keras import backend as K 
from keras.layers import Conv2D, MaxPooling2D, Dense,Input, Flatten 
from keras.models import Model, Sequential 
from keras.engine import InputSpec, Layer 
from keras import regularizers 
from keras.optimizers import SGD, Adam 
from keras.utils.conv_utils import conv_output_length 
from keras import activations 
import numpy as np

These are some of the imports from Keras and other libraries we need to implement in this model.

a = Input((160,60,3)) 
b = Input((160,60,3))

These create placeholders for the input images.

model = Sequential() 
model.add(Conv2D(kernel_size = (5,5), filters = 20,input_shape = (160,60,3), activation = 'relu')) 
model.add(MaxPooling2D((2,2))) 
model.add(Conv2D(kernel_size = (5,5), filters = 25, activation = 'relu')) 
model.add(MaxPooling2D((2,2)))

These are the layers that need to be shared between the images. Therefore, we create a model of these layers.

feat_map1 = model(b) 
feat_map2 = model(a)

model(a) passes the input it gets through the model and returns the output layer. This is done for both the layers so that they share the same model and output two feature maps as feat_map1 and feat_map2.

normalized_layer = Normalized_Correlation_Layer(stride = (1,1), patch_size = (5, 5))([feat_map1, feat_map2])

This is the custom layer that establishes a similarity between the feature maps extracted from the images. We pass the feature maps as a list input. Its implementation is mentioned later in this guide.

final_layer = Conv2D(kernel_size=(1,1), filters=25, activation='relu')(normalized_layer) 
final_layer = Conv2D(kernel_size=(3,3), filters=25, activation = None)(final_layer) 
final_layer = MaxPooling2D((2,2))(final_layer) 
final_layer = Dense(500)(final_layer) 
final_layer = Dense(2, activation = "softmax")(final_layer)

These are layers that are added on top of the normalized correlation layer.

x_corr_mod = Model(inputs=[a,b], outputs = final_layer)

Finally, a new model is created with inputs as the images to be passed as a list, which gives a binary output.

The visualizations of layers of this model are available in the paper “Supplementary Material for the Paper: Deep Neural Networks with Inexact Matching for Person Re-Identification.”

Normalized Correlation Layer

This is not a layer provided by Keras so we have to write it on our own layer with the support provided by the Keras backend.

class Normalized_Correlation_Layer(Layer):

create a class inherited from keras.engine.Layer.

	def __init__(self, patch_size=(5,5), 
          dim_ordering='tf', 
          border_mode='same', 
          stride=(1, 1), 
          activation=None, 
          **kwargs): 

       if border_mode != 'same': 
          raise ValueError('Invalid border mode for Correlation Layer ' 
                     '(only "same" is supported as of now):', border_mode) 
       self.kernel_size = patch_size 
       self.subsample = stride 
       self.dim_ordering = dim_ordering 
       self.border_mode = border_mode 
       self.activation = activations.get(activation) 
       super(Normalized_Correlation_Layer, self).__init__(**kwargs)

This constructor just sets the values passed as parameters as the class variables and also initializes its parent class by calling the constructor.

def compute_output_shape(self, input_shape):
      return(input_shape[0][0], input_shape[0][1], input_shape[0][2], 
             self.kernel_size[0] * input_shape[0][2]*input_shape[0][-1])

This returns the shape of the feature map outputted by this layer as a tuple. The first element is the number of images, the second is the number of rows, the third is the number of columns, and the last one is the depth which is the allowance to move in height×allowance to move in width×depth. In our case its 5×12×25.

		
def get_config(self): 
   config = {'patch_size': self.kernel_size, 
          'activation': self.activation.__name__, 
          'border_mode': self.border_mode, 
          'stride': self.subsample, 
          'dim_ordering': self.dim_ordering} 
     base_config = super(Correlation_Layer, self).get_config() 
     return dict(list(base_config.items()) + list(config.items()))

This adds the configuration passed as arguments to constructor, appends it to those of the parent class, and returns it. This function is called by Keras to get the configurations.

def call(self, x, mask=None):

This function is called at every iteration. This function takes the input as feature maps as per the model.

input_1, input_2 = x 
     stride_row, stride_col = self.subsample 
     inp_shape = input_1._keras_shape

Separate the inputs from the lists and load some variables to local ones\ to make it easier to refer later on.

output_shape = self.compute_output_shape([inp_shape, inp_shape])

This uses the function written earlier to get the desired output shape and store it in the variable.

padding_row = (int(self.kernel_size[0] / 2),int(self.kernel_size[0] / 2)) 
 padding_col = (int(self.kernel_size[1] / 2),int(self.kernel_size[1] / 2)) 
 input_1 = K.spatial_2d_padding(input_1, padding =(padding_row,padding_col)) 
 input_2 = K.spatial_2d_padding(input_2, padding = ((padding_row[0]*2, padding_row[1]*2),padding_col))

This block of code adds padding to the feature map. This is required as we take patches centered at (0,0) and other edges, too. Therefore, we need to add padding of 2 in our case. But, for the feature map of the second input, we need to take patches with an offset of 2 from the center of the patch of the first feature map. Thus, for the patch at (0, 0) we need to consider patches centered at (0,0), (0,1), (0,2), (0,-1), (0,-2) of the second feature map with same value at X. Thus, we need to add a padding of 4,

output_row = output_shape[1] 
output_col = output_shape[2]

and store them into the variables.

output = [] 
for k in range(inp_shape[-1]):

Loop for all the depths.

xc_1 = [] 
xc_2 = [] 
for i in range(padding_row[0]): 
   for j in range(output_col): 
      xc_2.append(K.reshape(input_2[:, i:i+self.kernel_size[0], j:j+self.kernel_size[1], k], 
                  (-1, 1,self.kernel_size[0]*self.kernel_size[1])))

This is done for the patches of feature map 2 where we have added the extra padding (i.e., the patches that are not centered on the feature map and which are at the first rows).

for i in range(output_row): 
       slice_row = slice(i, i + self.kernel_size[0]) 
       slice_row2 = slice(i + padding_row[0], i +self.kernel_size[0] + padding_row[0]) 
       for j in range(output_col): 
          slice_col = slice(j, j + self.kernel_size[1]) 
          xc_2.append(K.reshape(input_2[:, slice_row2, slice_col, k], 
                      (-1, 1,self.kernel_size[0]*self.kernel_size[1]))) 
          xc_1.append(K.reshape(input_1[:, slice_row, slice_col, k], 
                        (-1, 1,self.kernel_size[0]*self.kernel_size[1])))

Extract patches of size 5×5 from both feature maps and store them in xc_1 and xc_2, respectively. In this case, these patches are flattened and reshaped in form (-1,1,25).

for i in range(output_row, output_row+padding_row[1]): 
       for j in range(output_col): 
           xc_2.append(K.reshape(input_2[:, i:i+ self.kernel_size[0], j:j+self.kernel_size[1], k], 
                       (-1, 1,self.kernel_size[0]*self.kernel_size[1])))

This is to extract patches of feature map 2, but which are centered below the bottom of the feature maps.

xc_1_aggregate = K.concatenate(xc_1, axis=1)

These patches are joined along axis=1 so that they would be of the shape (-1, 60, 25) for any given depth.

xc_1_mean = K.mean(xc_1_aggregate, axis=-1, keepdims=True) 
xc_1_std = K.std(xc_1_aggregate, axis=-1, keepdims=True) 
xc_1_aggregate = (xc_1_aggregate - xc_1_mean) / xc_1_std

This is just the implementation of normalization of the features of the first feature map.

xc_2_aggregate = K.concatenate(xc_2, axis=1)
xc_2_mean = K.mean(xc_2_aggregate, axis=-1, keepdims=True) 
xc_2_std = K.std(xc_2_aggregate, axis=-1, keepdims=True) 
xc_2_aggregate = (xc_2_aggregate - xc_2_mean) / xc_2_std

Similarly, for the feature maps of image 2.

xc_1_aggregate = K.permute_dimensions(xc_1_aggregate, (0, 2, 1)) 
     block = [] 
     len_xc_1= len(xc_1) 
     for i in range(len_xc_1):
		     #This for loop is to compute the product of a given patch of feature map 1 
             #and the feature maps on which it is supposed to
         sl1 = slice(int(i/inp_shape[2])*inp_shape[2], 
              int(i/inp_shape[2])*inp_shape[2]+inp_shape[2]*self.kernel_
size[0]) 
			      #This calculates which are the patches of feature map 2 to be considered 
                  #for a given patch of first feature map.

         block.append(K.reshape(K.batch_dot(xc_2_aggregate[:,sl1,:], 
                      xc_1_aggregate[:,:,i]),(-1,1,1,inp_shape[2] *self.kernel_size[0])))

Calculate the dot product (i.e., the normalized correlation and store it in "block").

block = K.concatenate(block, axis=1) 
     block= K.reshape(block,(-1,output_row,output_col,inp_shape[2] *self.kernel_size[0])) 
     output.append(block)

Join the calculated normalized correlation values, reshape them (they are calculated sequentially so that reshaping would be easier), and append it to “output.”

output = K.concatenate(output, axis=-1)

Join the output feature map calculated at each depth, along the depth of “output.”

output = self.activation(output) 
return output

Apply activation if sent as an argument and return the output generated.

Applications

Such a network can have various applications such as matching a person’s identity in crime scenes. This network can be generalized to find similarity between two images (i.e., to find if the same fruit exists in both images or not).

Further Scope

The code runs sequentially and is devoid of parallelism. The matrix multiplication of the patches can be parallelized across multiple cores using libraries such as multiprocessing. This would help to speed the training time. The accuracy of the model can be increased by finding a more suitable similarity measure between the image patches.

Acknowledgement

I would like to thank the Intel® Student Ambassador Program for AI, which provided me with the necessary training resources on the Intel® AI DevCloud and the technical support that helped me to use DevCloud.

References

Subramaniam, M. Chatterjee, and A. Mittal. “Deep Neural Networks with Inexact Matching for Person Re-Identification.” In NIPS 2016.
Dong Yi, Zhen Lei, Shengcai Liao, Stan Z. Li. “Deep Metric Learning for Person Re-Identification.” In ICPR, volume 2014.
Code on GitHub*

For more complete information about compiler optimizations, see our Optimization Notice.