What is Deep Learning?

Actually deep learning is a branch of machine learning. Machine learning includes some different types of algorithms which get a few thousands data and try to learn from them in order to predict new events in future. But deep learning applies neural network as extended or variant shapes. Deep learning has a capacity of handling million points of data.

The most fundamental infrastructure of deep learning could be; its ability to pick the best features. Indeed, deep learning summarizes data and computes the result based on compressed data. It is what is really needed in artificial intelligence, especially when we have huge data base with dramatically computation.

Deep learning has sequential layers which is inspired by neural network. These layers have nonlinear function with the duty of feature selection. Each layer has an output which will be used as input for next layers. Deep learning applications are computer vision (such as face or object recognition), speech recognition, natural language process (NLP) and cyber threat detection.

Deep Learning vs Machine Learning

The major differences between machine learning and deep learning is that; in ML we need human manual intervention to select feature extraction while in DL, it will be done by its intuitive knowledge which has been embedded inside its architecture. This differences makes a dramatically influence in their performance either in precision or speed. Because there are always human errors in manually feature detection, therefore DL can be the best option for gigantic data computation.

The common factor between DL and ML is that both of them are working in supervised and unsupervised. DL is just based on NN while it changes its shape and operation in CNN - RNN, etc. But ML has different algorithms which are based on statistical and mathematical science. Although it doesn't mean that DL is merely on neural network, DL can also use various ML algorithms in order to increase performance by making hybrid functions. For instance, DL can apply Support Vector Machine (SVM) as its own activation function instead of softmax. [1]

Feature Engineering Importance

We try to make machine as an independent tool in artificial intelligence to think which needs less programmer intervention. The most characteristic of an automate machine is; the way he thinks, if his way of thinking has the most similarity to human brain so he will win in the race of best machine. So let’s see what is the pillar attribute in making accurate decision. Remember our childhood, when we saw objects but we had no idea about their properties such has name, exact size, weight and so on. But we could categorize them quickly by noticing one important thing. For example, by looking at one animal, we noticed that it is "Dog" as soon as we heard its sound which is "barking" or we noticed it is "Cat" when we heard its "meowing". So here, animal sound has a most effective influence rather than size because as experience when we see animal with similar size to other animal our brain starts to pay attention to the most distinguished feature which is sound. On the other hand, when we see the tallest animal at the zoo, we ignore all other features and say “Yes, it is a giraffe”.

It is a miracle in the brain because it can inference the situation and according to different condition in the same problem such as “animal detection” make one feature as his final key to make the decision according to that and given result by this attitude will be accurate and also quick. Another story to make clear the feature engineering importance is “Twenty Questions Game” if you did not play it till now, please look here.

The player will win if he has the ability to ask the proper question and according to the recent answers, he should make and improve the next question. The questions are sequential and the next question 100% depends on the previous answer. Previous answers have the duties to make filtration add clarification for player to reach the goal. Each question has a hidden layer in neural network which is connected to the next layers and their output will be used as input for the next layers. Our first question always starts as “Is it alive?” and by this question, we remove half of possibilities. This omitting and dropping lead us to asking better questions in new category, obviously we cannot ask the next one without the previous answer which made a clarification and filtration in our brain. This story happens somehow in deep learning convolutional neural network.

Deep Learning and Human Brain

Deep learning is an imitation of the human brain almost in the aspect of precision and speed. Convolutional Neural Networks (CNN) is inspired from brain cortex. As you see in the below picture, the visual cortex layer has covered all of the entire visual field. These sensitive cells have the role of kernel or filter matrix which we will pay attention to later in this article. God created these cells to extract important data which are coming from the eyes.

Assume students have exam and they are preparing themselves, they start to read the book while they pick up an important part of a book and write it on notes or by highlighting them. In both, they tend to reduce the volume of the book and summarized 100 pages into two pages which are easy to use it as reference and review it. The similar scenario happens on DL CNN, this time we need a smaller matrix to filter and remove data.

Requirement

I strongly recommend and request you to read carefully the first and second articles listed below, because their concept will be needed and I have assumed that you know everything about linear regression and neural network.

How Deep Learning - Convolutional Neural Network Works?

Deep learning is a neural network which has more than two hidden layers. If you are new to neural networks, please study this link. There is more data because of more layers which cause overfitting. Overfitting happens when we made our model from training data set as really complete and match to test set and always there is one answer inside the model. One of the good characteristics of model is to be generalized not to be a complete coincidence.

We cannot or even we can, it is wrong to make a complete model. Let’s see what happens when we want to assign a “Y” inside our model. We must ignore to be too much idealistic in making model and tend to make it general rather than specific, in order to reach this point, we can apply cross validation. Cross validation is model evaluation method. The best way is using K-fold cross validation which tries to divide train set to k parts and in each iteration, k belongs to test and the rest of k-1 is train set, therefore the chance of matching will be decreased. There are some specific solutions instead of K-fold cross validation in convolutional neural network in order to avoid overfitting such as drop out and regularization.

Fully connected in DL means that each neuron in one hidden layer has connection to all of neurons to the next layer. In the case of applying drop out in training time, some of the neurons will be turned off and after finishing training on the prediction time, all neurons will be turned on. So DL tries to omit and remove redundant data and obscure their role and enhance and bold the role of important features. Such as the below picture when left picture has high resolution but within passing time DL CNN tries to keep on important pixel and make it smaller.

Assume students have exam and they are preparing themselves, they start to read the book while they pick up an important part of book and write it on notes or by highlighting them. In both, they tend to reduce the volume of book and summarized 100 pages into two pages which are easy to use as reference and review it. The similar scenario happens on DL CNN, this time we need a smaller matrix to filter and remove data.

We can transform data to smaller data - which is easier to rely on for making decision- with the aid of smaller matrix and rotating all over original and primitive matrix. We do some mathematical calculation by moving filter matrix around primitive matrix. For example, in the below picture, 12 data points will be reduced to just 3 data points by rotating one matrix 3 times all over the primitive matrix. These computations can be maximized or take average of data.

One CNN Dimensional

There is no such as one dimensional matrix in the real world but because of presenting its way, I prefer to start with 1D Matrix. I want to make dimensional reduction with the aid of red matrix on blue matrix. So blue matrix is real data set and red one is filter matrix. I want to transform blue matrix with 5 elements to 3 elements. I push red matrix from left to the right (I push just one element in each step). Whenever there are coincident, I multiply two related elements and in the case of more than one matching elements, I sum them up together. As a notice red matrix was [2 -1 1] and after flipping it (kernel) becomes [1 -1 2].

To reduce matrix, I am looking for valid results and they happen when all of red or filter elements are covered by blue one. I just pick up [3 5].

import numpy as np

x = np.array([0,1,2,3])
 
w = np.array([2,-1,1])

result = np.convolve(x,w)
result_Valid = np.convolve(x,w, "valid")
print(result)
print(result_Valid)

Two CNN Dimensional

There is a similar story in two dimensional matrixes. The kernel matrix [[-1, 0], [2, 1]] will be changed [[1, 2], [0, -1]] to after flipping. Because in all steps in below pictures, filter matrix is inside original train matrix, so all of commutated elements are valid.

from scipy import signal as sg

print(sg.convolve([[2, 1, 3],
                   [5, -2, 1],
                   [0, 2, -4]], [[-1, 0],[2, 1]]))

print(sg.convolve([[2, 1, 3],
                   [5, -2, 1],
                   [0, 2, -4]], [[-1, 0],[2, 1]], "valid"))

Deep Learning Code Sample by Digit Recognition

I want to introduce you to the best competition community KAGGLE which is famous around data scientist. There are many competitions which are worthy to practice your abilities in machine learning and deep learning. Also, there are awards for whoever can accomplish code for recent challenges. There are kernels which have been written by authors and also you can contribute to those and they are good sources for learning artificial intelligence in R and Python. Moreover, you can use its data set as reference and test your code with prepared data.

I want to practice convolutional, please click here.

Download Training and Test Data Set

Please go to this link to get training and testing data set. Obviously, you must sign up on kaggle site and then try to join this competition.

# -*- coding: utf-8 -*-
"""
Created on Sun Nov 19 05:59:50 2017

@author: Mahsa
"""
import numpy as np
from numpy.random import permutation
import pandas as pd
import tflearn
from tflearn.layers.core import input_data,dropout,fully_connected,flatten
from tflearn.layers.conv import conv_2d,max_pool_2d
from tflearn.layers.normalization import local_response_normalization
from tflearn.layers.estimator import regression
from sklearn.cross_validation import train_test_split


train_Path = r'D:\digit\train.csv'
test_Path = r'D:\digit\test.csv'
  
#Split arrays or matrices into random train and test subsets
#http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

def split_matrices_into_random_train_test_subsets(train_Path):
    train = pd.read_csv(train_Path)
    train = np.array(train)
    train = permutation(train)
    X = train[:,1:785].astype(np.float32) #feature
    y = train[:,0].astype(np.float32) #label
    return train_test_split(X, y, test_size=0.33, random_state=42)

def reshape_data(Data,Labels):
    Data = Data.reshape(-1,28,28,1).astype(np.float32)
    Labels = (np.arange(10) == Labels[:,None]).astype(np.float32)
    return Data,Labels

X_train, X_test, y_train, y_test = split_matrices_into_random_train_test_subsets(train_Path)

X_train,y_train = reshape_data(X_train,y_train)
X_test,y_test = reshape_data(X_test,y_test)

test_x = np.array(pd.read_csv(test_Path))
test_x = test_x.reshape(-1,28,28,1)
  
def Convolutional_neural_network():
    network  = input_data(shape=[None,28,28,1],name='input_layer')
    network  = conv_2d(network, nb_filter=6,  filter_size=6, 
               strides=1, activation='relu', regularizer='L2')  
    network  = local_response_normalization(network)
    network  = conv_2d(network, nb_filter=12, filter_size=5, 
               strides=2, activation='relu', regularizer='L2') 
    network  = local_response_normalization(network)
    network  = conv_2d(network, nb_filter=24, filter_size=4, 
               strides=2, activation='relu', regularizer='L2')
    network  = local_response_normalization(network)    
 
    network = fully_connected(network, 128, activation='tanh')
    network = dropout(network, 0.8)
    network = fully_connected(network, 256, activation='tanh')
    network = dropout(network, 0.8) 
    network = fully_connected(network, 10, activation='softmax') 
    
    sgd   = tflearn.SGD(learning_rate=0.1,lr_decay=0.096,decay_step=100)
    top_k = tflearn.metrics.top_k(3) #Top-k mean accuracy ,  
            Number of top elements to look at for computing precision
    
    network = regression(network, optimizer=sgd, 
              metric=top_k, loss='categorical_crossentropy')
    return tflearn.DNN(network, tensorboard_dir='tf_CNN_board', tensorboard_verbose=3)
    
model = Convolutional_neural_network()
model.fit(X_train, y_train, batch_size=128, 
          validation_set=(X_test,y_test), n_epoch=1, show_metric=True)

P = model.predict(test_x)

index = [i for i in range(1,len(P)+1)]
result = []
for i in range(len(P)):
    result.append(np.argmax(P[i]).astype(np.int))

res = pd.DataFrame({'ImageId':index,'Label':result})
res.to_csv("sample_submission.csv", index=False)

Increase Deep Learning Performance With Hardware by GPU

One common important factor among gamer developer, graphic designer and data scientist is matrices. Every data point either in images, video or complex data has a value in matric element. Whatever we do includes some mathematical operation to transforming matrices.

For usual processing, Central Processing Unit is good answer, but in advanced mathematical and statistical operations with huge data, CPU cannot tolerate and we have to use Graphics Processing unit (GPU) which was designed for mathematical difficult function. Because deep learning includes functions which needs complex computation such as convolution neural network, activation function, sigmoid softmax and Fourier Transform will be processed on GPU and the rest of the 95% will be moved on CPU which or mostly I/O procedures.

GPU Activation

Open start and bring "windows command prompt cmd".
Type "dxdiag"
On the opening window look at "Display Tab"
If name is equal to "NVIDIA" or (NVIDIA GPU - AMD GPU - Intel Xeon Phi) other company, means that there is GPU card on the board.
Let's try to set configuration .theanorc on the "C:\users\"yourname"\".theanorc "
Set { device = gpu or cuda0 , floatX = float32 } in [global] section, and preallocate = 1 in [gpuarray]
If you want to know more about it, please look here.

GPU Test Code

import os
import shuti

destfile = "/home/ubuntu/.theanorc"
open(destfile, 'a').close()
shutil.copyfile("/mnt/.theanorc", destfile) # make .theanorc file in the project directory  

from theano import function, config, shared, sandbox
import theano.tensor as T
import numpy
import time
 
vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
iters = 1000

rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))

f = function([], T.exp(x))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in xrange(iters):
    r = f()
t1 = time.time()

print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r))


if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
    print('Used the cpu')
else:
    print('Used the gpu')

Increase Deep Learning Performance With Software Libraries

In order to enhance the CNN performances and also because it is not possible to shocked CPU or even GPU with gigantic data more than terabyte, we must use some strategies to break down data manually in some chunks for processing. I have used DASK to prevent out of RAM memory crashes. It is responsible for time scheduling.

import dask.array as da

X = da.from_array(np.asarray(X), chunks=(1000, 1000, 1000, 1000))

Y = da.from_array(np.asarray(Y), chunks=(1000, 1000, 1000, 1000))

X_test = da.from_array(np.asarray(X_test), chunks=(1000, 1000, 1000, 1000))

Y_test = da.from_array(np.asarray(Y_test), chunks=(1000, 1000, 1000, 1000))

References

Feedback

Feel free to leave any feedback on this article; it is a pleasure to see your opinions and vote about this code. If you have any questions, please do not hesitate to ask me here.

History

3^rd April, 2019: Initial version