We will write an application that will allow us to search images by keywords. I hate library dependencies or "blackbox"es. So we will not use any 3^rd party API or library. Everything will be in pure C# and simple. (With CeNiN v0.2, now it is more than 10 times faster when Intel MKL support is available.)

Introduction

Deep Convolutional Neural Network is one of the hot topics in the image processing community. There are different implementations in various languages. But if you are trying to get the logic behind ideas, large implementations are not always helpful. So I have implemented feed-forward phase of a convolutional neural network in its minimal form as a .NET library; CeNiN.dll.

We will use CeNiN to classify images and tag them with keywords so that we can search an object or scene in a set of images. We will be able to, for instance, search and find images that contain cats, cars or whatever we want, in a folder that we choose.

CeNiN doesn't contain implementation of back-propagation which is required to train a neural network model. We will use a pretrained model. The original model that we will use (imagenet-matconvnet-vgg-f) and the same model that is in a format compatible with CeNiN can be found here and here respectively. The model contains 19+2 (input and output) layers and 60824256 weights and has been trained for 1000 classes of images...

Preparing the Model

First, we load the model using the constructor. Since it may take a while to load millions of parameters from the model file, we call the constructor in a separate thread not to block the UI:

Thread t = new Thread(() =>
{
    try
    {
        cnn = new CNN("imagenet-matconvnet-vgg-f.cenin");
        ddLabel.Invoke((MethodInvoker)delegate ()
        {
            cbClasses.Items.AddRange(cnn.outputLayer.classes);
            dropToStart();
        });
    }
    catch (Exception exp)
    {
        ddLabel.Invoke((MethodInvoker)delegate ()
        {
            ddLabel.Text = "Missing model file!";
            if (MessageBox.Show(this, "Couldn't find model file. 
                Do you want to be redirected to download page?", "Missing Model File", 
                MessageBoxButtons.YesNo,MessageBoxIcon.Error) == DialogResult.Yes)
                Process.Start("http://huseyinatasoy.com/y.php?bid=71");
        });
    }
});
t.Start();

Classifying Images

We need a structure to keep the results:

private struct Match
{
    public int ImageIndex { set; get; }
    public string Keywords { set; get; }
    public float Probability { set; get; }
    public string ImageName { set; get; }

    public Match(int imageIndex, string keywords, float probability, string imageName)
    {
        ImageIndex = imageIndex;
        Keywords = keywords;
        Probability = probability;
        ImageName = imageName;
    }
}

CeNiN loads layers into memory as a layer chain. The chain is a linked list first and last nodes of which are Input and Output layers. To classify an image, the image is set as input and the layers are iterated calling feedNext() function to feed next layer in each step. When data arrives to Output layer, it is in a form of probability vector. Calling getDecision() sorts probabilities from highest to lowest and then we can consider each probability as a Match. It is important to make those calls inside a thread again not to block the UI. Also, since a thread cannot modify UI elements, codes that modify UI elements (adding new rows to lv_KeywordList, updating ddLabel.Text) should be invoked by GUI thread.

Thread t = new Thread(() =>
{
    int imCount = imageFullPaths.Length;
    for (int j = 0; j < imCount; j++)
    {
        Bitmap b = (Bitmap)Image.FromFile(imageFullPaths[j]);
        ddLabel.Invoke((Action<int,int>)delegate (int y, int n)
        {
            ddLabel.Text = "Processing [" + (y + 1) + "/" + n + "]...\n\n" + 
                            getImageName(imageFullPaths[y]);
        }, j, imCount);
        Application.DoEvents();

        cnn.inputLayer.setInput(b, Input.ResizingMethod.ZeroPad);
        b.Dispose();

        Layer currentLayer = cnn.inputLayer;
        while (currentLayer.nextLayer != null)
        {
            currentLayer.feedNext();
            currentLayer = currentLayer.nextLayer;
        }
        Output outputLayer = (Output)currentLayer;
        outputLayer.getDecision();

        lv_KeywordList.Invoke((MethodInvoker)delegate ()
        {
            int k = 0;
            while (outputLayer.probabilities[k] > 0.05)
            {
                Match m = new Match(
                    j,
                    outputLayer.sortedClasses[k],
                    (float)Math.Round(outputLayer.probabilities[k], 3),
                    getImageName(imageFullPaths[j])
                );
                matches.Add(m);
                k++;
            }
        });
    }

    lv_KeywordList.Invoke((MethodInvoker)delegate ()
    {
        groupBox2.Enabled = true;
        btnFilter.PerformClick();

        int k;
        for (k = 0; k < lv_KeywordList.Columns.Count - 1; k++)
            if(k!=1)
              lv_KeywordList.Columns[k].Width = -2;
        lv_KeywordList.Columns[k].Width = -1;

        dropToStart();
    });
});
t.Start();

Now all the images are tagged with keywords which are actually class descriptions of the model we are using. Finally, we iterate Matches to find each Match that contains the keyword written by the user.

float probThresh = (float)numericUpDown1.Value;
string str = cbClasses.Text.ToLower();
lv_KeywordList.Items.Clear();
pictureBox1.Image = null;

List<int> imagesToShow = new List<int>();

int j = 0;

bool stringFilter = (str != "");

for (int i = 0; i < matches.Count; i++)
{
    bool cond = (matches[i].Probability >= probThresh);
    if (stringFilter)
        cond = cond && matches[i].Keywords.Contains(str);
    if (cond)
    {
        addMatchToList(j, matches[i]);
        int ind = matches[i].ImageIndex;
        if (!imagesToShow.Contains(ind))
            imagesToShow.Add(ind);
        j++;
    }
}
if (lv_KeywordList.Items.Count > 0)
    lv_KeywordList.Items[0].Selected = true;

It is that simple!

Training Your Own Models for ImageTagger

You can train your own neural network using a tool like matconvnet and convert it to CeNiN format to use it in ImageTagger. Here is a matlab script that converts vgg nets to a format compatible with CeNiN:

function vgg2cenin(vggMatFile) % vgg2cenin('imagenet-matconvnet-vgg-f.mat')
  fprintf('Loading mat file...\n');
  net=load(vggMatFile);
  lc=size(net.layers,2);

  vggMatFile(find(vggMatFile=='.',1,'last'):end)=[]; % remove extension
  
  f=fopen(strcat(vggMatFile,'.cenin'),'w');   % Open an empty file with the same name
  fprintf(f,'CeNiN NEURAL NETWORK FILE');   % Header
  fwrite(f,lc,'int');             % Layer count
  if(isfield(net.meta,'inputSize'))
    s=net.meta.inputSize;
  else
    s=net.meta.inputs.size(1:3);
  end
  for i=1:length(s)
    fwrite(f,s(i),'int'); % Input dimensions (height, width and number of channels (depth))
  end
  for i=1:3
    fwrite(f,net.meta.normalization.averageImage(i),'single');
  end
  for i=1:lc % For each layer
    l=net.layers{i};
    t=l.type;
    s=length(t);
    fwrite(f,s,'int8'); % String length
    fprintf(f,t);     % Layer type (string)

    fprintf('Writing layer %d (%s)...\n',i,l.type);

    if strcmp(t,'conv') % Convolution layers     
      st=l.stride;
      p=l.pad;
      
      % We need 4 padding values for CeNiN (top, bottom, left, right)
      % In vgg format if there are one value, all padding values are
      % the same and if there are two values, these are for top-bottom
      % and left-right paddings.
      if size(st,2)<2 , st(2)=st(1); end
      if size(p,2)<2 , p(2)=p(1); end
      if size(p,2)<3 , p(3:4)=[p(1) p(2)]; end

      % Four padding values
      fwrite(f,p(1),'int8');
      fwrite(f,p(2),'int8');
      fwrite(f,p(3),'int8');
      fwrite(f,p(4),'int8');

      s=size(l.weights{1}); % Dimensions (height, width, number of channels (depth), 
                            number of filters)
      for j=1:length(s)
        fwrite(f,s(j),'int');
      end

      % Vertical and horizontal stride values (StrideY and StrideX)
      fwrite(f,st(1),'int8');
      fwrite(f,st(2),'int8');
      
      % Weight values
      % Writing each value one by one takes long time because there are many of them.
      %   for j=1:numel(l.weights{1})
      %     fwrite(f,l.weights{1}(j),'single');
      %   end
      % This is faster:
      fwrite(f,l.weights{1}(:),'single');
      
      % And biases
      %   for j=1:numel(l.weights{2})
      %     fwrite(f,l.weights{2}(j),'single');
      %   end
      fwrite(f,l.weights{2}(:),'single');

    elseif strcmp(t,'relu') % ReLu layers
      % Layer type ('relu') has been written above. There are no extra
      % parameters to be written for this layer..

    elseif strcmp(t,'pool') % Pooling layers
      st=l.stride;
      p=l.pad;
      po=l.pool;
      if size(st,2)<2 , st(2)=st(1); end
      if size(p,2)<2 , p(2)=p(1); end
      if size(p,2)<3 , p(3:4)=[p(1) p(2)]; end
      if size(po,2)<2 , po(2)=po(1); end

      % Four padding values (top, bottom, left, right)
      fwrite(f,p(1),'int8');
      fwrite(f,p(2),'int8');
      fwrite(f,p(3),'int8');
      fwrite(f,p(4),'int8');

      % Vertical and horizontal pooling values (PoolY and PoolX)
      fwrite(f,po(1),'int8');
      fwrite(f,po(2),'int8');

      % Vertical and horizontal stride values (StrideY and StrideX)
      fwrite(f,st(1),'int8');
      fwrite(f,st(2),'int8');

    elseif strcmp(t,'softmax') % SoftMax layer (this is the last layer)
      s=size(net.meta.classes.description,2);
      fwrite(f,s,'int'); % Number of classes
      for j=1:size(net.meta.classes.description,2) % For each class description
        s=size(net.meta.classes.description{j},2);
        fwrite(f,s,'int8'); % String length
        fprintf(f,'%s',net.meta.classes.description{j}); % Class description (string)
      end
    end

  end

  fwrite(f,3,'int8'); % Length of "EOF" as if it is a layer type.
  fprintf(f,'EOF');   % And the "EOF" string itself...
  fclose(f);

end

Useful Links

cuDNN: Efficient Primitives for Deep Learning
Pretrained models (they are not directly compatible with CeNiN)

History

3^rd April, 2019: Initial version