Click here to Skip to main content
15,875,568 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I am working on an image classification model in Pytorch. The setup is the following:
My training instances are bags of images (i.e. one training instance = one bag), where each bag contains a varying amount of images. Each bag has one label associated with it (0 or 1), that indicates whether at least one of the images contains a certain property (e.g. tumor as in my case). The objective is to learn a classifier that classifies new bags as accurately as possible. However the following problem occurs when I try to train my model: when I feed the bags one by one into the CNN architecture, the prediction (the probability that bag has label 1) is instantly either 0 or 1.

Now before I delve into the code itself, which is quite long, I have a similar setup where instead of tissue images I use numbers. So each bag contains a number MNIST like images (just pictures with a number), and the bag gets a positive label (i.e. 1) if one of the bag contains the number 9. Strangely enough, this task with the digits works very well (you can see that the learning works and in the end good classification performance is obtained), even though the setup compared to the tissue images is near identical.

Below I post code sections that differ between these 2 tasks, but they are essentially the following: the input shape of the bags differ, the digit images are much smaller (28x28) and have only one channel while the tissue images are 224x224 and have three channels. Therefor the convolutional layers also vary a little bit in specification.

First code section is of the tissue images, which has the problem that somehow this model won't learn

Python
class Attention(nn.Module):
    def __init__(self):
        super(Attention, self).__init__()
        self.L = 500
        self.D = 128
        self.K = 1


        self.feature_extractor_part1 = nn.Sequential(
            nn.Conv2d(3, 4, kernel_size=4), # 3 because three color channels, each kernel has size 3X4X4
            nn.ReLU(),
            nn.MaxPool2d(2, stride=2),
            nn.Conv2d(4, 8, kernel_size=3), # combine all input for one output
            nn.ReLU(),
            nn.MaxPool2d(2, stride=2)
        )

        self.feature_extractor_part2 = nn.Sequential(
            nn.Linear(8 * 54 * 54, self.L),    # y = Ax + b
            nn.ReLU(),
            #add dropout
        )

        self.attention = nn.Sequential(
            nn.Linear(self.L, self.D),
            nn.Tanh(),
            nn.Linear(self.D, self.K)
        )

        self.classifier = nn.Sequential(
            nn.Linear(self.L*self.K, 1),
            nn.Sigmoid()
        )

    # X is input and is one bag
    def forward(self, x):
        x = x.squeeze(0) #remove first dimension the bag tensor

        # feature extraction part
        H = self.feature_extractor_part1(x)  
        H = H.view(-1, 8 * 54 * 54)  
        H = self.feature_extractor_part2(H)  # NxL
        

        # aggregation part
        A = self.attention(H) 
        A = torch.transpose(A, 1, 0)  # KxN
        A = F.softmax(A, dim=1)  # softmax over N

        # H gets multiplied with A, where A is some kind of multiplied H
        M = torch.mm(A, H)  # KxL #so KxL is the feature of the bag
        print(M.shape) #torch.Size([1, 500])
      

        # final transformation part
        Y_prob = self.classifier(M)  # KxL to a one dim output for probability bag label
        Y_hat = torch.ge(Y_prob, 0.5).float()
 

        return Y_prob, Y_hat, A



Second code section is of the digit images, which works perfectly


Python
class Attention(nn.Module):
    def __init__(self):
        super(Attention, self).__init__()
        self.L = 500
        self.D = 128
        self.K = 1

        self.feature_extractor_part1 = nn.Sequential(
            nn.Conv2d(1, 10, kernel_size=5), # 1 because one color channel, 20 output feature #20
            nn.ReLU(),
            nn.MaxPool2d(2, stride=2),
            nn.Conv2d(10, 20, kernel_size=5), #50
            nn.ReLU(),
            nn.MaxPool2d(2, stride=2)
        )

        self.feature_extractor_part2 = nn.Sequential(
            nn.Linear(20 * 4 * 4, self.L),    #y= Ax + b   #50 feature maps and size 4x4
            nn.ReLU(),
        )

        self.attention = nn.Sequential(
            nn.Linear(self.L, self.D),
            nn.Tanh(),
            nn.Linear(self.D, self.K)
        )

        self.classifier = nn.Sequential(
            nn.Linear(self.L*self.K, 1),
            nn.Sigmoid()
        )

    # X is input and is one bag
    def forward(self, x):
        
        x = x.squeeze(0) #remove first dimension the bag tensor

        # feature extraction part
        H = self.feature_extractor_part1(x)  
        H = H.view(-1, 20 * 4 * 4) 
        H = self.feature_extractor_part2(H)  # NxL

        # aggregation part
        A = self.attention(H)  # NxK
        A = torch.transpose(A, 1, 0)  # KxN
        A = F.softmax(A, dim=1)  # softmax over N

        # H gets multiplied with A, where A is some kind of multiplied H
        M = torch.mm(A, H)  # KxL #so KxL is the feature of the bag
        
        # final transformation part
        Y_prob = self.classifier(M)  # KxL to a one dim output for probability bag label
        Y_hat = torch.ge(Y_prob, 0.5).float()

        return Y_prob, Y_hat, A


What I have tried:

I have tried changing the convolutional layers (dimension) and the learning rate.
Posted
Updated 23-Jun-21 10:23am

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900