In this video, we're going to talk about how deep learning and
convolutional neural networks can be adapted to
solve semantic segmentation tasks in computer vision.
Semantic segmentation with convolutional neural networks
effectively means classifying each pixel in the image.
Thus, the idea is to create a map of full-detected object areas in the image.
Basically, what we want is the output image in
the slide where every pixel has a label associated with it.
In this chapter, we're going to learn how
convolutional neural networks can do that job for us.
The naive approach is to reduce the segmentation task to the classification one.
The idea is based on the observation that
the activation map induced by the hidden layers when passing an image
through a CNN could give us a useful information
about which pixels have more activation on which class.
Our plan is to convert a normal CNN used for
classification to a fully convolutional neural network used for segmentation.
First, we get a pre-trained convolutional neural network such
as one pre-trained for classification and ImageNet,
you can choose your own favorite models like AlexNet or VGG or ResNet,
and then we convert the last fully connected layer into
convolutional layer of receptive field one-by-one.
When we do this, we gain some form of
localization if we look out where we have more activation.
The optional step is to fine-tune
to fully convolutional network for solely in the segmentation task.
Important point to note here is that the loss function we use in
this image segmentation scenario is
actually still the usual loss function we use for classification,
multi-class cross entropy and not something like the L2 loss,
like we would normally use when the output is an image.
This is because despite what you might think,
we're actually just assigning a class to each of our output pixels,
so this is a classification problem.
The problem with this approach is that we lose some resolution by just doing this,
because the activation will downscale on a lot of steps.
Example with a cyclist is on the slide.
Different approach to solving semantic segmentation via
deep learning is based on downsampling-upsampling architecture,
where both left and right parts have
the same size in terms of number of trainable parameters.
This approach is also called the encoder-decoder architecture.
The main idea is to get the input image with size, n times m,
compress it with a sequence of convolutions,
and then decompress it and get the output with the original size,
n times m. How can we do that?
To save the information,
we could use skip connections or reserve all convolution and pooling layers
by applying unpooling and transpose convolution operations in decoder's part,
but at the same place as where max pooling and convolution is applied
in convolutional part or encoder part of the network.
A working example of such an architecture is the SegNet model
featuring a VGG identical encoder or downsampling part,
and the corresponding decoder or upsampling part.
While possessing many learnable parameters,
the model performed well for road signs classification on the CamVid dataset
while slightly underperforming the segmentation of medical images.
Let's look at the details of transpose convolution employed in the SegNet model.
Basically, the idea is to scale up the scaled down effect made on all previous layers.
Actually, the upsampling or
transposed convolution forward propagation is a convolution back propagation.
And the upsampling back propagation is a convolution forward propagation.
The easiest way to obtain the result of
a transposed convolution is to apply an equivalent direct convolution.
Kernel and stride sizes remain the same.
But now, we should use zero padding with appropriate size.
For better understanding of downsampling-upsampling architecture,
we need to study the mechanism of unpooling.
The max pooling operation is not invertible.
Therefore, one may consider a different approximation to the inverse of max pooling.
The easiest way is to use resampling and interpolation.
This means, taking an input image,
re-scaling it to the desired size,
and then calculating the pixel values at each point using an interpolation method,
such as bilinear interpolation.
Another idea to restore max pooling is a "Bed of nails" where we either
duplicate or fill the empty block with
the entry value in the top left corner and the rows elsewhere.
Yet, another and effective mechanism is the following.
We record the position called max location
switches where we located the biggest values during normal max pooling.
And then use their positions to reconstruct the data from the layer above.
U-net, yet another model,
is a downsampling-upsampling architecture illustrated on the slide.
The downsampling part follows the typical architecture of a convolutional network.
It consists of the repeated application of two three-by-three unpadded convolutions
followed by a rectifier linear unit and
a two-by-two max pooling operation with stride two for downsampling.
At each downsamplings tab,
we double the number of feature channels.
Every step in the upsampling part consists of a transposed convolution of
the feature map followed by a two-by-two convolution that has a number of feature,
channels and upsamples the data,
and a concatenation with
a correspondingly cropped feature map from the downsampling part.
And this is implemented via skip connection.
This is convolved by
two three-by-three convolutional layer each followed by a rectifier linear unit.
The cropping is necessary due to the loss of border pixels in every convolution.
Of the final layer, a one-by-one convolution is used to
map each 64-component feature vector to the desired number of classes.
In total, the network has 23 convolutional layers,
U-net performs well on medical image segmentation tasks.
To summarize, you can view semantic segmentation as pixel-wise classification.
You could just directly apply a pre-trained convolutional neural network,
however, encoder-decoder style architectures seemed to be more effective in these tasks.
Decoder network that has to upsample the internal representation of
the data use a specialized layer such as has transpose convolution and
unpooling to increase spatial resolution of the produced representation
ending up with a dimensionality same as the input image.
Also, what people use a lot is skip connections that
help propagate gradients back and forth along the network.