Semantic segmentation-a pixel level prediction

As the topic suggests the role of semantic segmentation involves segmenting the input image by semantic information(meaningful information) and predicting and categorizing (semantic category) each pixel into groups with specific labels(Multi-color labels).

Applications:

Semantic segmentation is used in the following fields

Robotics:
Used by machines to segregate objects like in recycling to group objects and to locate objects and boundaries.
Self-driving cars:
It provides information about free spaces, lane markings, traffic signals, and ongoing cars.

Semantic segmentation in Self-Driving cars

Biomedical imaging and diagnostics:
It's used in detecting anomalies like tumors during imaging and many other abnormalities.

Semantic segmentation in medical field

Facial recognition:
Used in computer vision applications to distinguish an individual's ethnicity, age, gender, and expressions.
Used in Aerial imagery and GeoSensing.
Precision agriculture.

Steps in Semantic Segmentation:

Classification:
Classifying objects in the input image.

Classifying the objects in the image using labels
Bounding:
Bounding the classified objects using bounding boxes.
Segmenting and Masking:
Categorizing the pixels in the localized images by creating a segmentation mask.

State of art Deep Learning Methods:

Some of the Deep learning methods for semantic segmentation are:

DeepLab by Google.
Unet (Used mainly in the Biomedical industry).
Fully Convolution Network(FCN).

Semantic Segmentation using DeepLabV3+:

DeepLab is one of the best state-of-art methods used for semantic segmentation worldwide. Its open-sourced by Google since 2016 with multiple improvements and integrations as follows

DeepLabV2
DeepLabV3
DeepLabV3+

Architecture:

Architecture of DeepLabV3+ consists of:

Encoder
Decoder

Encoder:

ResNet101:
Part of the encoder is ResNet101(Residual Network) which is an Artificial Neural Network (ANN) of a kind that stacks residual blocks on top of each other to form a network. It is 101 layer deep and is trained with millions of data of various categories from the ImageNet database and thus consist of millions of parameters useful for image recognition.
ASPP:
Atrous convolution or dilated convolution:
It's one of the key innovations of DeepLabv3+ in Encoder. Differing from the normal convolution methods atrous convolution allows the network to have a larger effective receptive field, which means they can consider a larger context while making predictions while performing image processing.

Atrous Spatial Pyramid Pooling (ASPP):
ASPP works by applying multiple layers of Atrous convolutions at different field rates. The majority of downsampling of the images occurs in the ASPP module.
ASPP uses the spatial pyramid pooling(SPP) module, which pools the output of atrous convolutions at different field rates. This helps to capture information at different scales, which is helpful and important for semantic segmentation.

Pooling different convolution layers at different rates

The output of ASPP is then concatenated with the output of the last convolution layer(1x1) and fed into the decoder.

Let's look into the code block for building an encoder module(ASPP):

def ASPP(inputs):
    """ Image Pooling """
    shape = inputs.shape
    y1 = AveragePooling2D(pool_size=(shape[1], shape[2]))(inputs)
    y1 = Conv2D(256, 1, padding="same", use_bias=False)(y1)
    y1 = BatchNormalization()(y1)
    y1 = Activation("relu")(y1)
    y1 = UpSampling2D((shape[1], shape[2]), interpolation="bilinear")(y1)

    """ 1x1 conv """
    y2 = Conv2D(256, 1, padding="same", use_bias=False)(inputs)
    y2 = BatchNormalization()(y2)
    y2 = Activation("relu")(y2)

    """ 3x3 conv rate=6 """
    y3 = Conv2D(256, 3, padding="same", use_bias=False, dilation_rate=6)(inputs)
    y3 = BatchNormalization()(y3)
    y3 = Activation("relu")(y3)

    """ 3x3 conv rate=12 """
    y4 = Conv2D(256, 3, padding="same", use_bias=False, dilation_rate=12)(inputs)
    y4 = BatchNormalization()(y4)
    y4 = Activation("relu")(y4)

    """ 3x3 conv rate=18 """
    y5 = Conv2D(256, 3, padding="same", use_bias=False, dilation_rate=18)(inputs)
    y5 = BatchNormalization()(y5)
    y5 = Activation("relu")(y5)
    """concatenation of above convolutions with 1x1 convolution"""
    y = Concatenate()([y1, y2, y3, y4, y5])
    y = Conv2D(256, 1, padding="same", use_bias=False)(y)
    y = BatchNormalization()(y)
    y = Activation("relu")(y)

    return y

Decoder:

The decoder uses a combination of upsampling and convolution layers to combine the features extracted from the encoder and produce a dense segmentation map.

The decoder typically starts with an upsampling layer, which increases the spatial resolution of the feature map by inserting zeros between the values and then convolving over the result. This process is repeated several times until the spatial resolution of the feature map is the same as the input image.

Skip connections are used in the decoder, which allows the decoder to incorporate information from different scales and improve the accuracy of segmentation.

Skip connections, also known as shortcut connections, are a type of connection in neural networks that bypass one or more layers and directly connect the input of the network to a deeper layer.

Let's look at the code block for the model:

def deeplabv3_plus(shape):
    """ Input """
    inputs = Input(shape)

    """ Encoder """
    encoder = ResNet101(weights="imagenet", include_top=False, input_tensor=inputs)

    image_features = encoder.get_layer("conv4_block6_out").output
    x_a = ASPP(image_features)
    """Decoder"""
    x_a = UpSampling2D((4, 4), interpolation="bilinear")(x_a)

    x_b = encoder.get_layer("conv2_block2_out").output
    x_b = Conv2D(filters=48, kernel_size=1, padding='same', use_bias=False)(x_b)
    x_b = BatchNormalization()(x_b)
    x_b = Activation('relu')(x_b)

    x = Concatenate()([x_a, x_b])
    x = SqueezeAndExcite(x)

    x = Conv2D(filters=256, kernel_size=3, padding='same', use_bias=False)(x)
    x = BatchNormalization()(x)
    x = Activation('relu')(x)

    x = Conv2D(filters=256, kernel_size=3, padding='same', use_bias=False)(x)
    x = BatchNormalization()(x)
    x = Activation('relu')(x)
    x = SqueezeAndExcite(x)
    """Final Masking Stage"""
    x = UpSampling2D((4, 4), interpolation="bilinear")(x)
    x = Conv2D(1, 1)(x)
    x = Activation("sigmoid")(x)

    model = Model(inputs, x)
    return model

Hope you understood the working of semantic segmentation architecture. Now, let's code a simple semantic segmentation using DeepLabV3 using PyTorch, given along with the output.


import torch
import os

if __name__ == "__main__":
     os.environ['CUDA_VISIBLE_DEVICES'] = '0'
     print(torch.cuda.is_available()) 
def segment(net, path, show_orig=True, dev='cuda'):
  img = Image.open(path)
  if show_orig: plt.imshow(img); plt.axis('off');     plt.show()
 
  trf = T.Compose([T.Resize(640), 
                   T.ToTensor(), 
                   T.Normalize(mean = [0.485, 0.456, 0.406], 
                               std = [0.229, 0.224, 0.225])])
  inp = trf(img).unsqueeze(0).to(dev)
  out = net.to(dev)(inp)['out']
  om = torch.argmax(out.squeeze(), dim=0).detach().cpu().numpy()
  rgb = decode_segmap(om)
  plt.imshow(rgb); plt.axis('off'); plt.show()
dlab = models.segmentation.deeplabv3_resnet101(pretrained=1).eval()

segment(dlab, '/content/download.jpg')

With the above code, we could understand the working of semantic segmentation. Here i used a pretrained model of DeepLabV3+ using ResNet101 layer from PyTorch.

Hope everyone enjoyed reading.

Refer GitHUB: https://gist.github.com/saran2811/cfca7e2e2b17a5d1dcb0d52759d2c2ad

For further research refer following website:

https://pytorch.org/hub/pytorch_vision_deeplabv3_resnet101/

Applications:

Steps in Semantic Segmentation:

State of art Deep Learning Methods:

Semantic Segmentation using DeepLabV3+:

Thank You...Happy coding...