Object Detection using Deep Learning

Introduction to Deep Learning!

Before starting, Observe the below image just for a couple of seconds.
It's 5 Right?
Have you ever wondered how did you able to distinguish between the two images? Even though the specific values of pixels in the two images are very different from each other.
A part of the visual cortex in our brain is responsible for that! So that we can distinguish between different objects and texts.
Now, What if I tell you to write a program to identify a digit from a given image with 28 x 28 pixels? Suddenly becomes complicated, Correct?
It seems impossible with our conventional python programming, even with traditional machine learning algorithms like Regression and classification.
In traditional machine learning, we deal with structured data but here we have to deal with unstructured data.
That's where Neutral Networks come into play, let's take a peek inside the Black box!

Neutral Networks contain Nodes/Neurons, Each Neuron at the Input and Output layer contains a value within a range of 0 - 1.
W1 and W2 are called the weights.

Each Neuron inside the Hidden layer contains Activation Functions Like Softplus / Sigmoid curve/ ReLU bent Line etc.
ReLU Activation function is used extensively in conventional neural networks, we will come to that later!

Activation functions

Each connection between the Neurons has specific weights and bias associated which are calculated by using the Backpropagation technique. Which is achieved by taking Gradient descent of Loss functions like the Sum of Squared residuals (SSR) and SoftMax functions.
Gradient descent is another way to the find minimum value for a given function, Which is usually achieved by equating its derivatives to 0, but they are cases for which we can't equate its derivative to zero.
To avoid that. we define step sizes and so that our function converges more when it is close to the actual value. In other words, there will be fewer number of iterations when our solution is far from the actual value, and more number of iterations (baby steps) when it is near the actual value.
So Weights/ Bais are calculated in such a way that our Sum of squared residuals is minimum.

These weights and Bias in the connections slice flips and stretch the activation functions into new shapes which then are added to get a new curve, then the curve is shifted to fit the data.

Basic Structure of Neutral network

Image classification using Convolutional Neural Networks (CNN)

Consider the below image, we need to classify this as a 0/X by using our basic Neutral Network, it is an image 3 * 8 pixels wide.
We will have 24 input nodes, A Hidden layer consists of at least 2 nodes, So there will be 48 different connections, so 48 different weights and Bias!
This is just for 3 * 8 pixels wide, Real world image has 1920 *1080 pixels
First layer Neurons ~ 6 Million
Hidden layer ~ 4 Million( Approx.)
Weights and Bias - 6 * 4 ~ 24 Million!
That is a lot of Computation!
By using Convolutional Neural Networks (CNN), We will reduce the Input Nodes

CNN will first apply a filter to the Input image, The Intensity in the filter is calculated by Back propagation technique using Gradient descent.
We will overlap the Filter with the input image, to get the Dot product. Then we shift the Filter by pixel wide then we will continue the process throughout the input to get the feature map.

I * K is called Feature Map, K - Filter, and I - Input image ( Example )

Then we will run Feature matrix through ReLU Activation function.
Look at the Feature map post ReLU function below.

Observe the location of the Two 1's, it implies that our filter have done the best job matching the input image at those locations. As you can see our filter have extract image at left corner and Right corner of the input image.
Further the dimensionality of the Feature map is reduced by applying max pooling, Then these intensities in the Max pooled is passed as the input Nodes to the neutral network , Which further classifies the image as a X or O.
Refer the flow chart diagram of typical Convolutional Neural Networks (CNN) below

Object classification and Localization

The traditional method of Object detection is done by the sliding window approach, where the sliding window scans the images, and Each Frame is passed on to classifier CNN to predict a label for that region, whether is a Dog or a person.

Object detection by using YOLO

Instead of making predictions on many regions of an image, YOLO passes the entire image into CNN at once, making the process much faster.
CNN predicts the labels, Bounding boxes and confidence probabilities for the objects in the image.

Each image is divided into S x S Grid.
Each cells predicts B bounding boxes.
Returns the Bounding boxes above the confidence threshold.
Refer the below images.

For each cell, CNN will predict a column vector.

Pc - Probability that the bounding box contains an object.
bx, by - Coordinates of the bounding box center.
bh, bw - width and height of the bounding boxes.
C - Class which the object contains ( Car, Person)

Now we have the vector, Which we can pass to output nodes in CNN, and for Input, we have provided the input images and let our model train
After training with sufficient input images we can use if for object detection and classification.

Let's Get practical!

Source link: teachablemachine.withgoogle.com

Image classification problem

Step 1: Let's first divide into three classes: Dog, Cat, and Tiger, providing our model with input data.

Step 2: Train the model, with default parameters

Step 3: Let's test our data with new images.

Let's confuse our model now!

Here we got contradictory results because we have given two objects in a new image So it couldn't predict it correctly.

Conclusion

We have seen the basic concepts of Neutral networks, what are the basic components involved in Neutral Networks. We have covered why we can't use basic Neutral Networks for Image processing and classification, and what is the advantage of CNN over ANN.
We have covered why CNN is used extensively for image classification and how it reduces input nodes, thus reducing the computations.
We have seen the differences between Image classification and Image localization, and we have seen how image classification and localization are achieved in YOLO by using bounding box regression
At last, we have seen a practical implementation of image classification using teachable machine