Intro
YOLO (You Only Look Once) is an object detection algorithm that is able to locate and classify objects in an image or video in real-time. It divides the input image into a grid of cells and makes predictions for each cell, including bounding boxes enclosing the objects in the image and class probabilities indicating the likelihood that the objects belong to a certain class. YOLO is known for its efficiency and speed, as it is able to process images and make predictions quickly, making it suitable for use in real-time applications. It was developed to combine the tasks of classification and prediction of bounding boxes into a single neural network, which allows it to run faster than other object detection methods that use separate networks for these tasks. YOLO is able to capture the context of detected objects by looking at the entire image at once, which reduces the number of false-positive detections compared to methods that look at different parts of the image separately. It is also able to generalize the representations of various objects, making it more applicable to different environments.
Process
YOLO (You Only Look Once) is an object detection algorithm that works by dividing an input image into a grid of cells and making predictions for each cell.
Each cell predicts a fixed number of bounding boxes, which are boxes that enclose objects in the image, and class probabilities, which are the likelihood that the objects belong to a certain class. The model also predicts a confidence score for each bounding box, which indicates the model's certainty that there is an object in the cell and that the bounding box is accurate.
The bounding boxes are made up of 5 numbers: the x and y coordinates of the center, the width and height, and the confidence score. The confidence score is calculated using the Intersection Over Union (IOU) between the predicted bounding box and the ground truth box.
In YOLO (You Only Look Once), the Intersection Over Union (IOU) is a measure used to evaluate the accuracy of bounding box predictions. It is calculated as the area of the intersection of the predicted bounding box and the ground truth box, divided by the area of the union of the same predicted and ground truth boxes.
Each cell in the grid of the YOLO model predicts a fixed number of bounding boxes and class probabilities, and the overall prediction of the model is a tensor of shape S x S x (C + B x 5), where S is the number of cells in the grid, C is the number of classes, and B is the number of predicted bounding boxes. Each cell also predicts the class of the object, which is represented by a one-hot vector of length C.
However, it is important to note that each cell can only predict one class, even if it predicts multiple bounding boxes. If there are multiple objects of different classes in one grid cell, the algorithm may fail to classify both correctly.
The bounding box is made up of 4 elements: the x and y coordinates of the center, the width, and the height. These coordinates and dimensions are relative to the size of the entire image. The model predicts the center of the bounding box and the width and height rather than the top left and bottom right corner positions. The classification of the object is represented by a one-hot vector, with each class corresponding to a position in the vector. The prediction made by the model is indicated by the position in the vector with a value of 1, while the other positions have a value of 0.
The model also predicts a confidence score for the bounding box, which indicates the model's certainty that there is an object in the cell and that the bounding box is accurate. It is important to note that this is just an example of the kind of output that is possible and the values may not be accurate for any real data.
YOLO Architecture
It is made up of three main components: the head, neck, and backbone.
The backbone is responsible for detecting key features in the image and processing them using convolutional layers.
The neck uses the features from the backbone and combines them with fully connected layers to make predictions on probabilities and bounding box coordinates.
The head is the final output layer of the model, which takes the predictions from the neck and produces the final output tensor of shape S x S x (C + B x 5), where S is the number of cells in the grid, C is the number of classes, and B is the number of predicted bounding boxes.
The head can be interchanged with other layers that have the same input shape for transfer learning.
Together, these three components of the YOLO model work to extract visual features from the image, classify them, and generate bounding boxes around the objects.
YOLO Training
The YOLO model is pre-trained on an image classification dataset, in this case the ImageNet 1000-class competition dataset, and then fine-tuned for object detection by adding additional convolution and fully connected layers to the model.
The resolution of the input images is increased to allow for finer details to be captured for detection.
The model is trained using the Pascal VOC 2007 and 2012 datasets for a certain number of epochs with a batch size of 64, using data augmentation and dropout to prevent overfitting.
The YOLO model uses a loss function to measure the difference between the predicted bounding boxes and the ground truth bounding boxes in an image. The loss function takes into account the midpoint, width and height, and class probabilities of the predicted and ground truth bounding boxes. The midpoint and width and height differences are square-rooted to treat small and large bounding boxes equally.
The loss is calculated for the bounding box with the highest intersection over union (IOU) value with the ground truth bounding box in each grid cell. The loss is also split into two parts for cells with and without objects, with the loss for cells without objects multiplied by a coefficient to reduce the penalty for misclassification.
The final part of the loss function calculates the difference between the predicted and actual class probabilities for each grid cell containing an object. The loss is then minimized through optimization during training to improve the performance of the model.
NOTE: All the pictures were taken from Research Paper (Link)