11  Object Detection Algorithms

11.1 SSD: Single Shot MultiBox Detector

The single shot multibox detector is an algorithm presented by W. Liu (2016) for the purpose of taking a 300 x 300 input and generating bounding boxes on objects of interest within the image. The paper is linked here

11.2 A Comprehensive Review of YOLO (v1 to v8+)

J. Terven (2023) present review and analysis of the evolution of the Yolo algorithm, with a focus on the innovations and contributions made by each iteration, as well as the major changes in network architecture (and training tricks) which have been implemented over time.

A timeline of YOLO development

11.2.1 Applications of YOLO

Yolo has proven invaluable for a number of different applications

  • autonomous vehicles
    • enables quick identification and tracking of objects like vehicles, pedestrians, bicycles and other obstacles
  • action recognition
  • video surveillance
  • sports analysis
  • human-computer interaction
  • crop, disease, pest detection and classsification
  • face detection - biometrics, security, facial recognition
  • cancer detection
  • skin segmentation
  • pill identification
  • remote sensing
    • satellite and aerial imagery object detection / classification
    • land use mapping
    • urban planning
  • security systems
  • smart transportation systems
  • robotics and drones

11.2.2 Evaluation Metrics

Average Precision (AP) and Mean Average Precision (mAP) are the most common metrics used in the object detection task. It measures average precision across all categories, providing a single value to compare different models

11.2.2.1 How AP works

  • mAP is the average precision for accuracy of predictions across all classes of objects contained within an image
    • individual AP values are determined for each category separately.
  • IOU (intersection over union)
    • measures the proportion of the predicted bounding box which overlaps which overlaps with the true bounding box

Intersection over union in practice

Different methods are used to compute AP when evaluating object detection methods on the COCO and VOC datasets (PASCAL-VOC)

11.2.3 Non-Maximum Suppression

A post-processing technique - reduces number of overlapping boxes and improves detection quality. Object detectors typically generate multiple bounding boxes around the same object. Non-max suppression picks the best ones and gets rid of the others.

The algorithm for this is defined below:

Non-Max Suppression Alg

A useful visualization is also provided:

Non-Max Supression Vis

11.2.4 YOLO

The original authors of YOLO titled it as such for the reason that it only required a single pass on the image to accomplish the detection task. This is contrast to the other approaches used by Fast R-CNN and sliding window methods.

The output coordinates of the bounding box were detected using more straightforward regression techniques

11.2.4.1 YOLOv1

PASCAL-VOC AP: 63.4%

YOLOv1 predicted all bounding boxes simultaneously by the following process:

  1. divide image into S \times S grid
  2. predict B bounding boxes of the same class and confidence for C different classes per grid element
  3. each bounding box had five values:
    1. Pc - confidence score for the bounding box - how likely it contains an object and the accuracy of the box
    2. bx and by - coordinates of center of box relative to grid cell.
    3. bh and bw - height and width of box relative to full
  4. output an S \times S \times (B \times 5 + C) tensor
  5. (optional) NMS used to remove redundant bounding boxes

Here is an example of that output:

yolo output prediction
11.2.4.1.1 v1 Architecture

Normal Architecture

  • 24 conv layers
    • 1 \times 1 conv layers are used - reduce number of feature maps and keep parameters lower
    • leaky rectified linear unit activations
  • 2 fc layers
    • predict bounding box coordinates / probs
    • linear activation function for final layer

FastYOLO

  • Used 9 conv layers instead of 24 for greater speed (at the cost of reduced accuracy)

yolo v1 architecture
11.2.4.1.2 v1 Training

Basic training process:

  1. pretrain first 20 layers at resolution 224 \times 224 with ImageNet dataset
  2. add last four layers with randomly initialized weights - fine tune model with PASCAL VOC 2007 and PASCAL VOC 2012 at resolution 448 \times 448

Loss functions:

  • scaling factors
    • \lambda_{coord} = 5 - gives more weight to boxes with objects
    • \lambda_{noobj} = 0.5 - reduces importance of boxes with no object
  • localization loss:
    • first two terms
    • computes error in predicted bounding box locations (x,y) and (w,h)
    • only penalizes boxes with objects in them
  • confidence loss:
    • confidence error when object is detected (third term)
    • confidence error when no object is in box (fourth term)
  • classification loss:
    • squared error of class conditional probabilities for each class if an object appears in the cell

yolo v1 loss function

11.2.4.2 YOLOv2 (YOLO 9000)

PASCAL-VOC AP: 78.6%

Improvements / Changes

  1. Batch normalization - included on all convolutional layers
  2. Higher resolution classifier - pretrained model (224 x 224) and then fine-tuned with ImageNet at a higher reoslution (448 x 448) for ten epochs
  3. fully convolutional - remove dense layers and use fully conv architecture
  4. use anchor boxes to predict bounding boxes
    1. anchor box - box with predefined shapes for prototypical objects
    2. defined for each grid cell
    3. system predicts coordinates and class for every anchor box

yolo v2 anchor boxes
  1. Dimension clusters - pick good anchor boxes using k-means clustering on the training bounding boxes - improves accuracy of bounding boxes
  2. Direct Location Prediction
  3. Finer-grained features
    1. removed pooling layer - get feature 13 x 13 feature map for 416 x 416 images
    2. passthrough layer - 26 x 26 x 512 feature map -> stack adjacent features into different channels
  4. Multi-scale training - train on different input sizes to make model robust to different input types
11.2.4.2.1 v2 Architecture
  • backbone architecture -> Darknet-19
    • 19 conv layers
    • 5 max pool layers
      • non-linear operation - uses OT to perform efficiently
    • use 1 \times 1 conv between 3 \times 3 to reduce parameters
    • batch normalization to help convergence
    • object classification head (replaces last 4 conv layers of YOLOv1)
      • 1 conv layer (1000 filters)
      • GAP layer
      • Softmax classifier

yolo v2 architecture

11.2.4.3 YOLOv3

MS COCO AP: 36.2% AP(50): 60.6%

The code used to run YOLOv3 in Torch is provided at this repository

11.2.4.3.1 YOLOv3 Architecture

YOLOv3 makes use of a larger network architecture (backbone) called Darknet-53.

  • replaces all max-pooling layers with strided convolutions and added residual connections (what are residual connections?) - see Chapter 10 for more information on this primitive.

yolo v3 architecture

The darknet architecture is presented here as well (visually):

darknet 53 backbone
11.2.4.3.2 Multi-Scale predictions
  • enables multi-scale predictions (predictions at multiple grid sizes)
  • this helps to obtain finer detailed boxes (improves prediction of smaller boxes)
  • YOLOv3 generates three separate outputs:
    • y1: 13 \times 13 grid defines the output
    • y2: concatenating output after (Res \times 4) with output of (Res \times 8) - upsampling occurs from y1 since the feature maps are of different sizes (13 \times 13) and (26 \times 26)
    • y3: upsample y2 output to match 52\times 52 feature maps
11.2.4.3.3 Backbone, Neck, and Head

After release of YOLOv3, object detectors began to be described in terms of the backbone, neck, and head

modern object detection architecture

Backbone

  • Extracts useful features from the input image.
  • A convolutional nerual network trained on large-scale image classifications task (ImageNet)
  • captures hierarchical features at different scales
    • low-level features - earlier layers
    • high-level features - deeper layers

Neck

  • aggregates / refines features extracted by backbone
    • enhance spatial / semantic information across different scales
    • includes conv layers
    • includes feature pyramid networks

Head

  • makes predictions based on features provided by backbone and neck
  • consists of task-specific subnetworks to perform classification, localization, localization, instance segmentation and pose estimation
  • non-maximum supression filters out overlapping predictions (retains only most confident detections)

11.2.4.4 YOLOv4

MS COCO AP: 43.5% AP(50): 65.7%

The philosophy of YOLOv4 approaches optimization of the model into two categories: bag of freebies and bag of specials

Bag of Freebies:

  • increase training time / cost
  • do not affect inference time
  • examples include data augmentation

Bag of Specials:

  • increase inference time / cost
  • improve accuracy of the model (MaP)
  • examples include
    • enlarging receptive field ???
    • combining features ???
    • post-processing ???
11.2.4.4.1 Model Improvements
  • Enhanced network architecture with Bag-of-Specials
    • backbone: Darknet-53 + Cross-stage Partial Connections (CSPNet) ?? + Mish Activation Function ??
      • tested several backbone architectures to choose best option
      • CSP reduces computation while maintaining accuracy
    • neck: Spatial Pyramid Pooling (SPP)?? + Multi-scale predictions + modified path aggregation network PANet + modified spatial attention module (SAM)
      • SPP increases receptive field without affecting inference speed
    • detection head: Same anchors as YOLOv3
  • Advanced training with Bag-of-Freebies
    • regular augmentation
      • random brightness, contrast, scaling, cropping, flipping, rotation
    • special
      • mosaic augmentation:
        • combines four images into a single one
        • reduces need for large mini-batches for batch normalization
      • DropBlock regularization (instead of Dropout)
      • CIoU loss and Cross mini-batch nomralization CmBN for collecting statistics from the entire batch instead of just mini-batches
        • these changes improve the detector
  • Self-adversarial Training
    • improves model robustness to perturbations
  • Hyperparameter optimization with Genetic Algorithms
    • genetic algs used on first 10% of periods
    • cosine annealing scheduler to alter learning rate during training
11.2.4.4.2 YOLOv4 Architecture

yolo v4 architecture

11.2.4.5 YOLOv5

YOLOv5x MS COCO AP: 50.7% AP(50): X%

No paper exists for YOLOv5 but there are several key advantages to using it

  • It is developed in PyTorch instead of Darknet
  • is open source and actively maintained by ultralytics
  • is easy to use, train, and deploy
  • many integrations for labeling, training, and deployment (mobile)

There are several scaled versions of this model:

  • YOLOv5n (nano)
  • YOLOv5s (small)
  • YOLOv5m (medium)
  • YOLOv5l (large)
  • YOLOv5x (extra large)

This is useful for V2X privacy preserving applications because we will want to make use of smaller networks than might otherwise be used and compare their performance (efficiency) against larger networks.

This helps to answer the question of whether the accuracy tradeoff is worth the speedup.

11.2.4.5.1 Results

YOLOv5x achieves the following results

  • AP 50.7% on COCO [batch size = 32, 640 pixels]
  • AP 55.8% on COCO [1536 pixels]

11.2.4.6 Scaled-YOLOv4

YOLOv4-large MS COCO AP: 55.6%

Utilized scaling techniques (making larger to increase accuracy at the expense of speed, and scaling down increases speed at the expense of accuracy). Scaled-down models require less compute power and can run on embedded systems

Like YOLOv5, was developed in PyTorch instead of Darknet

11.2.4.6.1 Results
  • YOLOv4-large - MS COCO AP: 56%
  • YOLOv4-tiny - MS COCO AP: 22%

11.2.4.7 YOLOR

MS COCO AP: 55.4% AP(50): 73.3%

YOLOR stands for “you only learn one representation” and is novel for introducing multi-task learnin approach which creates a single model for multiple tasks (classification, detection, pose estimation) by learning a general representation and using sub-networks to create task-specific representations!

11.2.4.7.1 Results
  • YOLOR - MS COCO AP: 55.4% AP(50): 73.3%

11.2.4.8 YOLOX

MS COCO AP: 50.1%

Was designed off the back of the Ultralytics YOLOv3 (see Section 11.2.4.3) in PyTorch.

11.2.4.8.1 Model Changes and Improvements
  • Anchor Free - simplified training and decoding process
  • Multi Positives - center sampling (assigned center 3 \times 3 area as positives) to account for imbalances produced by lack of anchors
  • Decoupled Head - classification confidence and localization accuracy separated into two heads (connecting them leads to some misalignment)
    • sped up model convergence and improved accuracy
  • Advanced Label Assignment - ambiguities were associated with ground truth bounding boxes (box overlap) - this is addressed with assigning procedure as Optimal Transport Problem . The authors develop a simplified version called simOTA
  • Strong augmentations - MixUP and Mosaic augmentations were used
11.2.4.8.2 Architecture

yolox architecture

11.2.4.9 YOLOv6

MS COCO AP: 52.5% AP(50): 70%

Adopted an anchor-free detector and provided a series of models at different scales for nuanced applications.

11.2.4.9.1 Model Changes and Improvements
  • New architecture backbone based on RepVGG . higher parallelism and use of neck based on RepBlocks or CSPStackRep Blocks and developed an efficient decoupld head
  • Label Assignment with TOOD
  • New classification and regression losses using VariFocal loss and SIoU / GioU regression loss
  • self-distillation for regression and classification tasks
  • quantization scheme for detection with RepOptimizer channel-wise distillation for a faster detector

11.2.4.10 YOLOv7

MS COCO AP: 55.9% AP(50): 73.5% - input 1280 pixels

Compared to YOLOv4 and YOLOR, YOLOv7 achieves a significant reduction in parameters used, and improves average precision by a meaningful increase in average precision as well.

11.2.4.10.1 Model Changes and Improvements

Architectural Changes

  • Extended efficient layer aggregation network - ELAN allows for more efficient training and convergence
  • Model Scaling for concatenation-based models - maintain optimal structure of the model by scaling thedepth and width of the block with the same factor

Bag-of-Freebies

  • Planned re-parameterized convolution - Architecture is inspired by RepConv . Identity connection outperforms residual in Resnet (Chapter 10) and concatenation in DenseNet
  • Coarse label assignment for auxiliary head (training) and fine label assignment for the lead head (final out)
  • Batch normalization in conv-bn-activation - integrates mean and variance of batch normalization into weight and bias of convolutional layer at inference stage (batch norm folding)
  • Implicit knowledge (see Section 11.2.4.10)
  • Exponential moving average as final inference model

11.2.4.11 DAMO-YOLO

MS COCO AP: 50.0%

Was inspired by a series of technologies relevant at the time and provided tiny/small/medium scaled model variants

  • Neural architecture search (this was also used in Delphi) - MAE NAS -
  • a large neck
  • a small head
  • aligned OTA label assignment - uses focal loss for classification cost and IoU of prediction / ground truth box as soft label. Enables selection of aligned samples for targets
  • knowledge distillation
      1. teacher guiding student (stage 1)
      1. student fine-tuning (stage 2)
    • enhancements in distillation approach
      • align module - adapts student features to same resolution as teacher’s
      • channel-wise dynamic temperature - normalizes teacher and student features

11.2.4.12 YOLOv8

YOLOv8x MS COCO AP: 53.9% (640 pixels)

A version of YOLO released by ultralytics which is anchor free and uses mosaic augmentation for training up to the last ten epochs. It provides five scaled versions

  1. YOLOv8n (nano)
  2. YOLOv8s (small)
  3. YOLOv8m (medium)
  4. YOLOv8l (large)
  5. YOLOv8x (extra large)

11.2.5 PP-YOLO Models

The PP-YOLO models were developed in parallel with the standard YOLO variants. These models were based on YOLOv3 but were developed in the PaddlePaddle deep learning platform instead. The goal of their work was to show how an object detector should be constructed step-by-step, not to provide any novel functionality or a new approach.

11.2.5.1 PP-YOLO

MS COCO AP:45.9% AP(50): 65.2%

11.2.5.1.1 Divergence from YOLOv3
  • ResNet50-vd backbone + deformable convolutions replace DarkNet-53 (Section 11.2.4.3) - achieves higher classification accuracy on Imagenet
  • Larger batch size - improves training stability
  • Maintained moving averages for trained parameters
  • DropBlock applied on FPN
  • IoU loss added with L1-Loss for bounding box regression
  • IoU prediction branch for measuring localization accuracy (and optimization)
  • Grid Sensitive Approach like YOLOv4 (Section 11.2.4.4) - improves bounding box center prediction at grid boundary
  • Matrix NMS (parallelized NMS for faster computation)
  • CoordConv - 1\times 1 convolution of the FPN. Allows for learning translational invariance
  • Spatial Pyramid Pooling increases receptive field of the backbone
11.2.5.1.2 PP-YOLO Augmentations and Preprocessing
  • Mixup Training - weights \sim Beta(\alpha=1.5, \beta=1.5)
  • Random color distortion
  • Random expand
  • Random crop and random flip with probability 0.5
  • RGB channel z-score normalization \mu = [0.485, 0.456, 0.406] and \sigma=[0.229,0.224,0.225]

11.2.5.2 PP-YOLOv2

MS COCO AP: 49.5%

11.2.5.2.1 Changes and Improvements
  • Backbone changed from ResNet50 to ResNet101
  • Path aggregation network instead of FPN
  • Mish Activation function applied to the detection neck
  • Larger input sizes - increases performance on small objects
  • Modified IoU aware branch - soft label format instead of soft weight format

11.2.5.3 PP-YOLOE

MS COCO AP: 51.4% (78.1 FPS NVIDIA V100)

11.2.5.3.1 Changes and Improvements
  • Anchor Free (following trends of other yolo models)
  • New backbone and neck - modified neck wtih RepResBlocks (combine dense and residual connections)
  • task alignment learning (see Section 11.2.4.8)
  • efficient task-aligned head (ET-head) - single head based on TOOD instead of splitting the classification / detection heads like with YOLOX (Section 11.2.4.8)
  • varifocal (VFL) and distribution focal loss (DFL)
    • VFL weights positive samples using a target score which places a greater importance on high-quality samples during training
    • DFL extends Focal Loss from discrete to continuous labels - better for representations which combine qualtiy estimation and class prediction - this allows for better depiction of flexible distributions in real data (eliminates risk of inconsistency)

The authors provide several scaled modesl

  • PP-YOLOE-s (small)
  • PP-YOLOE-m (medium)
  • PP-YOLOE-1 (large)
  • PP-YOLOE-x (extra large)

11.2.6 Summary of YOLO

summary of yolo architectures and results