← ML Research Wiki / 2405.14458

YOLOv10: Real-Time End-to-End Object Detection

Ao Wang School of Software Tsinghua University, Hui Chen [email protected] BNRist Tsinghua University, Lihao Liu School of Software Tsinghua University, Kai Chen School of Software Tsinghua University, Zijia Lin [email protected] School of Software Tsinghua University, Jungong Han [email protected] Department of Automation Tsinghua University, Guiguang Ding [email protected] School of Software Tsinghua University (2024)

Paper Information
arXiv ID
Venue
Neural Information Processing Systems
Domain
computer vision
SOTA Claim
Yes
Code
Reproducibility
7/10

Abstract

example, our YOLOv10-S is 1.8× faster than RT-DETR-R18 under the similar AP on COCO, meanwhile enjoying 2.8× smaller number of parameters and FLOPs.Compared with YOLOv9-C, YOLOv10-B has 46% less latency and 25% fewer parameters for the same performance.Code and models are available at https://github.com/THU-MIG/yolov10.

Summary

The paper introduces YOLOv10, a new generation of real-time end-to-end object detectors which improve upon previous YOLO versions by addressing issues related to non-maximum suppression (NMS) and model architecture inefficiencies. It discusses a new dual assignment strategy for NMS-free training that enhances model performance while reducing inference latency. Furthermore, it implements a holistic efficiency-accuracy driven design that changes the classification head, downsampling strategies, and utilizes large-kernel convolutions and partial self-attention to achieve state-of-the-art performance and efficiency. Experiments show significant improvements in accuracy and latency over prior models like YOLOv9 and RT-DETR across various model scales, validating the proposed methods.

Methods

This paper employs the following methods:

  • NMS-free training
  • dual assignments
  • holistic efficiency-accuracy driven design
  • lightweight classification head
  • spatial-channel decoupled downsampling
  • rank-guided block design
  • large-kernel convolution
  • partial self-attention

Models Used

  • YOLOv10-N
  • YOLOv10-S
  • YOLOv10-M
  • YOLOv10-B
  • YOLOv10-L
  • YOLOv10-X
  • YOLOv9-C
  • RT-DETR-R18
  • RT-DETR-R101

Datasets

The following datasets were used in this research:

  • COCO

Evaluation Metrics

  • AP

Results

  • YOLOv10-S is 1.8× faster than RT-DETR-R18 under similar AP on COCO
  • YOLOv10-B has 46% less latency and 25% fewer parameters than YOLOv9-C for the same performance
  • YOLOv10-S / X are 1.8× / 1.3× faster than RT-DETR-R18 / R101 under similar performance
  • YOLOv10 exhibits highly efficient parameter utilization

Limitations

The authors identified the following limitations:

  • Performance gap compared to original one-to-many training with NMS observed in small models
  • Further exploration needed to reduce performance gap for future versions

Technical Requirements

  • Number of GPUs: 8
  • GPU Type: NVIDIA 3090

Keywords

YOLOv10 real-time object detection end-to-end detection NMS-free training model efficiency

External Resources