← ML Research Wiki / 2401.17270

YOLO-World: Real-Time Open-Vocabulary Object Detection

Tianheng Cheng School of EIC Huazhong University of Science & Technology, Lin Song Lab 2 ARC Lab Tencent AI Tencent PCG, Yixiao Ge Lab 2 ARC Lab Tencent AI Tencent PCG, Wenyu Liu School of EIC Huazhong University of Science & Technology, Xinggang Wang School of EIC Huazhong University of Science & Technology, Ying Shan Lab 2 ARC Lab Tencent AI Tencent PCG (2024)

Paper Information
arXiv ID
Venue
Computer Vision and Pattern Recognition
Domain
computer vision, artificial intelligence
SOTA Claim
Yes
Reproducibility
7/10

Abstract

The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools.However, their reliance on predefined and trained object categories limits their applicability in open scenarios.Addressing this limitation, we introduce YOLO-World, an innovative approach that enhances YOLO with openvocabulary detection capabilities through vision-language modeling and pre-training on large-scale datasets.Specifically, we propose a new Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and region-text contrastive loss to facilitate the interaction between visual and linguistic information.Our method excels in detecting a wide range of objects in a zero-shot manner with high efficiency.On the challenging LVIS dataset, YOLO-World achieves 35.4 AP with 52.0 FPS on V100, which outperforms many state-of-the-art methods in terms of both accuracy and speed.Furthermore, the fine-tuned YOLO-World achieves remarkable performance on several downstream tasks, including object detection and openvocabulary instance segmentation.

Summary

This paper introduces YOLO-World, an advanced object detection model that enhances the traditional YOLO framework with open-vocabulary capabilities, allowing for effective detection of a wider range of objects without needing predefined categories. The proposed system employs a novel Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and integrates vision-language modeling techniques alongside large-scale pre-training on diverse datasets. YOLO-World demonstrates remarkable speed and accuracy, achieving 35.4 AP at 52.0 FPS on the LVIS dataset. The paper contrasts previous fixed-vocabulary models with the prompt-then-detect paradigm, emphasizing efficiency and accessible deployment for edge applications. Multiple experiments reveal YOLO-World's superior performance across object detection and instance segmentation tasks, reinforcing its suitability for real-world applications involving open-vocabulary detection.

Methods

This paper employs the following methods:

  • YOLO
  • RepVL-PAN

Models Used

  • YOLO-World
  • BERT
  • CLIP

Datasets

The following datasets were used in this research:

  • LVIS
  • COCO
  • Objects365
  • GQA
  • Flickr30k
  • CC3M

Evaluation Metrics

  • AP

Results

  • Achieves 35.4 AP with 52.0 FPS on LVIS
  • Significant improvements over previous YOLO iterations

Limitations

The authors identified the following limitations:

  • Not specified

Technical Requirements

  • Number of GPUs: 32
  • GPU Type: NVIDIA V100

Keywords

YOLO-World open-vocabulary detection vision-language modeling Re-parameterizable Vision-Language Path Aggregation Network contrastive learning

Papers Using Similar Methods

External Resources