← ML Research Wiki / 2401.17270

YOLO-World: Real-Time Open-Vocabulary Object Detection

Tianheng Cheng School of EIC Huazhong University of Science & Technology, Lin Song Lab 2 ARC Lab Tencent AI Tencent PCG, Yixiao Ge Lab 2 ARC Lab Tencent AI Tencent PCG, Wenyu Liu School of EIC Huazhong University of Science & Technology, Xinggang Wang School of EIC Huazhong University of Science & Technology, Ying Shan Lab 2 ARC Lab Tencent AI Tencent PCG (2024)

Paper Information

arXiv ID

2401.17270

Venue

Computer Vision and Pattern Recognition

Domain

computer vision, artificial intelligence

SOTA Claim

Yes

Reproducibility

7/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools.However, their reliance on predefined and trained object categories limits their applicability in open scenarios.Addressing this limitation, we introduce YOLO-World, an innovative approach that enhances YOLO with openvocabulary detection capabilities through vision-language modeling and pre-training on large-scale datasets.Specifically, we propose a new Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and region-text contrastive loss to facilitate the interaction between visual and linguistic information.Our method excels in detecting a wide range of objects in a zero-shot manner with high efficiency.On the challenging LVIS dataset, YOLO-World achieves 35.4 AP with 52.0 FPS on V100, which outperforms many state-of-the-art methods in terms of both accuracy and speed.Furthermore, the fine-tuned YOLO-World achieves remarkable performance on several downstream tasks, including object detection and openvocabulary instance segmentation.

Summary

This paper introduces YOLO-World, an advanced object detection model that enhances the traditional YOLO framework with open-vocabulary capabilities, allowing for effective detection of a wider range of objects without needing predefined categories. The proposed system employs a novel Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and integrates vision-language modeling techniques alongside large-scale pre-training on diverse datasets. YOLO-World demonstrates remarkable speed and accuracy, achieving 35.4 AP at 52.0 FPS on the LVIS dataset. The paper contrasts previous fixed-vocabulary models with the prompt-then-detect paradigm, emphasizing efficiency and accessible deployment for edge applications. Multiple experiments reveal YOLO-World's superior performance across object detection and instance segmentation tasks, reinforcing its suitability for real-world applications involving open-vocabulary detection.

Methods

This paper employs the following methods:

YOLO
RepVL-PAN

Models Used

YOLO-World
BERT
CLIP

Datasets

The following datasets were used in this research:

LVIS
COCO
Objects365
GQA
Flickr30k
CC3M

Evaluation Metrics

AP

Results

Achieves 35.4 AP with 52.0 FPS on LVIS
Significant improvements over previous YOLO iterations

Limitations

The authors identified the following limitations:

Not specified

Technical Requirements

Number of GPUs: 32
GPU Type: NVIDIA V100

Keywords

YOLO-World open-vocabulary detection vision-language modeling Re-parameterizable Vision-Language Path Aggregation Network contrastive learning

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 69
Influential Citations: 28

YOLO-World: Real-Time Open-Vocabulary Object Detection

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers