Tianheng Cheng School of EIC Huazhong University of Science & Technology, Lin Song Lab 2 ARC Lab Tencent AI Tencent PCG, Yixiao Ge Lab 2 ARC Lab Tencent AI Tencent PCG, Wenyu Liu School of EIC Huazhong University of Science & Technology, Xinggang Wang School of EIC Huazhong University of Science & Technology, Ying Shan Lab 2 ARC Lab Tencent AI Tencent PCG (2024)
This paper introduces YOLO-World, an advanced object detection model that enhances the traditional YOLO framework with open-vocabulary capabilities, allowing for effective detection of a wider range of objects without needing predefined categories. The proposed system employs a novel Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and integrates vision-language modeling techniques alongside large-scale pre-training on diverse datasets. YOLO-World demonstrates remarkable speed and accuracy, achieving 35.4 AP at 52.0 FPS on the LVIS dataset. The paper contrasts previous fixed-vocabulary models with the prompt-then-detect paradigm, emphasizing efficiency and accessible deployment for edge applications. Multiple experiments reveal YOLO-World's superior performance across object detection and instance segmentation tasks, reinforcing its suitability for real-world applications involving open-vocabulary detection.
This paper employs the following methods:
The following datasets were used in this research:
The authors identified the following limitations: