← ML Research Wiki / 2308.12966

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, Jingren Zhou, Alibaba Group (2023)

Paper Information

arXiv ID

2308.12966

Domain

computer vision, natural language processing

SOTA Claim

Yes

Code

Available

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Related Work
External Resources

Abstract

In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images.Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus.Beyond the conventional image description and question-answering, we implement the grounding and text-reading ability of Qwen-VLs by aligning image-caption-box tuples.The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales on a broad range of visual-centric benchmarks (e.g., image captioning, question answering, visual grounding) and different settings (e.g., zero-shot, few-shot).Moreover, on real-world dialog benchmarks, our instruction-tuned Qwen-VL-Chat also demonstrates superiority compared to existing vision-language chatbots.All models are public to facilitate future research.

Summary

The paper introduces the Qwen-VL series, advanced large-scale vision-language models (LVLMs) that enhance the capabilities of traditional large language models by incorporating visual processing. The models, including Qwen-VL and Qwen-VL-Chat, are designed to perform various tasks such as image captioning, visual question answering, text reading, and visual grounding. They achieve leading performance on multiple benchmarks, benefiting from a sophisticated training pipeline that processes a multilingual multimodal corpus. The Qwen-VL-Chat model demonstrates robust real-world dialogue capabilities, outperforming existing vision-language chatbots on instruction-following benchmarks. The systematic methodology includes a three-stage training process consisting of pre-training on large-scale image-text pairs, multi-task pre-training on high-quality annotation data, and instruction fine-tuning for enhanced interaction. The paper emphasizes the model's versatility, multilingual support, and fine-grained visual understanding, which are validated through extensive evaluations across established benchmarks.

Methods

This paper employs the following methods:

Visual Receptor
3-stage Training Pipeline

Models Used

Qwen-VL
Qwen-VL-Chat
Qwen-7B

Datasets

The following datasets were used in this research:

LAION-en
LAION-zh
LAION-COCO
DataComp
Coyo
GQA
VGQA
VQAv2
DVQA
OCR-VQA
DocVQA
TextVQA
ChartQA
AI2Diagram
GRIT
RefCOCO
RefCOCO+
RefCOCOg
Visual Genome
COCO

Evaluation Metrics

CIDEr
Accuracy
Top-1 Accuracy

Results

Qwen-VL achieves state-of-the-art performance on the Flickr30K zero-shot image captioning task (85.8 CIDEr score)
Qwen-VL outperforms prior LVLMs by large margins on multiple VQA benchmarks (79.5, 58.6, and 59.3 accuracy on VQAv2, OKVQA, and GQA respectively)
Qwen-VL-Chat outperforms existing vision-language chatbots on TouchStone, SEED-Bench, and MME benchmarks.

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

vision-language models multilingual multimodal understanding fine-grained perception grounding OCR dialogue

External Resources

Funding: Not specified
References: 86
Influential Citations: 137

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Technical Requirements edit

Keywords add

Related Papers