← ML Research Wiki / 2308.12966

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, Jingren Zhou, Alibaba Group (2023)

Paper Information
arXiv ID
Domain
computer vision, natural language processing
SOTA Claim
Yes
Code
Available
Reproducibility
8/10

Abstract

In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images.Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus.Beyond the conventional image description and question-answering, we implement the grounding and text-reading ability of Qwen-VLs by aligning image-caption-box tuples.The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales on a broad range of visual-centric benchmarks (e.g., image captioning, question answering, visual grounding) and different settings (e.g., zero-shot, few-shot).Moreover, on real-world dialog benchmarks, our instruction-tuned Qwen-VL-Chat also demonstrates superiority compared to existing vision-language chatbots.All models are public to facilitate future research.

Summary

The paper introduces the Qwen-VL series, advanced large-scale vision-language models (LVLMs) that enhance the capabilities of traditional large language models by incorporating visual processing. The models, including Qwen-VL and Qwen-VL-Chat, are designed to perform various tasks such as image captioning, visual question answering, text reading, and visual grounding. They achieve leading performance on multiple benchmarks, benefiting from a sophisticated training pipeline that processes a multilingual multimodal corpus. The Qwen-VL-Chat model demonstrates robust real-world dialogue capabilities, outperforming existing vision-language chatbots on instruction-following benchmarks. The systematic methodology includes a three-stage training process consisting of pre-training on large-scale image-text pairs, multi-task pre-training on high-quality annotation data, and instruction fine-tuning for enhanced interaction. The paper emphasizes the model's versatility, multilingual support, and fine-grained visual understanding, which are validated through extensive evaluations across established benchmarks.

Methods

This paper employs the following methods:

  • Visual Receptor
  • 3-stage Training Pipeline

Models Used

  • Qwen-VL
  • Qwen-VL-Chat
  • Qwen-7B

Datasets

The following datasets were used in this research:

  • LAION-en
  • LAION-zh
  • LAION-COCO
  • DataComp
  • Coyo
  • GQA
  • VGQA
  • VQAv2
  • DVQA
  • OCR-VQA
  • DocVQA
  • TextVQA
  • ChartQA
  • AI2Diagram
  • GRIT
  • RefCOCO
  • RefCOCO+
  • RefCOCOg
  • Visual Genome
  • COCO

Evaluation Metrics

  • CIDEr
  • Accuracy
  • Top-1 Accuracy

Results

  • Qwen-VL achieves state-of-the-art performance on the Flickr30K zero-shot image captioning task (85.8 CIDEr score)
  • Qwen-VL outperforms prior LVLMs by large margins on multiple VQA benchmarks (79.5, 58.6, and 59.3 accuracy on VQAv2, OKVQA, and GQA respectively)
  • Qwen-VL-Chat outperforms existing vision-language chatbots on TouchStone, SEED-Bench, and MME benchmarks.

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

vision-language models multilingual multimodal understanding fine-grained perception grounding OCR dialogue

External Resources