Domain
computer vision, natural language processing
In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images.Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus.Beyond the conventional image description and question-answering, we implement the grounding and text-reading ability of Qwen-VLs by aligning image-caption-box tuples.The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales on a broad range of visual-centric benchmarks (e.g., image captioning, question answering, visual grounding) and different settings (e.g., zero-shot, few-shot).Moreover, on real-world dialog benchmarks, our instruction-tuned Qwen-VL-Chat also demonstrates superiority compared to existing vision-language chatbots.All models are public to facilitate future research.
The paper introduces the Qwen-VL series, advanced large-scale vision-language models (LVLMs) that enhance the capabilities of traditional large language models by incorporating visual processing. The models, including Qwen-VL and Qwen-VL-Chat, are designed to perform various tasks such as image captioning, visual question answering, text reading, and visual grounding. They achieve leading performance on multiple benchmarks, benefiting from a sophisticated training pipeline that processes a multilingual multimodal corpus. The Qwen-VL-Chat model demonstrates robust real-world dialogue capabilities, outperforming existing vision-language chatbots on instruction-following benchmarks. The systematic methodology includes a three-stage training process consisting of pre-training on large-scale image-text pairs, multi-task pre-training on high-quality annotation data, and instruction fine-tuning for enhanced interaction. The paper emphasizes the model's versatility, multilingual support, and fine-grained visual understanding, which are validated through extensive evaluations across established benchmarks.
This paper employs the following methods:
- Visual Receptor
- 3-stage Training Pipeline
- Qwen-VL
- Qwen-VL-Chat
- Qwen-7B
The following datasets were used in this research:
- LAION-en
- LAION-zh
- LAION-COCO
- DataComp
- Coyo
- GQA
- VGQA
- VQAv2
- DVQA
- OCR-VQA
- DocVQA
- TextVQA
- ChartQA
- AI2Diagram
- GRIT
- RefCOCO
- RefCOCO+
- RefCOCOg
- Visual Genome
- COCO
- CIDEr
- Accuracy
- Top-1 Accuracy
- Qwen-VL achieves state-of-the-art performance on the Flickr30K zero-shot image captioning task (85.8 CIDEr score)
- Qwen-VL outperforms prior LVLMs by large margins on multiple VQA benchmarks (79.5, 58.6, and 59.3 accuracy on VQAv2, OKVQA, and GQA respectively)
- Qwen-VL-Chat outperforms existing vision-language chatbots on TouchStone, SEED-Bench, and MME benchmarks.
- Number of GPUs: None specified
- GPU Type: None specified
vision-language models
multilingual
multimodal understanding
fine-grained perception
grounding
OCR
dialogue