← ML Research Wiki / 2305.06500

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Wenliang Dai Salesforce Research Hong Kong University of Science and Technology, Junnan Li [email protected], Dongxu Li Salesforce Research, Anthony Meng Huat Salesforce Research Nanyang Technological University Singapore, Junqi Zhao Salesforce Research Nanyang Technological University Singapore, Weisheng Wang Nanyang Technological University Singapore, Boyang Li Nanyang Technological University Singapore, Pascale Fung Hong Kong University of Science and Technology, Steven Hoi [email protected] Salesforce Research (2023)

Paper Information

arXiv ID

2305.06500

Venue

Neural Information Processing Systems

Domain

computer vision, natural language processing

SOTA Claim

Yes

Code

Available

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence.However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input.Although vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored.In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models.We gather 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transform them into instruction tuning format.Additionally, we introduce an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction.Trained on 13 held-in datasets, Instruct-BLIP attains state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and larger Flamingo models.Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA questions with image contexts).Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models.All InstructBLIP models are open-sourced.

Summary

This paper introduces InstructBLIP, a framework for vision-language models that employs instruction tuning to enhance general-purpose capabilities. it elucidates the challenges associated with constructing effective vision-language models, specifically highlighting the complexities brought by diverse tasks and rich input distributions. The paper provides a systematic investigation into vision-language instruction tuning, using a novel instruction-aware Query Transformer architecture alongside a collection of 26 publicly available datasets systematically formatted for instruction tuning. InstructBLIP demonstrates remarkable performance, achieving state-of-the-art zero-shot performance across 13 held-out datasets compared to previous models like BLIP-2 and Flamingo. The study encompasses qualitative assessments illustrating InstructBLIP's adeptness in visual reasoning and its capacity to handle diverse visual and textual instructions effectively. It contributes significantly by showing robust generalizability to unseen tasks and detailed discussion on improving instruction-tuning performance through methodological innovations. Furthermore, it validates InstructBLIP's utility as a solid starting point for downstream task finetuning, with improvements noted across various tasks and datasets.

Methods

This paper employs the following methods:

Instruction Tuning
Query Transformer

Models Used

BLIP-2
Flamingo
FlanT5
Vicuna

Datasets

The following datasets were used in this research:

COCO Caption
Web CapFilt
NoCaps
Flickr30K
TextCaps
Image Question Answering
OCR-VQA
VQG
OKVQA
A-OKVQA
ScienceQA
MSRVTT-QA
GQA
Visual Dialog

Evaluation Metrics

Accuracy
Mean Reciprocal Rank (MRR)
Top-1 Accuracy

Results

State-of-the-art zero-shot performance across 13 held-out datasets
Achieved 90.7% accuracy on ScienceQA questions with image contexts

Limitations

The authors identified the following limitations:

Inherits shortcomings from original large language models such as hallucinating ungrounded text and generating biased outputs.

Technical Requirements

Number of GPUs: 16
GPU Type: Nvidia A100 40G

Keywords

vision-language models instruction tuning multimodal models zero-shot learning general-purpose AI

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 52
Influential Citations: 347

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers