← ML Research Wiki / 2305.06500

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Wenliang Dai Salesforce Research Hong Kong University of Science and Technology, Junnan Li [email protected], Dongxu Li Salesforce Research, Anthony Meng Huat Salesforce Research Nanyang Technological University Singapore, Junqi Zhao Salesforce Research Nanyang Technological University Singapore, Weisheng Wang Nanyang Technological University Singapore, Boyang Li Nanyang Technological University Singapore, Pascale Fung Hong Kong University of Science and Technology, Steven Hoi [email protected] Salesforce Research (2023)

Paper Information
arXiv ID
Venue
Neural Information Processing Systems
Domain
computer vision, natural language processing
SOTA Claim
Yes
Code
Available
Reproducibility
8/10

Abstract

Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence.However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input.Although vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored.In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models.We gather 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transform them into instruction tuning format.Additionally, we introduce an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction.Trained on 13 held-in datasets, Instruct-BLIP attains state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and larger Flamingo models.Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA questions with image contexts).Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models.All InstructBLIP models are open-sourced.

Summary

This paper introduces InstructBLIP, a framework for vision-language models that employs instruction tuning to enhance general-purpose capabilities. it elucidates the challenges associated with constructing effective vision-language models, specifically highlighting the complexities brought by diverse tasks and rich input distributions. The paper provides a systematic investigation into vision-language instruction tuning, using a novel instruction-aware Query Transformer architecture alongside a collection of 26 publicly available datasets systematically formatted for instruction tuning. InstructBLIP demonstrates remarkable performance, achieving state-of-the-art zero-shot performance across 13 held-out datasets compared to previous models like BLIP-2 and Flamingo. The study encompasses qualitative assessments illustrating InstructBLIP's adeptness in visual reasoning and its capacity to handle diverse visual and textual instructions effectively. It contributes significantly by showing robust generalizability to unseen tasks and detailed discussion on improving instruction-tuning performance through methodological innovations. Furthermore, it validates InstructBLIP's utility as a solid starting point for downstream task finetuning, with improvements noted across various tasks and datasets.

Methods

This paper employs the following methods:

  • Instruction Tuning
  • Query Transformer

Models Used

  • BLIP-2
  • Flamingo
  • FlanT5
  • Vicuna

Datasets

The following datasets were used in this research:

  • COCO Caption
  • Web CapFilt
  • NoCaps
  • Flickr30K
  • TextCaps
  • Image Question Answering
  • OCR-VQA
  • VQG
  • OKVQA
  • A-OKVQA
  • ScienceQA
  • MSRVTT-QA
  • GQA
  • Visual Dialog

Evaluation Metrics

  • Accuracy
  • Mean Reciprocal Rank (MRR)
  • Top-1 Accuracy

Results

  • State-of-the-art zero-shot performance across 13 held-out datasets
  • Achieved 90.7% accuracy on ScienceQA questions with image contexts

Limitations

The authors identified the following limitations:

  • Inherits shortcomings from original large language models such as hallucinating ungrounded text and generating biased outputs.

Technical Requirements

  • Number of GPUs: 16
  • GPU Type: Nvidia A100 40G

Keywords

vision-language models instruction tuning multimodal models zero-shot learning general-purpose AI

Papers Using Similar Methods

External Resources