← ML Research Wiki / 2301.12597

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi, Salesforce Research (2023)

Paper Information
arXiv ID
Venue
International Conference on Machine Learning
Domain
Not specified
SOTA Claim
Yes
Reproducibility
8/10

Abstract

The cost of vision-and-language pre-training has become increasingly prohibitive due to end-toend training of large-scale models.This paper proposes BLIP-2, a generic and efficient pretraining strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models.BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pretrained in two stages.The first stage bootstraps vision-language representation learning from a frozen image encoder.The second stage bootstraps vision-to-language generative learning from a frozen language model.BLIP-2 achieves state-of-the-art performance on various visionlanguage tasks, despite having significantly fewer trainable parameters than existing methods.For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters.We also demonstrate the model's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.

Summary

The paper proposes BLIP-2, a novel and efficient pre-training approach for vision-language models that utilizes frozen pre-trained image encoders and large language models (LLMs). It introduces a Querying Transformer (Q-Former) which functions as a bridging mechanism between frozen image encoders and LLMs, effectively performing vision-language representation learning and generative tasks in a two-stage process. BLIP-2 demonstrates superior performance in various vision-language tasks while being more computationally efficient, outperforming existing models such as Flamingo80B with significantly fewer trainable parameters. The study emphasizes BLIP-2's capabilities in zero-shot image-to-text generation and highlights the effectiveness of leveraging pre-trained unimodal models for enhanced vision-language understanding.

Methods

This paper employs the following methods:

  • Querying Transformer (Q-Former)

Models Used

  • BLIP-2
  • OPT
  • FlanT5
  • ViT-L
  • ViT-g
  • Flamingo80B

Datasets

The following datasets were used in this research:

  • COCO
  • Visual Genome
  • CC3M
  • CC12M
  • SBU
  • LAION400M
  • VQAv2
  • GQA
  • OK-VQA
  • NoCaps
  • Flickr30K

Evaluation Metrics

  • Accuracy
  • BLEU

Results

  • BLIP-2 outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters
  • Achieves state-of-the-art performance on various vision-language tasks

Limitations

The authors identified the following limitations:

  • Not specified

Technical Requirements

  • Number of GPUs: 16
  • GPU Type: NVIDIA A100 40GB

Papers Using Similar Methods

External Resources