← ML Research Wiki / 2301.12597

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi, Salesforce Research (2023)

Paper Information

arXiv ID

2301.12597

Venue

International Conference on Machine Learning

Domain

Not specified

SOTA Claim

Yes

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

The cost of vision-and-language pre-training has become increasingly prohibitive due to end-toend training of large-scale models.This paper proposes BLIP-2, a generic and efficient pretraining strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models.BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pretrained in two stages.The first stage bootstraps vision-language representation learning from a frozen image encoder.The second stage bootstraps vision-to-language generative learning from a frozen language model.BLIP-2 achieves state-of-the-art performance on various visionlanguage tasks, despite having significantly fewer trainable parameters than existing methods.For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters.We also demonstrate the model's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.

Summary

The paper proposes BLIP-2, a novel and efficient pre-training approach for vision-language models that utilizes frozen pre-trained image encoders and large language models (LLMs). It introduces a Querying Transformer (Q-Former) which functions as a bridging mechanism between frozen image encoders and LLMs, effectively performing vision-language representation learning and generative tasks in a two-stage process. BLIP-2 demonstrates superior performance in various vision-language tasks while being more computationally efficient, outperforming existing models such as Flamingo80B with significantly fewer trainable parameters. The study emphasizes BLIP-2's capabilities in zero-shot image-to-text generation and highlights the effectiveness of leveraging pre-trained unimodal models for enhanced vision-language understanding.

Methods

This paper employs the following methods:

Querying Transformer (Q-Former)

Models Used

BLIP-2
OPT
FlanT5
ViT-L
ViT-g
Flamingo80B

Datasets

The following datasets were used in this research:

COCO
Visual Genome
CC3M
CC12M
SBU
LAION400M
VQAv2
GQA
OK-VQA
NoCaps
Flickr30K

Evaluation Metrics

Accuracy
BLEU

Results

BLIP-2 outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters
Achieves state-of-the-art performance on various vision-language tasks

Limitations

The authors identified the following limitations:

Not specified

Technical Requirements

Number of GPUs: 16
GPU Type: NVIDIA A100 40GB

Papers Using Similar Methods

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding (2024)

External Resources

Funding: Not specified
References: 48
Influential Citations: 630

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Related Papers