Venue
International Conference on Machine Learning
The cost of vision-and-language pre-training has become increasingly prohibitive due to end-toend training of large-scale models.This paper proposes BLIP-2, a generic and efficient pretraining strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models.BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pretrained in two stages.The first stage bootstraps vision-language representation learning from a frozen image encoder.The second stage bootstraps vision-to-language generative learning from a frozen language model.BLIP-2 achieves state-of-the-art performance on various visionlanguage tasks, despite having significantly fewer trainable parameters than existing methods.For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters.We also demonstrate the model's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.
The paper proposes BLIP-2, a novel and efficient pre-training approach for vision-language models that utilizes frozen pre-trained image encoders and large language models (LLMs). It introduces a Querying Transformer (Q-Former) which functions as a bridging mechanism between frozen image encoders and LLMs, effectively performing vision-language representation learning and generative tasks in a two-stage process. BLIP-2 demonstrates superior performance in various vision-language tasks while being more computationally efficient, outperforming existing models such as Flamingo80B with significantly fewer trainable parameters. The study emphasizes BLIP-2's capabilities in zero-shot image-to-text generation and highlights the effectiveness of leveraging pre-trained unimodal models for enhanced vision-language understanding.
This paper employs the following methods:
- Querying Transformer (Q-Former)
- BLIP-2
- OPT
- FlanT5
- ViT-L
- ViT-g
- Flamingo80B
The following datasets were used in this research:
- COCO
- Visual Genome
- CC3M
- CC12M
- SBU
- LAION400M
- VQAv2
- GQA
- OK-VQA
- NoCaps
- Flickr30K
- BLIP-2 outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters
- Achieves state-of-the-art performance on various vision-language tasks
The authors identified the following limitations:
- Number of GPUs: 16
- GPU Type: NVIDIA A100 40GB