← ML Research Wiki / 2310.03744

Improved Baselines with Visual Instruction Tuning

Haotian Liu University of Wisconsin, Chunyuan Li University of Wisconsin, Yuheng Li University of Wisconsin, Yong Jae Lee University of Wisconsin, -Madison University of Wisconsin, Microsoft Research University of Wisconsin (2023)

Paper Information
arXiv ID
Venue
Computer Vision and Pattern Recognition
Domain
multimodal AI, computer vision, natural language processing
SOTA Claim
Yes
Code
Reproducibility
8/10

Abstract

Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning.In this paper, we present the first systematic study to investigate the design choices of LMMs in a controlled setting under the LLaVA framework.We show that the fully-connected vision-language connector in LLaVA is surprisingly powerful and data-efficient.With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks.Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ∼1 day on a single 8-A100 node.Furthermore, we present some early exploration of open problems in LMMs, including scaling to higher resolution inputs, compositional capabilities, and model hallucination, etc.We hope this makes state-of-the-art LMM research more accessible.Code and model will be publicly available.eralization ability.In light of this, we consider investigating the root cause of the inability to balance between natural conversations and academic tasks in multimodal models.

Summary

This paper presents the LLaVA-1.5 model, an improved large multimodal model (LMM) that optimizes visual instruction tuning. The authors find that a fully-connected vision-language connector is effective and enhances data efficiency. With modifications such as integrating academic-task-oriented VQA datasets and using an MLP projection layer, LLaVA-1.5 establishes stronger baselines, achieving state-of-the-art results across 11 benchmarks while training on just 1.2M publicly available data. The paper explores challenges in LMMs, including scaling to higher resolutions, compositional capabilities, and hallucination issues, aiming to make advancements in LMM research more accessible through open-source contributions. Overall, the study balances multitask learning with effective scaling, introducing a robust and easily reproducible framework for future LMM research.

Methods

This paper employs the following methods:

  • Visual Instruction Tuning
  • MLP Projection
  • Data Compression

Models Used

  • LLaVA-1.5
  • CLIP-ViT-L-336px
  • Vicuna

Datasets

The following datasets were used in this research:

  • VQA-v2
  • GQA
  • MM-Vet
  • OKVQA
  • OCRVQA
  • A-OKVQA
  • TextCaps
  • Visual Genome
  • RefCOCO
  • ShareGPT
  • COCO

Evaluation Metrics

  • F1 score
  • Zero-shot generalization
  • Accuracy

Results

  • LLaVA-1.5 achieves state-of-the-art results on 11 benchmarks
  • LLaVA-1.5 finishes training in ∼1 day on a single 8-A100 node
  • LLaVA-1.5 demonstrates improved performance when scaled to higher resolution images

Limitations

The authors identified the following limitations:

  • Prolonged training for high-resolution images
  • Lack of multiple-image understanding
  • Limited problem-solving capabilities in certain fields
  • Still prone to hallucinations
  • Requires cautious use in critical applications

Technical Requirements

  • Number of GPUs: 8
  • GPU Type: NVIDIA A100

Keywords

large multimodal models visual instruction tuning LLaVA framework multimodal benchmarks vision-language models

Papers Using Similar Methods

External Resources