← ML Research Wiki / 2310.03744

Improved Baselines with Visual Instruction Tuning

Haotian Liu University of Wisconsin, Chunyuan Li University of Wisconsin, Yuheng Li University of Wisconsin, Yong Jae Lee University of Wisconsin, -Madison University of Wisconsin, Microsoft Research University of Wisconsin (2023)

Paper Information

arXiv ID

2310.03744

Venue

Computer Vision and Pattern Recognition

Domain

multimodal AI, computer vision, natural language processing

SOTA Claim

Yes

Code

Available

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning.In this paper, we present the first systematic study to investigate the design choices of LMMs in a controlled setting under the LLaVA framework.We show that the fully-connected vision-language connector in LLaVA is surprisingly powerful and data-efficient.With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks.Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ∼1 day on a single 8-A100 node.Furthermore, we present some early exploration of open problems in LMMs, including scaling to higher resolution inputs, compositional capabilities, and model hallucination, etc.We hope this makes state-of-the-art LMM research more accessible.Code and model will be publicly available.eralization ability.In light of this, we consider investigating the root cause of the inability to balance between natural conversations and academic tasks in multimodal models.

Summary

This paper presents the LLaVA-1.5 model, an improved large multimodal model (LMM) that optimizes visual instruction tuning. The authors find that a fully-connected vision-language connector is effective and enhances data efficiency. With modifications such as integrating academic-task-oriented VQA datasets and using an MLP projection layer, LLaVA-1.5 establishes stronger baselines, achieving state-of-the-art results across 11 benchmarks while training on just 1.2M publicly available data. The paper explores challenges in LMMs, including scaling to higher resolutions, compositional capabilities, and hallucination issues, aiming to make advancements in LMM research more accessible through open-source contributions. Overall, the study balances multitask learning with effective scaling, introducing a robust and easily reproducible framework for future LMM research.

Methods

This paper employs the following methods:

Visual Instruction Tuning
MLP Projection
Data Compression

Models Used

LLaVA-1.5
CLIP-ViT-L-336px
Vicuna

Datasets

The following datasets were used in this research:

VQA-v2
GQA
MM-Vet
OKVQA
OCRVQA
A-OKVQA
TextCaps
Visual Genome
RefCOCO
ShareGPT
COCO

Evaluation Metrics

F1 score
Zero-shot generalization
Accuracy

Results

LLaVA-1.5 achieves state-of-the-art results on 11 benchmarks
LLaVA-1.5 finishes training in ∼1 day on a single 8-A100 node
LLaVA-1.5 demonstrates improved performance when scaled to higher resolution images

Limitations

The authors identified the following limitations:

Prolonged training for high-resolution images
Lack of multiple-image understanding
Limited problem-solving capabilities in certain fields
Still prone to hallucinations
Requires cautious use in critical applications

Technical Requirements

Number of GPUs: 8
GPU Type: NVIDIA A100

Keywords

large multimodal models visual instruction tuning LLaVA framework multimodal benchmarks vision-language models

Papers Using Similar Methods

External Resources

Funding: NSF CAREER IIS2150012, IITP grants funded by the Korea government (MSIT)
References: 70
Influential Citations: 473

Improved Baselines with Visual Instruction Tuning

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers