Venue
Computer Vision and Pattern Recognition
Domain
multimodal AI, computer vision, natural language processing
Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning.In this paper, we present the first systematic study to investigate the design choices of LMMs in a controlled setting under the LLaVA framework.We show that the fully-connected vision-language connector in LLaVA is surprisingly powerful and data-efficient.With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks.Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ∼1 day on a single 8-A100 node.Furthermore, we present some early exploration of open problems in LMMs, including scaling to higher resolution inputs, compositional capabilities, and model hallucination, etc.We hope this makes state-of-the-art LMM research more accessible.Code and model will be publicly available.eralization ability.In light of this, we consider investigating the root cause of the inability to balance between natural conversations and academic tasks in multimodal models.
This paper presents the LLaVA-1.5 model, an improved large multimodal model (LMM) that optimizes visual instruction tuning. The authors find that a fully-connected vision-language connector is effective and enhances data efficiency. With modifications such as integrating academic-task-oriented VQA datasets and using an MLP projection layer, LLaVA-1.5 establishes stronger baselines, achieving state-of-the-art results across 11 benchmarks while training on just 1.2M publicly available data. The paper explores challenges in LMMs, including scaling to higher resolutions, compositional capabilities, and hallucination issues, aiming to make advancements in LMM research more accessible through open-source contributions. Overall, the study balances multitask learning with effective scaling, introducing a robust and easily reproducible framework for future LMM research.
This paper employs the following methods:
- Visual Instruction Tuning
- MLP Projection
- Data Compression
- LLaVA-1.5
- CLIP-ViT-L-336px
- Vicuna
The following datasets were used in this research:
- VQA-v2
- GQA
- MM-Vet
- OKVQA
- OCRVQA
- A-OKVQA
- TextCaps
- Visual Genome
- RefCOCO
- ShareGPT
- COCO
- F1 score
- Zero-shot generalization
- Accuracy
- LLaVA-1.5 achieves state-of-the-art results on 11 benchmarks
- LLaVA-1.5 finishes training in ∼1 day on a single 8-A100 node
- LLaVA-1.5 demonstrates improved performance when scaled to higher resolution images
The authors identified the following limitations:
- Prolonged training for high-resolution images
- Lack of multiple-image understanding
- Limited problem-solving capabilities in certain fields
- Still prone to hallucinations
- Requires cautious use in critical applications
- Number of GPUs: 8
- GPU Type: NVIDIA A100
large multimodal models
visual instruction tuning
LLaVA framework
multimodal benchmarks
vision-language models