← ML Research Wiki / 2408.03326

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li, Hkust (2024)

Paper Information

arXiv ID

2408.03326

Venue

arXiv.org

Domain

artificial intelligence; computer vision; natural language processing

SOTA Claim

Yes

Code

Available

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series.Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios.Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities.In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.

Summary

LLaVA-OneVision is an open family of large multimodal models developed to enhance performance in various computer vision scenarios, including single-image, multi-image, and video processing. This model integrates insights from previous models, employing techniques for strong transfer learning across modalities, resulting in improved video understanding and cross-scenario capabilities. The paper discusses its architecture based on Qwen-2 for language processing and SigLIP for visual encoding, emphasizing the importance of high-quality data for training. Key contributions include proposing a flexible visual representation framework and demonstrating strong performance across diverse benchmarks compared to state-of-the-art models. It showcases emerging capabilities from task transfer and emphasizes the model's open-source nature for community accessibility.

Methods

This paper employs the following methods:

Transfer Learning
Language-Image Alignment
High-Quality Knowledge Learning
Visual Instruction Tuning

Models Used

LLaVA-OneVision
Qwen-2
SigLIP

Datasets

The following datasets were used in this research:

COCO118K
BLIP558K
CC3M
UReader
Evo-Instruct

Evaluation Metrics

Accuracy

Results

State-of-the-art performance in single-image, multi-image, and video benchmarks
Emerging capabilities through task transfer
Competitive performance against proprietary models like GPT-4V and GPT-4o

Limitations

The authors identified the following limitations:

Performance gap in complex visual chat scenarios
Need for larger training data and better preference learning for certain tasks

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

large multimodal models visual transfer learning multi-image understanding video comprehension

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 156
Influential Citations: 113

LLaVA-OneVision: Easy Visual Task Transfer

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers