← ML Research Wiki / 2408.03326

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li, Hkust (2024)

Paper Information
arXiv ID
Venue
arXiv.org
Domain
artificial intelligence; computer vision; natural language processing
SOTA Claim
Yes
Code
Available
Reproducibility
8/10

Abstract

We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series.Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios.Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities.In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.

Summary

LLaVA-OneVision is an open family of large multimodal models developed to enhance performance in various computer vision scenarios, including single-image, multi-image, and video processing. This model integrates insights from previous models, employing techniques for strong transfer learning across modalities, resulting in improved video understanding and cross-scenario capabilities. The paper discusses its architecture based on Qwen-2 for language processing and SigLIP for visual encoding, emphasizing the importance of high-quality data for training. Key contributions include proposing a flexible visual representation framework and demonstrating strong performance across diverse benchmarks compared to state-of-the-art models. It showcases emerging capabilities from task transfer and emphasizes the model's open-source nature for community accessibility.

Methods

This paper employs the following methods:

  • Transfer Learning
  • Language-Image Alignment
  • High-Quality Knowledge Learning
  • Visual Instruction Tuning

Models Used

  • LLaVA-OneVision
  • Qwen-2
  • SigLIP

Datasets

The following datasets were used in this research:

  • COCO118K
  • BLIP558K
  • CC3M
  • UReader
  • Evo-Instruct

Evaluation Metrics

  • Accuracy

Results

  • State-of-the-art performance in single-image, multi-image, and video benchmarks
  • Emerging capabilities through task transfer
  • Competitive performance against proprietary models like GPT-4V and GPT-4o

Limitations

The authors identified the following limitations:

  • Performance gap in complex visual chat scenarios
  • Need for larger training data and better preference learning for certain tasks

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

large multimodal models visual transfer learning multi-image understanding video comprehension

Papers Using Similar Methods

External Resources