← ML Research Wiki / 2407.07895

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, Chunyuan Li, Ntu (2024)

Paper Information
arXiv ID
Venue
arXiv.org
Domain
Not specified
SOTA Claim
Yes

Abstract

Figure 1.Performance comparison in three interleaved scenarios, including multi-image, multi-frame (video), and multi-view (3D).Our LLaVA-NeXT-Interleave model achieves SoTA performance across a variety of evaluation benchmarks.

Summary

The paper presents LLaVA-NeXT-Interleave, a large multimodal model (LMM) designed to handle diverse real-world scenarios like multi-image, video, and 3D data by employing an image-text interleaved format as a universal data template. This model aims to unify previously fragmented training methodologies, allowing for more efficient and scalable multimodal tasks. The authors introduce the M4-Instruct dataset, comprising 1,177,600 samples across 14 tasks and 41 datasets, and a evaluation benchmark named LLaVA-Interleave Bench to assess performance in multi-image contexts. The model showcases state-of-the-art results across various benchmarks, preserves single-image task performance, and demonstrates emerging capabilities through cross-task transfer.

Methods

This paper employs the following methods:

  • LLaVA-NeXT-Interleave
  • interleaved data format

Models Used

  • LLaVA-NeXT-Interleave
  • LLaVA-NeXT-Image

Datasets

The following datasets were used in this research:

  • M4-Instruct
  • LLaVA-Interleave Bench

Evaluation Metrics

  • Accuracy
  • GPT score
  • Correctness of Information
  • Detail Orientation
  • Context Understanding
  • Temporal Understanding
  • Consistency

Results

  • Achieves state-of-the-art (SoTA) performance across various multi-image tasks
  • Maintains single-image performance

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

External Resources