← ML Research Wiki / 2407.07895

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, Chunyuan Li, Ntu (2024)

Paper Information

arXiv ID

2407.07895

Venue

arXiv.org

Domain

Not specified

SOTA Claim

Yes

Contents

Abstract
Methods
Datasets
Results
Related Work
External Resources

Abstract

Figure 1.Performance comparison in three interleaved scenarios, including multi-image, multi-frame (video), and multi-view (3D).Our LLaVA-NeXT-Interleave model achieves SoTA performance across a variety of evaluation benchmarks.

Summary

The paper presents LLaVA-NeXT-Interleave, a large multimodal model (LMM) designed to handle diverse real-world scenarios like multi-image, video, and 3D data by employing an image-text interleaved format as a universal data template. This model aims to unify previously fragmented training methodologies, allowing for more efficient and scalable multimodal tasks. The authors introduce the M4-Instruct dataset, comprising 1,177,600 samples across 14 tasks and 41 datasets, and a evaluation benchmark named LLaVA-Interleave Bench to assess performance in multi-image contexts. The model showcases state-of-the-art results across various benchmarks, preserves single-image task performance, and demonstrates emerging capabilities through cross-task transfer.

Methods

This paper employs the following methods:

LLaVA-NeXT-Interleave
interleaved data format

Models Used

LLaVA-NeXT-Interleave
LLaVA-NeXT-Image

Datasets

The following datasets were used in this research:

M4-Instruct
LLaVA-Interleave Bench

Evaluation Metrics

Accuracy
GPT score
Correctness of Information
Detail Orientation
Context Understanding
Temporal Understanding
Consistency

Results

Achieves state-of-the-art (SoTA) performance across various multi-image tasks
Maintains single-image performance

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

External Resources

Funding: Not specified
References: 57
Influential Citations: 23

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Technical Requirements edit

Related Papers