Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, Chunyuan Li, Ntu (2024)
The paper presents LLaVA-NeXT-Interleave, a large multimodal model (LMM) designed to handle diverse real-world scenarios like multi-image, video, and 3D data by employing an image-text interleaved format as a universal data template. This model aims to unify previously fragmented training methodologies, allowing for more efficient and scalable multimodal tasks. The authors introduce the M4-Instruct dataset, comprising 1,177,600 samples across 14 tasks and 41 datasets, and a evaluation benchmark named LLaVA-Interleave Bench to assess performance in multi-image contexts. The model showcases state-of-the-art results across various benchmarks, preserves single-image task performance, and demonstrates emerging capabilities through cross-task transfer.
This paper employs the following methods:
The following datasets were used in this research: