← ML Research Wiki / 2412.05271

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen Shanghai AI Laboratory Nanjing University, Weiyun Wang Shanghai AI Laboratory Fudan University, Yue Cao Shanghai AI Laboratory Nanjing University, Yangzhou Liu Shanghai AI Laboratory Nanjing University, Zhangwei Gao Shanghai AI Laboratory Shanghai Jiao Tong University, Erfei Cui Shanghai AI Laboratory Shanghai Jiao Tong University, Jinguo Zhu Shanghai AI Laboratory, Shenglong Ye Shanghai AI Laboratory, Hao Tian SenseTime Research, Zhaoyang Liu Shanghai AI Laboratory, Lixin Gu Shanghai AI Laboratory, Xuehui Wang Shanghai AI Laboratory, Qingyun Li, Yimin Ren Shanghai AI Laboratory, Zixuan Chen Shanghai AI Laboratory SenseTime Research, Jiapeng Luo SenseTime Research, Jiahao Wang SenseTime Research, Tan Jiang SenseTime Research, Bo Wang, Conghui He Shanghai AI Laboratory SenseTime Research, Botian Shi Shanghai AI Laboratory, Xingcheng Zhang Shanghai AI Laboratory, Han Lv Shanghai AI Laboratory, Yi Wang Shanghai AI Laboratory, Wenqi Shao Shanghai AI Laboratory, Pei Chu Shanghai AI Laboratory, Zhongying Tu Shanghai AI Laboratory, Tong He, Zhiyong Wu Shanghai AI Laboratory, Huipeng Deng Shanghai AI Laboratory, Jiaye Ge Shanghai AI Laboratory, Kai Chen, Kaipeng Zhang Shanghai AI Laboratory, Limin Wang Shanghai AI Laboratory Nanjing University, Min Dou Shanghai AI Laboratory, Lewei Lu Shanghai AI Laboratory SenseTime Research, Xizhou Zhu Shanghai AI Laboratory Tsinghua University, Tong Lu Shanghai AI Laboratory Nanjing University, Dahua Lin Shanghai AI Laboratory The Chinese University of Hong Kong, Yu Qiao Shanghai AI Laboratory, Jifeng Dai [email protected] Shanghai AI Laboratory Tsinghua University, Wenhai Wang [email protected] Shanghai AI Laboratory The Chinese University of Hong Kong (2024)

Paper Information

arXiv ID

2412.05271

Venue

arXiv.org

Domain

artificial intelligence, machine learning, natural language processing, computer vision

SOTA Claim

Yes

Code

Available

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality.In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, language models, dataset sizes, and testtime configurations.Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, multi-image / video understanding, realworld comprehension, multimodal hallucination detection, visual grounding, multilingual capabilities, and pure language processing, InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet.Notably, our model is the first open-source MLLMs to surpass 70% on the MMMU benchmark, achieving a 3.7-point improvement through Chain-of-Thought (CoT) reasoning and showcasing strong potential for test-time scaling.HuggingFace demo see https://huggingface.co/ spaces/OpenGVLab/InternVL

Summary

The paper presents InternVL 2.5, an open-source multimodal large language model series, which builds upon the architecture of InternVL 2.0, enhancing training and testing strategies along with data quality. The authors systematically investigate the relationship between model scaling and performance across different configurations. InternVL 2.5 achieves competitive results in diverse tasks such as multi-discipline reasoning, document understanding, video understanding, and multilingual capabilities. It notably surpasses 70% on the MMMU benchmark, marking a significant advancement for open-source models in these areas. The contributions include exploring the effects of scaling vision encoders and language models, improving data quality, and developing training strategies. InternVL 2.5 aims to narrow the performance gap with leading commercial models like GPT-4o and Claude-3.5-Sonnet by providing a robust tool for research and application in multimodal AI.

Methods

This paper employs the following methods:

Chain-of-Thought (CoT) reasoning
Dynamic High-Resolution training
Progressive Scaling Strategy

Models Used

InternVL 2.5
InternVL 2.0
GPT-4o
Claude-3.5-Sonnet

Datasets

The following datasets were used in this research:

MMM-U
OlympiadBench
MathVista
MATH-Vision
MMBench
VQA-RAD
DocVQA
TextVQA
MVBench
R-Bench

Evaluation Metrics

Accuracy
mIoU
F1-score
aAcc
fAcc
qAcc

Results

InternVL 2.5 achieves over 70% on the MMMU benchmark
Improved performance in multi-image understanding and OCR tasks compared to previous versions and commercial models
Demonstrated competitive results in real-world comprehension tasks
Strong performance in multilingual capabilities

Limitations

The authors identified the following limitations:

Model training remains resource-intensive
Some gaps in performance compared to closed-source models
Challenges in generating longer responses that match human expectations

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

multimodal large language models model scaling data quality test-time scaling benchmarking

Papers Using Similar Methods

External Resources

Funding: National Key R&D Program of China, National Natural Science Foundation of China, China Mobile Zijin Innovation Institute
References: 303
Influential Citations: 34

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers