← ML Research Wiki / 2412.05271

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen Shanghai AI Laboratory Nanjing University, Weiyun Wang Shanghai AI Laboratory Fudan University, Yue Cao Shanghai AI Laboratory Nanjing University, Yangzhou Liu Shanghai AI Laboratory Nanjing University, Zhangwei Gao Shanghai AI Laboratory Shanghai Jiao Tong University, Erfei Cui Shanghai AI Laboratory Shanghai Jiao Tong University, Jinguo Zhu Shanghai AI Laboratory, Shenglong Ye Shanghai AI Laboratory, Hao Tian SenseTime Research, Zhaoyang Liu Shanghai AI Laboratory, Lixin Gu Shanghai AI Laboratory, Xuehui Wang Shanghai AI Laboratory, Qingyun Li, Yimin Ren Shanghai AI Laboratory, Zixuan Chen Shanghai AI Laboratory SenseTime Research, Jiapeng Luo SenseTime Research, Jiahao Wang SenseTime Research, Tan Jiang SenseTime Research, Bo Wang, Conghui He Shanghai AI Laboratory SenseTime Research, Botian Shi Shanghai AI Laboratory, Xingcheng Zhang Shanghai AI Laboratory, Han Lv Shanghai AI Laboratory, Yi Wang Shanghai AI Laboratory, Wenqi Shao Shanghai AI Laboratory, Pei Chu Shanghai AI Laboratory, Zhongying Tu Shanghai AI Laboratory, Tong He, Zhiyong Wu Shanghai AI Laboratory, Huipeng Deng Shanghai AI Laboratory, Jiaye Ge Shanghai AI Laboratory, Kai Chen, Kaipeng Zhang Shanghai AI Laboratory, Limin Wang Shanghai AI Laboratory Nanjing University, Min Dou Shanghai AI Laboratory, Lewei Lu Shanghai AI Laboratory SenseTime Research, Xizhou Zhu Shanghai AI Laboratory Tsinghua University, Tong Lu Shanghai AI Laboratory Nanjing University, Dahua Lin Shanghai AI Laboratory The Chinese University of Hong Kong, Yu Qiao Shanghai AI Laboratory, Jifeng Dai [email protected] Shanghai AI Laboratory Tsinghua University, Wenhai Wang [email protected] Shanghai AI Laboratory The Chinese University of Hong Kong (2024)

Paper Information
arXiv ID
Venue
arXiv.org
Domain
artificial intelligence, machine learning, natural language processing, computer vision
SOTA Claim
Yes
Code
Reproducibility
8/10

Abstract

We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality.In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, language models, dataset sizes, and testtime configurations.Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, multi-image / video understanding, realworld comprehension, multimodal hallucination detection, visual grounding, multilingual capabilities, and pure language processing, InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet.Notably, our model is the first open-source MLLMs to surpass 70% on the MMMU benchmark, achieving a 3.7-point improvement through Chain-of-Thought (CoT) reasoning and showcasing strong potential for test-time scaling.HuggingFace demo see https://huggingface.co/ spaces/OpenGVLab/InternVL

Summary

The paper presents InternVL 2.5, an open-source multimodal large language model series, which builds upon the architecture of InternVL 2.0, enhancing training and testing strategies along with data quality. The authors systematically investigate the relationship between model scaling and performance across different configurations. InternVL 2.5 achieves competitive results in diverse tasks such as multi-discipline reasoning, document understanding, video understanding, and multilingual capabilities. It notably surpasses 70% on the MMMU benchmark, marking a significant advancement for open-source models in these areas. The contributions include exploring the effects of scaling vision encoders and language models, improving data quality, and developing training strategies. InternVL 2.5 aims to narrow the performance gap with leading commercial models like GPT-4o and Claude-3.5-Sonnet by providing a robust tool for research and application in multimodal AI.

Methods

This paper employs the following methods:

  • Chain-of-Thought (CoT) reasoning
  • Dynamic High-Resolution training
  • Progressive Scaling Strategy

Models Used

  • InternVL 2.5
  • InternVL 2.0
  • GPT-4o
  • Claude-3.5-Sonnet

Datasets

The following datasets were used in this research:

  • MMM-U
  • OlympiadBench
  • MathVista
  • MATH-Vision
  • MMBench
  • VQA-RAD
  • DocVQA
  • TextVQA
  • MVBench
  • R-Bench

Evaluation Metrics

  • Accuracy
  • mIoU
  • F1-score
  • aAcc
  • fAcc
  • qAcc

Results

  • InternVL 2.5 achieves over 70% on the MMMU benchmark
  • Improved performance in multi-image understanding and OCR tasks compared to previous versions and commercial models
  • Demonstrated competitive results in real-world comprehension tasks
  • Strong performance in multilingual capabilities

Limitations

The authors identified the following limitations:

  • Model training remains resource-intensive
  • Some gaps in performance compared to closed-source models
  • Challenges in generating longer responses that match human expectations

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

multimodal large language models model scaling data quality test-time scaling benchmarking

Papers Using Similar Methods

External Resources