Zhe Chen Shanghai AI Laboratory Nanjing University, Weiyun Wang Shanghai AI Laboratory Fudan University, Yue Cao Shanghai AI Laboratory Nanjing University, Yangzhou Liu Shanghai AI Laboratory Nanjing University, Zhangwei Gao Shanghai AI Laboratory Shanghai Jiao Tong University, Erfei Cui Shanghai AI Laboratory Shanghai Jiao Tong University, Jinguo Zhu Shanghai AI Laboratory, Shenglong Ye Shanghai AI Laboratory, Hao Tian SenseTime Research, Zhaoyang Liu Shanghai AI Laboratory, Lixin Gu Shanghai AI Laboratory, Xuehui Wang Shanghai AI Laboratory, Qingyun Li, Yimin Ren Shanghai AI Laboratory, Zixuan Chen Shanghai AI Laboratory SenseTime Research, Jiapeng Luo SenseTime Research, Jiahao Wang SenseTime Research, Tan Jiang SenseTime Research, Bo Wang, Conghui He Shanghai AI Laboratory SenseTime Research, Botian Shi Shanghai AI Laboratory, Xingcheng Zhang Shanghai AI Laboratory, Han Lv Shanghai AI Laboratory, Yi Wang Shanghai AI Laboratory, Wenqi Shao Shanghai AI Laboratory, Pei Chu Shanghai AI Laboratory, Zhongying Tu Shanghai AI Laboratory, Tong He, Zhiyong Wu Shanghai AI Laboratory, Huipeng Deng Shanghai AI Laboratory, Jiaye Ge Shanghai AI Laboratory, Kai Chen, Kaipeng Zhang Shanghai AI Laboratory, Limin Wang Shanghai AI Laboratory Nanjing University, Min Dou Shanghai AI Laboratory, Lewei Lu Shanghai AI Laboratory SenseTime Research, Xizhou Zhu Shanghai AI Laboratory Tsinghua University, Tong Lu Shanghai AI Laboratory Nanjing University, Dahua Lin Shanghai AI Laboratory The Chinese University of Hong Kong, Yu Qiao Shanghai AI Laboratory, Jifeng Dai [email protected] Shanghai AI Laboratory Tsinghua University, Wenhai Wang [email protected] Shanghai AI Laboratory The Chinese University of Hong Kong (2024)
The paper presents InternVL 2.5, an open-source multimodal large language model series, which builds upon the architecture of InternVL 2.0, enhancing training and testing strategies along with data quality. The authors systematically investigate the relationship between model scaling and performance across different configurations. InternVL 2.5 achieves competitive results in diverse tasks such as multi-discipline reasoning, document understanding, video understanding, and multilingual capabilities. It notably surpasses 70% on the MMMU benchmark, marking a significant advancement for open-source models in these areas. The contributions include exploring the effects of scaling vision encoders and language models, improving data quality, and developing training strategies. InternVL 2.5 aims to narrow the performance gap with leading commercial models like GPT-4o and Claude-3.5-Sonnet by providing a robust tool for research and application in multimodal AI.
This paper employs the following methods:
The following datasets were used in this research:
The authors identified the following limitations: