Zhe Chen Shanghai AI Laboratory Nanjing University, Weiyun Wang Shanghai AI Laboratory Fudan University, Hao Tian SenseTime Research, Shenglong Ye Shanghai AI Laboratory, Zhangwei Gao Shanghai AI Laboratory, Erfei Cui Shanghai AI Laboratory, Wenwen Tong SenseTime Research, Kongzhi Hu SenseTime Research, Jiapeng Luo SenseTime Research, Zheng Ma SenseTime Research, Ji Ma SenseTime Research, Jiaqi Wang Shanghai AI Laboratory, Xiaoyi Dong Shanghai AI Laboratory The Chinese University of Hong Kong, Hang Yan Shanghai AI Laboratory, Hewei Guo SenseTime Research, Conghui He Shanghai AI Laboratory, Botian Shi Shanghai AI Laboratory, Zhenjiang Jin Shanghai AI Laboratory, Chao Xu Shanghai AI Laboratory, Bin Wang Shanghai AI Laboratory, Xingjian Wei Shanghai AI Laboratory, Wei Li Shanghai AI Laboratory, Wenjian Zhang Shanghai AI Laboratory, Bo Zhang Shanghai AI Laboratory, Pinlong Cai Shanghai AI Laboratory, Licheng Wen, Xiangchao Yan Shanghai AI Laboratory, Min Dou Shanghai AI Laboratory, Lewei Lu Shanghai AI Laboratory SenseTime Research, Xizhou Zhu Shanghai AI Laboratory SenseTime Research Tsinghua University, Tong Lu Shanghai AI Laboratory Nanjing University, Dahua Lin Shanghai AI Laboratory The Chinese University of Hong Kong, Yu Qiao Shanghai AI Laboratory, Jifeng Dai Shanghai AI Laboratory Tsinghua University, Wenhai Wang Shanghai AI Laboratory Fudan University The Chinese University of Hong Kong, OpenGVLab Shanghai AI Labora (2024)
This paper introduces InternVL 1.5, an open-source multimodal large language model (MLLM) designed to bridge the capability gap between commercial and open-source models in multimodal understanding. Key improvements include a strong vision encoder (InternViT-6B), dynamic high-resolution processing of images, and a high-quality bilingual dataset that enhances performance in OCR and Chinese tasks. The model is evaluated across 18 multimodal benchmarks, achieving state-of-the-art results in 8 benchmarks, including OCR and document understanding tasks. Overall, InternVL 1.5 aims to contribute to the multimodal understanding community by providing accessible and competitive solutions.
This paper employs the following methods:
The following datasets were used in this research:
The authors identified the following limitations: