← ML Research Wiki / 2404.16821

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen Shanghai AI Laboratory Nanjing University, Weiyun Wang Shanghai AI Laboratory Fudan University, Hao Tian SenseTime Research, Shenglong Ye Shanghai AI Laboratory, Zhangwei Gao Shanghai AI Laboratory, Erfei Cui Shanghai AI Laboratory, Wenwen Tong SenseTime Research, Kongzhi Hu SenseTime Research, Jiapeng Luo SenseTime Research, Zheng Ma SenseTime Research, Ji Ma SenseTime Research, Jiaqi Wang Shanghai AI Laboratory, Xiaoyi Dong Shanghai AI Laboratory The Chinese University of Hong Kong, Hang Yan Shanghai AI Laboratory, Hewei Guo SenseTime Research, Conghui He Shanghai AI Laboratory, Botian Shi Shanghai AI Laboratory, Zhenjiang Jin Shanghai AI Laboratory, Chao Xu Shanghai AI Laboratory, Bin Wang Shanghai AI Laboratory, Xingjian Wei Shanghai AI Laboratory, Wei Li Shanghai AI Laboratory, Wenjian Zhang Shanghai AI Laboratory, Bo Zhang Shanghai AI Laboratory, Pinlong Cai Shanghai AI Laboratory, Licheng Wen, Xiangchao Yan Shanghai AI Laboratory, Min Dou Shanghai AI Laboratory, Lewei Lu Shanghai AI Laboratory SenseTime Research, Xizhou Zhu Shanghai AI Laboratory SenseTime Research Tsinghua University, Tong Lu Shanghai AI Laboratory Nanjing University, Dahua Lin Shanghai AI Laboratory The Chinese University of Hong Kong, Yu Qiao Shanghai AI Laboratory, Jifeng Dai Shanghai AI Laboratory Tsinghua University, Wenhai Wang Shanghai AI Laboratory Fudan University The Chinese University of Hong Kong, OpenGVLab Shanghai AI Labora (2024)

Paper Information
arXiv ID
Venue
Science China Information Sciences
Domain
Artificial Intelligence
SOTA Claim
Yes
Code
Reproducibility
8/10

Abstract

In this report, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding.We introduce three simple improvements: (1) Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model-InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs.(2) Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448×448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input.(3) High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR-and Chinese-related tasks.We evaluate InternVL 1.5 through a series of benchmarks and comparative studies.Compared to both open-source and proprietary commercial models, InternVL 1.5 shows competitive performance, achieving state-of-the-art results in 8 of 18 multimodal benchmarks.

Summary

This paper introduces InternVL 1.5, an open-source multimodal large language model (MLLM) designed to bridge the capability gap between commercial and open-source models in multimodal understanding. Key improvements include a strong vision encoder (InternViT-6B), dynamic high-resolution processing of images, and a high-quality bilingual dataset that enhances performance in OCR and Chinese tasks. The model is evaluated across 18 multimodal benchmarks, achieving state-of-the-art results in 8 benchmarks, including OCR and document understanding tasks. Overall, InternVL 1.5 aims to contribute to the multimodal understanding community by providing accessible and competitive solutions.

Methods

This paper employs the following methods:

  • Continuous Learning
  • Dynamic High-Resolution
  • Bilingual Dataset Collection

Models Used

  • InternVL 1.5
  • InternViT-6B
  • InternLM2-20B

Datasets

The following datasets were used in this research:

  • Laion-EN
  • Laion-ZH
  • COYO
  • TextCaps
  • Objects365
  • ChartQA
  • DocVQA
  • Wukong-OCR
  • LaionCOCO-OCR

Evaluation Metrics

  • Accuracy
  • OCRBench
  • MMBench-CN
  • CCBench
  • HallusionBench
  • MathVista

Results

  • InternVL 1.5 achieves state-of-the-art results in 8 of 18 multimodal benchmarks.
  • Surpasses GPT-4V in specific OCR-related benchmarks.

Limitations

The authors identified the following limitations:

  • The performance is still behind proprietary models in multi-turn conversations and does not reach the same level as GPT-4V overall.

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

open-source multimodal models vision-language understanding high-resolution input bilingual datasets

Papers Using Similar Methods

External Resources