← ML Research Wiki / 2404.16821

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen Shanghai AI Laboratory Nanjing University, Weiyun Wang Shanghai AI Laboratory Fudan University, Hao Tian SenseTime Research, Shenglong Ye Shanghai AI Laboratory, Zhangwei Gao Shanghai AI Laboratory, Erfei Cui Shanghai AI Laboratory, Wenwen Tong SenseTime Research, Kongzhi Hu SenseTime Research, Jiapeng Luo SenseTime Research, Zheng Ma SenseTime Research, Ji Ma SenseTime Research, Jiaqi Wang Shanghai AI Laboratory, Xiaoyi Dong Shanghai AI Laboratory The Chinese University of Hong Kong, Hang Yan Shanghai AI Laboratory, Hewei Guo SenseTime Research, Conghui He Shanghai AI Laboratory, Botian Shi Shanghai AI Laboratory, Zhenjiang Jin Shanghai AI Laboratory, Chao Xu Shanghai AI Laboratory, Bin Wang Shanghai AI Laboratory, Xingjian Wei Shanghai AI Laboratory, Wei Li Shanghai AI Laboratory, Wenjian Zhang Shanghai AI Laboratory, Bo Zhang Shanghai AI Laboratory, Pinlong Cai Shanghai AI Laboratory, Licheng Wen, Xiangchao Yan Shanghai AI Laboratory, Min Dou Shanghai AI Laboratory, Lewei Lu Shanghai AI Laboratory SenseTime Research, Xizhou Zhu Shanghai AI Laboratory SenseTime Research Tsinghua University, Tong Lu Shanghai AI Laboratory Nanjing University, Dahua Lin Shanghai AI Laboratory The Chinese University of Hong Kong, Yu Qiao Shanghai AI Laboratory, Jifeng Dai Shanghai AI Laboratory Tsinghua University, Wenhai Wang Shanghai AI Laboratory Fudan University The Chinese University of Hong Kong, OpenGVLab Shanghai AI Labora (2024)

Paper Information

arXiv ID

2404.16821

Venue

Science China Information Sciences

Domain

Artificial Intelligence

SOTA Claim

Yes

Code

Available

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

In this report, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding.We introduce three simple improvements: (1) Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model-InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs.(2) Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448×448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input.(3) High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR-and Chinese-related tasks.We evaluate InternVL 1.5 through a series of benchmarks and comparative studies.Compared to both open-source and proprietary commercial models, InternVL 1.5 shows competitive performance, achieving state-of-the-art results in 8 of 18 multimodal benchmarks.

Summary

This paper introduces InternVL 1.5, an open-source multimodal large language model (MLLM) designed to bridge the capability gap between commercial and open-source models in multimodal understanding. Key improvements include a strong vision encoder (InternViT-6B), dynamic high-resolution processing of images, and a high-quality bilingual dataset that enhances performance in OCR and Chinese tasks. The model is evaluated across 18 multimodal benchmarks, achieving state-of-the-art results in 8 benchmarks, including OCR and document understanding tasks. Overall, InternVL 1.5 aims to contribute to the multimodal understanding community by providing accessible and competitive solutions.

Methods

This paper employs the following methods:

Continuous Learning
Dynamic High-Resolution
Bilingual Dataset Collection

Models Used

InternVL 1.5
InternViT-6B
InternLM2-20B

Datasets

The following datasets were used in this research:

Laion-EN
Laion-ZH
COYO
TextCaps
Objects365
ChartQA
DocVQA
Wukong-OCR
LaionCOCO-OCR

Evaluation Metrics

Accuracy
OCRBench
MMBench-CN
CCBench
HallusionBench
MathVista

Results

InternVL 1.5 achieves state-of-the-art results in 8 of 18 multimodal benchmarks.
Surpasses GPT-4V in specific OCR-related benchmarks.

Limitations

The authors identified the following limitations:

The performance is still behind proprietary models in multi-turn conversations and does not reach the same level as GPT-4V overall.

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

open-source multimodal models vision-language understanding high-resolution input bilingual datasets

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 145
Influential Citations: 98

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers