← ML Research Wiki / 2403.20330

Are We on the Right Way for Evaluating Large Vision-Language Models?

Lin Chen University of Science and Technology China, Jinsong Li The Chinese University of Hong Kong 3 Shanghai AI Laboratory, Xiaoyi Dong The Chinese University of Hong Kong 3 Shanghai AI Laboratory, Pan Zhang, Yuhang Zang, Zehui Chen University of Science and Technology China, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin The Chinese University of Hong Kong 3 Shanghai AI Laboratory, Feng Zhao University of Science and Technology China (2024)

Paper Information

arXiv ID

2403.20330

Venue

Neural Information Processing Systems

Domain

artificial intelligence and machine learning

Reproducibility

7/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Large vision-language models (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities.However, we dig into current evaluation works and identify two primary issues: 1) Visual content is unnecessary for many samples.The answers can be directly inferred from the questions and options, or the world knowledge embedded in LLMs.This phenomenon is prevalent across current benchmarks.For instance, GeminiPro achieves 42.9% on the MMMU benchmark without any visual input, and outperforms the random choice baseline across six benchmarks over 24% on average.2) Unintentional data leakage exists in LLM and LVLM training.LLM and LVLM could still answer some visual-necessary questions without visual content, indicating the memorizing of these samples within large-scale training data.For example, Sphinx-X-MoE gets 43.6% on MMMU without accessing images, surpassing its LLM backbone with 17.9%.Both problems lead to misjudgments of actual multi-modal gains and potentially misguide the study of LVLM.To this end, we present MMStar, an elite vision-indispensable multi-modal benchmark comprising 1,500 samples meticulously selected by humans.MMStar benchmarks 6 core capabilities and 18 detailed axes, aiming to evaluate LVLMs' multi-modal capacities with carefully balanced and purified samples.These samples are first roughly selected from current benchmarks with an automated pipeline, human review is then involved to ensure each curated sample exhibits visual dependency, minimal data leakage, and requires advanced multi-modal capabilities.Moreover, two metrics are developed to measure data leakage and actual performance gain in multi-modal training.We evaluate 16 leading LVLMs on MMStar to assess their multi-modal capabilities, and on 7 benchmarks with the proposed metrics to investigate their data leakage and actual multi-modal gain.

Summary

This paper investigates the evaluation processes of Large Vision-Language Models (LVLMs) and identifies two major issues: (1) Many evaluation samples do not require visual content for correct answers, thus undermining the true assessment of multi-modal capabilities; (2) There is unintentional data leakage during training, where models can answer visual-dependent questions without visual inputs, indicating memorization of training data. To address these issues, the authors propose the MMStar benchmark, consisting of 1,500 carefully curated samples designed to ensure visual dependency and minimize data leakage. The benchmark evaluates LVLMs on six core capabilities across 18 detailed axes. Additionally, two new metrics are introduced to assess multi-modal gain and leakage. The performance of 16 leading LVLMs is evaluated on MMStar, revealing that even the best performing model scores below 60% on average, highlighting the ongoing challenges in LVLM evaluation.

Methods

This paper employs the following methods:

Vision-language model benchmarking

Models Used

GPT-4V
GeminiPro
Sphinx-X-MoE
LLaMA-70B
InternLM2-20B
Yi-VL-34B
Mixtral-8x7B
Deepseek-67B
LLaVA series
Qwen-7B

Datasets

The following datasets were used in this research:

MMMU
ScienceQA
AI2D
SEED
MMBench
MathVista

Evaluation Metrics

Accuracy
multi-modal gain (MG)
multi-modal leakage (ML)

Results

MMStar benchmark demonstrates the inadequacy of existing evaluations in assessing LVLM capabilities.
First place in MMStar benchmark is GPT-4V with 57.1% accuracy.

Limitations

The authors identified the following limitations:

The study indicates that issues persist in current benchmarking methods, potentially affecting the accuracy of model evaluations.

Technical Requirements

Number of GPUs: None specified
GPU Type: NVIDIA A100

Keywords

vision-language models benchmark evaluation data leakage multi-modal capabilities

Papers Using Similar Methods

UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling (2024)

External Resources

Funding: Not specified
References: 47
Influential Citations: 26

Are We on the Right Way for Evaluating Large Vision-Language Models?

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers