← ML Research Wiki / 2306.13394

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu Xiamen University, Peixian Chen Xiamen University, Yunhang Shen Xiamen University, Yulei Qin Xiamen University, Mengdan Zhang Xiamen University, Xu Lin Xiamen University, Jinrui Yang Xiamen University, Xiawu Zheng Xiamen University, Ke Li Xiamen University, Xing Sun Xiamen University, Yunsheng Wu Xiamen University, Rongrong Ji Xiamen University, Tencent Youtu Lab Xiamen University (2023)

Paper Information

arXiv ID

2306.13394

Venue

arXiv.org

Domain

artificial intelligence

Reproducibility

3/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks, showing amazing emergent abilities in recent studies, such as writing poems based on an image.However, it is difficult for these case studies to fully reflect the performance of MLLM, lacking a comprehensive evaluation.In this paper, we fill in this blank, presenting the first comprehensive MLLM Evaluation benchmark MME 1 .It measures both perception and cognition abilities on a total of 14 subtasks.In order to avoid data leakage that may arise from direct use of public datasets for evaluation, the annotations of instructionanswer pairs are all manually designed.The concise instruction design allows us to fairly compare MLLMs, instead of struggling in prompt engineering.Besides, with such an instruction, we can also easily carry out quantitative statistics.A total of 30 advanced MLLMs are comprehensively evaluated on our MME, which not only suggests that existing MLLMs still have a large room for improvement, but also reveals the potential directions for the subsequent model optimization.The data application manner and online leaderboards are released at https:// github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation.

Summary

This paper presents the MME benchmark for evaluating Multimodal Large Language Models (MLLMs), designed to comprehensively assess both perception and cognition abilities across 14 subtasks. The study addresses the inadequacies of existing evaluation methods by providing manually constructed instruction-answer pairs to prevent data leakage from traditional multimodal datasets. The benchmark evaluates 30 advanced MLLMs, revealing significant performance gaps and offering insights for future model optimizations. Key issues include models failing to follow instructions, perceptual inaccuracies, reasoning deficiencies, and instances of object hallucination in responses. Overall, MME aims to facilitate improved evaluation standards in the rapidly evolving field of MLLMs.

Methods

This paper employs the following methods:

MLLM Evaluation
Manual Instruction Design
Zero-shot Evaluation

Models Used

BLIP-2
Instruct-BLIP
MiniGPT-4
PandaGPT
Multimodal-GPT
VisualGLM-6B
ImageBind-LLM
VPGTrans
LaVIN
mPLUG-Owl
Octopus
Muffin
Otter
LRV-Instruction
Cheetor
LLaMA-Adapter-v2
GIT2
BLIVA
Lynx
MMICL
GPT-4V
Skywork-MM
mPLUG-Owl2
Qwen-VL-Chat
XComposer-VL
LLaVA
Lion
SPHINX
InfMLLM
WeMM

Datasets

The following datasets were used in this research:

None specified

Evaluation Metrics

Accuracy
Accuracy+

Results

30 advanced MLLMs evaluated
Identification of common performance issues in MLLMs

Limitations

The authors identified the following limitations:

Not specified

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

multimodal large language models evaluation benchmark perception cognition instruction design

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 60
Influential Citations: 110

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers