← ML Research Wiki / 2306.13394

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu Xiamen University, Peixian Chen Xiamen University, Yunhang Shen Xiamen University, Yulei Qin Xiamen University, Mengdan Zhang Xiamen University, Xu Lin Xiamen University, Jinrui Yang Xiamen University, Xiawu Zheng Xiamen University, Ke Li Xiamen University, Xing Sun Xiamen University, Yunsheng Wu Xiamen University, Rongrong Ji Xiamen University, Tencent Youtu Lab Xiamen University (2023)

Paper Information
arXiv ID
Venue
arXiv.org
Domain
artificial intelligence
Reproducibility
3/10

Abstract

Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks, showing amazing emergent abilities in recent studies, such as writing poems based on an image.However, it is difficult for these case studies to fully reflect the performance of MLLM, lacking a comprehensive evaluation.In this paper, we fill in this blank, presenting the first comprehensive MLLM Evaluation benchmark MME 1 .It measures both perception and cognition abilities on a total of 14 subtasks.In order to avoid data leakage that may arise from direct use of public datasets for evaluation, the annotations of instructionanswer pairs are all manually designed.The concise instruction design allows us to fairly compare MLLMs, instead of struggling in prompt engineering.Besides, with such an instruction, we can also easily carry out quantitative statistics.A total of 30 advanced MLLMs are comprehensively evaluated on our MME, which not only suggests that existing MLLMs still have a large room for improvement, but also reveals the potential directions for the subsequent model optimization.The data application manner and online leaderboards are released at https:// github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation.

Summary

This paper presents the MME benchmark for evaluating Multimodal Large Language Models (MLLMs), designed to comprehensively assess both perception and cognition abilities across 14 subtasks. The study addresses the inadequacies of existing evaluation methods by providing manually constructed instruction-answer pairs to prevent data leakage from traditional multimodal datasets. The benchmark evaluates 30 advanced MLLMs, revealing significant performance gaps and offering insights for future model optimizations. Key issues include models failing to follow instructions, perceptual inaccuracies, reasoning deficiencies, and instances of object hallucination in responses. Overall, MME aims to facilitate improved evaluation standards in the rapidly evolving field of MLLMs.

Methods

This paper employs the following methods:

  • MLLM Evaluation
  • Manual Instruction Design
  • Zero-shot Evaluation

Models Used

  • BLIP-2
  • Instruct-BLIP
  • MiniGPT-4
  • PandaGPT
  • Multimodal-GPT
  • VisualGLM-6B
  • ImageBind-LLM
  • VPGTrans
  • LaVIN
  • mPLUG-Owl
  • Octopus
  • Muffin
  • Otter
  • LRV-Instruction
  • Cheetor
  • LLaMA-Adapter-v2
  • GIT2
  • BLIVA
  • Lynx
  • MMICL
  • GPT-4V
  • Skywork-MM
  • mPLUG-Owl2
  • Qwen-VL-Chat
  • XComposer-VL
  • LLaVA
  • Lion
  • SPHINX
  • InfMLLM
  • WeMM

Datasets

The following datasets were used in this research:

  • None specified

Evaluation Metrics

  • Accuracy
  • Accuracy+

Results

  • 30 advanced MLLMs evaluated
  • Identification of common performance issues in MLLMs

Limitations

The authors identified the following limitations:

  • Not specified

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

multimodal large language models evaluation benchmark perception cognition instruction design

Papers Using Similar Methods

External Resources