Domain
artificial intelligence
Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks, showing amazing emergent abilities in recent studies, such as writing poems based on an image.However, it is difficult for these case studies to fully reflect the performance of MLLM, lacking a comprehensive evaluation.In this paper, we fill in this blank, presenting the first comprehensive MLLM Evaluation benchmark MME 1 .It measures both perception and cognition abilities on a total of 14 subtasks.In order to avoid data leakage that may arise from direct use of public datasets for evaluation, the annotations of instructionanswer pairs are all manually designed.The concise instruction design allows us to fairly compare MLLMs, instead of struggling in prompt engineering.Besides, with such an instruction, we can also easily carry out quantitative statistics.A total of 30 advanced MLLMs are comprehensively evaluated on our MME, which not only suggests that existing MLLMs still have a large room for improvement, but also reveals the potential directions for the subsequent model optimization.The data application manner and online leaderboards are released at https:// github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation.
This paper presents the MME benchmark for evaluating Multimodal Large Language Models (MLLMs), designed to comprehensively assess both perception and cognition abilities across 14 subtasks. The study addresses the inadequacies of existing evaluation methods by providing manually constructed instruction-answer pairs to prevent data leakage from traditional multimodal datasets. The benchmark evaluates 30 advanced MLLMs, revealing significant performance gaps and offering insights for future model optimizations. Key issues include models failing to follow instructions, perceptual inaccuracies, reasoning deficiencies, and instances of object hallucination in responses. Overall, MME aims to facilitate improved evaluation standards in the rapidly evolving field of MLLMs.
This paper employs the following methods:
- MLLM Evaluation
- Manual Instruction Design
- Zero-shot Evaluation
- BLIP-2
- Instruct-BLIP
- MiniGPT-4
- PandaGPT
- Multimodal-GPT
- VisualGLM-6B
- ImageBind-LLM
- VPGTrans
- LaVIN
- mPLUG-Owl
- Octopus
- Muffin
- Otter
- LRV-Instruction
- Cheetor
- LLaMA-Adapter-v2
- GIT2
- BLIVA
- Lynx
- MMICL
- GPT-4V
- Skywork-MM
- mPLUG-Owl2
- Qwen-VL-Chat
- XComposer-VL
- LLaVA
- Lion
- SPHINX
- InfMLLM
- WeMM
The following datasets were used in this research:
- 30 advanced MLLMs evaluated
- Identification of common performance issues in MLLMs
The authors identified the following limitations:
- Number of GPUs: None specified
- GPU Type: None specified
multimodal large language models
evaluation benchmark
perception
cognition
instruction design