← ML Research Wiki / 2311.16502

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Xiang Yue University of Waterloo, Yuansheng Ni The Ohio State University 4 Independent, Kai Zhang, Tianyu Zheng The Ohio State University 4 Independent, Ruoqi Liu University of Waterloo, Ge Zhang The Ohio State University 4 Independent, Samuel Stevens University of Waterloo, Dongfu Jiang University of Waterloo, Weiming Ren, Yuxuan Sun University of Waterloo, Cong Wei The Ohio State University 4 Independent, Botao Yu, Ruibin Yuan University of Waterloo, Renliang Sun, Ming Yin The Ohio State University 4 Independent, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang The Ohio State University 4 Independent, Huan Sun The Ohio State University 4 Independent, Yu Su University of Waterloo, Wenhu Chen [email protected], In Ai Research (2023)

Paper Information

arXiv ID

2311.16502

Venue

Computer Vision and Pattern Recognition

Domain

Artificial Intelligence/Multimodal AI

Reproducibility

6/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Knowledge Reasoning Perception Figure 1.Overview of the MMMU dataset.MMMU presents four challenges: 1) comprehensiveness: 11.5K college-level problems across six broad disciplines and 30 college subjects; 2) highly heterogeneous image types; 3) interleaved text and images; 4) expert-level perception and reasoning rooted in deep subject knowledge.

Summary

The paper introduces the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark aimed at evaluating the capabilities of large multimodal models (LMMs) in addressing expert-level understanding and reasoning across diverse disciplines. The benchmark comprises 11.5K college-level questions from six disciplines (Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering), integrating both text and images to test varied visual and textual comprehension skills. It identifies four key challenges: comprehensiveness (scale of subjects), diverse image types, interleaved text and images, and expert-level reasoning requirements. The paper evaluates how well current models, including GPT-4V and various open-source LMMs, perform on the MMMU and discusses significant gaps between human expert performance and model outputs, noting the need for improved multimodal understanding and reasoning capabilities. The authors highlight extensive error analysis of model predictions, identifying perceptual, knowledge, and reasoning errors as focal areas for future research and enhancement.

Methods

This paper employs the following methods:

Evaluation of multimodal models
Error analysis

Models Used

GPT-4V(ision)
LLaVA-1.5
BLIP-2 FLAN-T5-XXL
Kosmos2
CogVLM

Datasets

The following datasets were used in this research:

MMMU

Evaluation Metrics

Accuracy

Results

GPT-4V achieves 55.7% accuracy on MMMU; human experts achieve 88.6% accuracy.
Significant disparity between closed-source and open-source models; open-source models achieve approximately 34% accuracy.

Limitations

The authors identified the following limitations:

Not specified

Technical Requirements

Number of GPUs: None specified
GPU Type: NVIDIA A100

Keywords

Expert AGI multimodal benchmark vision-language models deep reasoning

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 84
Influential Citations: 89

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers