← ML Research Wiki / 2311.16502

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Xiang Yue University of Waterloo, Yuansheng Ni The Ohio State University 4 Independent, Kai Zhang, Tianyu Zheng The Ohio State University 4 Independent, Ruoqi Liu University of Waterloo, Ge Zhang The Ohio State University 4 Independent, Samuel Stevens University of Waterloo, Dongfu Jiang University of Waterloo, Weiming Ren, Yuxuan Sun University of Waterloo, Cong Wei The Ohio State University 4 Independent, Botao Yu, Ruibin Yuan University of Waterloo, Renliang Sun, Ming Yin The Ohio State University 4 Independent, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang The Ohio State University 4 Independent, Huan Sun The Ohio State University 4 Independent, Yu Su University of Waterloo, Wenhu Chen [email protected], In Ai Research (2023)

Paper Information
arXiv ID
Venue
Computer Vision and Pattern Recognition
Domain
Artificial Intelligence/Multimodal AI
Reproducibility
6/10

Abstract

Knowledge Reasoning Perception Figure 1.Overview of the MMMU dataset.MMMU presents four challenges: 1) comprehensiveness: 11.5K college-level problems across six broad disciplines and 30 college subjects; 2) highly heterogeneous image types; 3) interleaved text and images; 4) expert-level perception and reasoning rooted in deep subject knowledge.

Summary

The paper introduces the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark aimed at evaluating the capabilities of large multimodal models (LMMs) in addressing expert-level understanding and reasoning across diverse disciplines. The benchmark comprises 11.5K college-level questions from six disciplines (Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering), integrating both text and images to test varied visual and textual comprehension skills. It identifies four key challenges: comprehensiveness (scale of subjects), diverse image types, interleaved text and images, and expert-level reasoning requirements. The paper evaluates how well current models, including GPT-4V and various open-source LMMs, perform on the MMMU and discusses significant gaps between human expert performance and model outputs, noting the need for improved multimodal understanding and reasoning capabilities. The authors highlight extensive error analysis of model predictions, identifying perceptual, knowledge, and reasoning errors as focal areas for future research and enhancement.

Methods

This paper employs the following methods:

  • Evaluation of multimodal models
  • Error analysis

Models Used

  • GPT-4V(ision)
  • LLaVA-1.5
  • BLIP-2 FLAN-T5-XXL
  • Kosmos2
  • CogVLM

Datasets

The following datasets were used in this research:

  • MMMU

Evaluation Metrics

  • Accuracy

Results

  • GPT-4V achieves 55.7% accuracy on MMMU; human experts achieve 88.6% accuracy.
  • Significant disparity between closed-source and open-source models; open-source models achieve approximately 34% accuracy.

Limitations

The authors identified the following limitations:

  • Not specified

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: NVIDIA A100

Keywords

Expert AGI multimodal benchmark vision-language models deep reasoning

Papers Using Similar Methods

External Resources