Xiang Yue University of Waterloo, Yuansheng Ni The Ohio State University 4 Independent, Kai Zhang, Tianyu Zheng The Ohio State University 4 Independent, Ruoqi Liu University of Waterloo, Ge Zhang The Ohio State University 4 Independent, Samuel Stevens University of Waterloo, Dongfu Jiang University of Waterloo, Weiming Ren, Yuxuan Sun University of Waterloo, Cong Wei The Ohio State University 4 Independent, Botao Yu, Ruibin Yuan University of Waterloo, Renliang Sun, Ming Yin The Ohio State University 4 Independent, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang The Ohio State University 4 Independent, Huan Sun The Ohio State University 4 Independent, Yu Su University of Waterloo, Wenhu Chen [email protected], In Ai Research (2023)
The paper introduces the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark aimed at evaluating the capabilities of large multimodal models (LMMs) in addressing expert-level understanding and reasoning across diverse disciplines. The benchmark comprises 11.5K college-level questions from six disciplines (Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering), integrating both text and images to test varied visual and textual comprehension skills. It identifies four key challenges: comprehensiveness (scale of subjects), diverse image types, interleaved text and images, and expert-level reasoning requirements. The paper evaluates how well current models, including GPT-4V and various open-source LMMs, perform on the MMMU and discusses significant gaps between human expert performance and model outputs, noting the need for improved multimodal understanding and reasoning capabilities. The authors highlight extensive error analysis of model predictions, identifying perceptual, knowledge, and reasoning errors as focal areas for future research and enhancement.
This paper employs the following methods:
The following datasets were used in this research:
The authors identified the following limitations: