← ML Research Wiki / 2406.01574

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Yubo Wang [email protected] University of Waterloo, Xueguang Ma University of Waterloo, Ge Zhang University of Waterloo, Yuansheng Ni University of Waterloo, Abhranil Chandra University of Waterloo, Shiguang Guo University of Waterloo, Weiming Ren University of Waterloo, Aaran Arulraj University of Waterloo, Xuan He, Ziyan Jiang University of Waterloo, Tianle Li University of Waterloo, Max Ku University of Waterloo University of Toronto, Kai Wang University of Waterloo, Alex Zhuang University of Waterloo, Rongqi Fan University of Waterloo Carnegie Mellon University, Xiang Yue University of Waterloo, Wenhu Chen [email protected] (2024)

Paper Information

arXiv ID

2406.01574

Venue

Neural Information Processing Systems

Domain

Computer Science, Artificial Intelligence, Machine Learning

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

In the age of large-scale language models, benchmarks like the Massive Multitask Language Understanding (MMLU) have been pivotal in pushing the boundaries of what AI can achieve in language comprehension and reasoning across diverse domains.However, as models continue to improve, their performance on these benchmarks has begun to plateau, making it increasingly difficult to discern differences in model capabilities.This paper introduces MMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark by integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options.Additionally, MMLU-Pro eliminates the trivial and noisy questions in MMLU.Our experimental results show that MMLU-Pro not only raises the challenge, causing a significant drop in accuracy by 16% to 33% compared to MMLU but also demonstrates greater stability under varying prompts.With 24 different prompt styles tested, the sensitivity of model scores to prompt variations decreased from 4-5% in MMLU to just 2% in MMLU-Pro.Additionally, we found that models utilizing Chain of Thought (CoT) reasoning achieved better performance on MMLU-Pro compared to direct answering, which is in stark contrast to the findings on the original MMLU, indicating that MMLU-Pro includes more complex reasoning questions.Our assessments confirm that MMLU-Pro is a more discriminative benchmark to better track progress in the field.

Summary

This paper introduces MMLU-Pro, an enhanced benchmark for evaluating large-scale language models (LLMs) in multi-task language understanding. MMLU-Pro expands the previous MMLU benchmark by incorporating more complex, reasoning-intensive questions and increasing the number of answer options from four to ten, making it more challenging and robust. Experimental results demonstrate that MMLU-Pro improves model stability and provides a better measure of model performance, significantly lowering accuracy compared to MMLU. The paper highlights the need for such a benchmark due to the performance saturation observed on existing benchmarks and discusses how MMLU-Pro addresses issues like trivial questions, dataset noise, and the effectiveness of chain-of-thought (CoT) reasoning. With evaluations on over 50 LLMs, the paper provides insights into the performance gaps and emerging challenges for future models in natural language understanding.

Methods

This paper employs the following methods:

Data enhancement
Reasoning-focused questioning
Expert review process
Chain of Thought (CoT) reasoning

Models Used

GPT-4
Gemini-1.5-Pro
Claude-3-Opus
GPT-4-Turbo
Llama-3-70B-Instruct
Phi-3-medium-4k-instruct
DeepSeek-V2-Chat
Yi-large

Datasets

The following datasets were used in this research:

MMLU
TheoremQA
SciBench
STEM Website

Evaluation Metrics

Accuracy

Results

MMLU-Pro caused a significant drop in accuracy by 16% to 33% compared to MMLU
Improved stability under varying prompts with a decrease in sensitivity from 4-5% in MMLU to just 2% in MMLU-Pro
Models utilizing CoT reasoning showed better performance on MMLU-Pro compared to direct answering

Limitations

The authors identified the following limitations:

The benchmark may not capture the depth of comprehension as effectively as open-ended responses.
MMLU-Pro focuses on language models and does not assess multi-modal models.

Technical Requirements

Number of GPUs: None specified
GPU Type: NVIDIA A100

Keywords

MMLU-Pro benchmark large language models reasoning multi-task understanding model evaluation robustness prompt sensitivity

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 51
Influential Citations: 31

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers