MMLU-Pro

Name: MMLU-Pro
Published: 2024-06-03
License: Unknown

Dataset Information

Introduced

2024

License

Unknown

Homepage

Official Website

Contents

Overview
Associated Benchmarks
Recent Benchmark Submissions
Research Papers

Overview

The MMLU-Pro dataset is an enhanced version of the Massive Multitask Language Understanding (MMLU) benchmark. It's designed to be more robust and challenging, aiming to rigorously benchmark large language models' capabilities in language comprehension and reasoning across diverse domains. Here are some key features of the MMLU-Pro dataset:

Increased Complexity: It includes more reasoning-focused questions and expands the choice set from four to ten options, reducing the likelihood of random guessing and increasing the evaluation's complexity¹.
Elimination of Trivial Questions: MMLU-Pro removes trivial and noisy questions found in the original MMLU, making it a more discriminative benchmark².
Stability Under Varying Prompts: The dataset shows greater stability under varying prompts, with a decreased sensitivity of model scores to prompt variations².
Better Performance with Reasoning: Models utilizing Chain of Thought (CoT) reasoning achieved better performance on MMLU-Pro compared to direct answering².
Size and Scope: The dataset contains 12K complex questions across various disciplines¹⁴.

(1) TIGER-Lab/MMLU-Pro · Datasets at Hugging Face. https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro.
(2) MMLU-Pro: A More Robust and Challenging Multi-Task Language .... https://arxiv.org/abs/2406.01574.
(3) MMLU-Pro: An Upgraded Version of the MMLU Dataset | LLM Explorer Blog. https://llm.extractum.io/static/blog/?id=mmlu-pro-benchmark.
(4) TIGER-Lab Introduces MMLU-Pro Dataset for Comprehensive Benchmarking of .... https://www.marktechpost.com/2024/05/16/tiger-lab-introduces-mmlu-pro-dataset-for-comprehensive-benchmarking-of-large-language-models-capabilities-and-performance/.
(5) undefined. https://doi.org/10.48550/arXiv.2406.01574.

Variants: MMLU-Pro

Associated Benchmarks

This dataset is used in 1 benchmark:

MMLU - Metrics: 0-shot MRR

Recent Benchmark Submissions

Task	Model	Paper	Date
MMLU	Orange-mini	MyGO Multiplex CoT: A Method …	2025-01-20

Research Papers

Recent papers with results on this dataset:

MyGO Multiplex CoT: A Method for Self-Reflection in Large Language Models via Double Chain of Thought Thinking (2025) -

External Links:

MMLU-Pro

Overview edit

Associated Benchmarks

Recent Benchmark Submissions

Research Papers

Edit Dataset Information

Overview