MMLU-Pro

Dataset Information
Introduced
2024
License
Unknown
Homepage

Overview

The MMLU-Pro dataset is an enhanced version of the Massive Multitask Language Understanding (MMLU) benchmark. It's designed to be more robust and challenging, aiming to rigorously benchmark large language models' capabilities in language comprehension and reasoning across diverse domains. Here are some key features of the MMLU-Pro dataset:

  • Increased Complexity: It includes more reasoning-focused questions and expands the choice set from four to ten options, reducing the likelihood of random guessing and increasing the evaluation's complexity¹.
  • Elimination of Trivial Questions: MMLU-Pro removes trivial and noisy questions found in the original MMLU, making it a more discriminative benchmark².
  • Stability Under Varying Prompts: The dataset shows greater stability under varying prompts, with a decreased sensitivity of model scores to prompt variations².
  • Better Performance with Reasoning: Models utilizing Chain of Thought (CoT) reasoning achieved better performance on MMLU-Pro compared to direct answering².
  • Size and Scope: The dataset contains 12K complex questions across various disciplines¹⁴.

(1) TIGER-Lab/MMLU-Pro · Datasets at Hugging Face. https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro.
(2) MMLU-Pro: A More Robust and Challenging Multi-Task Language .... https://arxiv.org/abs/2406.01574.
(3) MMLU-Pro: An Upgraded Version of the MMLU Dataset | LLM Explorer Blog. https://llm.extractum.io/static/blog/?id=mmlu-pro-benchmark.
(4) TIGER-Lab Introduces MMLU-Pro Dataset for Comprehensive Benchmarking of .... https://www.marktechpost.com/2024/05/16/tiger-lab-introduces-mmlu-pro-dataset-for-comprehensive-benchmarking-of-large-language-models-capabilities-and-performance/.
(5) undefined. https://doi.org/10.48550/arXiv.2406.01574.

Variants: MMLU-Pro

Associated Benchmarks

This dataset is used in 1 benchmark:

Recent Benchmark Submissions

Task Model Paper Date
MMLU Orange-mini MyGO Multiplex CoT: A Method … 2025-01-20

Research Papers

Recent papers with results on this dataset: