GSM8K

Dataset Information
Modalities
Texts
Languages
Chinese
Introduced
2021
License
Unknown
Homepage

Overview

GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. The dataset is segmented into 7.5K training problems and 1K test problems. These problems take between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the final answer. A bright middle school student should be able to solve every problem. It can be used for multi-step mathematical reasoning.

Image source: https://arxiv.org/pdf/2110.14168v1.pdf

Variants: GSM8K, GSM8k (5-shot), GSM8k TR v0.2, GSM8k TR, gsm8k (5-shots)

Associated Benchmarks

This dataset is used in 3 benchmarks:

Recent Benchmark Submissions

Task Model Paper Date
GSM8K Xolver Xolver: Multi-Agent Reasoning with Holistic … 2025-06-17
Arithmetic Reasoning Qwen2.5-32B + CAPO CAPO: Cost-Aware Prompt Optimization 2025-04-22
Arithmetic Reasoning Mistral-Small-24B + CAPO CAPO: Cost-Aware Prompt Optimization 2025-04-22
Arithmetic Reasoning Llama-3.3-70B + CAPO CAPO: Cost-Aware Prompt Optimization 2025-04-22
GSM8K Orange-mini MyGO Multiplex CoT: A Method … 2025-01-20
Arithmetic Reasoning GPT-4 (Teaching-Inspired) Teaching-Inspired Integrated Prompting Framework: A … 2024-10-10
Arithmetic Reasoning OpenMath2-Llama3.1-8B (majority@256) OpenMathInstruct-2: Accelerating AI for Math … 2024-10-02
Arithmetic Reasoning OpenMath2-Llama3.1-8B OpenMathInstruct-2: Accelerating AI for Math … 2024-10-02
Arithmetic Reasoning OpenMath2-Llama3.1-70B OpenMathInstruct-2: Accelerating AI for Math … 2024-10-02
Arithmetic Reasoning OpenMath2-Llama3.1-70B (majority@256) OpenMathInstruct-2: Accelerating AI for Math … 2024-10-02
Arithmetic Reasoning Qwen2-Math-72B-Instruct (greedy) Qwen2 Technical Report 2024-07-15
Arithmetic Reasoning Qwen2-72B-Instruct-Step-DPO (0-shot CoT) Step-DPO: Step-wise Preference Optimization for … 2024-06-26
Arithmetic Reasoning DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code) DART-Math: Difficulty-Aware Rejection Tuning for … 2024-06-18
Arithmetic Reasoning DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code) DART-Math: Difficulty-Aware Rejection Tuning for … 2024-06-18
Arithmetic Reasoning GaC(Qwen2-72B-Instruct + Llama-3-70B-Instruct) Breaking the Ceiling of the … 2024-06-18
Arithmetic Reasoning Claude 3.5 Sonnet (HPT) Hierarchical Prompting Taxonomy: A Universal … 2024-06-18
Arithmetic Reasoning DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code) DART-Math: Difficulty-Aware Rejection Tuning for … 2024-06-18
Arithmetic Reasoning DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code) DART-Math: Difficulty-Aware Rejection Tuning for … 2024-06-18
Arithmetic Reasoning DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code) DART-Math: Difficulty-Aware Rejection Tuning for … 2024-06-18
Arithmetic Reasoning DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code) DART-Math: Difficulty-Aware Rejection Tuning for … 2024-06-18

Research Papers

Recent papers with results on this dataset: