GSM8K

Name: GSM8K
Published: 2021-10-27
License: Unknown

Dataset Information

Modalities

Texts

Languages

Chinese

Introduced

2021

License

Unknown

Homepage

Official Website

Contents

Overview
Associated Benchmarks
Recent Benchmark Submissions
Research Papers

Overview

GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. The dataset is segmented into 7.5K training problems and 1K test problems. These problems take between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the final answer. A bright middle school student should be able to solve every problem. It can be used for multi-step mathematical reasoning.

Image source: https://arxiv.org/pdf/2110.14168v1.pdf

Variants: GSM8K, GSM8k (5-shot), GSM8k TR v0.2, GSM8k TR, gsm8k (5-shots)

Associated Benchmarks

This dataset is used in 3 benchmarks:

Mathematical Reasoning - Metrics: Accuracy
Arithmetic Reasoning - Metrics: Accuracy, Parameters (Billion)
GSM8K - Metrics: Accuracy, 0-shot MRR

Recent Benchmark Submissions

Task	Model	Paper	Date
GSM8K	Xolver	Xolver: Multi-Agent Reasoning with Holistic …	2025-06-17
Arithmetic Reasoning	Qwen2.5-32B + CAPO	CAPO: Cost-Aware Prompt Optimization	2025-04-22
Arithmetic Reasoning	Mistral-Small-24B + CAPO	CAPO: Cost-Aware Prompt Optimization	2025-04-22
Arithmetic Reasoning	Llama-3.3-70B + CAPO	CAPO: Cost-Aware Prompt Optimization	2025-04-22
GSM8K	Orange-mini	MyGO Multiplex CoT: A Method …	2025-01-20
Arithmetic Reasoning	GPT-4 (Teaching-Inspired)	Teaching-Inspired Integrated Prompting Framework: A …	2024-10-10
Arithmetic Reasoning	OpenMath2-Llama3.1-8B (majority@256)	OpenMathInstruct-2: Accelerating AI for Math …	2024-10-02
Arithmetic Reasoning	OpenMath2-Llama3.1-8B	OpenMathInstruct-2: Accelerating AI for Math …	2024-10-02
Arithmetic Reasoning	OpenMath2-Llama3.1-70B	OpenMathInstruct-2: Accelerating AI for Math …	2024-10-02
Arithmetic Reasoning	OpenMath2-Llama3.1-70B (majority@256)	OpenMathInstruct-2: Accelerating AI for Math …	2024-10-02
Arithmetic Reasoning	Qwen2-Math-72B-Instruct (greedy)	Qwen2 Technical Report	2024-07-15
Arithmetic Reasoning	Qwen2-72B-Instruct-Step-DPO (0-shot CoT)	Step-DPO: Step-wise Preference Optimization for …	2024-06-26
Arithmetic Reasoning	DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code)	DART-Math: Difficulty-Aware Rejection Tuning for …	2024-06-18
Arithmetic Reasoning	DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code)	DART-Math: Difficulty-Aware Rejection Tuning for …	2024-06-18
Arithmetic Reasoning	GaC(Qwen2-72B-Instruct + Llama-3-70B-Instruct)	Breaking the Ceiling of the …	2024-06-18
Arithmetic Reasoning	Claude 3.5 Sonnet (HPT)	Hierarchical Prompting Taxonomy: A Universal …	2024-06-18
Arithmetic Reasoning	DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code)	DART-Math: Difficulty-Aware Rejection Tuning for …	2024-06-18
Arithmetic Reasoning	DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code)	DART-Math: Difficulty-Aware Rejection Tuning for …	2024-06-18
Arithmetic Reasoning	DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code)	DART-Math: Difficulty-Aware Rejection Tuning for …	2024-06-18
Arithmetic Reasoning	DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code)	DART-Math: Difficulty-Aware Rejection Tuning for …	2024-06-18

Research Papers

Recent papers with results on this dataset:

External Links:

GSM8K

Overview edit

Associated Benchmarks

Recent Benchmark Submissions

Research Papers

Edit Dataset Information

Overview