ML Research Wiki / Benchmarks / Arithmetic Reasoning / GSM8K

GSM8K

Arithmetic Reasoning Benchmark

Performance Over Time

📊 Showing 144 results | 📏 Metric: Accuracy

Top Performing Models

Rank	Model	Paper	Accuracy	Date	Code
1	Claude 3.5 Sonnet (HPT)	Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles	97.72	2024-06-18	📦 devichand579/HPT
2	DUP prompt upon GPT-4	Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems	97.10	2024-04-23	📦 whu-zqh/dup
3	Qwen2-Math-72B-Instruct (greedy) 📚	Qwen2 Technical Report	96.70	2024-07-15	📦 qwenlm/qwen1.5 📦 qwenlm/qwen2 📦 vicentvankor/sun-shine
4	OpenMath2-Llama3.1-70B (majority@256) 📚	OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data	96.00	2024-10-02	📦 NVIDIA/NeMo-Skills
5	OpenMath2-Llama3.1-70B 📚	OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data	94.90	2024-10-02	📦 NVIDIA/NeMo-Skills
6	GPT-4 (Teaching-Inspired)	Teaching-Inspired Integrated Prompting Framework: A Novel Approach for Enhancing Reasoning in Large Language Models	94.80	2024-10-10	📦 sallytan13/teaching-inspired-prompting
7	OpenMath2-Llama3.1-8B (majority@256) 📚	OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data	94.10	2024-10-02	📦 NVIDIA/NeMo-Skills
8	Qwen2-72B-Instruct-Step-DPO (0-shot CoT) 📚	Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs	94.00	2024-06-26	📦 dvlab-research/step-dpo
9	AlphaLLM (with MCTS)	Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing	92.00	2024-04-18	📦 yetianjhu/alphallm
10	OpenMath2-Llama3.1-8B 📚	OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data	91.70	2024-10-02	📦 NVIDIA/NeMo-Skills

All Papers (144)

Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles

2024

Claude 3.5 Sonnet (HPT)

devichand579/HPT

Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems

2024

DUP prompt upon GPT-4

whu-zqh/dup

Qwen2 Technical Report

2024

Qwen2-Math-72B-Instruct (greedy)

qwenlm/qwen1.5 qwenlm/qwen2

OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data

2024

OpenMath2-Llama3.1-70B (majority@256)

NVIDIA/NeMo-Skills

OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data

2024

OpenMath2-Llama3.1-70B

NVIDIA/NeMo-Skills

Teaching-Inspired Integrated Prompting Framework: A Novel Approach for Enhancing Reasoning in Large Language Models

2024

GPT-4 (Teaching-Inspired)

sallytan13/teaching-inspired-prompting

OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data

2024

OpenMath2-Llama3.1-8B (majority@256)

NVIDIA/NeMo-Skills

Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

2024

Qwen2-72B-Instruct-Step-DPO (0-shot CoT)

dvlab-research/step-dpo

Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing

2024

AlphaLLM (with MCTS)

yetianjhu/alphallm

OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data

2024

OpenMath2-Llama3.1-8B

NVIDIA/NeMo-Skills

PaLM 2 Technical Report

2023

PaLM 2 (few-shot, k=8, SC)

eternityyw/tram-benchmark

Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for Ensembling

2024

GaC(Qwen2-72B-Instruct + Llama-3-70B-Instruct)

yaoching0/gac

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

2024

OpenMath-CodeLlama-70B (w/ code, SC, k=50)

kipok/nemo-skills

DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

2024

DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code)

hkust-nlp/dart-math

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

2024

OpenMath-Llama2-70B (w/ code, SC, k=50)

kipok/nemo-skills

DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

2024

DART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code)

hkust-nlp/dart-math

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

2023

Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256)

hkust-nlp/b-star chang-github-00/llm-predictive-decoding peiyi9979/Math-Shepherd

Solving Quantitative Reasoning Problems with Language Models

2022

Minerva 62B (maj5@100)

gair-nlp/abel

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

2023

ToRA-70B (SC, k=50)

microsoft/tora

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

2024

DeepSeekMATH-RL-7B

shibing624/medicalgpt deepseek-ai/deepseek-math

DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

2024

DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code)

hkust-nlp/dart-math

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

2024

OpenMath-CodeLlama-34B (w/ code, SC, k=50)

kipok/nemo-skills

Solving math word problems with process- and outcome-based feedback

2022

DeepMind 70B Model (SFT+ORM-RL, ORM reranking)

An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning

2024

MMOS-DeepSeekMath-7B(0-shot,k=50)

cyzhh/MMOS

Solving math word problems with process- and outcome-based feedback

2022

DeepMind 70B Model (SFT+PRM-RL, PRM reranking)

Sparks of Artificial General Intelligence: Early experiments with GPT-4

2023

GPT-4

microsoft/guidance gammatauai/leetcode-hard-gym emrgnt-cmplxty/zero-shot-replication

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

2024

OpenMath-Mistral-7B (w/ code, SC, k=50)

kipok/nemo-skills

Orca-Math: Unlocking the potential of SLMs in Grade School Math

2024

Orca-Math 7B (fine-tuned)

DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

2024

DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code)

hkust-nlp/dart-math

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

2024

OpenMath-CodeLlama-13B (w/ code, SC, k=50)

kipok/nemo-skills

Gemini: A Family of Highly Capable Multimodal Models

2023

Gemini Pro (maj1@32)

valdecy/pybibx

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

2023

ToRA-Code-34B (SC, k=50)

microsoft/tora

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

2024

OpenMath-CodeLlama-7B (w/ code, SC, k=50)

kipok/nemo-skills

OVM, Outcome-supervised Value Models for Planning in Mathematical Reasoning

2023

OVM-Mistral-7B (verify100@1)

freedomintelligence/ovm

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

2024

OpenMath-Llama2-70B (w/ code)

kipok/nemo-skills

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

2024

OpenMath-CodeLlama-70B (w/ code)

kipok/nemo-skills

LEVER: Learning to Verify Language-to-Code Generation with Execution

2023

code-davinci-002 175B (LEVER, 8-shot)

niansong1996/lever

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

2023

ToRA 70B

microsoft/tora

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

2023

Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL)

hkust-nlp/b-star chang-github-00/llm-predictive-decoding peiyi9979/Math-Shepherd

MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning

2023

MathCoder-L-70B

mathllm/mathcoder

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

2023

WizardMath-7B-V1.1

nlpxucan/wizardlm

Making Large Language Models Better Reasoners with Step-Aware Verifier

2022

DIVERSE 175B (8-shot)

OVM, Outcome-supervised Value Models for Planning in Mathematical Reasoning

2023

OVM-Mistral-7B (verify20@1)

freedomintelligence/ovm

DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

2024

DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code)

hkust-nlp/dart-math

The ART of LLM Refinement: Ask, Refine, and Trust

2023

ChatGPT (Ask, Refine, Trust)

DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

2024

DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code)

hkust-nlp/dart-math

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

2023

MetaMath 70B

meta-math/MetaMath

MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning

2023

MuggleMATH 70B

ofa-sys/gsm8k-screl

Large Language Models Can Self-Improve

2022

PaLM 540B (Self Improvement, Self Consistency)

MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning

2023

MathCoder-CL-34B

mathllm/mathcoder

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

2023

WizardMath-70B-V1.0

nlpxucan/wizardlm

TinyGSM: achieving >80% on GSM8k with small language models

2023

Phi-GSM+V 1.3B+1.3B (verify48@1)

DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

2024

DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code)

hkust-nlp/dart-math

DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

2024

DART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code)

hkust-nlp/dart-math

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

2023

ToRA-Code 34B

microsoft/tora

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

2024

OpenMath-CodeLlama-34B (w/ code)

kipok/nemo-skills

PaLM 2 Technical Report

2023

PaLM 2 (few-shot, k=8, CoT)

eternityyw/tram-benchmark

An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning

2024

MMOS-DeepSeekMath-7B(0-shot)

cyzhh/MMOS

An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning

2024

MMOS-CODE-34B(0-shot)

cyzhh/MMOS

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

2024

OpenMath-Mistral-7B (w/ code)

kipok/nemo-skills

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

2024

OpenMath-CodeLlama-13B (w/ code)

kipok/nemo-skills

Solving Quantitative Reasoning Problems with Language Models

2022

Minerva 540B (CoT)

gair-nlp/abel

Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks

2024

Camelidae-8×34B (5-shot)

wuhy68/parameter-efficient-moe ShayekhBinIslam/openrag

Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks

2024

Qwen2idae-16x14B (5-shot)

wuhy68/parameter-efficient-moe ShayekhBinIslam/openrag

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

2023

MetaMath-Mistral-7B

meta-math/MetaMath

OpenChat: Advancing Open-source Language Models with Mixed-Quality Data

2023

OpenChat-3.5 7B

imoneoi/openchat

Solving math word problems with process- and outcome-based feedback

2022

DeepMind 70B Model (STaR, maj1@96)

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

2024

OpenMath-CodeLlama-7B (w/ code)

kipok/nemo-skills

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

2023

ToRA-Code 13B

microsoft/tora

Self-Consistency Improves Chain of Thought Reasoning in Language Models

2022

PaLM 540B maj1@40 (8-shot)

codelion/optillm lastmile-ai/aiconfig hughbzhang/o1_inference_scaling_laws

Large Language Models Can Self-Improve

2022

PaLM 540B (Self Consistency)

TinyGSM: achieving >80% on GSM8k with small language models

2023

Phi-GSM 2.7B (fine-tuned)

MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning

2023

MathCoder-CL-13B

mathllm/mathcoder

MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning

2023

MuggleMATH 13B

ofa-sys/gsm8k-screl

An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning

2024

MMOS-CODE-7B(0-shot)

cyzhh/MMOS

CodeT5+: Open Code Large Language Models for Code Understanding and Generation

2023

CodeT5+

salesforce/codet5 leiluk1/codesearcher

CAPO: Cost-Aware Prompt Optimization

2025

Llama-3.3-70B + CAPO

finitearth/promptolution finitearth/capo

OVM, Outcome-supervised Value Models for Planning in Mathematical Reasoning

2023

OVM-Llama2-7B (verify100@1)

freedomintelligence/ovm

Large Language Models Can Self-Improve

2022

PaLM 540B (Self Improvement, CoT Prompting)

KwaiYiiMath: Technical Report

2023

KwaiYiiMath 13B

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

2023

ToRA-Code 7B

microsoft/tora

MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning

2023

MathCoder-L-13B

mathllm/mathcoder

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

2023

MetaMath 13B

meta-math/MetaMath

MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning

2023

MuggleMATH 7B

ofa-sys/gsm8k-screl

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 65B-maj1@k

huggingface/transformers ggml-org/llama.cpp

Solving Quantitative Reasoning Problems with Language Models

2022

Minerva 62B (maj1@100)

gair-nlp/abel

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

2022

code-davinci-002 (Least-to-Most Prompting)

RUCAIBox/LLMBox

MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning

2023

MathCoder-CL-7B

mathllm/mathcoder

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

2023

MetaMath 7B

meta-math/MetaMath

CAPO: Cost-Aware Prompt Optimization

2025

Mistral-Small-24B + CAPO

finitearth/promptolution finitearth/capo

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

2023

RFT 70B

ofa-sys/gsm8k-screl

MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning

2023

MathCoder-L-7B

mathllm/mathcoder

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

2023

WizardMath-13B-V1.0

nlpxucan/wizardlm

Solving Math Word Problems via Cooperative Reasoning induced Language Models

2022

GPT-J (CoRe)

tianhongzxy/core

The Unreasonable Effectiveness of Eccentric Automatic Prompts

2024

Llama-2 70B (on 100 first questions, 4-shot, auto-optimized prompting)

CAPO: Cost-Aware Prompt Optimization

2025

Qwen2.5-32B + CAPO

finitearth/promptolution finitearth/capo

Fewer is More: Boosting LLM Reasoning with Reinforced Context Pruning

2023

LLaMA 2 70B (CoT-Influx)

Orca 2: Teaching Small Language Models How to Reason

2023

Orca 2 13B

Transcending Scaling Laws with 0.1% Extra Compute

2022

U-PaLM

Large Language Models are Zero-Shot Reasoners

2022

PaLM-540B (few-Shot-cot)

kojima-takeshi188/zero_shot_cot skytliang/multi-agents-debate

GPT-4 Technical Report

2023

GPT-3.5 (few-shot, k=5)

openai/evals shmsw25/factscore

Solving Quantitative Reasoning Problems with Language Models

2022

Minerva 8B (maj5@100)

gair-nlp/abel

Llama 2: Open Foundation and Fine-Tuned Chat Models

2023

LLaMA 2 70B (on-shot)

facebookresearch/llama llamafamily/llama-chinese

Solving Quantitative Reasoning Problems with Language Models

2022

PaLM 540B (8-shot)

gair-nlp/abel

Large Language Models Can Self-Improve

2022

PaLM 540B (CoT Prompting)

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

2023

RFT 13B

ofa-sys/gsm8k-screl

Large Language Models are Zero-Shot Reasoners

2022

Finetuned GPT-3 175B + verifier

kojima-takeshi188/zero_shot_cot skytliang/multi-agents-debate

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

2023

WizardMath-7B-V1.0

nlpxucan/wizardlm

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 33B-maj1@k

huggingface/transformers ggml-org/llama.cpp

Solving Quantitative Reasoning Problems with Language Models

2022

Minerva 62B (8-shot)

gair-nlp/abel

Mistral 7B

2023

Mistral 7B (maj@8)

mistralai/mistral-src facebookresearch/fairseq2

Llemma: An Open Language Model For Mathematics

2023

Llemma 34B

eleutherai/gpt-neox EleutherAI/math-lm

Large Language Models are Zero-Shot Reasoners

2022

Text-davinci-002-175B (zero-plus-few-Shot-cot (8 samples))

kojima-takeshi188/zero_shot_cot skytliang/multi-agents-debate

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

2023

RFT 7B

ofa-sys/gsm8k-screl

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 65B

huggingface/transformers ggml-org/llama.cpp

Orca 2: Teaching Small Language Models How to Reason

2023

Orca 2 7B

The Unreasonable Effectiveness of Eccentric Automatic Prompts

2024

Llama-2 13B (on 100 first questions, 4-shot, auto-optimized prompting)

Large Language Models are Zero-Shot Reasoners

2022

text-davinci-002 175B (2-shot, CoT)

kojima-takeshi188/zero_shot_cot skytliang/multi-agents-debate

The Unreasonable Effectiveness of Eccentric Automatic Prompts

2024

Mistral 7B (on 100 first questions, 4-shot, auto-optimized prompting)

Large Language Models are Zero-Shot Reasoners

2022

text-davinci-002 175B (0-shot, CoT)

kojima-takeshi188/zero_shot_cot skytliang/multi-agents-debate

Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

2024

Branch-Train-MiX 4x7B (sampling top-2 experts)

Leeroo-AI/mergoo

Llemma: An Open Language Model For Mathematics

2023

Llemma 7B

eleutherai/gpt-neox EleutherAI/math-lm

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 33B

huggingface/transformers ggml-org/llama.cpp

Frugal LMs Trained to Invoke Symbolic Solvers Achieve Parameter-Efficient Arithmetic Reasoning

2023

Vicuna (SYRELM)

joykirat18/syrelm

Solving Quantitative Reasoning Problems with Language Models

2022

PaLM 62B (8-shot)

gair-nlp/abel

Large Language Models Can Self-Improve

2022

PaLM 540B (Self Improvement, Standard-Prompting)

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 13B-maj1@k

huggingface/transformers ggml-org/llama.cpp

Solving Quantitative Reasoning Problems with Language Models

2022

Minerva 8B-maj1@k (8-shot)

gair-nlp/abel

Composing Ensembles of Pre-trained Models via Iterative Consensus

2022

GPT-2-Medium 355M + question-solution classifier (BS=5)

Learning Math Reasoning from Self-Sampled Correct and Partially-Correct Solutions

2022

GPT-Neo-2.7B + Self-Sampling

microsoft/tracecodegen

Composing Ensembles of Pre-trained Models via Iterative Consensus

2022

GPT-2-Medium 355M (fine-tuned, BS=5)

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 7B (maj1@k)

huggingface/transformers ggml-org/llama.cpp

Large Language Models are Zero-Shot Reasoners

2022

PaLM 540B (few-shot)

kojima-takeshi188/zero_shot_cot skytliang/multi-agents-debate

Large Language Models Can Self-Improve

2022

PaLM 540B (Standard-Prompting)

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 13B

huggingface/transformers ggml-org/llama.cpp

Composing Ensembles of Pre-trained Models via Iterative Consensus

2022

GPT-2-Medium 355M + question-solution classifier (BS=1)

Solving Quantitative Reasoning Problems with Language Models

2022

Minerva 8B (8-shot)

gair-nlp/abel

Composing Ensembles of Pre-trained Models via Iterative Consensus

2022

GPT-2-Medium 355M (BS=5)

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 7B

huggingface/transformers ggml-org/llama.cpp

Large Language Models are Zero-Shot Reasoners

2022

Text-davinci-002-175B (0-shot)

kojima-takeshi188/zero_shot_cot skytliang/multi-agents-debate

Learning Math Reasoning from Self-Sampled Correct and Partially-Correct Solutions

2022

GPT-Neo 125M + Self-Sampling

microsoft/tracecodegen

UL2: Unifying Language Learning Paradigms

2022

UL2 20B (chain-of-thought)

google-research/google-research opennlg/openba-v2

Solving Quantitative Reasoning Problems with Language Models

2022

PaLM 8B (8-shot)

gair-nlp/abel

UL2: Unifying Language Learning Paradigms

2022

UL2 20B (0-shot)

google-research/google-research opennlg/openba-v2

Model	Paper	Accuracy	Date
Claude 3.5 Sonnet (HPT)	Hierarchical Prompting Taxonomy: A Universal Eval…	97.72	2024-06-18
DUP prompt upon GPT-4	Achieving >97% on GSM8K: Deeply Understanding the…	97.10	2024-04-23
Qwen2-Math-72B-Instruct (greedy)	Qwen2 Technical Report	96.70	2024-07-15
OpenMath2-Llama3.1-70B (majority@256)	OpenMathInstruct-2: Accelerating AI for Math with…	96.00	2024-10-02
OpenMath2-Llama3.1-70B	OpenMathInstruct-2: Accelerating AI for Math with…	94.90	2024-10-02
GPT-4 (Teaching-Inspired)	Teaching-Inspired Integrated Prompting Framework:…	94.80	2024-10-10
OpenMath2-Llama3.1-8B (majority@256)	OpenMathInstruct-2: Accelerating AI for Math with…	94.10	2024-10-02
Qwen2-72B-Instruct-Step-DPO (0-shot CoT)	Step-DPO: Step-wise Preference Optimization for L…	94.00	2024-06-26
AlphaLLM (with MCTS)	Toward Self-Improvement of LLMs via Imagination, …	92.00	2024-04-18
OpenMath2-Llama3.1-8B	OpenMathInstruct-2: Accelerating AI for Math with…	91.70	2024-10-02
PaLM 2 (few-shot, k=8, SC)	PaLM 2 Technical Report	91.00	2023-05-17
GaC(Qwen2-72B-Instruct + Llama-3-70B-Instruct)	Breaking the Ceiling of the LLM Community by Trea…	90.91	2024-06-18
OpenMath-CodeLlama-70B (w/ code, SC, k=50)	OpenMathInstruct-1: A 1.8 Million Math Instructio…	90.80	2024-02-15
DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code)	DART-Math: Difficulty-Aware Rejection Tuning for …	90.40	2024-06-18
OpenMath-Llama2-70B (w/ code, SC, k=50)	OpenMathInstruct-1: A 1.8 Million Math Instructio…	90.10	2024-02-15
DART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code)	DART-Math: Difficulty-Aware Rejection Tuning for …	89.60	2024-06-18
Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256)	Math-Shepherd: Verify and Reinforce LLMs Step-by-…	89.10	2023-12-14
Minerva 62B (maj5@100)	Solving Quantitative Reasoning Problems with Lang…	89.00	2022-06-29
ToRA-70B (SC, k=50)	ToRA: A Tool-Integrated Reasoning Agent for Mathe…	88.30	2023-09-29
DeepSeekMATH-RL-7B	DeepSeekMath: Pushing the Limits of Mathematical …	88.20	2024-02-05
DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code)	DART-Math: Difficulty-Aware Rejection Tuning for …	88.20	2024-06-18
OpenMath-CodeLlama-34B (w/ code, SC, k=50)	OpenMathInstruct-1: A 1.8 Million Math Instructio…	88.00	2024-02-15
DeepMind 70B Model (SFT+ORM-RL, ORM reranking)	Solving math word problems with process- and outc…	87.30	2022-11-25
MMOS-DeepSeekMath-7B(0-shot,k=50)	An Empirical Study of Data Ability Boundary in LL…	87.20	2024-02-23
DeepMind 70B Model (SFT+PRM-RL, PRM reranking)	Solving math word problems with process- and outc…	87.10	2022-11-25
GPT-4	Sparks of Artificial General Intelligence: Early …	87.10	2023-03-22
OpenMath-Mistral-7B (w/ code, SC, k=50)	OpenMathInstruct-1: A 1.8 Million Math Instructio…	86.90	2024-02-15
Orca-Math 7B (fine-tuned)	Orca-Math: Unlocking the potential of SLMs in Gra…	86.80	2024-02-16
DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code)	DART-Math: Difficulty-Aware Rejection Tuning for …	86.80	2024-06-18
OpenMath-CodeLlama-13B (w/ code, SC, k=50)	OpenMathInstruct-1: A 1.8 Million Math Instructio…	86.80	2024-02-15
Gemini Pro (maj1@32)	Gemini: A Family of Highly Capable Multimodal Mod…	86.50	2023-12-19
ToRA-Code-34B (SC, k=50)	ToRA: A Tool-Integrated Reasoning Agent for Mathe…	85.10	2023-09-29
OpenMath-CodeLlama-7B (w/ code, SC, k=50)	OpenMathInstruct-1: A 1.8 Million Math Instructio…	84.80	2024-02-15
OVM-Mistral-7B (verify100@1)	OVM, Outcome-supervised Value Models for Planning…	84.70	2023-11-16
OpenMath-Llama2-70B (w/ code)	OpenMathInstruct-1: A 1.8 Million Math Instructio…	84.70	2024-02-15
OpenMath-CodeLlama-70B (w/ code)	OpenMathInstruct-1: A 1.8 Million Math Instructio…	84.60	2024-02-15
code-davinci-002 175B (LEVER, 8-shot)	LEVER: Learning to Verify Language-to-Code Genera…	84.50	2023-02-16
ToRA 70B	ToRA: A Tool-Integrated Reasoning Agent for Mathe…	84.30	2023-09-29
Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL)	Math-Shepherd: Verify and Reinforce LLMs Step-by-…	84.10	2023-12-14
MathCoder-L-70B	MathCoder: Seamless Code Integration in LLMs for …	83.90	2023-10-05
WizardMath-7B-V1.1	WizardMath: Empowering Mathematical Reasoning for…	83.20	2023-08-18
DIVERSE 175B (8-shot)	Making Large Language Models Better Reasoners wit…	83.20	2022-06-06
OVM-Mistral-7B (verify20@1)	OVM, Outcome-supervised Value Models for Planning…	82.60	2023-11-16
DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code)	DART-Math: Difficulty-Aware Rejection Tuning for …	82.60	2024-06-18
ChatGPT (Ask, Refine, Trust)	The ART of LLM Refinement: Ask, Refine, and Trust	82.60	2023-11-14
DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code)	DART-Math: Difficulty-Aware Rejection Tuning for …	82.50	2024-06-18
MetaMath 70B	MetaMath: Bootstrap Your Own Mathematical Questio…	82.30	2023-09-21
MuggleMATH 70B	MuggleMath: Assessing the Impact of Query and Res…	82.30	2023-10-09
PaLM 540B (Self Improvement, Self Consistency)	Large Language Models Can Self-Improve	82.10	2022-10-20
MathCoder-CL-34B	MathCoder: Seamless Code Integration in LLMs for …	81.70	2023-10-05
WizardMath-70B-V1.0	WizardMath: Empowering Mathematical Reasoning for…	81.60	2023-08-18
Phi-GSM+V 1.3B+1.3B (verify48@1)	TinyGSM: achieving >80% on GSM8k with small langu…	81.50	2023-12-14
DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code)	DART-Math: Difficulty-Aware Rejection Tuning for …	81.10	2024-06-18
DART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code)	DART-Math: Difficulty-Aware Rejection Tuning for …	81.10	2024-06-18
ToRA-Code 34B	ToRA: A Tool-Integrated Reasoning Agent for Mathe…	80.70	2023-09-29
OpenMath-CodeLlama-34B (w/ code)	OpenMathInstruct-1: A 1.8 Million Math Instructio…	80.70	2024-02-15
PaLM 2 (few-shot, k=8, CoT)	PaLM 2 Technical Report	80.70	2023-05-17
MMOS-DeepSeekMath-7B(0-shot)	An Empirical Study of Data Ability Boundary in LL…	80.50	2024-02-23
MMOS-CODE-34B(0-shot)	An Empirical Study of Data Ability Boundary in LL…	80.40	2024-02-23
OpenMath-Mistral-7B (w/ code)	OpenMathInstruct-1: A 1.8 Million Math Instructio…	80.20	2024-02-15
OpenMath-CodeLlama-13B (w/ code)	OpenMathInstruct-1: A 1.8 Million Math Instructio…	78.80	2024-02-15
Minerva 540B (CoT)	Solving Quantitative Reasoning Problems with Lang…	78.50	2022-06-29
Camelidae-8×34B (5-shot)	Parameter-Efficient Sparsity Crafting from Dense …	78.30	2024-01-05
Qwen2idae-16x14B (5-shot)	Parameter-Efficient Sparsity Crafting from Dense …	77.80	2024-01-05
MetaMath-Mistral-7B	MetaMath: Bootstrap Your Own Mathematical Questio…	77.70	2023-09-21
OpenChat-3.5 7B	OpenChat: Advancing Open-source Language Models w…	77.30	2023-09-20
DeepMind 70B Model (STaR, maj1@96)	Solving math word problems with process- and outc…	76.50	2022-11-25
OpenMath-CodeLlama-7B (w/ code)	OpenMathInstruct-1: A 1.8 Million Math Instructio…	75.90	2024-02-15
ToRA-Code 13B	ToRA: A Tool-Integrated Reasoning Agent for Mathe…	75.80	2023-09-29
PaLM 540B maj1@40 (8-shot)	Self-Consistency Improves Chain of Thought Reason…	74.40	2022-03-21
PaLM 540B (Self Consistency)	Large Language Models Can Self-Improve	74.40	2022-10-20
Phi-GSM 2.7B (fine-tuned)	TinyGSM: achieving >80% on GSM8k with small langu…	74.30	2023-12-14
MathCoder-CL-13B	MathCoder: Seamless Code Integration in LLMs for …	74.10	2023-10-05
MuggleMATH 13B	MuggleMath: Assessing the Impact of Query and Res…	74.00	2023-10-09
MMOS-CODE-7B(0-shot)	An Empirical Study of Data Ability Boundary in LL…	73.90	2024-02-23
CodeT5+	CodeT5+: Open Code Large Language Models for Code…	73.80	2023-05-13
Llama-3.3-70B + CAPO	CAPO: Cost-Aware Prompt Optimization	73.73	2025-04-22
OVM-Llama2-7B (verify100@1)	OVM, Outcome-supervised Value Models for Planning…	73.70	2023-11-16
PaLM 540B (Self Improvement, CoT Prompting)	Large Language Models Can Self-Improve	73.50	2022-10-20
KwaiYiiMath 13B	KwaiYiiMath: Technical Report	73.30	2023-10-11
ToRA-Code 7B	ToRA: A Tool-Integrated Reasoning Agent for Mathe…	72.60	2023-09-29
MathCoder-L-13B	MathCoder: Seamless Code Integration in LLMs for …	72.60	2023-10-05
MetaMath 13B	MetaMath: Bootstrap Your Own Mathematical Questio…	71.00	2023-09-21
MuggleMATH 7B	MuggleMath: Assessing the Impact of Query and Res…	69.80	2023-10-09
LLaMA 65B-maj1@k	LLaMA: Open and Efficient Foundation Language Mod…	69.70	2023-02-27
Minerva 62B (maj1@100)	Solving Quantitative Reasoning Problems with Lang…	68.50	2022-06-29
code-davinci-002 (Least-to-Most Prompting)	Least-to-Most Prompting Enables Complex Reasoning…	68.01	2022-05-21
MathCoder-CL-7B	MathCoder: Seamless Code Integration in LLMs for …	67.80	2023-10-05
MetaMath 7B	MetaMath: Bootstrap Your Own Mathematical Questio…	66.40	2023-09-21
Mistral-Small-24B + CAPO	CAPO: Cost-Aware Prompt Optimization	65.07	2025-04-22
RFT 70B	Scaling Relationship on Learning Mathematical Rea…	64.80	2023-08-03
MathCoder-L-7B	MathCoder: Seamless Code Integration in LLMs for …	64.20	2023-10-05
WizardMath-13B-V1.0	WizardMath: Empowering Mathematical Reasoning for…	63.90	2023-08-18
GPT-J (CoRe)	Solving Math Word Problems via Cooperative Reason…	63.20	2022-10-28
Llama-2 70B (on 100 first questions, 4-shot, auto-optimized prompting)	The Unreasonable Effectiveness of Eccentric Autom…	61.00	2024-02-09
Qwen2.5-32B + CAPO	CAPO: Cost-Aware Prompt Optimization	60.20	2025-04-22
LLaMA 2 70B (CoT-Influx)	Fewer is More: Boosting LLM Reasoning with Reinfo…	59.59	2023-12-14
Orca 2 13B	Orca 2: Teaching Small Language Models How to Rea…	59.14	2023-11-18
U-PaLM	Transcending Scaling Laws with 0.1% Extra Compute	58.50	2022-10-20
PaLM-540B (few-Shot-cot)	Large Language Models are Zero-Shot Reasoners	58.10	2022-05-24
GPT-3.5 (few-shot, k=5)	GPT-4 Technical Report	57.10	2023-03-15
Minerva 8B (maj5@100)	Solving Quantitative Reasoning Problems with Lang…	56.80	2022-06-29
LLaMA 2 70B (on-shot)	Llama 2: Open Foundation and Fine-Tuned Chat Mode…	56.80	2023-07-18
PaLM 540B (8-shot)	Solving Quantitative Reasoning Problems with Lang…	56.50	2022-06-29
PaLM 540B (CoT Prompting)	Large Language Models Can Self-Improve	56.50	2022-10-20
RFT 13B	Scaling Relationship on Learning Mathematical Rea…	55.30	2023-08-03
Finetuned GPT-3 175B + verifier	Large Language Models are Zero-Shot Reasoners	55.00	2022-05-24
WizardMath-7B-V1.0	WizardMath: Empowering Mathematical Reasoning for…	54.90	2023-08-18
LLaMA 33B-maj1@k	LLaMA: Open and Efficient Foundation Language Mod…	53.10	2023-02-27
Minerva 62B (8-shot)	Solving Quantitative Reasoning Problems with Lang…	52.40	2022-06-29
Mistral 7B (maj@8)	Mistral 7B	52.20	2023-10-10
Llemma 34B	Llemma: An Open Language Model For Mathematics	51.50	2023-10-16
Text-davinci-002-175B (zero-plus-few-Shot-cot (8 samples))	Large Language Models are Zero-Shot Reasoners	51.50	2022-05-24
RFT 7B	Scaling Relationship on Learning Mathematical Rea…	51.20	2023-08-03
LLaMA 65B	LLaMA: Open and Efficient Foundation Language Mod…	50.90	2023-02-27
Orca 2 7B	Orca 2: Teaching Small Language Models How to Rea…	47.23	2023-11-18
Llama-2 13B (on 100 first questions, 4-shot, auto-optimized prompting)	The Unreasonable Effectiveness of Eccentric Autom…	43.00	2024-02-09
text-davinci-002 175B (2-shot, CoT)	Large Language Models are Zero-Shot Reasoners	41.30	2022-05-24
Mistral 7B (on 100 first questions, 4-shot, auto-optimized prompting)	The Unreasonable Effectiveness of Eccentric Autom…	41.00	2024-02-09
text-davinci-002 175B (0-shot, CoT)	Large Language Models are Zero-Shot Reasoners	40.70	2022-05-24
Branch-Train-MiX 4x7B (sampling top-2 experts)	Branch-Train-MiX: Mixing Expert LLMs into a Mixtu…	37.10	2024-03-12
Llemma 7B	Llemma: An Open Language Model For Mathematics	36.40	2023-10-16
LLaMA 33B	LLaMA: Open and Efficient Foundation Language Mod…	35.60	2023-02-27
Vicuna (SYRELM)	Frugal LMs Trained to Invoke Symbolic Solvers Ach…	35.20	2023-12-09
PaLM 62B (8-shot)	Solving Quantitative Reasoning Problems with Lang…	33.00	2022-06-29
PaLM 540B (Self Improvement, Standard-Prompting)	Large Language Models Can Self-Improve	32.20	2022-10-20
LLaMA 13B-maj1@k	LLaMA: Open and Efficient Foundation Language Mod…	29.30	2023-02-27
Minerva 8B-maj1@k (8-shot)	Solving Quantitative Reasoning Problems with Lang…	28.40	2022-06-29
GPT-2-Medium 355M + question-solution classifier (BS=5)	Composing Ensembles of Pre-trained Models via Ite…	20.80	2022-10-20
GPT-Neo-2.7B + Self-Sampling	Learning Math Reasoning from Self-Sampled Correct…	19.50	2022-05-28
GPT-2-Medium 355M (fine-tuned, BS=5)	Composing Ensembles of Pre-trained Models via Ite…	18.30	2022-10-20
LLaMA 7B (maj1@k)	LLaMA: Open and Efficient Foundation Language Mod…	18.10	2023-02-27
PaLM 540B (few-shot)	Large Language Models are Zero-Shot Reasoners	17.90	2022-05-24
PaLM 540B (Standard-Prompting)	Large Language Models Can Self-Improve	17.90	2022-10-20
LLaMA 13B	LLaMA: Open and Efficient Foundation Language Mod…	17.80	2023-02-27
GPT-2-Medium 355M + question-solution classifier (BS=1)	Composing Ensembles of Pre-trained Models via Ite…	16.80	2022-10-20
Minerva 8B (8-shot)	Solving Quantitative Reasoning Problems with Lang…	16.20	2022-06-29
GPT-2-Medium 355M (BS=5)	Composing Ensembles of Pre-trained Models via Ite…	12.20	2022-10-20
LLaMA 7B	LLaMA: Open and Efficient Foundation Language Mod…	11.00	2023-02-27
Text-davinci-002-175B (0-shot)	Large Language Models are Zero-Shot Reasoners	10.40	2022-05-24
GPT-Neo 125M + Self-Sampling	Learning Math Reasoning from Self-Sampled Correct…	7.50	2022-05-28
UL2 20B (chain-of-thought)	UL2: Unifying Language Learning Paradigms	4.40	2022-05-10
PaLM 8B (8-shot)	Solving Quantitative Reasoning Problems with Lang…	4.10	2022-06-29
UL2 20B (0-shot)	UL2: Unifying Language Learning Paradigms	4.10	2022-05-10

GSM8K

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (144)