ML Research Wiki / Benchmarks / Math Word Problem Solving / MATH

MATH

Math Word Problem Solving Benchmark

Performance Over Time

📊 Showing 132 results | 📏 Metric: Accuracy

Top Performing Models

Rank	Model	Paper	Accuracy	Date	Code
1	Qwen2.5-Math-72B-Instruct(TIR,Greedy) 📚	Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement	88.10	2024-09-18	-
2	GPT-4 Turbo (MACM, w/code, voting)	MACM: Utilizing a Multi-Agent System for Condition Mining in Solving Complex Mathematical Problems	87.92	2024-04-06	📦 bin123apple/macm
3	Qwen2.5-Math-72B-Instruct(COT,Greedy) 📚	Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement	85.90	2024-09-18	-
4	Qwen2.5-Math-7B-Instruct(TIR,Greedy) 📚	Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement	85.20	2024-09-18	-
5	GPT-4-code model (CSV, w/ code, SC, k=16)	Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification	84.30	2023-08-15	📦 kipok/nemo-skills
6	Qwen2-Math-72B-Instruct(greedy) 📚	Qwen2 Technical Report	84.00	2024-07-15	📦 qwenlm/qwen1.5 📦 qwenlm/qwen2 📦 vicentvankor/sun-shine
7	Qwen2.5-Math-7B-Instruct(COT,Greedy) 📚	Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement	83.60	2024-09-18	-
8	Qwen2.5-Math-1.5B-Instruct(TIR,Greedy) 📚	Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement	79.90	2024-09-18	-
9	OpenMath2-Llama3.1-70B (majority@256) 📚	OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data	79.60	2024-10-02	📦 NVIDIA/NeMo-Skills
10	OpenMath2-Llama3.1-8B (majority@256) 📚	OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data	76.10	2024-10-02	📦 NVIDIA/NeMo-Skills

All Papers (132)

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

2024

Qwen2.5-Math-72B-Instruct(TIR,Greedy)

MACM: Utilizing a Multi-Agent System for Condition Mining in Solving Complex Mathematical Problems

2024

GPT-4 Turbo (MACM, w/code, voting)

bin123apple/macm

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

2024

Qwen2.5-Math-72B-Instruct(COT,Greedy)

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

2024

Qwen2.5-Math-7B-Instruct(TIR,Greedy)

Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

2023

GPT-4-code model (CSV, w/ code, SC, k=16)

kipok/nemo-skills

Qwen2 Technical Report

2024

Qwen2-Math-72B-Instruct(greedy)

qwenlm/qwen1.5 qwenlm/qwen2

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

2024

Qwen2.5-Math-7B-Instruct(COT,Greedy)

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

2024

Qwen2.5-Math-1.5B-Instruct(TIR,Greedy)

OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data

2024

OpenMath2-Llama3.1-70B (majority@256)

NVIDIA/NeMo-Skills

OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data

2024

OpenMath2-Llama3.1-8B (majority@256)

NVIDIA/NeMo-Skills

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

2024

Qwen2.5-Math-1.5B-Instruct(COT,Greedy)

Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

2023

GPT-4-code model (CSV, w/ code)

kipok/nemo-skills

Cumulative Reasoning with Large Language Models

2023

CR (GPT-4-turbo model, w/ code)

iiis-ai/cumulative-reasoning

OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data

2024

OpenMath2-Llama3.1-70B

NVIDIA/NeMo-Skills

Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

2023

LogicNet (with code interpreter)

kipok/nemo-skills

Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

2024

Qwen2-72B-Instruct-Step-DPO (0-shot CoT, w/o code)

dvlab-research/step-dpo

Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

2023

GPT-4-code model (w/ code)

kipok/nemo-skills

OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data

2024

OpenMath2-Llama3.1-8B

NVIDIA/NeMo-Skills

AlphaMath Almost Zero: Process Supervision without Process

2024

AlphaMath-7B-SBS@3

MARIO-Math-Reasoning/Super_MARIO

Solving Quantitative Reasoning Problems with Language Models

2022

Minerva 62B (maj5@256)

gair-nlp/abel

An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning

2024

MMOS-DeepSeekMath-7B(0-shot,k=50)

cyzhh/MMOS

Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

2023

GPT-4-code model (w/o code)

kipok/nemo-skills

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

2024

OpenMath-CodeLlama-70B (w/ code, SC, k=50)

kipok/nemo-skills

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

2024

OpenMath-CodeLlama-34B (w/ code, SC, k=50)

kipok/nemo-skills

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

2023

ToRA-Code 34B model (w/ code, SC, k=50)

microsoft/tora

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

2024

DeepSeekMATH-RL-7B (w/ code, greedy decoding)

shibing624/medicalgpt deepseek-ai/deepseek-math

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

2024

OpenMath-Llama2-70B (w/ code, SC, k=50)

kipok/nemo-skills

Cumulative Reasoning with Large Language Models

2023

CR (GPT-4 model, w/o code)

iiis-ai/cumulative-reasoning

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

2024

OpenMath-CodeLlama-13B (w/ code, SC, k=50)

kipok/nemo-skills

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

2024

OpenMath-Mistral-7B (w/ code, SC, k=50)

kipok/nemo-skills

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

2023

ToRA 70B (w/ code, SC, k=50)

microsoft/tora

Skills-in-Context Prompting: Unlocking Compositionality in Large Language Models

2023

SKiC (GPT-4 model)

DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

2024

DART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code)

hkust-nlp/dart-math

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

2024

OpenMath-CodeLlama-7B (w/ code, SC, k=50)

kipok/nemo-skills

An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning

2024

MMOS-DeepSeekMath-7B(0-shot)

cyzhh/MMOS

DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

2024

DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code)

hkust-nlp/dart-math

Progressive-Hint Prompting Improves Reasoning in Large Language Models

2023

PHP (GPT-4 model)

chuanyang-Zheng/Progressive-Hint

DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

2024

DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code)

hkust-nlp/dart-math

Gemini: A Family of Highly Capable Multimodal Models

2023

Gemini Ultra (4-shot)

valdecy/pybibx

DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

2024

DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code)

hkust-nlp/dart-math

PAL: Program-aided Language Models

2022

GPT-4 model (w/ code, PAL)

srush/minichain RUCAIBox/LLMBox allanj/dynamic-pal

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

2024

DeepSeekMATH-RL-7B (greedy decoding)

shibing624/medicalgpt deepseek-ai/deepseek-math

Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing

2024

AlphaLLM (with MCTS)

yetianjhu/alphallm

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

2023

ToRA-Code 34B (w/ code)

microsoft/tora

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

2024

OpenMath-CodeLlama-70B (w/ code)

kipok/nemo-skills

Solving Quantitative Reasoning Problems with Language Models

2022

Minerva 540B (maj1@k, k=64)

gair-nlp/abel

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

2023

ToRA 70B (w/ code)

microsoft/tora

An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning

2024

MMOS-CODE-34B(0-shot)

cyzhh/MMOS

Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning

2024

DeepSeekMath-7B-KPMath-Plus

PaLM 2 Technical Report

2023

PaLM 2 (few-shot, k=4, SC)

eternityyw/tram-benchmark

Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning

2024

Llemma-34B-KPMath-Plus

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

2024

OpenMath-CodeLlama-34B (w/ code)

kipok/nemo-skills

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

2023

Shepherd + DeepSeek-67B (SFT on MetaMATH + PRM rerank, k=256)

hkust-nlp/b-star chang-github-00/llm-predictive-decoding peiyi9979/Math-Shepherd

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

2023

ToRA-Code 13B (w/ code)

microsoft/tora

Solving Quantitative Reasoning Problems with Language Models

2022

Minerva 8B (maj5@256)

gair-nlp/abel

Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning

2024

Mistral-7B-KPMath-Plus

DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

2024

DART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code)

hkust-nlp/dart-math

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

2024

OpenMath-Llama2-70B (w/ code)

kipok/nemo-skills

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

2024

OpenMath-CodeLlama-13B (w/ code)

kipok/nemo-skills

DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

2024

DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code)

hkust-nlp/dart-math

DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

2024

DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code)

hkust-nlp/dart-math

MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning

2023

MathCoder-CL-34B

mathllm/mathcoder

MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning

2023

MathCoder-L-34B

mathllm/mathcoder

Augmenting Math Word Problems via Iterative Question Composing

2024

MMIQC-72B

iiis-ai/iterativequestioncomposing

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

2023

ToRA-Code 7B (w/ code)

microsoft/tora

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

2024

OpenMath-Mistral-7B (w/ code)

kipok/nemo-skills

An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning

2024

MMOS-CODE-7B(0-shot)

cyzhh/MMOS

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

2024

OpenMath-CodeLlama-7B (w/ code)

kipok/nemo-skills

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

2023

Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256)

hkust-nlp/b-star chang-github-00/llm-predictive-decoding peiyi9979/Math-Shepherd

DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

2024

DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code)

hkust-nlp/dart-math

Solving Quantitative Reasoning Problems with Language Models

2022

Minerva 62B (maj1@k, k=64)

gair-nlp/abel

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

2023

ToRA 13B (w/ code)

microsoft/tora

Sparks of Artificial General Intelligence: Early experiments with GPT-4

2023

GPT-4

microsoft/guidance gammatauai/leetcode-hard-gym emrgnt-cmplxty/zero-shot-replication

Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning

2024

Llama2-13B-KPMath-Plus

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

2023

ToRA 7B (w/ code)

microsoft/tora

MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning

2023

MathCoder-CL-13B

mathllm/mathcoder

MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning

2023

MuggleMATH-70B

ofa-sys/gsm8k-screl

PaLM 2 Technical Report

2023

PaLM 2 (few-shot, k=4, CoT)

eternityyw/tram-benchmark

Solving Quantitative Reasoning Problems with Language Models

2022

Minerva 540B

gair-nlp/abel

Galactica: A Large Language Model for Science

2022

Minerva 540B (5-shot) mCoT

paperswithcode/galai

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

2023

Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL)

hkust-nlp/b-star chang-github-00/llm-predictive-decoding peiyi9979/Math-Shepherd

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

2023

WizardMath-7B-V1.1

nlpxucan/wizardlm

Gemini: A Family of Highly Capable Multimodal Models

2023

Gemini Pro (4-shot)

valdecy/pybibx

MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning

2023

MuggleMATH-13B

ofa-sys/gsm8k-screl

MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning

2023

MathCoder-CL-7B

mathllm/mathcoder

MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning

2023

MathCoder-L-13B

mathllm/mathcoder

Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks

2024

Qwen2idae-16x14B (4-shot)

wuhy68/parameter-efficient-moe ShayekhBinIslam/openrag

OpenChat: Advancing Open-source Language Models with Mixed-Quality Data

2023

OpenChat-3.5-1210 7B

imoneoi/openchat

OpenChat: Advancing Open-source Language Models with Mixed-Quality Data

2023

OpenChat-3.5 7B

imoneoi/openchat

Mixtral of Experts

2024

Mixtral 8x7B (maj@4)

jingyaogong/minimind hit-scir/chinese-mixtral-8x7b

Solving Quantitative Reasoning Problems with Language Models

2022

Minerva 62B (4-shot)

gair-nlp/abel

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

2023

MetaMath 70B

meta-math/MetaMath

MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning

2023

MuggleMATH 7B

ofa-sys/gsm8k-screl

Solving Quantitative Reasoning Problems with Language Models

2022

Minerva 8B (maj1@k, k=64)

gair-nlp/abel

MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning

2023

MathCoder-L-7B

mathllm/mathcoder

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

2023

WizardMath-70B-V1.0

nlpxucan/wizardlm

Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks

2024

Camelidae-8×34B (4-shot)

wuhy68/parameter-efficient-moe ShayekhBinIslam/openrag

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

2023

MetaMath 13B

meta-math/MetaMath

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 65B (maj1@k)

huggingface/transformers ggml-org/llama.cpp

Galactica: A Large Language Model for Science

2022

GAL 120B (5-shot) mCoT

paperswithcode/galai

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

2023

MetaMath 7B

meta-math/MetaMath

Solving Quantitative Reasoning Problems with Language Models

2022

davinci-002 175B

gair-nlp/abel

Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

2024

Branch-Train-MiX 4x7B (sampling top-2 experts)

Leeroo-AI/mergoo

Galactica: A Large Language Model for Science

2022

GAL 120B <work>

paperswithcode/galai

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 33B-maj1@k

huggingface/transformers ggml-org/llama.cpp

Solving Quantitative Reasoning Problems with Language Models

2022

Minerva 8B

gair-nlp/abel

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

2023

WizardMath-13B-V1.0

nlpxucan/wizardlm

Mistral 7B

2023

Mistral 7B (maj@4)

mistralai/mistral-src facebookresearch/fairseq2

Galactica: A Large Language Model for Science

2022

GAL 30B (5-shot) mCoT

paperswithcode/galai

Mixtral of Experts

2024

Mistral 7B (maj@4)

jingyaogong/minimind hit-scir/chinese-mixtral-8x7b

Galactica: A Large Language Model for Science

2022

GAL 30B <work>

paperswithcode/galai

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

2023

WizardMath-7B-V1.0

nlpxucan/wizardlm

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 65B

huggingface/transformers ggml-org/llama.cpp

Solving Quantitative Reasoning Problems with Language Models

2022

PaLM 540B

gair-nlp/abel

Galactica: A Large Language Model for Science

2022

PaLM 540B (5-shot) mCoT

paperswithcode/galai

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 13B-maj1@k

huggingface/transformers ggml-org/llama.cpp

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 33B

huggingface/transformers ggml-org/llama.cpp

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 7B-maj1@k

huggingface/transformers ggml-org/llama.cpp

Measuring Mathematical Problem Solving With the MATH Dataset

2021

GPT-2 (1.5B)

hendrycks/math openai/minif2f

Measuring Mathematical Problem Solving With the MATH Dataset

2021

GPT-2 (0.7B)

hendrycks/math openai/minif2f

Measuring Mathematical Problem Solving With the MATH Dataset

2021

GPT-2 (0.3B)

hendrycks/math openai/minif2f

Measuring Mathematical Problem Solving With the MATH Dataset

2021

GPT-3 13B

hendrycks/math openai/minif2f

Solving Quantitative Reasoning Problems with Language Models

2022

PaLM 8B (fine-tuned)

gair-nlp/abel

Measuring Mathematical Problem Solving With the MATH Dataset

2021

GPT-2 (0.1B)

hendrycks/math openai/minif2f

Measuring Mathematical Problem Solving With the MATH Dataset

2021

GPT-3-175B (few-shot)

hendrycks/math openai/minif2f

Galactica: A Large Language Model for Science

2022

GPT-3 175B (8-shot)

paperswithcode/galai

Solving Quantitative Reasoning Problems with Language Models

2022

PaLM 62B

gair-nlp/abel

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 13B

huggingface/transformers ggml-org/llama.cpp

Measuring Mathematical Problem Solving With the MATH Dataset

2021

GPT-3-13B (few-shot)

hendrycks/math openai/minif2f

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 7B

huggingface/transformers ggml-org/llama.cpp

Measuring Mathematical Problem Solving With the MATH Dataset

2021

GPT-3 2.7B

hendrycks/math openai/minif2f

Solving Quantitative Reasoning Problems with Language Models

2022

PaLM 8B

gair-nlp/abel

Model	Paper	Accuracy	Date
Qwen2.5-Math-72B-Instruct(TIR,Greedy)	Qwen2.5-Math Technical Report: Toward Mathematica…	88.10	2024-09-18
GPT-4 Turbo (MACM, w/code, voting)	MACM: Utilizing a Multi-Agent System for Conditio…	87.92	2024-04-06
Qwen2.5-Math-72B-Instruct(COT,Greedy)	Qwen2.5-Math Technical Report: Toward Mathematica…	85.90	2024-09-18
Qwen2.5-Math-7B-Instruct(TIR,Greedy)	Qwen2.5-Math Technical Report: Toward Mathematica…	85.20	2024-09-18
GPT-4-code model (CSV, w/ code, SC, k=16)	Solving Challenging Math Word Problems Using GPT-…	84.30	2023-08-15
Qwen2-Math-72B-Instruct(greedy)	Qwen2 Technical Report	84.00	2024-07-15
Qwen2.5-Math-7B-Instruct(COT,Greedy)	Qwen2.5-Math Technical Report: Toward Mathematica…	83.60	2024-09-18
Qwen2.5-Math-1.5B-Instruct(TIR,Greedy)	Qwen2.5-Math Technical Report: Toward Mathematica…	79.90	2024-09-18
OpenMath2-Llama3.1-70B (majority@256)	OpenMathInstruct-2: Accelerating AI for Math with…	79.60	2024-10-02
OpenMath2-Llama3.1-8B (majority@256)	OpenMathInstruct-2: Accelerating AI for Math with…	76.10	2024-10-02
Qwen2.5-Math-1.5B-Instruct(COT,Greedy)	Qwen2.5-Math Technical Report: Toward Mathematica…	75.80	2024-09-18
GPT-4-code model (CSV, w/ code)	Solving Challenging Math Word Problems Using GPT-…	73.50	2023-08-15
CR (GPT-4-turbo model, w/ code)	Cumulative Reasoning with Large Language Models	72.20	2023-08-08
OpenMath2-Llama3.1-70B	OpenMathInstruct-2: Accelerating AI for Math with…	71.90	2024-10-02
LogicNet (with code interpreter)	Solving Challenging Math Word Problems Using GPT-…	71.20	2023-08-15
Qwen2-72B-Instruct-Step-DPO (0-shot CoT, w/o code)	Step-DPO: Step-wise Preference Optimization for L…	70.80	2024-06-26
GPT-4-code model (w/ code)	Solving Challenging Math Word Problems Using GPT-…	69.70	2023-08-15
OpenMath2-Llama3.1-8B	OpenMathInstruct-2: Accelerating AI for Math with…	67.80	2024-10-02
AlphaMath-7B-SBS@3	AlphaMath Almost Zero: Process Supervision withou…	66.30	2024-05-06
Minerva 62B (maj5@256)	Solving Quantitative Reasoning Problems with Lang…	64.90	2022-06-29
MMOS-DeepSeekMath-7B(0-shot,k=50)	An Empirical Study of Data Ability Boundary in LL…	63.70	2024-02-23
GPT-4-code model (w/o code)	Solving Challenging Math Word Problems Using GPT-…	60.80	2023-08-15
OpenMath-CodeLlama-70B (w/ code, SC, k=50)	OpenMathInstruct-1: A 1.8 Million Math Instructio…	60.40	2024-02-15
OpenMath-CodeLlama-34B (w/ code, SC, k=50)	OpenMathInstruct-1: A 1.8 Million Math Instructio…	60.20	2024-02-15
ToRA-Code 34B model (w/ code, SC, k=50)	ToRA: A Tool-Integrated Reasoning Agent for Mathe…	60.00	2023-09-29
DeepSeekMATH-RL-7B (w/ code, greedy decoding)	DeepSeekMath: Pushing the Limits of Mathematical …	58.80	2024-02-05
OpenMath-Llama2-70B (w/ code, SC, k=50)	OpenMathInstruct-1: A 1.8 Million Math Instructio…	58.30	2024-02-15
CR (GPT-4 model, w/o code)	Cumulative Reasoning with Large Language Models	58.00	2023-08-08
OpenMath-CodeLlama-13B (w/ code, SC, k=50)	OpenMathInstruct-1: A 1.8 Million Math Instructio…	57.60	2024-02-15
OpenMath-Mistral-7B (w/ code, SC, k=50)	OpenMathInstruct-1: A 1.8 Million Math Instructio…	57.20	2024-02-15
ToRA 70B (w/ code, SC, k=50)	ToRA: A Tool-Integrated Reasoning Agent for Mathe…	56.90	2023-09-29
SKiC (GPT-4 model)	Skills-in-Context Prompting: Unlocking Compositio…	56.40	2023-08-01
DART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code)	DART-Math: Difficulty-Aware Rejection Tuning for …	56.10	2024-06-18
OpenMath-CodeLlama-7B (w/ code, SC, k=50)	OpenMathInstruct-1: A 1.8 Million Math Instructio…	55.60	2024-02-15
MMOS-DeepSeekMath-7B(0-shot)	An Empirical Study of Data Ability Boundary in LL…	55.00	2024-02-23
DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code)	DART-Math: Difficulty-Aware Rejection Tuning for …	54.90	2024-06-18
PHP (GPT-4 model)	Progressive-Hint Prompting Improves Reasoning in …	53.90	2023-04-19
DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code)	DART-Math: Difficulty-Aware Rejection Tuning for …	53.60	2024-06-18
Gemini Ultra (4-shot)	Gemini: A Family of Highly Capable Multimodal Mod…	53.20	2023-12-19
DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code)	DART-Math: Difficulty-Aware Rejection Tuning for …	52.90	2024-06-18
GPT-4 model (w/ code, PAL)	PAL: Program-aided Language Models	51.80	2022-11-18
DeepSeekMATH-RL-7B (greedy decoding)	DeepSeekMath: Pushing the Limits of Mathematical …	51.70	2024-02-05
AlphaLLM (with MCTS)	Toward Self-Improvement of LLMs via Imagination, …	51.00	2024-04-18
ToRA-Code 34B (w/ code)	ToRA: A Tool-Integrated Reasoning Agent for Mathe…	50.80	2023-09-29
OpenMath-CodeLlama-70B (w/ code)	OpenMathInstruct-1: A 1.8 Million Math Instructio…	50.70	2024-02-15
Minerva 540B (maj1@k, k=64)	Solving Quantitative Reasoning Problems with Lang…	50.30	2022-06-29
ToRA 70B (w/ code)	ToRA: A Tool-Integrated Reasoning Agent for Mathe…	49.70	2023-09-29
MMOS-CODE-34B(0-shot)	An Empirical Study of Data Ability Boundary in LL…	49.50	2024-02-23
DeepSeekMath-7B-KPMath-Plus	Key-Point-Driven Data Synthesis with its Enhancem…	48.80	2024-03-04
PaLM 2 (few-shot, k=4, SC)	PaLM 2 Technical Report	48.80	2023-05-17
Llemma-34B-KPMath-Plus	Key-Point-Driven Data Synthesis with its Enhancem…	48.60	2024-03-04
OpenMath-CodeLlama-34B (w/ code)	OpenMathInstruct-1: A 1.8 Million Math Instructio…	48.30	2024-02-15
Shepherd + DeepSeek-67B (SFT on MetaMATH + PRM rerank, k=256)	Math-Shepherd: Verify and Reinforce LLMs Step-by-…	48.10	2023-12-14
ToRA-Code 13B (w/ code)	ToRA: A Tool-Integrated Reasoning Agent for Mathe…	48.10	2023-09-29
Minerva 8B (maj5@256)	Solving Quantitative Reasoning Problems with Lang…	47.60	2022-06-29
Mistral-7B-KPMath-Plus	Key-Point-Driven Data Synthesis with its Enhancem…	46.80	2024-03-04
DART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code)	DART-Math: Difficulty-Aware Rejection Tuning for …	46.60	2024-06-18
OpenMath-Llama2-70B (w/ code)	OpenMathInstruct-1: A 1.8 Million Math Instructio…	46.30	2024-02-15
OpenMath-CodeLlama-13B (w/ code)	OpenMathInstruct-1: A 1.8 Million Math Instructio…	45.50	2024-02-15
DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code)	DART-Math: Difficulty-Aware Rejection Tuning for …	45.50	2024-06-18
DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code)	DART-Math: Difficulty-Aware Rejection Tuning for …	45.30	2024-06-18
MathCoder-CL-34B	MathCoder: Seamless Code Integration in LLMs for …	45.20	2023-10-05
MathCoder-L-34B	MathCoder: Seamless Code Integration in LLMs for …	45.10	2023-10-05
MMIQC-72B	Augmenting Math Word Problems via Iterative Quest…	45.00	2024-01-17
ToRA-Code 7B (w/ code)	ToRA: A Tool-Integrated Reasoning Agent for Mathe…	44.60	2023-09-29
OpenMath-Mistral-7B (w/ code)	OpenMathInstruct-1: A 1.8 Million Math Instructio…	44.50	2024-02-15
MMOS-CODE-7B(0-shot)	An Empirical Study of Data Ability Boundary in LL…	44.30	2024-02-23
OpenMath-CodeLlama-7B (w/ code)	OpenMathInstruct-1: A 1.8 Million Math Instructio…	43.60	2024-02-15
Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256)	Math-Shepherd: Verify and Reinforce LLMs Step-by-…	43.50	2023-12-14
DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code)	DART-Math: Difficulty-Aware Rejection Tuning for …	43.50	2024-06-18
Minerva 62B (maj1@k, k=64)	Solving Quantitative Reasoning Problems with Lang…	43.40	2022-06-29
ToRA 13B (w/ code)	ToRA: A Tool-Integrated Reasoning Agent for Mathe…	43.00	2023-09-29
GPT-4	Sparks of Artificial General Intelligence: Early …	42.50	2023-03-22
Llama2-13B-KPMath-Plus	Key-Point-Driven Data Synthesis with its Enhancem…	41.00	2024-03-04
ToRA 7B (w/ code)	ToRA: A Tool-Integrated Reasoning Agent for Mathe…	40.10	2023-09-29
MathCoder-CL-13B	MathCoder: Seamless Code Integration in LLMs for …	35.90	2023-10-05
MuggleMATH-70B	MuggleMath: Assessing the Impact of Query and Res…	35.60	2023-10-09
PaLM 2 (few-shot, k=4, CoT)	PaLM 2 Technical Report	34.30	2023-05-17
Minerva 540B	Solving Quantitative Reasoning Problems with Lang…	33.60	2022-06-29
Minerva 540B (5-shot) mCoT	Galactica: A Large Language Model for Science	33.60	2022-11-16
Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL)	Math-Shepherd: Verify and Reinforce LLMs Step-by-…	33.00	2023-12-14
WizardMath-7B-V1.1	WizardMath: Empowering Mathematical Reasoning for…	33.00	2023-08-18
Gemini Pro (4-shot)	Gemini: A Family of Highly Capable Multimodal Mod…	32.60	2023-12-19
MuggleMATH-13B	MuggleMath: Assessing the Impact of Query and Res…	30.70	2023-10-09
MathCoder-CL-7B	MathCoder: Seamless Code Integration in LLMs for …	30.20	2023-10-05
MathCoder-L-13B	MathCoder: Seamless Code Integration in LLMs for …	29.90	2023-10-05
Qwen2idae-16x14B (4-shot)	Parameter-Efficient Sparsity Crafting from Dense …	29.90	2024-01-05
OpenChat-3.5-1210 7B	OpenChat: Advancing Open-source Language Models w…	28.90	2023-09-20
OpenChat-3.5 7B	OpenChat: Advancing Open-source Language Models w…	28.60	2023-09-20
Mixtral 8x7B (maj@4)	Mixtral of Experts	28.40	2024-01-08
Minerva 62B (4-shot)	Solving Quantitative Reasoning Problems with Lang…	27.60	2022-06-29
MetaMath 70B	MetaMath: Bootstrap Your Own Mathematical Questio…	26.00	2023-09-21
MuggleMATH 7B	MuggleMath: Assessing the Impact of Query and Res…	25.80	2023-10-09
Minerva 8B (maj1@k, k=64)	Solving Quantitative Reasoning Problems with Lang…	25.40	2022-06-29
MathCoder-L-7B	MathCoder: Seamless Code Integration in LLMs for …	23.30	2023-10-05
WizardMath-70B-V1.0	WizardMath: Empowering Mathematical Reasoning for…	22.70	2023-08-18
Camelidae-8×34B (4-shot)	Parameter-Efficient Sparsity Crafting from Dense …	22.60	2024-01-05
MetaMath 13B	MetaMath: Bootstrap Your Own Mathematical Questio…	22.50	2023-09-21
LLaMA 65B (maj1@k)	LLaMA: Open and Efficient Foundation Language Mod…	20.50	2023-02-27
GAL 120B (5-shot) mCoT	Galactica: A Large Language Model for Science	20.40	2022-11-16
MetaMath 7B	MetaMath: Bootstrap Your Own Mathematical Questio…	19.40	2023-09-21
davinci-002 175B	Solving Quantitative Reasoning Problems with Lang…	19.10	2022-06-29
Branch-Train-MiX 4x7B (sampling top-2 experts)	Branch-Train-MiX: Mixing Expert LLMs into a Mixtu…	17.80	2024-03-12
GAL 120B <work>	Galactica: A Large Language Model for Science	16.60	2022-11-16
LLaMA 33B-maj1@k	LLaMA: Open and Efficient Foundation Language Mod…	15.20	2023-02-27
Minerva 8B	Solving Quantitative Reasoning Problems with Lang…	14.10	2022-06-29
WizardMath-13B-V1.0	WizardMath: Empowering Mathematical Reasoning for…	14.00	2023-08-18
Mistral 7B (maj@4)	Mistral 7B	13.10	2023-10-10
GAL 30B (5-shot) mCoT	Galactica: A Large Language Model for Science	12.70	2022-11-16
Mistral 7B (maj@4)	Mixtral of Experts	12.70	2024-01-08
GAL 30B <work>	Galactica: A Large Language Model for Science	11.40	2022-11-16
WizardMath-7B-V1.0	WizardMath: Empowering Mathematical Reasoning for…	10.70	2023-08-18
LLaMA 65B	LLaMA: Open and Efficient Foundation Language Mod…	10.60	2023-02-27
PaLM 540B	Solving Quantitative Reasoning Problems with Lang…	8.80	2022-06-29
PaLM 540B (5-shot) mCoT	Galactica: A Large Language Model for Science	8.80	2022-11-16
LLaMA 13B-maj1@k	LLaMA: Open and Efficient Foundation Language Mod…	8.80	2023-02-27
LLaMA 33B	LLaMA: Open and Efficient Foundation Language Mod…	7.10	2023-02-27
LLaMA 7B-maj1@k	LLaMA: Open and Efficient Foundation Language Mod…	6.90	2023-02-27
GPT-2 (1.5B)	Measuring Mathematical Problem Solving With the M…	6.90	2021-03-05
GPT-2 (0.7B)	Measuring Mathematical Problem Solving With the M…	6.40	2021-03-05
GPT-2 (0.3B)	Measuring Mathematical Problem Solving With the M…	6.20	2021-03-05
GPT-3 13B	Measuring Mathematical Problem Solving With the M…	5.60	2021-03-05
PaLM 8B (fine-tuned)	Solving Quantitative Reasoning Problems with Lang…	5.60	2022-06-29
GPT-2 (0.1B)	Measuring Mathematical Problem Solving With the M…	5.40	2021-03-05
GPT-3-175B (few-shot)	Measuring Mathematical Problem Solving With the M…	5.20	2021-03-05
GPT-3 175B (8-shot)	Galactica: A Large Language Model for Science	5.20	2022-11-16
PaLM 62B	Solving Quantitative Reasoning Problems with Lang…	4.40	2022-06-29
LLaMA 13B	LLaMA: Open and Efficient Foundation Language Mod…	3.90	2023-02-27
GPT-3-13B (few-shot)	Measuring Mathematical Problem Solving With the M…	3.00	2021-03-05
LLaMA 7B	LLaMA: Open and Efficient Foundation Language Mod…	2.90	2023-02-27
GPT-3 2.7B	Measuring Mathematical Problem Solving With the M…	2.90	2021-03-05
PaLM 8B	Solving Quantitative Reasoning Problems with Lang…	1.50	2022-06-29

MATH

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (132)