Claude 3.5 Sonnet (HPT)
|
Hierarchical Prompting Taxonomy: A Universal Eval…
|
97.72
|
2024-06-18
|
|
DUP prompt upon GPT-4
|
Achieving >97% on GSM8K: Deeply Understanding the…
|
97.10
|
2024-04-23
|
|
Qwen2-Math-72B-Instruct
(greedy)
|
Qwen2 Technical Report
|
96.70
|
2024-07-15
|
|
OpenMath2-Llama3.1-70B (majority@256)
|
OpenMathInstruct-2: Accelerating AI for Math with…
|
96.00
|
2024-10-02
|
|
OpenMath2-Llama3.1-70B
|
OpenMathInstruct-2: Accelerating AI for Math with…
|
94.90
|
2024-10-02
|
|
GPT-4 (Teaching-Inspired)
|
Teaching-Inspired Integrated Prompting Framework:…
|
94.80
|
2024-10-10
|
|
OpenMath2-Llama3.1-8B (majority@256)
|
OpenMathInstruct-2: Accelerating AI for Math with…
|
94.10
|
2024-10-02
|
|
Qwen2-72B-Instruct-Step-DPO (0-shot CoT)
|
Step-DPO: Step-wise Preference Optimization for L…
|
94.00
|
2024-06-26
|
|
AlphaLLM (with MCTS)
|
Toward Self-Improvement of LLMs via Imagination, …
|
92.00
|
2024-04-18
|
|
OpenMath2-Llama3.1-8B
|
OpenMathInstruct-2: Accelerating AI for Math with…
|
91.70
|
2024-10-02
|
|
PaLM 2 (few-shot, k=8, SC)
|
PaLM 2 Technical Report
|
91.00
|
2023-05-17
|
|
GaC(Qwen2-72B-Instruct + Llama-3-70B-Instruct)
|
Breaking the Ceiling of the LLM Community by Trea…
|
90.91
|
2024-06-18
|
|
OpenMath-CodeLlama-70B (w/ code, SC, k=50)
|
OpenMathInstruct-1: A 1.8 Million Math Instructio…
|
90.80
|
2024-02-15
|
|
DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code)
|
DART-Math: Difficulty-Aware Rejection Tuning for …
|
90.40
|
2024-06-18
|
|
OpenMath-Llama2-70B (w/ code, SC, k=50)
|
OpenMathInstruct-1: A 1.8 Million Math Instructio…
|
90.10
|
2024-02-15
|
|
DART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code)
|
DART-Math: Difficulty-Aware Rejection Tuning for …
|
89.60
|
2024-06-18
|
|
Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256)
|
Math-Shepherd: Verify and Reinforce LLMs Step-by-…
|
89.10
|
2023-12-14
|
|
Minerva 62B (maj5@100)
|
Solving Quantitative Reasoning Problems with Lang…
|
89.00
|
2022-06-29
|
|
ToRA-70B (SC, k=50)
|
ToRA: A Tool-Integrated Reasoning Agent for Mathe…
|
88.30
|
2023-09-29
|
|
DeepSeekMATH-RL-7B
|
DeepSeekMath: Pushing the Limits of Mathematical …
|
88.20
|
2024-02-05
|
|
DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code)
|
DART-Math: Difficulty-Aware Rejection Tuning for …
|
88.20
|
2024-06-18
|
|
OpenMath-CodeLlama-34B (w/ code, SC, k=50)
|
OpenMathInstruct-1: A 1.8 Million Math Instructio…
|
88.00
|
2024-02-15
|
|
DeepMind 70B Model (SFT+ORM-RL, ORM reranking)
|
Solving math word problems with process- and outc…
|
87.30
|
2022-11-25
|
|
MMOS-DeepSeekMath-7B(0-shot,k=50)
|
An Empirical Study of Data Ability Boundary in LL…
|
87.20
|
2024-02-23
|
|
DeepMind 70B Model (SFT+PRM-RL, PRM reranking)
|
Solving math word problems with process- and outc…
|
87.10
|
2022-11-25
|
|
GPT-4
|
Sparks of Artificial General Intelligence: Early …
|
87.10
|
2023-03-22
|
|
OpenMath-Mistral-7B (w/ code, SC, k=50)
|
OpenMathInstruct-1: A 1.8 Million Math Instructio…
|
86.90
|
2024-02-15
|
|
Orca-Math 7B (fine-tuned)
|
Orca-Math: Unlocking the potential of SLMs in Gra…
|
86.80
|
2024-02-16
|
|
DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code)
|
DART-Math: Difficulty-Aware Rejection Tuning for …
|
86.80
|
2024-06-18
|
|
OpenMath-CodeLlama-13B (w/ code, SC, k=50)
|
OpenMathInstruct-1: A 1.8 Million Math Instructio…
|
86.80
|
2024-02-15
|
|
Gemini Pro (maj1@32)
|
Gemini: A Family of Highly Capable Multimodal Mod…
|
86.50
|
2023-12-19
|
|
ToRA-Code-34B (SC, k=50)
|
ToRA: A Tool-Integrated Reasoning Agent for Mathe…
|
85.10
|
2023-09-29
|
|
OpenMath-CodeLlama-7B (w/ code, SC, k=50)
|
OpenMathInstruct-1: A 1.8 Million Math Instructio…
|
84.80
|
2024-02-15
|
|
OVM-Mistral-7B (verify100@1)
|
OVM, Outcome-supervised Value Models for Planning…
|
84.70
|
2023-11-16
|
|
OpenMath-Llama2-70B (w/ code)
|
OpenMathInstruct-1: A 1.8 Million Math Instructio…
|
84.70
|
2024-02-15
|
|
OpenMath-CodeLlama-70B (w/ code)
|
OpenMathInstruct-1: A 1.8 Million Math Instructio…
|
84.60
|
2024-02-15
|
|
code-davinci-002 175B (LEVER, 8-shot)
|
LEVER: Learning to Verify Language-to-Code Genera…
|
84.50
|
2023-02-16
|
|
ToRA 70B
|
ToRA: A Tool-Integrated Reasoning Agent for Mathe…
|
84.30
|
2023-09-29
|
|
Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL)
|
Math-Shepherd: Verify and Reinforce LLMs Step-by-…
|
84.10
|
2023-12-14
|
|
MathCoder-L-70B
|
MathCoder: Seamless Code Integration in LLMs for …
|
83.90
|
2023-10-05
|
|
WizardMath-7B-V1.1
|
WizardMath: Empowering Mathematical Reasoning for…
|
83.20
|
2023-08-18
|
|
DIVERSE 175B (8-shot)
|
Making Large Language Models Better Reasoners wit…
|
83.20
|
2022-06-06
|
|
OVM-Mistral-7B (verify20@1)
|
OVM, Outcome-supervised Value Models for Planning…
|
82.60
|
2023-11-16
|
|
DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code)
|
DART-Math: Difficulty-Aware Rejection Tuning for …
|
82.60
|
2024-06-18
|
|
ChatGPT (Ask, Refine, Trust)
|
The ART of LLM Refinement: Ask, Refine, and Trust
|
82.60
|
2023-11-14
|
|
DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code)
|
DART-Math: Difficulty-Aware Rejection Tuning for …
|
82.50
|
2024-06-18
|
|
MetaMath 70B
|
MetaMath: Bootstrap Your Own Mathematical Questio…
|
82.30
|
2023-09-21
|
|
MuggleMATH 70B
|
MuggleMath: Assessing the Impact of Query and Res…
|
82.30
|
2023-10-09
|
|
PaLM 540B (Self Improvement, Self Consistency)
|
Large Language Models Can Self-Improve
|
82.10
|
2022-10-20
|
|
MathCoder-CL-34B
|
MathCoder: Seamless Code Integration in LLMs for …
|
81.70
|
2023-10-05
|
|
WizardMath-70B-V1.0
|
WizardMath: Empowering Mathematical Reasoning for…
|
81.60
|
2023-08-18
|
|
Phi-GSM+V 1.3B+1.3B (verify48@1)
|
TinyGSM: achieving >80% on GSM8k with small langu…
|
81.50
|
2023-12-14
|
|
DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code)
|
DART-Math: Difficulty-Aware Rejection Tuning for …
|
81.10
|
2024-06-18
|
|
DART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code)
|
DART-Math: Difficulty-Aware Rejection Tuning for …
|
81.10
|
2024-06-18
|
|
ToRA-Code 34B
|
ToRA: A Tool-Integrated Reasoning Agent for Mathe…
|
80.70
|
2023-09-29
|
|
OpenMath-CodeLlama-34B (w/ code)
|
OpenMathInstruct-1: A 1.8 Million Math Instructio…
|
80.70
|
2024-02-15
|
|
PaLM 2 (few-shot, k=8, CoT)
|
PaLM 2 Technical Report
|
80.70
|
2023-05-17
|
|
MMOS-DeepSeekMath-7B(0-shot)
|
An Empirical Study of Data Ability Boundary in LL…
|
80.50
|
2024-02-23
|
|
MMOS-CODE-34B(0-shot)
|
An Empirical Study of Data Ability Boundary in LL…
|
80.40
|
2024-02-23
|
|
OpenMath-Mistral-7B (w/ code)
|
OpenMathInstruct-1: A 1.8 Million Math Instructio…
|
80.20
|
2024-02-15
|
|
OpenMath-CodeLlama-13B (w/ code)
|
OpenMathInstruct-1: A 1.8 Million Math Instructio…
|
78.80
|
2024-02-15
|
|
Minerva 540B (CoT)
|
Solving Quantitative Reasoning Problems with Lang…
|
78.50
|
2022-06-29
|
|
Camelidae-8×34B (5-shot)
|
Parameter-Efficient Sparsity Crafting from Dense …
|
78.30
|
2024-01-05
|
|
Qwen2idae-16x14B (5-shot)
|
Parameter-Efficient Sparsity Crafting from Dense …
|
77.80
|
2024-01-05
|
|
MetaMath-Mistral-7B
|
MetaMath: Bootstrap Your Own Mathematical Questio…
|
77.70
|
2023-09-21
|
|
OpenChat-3.5 7B
|
OpenChat: Advancing Open-source Language Models w…
|
77.30
|
2023-09-20
|
|
DeepMind 70B Model (STaR, maj1@96)
|
Solving math word problems with process- and outc…
|
76.50
|
2022-11-25
|
|
OpenMath-CodeLlama-7B (w/ code)
|
OpenMathInstruct-1: A 1.8 Million Math Instructio…
|
75.90
|
2024-02-15
|
|
ToRA-Code 13B
|
ToRA: A Tool-Integrated Reasoning Agent for Mathe…
|
75.80
|
2023-09-29
|
|
PaLM 540B maj1@40 (8-shot)
|
Self-Consistency Improves Chain of Thought Reason…
|
74.40
|
2022-03-21
|
|
PaLM 540B (Self Consistency)
|
Large Language Models Can Self-Improve
|
74.40
|
2022-10-20
|
|
Phi-GSM 2.7B (fine-tuned)
|
TinyGSM: achieving >80% on GSM8k with small langu…
|
74.30
|
2023-12-14
|
|
MathCoder-CL-13B
|
MathCoder: Seamless Code Integration in LLMs for …
|
74.10
|
2023-10-05
|
|
MuggleMATH 13B
|
MuggleMath: Assessing the Impact of Query and Res…
|
74.00
|
2023-10-09
|
|
MMOS-CODE-7B(0-shot)
|
An Empirical Study of Data Ability Boundary in LL…
|
73.90
|
2024-02-23
|
|
CodeT5+
|
CodeT5+: Open Code Large Language Models for Code…
|
73.80
|
2023-05-13
|
|
Llama-3.3-70B + CAPO
|
CAPO: Cost-Aware Prompt Optimization
|
73.73
|
2025-04-22
|
|
OVM-Llama2-7B (verify100@1)
|
OVM, Outcome-supervised Value Models for Planning…
|
73.70
|
2023-11-16
|
|
PaLM 540B (Self Improvement, CoT Prompting)
|
Large Language Models Can Self-Improve
|
73.50
|
2022-10-20
|
|
KwaiYiiMath 13B
|
KwaiYiiMath: Technical Report
|
73.30
|
2023-10-11
|
|
ToRA-Code 7B
|
ToRA: A Tool-Integrated Reasoning Agent for Mathe…
|
72.60
|
2023-09-29
|
|
MathCoder-L-13B
|
MathCoder: Seamless Code Integration in LLMs for …
|
72.60
|
2023-10-05
|
|
MetaMath 13B
|
MetaMath: Bootstrap Your Own Mathematical Questio…
|
71.00
|
2023-09-21
|
|
MuggleMATH 7B
|
MuggleMath: Assessing the Impact of Query and Res…
|
69.80
|
2023-10-09
|
|
LLaMA 65B-maj1@k
|
LLaMA: Open and Efficient Foundation Language Mod…
|
69.70
|
2023-02-27
|
|
Minerva 62B (maj1@100)
|
Solving Quantitative Reasoning Problems with Lang…
|
68.50
|
2022-06-29
|
|
code-davinci-002 (Least-to-Most Prompting)
|
Least-to-Most Prompting Enables Complex Reasoning…
|
68.01
|
2022-05-21
|
|
MathCoder-CL-7B
|
MathCoder: Seamless Code Integration in LLMs for …
|
67.80
|
2023-10-05
|
|
MetaMath 7B
|
MetaMath: Bootstrap Your Own Mathematical Questio…
|
66.40
|
2023-09-21
|
|
Mistral-Small-24B + CAPO
|
CAPO: Cost-Aware Prompt Optimization
|
65.07
|
2025-04-22
|
|
RFT 70B
|
Scaling Relationship on Learning Mathematical Rea…
|
64.80
|
2023-08-03
|
|
MathCoder-L-7B
|
MathCoder: Seamless Code Integration in LLMs for …
|
64.20
|
2023-10-05
|
|
WizardMath-13B-V1.0
|
WizardMath: Empowering Mathematical Reasoning for…
|
63.90
|
2023-08-18
|
|
GPT-J (CoRe)
|
Solving Math Word Problems via Cooperative Reason…
|
63.20
|
2022-10-28
|
|
Llama-2 70B (on 100 first questions, 4-shot, auto-optimized prompting)
|
The Unreasonable Effectiveness of Eccentric Autom…
|
61.00
|
2024-02-09
|
|
Qwen2.5-32B + CAPO
|
CAPO: Cost-Aware Prompt Optimization
|
60.20
|
2025-04-22
|
|
LLaMA 2 70B (CoT-Influx)
|
Fewer is More: Boosting LLM Reasoning with Reinfo…
|
59.59
|
2023-12-14
|
|
Orca 2 13B
|
Orca 2: Teaching Small Language Models How to Rea…
|
59.14
|
2023-11-18
|
|
U-PaLM
|
Transcending Scaling Laws with 0.1% Extra Compute
|
58.50
|
2022-10-20
|
|
PaLM-540B (few-Shot-cot)
|
Large Language Models are Zero-Shot Reasoners
|
58.10
|
2022-05-24
|
|
GPT-3.5 (few-shot, k=5)
|
GPT-4 Technical Report
|
57.10
|
2023-03-15
|
|
Minerva 8B (maj5@100)
|
Solving Quantitative Reasoning Problems with Lang…
|
56.80
|
2022-06-29
|
|
LLaMA 2 70B (on-shot)
|
Llama 2: Open Foundation and Fine-Tuned Chat Mode…
|
56.80
|
2023-07-18
|
|
PaLM 540B (8-shot)
|
Solving Quantitative Reasoning Problems with Lang…
|
56.50
|
2022-06-29
|
|
PaLM 540B (CoT Prompting)
|
Large Language Models Can Self-Improve
|
56.50
|
2022-10-20
|
|
RFT 13B
|
Scaling Relationship on Learning Mathematical Rea…
|
55.30
|
2023-08-03
|
|
Finetuned GPT-3 175B + verifier
|
Large Language Models are Zero-Shot Reasoners
|
55.00
|
2022-05-24
|
|
WizardMath-7B-V1.0
|
WizardMath: Empowering Mathematical Reasoning for…
|
54.90
|
2023-08-18
|
|
LLaMA 33B-maj1@k
|
LLaMA: Open and Efficient Foundation Language Mod…
|
53.10
|
2023-02-27
|
|
Minerva 62B (8-shot)
|
Solving Quantitative Reasoning Problems with Lang…
|
52.40
|
2022-06-29
|
|
Mistral 7B (maj@8)
|
Mistral 7B
|
52.20
|
2023-10-10
|
|
Llemma 34B
|
Llemma: An Open Language Model For Mathematics
|
51.50
|
2023-10-16
|
|
Text-davinci-002-175B (zero-plus-few-Shot-cot (8 samples))
|
Large Language Models are Zero-Shot Reasoners
|
51.50
|
2022-05-24
|
|
RFT 7B
|
Scaling Relationship on Learning Mathematical Rea…
|
51.20
|
2023-08-03
|
|
LLaMA 65B
|
LLaMA: Open and Efficient Foundation Language Mod…
|
50.90
|
2023-02-27
|
|
Orca 2 7B
|
Orca 2: Teaching Small Language Models How to Rea…
|
47.23
|
2023-11-18
|
|
Llama-2 13B (on 100 first questions, 4-shot, auto-optimized prompting)
|
The Unreasonable Effectiveness of Eccentric Autom…
|
43.00
|
2024-02-09
|
|
text-davinci-002 175B (2-shot, CoT)
|
Large Language Models are Zero-Shot Reasoners
|
41.30
|
2022-05-24
|
|
Mistral 7B (on 100 first questions, 4-shot, auto-optimized prompting)
|
The Unreasonable Effectiveness of Eccentric Autom…
|
41.00
|
2024-02-09
|
|
text-davinci-002 175B (0-shot, CoT)
|
Large Language Models are Zero-Shot Reasoners
|
40.70
|
2022-05-24
|
|
Branch-Train-MiX 4x7B (sampling top-2 experts)
|
Branch-Train-MiX: Mixing Expert LLMs into a Mixtu…
|
37.10
|
2024-03-12
|
|
Llemma 7B
|
Llemma: An Open Language Model For Mathematics
|
36.40
|
2023-10-16
|
|
LLaMA 33B
|
LLaMA: Open and Efficient Foundation Language Mod…
|
35.60
|
2023-02-27
|
|
Vicuna (SYRELM)
|
Frugal LMs Trained to Invoke Symbolic Solvers Ach…
|
35.20
|
2023-12-09
|
|
PaLM 62B (8-shot)
|
Solving Quantitative Reasoning Problems with Lang…
|
33.00
|
2022-06-29
|
|
PaLM 540B (Self Improvement, Standard-Prompting)
|
Large Language Models Can Self-Improve
|
32.20
|
2022-10-20
|
|
LLaMA 13B-maj1@k
|
LLaMA: Open and Efficient Foundation Language Mod…
|
29.30
|
2023-02-27
|
|
Minerva 8B-maj1@k (8-shot)
|
Solving Quantitative Reasoning Problems with Lang…
|
28.40
|
2022-06-29
|
|
GPT-2-Medium 355M + question-solution classifier (BS=5)
|
Composing Ensembles of Pre-trained Models via Ite…
|
20.80
|
2022-10-20
|
|
GPT-Neo-2.7B + Self-Sampling
|
Learning Math Reasoning from Self-Sampled Correct…
|
19.50
|
2022-05-28
|
|
GPT-2-Medium 355M (fine-tuned, BS=5)
|
Composing Ensembles of Pre-trained Models via Ite…
|
18.30
|
2022-10-20
|
|
LLaMA 7B (maj1@k)
|
LLaMA: Open and Efficient Foundation Language Mod…
|
18.10
|
2023-02-27
|
|
PaLM 540B (few-shot)
|
Large Language Models are Zero-Shot Reasoners
|
17.90
|
2022-05-24
|
|
PaLM 540B (Standard-Prompting)
|
Large Language Models Can Self-Improve
|
17.90
|
2022-10-20
|
|
LLaMA 13B
|
LLaMA: Open and Efficient Foundation Language Mod…
|
17.80
|
2023-02-27
|
|
GPT-2-Medium 355M + question-solution classifier (BS=1)
|
Composing Ensembles of Pre-trained Models via Ite…
|
16.80
|
2022-10-20
|
|
Minerva 8B (8-shot)
|
Solving Quantitative Reasoning Problems with Lang…
|
16.20
|
2022-06-29
|
|
GPT-2-Medium 355M (BS=5)
|
Composing Ensembles of Pre-trained Models via Ite…
|
12.20
|
2022-10-20
|
|
LLaMA 7B
|
LLaMA: Open and Efficient Foundation Language Mod…
|
11.00
|
2023-02-27
|
|
Text-davinci-002-175B (0-shot)
|
Large Language Models are Zero-Shot Reasoners
|
10.40
|
2022-05-24
|
|
GPT-Neo 125M + Self-Sampling
|
Learning Math Reasoning from Self-Sampled Correct…
|
7.50
|
2022-05-28
|
|
UL2 20B (chain-of-thought)
|
UL2: Unifying Language Learning Paradigms
|
4.40
|
2022-05-10
|
|
PaLM 8B (8-shot)
|
Solving Quantitative Reasoning Problems with Lang…
|
4.10
|
2022-06-29
|
|
UL2 20B (0-shot)
|
UL2: Unifying Language Learning Paradigms
|
4.10
|
2022-05-10
|
|