Qwen2.5-Math-72B-Instruct(TIR,Greedy)
|
Qwen2.5-Math Technical Report: Toward Mathematica…
|
88.10
|
2024-09-18
|
|
GPT-4 Turbo (MACM, w/code, voting)
|
MACM: Utilizing a Multi-Agent System for Conditio…
|
87.92
|
2024-04-06
|
|
Qwen2.5-Math-72B-Instruct(COT,Greedy)
|
Qwen2.5-Math Technical Report: Toward Mathematica…
|
85.90
|
2024-09-18
|
|
Qwen2.5-Math-7B-Instruct(TIR,Greedy)
|
Qwen2.5-Math Technical Report: Toward Mathematica…
|
85.20
|
2024-09-18
|
|
GPT-4-code model (CSV, w/ code, SC, k=16)
|
Solving Challenging Math Word Problems Using GPT-…
|
84.30
|
2023-08-15
|
|
Qwen2-Math-72B-Instruct(greedy)
|
Qwen2 Technical Report
|
84.00
|
2024-07-15
|
|
Qwen2.5-Math-7B-Instruct(COT,Greedy)
|
Qwen2.5-Math Technical Report: Toward Mathematica…
|
83.60
|
2024-09-18
|
|
Qwen2.5-Math-1.5B-Instruct(TIR,Greedy)
|
Qwen2.5-Math Technical Report: Toward Mathematica…
|
79.90
|
2024-09-18
|
|
OpenMath2-Llama3.1-70B (majority@256)
|
OpenMathInstruct-2: Accelerating AI for Math with…
|
79.60
|
2024-10-02
|
|
OpenMath2-Llama3.1-8B (majority@256)
|
OpenMathInstruct-2: Accelerating AI for Math with…
|
76.10
|
2024-10-02
|
|
Qwen2.5-Math-1.5B-Instruct(COT,Greedy)
|
Qwen2.5-Math Technical Report: Toward Mathematica…
|
75.80
|
2024-09-18
|
|
GPT-4-code model (CSV, w/ code)
|
Solving Challenging Math Word Problems Using GPT-…
|
73.50
|
2023-08-15
|
|
CR (GPT-4-turbo model, w/ code)
|
Cumulative Reasoning with Large Language Models
|
72.20
|
2023-08-08
|
|
OpenMath2-Llama3.1-70B
|
OpenMathInstruct-2: Accelerating AI for Math with…
|
71.90
|
2024-10-02
|
|
LogicNet (with code interpreter)
|
Solving Challenging Math Word Problems Using GPT-…
|
71.20
|
2023-08-15
|
|
Qwen2-72B-Instruct-Step-DPO (0-shot CoT, w/o code)
|
Step-DPO: Step-wise Preference Optimization for L…
|
70.80
|
2024-06-26
|
|
GPT-4-code model (w/ code)
|
Solving Challenging Math Word Problems Using GPT-…
|
69.70
|
2023-08-15
|
|
OpenMath2-Llama3.1-8B
|
OpenMathInstruct-2: Accelerating AI for Math with…
|
67.80
|
2024-10-02
|
|
AlphaMath-7B-SBS@3
|
AlphaMath Almost Zero: Process Supervision withou…
|
66.30
|
2024-05-06
|
|
Minerva 62B (maj5@256)
|
Solving Quantitative Reasoning Problems with Lang…
|
64.90
|
2022-06-29
|
|
MMOS-DeepSeekMath-7B(0-shot,k=50)
|
An Empirical Study of Data Ability Boundary in LL…
|
63.70
|
2024-02-23
|
|
GPT-4-code model (w/o code)
|
Solving Challenging Math Word Problems Using GPT-…
|
60.80
|
2023-08-15
|
|
OpenMath-CodeLlama-70B (w/ code, SC, k=50)
|
OpenMathInstruct-1: A 1.8 Million Math Instructio…
|
60.40
|
2024-02-15
|
|
OpenMath-CodeLlama-34B (w/ code, SC, k=50)
|
OpenMathInstruct-1: A 1.8 Million Math Instructio…
|
60.20
|
2024-02-15
|
|
ToRA-Code 34B model (w/ code, SC, k=50)
|
ToRA: A Tool-Integrated Reasoning Agent for Mathe…
|
60.00
|
2023-09-29
|
|
DeepSeekMATH-RL-7B (w/ code, greedy decoding)
|
DeepSeekMath: Pushing the Limits of Mathematical …
|
58.80
|
2024-02-05
|
|
OpenMath-Llama2-70B (w/ code, SC, k=50)
|
OpenMathInstruct-1: A 1.8 Million Math Instructio…
|
58.30
|
2024-02-15
|
|
CR (GPT-4 model, w/o code)
|
Cumulative Reasoning with Large Language Models
|
58.00
|
2023-08-08
|
|
OpenMath-CodeLlama-13B (w/ code, SC, k=50)
|
OpenMathInstruct-1: A 1.8 Million Math Instructio…
|
57.60
|
2024-02-15
|
|
OpenMath-Mistral-7B (w/ code, SC, k=50)
|
OpenMathInstruct-1: A 1.8 Million Math Instructio…
|
57.20
|
2024-02-15
|
|
ToRA 70B (w/ code, SC, k=50)
|
ToRA: A Tool-Integrated Reasoning Agent for Mathe…
|
56.90
|
2023-09-29
|
|
SKiC (GPT-4 model)
|
Skills-in-Context Prompting: Unlocking Compositio…
|
56.40
|
2023-08-01
|
|
DART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code)
|
DART-Math: Difficulty-Aware Rejection Tuning for …
|
56.10
|
2024-06-18
|
|
OpenMath-CodeLlama-7B (w/ code, SC, k=50)
|
OpenMathInstruct-1: A 1.8 Million Math Instructio…
|
55.60
|
2024-02-15
|
|
MMOS-DeepSeekMath-7B(0-shot)
|
An Empirical Study of Data Ability Boundary in LL…
|
55.00
|
2024-02-23
|
|
DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code)
|
DART-Math: Difficulty-Aware Rejection Tuning for …
|
54.90
|
2024-06-18
|
|
PHP (GPT-4 model)
|
Progressive-Hint Prompting Improves Reasoning in …
|
53.90
|
2023-04-19
|
|
DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code)
|
DART-Math: Difficulty-Aware Rejection Tuning for …
|
53.60
|
2024-06-18
|
|
Gemini Ultra (4-shot)
|
Gemini: A Family of Highly Capable Multimodal Mod…
|
53.20
|
2023-12-19
|
|
DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code)
|
DART-Math: Difficulty-Aware Rejection Tuning for …
|
52.90
|
2024-06-18
|
|
GPT-4 model (w/ code, PAL)
|
PAL: Program-aided Language Models
|
51.80
|
2022-11-18
|
|
DeepSeekMATH-RL-7B (greedy decoding)
|
DeepSeekMath: Pushing the Limits of Mathematical …
|
51.70
|
2024-02-05
|
|
AlphaLLM (with MCTS)
|
Toward Self-Improvement of LLMs via Imagination, …
|
51.00
|
2024-04-18
|
|
ToRA-Code 34B (w/ code)
|
ToRA: A Tool-Integrated Reasoning Agent for Mathe…
|
50.80
|
2023-09-29
|
|
OpenMath-CodeLlama-70B (w/ code)
|
OpenMathInstruct-1: A 1.8 Million Math Instructio…
|
50.70
|
2024-02-15
|
|
Minerva 540B (maj1@k, k=64)
|
Solving Quantitative Reasoning Problems with Lang…
|
50.30
|
2022-06-29
|
|
ToRA 70B (w/ code)
|
ToRA: A Tool-Integrated Reasoning Agent for Mathe…
|
49.70
|
2023-09-29
|
|
MMOS-CODE-34B(0-shot)
|
An Empirical Study of Data Ability Boundary in LL…
|
49.50
|
2024-02-23
|
|
DeepSeekMath-7B-KPMath-Plus
|
Key-Point-Driven Data Synthesis with its Enhancem…
|
48.80
|
2024-03-04
|
|
PaLM 2 (few-shot, k=4, SC)
|
PaLM 2 Technical Report
|
48.80
|
2023-05-17
|
|
Llemma-34B-KPMath-Plus
|
Key-Point-Driven Data Synthesis with its Enhancem…
|
48.60
|
2024-03-04
|
|
OpenMath-CodeLlama-34B (w/ code)
|
OpenMathInstruct-1: A 1.8 Million Math Instructio…
|
48.30
|
2024-02-15
|
|
Shepherd + DeepSeek-67B (SFT on MetaMATH + PRM rerank, k=256)
|
Math-Shepherd: Verify and Reinforce LLMs Step-by-…
|
48.10
|
2023-12-14
|
|
ToRA-Code 13B (w/ code)
|
ToRA: A Tool-Integrated Reasoning Agent for Mathe…
|
48.10
|
2023-09-29
|
|
Minerva 8B (maj5@256)
|
Solving Quantitative Reasoning Problems with Lang…
|
47.60
|
2022-06-29
|
|
Mistral-7B-KPMath-Plus
|
Key-Point-Driven Data Synthesis with its Enhancem…
|
46.80
|
2024-03-04
|
|
DART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code)
|
DART-Math: Difficulty-Aware Rejection Tuning for …
|
46.60
|
2024-06-18
|
|
OpenMath-Llama2-70B (w/ code)
|
OpenMathInstruct-1: A 1.8 Million Math Instructio…
|
46.30
|
2024-02-15
|
|
OpenMath-CodeLlama-13B (w/ code)
|
OpenMathInstruct-1: A 1.8 Million Math Instructio…
|
45.50
|
2024-02-15
|
|
DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code)
|
DART-Math: Difficulty-Aware Rejection Tuning for …
|
45.50
|
2024-06-18
|
|
DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code)
|
DART-Math: Difficulty-Aware Rejection Tuning for …
|
45.30
|
2024-06-18
|
|
MathCoder-CL-34B
|
MathCoder: Seamless Code Integration in LLMs for …
|
45.20
|
2023-10-05
|
|
MathCoder-L-34B
|
MathCoder: Seamless Code Integration in LLMs for …
|
45.10
|
2023-10-05
|
|
MMIQC-72B
|
Augmenting Math Word Problems via Iterative Quest…
|
45.00
|
2024-01-17
|
|
ToRA-Code 7B (w/ code)
|
ToRA: A Tool-Integrated Reasoning Agent for Mathe…
|
44.60
|
2023-09-29
|
|
OpenMath-Mistral-7B (w/ code)
|
OpenMathInstruct-1: A 1.8 Million Math Instructio…
|
44.50
|
2024-02-15
|
|
MMOS-CODE-7B(0-shot)
|
An Empirical Study of Data Ability Boundary in LL…
|
44.30
|
2024-02-23
|
|
OpenMath-CodeLlama-7B (w/ code)
|
OpenMathInstruct-1: A 1.8 Million Math Instructio…
|
43.60
|
2024-02-15
|
|
Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256)
|
Math-Shepherd: Verify and Reinforce LLMs Step-by-…
|
43.50
|
2023-12-14
|
|
DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code)
|
DART-Math: Difficulty-Aware Rejection Tuning for …
|
43.50
|
2024-06-18
|
|
Minerva 62B (maj1@k, k=64)
|
Solving Quantitative Reasoning Problems with Lang…
|
43.40
|
2022-06-29
|
|
ToRA 13B (w/ code)
|
ToRA: A Tool-Integrated Reasoning Agent for Mathe…
|
43.00
|
2023-09-29
|
|
GPT-4
|
Sparks of Artificial General Intelligence: Early …
|
42.50
|
2023-03-22
|
|
Llama2-13B-KPMath-Plus
|
Key-Point-Driven Data Synthesis with its Enhancem…
|
41.00
|
2024-03-04
|
|
ToRA 7B (w/ code)
|
ToRA: A Tool-Integrated Reasoning Agent for Mathe…
|
40.10
|
2023-09-29
|
|
MathCoder-CL-13B
|
MathCoder: Seamless Code Integration in LLMs for …
|
35.90
|
2023-10-05
|
|
MuggleMATH-70B
|
MuggleMath: Assessing the Impact of Query and Res…
|
35.60
|
2023-10-09
|
|
PaLM 2 (few-shot, k=4, CoT)
|
PaLM 2 Technical Report
|
34.30
|
2023-05-17
|
|
Minerva 540B
|
Solving Quantitative Reasoning Problems with Lang…
|
33.60
|
2022-06-29
|
|
Minerva 540B (5-shot) mCoT
|
Galactica: A Large Language Model for Science
|
33.60
|
2022-11-16
|
|
Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL)
|
Math-Shepherd: Verify and Reinforce LLMs Step-by-…
|
33.00
|
2023-12-14
|
|
WizardMath-7B-V1.1
|
WizardMath: Empowering Mathematical Reasoning for…
|
33.00
|
2023-08-18
|
|
Gemini Pro (4-shot)
|
Gemini: A Family of Highly Capable Multimodal Mod…
|
32.60
|
2023-12-19
|
|
MuggleMATH-13B
|
MuggleMath: Assessing the Impact of Query and Res…
|
30.70
|
2023-10-09
|
|
MathCoder-CL-7B
|
MathCoder: Seamless Code Integration in LLMs for …
|
30.20
|
2023-10-05
|
|
MathCoder-L-13B
|
MathCoder: Seamless Code Integration in LLMs for …
|
29.90
|
2023-10-05
|
|
Qwen2idae-16x14B (4-shot)
|
Parameter-Efficient Sparsity Crafting from Dense …
|
29.90
|
2024-01-05
|
|
OpenChat-3.5-1210 7B
|
OpenChat: Advancing Open-source Language Models w…
|
28.90
|
2023-09-20
|
|
OpenChat-3.5 7B
|
OpenChat: Advancing Open-source Language Models w…
|
28.60
|
2023-09-20
|
|
Mixtral 8x7B (maj@4)
|
Mixtral of Experts
|
28.40
|
2024-01-08
|
|
Minerva 62B (4-shot)
|
Solving Quantitative Reasoning Problems with Lang…
|
27.60
|
2022-06-29
|
|
MetaMath 70B
|
MetaMath: Bootstrap Your Own Mathematical Questio…
|
26.00
|
2023-09-21
|
|
MuggleMATH 7B
|
MuggleMath: Assessing the Impact of Query and Res…
|
25.80
|
2023-10-09
|
|
Minerva 8B (maj1@k, k=64)
|
Solving Quantitative Reasoning Problems with Lang…
|
25.40
|
2022-06-29
|
|
MathCoder-L-7B
|
MathCoder: Seamless Code Integration in LLMs for …
|
23.30
|
2023-10-05
|
|
WizardMath-70B-V1.0
|
WizardMath: Empowering Mathematical Reasoning for…
|
22.70
|
2023-08-18
|
|
Camelidae-8×34B (4-shot)
|
Parameter-Efficient Sparsity Crafting from Dense …
|
22.60
|
2024-01-05
|
|
MetaMath 13B
|
MetaMath: Bootstrap Your Own Mathematical Questio…
|
22.50
|
2023-09-21
|
|
LLaMA 65B (maj1@k)
|
LLaMA: Open and Efficient Foundation Language Mod…
|
20.50
|
2023-02-27
|
|
GAL 120B (5-shot) mCoT
|
Galactica: A Large Language Model for Science
|
20.40
|
2022-11-16
|
|
MetaMath 7B
|
MetaMath: Bootstrap Your Own Mathematical Questio…
|
19.40
|
2023-09-21
|
|
davinci-002 175B
|
Solving Quantitative Reasoning Problems with Lang…
|
19.10
|
2022-06-29
|
|
Branch-Train-MiX 4x7B (sampling top-2 experts)
|
Branch-Train-MiX: Mixing Expert LLMs into a Mixtu…
|
17.80
|
2024-03-12
|
|
GAL 120B <work>
|
Galactica: A Large Language Model for Science
|
16.60
|
2022-11-16
|
|
LLaMA 33B-maj1@k
|
LLaMA: Open and Efficient Foundation Language Mod…
|
15.20
|
2023-02-27
|
|
Minerva 8B
|
Solving Quantitative Reasoning Problems with Lang…
|
14.10
|
2022-06-29
|
|
WizardMath-13B-V1.0
|
WizardMath: Empowering Mathematical Reasoning for…
|
14.00
|
2023-08-18
|
|
Mistral 7B (maj@4)
|
Mistral 7B
|
13.10
|
2023-10-10
|
|
GAL 30B (5-shot) mCoT
|
Galactica: A Large Language Model for Science
|
12.70
|
2022-11-16
|
|
Mistral 7B (maj@4)
|
Mixtral of Experts
|
12.70
|
2024-01-08
|
|
GAL 30B <work>
|
Galactica: A Large Language Model for Science
|
11.40
|
2022-11-16
|
|
WizardMath-7B-V1.0
|
WizardMath: Empowering Mathematical Reasoning for…
|
10.70
|
2023-08-18
|
|
LLaMA 65B
|
LLaMA: Open and Efficient Foundation Language Mod…
|
10.60
|
2023-02-27
|
|
PaLM 540B
|
Solving Quantitative Reasoning Problems with Lang…
|
8.80
|
2022-06-29
|
|
PaLM 540B (5-shot) mCoT
|
Galactica: A Large Language Model for Science
|
8.80
|
2022-11-16
|
|
LLaMA 13B-maj1@k
|
LLaMA: Open and Efficient Foundation Language Mod…
|
8.80
|
2023-02-27
|
|
LLaMA 33B
|
LLaMA: Open and Efficient Foundation Language Mod…
|
7.10
|
2023-02-27
|
|
LLaMA 7B-maj1@k
|
LLaMA: Open and Efficient Foundation Language Mod…
|
6.90
|
2023-02-27
|
|
GPT-2 (1.5B)
|
Measuring Mathematical Problem Solving With the M…
|
6.90
|
2021-03-05
|
|
GPT-2 (0.7B)
|
Measuring Mathematical Problem Solving With the M…
|
6.40
|
2021-03-05
|
|
GPT-2 (0.3B)
|
Measuring Mathematical Problem Solving With the M…
|
6.20
|
2021-03-05
|
|
GPT-3 13B
|
Measuring Mathematical Problem Solving With the M…
|
5.60
|
2021-03-05
|
|
PaLM 8B (fine-tuned)
|
Solving Quantitative Reasoning Problems with Lang…
|
5.60
|
2022-06-29
|
|
GPT-2 (0.1B)
|
Measuring Mathematical Problem Solving With the M…
|
5.40
|
2021-03-05
|
|
GPT-3-175B (few-shot)
|
Measuring Mathematical Problem Solving With the M…
|
5.20
|
2021-03-05
|
|
GPT-3 175B (8-shot)
|
Galactica: A Large Language Model for Science
|
5.20
|
2022-11-16
|
|
PaLM 62B
|
Solving Quantitative Reasoning Problems with Lang…
|
4.40
|
2022-06-29
|
|
LLaMA 13B
|
LLaMA: Open and Efficient Foundation Language Mod…
|
3.90
|
2023-02-27
|
|
GPT-3-13B (few-shot)
|
Measuring Mathematical Problem Solving With the M…
|
3.00
|
2021-03-05
|
|
LLaMA 7B
|
LLaMA: Open and Efficient Foundation Language Mod…
|
2.90
|
2023-02-27
|
|
GPT-3 2.7B
|
Measuring Mathematical Problem Solving With the M…
|
2.90
|
2021-03-05
|
|
PaLM 8B
|
Solving Quantitative Reasoning Problems with Lang…
|
1.50
|
2022-06-29
|
|