ML Research Wiki / Benchmarks / Arithmetic Reasoning / GSM8K

GSM8K

Arithmetic Reasoning Benchmark

Performance Over Time

📊 Showing 144 results | 📏 Metric: Accuracy

Top Performing Models

Rank Model Paper Accuracy Date Code
1 Claude 3.5 Sonnet (HPT) Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles 97.72 2024-06-18 📦 devichand579/HPT
2 DUP prompt upon GPT-4 Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems 97.10 2024-04-23 📦 whu-zqh/dup
3 Qwen2-Math-72B-Instruct (greedy) 📚 Qwen2 Technical Report 96.70 2024-07-15 📦 qwenlm/qwen1.5 📦 qwenlm/qwen2 📦 vicentvankor/sun-shine
4 OpenMath2-Llama3.1-70B (majority@256) 📚 OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data 96.00 2024-10-02 📦 NVIDIA/NeMo-Skills
5 OpenMath2-Llama3.1-70B 📚 OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data 94.90 2024-10-02 📦 NVIDIA/NeMo-Skills
6 GPT-4 (Teaching-Inspired) Teaching-Inspired Integrated Prompting Framework: A Novel Approach for Enhancing Reasoning in Large Language Models 94.80 2024-10-10 📦 sallytan13/teaching-inspired-prompting
7 OpenMath2-Llama3.1-8B (majority@256) 📚 OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data 94.10 2024-10-02 📦 NVIDIA/NeMo-Skills
8 Qwen2-72B-Instruct-Step-DPO (0-shot CoT) 📚 Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs 94.00 2024-06-26 📦 dvlab-research/step-dpo
9 AlphaLLM (with MCTS) Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing 92.00 2024-04-18 📦 yetianjhu/alphallm
10 OpenMath2-Llama3.1-8B 📚 OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data 91.70 2024-10-02 📦 NVIDIA/NeMo-Skills

All Papers (144)

Solving math word problems with process- and outcome-based feedback

2022
DeepMind 70B Model (SFT+ORM-RL, ORM reranking)

Solving math word problems with process- and outcome-based feedback

2022
DeepMind 70B Model (SFT+PRM-RL, PRM reranking)

The ART of LLM Refinement: Ask, Refine, and Trust

2023
ChatGPT (Ask, Refine, Trust)

Large Language Models Can Self-Improve

2022
PaLM 540B (Self Improvement, Self Consistency)

Large Language Models Can Self-Improve

2022
PaLM 540B (Self Consistency)

Large Language Models Can Self-Improve

2022
PaLM 540B (Self Improvement, CoT Prompting)

KwaiYiiMath: Technical Report

2023
KwaiYiiMath 13B

The Unreasonable Effectiveness of Eccentric Automatic Prompts

2024
Llama-2 70B (on 100 first questions, 4-shot, auto-optimized prompting)

Large Language Models Can Self-Improve

2022
PaLM 540B (CoT Prompting)

The Unreasonable Effectiveness of Eccentric Automatic Prompts

2024
Llama-2 13B (on 100 first questions, 4-shot, auto-optimized prompting)

The Unreasonable Effectiveness of Eccentric Automatic Prompts

2024
Mistral 7B (on 100 first questions, 4-shot, auto-optimized prompting)

Large Language Models Can Self-Improve

2022
PaLM 540B (Self Improvement, Standard-Prompting)

Composing Ensembles of Pre-trained Models via Iterative Consensus

2022
GPT-2-Medium 355M + question-solution classifier (BS=5)

Large Language Models Can Self-Improve

2022
PaLM 540B (Standard-Prompting)

Composing Ensembles of Pre-trained Models via Iterative Consensus

2022
GPT-2-Medium 355M + question-solution classifier (BS=1)