← ML Research Wiki / 2402.03300

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao [email protected] DeepSeek-AI Tsinghua University, Peiyi Wang [email protected] DeepSeek-AI Peking University, Qihao Zhu DeepSeek-AI Peking University, Runxin Xu DeepSeek-AI, Junxiao Song DeepSeek-AI, Xiao Bi DeepSeek-AI, Haowei Zhang DeepSeek-AI, Mingchuan Zhang DeepSeek-AI, Y K Li DeepSeek-AI, Y Wu DeepSeek-AI, Daya Guo [email protected] DeepSeek-AI (2024)

Paper Information

arXiv ID

2402.03300

Venue

arXiv.org

Domain

natural language processing, artificial intelligence

SOTA Claim

Yes

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature.In this paper, we introduce DeepSeekMath 7B, which continues pretraining DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code data.DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4.Self-consistency over 64 samples from DeepSeekMath 7B achieves 60.9% on MATH.The mathematical reasoning capability of DeepSeekMath is attributed to two key factors: First, we harness the significant potential of publicly available web data through a meticulously engineered data selection pipeline.Second, we introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO.

Summary

This paper introduces DeepSeekMath, a domain-specific language model focused on mathematical reasoning, which significantly outperforms existing open-source models and approaches the performance of proprietary models like GPT-4. The authors present DeepSeekMath 7B, trained on a dataset of 120 billion math-related tokens sourced from Common Crawl, combined with natural language and code data. The model achieved a score of 51.7% on the MATH benchmark and demonstrated self-consistency results of 60.9%. Key innovations include a well-crafted data selection pipeline for high-quality training data and a new reinforcement learning algorithm called Group Relative Policy Optimization (GRPO), which improves memory efficiency and mathematical reasoning capabilities. Evaluations showed that DeepSeekMath 7B excels in various benchmarks, outperforming larger models like Minerva (540B) while also showcasing improvements in multilingual tasks. Despite these successes, the authors note limitations in geometry problems and the model's inability to handle some mathematical tasks effectively.

Methods

This paper employs the following methods:

Group Relative Policy Optimization (GRPO)
Proximal Policy Optimization (PPO)

Models Used

DeepSeekMath 7B
DeepSeekMath-Base 7B
DeepSeekMath-Instruct 7B
Minerva 540B
GPT-4
Gemini-Ultra

Datasets

The following datasets were used in this research:

DeepSeek-Math Corpus
Common Crawl
GSM8K
MATH
CMATH
MGSM-zh
Gaokao-MathCloze
Gaokao-MathQA

Evaluation Metrics

MATH
GSM8K
Accuracy

Results

DeepSeekMath 7B achieved 51.7% on the MATH benchmark
Self-consistency over 64 samples from DeepSeekMath 7B achieves 60.9% on MATH
DeepSeekMath-Base achieved 64.2% on GSM8K
DeepSeekMath-Instruct beats all 7B models and rivals 70B models

Limitations

The authors identified the following limitations:

Not specified

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

mathematical reasoning large language models reinforcement learning instruction tuning web data corpus construction

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 57
Influential Citations: 95

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers