← ML Research Wiki / 2401.14196

DeepSeek-Coder: When the Large Language Model Meets Programming -The Rise of Code Intelligence

Daya Guo [email protected] Key Lab of HCST (PKU) MOE; SCS Peking University, Qihao Zhu Key Lab of HCST (PKU) MOE; SCS Peking University, Dejian Yang Key Lab of HCST (PKU) MOE; SCS Peking University, Zhenda Xie Key Lab of HCST (PKU) MOE; SCS Peking University, Kai Dong Key Lab of HCST (PKU) MOE; SCS Peking University, Wentao Zhang Key Lab of HCST (PKU) MOE; SCS Peking University, Guanting Chen Key Lab of HCST (PKU) MOE; SCS Peking University, Xiao Bi Key Lab of HCST (PKU) MOE; SCS Peking University, Y Wu Key Lab of HCST (PKU) MOE; SCS Peking University, Y K Li Key Lab of HCST (PKU) MOE; SCS Peking University, Fuli Luo Key Lab of HCST (PKU) MOE; SCS Peking University, Yingfei Xiong Key Lab of HCST (PKU) MOE; SCS Peking University, Wenfeng Liang Key Lab of HCST (PKU) MOE; SCS Peking University (2024)

Paper Information

arXiv ID

2401.14196

Venue

arXiv.org

Domain

Not specified

SOTA Claim

Yes

Reproducibility

4/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

The rapid development of large language models has revolutionized code intelligence in software development.However, the predominance of closed-source models has restricted extensive research and development.To address this, we introduce the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens.These models are pre-trained on a high-quality project-level code corpus and employ a fill-in-the-blank task with a 16K window to enhance code generation and infilling.Our extensive evaluations demonstrate that DeepSeek-Coder not only achieves state-of-the-art performance among open-source code models across multiple benchmarks but also surpasses existing closed-source models like Codex and GPT-3.5.Furthermore, DeepSeek-Coder models are under a permissive license that allows for both research and unrestricted commercial use. Figure 1 | The Performance of DeepSeek-Coder *Core contributors, ordered alphabetically by the name.

Summary

The paper introduces the DeepSeek-Coder series of open-source code models, trained from scratch on a large dataset of 2 trillion tokens of source code. These models vary in size from 1.3B to 33B parameters and aim to enhance code generation and infilling capabilities through novel pre-training methods and extensive evaluations.The authors highlight the significance of repository-level data construction and advanced training strategies like Fill-In-the-Middle (FIM) for improved performance across various benchmarks. Empirical results indicate that DeepSeek-Coder outperforms existing state-of-the-art models, including GPT-3.5 Turbo. Additionally, the models are fine-tuned using instructional data, resulting in significant improvements in code generation tasks, particularly in handling longer contexts (up to 16K tokens). The findings highlight the importance of robust training datasets and architectural innovations in developing effective code-focused large language models.

Methods

This paper employs the following methods:

Fill-In-the-Middle (FIM)
Next Token Prediction

Models Used

DeepSeek-Coder-Base
DeepSeek-Coder-Instruct

Datasets

The following datasets were used in this research:

HumanEval
MBPP
DS-1000
GSM8K
MATH
CrossCodeEval

Evaluation Metrics

Pass@1
Accuracy
F1-score

Results

DeepSeek-Coder outperforms existing open-source models in multiple benchmarks
DeepSeek-Coder-Instruct surpasses OpenAI GPT-3.5 Turbo in code-related tasks

Limitations

The authors identified the following limitations:

The closed-source nature of other models limits research access
Performance evaluation might be affected by data contamination risks

Technical Requirements

Number of GPUs: None specified
GPU Type: NVIDIA A100, H800

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 43
Influential Citations: 112

DeepSeek-Coder: When the Large Language Model Meets Programming -The Rise of Code Intelligence

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Related Papers