← ML Research Wiki / 2401.14196

DeepSeek-Coder: When the Large Language Model Meets Programming -The Rise of Code Intelligence

Daya Guo [email protected] Key Lab of HCST (PKU) MOE; SCS Peking University, Qihao Zhu Key Lab of HCST (PKU) MOE; SCS Peking University, Dejian Yang Key Lab of HCST (PKU) MOE; SCS Peking University, Zhenda Xie Key Lab of HCST (PKU) MOE; SCS Peking University, Kai Dong Key Lab of HCST (PKU) MOE; SCS Peking University, Wentao Zhang Key Lab of HCST (PKU) MOE; SCS Peking University, Guanting Chen Key Lab of HCST (PKU) MOE; SCS Peking University, Xiao Bi Key Lab of HCST (PKU) MOE; SCS Peking University, Y Wu Key Lab of HCST (PKU) MOE; SCS Peking University, Y K Li Key Lab of HCST (PKU) MOE; SCS Peking University, Fuli Luo Key Lab of HCST (PKU) MOE; SCS Peking University, Yingfei Xiong Key Lab of HCST (PKU) MOE; SCS Peking University, Wenfeng Liang Key Lab of HCST (PKU) MOE; SCS Peking University (2024)

Paper Information
arXiv ID
Venue
arXiv.org
Domain
Not specified
SOTA Claim
Yes
Reproducibility
4/10

Abstract

The rapid development of large language models has revolutionized code intelligence in software development.However, the predominance of closed-source models has restricted extensive research and development.To address this, we introduce the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens.These models are pre-trained on a high-quality project-level code corpus and employ a fill-in-the-blank task with a 16K window to enhance code generation and infilling.Our extensive evaluations demonstrate that DeepSeek-Coder not only achieves state-of-the-art performance among open-source code models across multiple benchmarks but also surpasses existing closed-source models like Codex and GPT-3.5.Furthermore, DeepSeek-Coder models are under a permissive license that allows for both research and unrestricted commercial use. Figure 1 | The Performance of DeepSeek-Coder *Core contributors, ordered alphabetically by the name.

Summary

The paper introduces the DeepSeek-Coder series of open-source code models, trained from scratch on a large dataset of 2 trillion tokens of source code. These models vary in size from 1.3B to 33B parameters and aim to enhance code generation and infilling capabilities through novel pre-training methods and extensive evaluations.The authors highlight the significance of repository-level data construction and advanced training strategies like Fill-In-the-Middle (FIM) for improved performance across various benchmarks. Empirical results indicate that DeepSeek-Coder outperforms existing state-of-the-art models, including GPT-3.5 Turbo. Additionally, the models are fine-tuned using instructional data, resulting in significant improvements in code generation tasks, particularly in handling longer contexts (up to 16K tokens). The findings highlight the importance of robust training datasets and architectural innovations in developing effective code-focused large language models.

Methods

This paper employs the following methods:

  • Fill-In-the-Middle (FIM)
  • Next Token Prediction

Models Used

  • DeepSeek-Coder-Base
  • DeepSeek-Coder-Instruct

Datasets

The following datasets were used in this research:

  • HumanEval
  • MBPP
  • DS-1000
  • GSM8K
  • MATH
  • CrossCodeEval

Evaluation Metrics

  • Pass@1
  • Accuracy
  • F1-score

Results

  • DeepSeek-Coder outperforms existing open-source models in multiple benchmarks
  • DeepSeek-Coder-Instruct surpasses OpenAI GPT-3.5 Turbo in code-related tasks

Limitations

The authors identified the following limitations:

  • The closed-source nature of other models limits research access
  • Performance evaluation might be affected by data contamination risks

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: NVIDIA A100, H800

Papers Using Similar Methods

External Resources