Daya Guo [email protected] Key Lab of HCST (PKU) MOE; SCS Peking University, Qihao Zhu Key Lab of HCST (PKU) MOE; SCS Peking University, Dejian Yang Key Lab of HCST (PKU) MOE; SCS Peking University, Zhenda Xie Key Lab of HCST (PKU) MOE; SCS Peking University, Kai Dong Key Lab of HCST (PKU) MOE; SCS Peking University, Wentao Zhang Key Lab of HCST (PKU) MOE; SCS Peking University, Guanting Chen Key Lab of HCST (PKU) MOE; SCS Peking University, Xiao Bi Key Lab of HCST (PKU) MOE; SCS Peking University, Y Wu Key Lab of HCST (PKU) MOE; SCS Peking University, Y K Li Key Lab of HCST (PKU) MOE; SCS Peking University, Fuli Luo Key Lab of HCST (PKU) MOE; SCS Peking University, Yingfei Xiong Key Lab of HCST (PKU) MOE; SCS Peking University, Wenfeng Liang Key Lab of HCST (PKU) MOE; SCS Peking University (2024)
The paper introduces the DeepSeek-Coder series of open-source code models, trained from scratch on a large dataset of 2 trillion tokens of source code. These models vary in size from 1.3B to 33B parameters and aim to enhance code generation and infilling capabilities through novel pre-training methods and extensive evaluations.The authors highlight the significance of repository-level data construction and advanced training strategies like Fill-In-the-Middle (FIM) for improved performance across various benchmarks. Empirical results indicate that DeepSeek-Coder outperforms existing state-of-the-art models, including GPT-3.5 Turbo. Additionally, the models are fine-tuned using instructional data, resulting in significant improvements in code generation tasks, particularly in handling longer contexts (up to 16K tokens). The findings highlight the importance of robust training datasets and architectural innovations in developing effective code-focused large language models.
This paper employs the following methods:
The following datasets were used in this research:
The authors identified the following limitations: