← ML Research Wiki / 2401.02954

DeepSeek LLM Scaling Open-Source Language Models with Longtermism

Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, Huazuo Gao, Kaige Gao, Wenjun Gao, Ruiqi Ge, Kang Guan, Daya Guo, Jianzhong Guo, Guangbo Hao, Zhewen Hao, Ying He, Wenjie Hu, Panpan Huang, Erhang Li, Guowei Li, Jiashi Li, Yao K Li, Wenfeng Liang, Fangyun Lin, A X Liu, Bo Liu, Wen Liu, Xiaodong Liu, Xin Liu, Yiyuan Liu, Haoyu Lu, Shanghao Lu, Fuli Luo, Shirong Ma, Xiaotao Nie, Tian Pei, Yishi Piao, Junjie Qiu, Hui Qu, Y Wu, Xin Wu, Zhenda Xie, Ziwei Xie, Yiliang Xie, Hanwei Xiong, R X Xu, Yanhong Xu, Dejian Xu, Yuxiang Yang, Shuiping You, Xingkai Yu, B Yu, Haowei Zhang, Lecong Zhang, Liyue Zhang, Mingchuan Zhang, Minghua Zhang, Wentao Zhang, Yichao Zhang, Chenggang Zhang, Yao Zhao, Shangyan Zhao, Shunfeng Zhou, Qihao Zhou, Yuheng Zhu, Zou DeepSeek-AI, Tongzheng Ren, Zehui Ren Chong Ruan Zhihong Shao, Jingxiang Sun, Bingxuan Wang, Yaohui WangZhangli Sha, Junxiao Song, Xuecheng Su, Yaofeng Sun, Minghui Tang, Peiyi Wang, Shiyu Wang, Yongji Wang, Tong (2024)

Paper Information
arXiv ID
Venue
arXiv.org
Domain
natural language processing, machine learning
Reproducibility
6/10

Abstract

The rapid development of open-source large language models (LLMs) has been truly remarkable.However, the scaling laws described in previous literature presents varying conclusions, which casts a dark cloud over scaling LLMs.We delve into the study of scaling laws and present our distinctive findings that facilitate the scaling of large scale models in two prevalent used opensource configurations, 7B and 67B.Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing open-source language models with a long-term perspective.To support the pre-training phase, we have developed a dataset that currently consists of 2 trillion tokens and is continuously expanding.We further conduct supervised fine-tuning (SFT) and direct preference optimization (DPO) on DeepSeek LLM Base models, resulting in the creation of DeepSeek Chat models.Our evaluation results demonstrate that DeepSeek LLM 67B surpasses LLaMA-2 70B across a range of benchmarks, especially in the domains of code, mathematics, and reasoning.Furthermore, open-ended evaluations reveal that our DeepSeek LLM 67B Chat exhibits superior performance compared to GPT-3.5.

Summary

The paper introduces DeepSeek LLM, a project focusing on the scaling of open-source large language models (LLMs) with a long-term perspective. It discusses the significance of scaling laws in model performance and reports the creation of datasets and scaling experiments that underline the influence of data quality on model performance. The authors present a dataset of 2 trillion tokens collected primarily in Chinese and English, and evaluate the DeepSeek LLM 67B model, showing its competitive performance against LLaMA-2 70B and GPT-3.5 across various benchmarks including mathematics, coding, and reasoning tasks. The paper emphasizes the optimization of hyperparameters, scaling behavior, and fine-tuning techniques like supervised fine-tuning (SFT) and direct preference optimization (DPO) to improve conversational capabilities and model alignment. Furthermore, evaluations include safety assessments to ensure responses align with human values and mitigate harmful outputs. The study concludes with a discussion of limitations and prospects for future developments in the DeepSeek LLM project.

Methods

This paper employs the following methods:

  • Supervised Fine-Tuning (SFT)
  • Direct Preference Optimization (DPO)

Models Used

  • DeepSeek LLM 67B
  • LLaMA-2 70B
  • GPT-3.5

Datasets

The following datasets were used in this research:

  • None specified

Evaluation Metrics

  • None specified

Results

  • DeepSeek LLM 67B outperforms LLaMA-2 70B on various benchmarks including mathematics, coding, and reasoning tasks
  • DeepSeek 67B Chat shows superior performance in open-ended evaluations compared to GPT-3.5
  • DeepSeek models demonstrate good performance in safety evaluations and provide harmless responses.

Limitations

The authors identified the following limitations:

  • The DeepSeek LLM lacks ongoing knowledge updates after pre-training
  • Possibility of generating non-factual information and hallucinations
  • Initial version of Chinese data is not exhaustive, influencing performance on specific topics.

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

large language models scaling laws open-source models dataset fine-tuning evaluation safety

Papers Using Similar Methods

External Resources