← ML Research Wiki / 2404.06395

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Shengding Hu [email protected] Department of Computer Science and Technology Tsinghua University, Yuge Tu Modelbest Inc, Xu Han Department of Computer Science and Technology Tsinghua University, Chaoqun He Department of Computer Science and Technology Tsinghua University, Ganqu Cui Department of Computer Science and Technology Tsinghua University, Xiang Long Modelbest Inc, Zhi Zheng Modelbest Inc, Yewei Fang Modelbest Inc, Yuxiang Huang Department of Computer Science and Technology Tsinghua University, Weilin Zhao Department of Computer Science and Technology Tsinghua University, Xinrong Zhang Department of Computer Science and Technology Tsinghua University, Zheng Leng Thai Department of Computer Science and Technology Tsinghua University, Kaihuo Zhang Modelbest Inc, Chongyi Wang Modelbest Inc, Yuan Yao Department of Computer Science and Technology Tsinghua University, Chenyang Zhao Department of Computer Science and Technology Tsinghua University, Jie Zhou Modelbest Inc, Jie Cai Modelbest Inc, Zhongwu Zhai Modelbest Inc, Ning Ding Department of Computer Science and Technology Tsinghua University, Chao Jia Modelbest Inc, Guoyang Zeng Modelbest Inc, Dahai Li Modelbest Inc, Zhiyuan Liu Department of Computer Science and Technology Tsinghua University, Maosong Sun Department of Computer Science and Technology Tsinghua University (2024)

Paper Information
arXiv ID
Venue
arXiv.org
Domain
natural language processing
Code
Available
Reproducibility
8/10

Abstract

The burgeoning interest in developing Large Language Models (LLMs) with up to trillion parameters has been met with concerns regarding resource efficiency and practical expense, particularly given the immense cost of experimentation.This scenario underscores the importance of exploring the potential of Small Language Models (SLMs) as a resource-efficient alternative.In this context, we introduce MiniCPM, specifically the 1.2B and 2.4B non-embedding parameter variants, not only excel in their respective categories but also demonstrate capabilities on par with 7B-13B LLMs.While focusing on SLMs, our approach exhibits scalability in both model and data dimensions for future LLM research.Regarding model scaling, we employ extensive model wind tunnel experiments for stable and optimal scaling.For data scaling, we introduce a Warmup-Stable-Decay (WSD) learning rate scheduler (LRS), conducive to continuous training and domain adaptation.We present an in-depth analysis of the intriguing training dynamics that occurred in the WSD LRS.With WSD LRS, we are now able to efficiently study data-model scaling law without extensive retraining experiments on both axes of model and data, from which we derive the much higher compute optimal data-model ratio than Chinchilla Optimal.Additionally, we introduce MiniCPM family, including MiniCPM-DPO, MiniCPM-MoE and MiniCPM-128K, whose excellent performance further cementing MiniCPM's foundation in diverse SLM applications.MiniCPM models are available publicly 1 .

Summary

This paper presents MiniCPM, a set of Small Language Models (SLMs) designed to provide effective alternatives to Large Language Models (LLMs) while minimizing resource consumption. The two main variants discussed are MiniCPM-1.2B and MiniCPM-2.4B, which outperform larger models in specific tasks. The research emphasizes scalable training strategies, including a Warmup-Stable-Decay (WSD) learning rate scheduler that enhances stability and data efficiency during training. Key experiments demonstrate the models' capabilities in various application areas by applying adaptive training methods and analyzing scaling laws. The paper provides insights into optimal batch size adjustments and learning rate stability across both model and data scaling.

Methods

This paper employs the following methods:

  • WSD (Warmup-Stable-Decay) Learning Rate Scheduler

Models Used

  • MiniCPM-1.2B
  • MiniCPM-2.4B
  • MiniCPM-DPO
  • MiniCPM-MoE
  • MiniCPM-128K

Datasets

The following datasets were used in this research:

  • C4
  • MMLU
  • CMMLU
  • GSM8K
  • MATH
  • HumanEval
  • MBPP
  • HellaSwag
  • ARC-e
  • ARC-c
  • BBH
  • UltraChat
  • SlimOrca
  • OssInstruct
  • EvolInstruct

Evaluation Metrics

  • PPL (Perplexity)
  • Loss

Results

  • MiniCPM models demonstrate superior performance compared to competitors in various tasks such as translation, coding, and commonsense reasoning.

Limitations

The authors identified the following limitations:

  • The paper does not explore the training of an LLM to validate the scaling law. The application of WSD LRS on LLMs remains unexplored.

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

small language models scalable training learning rate scheduler scaling law model wind tunnel experiments

External Resources