Shengding Hu [email protected] Department of Computer Science and Technology Tsinghua University, Yuge Tu Modelbest Inc, Xu Han Department of Computer Science and Technology Tsinghua University, Chaoqun He Department of Computer Science and Technology Tsinghua University, Ganqu Cui Department of Computer Science and Technology Tsinghua University, Xiang Long Modelbest Inc, Zhi Zheng Modelbest Inc, Yewei Fang Modelbest Inc, Yuxiang Huang Department of Computer Science and Technology Tsinghua University, Weilin Zhao Department of Computer Science and Technology Tsinghua University, Xinrong Zhang Department of Computer Science and Technology Tsinghua University, Zheng Leng Thai Department of Computer Science and Technology Tsinghua University, Kaihuo Zhang Modelbest Inc, Chongyi Wang Modelbest Inc, Yuan Yao Department of Computer Science and Technology Tsinghua University, Chenyang Zhao Department of Computer Science and Technology Tsinghua University, Jie Zhou Modelbest Inc, Jie Cai Modelbest Inc, Zhongwu Zhai Modelbest Inc, Ning Ding Department of Computer Science and Technology Tsinghua University, Chao Jia Modelbest Inc, Guoyang Zeng Modelbest Inc, Dahai Li Modelbest Inc, Zhiyuan Liu Department of Computer Science and Technology Tsinghua University, Maosong Sun Department of Computer Science and Technology Tsinghua University (2024)
This paper presents MiniCPM, a set of Small Language Models (SLMs) designed to provide effective alternatives to Large Language Models (LLMs) while minimizing resource consumption. The two main variants discussed are MiniCPM-1.2B and MiniCPM-2.4B, which outperform larger models in specific tasks. The research emphasizes scalable training strategies, including a Warmup-Stable-Decay (WSD) learning rate scheduler that enhances stability and data efficiency during training. Key experiments demonstrate the models' capabilities in various application areas by applying adaptive training methods and analyzing scaling laws. The paper provides insights into optimal batch size adjustments and learning rate stability across both model and data scaling.
This paper employs the following methods:
The following datasets were used in this research:
The authors identified the following limitations: