Peiyuan Zhang [email protected] StatNLP Research Group Singapore University of Technology, Guangtao Zeng [email protected] StatNLP Research Group Singapore University of Technology, Tianduo Wang [email protected] StatNLP Research Group Singapore University of Technology, Wei Lu [email protected] StatNLP Research Group Singapore University of Technology (2024)
This paper introduces TinyLlama, an open-source compact language model with 1.1 billion parameters pretrained on approximately 1 trillion tokens for 3 epochs. It leverages advances such as FlashAttention and Lit-GPT for computational efficiency, significantly outperforming other open-source models of a similar size. The authors discuss recent findings on the effectiveness of training smaller models with larger datasets and explore TinyLlama's capabilities on various downstream tasks. The pre-training data combines natural language and code from the SlimPajama and StarCoder datasets, totaling roughly 950 billion tokens. Furthermore, they evaluate TinyLlama across commonsense reasoning and problem-solving tasks, demonstrating superior performance compared to baseline models. The paper concludes by emphasizing the model's accessibility for language model research, the release of open-source resources, and future research directions.
This paper employs the following methods:
The following datasets were used in this research: