← ML Research Wiki / 2401.02385

TinyLlama: An Open-Source Small Language Model

Peiyuan Zhang [email protected] StatNLP Research Group Singapore University of Technology, Guangtao Zeng [email protected] StatNLP Research Group Singapore University of Technology, Tianduo Wang [email protected] StatNLP Research Group Singapore University of Technology, Wei Lu [email protected] StatNLP Research Group Singapore University of Technology (2024)

Paper Information

arXiv ID

2401.02385

Venue

arXiv.org

Domain

natural language processing

Code

Available

Reproducibility

6/10

Contents

Abstract
Methods
Datasets
Results
Related Work
External Resources

Abstract

We present TinyLlama, a compact 1.1B language model pretrained on around 1 trillion tokens for up to 3 epochs 1 .Building on the architecture and tokenizer of Llama 2 (Touvron et al., 2023b), TinyLlama leverages various advances contributed by the open-source community, e.g., FlashAttention (Dao, 2023) and Lit-GPT (Lightning-AI, 2023), achieving better computational efficiency.Despite its relatively small size, TinyLlama demonstrates remarkable performance in a series of downstream tasks.It significantly outperforms existing open-source language models with comparable sizes.Our model checkpoints and code are publicly available on GitHub at https://github.com/jzhang38/TinyLlama.* The first two authors contributed equally. 1 Our latest model, TinyLlama v1.1, is only trained for 2 trillion tokens.More details about this latest version will be elaborated in the later section.

Summary

This paper introduces TinyLlama, an open-source compact language model with 1.1 billion parameters pretrained on approximately 1 trillion tokens for 3 epochs. It leverages advances such as FlashAttention and Lit-GPT for computational efficiency, significantly outperforming other open-source models of a similar size. The authors discuss recent findings on the effectiveness of training smaller models with larger datasets and explore TinyLlama's capabilities on various downstream tasks. The pre-training data combines natural language and code from the SlimPajama and StarCoder datasets, totaling roughly 950 billion tokens. Furthermore, they evaluate TinyLlama across commonsense reasoning and problem-solving tasks, demonstrating superior performance compared to baseline models. The paper concludes by emphasizing the model's accessibility for language model research, the release of open-source resources, and future research directions.

Methods

This paper employs the following methods:

Transformer

Models Used

TinyLlama
TinyLlama v1.1
OPT-1.3B
Pythia-1.0B
Pythia-1.4B
TinyLlama v1.1 Math&Code
TinyLlama v1.1 Chinese

Datasets

The following datasets were used in this research:

SlimPajama
StarCoder

Evaluation Metrics

Accuracy
Zero-shot performance
5-shot
3-shot
0-shot

Results

Significant improvement in commonsense reasoning tasks
Outperforms existing models on problem-solving tasks
Competes with larger models using lower parameter count

Technical Requirements

Number of GPUs: 16
GPU Type: NVIDIA A100 40GB

Keywords

TinyLlama small language model open source transformer pre-training efficiency

Papers Using Similar Methods

External Resources

Funding: Ministry of Education Singapore, National Research Foundation Singapore, DSO National Laboratories, AI Singapore, SUTD Kick-Starter Project, RS-INSUR
References: 44
Influential Citations: 38

TinyLlama: An Open-Source Small Language Model

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Technical Requirements edit

Keywords add

Related Papers