← ML Research Wiki / 2401.02385

TinyLlama: An Open-Source Small Language Model

Peiyuan Zhang [email protected] StatNLP Research Group Singapore University of Technology, Guangtao Zeng [email protected] StatNLP Research Group Singapore University of Technology, Tianduo Wang [email protected] StatNLP Research Group Singapore University of Technology, Wei Lu [email protected] StatNLP Research Group Singapore University of Technology (2024)

Paper Information
arXiv ID
Venue
arXiv.org
Domain
natural language processing
Code
Reproducibility
6/10

Abstract

We present TinyLlama, a compact 1.1B language model pretrained on around 1 trillion tokens for up to 3 epochs 1 .Building on the architecture and tokenizer of Llama 2 (Touvron et al., 2023b), TinyLlama leverages various advances contributed by the open-source community, e.g., FlashAttention (Dao, 2023) and Lit-GPT (Lightning-AI, 2023), achieving better computational efficiency.Despite its relatively small size, TinyLlama demonstrates remarkable performance in a series of downstream tasks.It significantly outperforms existing open-source language models with comparable sizes.Our model checkpoints and code are publicly available on GitHub at https://github.com/jzhang38/TinyLlama.* The first two authors contributed equally. 1 Our latest model, TinyLlama v1.1, is only trained for 2 trillion tokens.More details about this latest version will be elaborated in the later section.

Summary

This paper introduces TinyLlama, an open-source compact language model with 1.1 billion parameters pretrained on approximately 1 trillion tokens for 3 epochs. It leverages advances such as FlashAttention and Lit-GPT for computational efficiency, significantly outperforming other open-source models of a similar size. The authors discuss recent findings on the effectiveness of training smaller models with larger datasets and explore TinyLlama's capabilities on various downstream tasks. The pre-training data combines natural language and code from the SlimPajama and StarCoder datasets, totaling roughly 950 billion tokens. Furthermore, they evaluate TinyLlama across commonsense reasoning and problem-solving tasks, demonstrating superior performance compared to baseline models. The paper concludes by emphasizing the model's accessibility for language model research, the release of open-source resources, and future research directions.

Methods

This paper employs the following methods:

  • Transformer

Models Used

  • TinyLlama
  • TinyLlama v1.1
  • OPT-1.3B
  • Pythia-1.0B
  • Pythia-1.4B
  • TinyLlama v1.1 Math&Code
  • TinyLlama v1.1 Chinese

Datasets

The following datasets were used in this research:

  • SlimPajama
  • StarCoder

Evaluation Metrics

  • Accuracy
  • Zero-shot performance
  • 5-shot
  • 3-shot
  • 0-shot

Results

  • Significant improvement in commonsense reasoning tasks
  • Outperforms existing models on problem-solving tasks
  • Competes with larger models using lower parameter count

Technical Requirements

  • Number of GPUs: 16
  • GPU Type: NVIDIA A100 40GB

Keywords

TinyLlama small language model open source transformer pre-training efficiency

Papers Using Similar Methods

External Resources