← ML Research Wiki / 2402.17764

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Shuming Ma https://aka.ms/GeneralAI, Hongyu Wang https://aka.ms/GeneralAI, Lingxiao Ma https://aka.ms/GeneralAI, Lei Wang https://aka.ms/GeneralAI, Wenhui Wang https://aka.ms/GeneralAI, Shaohan Huang https://aka.ms/GeneralAI, Li Dong https://aka.ms/GeneralAI, Ruiping Wang https://aka.ms/GeneralAI, Jilong Xue https://aka.ms/GeneralAI, Furu Wei https://aka.ms/GeneralAI (2024)

Paper Information
arXiv ID
Venue
arXiv.org
Domain
artificial intelligence, natural language processing
SOTA Claim
Yes
Reproducibility
6/10

Abstract

Recent research, such as BitNet [WMD + 23], is paving the way for a new era of 1bit Large Language Models (LLMs).In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}.It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption.More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective.Furthermore, it enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.

Summary

The paper introduces BitNet b1.58, a 1-bit variant of Large Language Models (LLMs), using ternary weights {-1, 0, 1}. It promises high performance comparable to full-precision models while being more energy-efficient and cost-effective in terms of latency, memory, and throughput. The new model architecture offers significant benefits in computation by reducing the need for multiplications and decreasing energy consumption during matrix operations. By retaining performance on various natural language tasks, BitNet b1.58 enables effective deployment of LLMs on resource-constrained devices, opens up new avenues for hardware optimization, and challenges traditional scaling practices in model training. Overall, it suggests that 1.58-bit LLMs can achieve state-of-the-art performance with improved efficiency, particularly for larger model sizes.

Methods

This paper employs the following methods:

  • 1-bit quantization
  • ternary weight encoding
  • neural architecture optimization

Models Used

  • BitNet b1.58
  • LLaMA LLM

Datasets

The following datasets were used in this research:

  • RedPajama

Evaluation Metrics

  • Perplexity
  • Zero-shot accuracy

Results

  • BitNet b1.58 matches FP16 baseline performance starting from 3B model size
  • BitNet b1.58 is significantly faster and more memory efficient compared to the LLaMA LLM
  • Energy consumption for matrix multiplication is 71.4 times lower for BitNet b1.58 compared to LLaMA LLM

Limitations

The authors identified the following limitations:

  • Not specified

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: NVIDIA A100 80GB

Keywords

Large Language Models quantization 1-bit models efficiency hardware

Papers Using Similar Methods

External Resources