← ML Research Wiki / 2402.17764

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Shuming Ma https://aka.ms/GeneralAI, Hongyu Wang https://aka.ms/GeneralAI, Lingxiao Ma https://aka.ms/GeneralAI, Lei Wang https://aka.ms/GeneralAI, Wenhui Wang https://aka.ms/GeneralAI, Shaohan Huang https://aka.ms/GeneralAI, Li Dong https://aka.ms/GeneralAI, Ruiping Wang https://aka.ms/GeneralAI, Jilong Xue https://aka.ms/GeneralAI, Furu Wei https://aka.ms/GeneralAI (2024)

Paper Information

arXiv ID

2402.17764

Venue

arXiv.org

Domain

artificial intelligence, natural language processing

SOTA Claim

Yes

Reproducibility

6/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Recent research, such as BitNet [WMD + 23], is paving the way for a new era of 1bit Large Language Models (LLMs).In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}.It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption.More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective.Furthermore, it enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.

Summary

The paper introduces BitNet b1.58, a 1-bit variant of Large Language Models (LLMs), using ternary weights {-1, 0, 1}. It promises high performance comparable to full-precision models while being more energy-efficient and cost-effective in terms of latency, memory, and throughput. The new model architecture offers significant benefits in computation by reducing the need for multiplications and decreasing energy consumption during matrix operations. By retaining performance on various natural language tasks, BitNet b1.58 enables effective deployment of LLMs on resource-constrained devices, opens up new avenues for hardware optimization, and challenges traditional scaling practices in model training. Overall, it suggests that 1.58-bit LLMs can achieve state-of-the-art performance with improved efficiency, particularly for larger model sizes.

Methods

This paper employs the following methods:

1-bit quantization
ternary weight encoding
neural architecture optimization

Models Used

BitNet b1.58
LLaMA LLM

Datasets

The following datasets were used in this research:

RedPajama

Evaluation Metrics

Perplexity
Zero-shot accuracy

Results

BitNet b1.58 matches FP16 baseline performance starting from 3B model size
BitNet b1.58 is significantly faster and more memory efficient compared to the LLaMA LLM
Energy consumption for matrix multiplication is 71.4 times lower for BitNet b1.58 compared to LLaMA LLM

Limitations

The authors identified the following limitations:

Not specified

Technical Requirements

Number of GPUs: None specified
GPU Type: NVIDIA A100 80GB

Keywords

Large Language Models quantization 1-bit models efficiency hardware

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 28
Influential Citations: 31

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers