← ML Research Wiki / 2301.00774

SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot

Elias Frantar, Dan Alistarh (2023)

Paper Information
arXiv ID
Venue
International Conference on Machine Learning
Domain
Natural language processing
SOTA Claim
Yes
Code
Available
Reproducibility
8/10

Abstract

We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. This is achieved via a new pruning method called SparseGPT, specifically designed to work efficiently and accurately on massive GPT-family models. We can execute SparseGPT on the largest available open-source models, OPT-175B and BLOOM-176B, in under 4.5 hours, and can reach 60% unstructured sparsity with negligible increase in perplexity: remarkably, more than 100 billion weights from these models can be ignored at inference time.SparseGPT generalizes to semi-structured (2:4 and 4:8) patterns, and is compatible with weight quantization approaches. The code is available at: https://github.com/IST-DASLab/ sparsegpt.

Summary

This paper introduces SparseGPT, a method that allows for one-shot pruning of large generative pretrained transformer (GPT) family models to at least 50% sparsity without the need for retraining, resulting in minimal accuracy loss. SparseGPT efficiently works on large models like OPT-175B and BLOOM-176B, achieving significant weight reduction while maintaining performance. The authors demonstrate that this method enables high levels of sparsity while preserving accuracy metrics, using perplexity and zero-shot accuracy for evaluation. They also discuss the implications of model size on sparsity effectiveness and highlight that larger models tend to tolerate higher sparsity with less impact on accuracy. Additionally, SparseGPT can work alongside weight quantization techniques for further efficiency. Experimental results confirm its advantage over traditional pruning methods, particularly emphasizing its effectiveness at scales of 10-100 billion parameters.

Methods

This paper employs the following methods:

  • SparseGPT

Models Used

  • OPT-175B
  • BLOOM-176B

Datasets

The following datasets were used in this research:

  • C4
  • WikiText2
  • PTB

Evaluation Metrics

  • Perplexity
  • Zero-shot accuracy

Results

  • SPARSE models can reach 50-60% unstructured sparsity with negligible increase in perplexity, and larger models exhibit less perplexity increase at similar sparsity levels.

Limitations

The authors identified the following limitations:

  • Not specified

Technical Requirements

  • Number of GPUs: 5
  • GPU Type: NVIDIA A100 80GB

Keywords

sparse pruning large language models transformer models post-training pruning model compression

Papers Using Similar Methods

External Resources