← ML Research Wiki / 2401.18079

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Coleman Hooper [email protected] University of California Berkeley, Sehoon Kim [email protected] University of California Berkeley, Hiva Mohammadzadeh University of California Berkeley, Michael W Mahoney [email protected] University of California Berkeley, Yakun Sophia Shao [email protected] University of California Berkeley, Kurt Keutzer [email protected] University of California Berkeley, Amir Gholami [email protected] University of California Berkeley, Baseline Kvquant University of California Berkeley (2024)

Paper Information
arXiv ID
Venue
Neural Information Processing Systems
Domain
Natural Language Processing
Code
Reproducibility
8/10

Abstract

LLMs are seeing growing use for applications which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference.Quantization is a promising approach for compressing KV cache activations; however, existing solutions fail to represent activations accurately in sub-4-bit precision.Our work, KVQuant, facilitates low precision KV cache quantization by incorporating several novel methods: (i) Per-Channel Key Quantization, where we adjust the dimension along which we quantize the Key activations to better match the distribution; (ii) Pre-RoPE Key Quantization, where we quantize Key activations before the rotary positional embedding to mitigate its impact on quantization; (iii) Non-Uniform KV Cache Quantization, where we derive per-layer sensitivity-weighted non-uniform datatypes that better represent the distributions; and (iv) Per-Vector Dense-and-Sparse Quantization, where we isolate outliers separately for each vector to minimize skews in quantization ranges.By applying our method to the LLaMA, Llama-2, Llama-3, and Mistral models, we achieve < 0.1 perplexity degradation with 3-bit quantization on both Wikitext-2 and C4, outperforming existing approaches.Our method enables serving LLaMA-7B with a context length of up to 1 million on a single A100-80GB GPU and up to 10 million on an 8-GPU system.We develop custom CUDA kernels for KVQuant, showing that we can achieve up to ∼1.7× speedups, compared to baseline fp16 matrix-vector multiplications, for the LLaMA-7B model.Code is available at https://github.com/SqueezeAILab/KVQuant.38th Conference on Neural Information Processing Systems (NeurIPS 2024).

Summary

This paper presents KVQuant, a method aimed at improving the inference of large language models (LLMs) by efficiently quantizing the Key-Value (KV) cache activations, which are identified as the primary contributor to memory usage during inference with large context lengths. The authors introduce several innovations to enhance low precision KV cache quantization including: 1) Per-Channel Key Quantization for better distribution representation, 2) Pre-RoPE Key Quantization to mitigate the impact of rotary positional embeddings on quantization, 3) Non-Uniform KV Cache Quantization focusing on per-layer sensitivity, and 4) Per-Vector Dense-and-Sparse Quantization to manage outliers effectively. The experimental results, applied to LLaMA and other models, demonstrate significant performance gains with less than 0.1 perplexity degradation while achieving up to 4.8 times reduction in memory footprint. The method facilitates serving LLaMA-7B with context lengths of up to 10 million on an 8-GPU setup and shows improvements in inference speed. The paper concludes that KVQuant enables accurate long-context language model inference with reduced memory requirements and faster computation.

Methods

This paper employs the following methods:

  • Per-Channel Key Quantization
  • Pre-RoPE Key Quantization
  • Non-Uniform KV Cache Quantization
  • Per-Vector Dense-and-Sparse Quantization

Models Used

  • LLaMA
  • Llama-2
  • Llama-3
  • Mistral

Datasets

The following datasets were used in this research:

  • Wikitext-2
  • C4

Evaluation Metrics

  • Perplexity

Results

  • Less than 0.1 perplexity degradation on Wikitext-2 and C4
  • 4.8 times reduction in cached activation memory footprint
  • Enables LLaMA-7B with context length of up to 10 million with multi-GPU serving

Limitations

The authors identified the following limitations:

  • Significant work needed for training LLMs with context lengths greater than 100K
  • Inefficiencies in memory allocation during updating sparse matrices

Technical Requirements

  • Number of GPUs: 8
  • GPU Type: A100-80GB

Keywords

KV cache Quantization Long context length Low-bit precision CUDA kernels

External Resources