← ML Research Wiki / 2309.17453

EFFICIENT STREAMING LANGUAGE MODELS WITH ATTENTION SINKS

Guangxuan Xiao Massachusetts Institute of Technology, Yuandong Tian Meta AI, Beidi Chen Carnegie Mellon University 4 NVIDIA, Song Han Massachusetts Institute of Technology, Mike Lewis Meta AI (2023)

Paper Information

arXiv ID

2309.17453

Venue

International Conference on Learning Representations

Domain

natural language processing

Code

Available

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges.Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory.Secondly, popular LLMs cannot generalize to longer texts than the training sequence length.Window attention, where only the most recent KVs are cached, is a natural approach -but we show that it fails when the text length surpasses the cache size.We observe an interesting phenomenon, namely attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention.In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a "sink" even if they are not semantically important.Based on the above analysis, we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence length without any fine-tuning.We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more.In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment.In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2× speedup.Code and datasets are provided in the link.

Summary

This paper introduces StreamingLLM, an efficient framework for deploying large language models (LLMs) in streaming applications, addressing challenges in memory consumption and performance with long texts. It highlights the concept of 'attention sinks,' where the initial tokens receive disproportionately high attention scores, influencing the model's performance. By maintaining Key and Value states of initial tokens during inference, StreamingLLM allows LLMs like Llama-2 and others to generalize to infinite sequence lengths without significant fine-tuning. The authors demonstrate that this method achieves substantial speed improvements and stable performance on extended text sequences. They also validate the importance of including a sink token during pre-training to optimize streaming performance while maintaining effectiveness in language modeling tasks.

Methods

This paper employs the following methods:

Attention Sink
Window Attention
StreamingLLM
Re-computation

Models Used

Llama-2
MPT
Falcon
Pythia

Datasets

The following datasets were used in this research:

PG19
StreamEval

Evaluation Metrics

Perplexity
Accuracy

Results

StreamingLLM achieves up to 4 million tokens effectively
Up to 22.2× speedup compared to sliding window recomputation baseline

Limitations

The authors identified the following limitations:

Not specified

Technical Requirements

Number of GPUs: 4
GPU Type: NVIDIA

Keywords

large language models attention sinks streaming inference KV cache window attention

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 61
Influential Citations: 100

EFFICIENT STREAMING LANGUAGE MODELS WITH ATTENTION SINKS

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers