Woosuk Kwon UC Berkeley, Zhuohan Li UC Berkeley, Siyuan Zhuang UC Berkeley, Ying Sheng UC Berkeley Stanford University 3 Independent Researcher 4 UCSan Diego, Lianmin Zheng UC Berkeley, Cody Hao Yu, Joseph E Gonzalez UC Berkeley, Hao Zhang, Ion Stoica UC Berkeley (2023)
This paper introduces PagedAttention and presents vLLM, a high-throughput serving system for large language models (LLMs). The authors identify the problem of inefficient memory management in existing LLM serving systems, particularly regarding the key-value (KV) cache memory. They propose an algorithm, PagedAttention, which allows for dynamic and non-contiguous memory allocation, reducing fragmentation and enabling memory sharing among requests. The vLLM system achieves significant improvements in throughput (2-4 times) compared to leading systems like FasterTransformer and Orca. The paper evaluates vLLM using various models and workloads, demonstrating its capacity to handle long sequences and complex decoding algorithms while maintaining latency. Additionally, the paper discusses the architecture of vLLM and its implementation details, highlighting how it addresses memory challenges and optimizes performance for high-demand LLM applications.
This paper employs the following methods:
The following datasets were used in this research:
The authors identified the following limitations: