vLLM for distributed serving

[TOC]

vLLM Document：https://vllm.readthedocs.io/en/latest/serving/distributed_serving.html

vLLM utilizes PagedAttention, our new attention algorithm that effectively manages attention keys and values.

Result: vLLM achieves up to 24x higher throughput compared to HF and up to 3.5x higher throughput than TGI.

Paged Attention

下面这番表述可以好好学一下：

we identify that the performance of LLM serving is bottlenecked by memory. In the autoregressive decoding process, all the input tokens to the LLM produce their attention key and value tensors, and these tensors are kept in GPU memory to generate next tokens. These cached key and value tensors are often referred to as KV cache. The KV cache is

Large: Takes up to 1.7GB for a single sequence in LLaMA-13B.
Dynamic: Its size depends on the sequence length, which is highly variable and unpredictable. As a result, efficiently managing the KV cache presents a significant challenge. We find that existing systems waste 60% – 80% of memory due to fragmentation and over-reservation.

Key idea: PagedAttention allows storing continuous keys and values in non-contiguous memory space.

Specifically, PagedAttention partitions the KV cache of each sequence into blocks, each block containing the keys and values for a fixed number of tokens. During the attention computation, the PagedAttention kernel identifies and fetches these blocks efficiently.

annimation0

Block Table

annimation1

Sample Optimization

可以结合beamsearch在较小储存开销下生成更好的generation。

annimation3

Experiment

Baselines

HuggingFace Transformers (HF)

HuggingFace Text Generation Inference (TGI)

Setups

LLaMA-7B on an NVIDIA A10G GPU

LLaMA-13B on an NVIDIA A100 GPU (40GB)

We sample the requests’ input/output lengths from the ShareGPT dataset

Result

截屏2023-09-11 09.48.44

Rethink

Nvidia关于 memory-bounded的叙述：https://docs.nvidia.com/deeplearning/performance/dl-performance-gpu-background/index.html#understand-perf

Easy, Fast, and Cheap LLM Serving with PagedAttention