LLM推理优化Review202408

[TOC]

RelayAttention¹ :

场景：长system prompt

问题：对于batched request，KV caches are transferred from off-chip DRAM to on-chip SRAM multiple times，也就是对于request是独立的。

解决方案：RelayAttention allows reading these hidden states from DRAM exactly once for a batch of input tokens

challenge

截屏2024-08-30 16.02.13

The computation of attention is memory-bound: 主要就是GEMV相对GEMM计算不饱和度问题
There are redundant memory accesses in the typical scenarios where a shared system prompt is prepended to request-specific contexts：diss vllm说prefix虽然share了但是access要两次

Methodology

memory-bound (Section 3.1）有定义：

Given a processor that takes tm for per-byte access, and tc for a floating operation on average, the ratio r of the total computation time over the total memory access time for an operator is:

where I is the arithmetic intensity of the operator:

When $I < t_m/t_c$, $r$ is less than 1, the operator is memory-bound. 因此对于不同的硬件来说，要克服memory-bound需要的computation intensity是不一样的。

截屏2024-09-01 17.41.41

Hydragen²

截屏2024-09-01 23.48.46

Reference

RelayAttention for Efficient Large Language Model Serving with Long System Prompts ↩
Hydragen: High-Throughput LLM Inference with Shared Prefixes ↩

LLM Infer Paper Review202408

RelayAttention¹ :

challenge

Methodology

Hydragen²

Reference

CATALOG

FEATURED TAGS

FRIENDS

RelayAttention1 :

challenge

Methodology

Hydragen2

Reference

CATALOG

FEATURED TAGS

FRIENDS

RelayAttention¹ :

Hydragen²