[TOC]
RelayAttention1 :
场景:长system prompt
问题:对于batched request,KV caches are transferred from off-chip DRAM to on-chip SRAM multiple times,也就是对于request是独立的。
解决方案:RelayAttention allows reading these hidden states from DRAM exactly once for a batch of input tokens
challenge
- The computation of attention is memory-bound: 主要就是GEMV相对GEMM计算不饱和度问题
- There are redundant memory accesses in the typical scenarios where a shared system prompt is prepended to request-specific contexts:diss vllm说prefix虽然share了但是access要两次
Methodology
memory-bound (Section 3.1)有定义:
Given a processor that takes tm for per-byte access, and tc for a floating operation on average, the ratio r of the total computation time over the total memory access time for an operator is:
where I is the arithmetic intensity of the operator:
When $I < t_m/t_c$, $r$ is less than 1, the operator is memory-bound. 因此对于不同的硬件来说,要克服memory-bound需要的computation intensity是不一样的。