LLM推理优化Review202408

LLM Infer Paper Review202408

Posted by Kylin on August 30, 2024

[TOC]

RelayAttention1 :

场景:长system prompt

问题:对于batched request,KV caches are transferred from off-chip DRAM to on-chip SRAM multiple times,也就是对于request是独立的。

解决方案:RelayAttention allows reading these hidden states from DRAM exactly once for a batch of input tokens

challenge

截屏2024-08-30 16.02.13

  • The computation of attention is memory-bound: 主要就是GEMV相对GEMM计算不饱和度问题
  • There are redundant memory accesses in the typical scenarios where a shared system prompt is prepended to request-specific contexts:diss vllm说prefix虽然share了但是access要两次
Methodology

memory-bound (Section 3.1)有定义:

Given a processor that takes tm for per-byte access, and tc for a floating operation on average, the ratio r of the total computation time over the total memory access time for an operator is:

截屏2024-08-30 15.57.56

where I is the arithmetic intensity of the operator:

截屏2024-08-30 15.58.16

When $I < t_m/t_c$, $r$ is less than 1, the operator is memory-bound. 因此对于不同的硬件来说,要克服memory-bound需要的computation intensity是不一样的。

截屏2024-09-01 17.41.41

Hydragen2

截屏2024-09-01 23.48.46

Reference

  1. RelayAttention for Efficient Large Language Model Serving with Long System Prompts 

  2. Hydragen: High-Throughput LLM Inference with Shared Prefixes