SARATHI Piggybacking Decodes

[TOC]

challenge: prefill 和 decode 阶段的 compute density 不一样，具体来说，prefill在小batchsize时候就填满，但是decode阶段此时GPU利用率低。

challenge带来的问题：在pipeline parallelism中产生bubble，造成micro-batches不平衡

提出了：chunked-prefills：分chunck进行prefill，让不同microbatch均衡inference

decode requests 进行 piggyback（搭载），不会作为主要负载

由于这样做计算块比较小，因此可以实现均衡，少bubble

prefill and decode phases have different compute requirements

主要解决两个问题：bubble，microbatch imbalance

解决办法：统一request的compute requirements，本质上还是降粒度

截屏2023-10-09 16.06.40

截屏2023-10-10 10.21.10

分析一下计算模式，发现是非平衡的，计算成本在第k个chunk是 \(kc^2\times hidden\_size + kc\times hidden\_size^2\) 截屏2023-10-10 10.28.40

计算一个最大decode batch，看最多可以搭载多少decodes

截屏2023-10-10 10.50.35

定义了一个PD ratio来确定最佳块大小，但是这是扯淡的，因为要是能确定要decode多少token出来，all things 都没有了，但是以最大长度来算，显然又有浪费。

“P:D ratio” that is computed as the ratio of the number of prefill tokens to the number of decode tokens in a given batch

throughput vs. end2end inference

impact of varying sequence lengths, batch sizes, and P:D ratios

throughput vs. orca-like

overheads of our system

Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills