SARATHI Piggybacking Decodes

Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

Posted by Kylin on October 8, 2023

[TOC]

Abs

challenge: prefill 和 decode 阶段的 compute density 不一样,具体来说,prefill在小batchsize时候就填满,但是decode阶段此时GPU利用率低。

challenge带来的问题:在pipeline parallelism中产生bubble,造成micro-batches不平衡

提出了:chunked-prefills:分chunck进行prefill,让不同microbatch均衡inference

decode requests 进行 piggyback(搭载),不会作为主要负载

由于这样做计算块比较小,因此可以实现均衡,少bubble

Intro

prefill and decode phases have different compute requirements

主要解决两个问题:bubble,microbatch imbalance

解决办法:统一request的compute requirements,本质上还是降粒度

截屏2023-10-09 16.06.40

SARATHI

Chunked-prefills

截屏2023-10-10 10.21.10

分析一下计算模式,发现是非平衡的,计算成本在第k个chunk是 \(kc^2\times hidden\_size + kc\times hidden\_size^2\) 截屏2023-10-10 10.28.40

Piggybacking decodes with prefills

计算一个最大decode batch,看最多可以搭载多少decodes

截屏2023-10-10 10.50.35

Identifying the ideal chunk size

定义了一个PD ratio来确定最佳块大小,但是这是扯淡的,因为要是能确定要decode多少token出来,all things 都没有了,但是以最大长度来算,显然又有浪费。

“P:D ratio” that is computed as the ratio of the number of prefill tokens to the number of decode tokens in a given batch

Exp

Question

throughput vs. end2end inference

impact of varying sequence lengths, batch sizes, and P:D ratios

throughput vs. orca-like

overheads of our system

Disc

  • 多角度优化 latency, queuing delays, fairness(SARATHI只从throughput角度进行优化)
  • how to pick an optimal chunk size as it depends on several factors
  • how to process varying number of prefill and decode tokens (SARATHI假设定长)
  • supporting very long sequences