基于树状推测解码和验证加速LLM服务

Accelerating Large Language Model Serving with Tree based Speculative Inference and Verification

Posted by Kylin on April 24, 2024

[TOC]

Abs

在推测解码上做了两个事情:

1)把predictions组织成tree-based,every node represent a candidate token sequence

2)tree-based predictions可以并行verification

Intro

insight:并行猜测、树状组织、并行验证

截屏2024-04-24 21.06.13

但是有两个challenge:

1)token搜索空间巨大,但是SSM和LLM之间的对齐是capacity-bound的:SpecInfer用tree组织多candidates, introduces an expansion- and a merge-based mechanism for constructing token trees by exploiting diversity within a single SSM and across multiple SSMs, respectively.

2)method of verifying the speculated tokens,要保证probability distribution一致。为了最小化验证成本,SpecInfer introduces a tree-based parallel decoding mechanism, simultaneously verifying all tokens of a token tree against the LLM’s output in a single LLM decoding step.

SpecInfer

Overview

expansion-based:就是单模型输出分支算法

merge-based:多个SSM的输出怎么合并到一棵tree中

截屏2024-04-24 21.40.00

  • Reduced memory accesses to LLM parameters

因为验证并行了,所以均摊下来,比起token by token方式的均摊weights访问减少了;还降低的能耗;

  • Reduced end-to-end inference latency

LLM作为verfier,并行验证,并行度大大提高

Learning-based Speculator

目标:每个step多猜测一些tokens,既可以增加SSM和LLM生成overlap的比例,也可以潜在提升验证效率。

Expansion-based token tree construction

如果将speculative放在top-k上,会发现验证准确率是极速提升的: 截屏2024-04-25 08.29.16

但是问题是每次选top-k的话,计算成本会exponential上升,所以作者会给一个expansion configuration: \(\left\langle k_1, k_2, \ldots, k_m\right\rangle\) 表示m步解码,每步选k_i个candidates。

怎么进行expansion仍然是future work:We acknowledge that dynamically expanding a token tree from an SSM is an opening research problem beyond the scope of this paper, which we leave as future work.

Merge-based token tree construction

这是对多个SSM输出的合并,首先SSM涉及一个distill:

1)首先有一个SSM pool,随机排列,选取第一个SSM在某一个corpus上进行distill;distill完成之后在corpus上验证,标记出不一致的。

2)选取下一个SSM,只在第一步不一致的corpus上distill,一直重复直到corpus为空

之后就进行树合并,没有特殊算法。

怎么进行集成仍然是future work:Note that, in addition to boosting, there are several other ensemble learning methods (e.g., voting, bagging, and stack- ing) that can be used to combine the outputs from multiple SSMs, and we leave the exploration as future work.

Token Tree Verifier

两个事情导致树结构无法并行验证,一个是KV cache重用,一个是Attention计算,这个事情十分简单,就是更新mask就可以了(但是要重写attention kernel),这样每启动一次tree attention,就只调用一次kernel。

截屏2024-04-25 10.14.55

System Implement

Distributed LLM

这里提及到SpecInfer对Distributed LLM inference的尝试,并且承认activation transfer会占用大量通信时间:Distributed LLM inference is largely limited by the latency to transfer intermediate activations between GPUs for each LLM decoding step. While SpecInfer’s ap- proach does not directly reduce the amount of inter-GPU communications, its verification mechanism can increase the communication granularity and reduce the number of decoding steps.

Exp

LLMs

LLaMA-7B, OPT- 13B, OPT-30B, and LLaMA-65B as the LLMs, and LLaMA-68M and OPT-125M as the SSMs。

Datasets

只使用prompt

We evaluate SpecInfer on five datasets: Chatbot Instruction Prompts (CIP) [34], ChatGPT Prompts (CP) [30], WebQA [1], Alpaca [36, 45], and PIQA [2]. We only use the prompts/questions from these datasets to form our input prompts to simulate real-world conversation traces.

Reference