LLM Inference Optimization 2403 Review

LLM优化技术进展

Posted by Kylin on March 4, 2024

[TOC]

Cascade Inference1

带宽高效的共享前缀自注意力操作

SGLang2

提出了RadixAttention,其中KV-Cache被组织成前缀树的形态,这类注意力操作可以使用多级的Cascade Inference加速

LMaaS3

XXX

Reference

  1. Cascade Inference: Memory Bandwidth Efficient Shared Prefix Batch Decoding. https://flashinfer.ai/2024/02/02/cascade-inference.html 

  2. Efficiently Programming Large Language Models using SGLang. https://arxiv.org/abs/2312.07104 

  3. A Survey on Effective Invocation Methods of Massive LLM Services