[TOC] Cascade Inference1 带宽高效的共享前缀自注意力操作 SGLang2 提出了RadixAttention,其中KV-Cache被组织成前缀树的形态,这类注意力操作可以使用多级的Cascade Inference加速 LMaaS3 XXX Reference Cascade Inference: Memory Bandwidth Efficient Shared Prefix Batch Decoding. https://flashinfer.ai/2024/02/02/cascade-inference.html ↩ Efficiently Programming Large Language Models using SGLang. https://arxiv.org/abs/2312.07104 ↩ A Survey on Effective Invocation Methods of Massive LLM Services ↩ Previous Coding on 换根DP Next Coding on Dijkstra CATALOG FEATURED TAGS From Bohemia Linux Git Cloud Computer Architecture Database machine learning paper software coding quant Object Detection job programming Competition lecture nlp competition llmsys FRIENDS Tsinghua SJTU HomePage GitHub