EFFICIENT STREAMING LANGUAGE MODELS WITH ATTENTION SINKS

[TOC]

https://arxiv.org/pdf/2309.17453.pdf

Abs

两个gap：

during the decoding stage, caching previous tokens’ Key and Value states (KV) consumes extensive memory
popular LLMs cannot generalize to longer texts than the training sequence length (实际上受制于positional decoding罢了)

一个insight：Lost in the middle（attention一般只注意头尾的tokens）：attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention

在以下文献里面也讨论过这个现象： Lost in the Middle: How Language Models Use Long Contexts: https://browse.arxiv.org/pdf/2307.03172.pdf

Attention Is Off By One：https://www.evanmiller.org/attention-is-off-by-one.html?continueFlag=5d0e431f4edf1d8cccea47871e82fbc4

LM-Infinit：https://arxiv.org/pdf/2308.16137.pdf （Λ-shaped attention mask，其实就是sink，研究挂arxiv上晚三四天）

Intro

现有的几种 Attention （KV cache）方法：

Meth

干了啥？将滑窗 L 分为固定前缀加上滑动后缀

Lost in the middle in LLM serving

Abs

Intro

Meth

CATALOG

FEATURED TAGS

FRIENDS