H2O filtering KV cache

Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

Posted by Kylin on January 8, 2024

[TOC]

Abs

challenge: considering LLM inference, KV cache scaling linearly with the sequence length and batch size.

insight: a small portion of tokens contributes most of the value when computing attention scores.

solution: Heavy Hitter Oracle (H2O), a KV cache eviction policy that dynamically retains a balance of recent and Heavy Hitters (H2) tokens.

Introduction

ideal KV cache 应该有三个特征:

  • a small cache size to reduce memory footprint
  • a low miss rate to maintain the performance and long-content generation ability of LLMs
  • a low-cost eviction policy to reduce the wall-clock time during generation

但是有3个challenge:

  • it is not immediately clear whether the size of the KV cache can be restricted—each decoding step might, in principle, require access to all previous attention keys and values.
  • identifying an optimal eviction policy that maintains generation accuracy is a combinatorial problem【Belady’s Algorithm is optimal for standard cache, but not necessarily for KV cache】.
  • even if an optimal policy can be brute-forced, it is infeasible for deployment on real-world applications

论文的preliminary实验的三个observations:

  • Sparsity for small cache size: only 5% of the KV cache is sufficient for decoding the same output token at each generation step

  • Heavy-Hitters for low miss rate: We discover that the accumulated attention scores of all tokens in attention blocks adhere to a power-law distribution.

    H2O provides an opportunity to step away from the combinatorial search problem and identify an eviction policy that maintains accuracy

  • Greedy algorithm for low-cost policy: greedy 保留最后几个tokens是有意义的

without缩写成w.o.

所以螺纹的思路就是:

1)证明H2现象的存在

2)H2可以被greedy的办法模拟(最后感觉可以做成一个类似体系结构cache的概念)

Problem Formulation

image-20240108234314391

image-20240108234300776

Insights

  • Sparsity for Small Cache Size

image-20240108234438223

  • Heavy-Hitters for Low Miss Rate

其实就是讲幂率分布,少部分token决定大部分attention

Heavy-Hitter Oracle

image-20240109000353183

Experiment

Baselines

  • FlexGen(H2O就是在FlexGen基础上实现的)
  • DeepSpeed Zero-Inference
  • Hugging Face Accelerate

maintaining generation quality across a broad spectrum of domains and tasks.

Reference