FlashAttention

[TOC]

目标：reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM

截屏2023-12-07 10.08.43

为什么IO-aware在transformer里是重要的？

对pytorch实现版的attention进行profiling，发现利用GPU并行性最好的matmul占比其实不高

截屏2023-12-07 10.30.20

而其他操作却是IO-bound的状态：

截屏2023-12-07 10.28.46

主要方法，相比于整个attention加载，逐步加载，逐步存入：

640-1

但是有一个比较关键的问题是softmax的分块算法（因为其他op的分解已经很成熟）

flash attention 分块softmax算法证明¹

从 FlashAttention 到 PagedAttention, 如何进一步优化 Attention 性能 https://zhuanlan.zhihu.com/p/638468472 ↩