FLAT Attention

An Optimized Dataflow for Mitigating Attention Bottlenecks

Posted by Kylin on November 9, 2023

[TOC]

Abs

Attention的问题:large memory requirements and computational complexity => This limitation is due to inherently limited data reuse opportunities and quadratic growth in memory footprints

我们的解决方案:a tailored dataflow optimization => transforming the memory footprint quadratic growth to merely a linear one

Our method both mitigates the off- chip bandwidth bottleneck as well as reduces the on-chip memory requirement.

所以解决的问题(off-chip data transfer)、解决的方式(tiling & scheduling)和flash attention是一致的。

*Roofline Model[^3]

截屏2023-12-08 18.13.34

  • Operational Intensity: performance of computing core over bandwidth of memory: OI := FLOPs per sec/ Memory Access per sec
  • For a specific hardware, the optimal working operational intensity is $I_{max}$.

DSQ[^4] ran a similar analysis to Ivanov et al. [1]: OPT-1.3B Model:

  • 47.93% latency: Attention layers. Memory-bound. (the matrix multiplications only takes around 66% of the latency in a single attention module.)
  • 32.20% latency: FC layer. Compute-bound.
  • 19.87% latency: Activation & Norm layer. Memory-bound.

Intro & bg

相比于Flash Attention的故事一直在紧抓SRAM-HBM访存次数问题,Flat Attention关注operational intensity(当然其实两者在本质上也是一致的)

但是有两个问题是必须注意的,这个也是在flashAttention中在motivation中没有说清楚的:

  • flash和flat的优化对象是attention block,这样的话performance bottleneck反而是self-attention中的两个matmuls(computing intensity低)
  • flash尤其flat甚至没有考虑自回归,分析intensity用的是prefill阶段的

截屏2023-12-08 17.40.55

Flat把Attention中的op分为两类,一类是Activation-Weight operators (Q/K/V/O),一类是Activation-Activation operators (L/A),然后就会发现 L/A 限制了整体的operational intensity,即使增大batch size也没有用。

截屏2023-12-08 17.26.58

Method1

Tiling Strategy Comparisons

HyirTjABh

The tiling strategy difference between FLAT-Attention and FlashAttention. FlashAttention uses block tiling and weight stationary. FLAT-Attention uses row tiling (row-granularity) and output stationary.

Scheduling Strategy (Dataflow) Comparisons

Flash Attention

S1geyhArn

H1Fxy3AH3

FLAT attention

BkXMy2RBn

SyRSk20rh

Qualtively Comparisons1(FLAT vs. Flash)

rkQU5jRHh

Reference

[^ 2]: Kao, Sheng-Chun, et al. “FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks.” Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 2023.

[^ 3]: Williams, Samuel, Andrew Waterman, and David Patterson. “Roofline: an insightful visual performance model for multicore architectures.” Communications of the ACM 52.4 (2009): 65-76. [^ 4]: Yang, Guo, et al. “Dynamic Stashing Quantization for Efficient Transformer Training.” arXiv preprint arXiv:2303.05295 (2023).

  1. FLAT-Attention v.s. FlashAttention https://hackmd.io/@felixkao/HkZaooPD3  2