FP32 FlashAttention #781

ssiu · 2024-10-20T05:05:05Z

Overview

In this PR we implement FlashAttention forward + backward kernels for FP32 training.

All results were tested on V100.

For B = 4, T = 1024, C = 768, NH = 12:

attention_forward4 (ms)	flash_attention_forward (ms)	speedup
1.64	1.15	1.43x

attention_backward8 (ms)	flash_attention_backward (ms)	speedup
3.06	2.86	1.07x

Shared memory >= 64KB, so should work on all GPUs with SM >= 70.

Training was done with B = 4, T = 1024, C = 768, NH = 12.

We use:

	attention_forward4 + attention_backward8 (ms)	flash_attention_forward + flash_attention_backward (ms)	speedup
Total average iteration time	337.81	330.88	1.02x
Final loss	3.49	3.51

For some reason, training with flash attention kernels results in a slightly higher loss.

We also tested long context performance by fixing B = 4, C = 768, NH = 12.

T	attention_forward4 (ms)	flash_attention_forward (ms)	speedup
1024	1.64	1.15	1.43x
2048	6.11	3.78	1.62x
4096	24.62	13.53	1.82x

T	attention_backward8 (ms)	flash_attention_backward (ms)	speedup
1024	3.06	2.86	1.07x
2048	11.08	10.14	1.09x
4096	44.69	38.54	1.16x

We can improve the kernels further by permuting the shared memory layout to further minimize bank conflicts.

added flash attention kernels

6eceb89