Add fused scale mask bias softmax #9867

ofhwei · 2023-02-14T11:34:45Z

该pr主要是对 softmax(x * scale + mask + bias) 多步操作进行融合，减少内存访问次数以提升效率，适用于通用的attention场景: 其中x为query和key矩阵乘的结果，形状一般为[batch_size, num_heads, seq_len_q, seq_len_kv]; scale=sqrt(head_size); mask和bias分别为[batch_size, 1, 1, seq_len_k] 和 [1, num_heads, seq_len_q, seq_len_kv]。
对于某些场景，输入的mask的形状可能也不同(如alphafold)：

global attn中x.shape=[b, h, s], mask.shape=[b, 1, s],
template_pointwise_attn中x.shape=[s, s, 1, n_templ], mask.shape=[1, 1, 1, n_templ]等。

针对上述特殊情形，该pr也做了一些针对性处理。

对于半精度类型(fp16和bf16)可以使用flash attention替换，但目前flash attention 还不支持tf32和fp32，后续应该还会逐渐完善。

…oneflow into dev_flash_attention

…nto dev_alphafold_fused_attn

…add_fused_msa_softmax

oneflow/core/autograd/gradient_funcs/fused_scale_mask_bias_softmax.cpp

Ldpe2G · 2023-02-20T07:17:38Z

建议这里加一下文档，描述一下 fused_msa_softmax 和一般的 softmax 的区别是啥

oneflow/user/ops/fused_scale_mask_bias_softmax_op.cpp

oneflow/core/autograd/gradient_funcs/fused_scale_mask_bias_softmax.cpp

Ldpe2G · 2023-02-21T09:09:03Z

oneflow/user/ops/fused_scale_mask_bias_softmax_op.cpp

+    -> Maybe<void> {
+  const float scale = ctx->Attr<float>("scale");
+  CHECK_LE_OR_RETURN(scale, 1.);
+


对于下面的形状检查的逻辑，可以加些注释，简单讲解一下 x, mask 和 bias 支持哪些形状

oneflow/user/ops/fused_scale_mask_bias_softmax_op.cpp

…add_fused_msa_softmax

…c/oneflow into add_fused_msa_softmax

github-actions · 2023-03-04T04:23:56Z

Speed stats:

GPU Name: GeForce GTX 1080 

❌ OneFlow resnet50 time: 141.2ms (= 14119.4ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 144.5ms (= 14449.6ms / 100, input_shape=[16, 3, 224, 224])
❌ Relative speed: 1.02 (= 144.5ms / 141.2ms)

OneFlow resnet50 time: 82.4ms (= 8242.7ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 88.9ms (= 8890.2ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.08 (= 88.9ms / 82.4ms)

OneFlow resnet50 time: 50.8ms (= 10165.4ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 59.5ms (= 11909.7ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.17 (= 59.5ms / 50.8ms)

OneFlow resnet50 time: 33.9ms (= 6772.5ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 45.5ms (= 9099.2ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.34 (= 45.5ms / 33.9ms)

OneFlow resnet50 time: 25.7ms (= 5146.7ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 45.4ms (= 9079.9ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.76 (= 45.4ms / 25.7ms)

OneFlow swin dataloader time: 0.236s (= 47.129s / 200, num_workers=1)
PyTorch swin dataloader time: 0.150s (= 30.085s / 200, num_workers=1)
Relative speed: 0.638 (= 0.150s / 0.236s)

OneFlow swin dataloader time: 0.068s (= 13.530s / 200, num_workers=4)
PyTorch swin dataloader time: 0.045s (= 8.931s / 200, num_workers=4)
Relative speed: 0.660 (= 0.045s / 0.068s)

OneFlow swin dataloader time: 0.044s (= 8.878s / 200, num_workers=8)
PyTorch swin dataloader time: 0.023s (= 4.579s / 200, num_workers=8)
Relative speed: 0.516 (= 0.023s / 0.044s)

❌ OneFlow resnet50 time: 154.3ms (= 15426.9ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 164.3ms (= 16428.0ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
❌ Relative speed: 1.06 (= 164.3ms / 154.3ms)

OneFlow resnet50 time: 93.2ms (= 9323.5ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 103.4ms (= 10343.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.11 (= 103.4ms / 93.2ms)

OneFlow resnet50 time: 60.8ms (= 12162.5ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 78.5ms (= 15698.3ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.29 (= 78.5ms / 60.8ms)

OneFlow resnet50 time: 43.1ms (= 8620.6ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 71.3ms (= 14267.9ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.66 (= 71.3ms / 43.1ms)

OneFlow resnet50 time: 37.6ms (= 7528.9ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 67.7ms (= 13539.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.80 (= 67.7ms / 37.6ms)

github-actions · 2023-03-04T04:34:52Z

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9867/

guo-ran and others added 30 commits October 17, 2022 10:53

flash attention

7e0fef1

add flash-attention as third_party

1b236c2

Merge branch 'master' into dev_flash_attention

3541b18

auto format by CI

9f8fce3

refine

346e1e9

address review

52b54e2

merge master

b2adbf2

auto format by CI

0149678

if build cuda

14589e7

Merge branch 'master' into dev_flash_attention

ccb6a43

fix max_seqlen

08dca77

Merge branch 'dev_flash_attention' of https://github.com/Oneflow-Inc/…

2e075ca

…oneflow into dev_flash_attention

add install

e5a6f9a

fix

dfef627

auto format by CI

1ec4abc

fix of_tidy

208c590

Merge branch 'master' into dev_flash_attention

96778ba

add fused row attention for alphafold

8b426fd

add fused_row_attn backward kernel

23baff0

add 6 attention modules in alphafold

952aabc

rm old files

deeae74

fix fused kernel backward

52b9146

add fused sigmoidmul & dropoutadd forward

217c58b

update FusedMSASigmoidMulGrad

2afef49

add auto grad functions

7cabed9

Merge branch 'master' of https://www.github.com/OneFlow-Inc/oneflow i…

199ac6e

…nto dev_alphafold_fused_attn

fix backward bugs

d146386

add inplace operation for softmax

5bfcff6

fix some backward bugs

3e8d807

add unittest for softmax

f6852c1

ofhwei added 5 commits February 16, 2023 02:47

rename to fused_scale_mask_bias

0911654

rename test file

27ece59

rename python test file

4868778

add fused_scale_mask_bias_softmax op/kernel/autograd

6371c9e

Merge branch 'master' of https://github.com/OneFlow-Inc/oneflow into …

666349b

…add_fused_msa_softmax

Ldpe2G reviewed Feb 20, 2023

View reviewed changes

oneflow/core/autograd/gradient_funcs/fused_scale_mask_bias_softmax.cpp Show resolved Hide resolved

Ldpe2G reviewed Feb 20, 2023

View reviewed changes

oneflow/core/autograd/gradient_funcs/fused_scale_mask_bias_softmax.cpp Outdated Show resolved Hide resolved

Ldpe2G reviewed Feb 20, 2023

View reviewed changes

oneflow/user/ops/fused_scale_mask_bias_softmax_op.cpp Outdated Show resolved Hide resolved

ofhwei changed the title ~~Add fused msa softmax~~ Add fused scale mask bias softmax Feb 20, 2023

refine code

8618b51

ofhwei added enhancement op labels Feb 20, 2023

Ldpe2G reviewed Feb 21, 2023

View reviewed changes

oneflow/core/autograd/gradient_funcs/fused_scale_mask_bias_softmax.cpp Show resolved Hide resolved

Ldpe2G reviewed Feb 21, 2023

View reviewed changes

Ldpe2G approved these changes Feb 21, 2023

View reviewed changes

add ensemble test & comments

e839f81

yz-chen18 reviewed Mar 2, 2023

View reviewed changes

oneflow/user/ops/fused_scale_mask_bias_softmax_op.cpp Outdated Show resolved Hide resolved

yz-chen18 reviewed Mar 2, 2023

View reviewed changes

oneflow/user/ops/fused_scale_mask_bias_softmax_op.cpp Outdated Show resolved Hide resolved

ofhwei added 2 commits March 3, 2023 04:27

refine code

6adce5f

Merge branch 'master' of https://github.com/OneFlow-Inc/oneflow into …

f2b25fd

…add_fused_msa_softmax

yz-chen18 approved these changes Mar 3, 2023

View reviewed changes

Merge branch 'master' into add_fused_msa_softmax

1244df7

ofhwei requested a review from oneflow-ci-bot March 4, 2023 01:48

ofhwei added 2 commits March 4, 2023 03:05

merge

208dbcd

Merge branch 'add_fused_msa_softmax' of https://github.com/OneFlow-In…

ec6326b

…c/oneflow into add_fused_msa_softmax

ofhwei merged commit 7d07caf into master Mar 4, 2023

ofhwei deleted the add_fused_msa_softmax branch March 4, 2023 05:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fused scale mask bias softmax #9867

Add fused scale mask bias softmax #9867

ofhwei commented Feb 14, 2023 •

edited

Loading

Ldpe2G commented Feb 20, 2023

Ldpe2G Feb 21, 2023

ofhwei Feb 22, 2023

github-actions bot commented Mar 4, 2023

github-actions bot commented Mar 4, 2023

Add fused scale mask bias softmax #9867

Add fused scale mask bias softmax #9867

Conversation

ofhwei commented Feb 14, 2023 • edited Loading

Ldpe2G commented Feb 20, 2023

Ldpe2G Feb 21, 2023

Choose a reason for hiding this comment

ofhwei Feb 22, 2023

Choose a reason for hiding this comment

github-actions bot commented Mar 4, 2023

github-actions bot commented Mar 4, 2023

ofhwei commented Feb 14, 2023 •

edited

Loading