add fused_codegeex_qkv_reshape #9927

BBuf · 2023-03-02T04:03:54Z

在 codegeex 的 attention 部分对 q,k,v 的reshape每次迭代都需要调用 3 次 to contiguous 操作：https://github.com/Oneflow-Inc/one-codegeex/blob/main/codegeex/oneflow/codegeex_model.py#L112-L138

预计本pr可以将这3次 to contiguous 操作去掉，并减少eager的调度开销。待提供nsys效果图。

选取同一个时间节点的self attention block，原始的 codegeex nsys：

本pr的nsys：

可以看到fused codegeex qkv shape 可以避免三次tocontiguous(这里是memcpy d2d)带来的调度开销以及单独view带来的调度开销。cuda kernel的耗时也从48us->39us。

github-actions · 2023-03-03T02:03:34Z

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

github-actions · 2023-03-03T02:49:52Z

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

github-actions · 2023-03-03T03:51:04Z

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

github-actions · 2023-03-03T07:08:53Z

Speed stats:

GPU Name: GeForce GTX 1080 

❌ OneFlow resnet50 time: 141.3ms (= 14132.3ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 144.7ms (= 14465.4ms / 100, input_shape=[16, 3, 224, 224])
❌ Relative speed: 1.02 (= 144.7ms / 141.3ms)

OneFlow resnet50 time: 83.0ms (= 8296.5ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 87.0ms (= 8695.9ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.05 (= 87.0ms / 83.0ms)

OneFlow resnet50 time: 51.1ms (= 10215.3ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 60.2ms (= 12047.2ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.18 (= 60.2ms / 51.1ms)

OneFlow resnet50 time: 34.1ms (= 6812.6ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 42.7ms (= 8535.3ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.25 (= 42.7ms / 34.1ms)

OneFlow resnet50 time: 26.4ms (= 5285.4ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 36.5ms (= 7303.3ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.38 (= 36.5ms / 26.4ms)

OneFlow swin dataloader time: 0.245s (= 48.903s / 200, num_workers=1)
PyTorch swin dataloader time: 0.150s (= 29.967s / 200, num_workers=1)
Relative speed: 0.613 (= 0.150s / 0.245s)

OneFlow swin dataloader time: 0.069s (= 13.884s / 200, num_workers=4)
PyTorch swin dataloader time: 0.041s (= 8.147s / 200, num_workers=4)
Relative speed: 0.587 (= 0.041s / 0.069s)

OneFlow swin dataloader time: 0.046s (= 9.140s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.428s / 200, num_workers=8)
Relative speed: 0.485 (= 0.022s / 0.046s)

❌ OneFlow resnet50 time: 153.6ms (= 15356.7ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 165.9ms (= 16594.5ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
❌ Relative speed: 1.08 (= 165.9ms / 153.6ms)

OneFlow resnet50 time: 93.5ms (= 9355.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 105.8ms (= 10577.1ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.13 (= 105.8ms / 93.5ms)

OneFlow resnet50 time: 60.8ms (= 12163.2ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 88.9ms (= 17774.4ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.46 (= 88.9ms / 60.8ms)

OneFlow resnet50 time: 43.4ms (= 8686.6ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 75.3ms (= 15057.0ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.73 (= 75.3ms / 43.4ms)

OneFlow resnet50 time: 37.7ms (= 7540.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 67.0ms (= 13392.6ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.78 (= 67.0ms / 37.7ms)

github-actions · 2023-03-03T07:09:12Z

CI failed when running job: cpu-module. PR label automerge has been removed

github-actions · 2023-03-03T07:28:48Z

Speed stats:

GPU Name: GeForce GTX 1080 

❌ OneFlow resnet50 time: 141.4ms (= 14142.1ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 144.5ms (= 14450.4ms / 100, input_shape=[16, 3, 224, 224])
❌ Relative speed: 1.02 (= 144.5ms / 141.4ms)

OneFlow resnet50 time: 83.8ms (= 8384.7ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 86.5ms (= 8646.2ms / 100, input_shape=[8, 3, 224, 224])
❌ Relative speed: 1.03 (= 86.5ms / 83.8ms)

OneFlow resnet50 time: 51.3ms (= 10256.9ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 57.8ms (= 11558.7ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.13 (= 57.8ms / 51.3ms)

OneFlow resnet50 time: 33.7ms (= 6730.6ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 47.8ms (= 9568.4ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.42 (= 47.8ms / 33.7ms)

OneFlow resnet50 time: 25.5ms (= 5108.0ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 36.7ms (= 7349.6ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.44 (= 36.7ms / 25.5ms)

OneFlow swin dataloader time: 0.237s (= 47.495s / 200, num_workers=1)
PyTorch swin dataloader time: 0.154s (= 30.862s / 200, num_workers=1)
Relative speed: 0.650 (= 0.154s / 0.237s)

OneFlow swin dataloader time: 0.067s (= 13.495s / 200, num_workers=4)
PyTorch swin dataloader time: 0.042s (= 8.459s / 200, num_workers=4)
Relative speed: 0.627 (= 0.042s / 0.067s)

OneFlow swin dataloader time: 0.044s (= 8.863s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.363s / 200, num_workers=8)
Relative speed: 0.492 (= 0.022s / 0.044s)

❌ OneFlow resnet50 time: 153.6ms (= 15361.9ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 166.0ms (= 16596.0ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
❌ Relative speed: 1.08 (= 166.0ms / 153.6ms)

OneFlow resnet50 time: 94.0ms (= 9399.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 103.9ms (= 10389.6ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.11 (= 103.9ms / 94.0ms)

OneFlow resnet50 time: 61.4ms (= 12284.8ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 79.9ms (= 15987.2ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.30 (= 79.9ms / 61.4ms)

OneFlow resnet50 time: 43.9ms (= 8786.2ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 72.0ms (= 14394.7ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.64 (= 72.0ms / 43.9ms)

OneFlow resnet50 time: 37.1ms (= 7429.7ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 67.1ms (= 13427.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.81 (= 67.1ms / 37.1ms)

github-actions · 2023-03-03T07:33:52Z

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9927/

github-actions · 2023-03-03T15:41:27Z

Speed stats:

GPU Name: GeForce GTX 1080 

❌ OneFlow resnet50 time: 141.2ms (= 14120.1ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 143.4ms (= 14337.2ms / 100, input_shape=[16, 3, 224, 224])
❌ Relative speed: 1.02 (= 143.4ms / 141.2ms)

OneFlow resnet50 time: 82.8ms (= 8283.3ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 89.4ms (= 8940.4ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.08 (= 89.4ms / 82.8ms)

OneFlow resnet50 time: 50.9ms (= 10181.7ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 62.9ms (= 12575.8ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.24 (= 62.9ms / 50.9ms)

OneFlow resnet50 time: 33.8ms (= 6766.9ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 44.4ms (= 8870.7ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.31 (= 44.4ms / 33.8ms)

OneFlow resnet50 time: 25.6ms (= 5119.5ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 39.9ms (= 7971.5ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.56 (= 39.9ms / 25.6ms)

OneFlow swin dataloader time: 0.237s (= 47.424s / 200, num_workers=1)
PyTorch swin dataloader time: 0.147s (= 29.433s / 200, num_workers=1)
Relative speed: 0.621 (= 0.147s / 0.237s)

OneFlow swin dataloader time: 0.069s (= 13.897s / 200, num_workers=4)
PyTorch swin dataloader time: 0.043s (= 8.623s / 200, num_workers=4)
Relative speed: 0.620 (= 0.043s / 0.069s)

OneFlow swin dataloader time: 0.040s (= 7.988s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.391s / 200, num_workers=8)
Relative speed: 0.550 (= 0.022s / 0.040s)

❌ OneFlow resnet50 time: 153.2ms (= 15320.8ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 164.8ms (= 16475.5ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
❌ Relative speed: 1.08 (= 164.8ms / 153.2ms)

OneFlow resnet50 time: 94.2ms (= 9415.4ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 103.2ms (= 10319.6ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.10 (= 103.2ms / 94.2ms)

OneFlow resnet50 time: 60.9ms (= 12183.8ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 78.9ms (= 15787.2ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.30 (= 78.9ms / 60.9ms)

OneFlow resnet50 time: 43.2ms (= 8636.7ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 69.4ms (= 13880.9ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.61 (= 69.4ms / 43.2ms)

OneFlow resnet50 time: 36.6ms (= 7323.8ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 65.9ms (= 13174.6ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.80 (= 65.9ms / 36.6ms)

github-actions · 2023-03-03T15:48:38Z

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9927/

add fused_codegeex_qkv_transpose op

0f021d3

BBuf requested review from hjchen2, jackalcooper, daquexian and liujuncheng as code owners March 2, 2023 04:03

rename kernel

829d36c

BBuf changed the title ~~add fused_codegeex_qkv_transpose~~ add fused_codegeex_qkv_reshape Mar 2, 2023

fix kernel regist bug

5fcd708

BBuf mentioned this pull request Mar 2, 2023

Add fused codegeex qkv reshape optimize Oneflow-Inc/one-codegeex#14

Merged

BBuf added 2 commits March 2, 2023 11:37

add pack read and write

90aca37

revert useless change

dd3f63a

marigoold approved these changes Mar 2, 2023

View reviewed changes

refine grid_size

c012802

BBuf requested a review from oneflow-ci-bot March 3, 2023 02:00

BBuf added enhancement automerge eager api labels Mar 3, 2023

liujuncheng approved these changes Mar 3, 2023

View reviewed changes

Merge branch 'master' into add_fused_codegeex_qkv_transpose_op

b5cc624

BBuf requested review from oneflow-ci-bot and removed request for oneflow-ci-bot March 3, 2023 02:01

auto format by CI

7827867

BBuf requested review from oneflow-ci-bot and removed request for oneflow-ci-bot March 3, 2023 02:48

BBuf and others added 2 commits March 3, 2023 03:48

fix license error

8a621d4

auto format by CI

5800428

BBuf requested review from oneflow-ci-bot and removed request for oneflow-ci-bot March 3, 2023 04:07

Merge branch 'master' into add_fused_codegeex_qkv_transpose_op

a9b904c

BBuf requested review from oneflow-ci-bot and removed request for oneflow-ci-bot March 3, 2023 05:56

github-actions bot removed the automerge label Mar 3, 2023

BBuf requested review from oneflow-ci-bot and removed request for oneflow-ci-bot March 3, 2023 07:11

BBuf added the automerge label Mar 3, 2023

BBuf requested review from oneflow-ci-bot and removed request for oneflow-ci-bot March 3, 2023 08:28

BBuf and others added 2 commits March 3, 2023 18:09

Merge branch 'master' into add_fused_codegeex_qkv_transpose_op

9a625a0

Merge branch 'master' into add_fused_codegeex_qkv_transpose_op

db58fe6

mergify bot merged commit 300bd67 into master Mar 3, 2023

mergify bot deleted the add_fused_codegeex_qkv_transpose_op branch March 3, 2023 16:22

BBuf mentioned this pull request Mar 10, 2023

add fused_codegeex_qkv_reshape Oneflow-Inc/one-codegeex#21

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add fused_codegeex_qkv_reshape #9927

add fused_codegeex_qkv_reshape #9927

BBuf commented Mar 2, 2023 •

edited

Loading

github-actions bot commented Mar 3, 2023

github-actions bot commented Mar 3, 2023

github-actions bot commented Mar 3, 2023

github-actions bot commented Mar 3, 2023

github-actions bot commented Mar 3, 2023

github-actions bot commented Mar 3, 2023

github-actions bot commented Mar 3, 2023

github-actions bot commented Mar 3, 2023

github-actions bot commented Mar 3, 2023

add fused_codegeex_qkv_reshape #9927

add fused_codegeex_qkv_reshape #9927

Conversation

BBuf commented Mar 2, 2023 • edited Loading

github-actions bot commented Mar 3, 2023

github-actions bot commented Mar 3, 2023

github-actions bot commented Mar 3, 2023

github-actions bot commented Mar 3, 2023

github-actions bot commented Mar 3, 2023

github-actions bot commented Mar 3, 2023

github-actions bot commented Mar 3, 2023

github-actions bot commented Mar 3, 2023

github-actions bot commented Mar 3, 2023

BBuf commented Mar 2, 2023 •

edited

Loading