Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add fused_codegeex_qkv_reshape #9927

Merged
merged 13 commits into from
Mar 3, 2023
Merged

Conversation

BBuf
Copy link
Contributor

@BBuf BBuf commented Mar 2, 2023

在 codegeex 的 attention 部分对 q,k,v 的reshape每次迭代都需要调用 3 次 to contiguous 操作:https://github.com/Oneflow-Inc/one-codegeex/blob/main/codegeex/oneflow/codegeex_model.py#L112-L138

预计本pr可以将这3次 to contiguous 操作去掉,并减少eager的调度开销。待提供nsys效果图。

选取同一个时间节点的self attention block,原始的 codegeex nsys:

图片

图片

本pr的nsys:

图片

图片

可以看到fused codegeex qkv shape 可以避免三次tocontiguous(这里是memcpy d2d)带来的调度开销以及单独view带来的调度开销。cuda kernel的耗时也从48us->39us。

@BBuf BBuf changed the title add fused_codegeex_qkv_transpose add fused_codegeex_qkv_reshape Mar 2, 2023
@BBuf BBuf requested review from oneflow-ci-bot and removed request for oneflow-ci-bot March 3, 2023 02:01
@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2023

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

@BBuf BBuf requested review from oneflow-ci-bot and removed request for oneflow-ci-bot March 3, 2023 02:48
@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2023

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2023

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

@BBuf BBuf requested review from oneflow-ci-bot and removed request for oneflow-ci-bot March 3, 2023 04:07
@BBuf BBuf requested review from oneflow-ci-bot and removed request for oneflow-ci-bot March 3, 2023 05:56
@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2023

Speed stats:
GPU Name: GeForce GTX 1080 

❌ OneFlow resnet50 time: 141.3ms (= 14132.3ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 144.7ms (= 14465.4ms / 100, input_shape=[16, 3, 224, 224])
❌ Relative speed: 1.02 (= 144.7ms / 141.3ms)

OneFlow resnet50 time: 83.0ms (= 8296.5ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 87.0ms (= 8695.9ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.05 (= 87.0ms / 83.0ms)

OneFlow resnet50 time: 51.1ms (= 10215.3ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 60.2ms (= 12047.2ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.18 (= 60.2ms / 51.1ms)

OneFlow resnet50 time: 34.1ms (= 6812.6ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 42.7ms (= 8535.3ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.25 (= 42.7ms / 34.1ms)

OneFlow resnet50 time: 26.4ms (= 5285.4ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 36.5ms (= 7303.3ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.38 (= 36.5ms / 26.4ms)

OneFlow swin dataloader time: 0.245s (= 48.903s / 200, num_workers=1)
PyTorch swin dataloader time: 0.150s (= 29.967s / 200, num_workers=1)
Relative speed: 0.613 (= 0.150s / 0.245s)

OneFlow swin dataloader time: 0.069s (= 13.884s / 200, num_workers=4)
PyTorch swin dataloader time: 0.041s (= 8.147s / 200, num_workers=4)
Relative speed: 0.587 (= 0.041s / 0.069s)

OneFlow swin dataloader time: 0.046s (= 9.140s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.428s / 200, num_workers=8)
Relative speed: 0.485 (= 0.022s / 0.046s)

❌ OneFlow resnet50 time: 153.6ms (= 15356.7ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 165.9ms (= 16594.5ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
❌ Relative speed: 1.08 (= 165.9ms / 153.6ms)

OneFlow resnet50 time: 93.5ms (= 9355.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 105.8ms (= 10577.1ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.13 (= 105.8ms / 93.5ms)

OneFlow resnet50 time: 60.8ms (= 12163.2ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 88.9ms (= 17774.4ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.46 (= 88.9ms / 60.8ms)

OneFlow resnet50 time: 43.4ms (= 8686.6ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 75.3ms (= 15057.0ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.73 (= 75.3ms / 43.4ms)

OneFlow resnet50 time: 37.7ms (= 7540.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 67.0ms (= 13392.6ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.78 (= 67.0ms / 37.7ms)

@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2023

CI failed when running job: cpu-module. PR label automerge has been removed

@github-actions github-actions bot removed the automerge label Mar 3, 2023
@BBuf BBuf requested review from oneflow-ci-bot and removed request for oneflow-ci-bot March 3, 2023 07:11
@BBuf BBuf added the automerge label Mar 3, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2023

Speed stats:
GPU Name: GeForce GTX 1080 

❌ OneFlow resnet50 time: 141.4ms (= 14142.1ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 144.5ms (= 14450.4ms / 100, input_shape=[16, 3, 224, 224])
❌ Relative speed: 1.02 (= 144.5ms / 141.4ms)

OneFlow resnet50 time: 83.8ms (= 8384.7ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 86.5ms (= 8646.2ms / 100, input_shape=[8, 3, 224, 224])
❌ Relative speed: 1.03 (= 86.5ms / 83.8ms)

OneFlow resnet50 time: 51.3ms (= 10256.9ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 57.8ms (= 11558.7ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.13 (= 57.8ms / 51.3ms)

OneFlow resnet50 time: 33.7ms (= 6730.6ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 47.8ms (= 9568.4ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.42 (= 47.8ms / 33.7ms)

OneFlow resnet50 time: 25.5ms (= 5108.0ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 36.7ms (= 7349.6ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.44 (= 36.7ms / 25.5ms)

OneFlow swin dataloader time: 0.237s (= 47.495s / 200, num_workers=1)
PyTorch swin dataloader time: 0.154s (= 30.862s / 200, num_workers=1)
Relative speed: 0.650 (= 0.154s / 0.237s)

OneFlow swin dataloader time: 0.067s (= 13.495s / 200, num_workers=4)
PyTorch swin dataloader time: 0.042s (= 8.459s / 200, num_workers=4)
Relative speed: 0.627 (= 0.042s / 0.067s)

OneFlow swin dataloader time: 0.044s (= 8.863s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.363s / 200, num_workers=8)
Relative speed: 0.492 (= 0.022s / 0.044s)

❌ OneFlow resnet50 time: 153.6ms (= 15361.9ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 166.0ms (= 16596.0ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
❌ Relative speed: 1.08 (= 166.0ms / 153.6ms)

OneFlow resnet50 time: 94.0ms (= 9399.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 103.9ms (= 10389.6ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.11 (= 103.9ms / 94.0ms)

OneFlow resnet50 time: 61.4ms (= 12284.8ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 79.9ms (= 15987.2ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.30 (= 79.9ms / 61.4ms)

OneFlow resnet50 time: 43.9ms (= 8786.2ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 72.0ms (= 14394.7ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.64 (= 72.0ms / 43.9ms)

OneFlow resnet50 time: 37.1ms (= 7429.7ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 67.1ms (= 13427.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.81 (= 67.1ms / 37.1ms)

@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2023

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9927/

@BBuf BBuf requested review from oneflow-ci-bot and removed request for oneflow-ci-bot March 3, 2023 08:28
@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2023

Speed stats:
GPU Name: GeForce GTX 1080 

❌ OneFlow resnet50 time: 141.2ms (= 14120.1ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 143.4ms (= 14337.2ms / 100, input_shape=[16, 3, 224, 224])
❌ Relative speed: 1.02 (= 143.4ms / 141.2ms)

OneFlow resnet50 time: 82.8ms (= 8283.3ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 89.4ms (= 8940.4ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.08 (= 89.4ms / 82.8ms)

OneFlow resnet50 time: 50.9ms (= 10181.7ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 62.9ms (= 12575.8ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.24 (= 62.9ms / 50.9ms)

OneFlow resnet50 time: 33.8ms (= 6766.9ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 44.4ms (= 8870.7ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.31 (= 44.4ms / 33.8ms)

OneFlow resnet50 time: 25.6ms (= 5119.5ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 39.9ms (= 7971.5ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.56 (= 39.9ms / 25.6ms)

OneFlow swin dataloader time: 0.237s (= 47.424s / 200, num_workers=1)
PyTorch swin dataloader time: 0.147s (= 29.433s / 200, num_workers=1)
Relative speed: 0.621 (= 0.147s / 0.237s)

OneFlow swin dataloader time: 0.069s (= 13.897s / 200, num_workers=4)
PyTorch swin dataloader time: 0.043s (= 8.623s / 200, num_workers=4)
Relative speed: 0.620 (= 0.043s / 0.069s)

OneFlow swin dataloader time: 0.040s (= 7.988s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.391s / 200, num_workers=8)
Relative speed: 0.550 (= 0.022s / 0.040s)

❌ OneFlow resnet50 time: 153.2ms (= 15320.8ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 164.8ms (= 16475.5ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
❌ Relative speed: 1.08 (= 164.8ms / 153.2ms)

OneFlow resnet50 time: 94.2ms (= 9415.4ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 103.2ms (= 10319.6ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.10 (= 103.2ms / 94.2ms)

OneFlow resnet50 time: 60.9ms (= 12183.8ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 78.9ms (= 15787.2ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.30 (= 78.9ms / 60.9ms)

OneFlow resnet50 time: 43.2ms (= 8636.7ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 69.4ms (= 13880.9ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.61 (= 69.4ms / 43.2ms)

OneFlow resnet50 time: 36.6ms (= 7323.8ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 65.9ms (= 13174.6ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.80 (= 65.9ms / 36.6ms)

@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2023

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9927/

@mergify mergify bot merged commit 300bd67 into master Mar 3, 2023
@mergify mergify bot deleted the add_fused_codegeex_qkv_transpose_op branch March 3, 2023 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants