Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fused elemwise gelu and optimize performance #33480

Merged

Conversation

wangxicoding
Copy link
Contributor

@wangxicoding wangxicoding commented Jun 9, 2021

PR types

New features

PR changes

OPs

Describe

  1. fused_elemwise_act op 添加gelu(elemwise_add(x, y)的fuse功能。
  2. 优化fused_elemwise_act op在broad_y情况下的性能。
    fused_elemwise_act op在该PR中的主要优化为:
  • 前向kernel优化前为列遍历,改成了行遍历优先
  • 反向kernel优化前为列遍历,由于反向存在ReduceSum操作,改成了(32*32)块处理,一个wrap处理一行,遍历上按照列遍历。

使用tanh(elemwise_add(x, y))作性能测试,测试实验如下:

编号 问题规模 dtype tanh(add) fused(dev) fused(PR)    
      前向(us) 反向(us) 总(us) 前向(us) 反向(us) 总(us) 前向(us) 反向(us) 总(us) 开启fuse后提升 PR相对dev提升
0 [10240, 5120] + [5120] fp32 1027.77 1544.79 2572.56 6329.1 16348 22677.1 568.73 857.27 1426 80.40% 1490.26%
    fp16 539.9 913.09 1452.99 3107.6 6770.2 9877.8 441.28 682.17 1123.45 29.33% 779.24%
1 [10240, 512] + [512] fp32 107.742 210.05 317.792 403.36 672.22 1075.58 61.568 288.29 349.858 -9.17% 207.43%
    fp16 61.058 100.26 161.318 369.53 668 1037.53 52.191 259.74 311.931 -48.28% 232.62%
2 [4096, 512] + [512] fp32 46.847 94.69 141.537 138.82 270.4 409.22 27.488 118.27 145.758 -2.90% 180.75%
    fp16 26.879 42.27 69.149 138.59 270.11 408.7 23.935 102.3 126.235 -45.22% 223.76%
3 [4096, 128] + [128] fp32 13.28 25.79 39.07 43.008 80.447 123.455 11.008 68.448 79.456 -50.83% 55.38%
    fp16 11.159 17.21 28.369 41.856 80.063 121.919 10.624 54.496 65.12 -56.44% 87.22%
4 [1024, 128] + [128] fp32 8.832 16.1 24.932 13.183 23.264 36.447 5.44 16.96 22.4 11.30% 62.71%
    fp16 8.512 12.64 21.152 13.248 23.519 36.767 5.44 17.024 22.464 -5.84% 63.67%
5 [1024, 32] + [32] fp32 7.776 14.82 22.596 6.112 9.728 15.84 5.152 16.96 22.112 2.19% -28.36%
    fp16 7.807 11.36 19.167 5.248 7.744 12.992 5.184 16.64 21.824 -12.17% -40.47%
6 [64, 32] + [32] fp32 7.584 11.3 18.884 3.68 4.096 7.776 3.648 5.44 9.088 107.79% -14.44%
    fp16 7.616 10.94 18.556 3.712 4 7.712 3.552 4.992 8.544 117.18% -9.74%

性能提升总结:

  • 相比develop版本的fuse,本PR在大size下提升有十余倍性能,常规size提升了0.5-2倍性能,小size下性能略降。
  • 相比不fuse下的性能,大size下性能有提升,中等规模size略降,fp16由于未优化fuse后反而性能下降,需优化。

fused_elemwise_act op后续优化TODO:

  • 前向kernel加上SIMD,fp16优化
  • 反向kernel加上SIMD,fp16优化
  • 反向kernel使用tmp memory,拆分成两个kernel,第一个kernel计算dx按行,第二个kernel ReduceSum计算dy,看是否加速?
  • 反向kernel计算dx和dy公式相同时可复用计算结果,减少不必要的重复计算。
  • gelu(elemwise_add())反向时,中间结果和out可以都不需要,fused_elemwise_act反向可以添加不需要out的功能。或者新开gelu add的fuse op,更方便优化。

@paddle-bot-old
Copy link

paddle-bot-old bot commented Jun 9, 2021

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot-old
Copy link

Sorry to inform you that d48d60b's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

@paddle-bot-old
Copy link

Sorry to inform you that 4ce62de's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

static_cast<MT>(0.5) * (static_cast<MT>(1) + tanh_out);
return static_cast<T>(ans);
}
inline HOSTDEVICE T UseOut(T x) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UseOut代表什么?加注释?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

应该是使用Out来计算梯度,这个目前没有用到。

Copy link
Contributor

@gongweibao gongweibao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@Xreki Xreki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for op benchmark ci

@wangxicoding wangxicoding changed the title Add fused elemwise gelu Add fused elemwise gelu and optimize performance Jul 5, 2021
@wangxicoding wangxicoding merged commit eae3185 into PaddlePaddle:develop Jul 5, 2021
@wangxicoding wangxicoding deleted the add_fused_elemwise_gelu branch July 5, 2021 10:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants