-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【PaddlePaddle Hackathon 3 No.33】为 Paddle 优化 erfinv op 在 GPU 上的计算性能 #45057
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
namespace phi { | ||
|
||
template <typename T> | ||
struct ErfinvCUDAFunctor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
直接叫ErfinvFunctor就可以
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嗯 已修改
|
||
template <typename T> | ||
struct ErfinvCUDAFunctor { | ||
HOSTDEVICE inline ErfinvCUDAFunctor() {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
默认构造为空的话可以省略
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
谢谢建议,已移除
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ZzSean 辛苦老师再看一下
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Performance optimization
PR changes
OPs
Describe
目前 Paddle 内 erfinv 算子的 GPU 实现采用 Eigen 组合的模式,缺少 GPU Kernel,性能相对不足;可以基于飞桨已有的kps api基础上开发得到较高的性能提升。
设计文档: PaddlePaddle/community#199
1. (方案A)参考Eigen,在cuda算子中先实现ndtri函数,进一步实现erfinv函数
2.(方案B)直接基于cuda提供的内置api函数进行开发
完成优化后,Paddle与优化前的Paddle的前向推理性能对比效果:
完成优化后,Paddle与Pytorch的前向推理性能对比效果:
方案A实现较为复杂,反而性能还有所降低,故本PR采用方案B。