-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【PaddlePaddle Hackathon 4 No.35】为 Paddle 优化 prelu op 在 GPU 上的计算性能 #51131
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
@@ -43,7 +43,7 @@ __global__ void VectorizedIndexKernel(T *out, | |||
out + data_offset, &result[0], BLOCK_NUM_X * VecSize); | |||
} | |||
size_t num = numel - data_offset; | |||
if (num > 0) { | |||
if (static_cast<int>(num) > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里感觉不需要做 static_cast 的转换操作.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
之前跑benchmark的时候,定位了很久才发现这里一直报错,所以才修改了. @JamesLim-sy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
原来是这样,那这里保持即可,不用做修改.
paddle/phi/kernels/gpu/prelu_funcs.h
Outdated
size_t channel_num_; | ||
size_t plane_size_; | ||
int numel_; | ||
const T zero = static_cast<T>(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里不必作为成员变量,在HOSTDEVICE inline PReluChannelFirstWiseCUDAFunctor
函数实现内,作为下述代码行即可:
constexpr T zero = static_cast<T>(0);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JamesLim-sy 已修改
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JamesLim-sy 使用constexpr或const,在编译时会报错, 我暂时先去掉。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Performance optimization
PR changes
OPs
Describe
目前Paddle中的Prelu算子仍旧通过内部循环方式实现,没有用到一些性能优化的技巧,存在性能优化的空间。
设计文档: PaddlePaddle/community#370
通过使用飞桨内部kps的Elementwise Kernel 和 IndexKernel来进行计算。通过向量化读取、向量化写入对prelu算子进行优化.
完成优化后,Paddle与优化前的Paddle的性能对比效果:
完成优化后,Paddle与Pytorch的性能对比效果如下:
针对四种不同case, 优化后的性能有不同程度的提升。