-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【PaddlePaddle Hackathon 4 No.33】为 Paddle 优化 Histogram op 在 GPU 上的计算性能 #53112
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
__syncthreads(); | ||
CUDA_KERNEL_LOOP(index, total_elements) { | ||
const auto input_value = input[index]; | ||
phi::CudaAtomicMin(&min_data, input_value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cuda atomic 太暴力了,建议切换成串联block级别和 warp级别的 操作,可以进一步拉升性能。实在是太暴力了,review RFC 的时候,我看到的atomic以为是要改 histogram 直方图部分的计算。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改,性能确实能提升,同时消除了原先存在的DtoH的开销,已更新性能表,麻烦老师看看还有什么问题。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Performance optimization
PR changes
OPs
Description
当前Paddle采用自主编写的CUDA Kernel执行Histogram的核心计算部分,但是在确定直方图边界时使用Eigen进行计算,存在一定的优化空间
设计文档:https://github.com/PaddlePaddle/community/blob/master/rfcs/OPs-Perf/20230328_histogram_op_optimization.md
开发环境
优化方法
完成优化后,Paddle与优化前的Paddle的性能对比效果:
完成优化后,Paddle与Pytorch的性能对比效果如下:
针对三种不同case, 优化后的性能有不同程度的提升。