【PaddlePaddle Hackathon 4 No.33】为 Paddle 优化 Histogram op 在 GPU 上的计算性能 #53112

zeroRains · 2023-04-20T05:59:44Z

PR types

Performance optimization

PR changes

OPs

Description

当前Paddle采用自主编写的CUDA Kernel执行Histogram的核心计算部分，但是在确定直方图边界时使用Eigen进行计算，存在一定的优化空间
设计文档：https://github.com/PaddlePaddle/community/blob/master/rfcs/OPs-Perf/20230328_histogram_op_optimization.md

开发环境
1. 设备：Tesla V100
2. 环境：CUDA11.2，cuDNN 8
优化方法
- 关键是使用__global__ kernel的方式实现了KernelMinMax，加速Histogram确定直方图边界的计算部分，从而提高Histogram算子在GPU上的计算性能。

完成优化后，Paddle与优化前的Paddle的性能对比效果:

Case No.	device	input_shape	input_type	bins	Paddle Perf(ms)	old Paddle Perf(ms)	diff
1	Tesla V100	[16, 64]	int32	100	0.01176	0.09403	faster than 699.57%
2	Tesla V100	[16, 64]	int64	100	0.01179	0.13624	faster than 1055.56%
3	Tesla V100	[16, 64]	float32	100	0.01117	0.01889	faster than 69.11%

完成优化后，Paddle与Pytorch的性能对比效果如下:

Case No.	device	input_shape	input_type	bins	Paddle Perf(ms)	Pytorch Perf(ms)	diff
1	Tesla V100	[16, 64]	int32	100	0.01176	0.02255	faster than 91.75%
2	Tesla V100	[16, 64]	int64	100	0.01179	0.03424	faster than 190.42%
3	Tesla V100	[16, 64]	float32	100	0.01117	0.02250	faster than 101.43%

针对三种不同case, 优化后的性能有不同程度的提升。

paddle-bot · 2023-04-20T05:59:49Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

JamesLim-sy · 2023-04-20T08:45:59Z

paddle/phi/kernels/gpu/histogram_kernel.cu

+  __syncthreads();
+  CUDA_KERNEL_LOOP(index, total_elements) {
+    const auto input_value = input[index];
+    phi::CudaAtomicMin(&min_data, input_value);


cuda atomic 太暴力了，建议切换成串联block级别和 warp级别的操作，可以进一步拉升性能。实在是太暴力了，review RFC 的时候，我看到的atomic以为是要改 histogram 直方图部分的计算。

已修改，性能确实能提升，同时消除了原先存在的DtoH的开销，已更新性能表，麻烦老师看看还有什么问题。

JamesLim-sy

LGTM

create KernelMinMax to optimize the performance of histogram op in GPU

c3d7d3e

paddle-bot bot added contributor External developers status: proposed labels Apr 20, 2023

zeroRains changed the title ~~create KernelMinMax to optimize the performance of histogram op in GPU~~ 【PaddlePaddle Hackathon 4 No.33】为 Paddle 优化 Histogram op 在 GPU 上的计算性能 Apr 20, 2023

zeroRains mentioned this pull request Apr 20, 2023

【PaddlePaddle Hackathon 第四期】任务总览 #51281

Closed

luotao1 assigned luotao1, Ligoml and JamesLim-sy Apr 20, 2023

JamesLim-sy reviewed Apr 20, 2023

View reviewed changes

zeroRains added 3 commits April 21, 2023 02:54

change to block and warp wise operation

b872ed9

remove the time in DtoH

b922494

fix a bug

dc220e3

JamesLim-sy approved these changes Apr 25, 2023

View reviewed changes

JamesLim-sy merged commit c1a61fc into PaddlePaddle:develop Apr 25, 2023

zeroRains deleted the histogram branch April 26, 2023 01:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【PaddlePaddle Hackathon 4 No.33】为 Paddle 优化 Histogram op 在 GPU 上的计算性能 #53112

【PaddlePaddle Hackathon 4 No.33】为 Paddle 优化 Histogram op 在 GPU 上的计算性能 #53112

zeroRains commented Apr 20, 2023 •

edited

Loading

paddle-bot bot commented Apr 20, 2023

JamesLim-sy Apr 20, 2023

zeroRains Apr 21, 2023 •

edited

Loading

JamesLim-sy left a comment

【PaddlePaddle Hackathon 4 No.33】为 Paddle 优化 Histogram op 在 GPU 上的计算性能 #53112

【PaddlePaddle Hackathon 4 No.33】为 Paddle 优化 Histogram op 在 GPU 上的计算性能 #53112

Conversation

zeroRains commented Apr 20, 2023 • edited Loading

PR types

PR changes

Description

paddle-bot bot commented Apr 20, 2023

JamesLim-sy Apr 20, 2023

Choose a reason for hiding this comment

zeroRains Apr 21, 2023 • edited Loading

Choose a reason for hiding this comment

JamesLim-sy left a comment

Choose a reason for hiding this comment

zeroRains commented Apr 20, 2023 •

edited

Loading

zeroRains Apr 21, 2023 •

edited

Loading