【PaddlePaddle Hackathon 3 No.31】为 Paddle 优化 dist op 在 GPU 上的计算性能 #44946

thunder95 · 2022-08-06T16:46:45Z

PR types

Performance optimization

PR changes

OPs

Describe

目前Paddle中的Dist算子已经抛弃了Ｅigen实现，基于Kernel Primitive API实现的PNormKernel达到很不错的性能效果。
设计文档: PaddlePaddle/community#187

开发环境：

设备：RTX 2070s
环境：CUDA10.2，cuDNN 7

优化方法
1.基于block reduce和warp reduce实现更高效的归约操作

相比现有实现方式，合并一些简单的计算步骤
对于形状相同的输入，使用更高效的计算，减少广播运算的性能消耗

完成优化后，Paddle与优化前的Paddle的性能对比效果:

Case No.	input_shape	ｐ	paddle Perf(ms)	old_paddle Perf(ms)	diff
0	[1000,1000]	2.0	0.1485	0.2492	faster than 40.4%
1	[1000,1000]	inf	0.1088	0.2134	faster than 49.02%
2	[1000,1000]	0	0.0918	0.1586	faster than 42.12%

完成优化后，Paddle与Pytorch的性能对比效果如下:

Case No.	input_shape	ｐ	paddle Perf(ms)	pytorch Perf(ms)	diff
0	[1000,1000]	2.0	0.1485	0.2338	faster than 36.38%
1	[1000,1000]	inf	0.1088	0.1625	faster than 33.05%
2	[1000,1000]	0	0.0918	0.1196	faster than 23.24%

也对不同shape做了测试，与之前old_paddle测试的性能效果无明显提升，所以本PR只针对相同shape情况下进行优化。

… dist

paddle-bot · 2022-08-06T16:46:48Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

JamesLim-sy

代码逻辑部分还在review中，先提交两处比较明显可改动的地方

JamesLim-sy · 2022-08-10T12:55:39Z

paddle/phi/kernels/gpu/dist_kernel.cu

+  val = (threadIdx.x < block_span) ? shared[lane] : 1e10f;
+  if (wid == 0) val = warpReduceMax<T>(val);
+  return val;
+}


blockReduceMin, blockReduceMax, blockReduceSum, warpReduceMin, warpReduceMax, warpReduceSum 这些基础函数已经封装在了 math_cuda_utils.h 这个文件中

谢谢老师提醒，已修改

已跑测试和benchmark，没有错误，性能也几乎没变化。

JamesLim-sy · 2022-08-10T13:04:42Z

paddle/phi/kernels/gpu/dist_kernel.cu

+    auto n = x.numel();
+    auto config = phi::backends::gpu::GetGpuLaunchConfig1D(dev_ctx, n);
+    intermediate.Resize(phi::make_ddim({config.block_per_grid.x}));
+    T* i_ptr = dev_ctx.template Alloc<T>(&intermediate);


line258-261 这段代码在 line300-303 重复使用了，可以归纳在 if 语句之外.

这部分代码虽然功能上重复使用了，但是输入不一样。前者是原始输入x，获得其numel，后者是substract之后的tensor，可能由于broadcast之后numel就不一样了。这里建立if分支，主要是为了优化不需要广播计算substract时的性能。如果归纳在if之外，就每次都需要计算一次substract，那么会带来不必要的性能开销。如果我理解错误，希望老师纠正。　@JamesLim-sy

… dist

ZzSean · 2022-08-16T06:42:01Z

PR描述中的两个表格顺序错了

ZzSean · 2022-08-16T06:43:33Z

paddle/phi/kernels/gpu/dist_kernel.cu

@@ -0,0 +1,259 @@
+// Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.


copyright 年份 2021->2022

ZzSean · 2022-08-16T07:12:36Z

paddle/phi/kernels/gpu/dist_kernel.cu

+
+#define FULL_MASK 0xffffffff
+
+__device__ __forceinline__ float inline_abs(float x) { return abs(x); }


这些内置函数本身就可以支持单双精度，无需再封装

谢谢建议已移除

ZzSean · 2022-08-16T08:38:58Z

有没有测试过x与y shape不同的case性能呢？

ZzSean · 2022-08-16T08:49:16Z

paddle/phi/kernels/gpu/dist_kernel.cu

+}
+
+template <typename T>
+__global__ void deviceReduceSumZero(const T* x, T* out, int64_t N) {


函数命名规则为大驼峰

ZzSean · 2022-08-16T08:49:35Z

paddle/phi/kernels/gpu/dist_kernel.cu

+  T sum_val = 0;
+  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < N;
+       i += blockDim.x * gridDim.x)
+    sum_val += static_cast<T>(static_cast<double>(x[i]) != 0);


for 循环代码块需要加大括号

ZzSean · 2022-08-16T08:57:35Z

paddle/phi/kernels/gpu/dist_kernel.cu

+      deviceReduceMaxWithSubstract<T>
+          <<<config.block_per_grid.x, config.thread_per_block.x, 0, stream>>>(
+              x_ptr, y_ptr, i_ptr, n);
+      cudaStreamSynchronize(stream);


这个同步可以去掉

ZzSean · 2022-08-16T09:00:45Z

paddle/phi/kernels/gpu/dist_kernel.cu

+}
+
+template <typename T>
+__global__ void deviceReduceSumOrder(const T* x, T* out, T p_order, int64_t N) {


函数命名有些让人疑惑，最好可以再明了些

将order改成porder不知道会好点不，老师有什么建议？

ZzSean · 2022-08-16T09:22:03Z

paddle/phi/kernels/dist_kernel.cc

@@ -34,7 +34,3 @@ void DistKernel(const Context& dev_ctx,
 }  // namespace phi

 PD_REGISTER_KERNEL(dist, CPU, ALL_LAYOUT, phi::DistKernel, float, double) {}
-
-#if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP)
-PD_REGISTER_KERNEL(dist, GPU, ALL_LAYOUT, phi::DistKernel, float, double) {}


整体看下来，优化后与优化前主要有以下几点区别：

x与y的shape相同时，将减法与后续的reduce相当于做了个融合，理论上看融合肯定是有性能收益的，那这部分我觉得可以单独拿出来保留下来；

x与y的shape不同时，其实逻辑跟原本的pnorm基本一致，但是使用的还是重写的一些reduce kernel，这部分性能是否有提升还需要测试验证。如果确认有性能提升，那么我觉得这部分优化可以直接用于pnorm中，正好这次算子优化中也有pnorm，这样可以减少一些重复工作；如果这部分性能提升不理想，那我觉得为了简化代码可以还沿用之前的直接调pnorm的写法。

@ZzSean 已按照建议，对不同shape时使用pnorm，测试下来确实pnorm差不多或更优一些。对于相同shape，按照新的方式确实有一定的提升。

… dist

thunder95 · 2022-08-17T13:30:45Z

PR描述中的两个表格顺序错了

已修改

thunder95 · 2022-08-17T13:31:38Z

有没有测试过x与y shape不同的case性能呢？

不同shape时确实没有性能提升，甚至还有轻微下降

ZzSean · 2022-08-19T07:29:16Z

paddle/phi/kernels/gpu/dist_kernel.cu

+      DeviceReduceSumZeroWithSubtract<T>
+          <<<config.block_per_grid.x, config.thread_per_block.x, 0, stream>>>(
+              x_ptr, y_ptr, i_ptr, n);
+      DeviceReduceSumZeroFinal<T><<<1, config.thread_per_block.x, 0, stream>>>(


第二次的reduce是不是可以直接用paddle中的reduce kernel

@ZzSean 已修改，使用paddle种的reduce kernel，几种测试用例都会有轻微的性能下降，影响可忽略。

ZzSean · 2022-08-19T07:40:08Z

paddle/phi/kernels/gpu/dist_kernel.cu

+#define FULL_MASK 0xffffffff
+
+template <typename T>
+__global__ void DeviceReduceSumZeroWithSubtract(const T* x,


这里函数命名建议：
DeviceReduceSumZeroWithSubtract->ReduceSumWithSubtract
DeviceReduceMaxWithSubtract->ReduceMaxWithSubtract
DeviceReduceMinZeroWithSubtract->ReduceMinWithSubtract
DeviceReduceSumPOrderWithSubtract可以跟DeviceReduceSumZeroWithSubtract结合为一个函数，将求和前的那部分计算functor作为模版参数传入，这样一共就只需要三个新的函数

@ZzSean 已参考您的建议完成修改

ZzSean · 2022-08-19T07:40:55Z

paddle/phi/kernels/gpu/dist_kernel.cu

+
+  __syncthreads();
+  sum_val = phi::funcs::blockReduceSum<T>(sum_val, FULL_MASK);
+  if (threadIdx.x == 0) out[blockIdx.x] = sum_val;


if条件也加上大括号

… dist

thunder95 · 2022-08-24T07:51:20Z

@Ligoml PR-CE-Framework 这个ＣＩ总是不能通过，能帮忙看下原因吗？

… dist

luotao1 · 2022-08-26T02:39:31Z

@thunder95 请参考 https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/dev_guides/git_guides/paddle_ci_manual_cn.html#pr-ce-framework

直接进到PaddleTest repo下，framework/api/paddlebase目录然后pytest xxx.py就行

… dist

thunder95 · 2022-08-29T05:55:31Z

@thunder95 请参考 https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/dev_guides/git_guides/paddle_ci_manual_cn.html#pr-ce-framework

直接进到PaddleTest repo下，framework/api/paddlebase目录然后pytest xxx.py就行

@luotao1 谢谢，按照这种方式定位到了问题。 @ZzSean 辛苦老师再review一下。

ZzSean · 2022-08-29T07:47:06Z

paddle/phi/kernels/gpu/dist_kernel.cu

+#include "paddle/phi/core/kernel_registry.h"
+#include "paddle/phi/kernels/elementwise_subtract_kernel.h"
+#include "paddle/phi/kernels/funcs/math_cuda_utils.h"
+#include "paddle/phi/kernels/funcs/reduce_function.h"


删除重复头文件

ZzSean · 2022-08-29T07:48:39Z

paddle/phi/kernels/gpu/dist_kernel.cu

+
+#include "paddle/phi/kernels/dist_kernel.h"
+
+#include "paddle/fluid/operators/elementwise/elementwise_op_impl.cu.h"


删除无用头文件，解决CI-APPROVAL问题

ZzSean · 2022-08-29T08:00:03Z

paddle/phi/kernels/gpu/dist_kernel.cu

+struct ZeroOrderFunctor {
+ public:
+  __device__ T operator()(const T& x, const T& y) const {
+    return abs(static_cast<T>(static_cast<double>(x - y) != 0));


static_cast<double>感觉不需要吧？好像也不需要abs？布尔只能取0或1吧

已按照修改意见完成修改　@ZzSean

… dist

ZzSean

LGTM

thunder95 added 2 commits August 4, 2022 03:38

add dist cuda kernel

36af63c

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

46aa2d8

… dist

thunder95 mentioned this pull request Aug 9, 2022

【PaddlePaddle Hackathon 第三期】任务总览 #43938

Closed

luotao1 assigned luotao1 and JamesLim-sy Aug 9, 2022

luotao1 added PaddlePaddle Hackathon contributor External developers labels Aug 9, 2022

JamesLim-sy reviewed Aug 10, 2022

View reviewed changes

luotao1 assigned Ligoml Aug 11, 2022

thunder95 added 2 commits August 11, 2022 13:52

reuse some funcs in phi

1c4b2c1

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

6f0a1e5

… dist

ZzSean reviewed Aug 16, 2022

View reviewed changes

luotao1 assigned ZzSean Aug 16, 2022

ZzSean reviewed Aug 16, 2022

View reviewed changes

thunder95 added 3 commits August 17, 2022 06:36

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

860dce7

… dist

使用pnorm

21d3f01

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

f6735ad

… dist

ZzSean reviewed Aug 19, 2022

View reviewed changes

thunder95 added 4 commits August 22, 2022 15:01

fix code style - explicit

7a05b51

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

520d726

… dist

fix code sytle

e369fb1

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

36f5a86

… dist

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

305e483

… dist

thunder95 added 2 commits August 29, 2022 02:51

fix bug

494e905

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

bdb14be

… dist

ZzSean reviewed Aug 29, 2022

View reviewed changes

thunder95 added 2 commits August 30, 2022 07:24

remove unused headers

39acd97

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

0c4b29b

… dist

thunder95 requested a review from ZzSean September 2, 2022 02:48

ZzSean approved these changes Sep 2, 2022

View reviewed changes

ZzSean merged commit ad70471 into PaddlePaddle:develop Sep 2, 2022

		@@ -0,0 +1,259 @@
		// Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.


		#define FULL_MASK 0xffffffff

		__device__ __forceinline__ float inline_abs(float x) { return abs(x); }


		#include "paddle/phi/kernels/dist_kernel.h"

		#include "paddle/fluid/operators/elementwise/elementwise_op_impl.cu.h"

【PaddlePaddle Hackathon 3 No.31】为 Paddle 优化 dist op 在 GPU 上的计算性能 #44946

【PaddlePaddle Hackathon 3 No.31】为 Paddle 优化 dist op 在 GPU 上的计算性能 #44946

Conversation

thunder95 commented Aug 6, 2022 • edited Loading

PR types

PR changes

Describe

paddle-bot bot commented Aug 6, 2022

JamesLim-sy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZzSean commented Aug 16, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZzSean Aug 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZzSean commented Aug 16, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thunder95 commented Aug 17, 2022

thunder95 commented Aug 17, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thunder95 commented Aug 24, 2022

luotao1 commented Aug 26, 2022

thunder95 commented Aug 29, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZzSean left a comment

Choose a reason for hiding this comment

thunder95 commented Aug 6, 2022 •

edited

Loading

ZzSean Aug 16, 2022 •

edited

Loading