Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【PaddlePaddle Hackathon 3 No.31】为 Paddle 优化 dist op 在 GPU 上的计算性能 #44946

Merged
merged 16 commits into from
Sep 2, 2022

Conversation

thunder95
Copy link
Contributor

@thunder95 thunder95 commented Aug 6, 2022

PR types

Performance optimization

PR changes

OPs

Describe

目前Paddle中的Dist算子已经抛弃了Eigen实现,基于Kernel Primitive API实现的PNormKernel达到很不错的性能效果。
设计文档: PaddlePaddle/community#187

  • 开发环境:
  1. 设备:RTX 2070s
  2. 环境:CUDA10.2,cuDNN 7
  • 优化方法
    1.基于block reduce和warp reduce实现更高效的归约操作
  1. 相比现有实现方式,合并一些简单的计算步骤
  2. 对于形状相同的输入,使用更高效的计算,减少广播运算的性能消耗

完成优化后,Paddle与优化前的Paddle的性能对比效果:

Case No. input_shape  p paddle Perf(ms) old_paddle Perf(ms) diff
0 [1000,1000]  2.0 0.1485 0.2492 faster than 40.4%
1 [1000,1000] inf  0.1088 0.2134 faster than 49.02%
2 [1000,1000]  0 0.0918 0.1586 faster than 42.12%

完成优化后,Paddle与Pytorch的性能对比效果如下:

Case No. input_shape  p paddle Perf(ms) pytorch Perf(ms) diff
0 [1000,1000]  2.0 0.1485 0.2338 faster than 36.38%
1 [1000,1000] inf  0.1088 0.1625 faster than 33.05%
2 [1000,1000]  0 0.0918 0.1196 faster than 23.24%

也对不同shape做了测试,与之前old_paddle测试的性能效果无明显提升,所以本PR只针对相同shape情况下进行优化。

@paddle-bot
Copy link

paddle-bot bot commented Aug 6, 2022

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Copy link
Contributor

@JamesLim-sy JamesLim-sy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

代码逻辑部分还在review中,先提交两处比较明显可改动的地方

val = (threadIdx.x < block_span) ? shared[lane] : 1e10f;
if (wid == 0) val = warpReduceMax<T>(val);
return val;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blockReduceMin, blockReduceMax, blockReduceSum, warpReduceMin, warpReduceMax, warpReduceSum 这些基础函数已经封装在了 math_cuda_utils.h 这个文件中

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

谢谢老师提醒,已修改

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已跑测试和benchmark,没有错误,性能也几乎没变化。

auto n = x.numel();
auto config = phi::backends::gpu::GetGpuLaunchConfig1D(dev_ctx, n);
intermediate.Resize(phi::make_ddim({config.block_per_grid.x}));
T* i_ptr = dev_ctx.template Alloc<T>(&intermediate);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line258-261 这段代码在 line300-303 重复使用了,可以归纳在 if 语句之外.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这部分代码虽然功能上重复使用了,但是输入不一样。前者是原始输入x,获得其numel,后者是substract之后的tensor,可能由于broadcast之后numel就不一样了。这里建立if分支,主要是为了优化不需要广播计算substract时的性能。如果归纳在if之外,就每次都需要计算一次substract,那么会带来不必要的性能开销。如果我理解错误,希望老师纠正。 @JamesLim-sy

@ZzSean
Copy link
Contributor

ZzSean commented Aug 16, 2022

PR描述中的两个表格顺序错了

@@ -0,0 +1,259 @@
// Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copyright 年份 2021->2022

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改


#define FULL_MASK 0xffffffff

__device__ __forceinline__ float inline_abs(float x) { return abs(x); }
Copy link
Contributor

@ZzSean ZzSean Aug 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些内置函数本身就可以支持单双精度,无需再封装

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

谢谢建议 已移除

@ZzSean
Copy link
Contributor

ZzSean commented Aug 16, 2022

有没有测试过x与y shape不同的case性能呢?

}

template <typename T>
__global__ void deviceReduceSumZero(const T* x, T* out, int64_t N) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

函数命名规则为大驼峰

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

T sum_val = 0;
for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < N;
i += blockDim.x * gridDim.x)
sum_val += static_cast<T>(static_cast<double>(x[i]) != 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for 循环代码块需要加大括号

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

deviceReduceMaxWithSubstract<T>
<<<config.block_per_grid.x, config.thread_per_block.x, 0, stream>>>(
x_ptr, y_ptr, i_ptr, n);
cudaStreamSynchronize(stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个同步可以去掉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已去掉

}

template <typename T>
__global__ void deviceReduceSumOrder(const T* x, T* out, T p_order, int64_t N) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

函数命名有些让人疑惑,最好可以再明了些

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

将order改成porder不知道会好点不,老师有什么建议?

@@ -34,7 +34,3 @@ void DistKernel(const Context& dev_ctx,
} // namespace phi

PD_REGISTER_KERNEL(dist, CPU, ALL_LAYOUT, phi::DistKernel, float, double) {}

#if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP)
PD_REGISTER_KERNEL(dist, GPU, ALL_LAYOUT, phi::DistKernel, float, double) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

整体看下来,优化后与优化前主要有以下几点区别:

  1. x与y的shape相同时,将减法与后续的reduce相当于做了个融合,理论上看融合肯定是有性能收益的,那这部分我觉得可以单独拿出来保留下来;
  2. x与y的shape不同时,其实逻辑跟原本的pnorm基本一致,但是使用的还是重写的一些reduce kernel,这部分性能是否有提升还需要测试验证。如果确认有性能提升,那么我觉得这部分优化可以直接用于pnorm中,正好这次算子优化中也有pnorm,这样可以减少一些重复工作;如果这部分性能提升不理想,那我觉得为了简化代码可以还沿用之前的直接调pnorm的写法。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZzSean 已按照建议,对不同shape时使用pnorm,测试下来确实pnorm差不多或更优一些。对于相同shape,按照新的方式确实有一定的提升。

@thunder95
Copy link
Contributor Author

PR描述中的两个表格顺序错了

已修改

@thunder95
Copy link
Contributor Author

有没有测试过x与y shape不同的case性能呢?

不同shape时确实没有性能提升,甚至还有轻微下降

DeviceReduceSumZeroWithSubtract<T>
<<<config.block_per_grid.x, config.thread_per_block.x, 0, stream>>>(
x_ptr, y_ptr, i_ptr, n);
DeviceReduceSumZeroFinal<T><<<1, config.thread_per_block.x, 0, stream>>>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

第二次的reduce是不是可以直接用paddle中的reduce kernel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZzSean 已修改,使用paddle种的reduce kernel,几种测试用例都会有轻微的性能下降,影响可忽略。

#define FULL_MASK 0xffffffff

template <typename T>
__global__ void DeviceReduceSumZeroWithSubtract(const T* x,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里函数命名建议:
DeviceReduceSumZeroWithSubtract->ReduceSumWithSubtract
DeviceReduceMaxWithSubtract->ReduceMaxWithSubtract
DeviceReduceMinZeroWithSubtract->ReduceMinWithSubtract
DeviceReduceSumPOrderWithSubtract可以跟DeviceReduceSumZeroWithSubtract结合为一个函数,将求和前的那部分计算functor作为模版参数传入,这样一共就只需要三个新的函数

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZzSean 已参考您的建议完成修改


__syncthreads();
sum_val = phi::funcs::blockReduceSum<T>(sum_val, FULL_MASK);
if (threadIdx.x == 0) out[blockIdx.x] = sum_val;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if条件也加上大括号

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

@thunder95
Copy link
Contributor Author

@Ligoml PR-CE-Framework 这个CI总是不能通过,能帮忙看下原因吗?

@luotao1
Copy link
Contributor

luotao1 commented Aug 26, 2022

@thunder95 请参考 https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/dev_guides/git_guides/paddle_ci_manual_cn.html#pr-ce-framework

直接进到PaddleTest repo下,framework/api/paddlebase目录 然后pytest xxx.py就行

@thunder95
Copy link
Contributor Author

@thunder95 请参考 https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/dev_guides/git_guides/paddle_ci_manual_cn.html#pr-ce-framework

直接进到PaddleTest repo下,framework/api/paddlebase目录 然后pytest xxx.py就行

@luotao1 谢谢,按照这种方式定位到了问题。 @ZzSean 辛苦老师再review一下。

#include "paddle/phi/core/kernel_registry.h"
#include "paddle/phi/kernels/elementwise_subtract_kernel.h"
#include "paddle/phi/kernels/funcs/math_cuda_utils.h"
#include "paddle/phi/kernels/funcs/reduce_function.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

删除重复头文件

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已移除


#include "paddle/phi/kernels/dist_kernel.h"

#include "paddle/fluid/operators/elementwise/elementwise_op_impl.cu.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

删除无用头文件,解决CI-APPROVAL问题

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已删除

struct ZeroOrderFunctor {
public:
__device__ T operator()(const T& x, const T& y) const {
return abs(static_cast<T>(static_cast<double>(x - y) != 0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static_cast<double>感觉不需要吧?好像也不需要abs?布尔只能取0或1吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已按照修改意见完成修改 @ZzSean

@thunder95 thunder95 requested a review from ZzSean September 2, 2022 02:48
Copy link
Contributor

@ZzSean ZzSean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ZzSean ZzSean merged commit ad70471 into PaddlePaddle:develop Sep 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants