-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【PaddlePaddle Hackathon 3 No.31】为 Paddle 优化 dist op 在 GPU 上的计算性能 #44946
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
代码逻辑部分还在review中,先提交两处比较明显可改动的地方
val = (threadIdx.x < block_span) ? shared[lane] : 1e10f; | ||
if (wid == 0) val = warpReduceMax<T>(val); | ||
return val; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blockReduceMin
, blockReduceMax
, blockReduceSum
, warpReduceMin
, warpReduceMax
, warpReduceSum
这些基础函数已经封装在了 math_cuda_utils.h 这个文件中
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
谢谢老师提醒,已修改
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已跑测试和benchmark,没有错误,性能也几乎没变化。
auto n = x.numel(); | ||
auto config = phi::backends::gpu::GetGpuLaunchConfig1D(dev_ctx, n); | ||
intermediate.Resize(phi::make_ddim({config.block_per_grid.x})); | ||
T* i_ptr = dev_ctx.template Alloc<T>(&intermediate); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line258-261 这段代码在 line300-303 重复使用了,可以归纳在 if 语句之外.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这部分代码虽然功能上重复使用了,但是输入不一样。前者是原始输入x,获得其numel,后者是substract之后的tensor,可能由于broadcast之后numel就不一样了。这里建立if分支,主要是为了优化不需要广播计算substract时的性能。如果归纳在if之外,就每次都需要计算一次substract,那么会带来不必要的性能开销。如果我理解错误,希望老师纠正。 @JamesLim-sy
PR描述中的两个表格顺序错了 |
@@ -0,0 +1,259 @@ | |||
// Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copyright 年份 2021->2022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
|
||
#define FULL_MASK 0xffffffff | ||
|
||
__device__ __forceinline__ float inline_abs(float x) { return abs(x); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这些内置函数本身就可以支持单双精度,无需再封装
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
谢谢建议 已移除
有没有测试过x与y shape不同的case性能呢? |
} | ||
|
||
template <typename T> | ||
__global__ void deviceReduceSumZero(const T* x, T* out, int64_t N) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
函数命名规则为大驼峰
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
T sum_val = 0; | ||
for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; | ||
i += blockDim.x * gridDim.x) | ||
sum_val += static_cast<T>(static_cast<double>(x[i]) != 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for 循环代码块需要加大括号
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
deviceReduceMaxWithSubstract<T> | ||
<<<config.block_per_grid.x, config.thread_per_block.x, 0, stream>>>( | ||
x_ptr, y_ptr, i_ptr, n); | ||
cudaStreamSynchronize(stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个同步可以去掉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已去掉
} | ||
|
||
template <typename T> | ||
__global__ void deviceReduceSumOrder(const T* x, T* out, T p_order, int64_t N) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
函数命名有些让人疑惑,最好可以再明了些
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
将order改成porder不知道会好点不,老师有什么建议?
@@ -34,7 +34,3 @@ void DistKernel(const Context& dev_ctx, | |||
} // namespace phi | |||
|
|||
PD_REGISTER_KERNEL(dist, CPU, ALL_LAYOUT, phi::DistKernel, float, double) {} | |||
|
|||
#if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP) | |||
PD_REGISTER_KERNEL(dist, GPU, ALL_LAYOUT, phi::DistKernel, float, double) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
整体看下来,优化后与优化前主要有以下几点区别:
- x与y的shape相同时,将减法与后续的reduce相当于做了个融合,理论上看融合肯定是有性能收益的,那这部分我觉得可以单独拿出来保留下来;
- x与y的shape不同时,其实逻辑跟原本的pnorm基本一致,但是使用的还是重写的一些reduce kernel,这部分性能是否有提升还需要测试验证。如果确认有性能提升,那么我觉得这部分优化可以直接用于pnorm中,正好这次算子优化中也有pnorm,这样可以减少一些重复工作;如果这部分性能提升不理想,那我觉得为了简化代码可以还沿用之前的直接调pnorm的写法。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ZzSean 已按照建议,对不同shape时使用pnorm,测试下来确实pnorm差不多或更优一些。对于相同shape,按照新的方式确实有一定的提升。
已修改 |
不同shape时确实没有性能提升,甚至还有轻微下降 |
DeviceReduceSumZeroWithSubtract<T> | ||
<<<config.block_per_grid.x, config.thread_per_block.x, 0, stream>>>( | ||
x_ptr, y_ptr, i_ptr, n); | ||
DeviceReduceSumZeroFinal<T><<<1, config.thread_per_block.x, 0, stream>>>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
第二次的reduce是不是可以直接用paddle中的reduce kernel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ZzSean 已修改,使用paddle种的reduce kernel,几种测试用例都会有轻微的性能下降,影响可忽略。
#define FULL_MASK 0xffffffff | ||
|
||
template <typename T> | ||
__global__ void DeviceReduceSumZeroWithSubtract(const T* x, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里函数命名建议:
DeviceReduceSumZeroWithSubtract
->ReduceSumWithSubtract
DeviceReduceMaxWithSubtract
->ReduceMaxWithSubtract
DeviceReduceMinZeroWithSubtract
->ReduceMinWithSubtract
DeviceReduceSumPOrderWithSubtract
可以跟DeviceReduceSumZeroWithSubtract
结合为一个函数,将求和前的那部分计算functor作为模版参数传入,这样一共就只需要三个新的函数
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ZzSean 已参考您的建议完成修改
|
||
__syncthreads(); | ||
sum_val = phi::funcs::blockReduceSum<T>(sum_val, FULL_MASK); | ||
if (threadIdx.x == 0) out[blockIdx.x] = sum_val; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if条件也加上大括号
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
@Ligoml PR-CE-Framework 这个CI总是不能通过,能帮忙看下原因吗? |
@thunder95 请参考 https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/dev_guides/git_guides/paddle_ci_manual_cn.html#pr-ce-framework 直接进到PaddleTest repo下,framework/api/paddlebase目录 然后pytest xxx.py就行 |
|
#include "paddle/phi/core/kernel_registry.h" | ||
#include "paddle/phi/kernels/elementwise_subtract_kernel.h" | ||
#include "paddle/phi/kernels/funcs/math_cuda_utils.h" | ||
#include "paddle/phi/kernels/funcs/reduce_function.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
删除重复头文件
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已移除
|
||
#include "paddle/phi/kernels/dist_kernel.h" | ||
|
||
#include "paddle/fluid/operators/elementwise/elementwise_op_impl.cu.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
删除无用头文件,解决CI-APPROVAL问题
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已删除
struct ZeroOrderFunctor { | ||
public: | ||
__device__ T operator()(const T& x, const T& y) const { | ||
return abs(static_cast<T>(static_cast<double>(x - y) != 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static_cast<double>
感觉不需要吧?好像也不需要abs?布尔只能取0或1吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已按照修改意见完成修改 @ZzSean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Performance optimization
PR changes
OPs
Describe
目前Paddle中的Dist算子已经抛弃了Eigen实现,基于Kernel Primitive API实现的PNormKernel达到很不错的性能效果。
设计文档: PaddlePaddle/community#187
1.基于block reduce和warp reduce实现更高效的归约操作
完成优化后,Paddle与优化前的Paddle的性能对比效果:
完成优化后,Paddle与Pytorch的性能对比效果如下:
也对不同shape做了测试,与之前old_paddle测试的性能效果无明显提升,所以本PR只针对相同shape情况下进行优化。