Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve the performence of divide_double_grad #62533

Merged

Conversation

YibinLiu666
Copy link
Contributor

@YibinLiu666 YibinLiu666 commented Mar 7, 2024

PR Category

Performance Optimization

PR Types

Performance

Description

优化divide_double_grad的大算子实现,优化前:
D678736A3AE5184B387F9F618F2F45E5
优化后
E5D2465441F0DDAE3EB4215032DF2EE7

Copy link

paddle-bot bot commented Mar 7, 2024

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot paddle-bot bot added the contributor External developers label Mar 7, 2024
dy->Resize(y.dims());
dev_ctx.template Alloc<T>(dy);
if (ddx_tensor == nullptr && ddy_tensor == nullptr) {
dy = nullptr;
Copy link
Contributor

@HydrogenSulfate HydrogenSulfate Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 如果推导右侧公式存在None导致左侧变量无法计算,根据现在paddle的方案,应该用全0填充:FullLikeKernel<T, Context>(dev_ctx, y, Scalar(0.0), y.dtype(), dy);,否则大量代码需要修改。
  2. 把指针本身设置为空的行为应该是没有意义的

其他地方问题类似

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -312,17 +312,6 @@ void DivideDoubleGradKernel(const Context& dev_ctx,
if (ddy_tensor == nullptr) {
dout = nullptr;
Copy link
Contributor

@HydrogenSulfate HydrogenSulfate Mar 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 让传值的指针等于 nullptr 这个操作,应该是没意义的,进而可以把这里的if-else判断优化下
  2. dy应该赋值为形状跟y一样的全0矩阵

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@HydrogenSulfate HydrogenSulfate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以对比一下调用次数压缩前后的性能情况

Comment on lines 180 to 190
DenseTensor dz_div_y;
dz_div_y.Resize(out.dims());
if (!dx_tensor || dx_tensor->dims() != out.dims()) {
dev_ctx.template Alloc<T>(&dz_div_y);
funcs::DefaultElementwiseOperator<Context,
T,
funcs::DivideFunctor<T>,
funcs::InverseDivideFunctor<T>>(
dev_ctx, grad_out, y, &dz_div_y, axis);
dx_tensor = &dz_div_y;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

由于dxdz_div_y),只有在算dydout的时候才会使用到,所以我觉得可以:

  1. 182行的if条件改为:if ((dy || dout) && (!dx_tensor || dx_tensor->dims() != out.dims()))
  2. 保持dz_div_y的定义位置不变,将dz_div_y.Resize(out.dims());放到if条件里面,因为只有需要计算dx_tensor时,才需要dz_div_y这个中间变量

这样尽量减少不必要的计算。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

dy->Resize(y.dims());
dev_ctx.template Alloc<T>(dy);
if (!ddx_tensor && !ddy_tensor) {
FullLikeKernel<T, Context>(dev_ctx, y, Scalar(0.0), y.dtype(), dy);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

构造函数的第三个参数Scalar(0.0)是否可以改为Scalar(static_cast<T>(0.0)),否则当T为complex的时候可能会有问题?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if (!ddx_tensor && !ddy_tensor) {
FullLikeKernel<T, Context>(dev_ctx, y, Scalar(0.0), y.dtype(), dy);
} else {
DenseTensor tmp_dy = tmp;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

209行感觉可以删掉吧,反正tmp_dy也是语义不明,还不如直接用tmp,删掉之后下面的tmp_dy全部改成tmp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 211 to 259
funcs::DefaultElementwiseOperator<Context,
T,
funcs::DivideFunctor<T>,
funcs::InverseDivideFunctor<T>>(
dev_ctx, *dx_tensor, y, &tmp_dy, axis);
if (ddx_tensor && !ddy_tensor) {
// dy = -dX * ddX / Y
funcs::DefaultElementwiseOperator<Context,
T,
funcs::MultiplyFunctor<T>,
funcs::InverseMultiplyFunctor<T>>(
dev_ctx, *ddx_tensor, tmp_dy, dy, axis);
auto& place = *dev_ctx.eigen_device();
auto dy_result = phi::EigenVector<T>::Flatten(*dy);
dy_result.device(place) = static_cast<T>(-1) * dy_result;
} else if (!ddx_tensor && ddy_tensor) {
// dY = Out * dX * ddY / Y
funcs::DefaultElementwiseOperator<Context,
T,
funcs::MultiplyFunctor<T>,
funcs::InverseMultiplyFunctor<T>>(
dev_ctx, *ddy_tensor, tmp_dy, &tmp_dy, axis);
funcs::DefaultElementwiseOperator<Context,
T,
funcs::MultiplyFunctor<T>,
funcs::InverseMultiplyFunctor<T>>(
dev_ctx, out, tmp_dy, dy, axis);
} else {
// dY = Out * dX * ddY / Y - dX * ddX / Y

// dY = Out * dX * ddY / Y - dX * ddX / Y
phi::funcs::ElemwiseGradCompute<Context, T, DivGradDX<T>, DivDoubleDY<T>>(
dev_ctx,
ddX_safe,
ddY_safe,
out,
dX_div_Y,
axis,
nullptr,
dy,
DivGradDX<T>(),
DivDoubleDY<T>());
// NOTE(dengkaipeng): in the following ElemwiseGradCompute, for the
// first output tensor is nullptr, the branch to calculate first
// output tensor will not be activated, DivGradDx function will not
// be called and can be ignored, the first branch has little effect
// on running speed.
phi::funcs::
ElemwiseGradCompute<Context, T, DivGradDX<T>, DivDoubleDY<T>>(
dev_ctx,
*ddx_tensor,
*ddy_tensor,
out,
tmp_dy,
axis,
nullptr,
dy,
DivGradDX<T>(),
DivDoubleDY<T>());
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的逻辑看起来是首先统一在 开头求出公共项:dx/y,然后根据不同的条件走不同的分支,有的分支内部调用不止一次DefaultElementwiseOperator,所以这里看一下是否可以全部统一使用ElemwiseGradCompute,if-else分支内也通过一次调用计算完毕,只不过需要根据不同的条件,编写不同的dy_op(如DivDoubleDY_Only_DDX, DivDoubleDY_Only_DDY)并传给ElemwiseGradCompute,这些不同的dy_op运算时真正使用到的参数也是不同的,不使用的参数可以随便传一个形状相同的并且能正常访问的DenseTensor占位即可。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 268 to 306
} else if (ddx_tensor != nullptr && ddy_tensor == nullptr) {
// ddOut = ddX / Y
funcs::DefaultElementwiseOperator<Context,
T,
funcs::DivideFunctor<T>,
funcs::InverseDivideFunctor<T>>(
dev_ctx, *ddx_tensor, y, ddout, axis);
} else if (!ddx_tensor && ddy_tensor) {
// ddOut = - Out * ddY / Y
funcs::DefaultElementwiseOperator<Context,
T,
funcs::MultiplyFunctor<T>,
funcs::InverseMultiplyFunctor<T>>(
dev_ctx, out, *ddy_tensor, &tmp, axis);
funcs::DefaultElementwiseOperator<Context,
T,
funcs::DivideFunctor<T>,
funcs::InverseDivideFunctor<T>>(
dev_ctx, tmp, y, ddout, axis);
auto& place = *dev_ctx.eigen_device();
auto ddout_result = phi::EigenVector<T>::Flatten(*ddout);
ddout_result.device(place) = static_cast<T>(-1) * ddout_result;
} else {
funcs::DefaultElementwiseOperator<Context,
T,
funcs::MultiplyFunctor<T>,
funcs::InverseMultiplyFunctor<T>>(
dev_ctx, out, *ddy_tensor, &tmp, axis);
funcs::DefaultElementwiseOperator<Context,
T,
funcs::SubtractFunctor<T>,
funcs::InverseSubtractFunctor<T>>(
dev_ctx, *ddx_tensor, tmp, &tmp, axis);
funcs::DefaultElementwiseOperator<Context,
T,
funcs::DivideFunctor<T>,
funcs::InverseDivideFunctor<T>>(
dev_ctx, tmp, y, ddout, axis);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同理,多次DefaultElementwiseOperator调用是否可以优化成一次调用

FullLikeKernel<T, Context>(dev_ctx, y, Scalar(0.0), y.dtype(), dy);
} else {
DenseTensor tmp_dy = tmp;
// dX / Y
Copy link
Contributor

@HydrogenSulfate HydrogenSulfate Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// dX / Y ==> // pre-compute 'dX / Y' into 'tmp' for 'ddout' and/or 'dy'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

auto& place = *dev_ctx.eigen_device();
auto dout_result = phi::EigenVector<T>::Flatten(*dout);
dout_result.device(place) = static_cast<T>(-1) * dout_result;
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

326和327之间加个空行

@@ -166,33 +166,28 @@ template <typename T, typename Context>
void DivideDoubleGradKernel(const Context& dev_ctx,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DivDoubleDY函数内可以把dout提出来,减少乘法次数

Comment on lines 165 to 175
template <typename T>
struct DivDoubleDY_Only_DDX {
HOSTDEVICE T operator()(T x, T y, T out, T dout) const { return -x * dout; }
};

template <typename T>
struct DivDoubleDY_Only_DDY {
HOSTDEVICE T operator()(T x, T y, T out, T dout) const {
return y * out * dout;
}
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

参数类型可以改为const T&

@HydrogenSulfate
Copy link
Contributor

代码覆盖率挂了,可能需要加一些单测来覆盖被标记为红色的代码:https://xly.bce.baidu.com/paddlepaddle/paddle/newipipe/detail/10325058/job/25667520

Comment on lines +336 to +350
funcs::DefaultElementwiseOperator<Context,
T,
funcs::MultiplyFunctor<T>,
funcs::InverseMultiplyFunctor<T>>(
dev_ctx, out, *ddy_tensor, &tmp, axis);
funcs::DefaultElementwiseOperator<Context,
T,
funcs::SubtractFunctor<T>,
funcs::InverseSubtractFunctor<T>>(
dev_ctx, *ddx_tensor, tmp, &tmp, axis);
funcs::DefaultElementwiseOperator<Context,
T,
funcs::DivideFunctor<T>,
funcs::InverseDivideFunctor<T>>(
dev_ctx, tmp, y, ddout, axis);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以把这三次调用合成一次

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 320 to 334
// ddOut = - Out * ddY / Y
funcs::DefaultElementwiseOperator<Context,
T,
funcs::MultiplyFunctor<T>,
funcs::InverseMultiplyFunctor<T>>(
dev_ctx, out, *ddy_tensor, &tmp, axis);
// VLOG(4) << "5";
funcs::DefaultElementwiseOperator<Context,
T,
funcs::DivideFunctor<T>,
funcs::InverseDivideFunctor<T>>(
dev_ctx, tmp, y, ddout, axis);
auto& place = *dev_ctx.eigen_device();
auto ddout_result = phi::EigenVector<T>::Flatten(*ddout);
ddout_result.device(place) = static_cast<T>(-1) * ddout_result;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

两次调用合成一次

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 177 to 194
// template <typename T>
// struct DivDoubleDDOut {
// HOSTDEVICE T operator()(T x, T y, T out, T dout) const {
// return (x - out * y) / dout;
// }
// };

// template <typename T>
// struct DivDoubleDDOut_Only_DDX {
// HOSTDEVICE T operator()(T x, T y, T out, T dout) const { return x / dout; }
// };

// template <typename T>
// struct DivDoubleDDOut_Only_DDY {
// HOSTDEVICE T operator()(T x, T y, T out, T dout) const {
// return -out * y / dout;
// }
// };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

参数类型改为const T&

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

funcs::MultiplyFunctor<T>,
funcs::InverseMultiplyFunctor<T>>(
dev_ctx, out, *ddy_tensor, &tmp, axis);
// VLOG(4) << "5";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

调试用的VLOG可以删除

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@Xreki Xreki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for op benchmark ci

@HydrogenSulfate HydrogenSulfate merged commit 52984e3 into PaddlePaddle:develop Apr 1, 2024
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributor External developers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants