-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization of pool2d grad #35389
Optimization of pool2d grad #35389
Conversation
Thanks for your contribution! |
c3aaa3e
to
2294254
Compare
…/Paddle into Optimize_pool2d_grad
auto channel_divmod = divmods.channel.Divmod(input_height_divmod.val[0]); | ||
w_offset = input_width_divmod.val[1] + padding_width; | ||
h_offset = input_height_divmod.val[1] + padding_height; | ||
offsetC = channel_divmod.val[1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个命名跟其他的变量不太一致
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这块的命名会修改成 channel_offset
namespace paddle { | ||
namespace operators { | ||
namespace math { | ||
|
||
struct FastDivModOfPool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
一般来说,先线程池这样的概念才叫XxxPool
,这个命名不太合适。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
根据建议修改成 FastDivModForPoolingGrad
#include "paddle/fluid/platform/gpu_launch_config.h" | ||
|
||
#ifdef __HIPCC__ | ||
#define BLOCK_SIZE 256 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
宏的名字加一下限定,POOL_BLOCK_SIZE
之类的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
根据建议修改。
|
||
inline DEVICE void ParameterUpdate(int tid, int output_stride) { | ||
input = input_data[tid]; | ||
output_data += output_stride; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output_data
不是定义成了const
吗,还可以+=
?以及这种方式,感觉不太安全啊。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+=
在这里的目的是移动 output_data
所代表的指针地址。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
感觉这个类型,是为了避免input_data
、output_data
的访存,强行做了一些封装,本身不具备完备的语义跟可解释性。
}; | ||
|
||
template <typename T, typename PoolProcess, typename Enable = void> | ||
struct PoolingFunctor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个Functor看起来是用于反向计算,命名不够代表实际含义,以及每个函数是用来干什么的?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
根据建议修改成 PoolingGradProcess
,并对内部成员方法添加注释。
} | ||
|
||
inline HOSTDEVICE void operator()(const T* __restrict__ output_grad, | ||
T* __restrict__ gradient, int pool_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gradient
是谁的梯度?这个函数不太像常规的operator()
,用具体的函数名代替合适些。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gradient
会修改成input_grad_data
, operator
会修改成Compute
。
template <typename T, typename PoolProcess> | ||
struct PoolingFunctor<T, PoolProcess, | ||
typename std::enable_if<std::is_same< | ||
PoolProcess, math::AvgPoolGrad<T>>::value>::type> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kernel中PoolProcess
不会再实际用到了吧?这个PoolProcess
只是用来区分Avg和Max的Pool定义,没有实际用于计算,感觉没有必要。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嗯,Kenel内部的 PoolProcess
可以消除掉了。
PoolProcess pool_process, bool exclusive, bool adaptive, T* input_grad, | ||
bool channel_last = false) { | ||
const int nthreads, const T* __restrict__ output_grad, | ||
const int output_height, const int output_width, const int input_width, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
参数列表中统一一下height、width的顺序。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
根据建议修改。
int w_offset, h_offset, c_offset; | ||
int phstart, phend, pwstart, pwend; | ||
int output_stride; | ||
|
||
if (!channel_last) { /* NCHW */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NHWC或NCHW这些索引的计算,感觉可能是比较常见的?可以进行封装一下,比如可以定义一个
IndexCalculator4d`,并且针对NHWC、NCHW提供一些基础计算函数?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
根据建议会进行封装化修改。
auto pool_divmod = | ||
FastDivModOfPool(input_channels, input_width, input_height, ksize_width, | ||
ksize_height, stride_width, stride_height); | ||
auto pool_functor = PoolingFunctor<T, PoolProcess>(input_data, output_data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个Functor的引入,是为了减少IO?感觉可以基于原PoolProcess改造一下。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
尝试过对于原始PoolProcess
的修改,但是实现起来感觉需要在类对象中加入太多成员,就切换成用对CUDA计算进行特化的实现方式了。
原始PoolProcess
是链接中的类对象,
Paddle/paddle/fluid/operators/math/pooling.h
Lines 68 to 84 in 8342403
template <class T> | |
class MaxPoolGrad { | |
public: | |
DEVICE inline void compute(const T& x, const T& y, const T& dy, T scale, | |
T* dx) { | |
*dx += dy * static_cast<T>(x == y); | |
} | |
}; | |
template <class T> | |
class AvgPoolGrad { | |
public: | |
DEVICE inline void compute(const T& x, const T& y, const T& dy, T scale, | |
T* dx) { | |
*dx += (scale * dy); | |
} | |
}; |
类对象需要同时支持CPU和CUDA计算,其中CPU的计算逻辑和CUDA的计算逻辑差异比较大,CPU的计算逻辑是,
input_data 赋值在前,output_data指针地址的偏移滞后,且CPU计算中output_data的指针地址需要伴随循环不断偏移
,如下:Paddle/paddle/fluid/operators/math/pooling.cc
Lines 1047 to 1067 in 8342403
float scale = 1.0 / pool_size; | |
for (int d = dstart; d < dend; ++d) { | |
for (int h = hstart; h < hend; ++h) { | |
for (int w = wstart; w < wend; ++w) { | |
int input_idx = (d * input_height + h) * input_width + w; | |
int output_idx = | |
(pd * output_height + ph) * output_width + pw; | |
pool_grad_process.compute( | |
input_data[input_idx], output_data[output_idx], | |
output_grad_data[output_idx], static_cast<T>(scale), | |
input_grad_data + input_idx); | |
} | |
} | |
} | |
} | |
} | |
} | |
input_data += input_stride; | |
output_data += output_stride; | |
input_grad_data += input_stride; | |
output_grad_data += output_stride; |
和CUDA的计算逻辑不太相同,CUDA 的计算逻辑中数据读取和指针偏移都是一次性的,如下:
Paddle/paddle/fluid/operators/math/pooling.cu
Lines 142 to 169 in 8342403
output_data += output_stride; | |
output_grad += output_stride; | |
for (int ph = phstart; ph < phend; ++ph) { | |
for (int pw = pwstart; pw < pwend; ++pw) { | |
int pool_size; | |
if (adaptive) { | |
pool_size = static_cast<int>(ceil(static_cast<double>(input_height) / | |
ksize_height)) * | |
static_cast<int>( | |
ceil(static_cast<double>(input_width) / ksize_width)); | |
} else { | |
int hstart = ph * stride_height - padding_height; | |
int wstart = pw * stride_width - padding_width; | |
int hend = min(hstart + ksize_height, input_height); | |
int wend = min(wstart + ksize_width, input_width); | |
hstart = max(hstart, 0); | |
wstart = max(wstart, 0); | |
pool_size = exclusive ? (hend - hstart) * (wend - wstart) | |
: ksize_height * ksize_width; | |
} | |
int output_sub_idx = channel_last | |
? (ph * output_width + pw) * channels + offsetC | |
: ph * output_width + pw; | |
pool_process.compute(input, output_data[output_sub_idx], | |
output_grad[output_sub_idx], | |
static_cast<T>(1.0 / pool_size), &gradient); |
…/Paddle into Optimize_pool2d_grad
|
||
inline DEVICE void ParameterUpdate(int tid, int output_stride) { | ||
input = input_data[tid]; | ||
output_data += output_stride; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
感觉这个类型,是为了避免input_data
、output_data
的访存,强行做了一些封装,本身不具备完备的语义跟可解释性。
batch_idx = index / channels / output_width / output_height; | ||
} | ||
int hstart, hend, wstart, wend; | ||
int pw, ph, c, input_stride; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
input_stride
->input_offset
- pw、ph是什么缩写?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pw 和 ph 分表代表 w_offet 和 h_offset,下个commit会修改这里的不规范命名
T input_grad_data = static_cast<T>(0); | ||
int phstart, phend, pwstart, pwend; | ||
int w_offset, h_offset, c_offset, output_stride; | ||
ParamPreparationByDatalayout<>(index, channel_last, divmods, padding_width, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
函数命名应该按照功能来,这个函数的功能应该是计算4D坐标?
T* dx) { | ||
*dx += dy * static_cast<T>(x == y); | ||
static constexpr bool use_x = true; | ||
DEVICE inline void compute(const T& x, const T* y, const T* dy, int out_idx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为什么要把y和dy改成指针类型呢?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 主要是把
y
改成指针,按照原始的数据类型传入的话,AvgPool 也需要从global_memory中读取output_data[output_index]
数据再传入到compute
方法内,AvgPool中并不需要数据output_data[output_index]
,传指针可以避免这里的开销。 - 考虑到
dx
是指针类型,对应的就把dy
改成了指针类型。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Optimization of pool2d grad, first commit. * remove useless print codes * refine codes * refine codes * seal more operation into template specialization * fix template struct error in MaxPool2dGrad. * Fix header including error * refine code with comment * Seal the param-preparation codes into function for common use. * Seal the param-preparation codes into function for common use. * Seal the param-preparation into funciton and make it common for other kernels * polish code and erase useless template speicalization * Rerun triger * rerun trigger
PR types
Performance optimization
PR changes
OPs
Describe
Feature :
Performance (Take resnet50 for example):