-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement cuda kernel for index_sample. #30380
Conversation
Thanks for your contribution! |
…timization desperately.
…timization desperately.
using LoDTensor = framework::LoDTensor; | ||
|
||
template <typename T, typename IndexT = int> | ||
__global__ void index_kernel(const IndexT* p_index, const T* p_input, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
代码需遵循Google C++编程风格,函数命名为AxxBxx
。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
编码规范确实会修改,下一次PR中这个问题会被处理掉。
template <typename T, typename IndexT = int> | ||
__global__ void index_kernel(const IndexT* p_index, const T* p_input, | ||
T* p_output, size_t stride_index, | ||
size_t stride_input, size_t height) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- p_index -> index, p_input -> input, p_output -> output,感觉这里不需要从命名上特意强调这是个ptr。
- stride_index、stride_input、height,这几个参数我有点对应不上,变量命名能否更直观一些。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
指针加 "p_" 前缀是长期保持的习惯,后续修改成与paddle贴合的命名规范。
template <typename T, typename IndexT = int> | ||
__global__ void index_kernel_grad(const IndexT* p_index, T* p_input, | ||
const T* p_output, size_t stride_index, | ||
size_t stride_input, size_t height) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
从实际含义上来讲:p_index -> index,p_input -> in_grad,p_output -> out_grad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
根据建议修改
|
||
dim3 block_dim(block_width, block_height); | ||
dim3 grid_dim((index_length + block_dim.x - 1) / block_dim.x, | ||
(batch_size + block_dim.y - 1) / block_dim.y); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- cuda并行方案怎么设计的,在PR描述里面补充下。
- op benchmark里面可能要补充下配置,当前只有1个index_dim=1的配置,最好补充下index_dim>1的配置。
- 另外看看单测里面有没有index_dim>1的配置
} | ||
|
||
template <typename DeviceContext, typename T> | ||
class IndexSampleCUDAKernel : public framework::OpKernel<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里建议改成特化index_sample.h中IndexSampleKernel
类的形式。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
根据建议修改
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个没有改?我是建议改成如下方式:
Paddle/paddle/fluid/operators/sum_op.cu
Lines 230 to 233 in f89da4a
template <typename T> | |
class SumKernel<platform::CUDADeviceContext, T> | |
: public framework::OpKernel<T> { | |
public: |
这样L92的检查就可以去掉了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经按要求修改
return 16; | ||
else | ||
return 8; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Paddle/paddle/fluid/platform/cuda_device_function.h
Lines 36 to 50 in 7e9f336
inline static int RoundToPowerOfTwo(int dim) { | |
if (dim > 512) { | |
return 1024; | |
} else if (dim > 256) { | |
return 512; | |
} else if (dim > 128) { | |
return 256; | |
} else if (dim > 64) { | |
return 128; | |
} else if (dim > 32) { | |
return 64; | |
} else { | |
return 32; | |
} | |
} |
可使用这个函数代替吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以的,这块的写法非常不美观,肯定替换掉。
…d inevitably increase thread-safety once calcalating the backward step of index_sample OP, and one special CUDA kernel considering the condition that each line of index array only contains 1 element. Besides, thread-deployment in block was 2-demensions.
d18b4bf
to
98a9af7
Compare
…d inevitably increase thread-safety once calcalating the backward step of index_sample OP, and one special CUDA kernel considering the condition that each line of index array only contains 1 element. Besides, thread-deployment in block was 2-demensions.
int tid = iy * index_length + ix; | ||
int tid_x = iy * input_length + ix; | ||
|
||
if (ix < index_length & iy < batch_size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BlockDim.x最小值为32。当index_length<32时,一个block里面连续的32个线程会有空闲?后续可以再看看有没有更好的并行方案。
namespace paddle { | ||
namespace operators { | ||
|
||
using platform::PADDLE_CUDA_NUM_THREADS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个实际没有用到?
int ix = blockDim.x * blockIdx.x + threadIdx.x; | ||
int iy = blockDim.y * blockIdx.y + threadIdx.y; | ||
int tid = iy * index_length + ix; | ||
int tid_x = iy * input_length + ix; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
变量名可以起的更直观一些,比如你这里的ix
、iy
应该是index
数组里面的x和y下标,可以改成index_i
、index_j
。tid
是index
数组里面的位置,也是out
数组里面的位置,可以改成index_idx
或out_idx
。tid_x
是in数组里面的位置,可以改成in_idx
。
int ix = blockDim.x * blockIdx.x + threadIdx.x; | ||
int iy = blockDim.y * blockIdx.y + threadIdx.y; | ||
int tid = iy * index_length + ix; | ||
int tid_y = iy * input_length + ix; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
变量名命名建议,同上。为什么前向的kernel里面叫tid_x
,这个kernel里面叫tid_y
呢?
} | ||
|
||
template <typename DeviceContext, typename T> | ||
class IndexSampleCUDAKernel : public framework::OpKernel<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个没有改?我是建议改成如下方式:
Paddle/paddle/fluid/operators/sum_op.cu
Lines 230 to 233 in f89da4a
template <typename T> | |
class SumKernel<platform::CUDADeviceContext, T> | |
: public framework::OpKernel<T> { | |
public: |
这样L92的检查就可以去掉了。
(batch_size + block_dim.y - 1) / block_dim.y); | ||
|
||
platform::GpuMemsetAsync(input_grad_data, 0, | ||
sizeof(T) * input_length * batch_size, stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里改成调用如下函数:
Paddle/paddle/fluid/operators/trace_op.h
Lines 219 to 221 in f89da4a
math::SetConstant<DeviceContext, T> set_zero; | |
auto& dev_ctx = context.template device_context<DeviceContext>(); | |
set_zero(dev_ctx, d_x, static_cast<T>(0.0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
Sorry to inform you that fec47c5's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually. |
index_data, in_data, out_data, index_length, input_length, | ||
batch_size); | ||
} | ||
PADDLE_ENFORCE_CUDA_SUCCESS(cudaStreamSynchronize(stream)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
op实现里面不用同步。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已删除同步操作
framework::proto::VarType::INT64))); | ||
PADDLE_ENFORCE_EQ( | ||
platform::is_gpu_place(ctx.GetPlace()), true, | ||
platform::errors::InvalidArgument("It must use CUDAPlace.")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个检查可以删掉了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已删除检查判断
index_data, input_grad_data, output_grad_data, index_length, | ||
input_length, batch_size, same_data_in_index_row); | ||
} | ||
PADDLE_ENFORCE_CUDA_SUCCESS(cudaStreamSynchronize(stream)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
op实现里面不用同步。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
后续已删除
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
修改变量命名方式
return 16; | ||
else | ||
return 8; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以的,这块的写法非常不美观,肯定替换掉。
template <typename T, typename IndexT = int> | ||
__global__ void index_kernel_grad(const IndexT* p_index, T* p_input, | ||
const T* p_output, size_t stride_index, | ||
size_t stride_input, size_t height) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
根据建议修改
index_data, input_grad_data, output_grad_data, index_length, | ||
input_length, batch_size, same_data_in_index_row); | ||
} | ||
PADDLE_ENFORCE_CUDA_SUCCESS(cudaStreamSynchronize(stream)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
后续已删除
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM and great work~
PR types
Performance optimization
PR changes
OPs
Describe
IndexSample OP
的反向计算中使用了atomicAdd
接口,保证计算时的线程安全性IndexSample OP
的前向Kernel和反向Kernel中,均采用了2维的block和2维Grid
,其目的是减少索引计算部分的开销;pytorch
对比OP的性能