Implement cuda kernel for index_sample. #30380

JamesLim-sy · 2021-01-13T05:30:44Z

PR types

Performance optimization

PR changes

OPs

Describe

开发环境：

设备：V100-16G
环境：CUDA10.1，cuDNN 7

优化方法：

在IndexSample OP的反向计算中使用了atomicAdd接口，保证计算时的线程安全性
在IndexSample OP的前向Kernel和反向Kernel中，均采用了2维的block和2维Grid，其目的是减少索引计算部分的开销;
由于Paddle中没有该OP没有GPU Kernel实现，因此主要与pytorch对比OP的性能

优化效果：

No.	index_shape	input_shape	Paddle Perf(ms)	Pytorch Perf(ms)	diff
1	[5100,1]	[5100,38506]	0.7052	1.7032	faster than 58.5 97%
2	[100,64]	[100, 128]	0.0055	0.0083	faster than 33.874%
3	[5100,96]	[5100,128]	0.0323	0.0377	faster than 14.131%

CLAassistant · 2021-01-13T05:30:48Z

All committers have signed the CLA.

paddle-bot-old · 2021-01-13T05:30:52Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

…ized.

…timization desperately.

Xreki · 2021-01-21T02:17:13Z

paddle/fluid/operators/index_sample_op.cu

+using LoDTensor = framework::LoDTensor;
+
+template <typename T, typename IndexT = int>
+__global__ void index_kernel(const IndexT* p_index, const T* p_input,


代码需遵循Google C++编程风格，函数命名为AxxBxx。

编码规范确实会修改，下一次PR中这个问题会被处理掉。

Xreki · 2021-01-21T02:20:23Z

paddle/fluid/operators/index_sample_op.cu

+template <typename T, typename IndexT = int>
+__global__ void index_kernel(const IndexT* p_index, const T* p_input,
+                             T* p_output, size_t stride_index,
+                             size_t stride_input, size_t height) {


p_index -> index, p_input -> input, p_output -> output，感觉这里不需要从命名上特意强调这是个ptr。

stride_index、stride_input、height，这几个参数我有点对应不上，变量命名能否更直观一些。

指针加 "p_" 前缀是长期保持的习惯，后续修改成与paddle贴合的命名规范。

Xreki · 2021-01-21T02:24:19Z

paddle/fluid/operators/index_sample_op.cu

+template <typename T, typename IndexT = int>
+__global__ void index_kernel_grad(const IndexT* p_index, T* p_input,
+                                  const T* p_output, size_t stride_index,
+                                  size_t stride_input, size_t height) {


从实际含义上来讲：p_index -> index，p_input -> in_grad，p_output -> out_grad

根据建议修改

paddle/fluid/operators/index_sample_op.cu

Xreki · 2021-01-21T02:33:28Z

paddle/fluid/operators/index_sample_op.cu

+
+    dim3 block_dim(block_width, block_height);
+    dim3 grid_dim((index_length + block_dim.x - 1) / block_dim.x,
+                  (batch_size + block_dim.y - 1) / block_dim.y);


cuda并行方案怎么设计的，在PR描述里面补充下。

op benchmark里面可能要补充下配置，当前只有1个index_dim=1的配置，最好补充下index_dim>1的配置。

另外看看单测里面有没有index_dim>1的配置

Xreki · 2021-01-21T02:35:00Z

paddle/fluid/operators/index_sample_op.cu

+}
+
+template <typename DeviceContext, typename T>
+class IndexSampleCUDAKernel : public framework::OpKernel<T> {


这里建议改成特化index_sample.h中IndexSampleKernel类的形式。

根据建议修改

这个没有改？我是建议改成如下方式：

Paddle/paddle/fluid/operators/sum_op.cu

Lines 230 to 233 in f89da4a

template <typename T>

class SumKernel<platform::CUDADeviceContext, T>

: public framework::OpKernel<T> {

public:

这样L92的检查就可以去掉了。

已经按要求修改

Xreki · 2021-01-21T02:37:32Z

paddle/fluid/operators/index_sample_op.cu

+        return 16;
+      else
+        return 8;
+    };


Paddle/paddle/fluid/platform/cuda_device_function.h

Lines 36 to 50 in 7e9f336

inline static int RoundToPowerOfTwo(int dim) {

if (dim > 512) {

return 1024;

} else if (dim > 256) {

return 512;

} else if (dim > 128) {

return 256;

} else if (dim > 64) {

return 128;

} else if (dim > 32) {

return 64;

} else {

return 32;

}

}

可使用这个函数代替吗？

可以的，这块的写法非常不美观，肯定替换掉。

…d inevitably increase thread-safety once calcalating the backward step of index_sample OP, and one special CUDA kernel considering the condition that each line of index array only contains 1 element. Besides, thread-deployment in block was 2-demensions.

Xreki · 2021-01-29T04:50:28Z

paddle/fluid/operators/index_sample_op.cu

+  int tid = iy * index_length + ix;
+  int tid_x = iy * input_length + ix;
+
+  if (ix < index_length & iy < batch_size) {


BlockDim.x最小值为32。当index_length<32时，一个block里面连续的32个线程会有空闲？后续可以再看看有没有更好的并行方案。

Xreki · 2021-01-29T04:50:37Z

paddle/fluid/operators/index_sample_op.cu

+namespace paddle {
+namespace operators {
+
+using platform::PADDLE_CUDA_NUM_THREADS;


这个实际没有用到？

Xreki · 2021-01-29T04:58:26Z

paddle/fluid/operators/index_sample_op.cu

+  int ix = blockDim.x * blockIdx.x + threadIdx.x;
+  int iy = blockDim.y * blockIdx.y + threadIdx.y;
+  int tid = iy * index_length + ix;
+  int tid_x = iy * input_length + ix;


变量名可以起的更直观一些，比如你这里的ix、iy应该是index数组里面的x和y下标，可以改成index_i、index_j。tid是index数组里面的位置，也是out数组里面的位置，可以改成index_idx或out_idx。tid_x是in数组里面的位置，可以改成in_idx。

Xreki · 2021-01-29T04:59:48Z

paddle/fluid/operators/index_sample_op.cu

+  int ix = blockDim.x * blockIdx.x + threadIdx.x;
+  int iy = blockDim.y * blockIdx.y + threadIdx.y;
+  int tid = iy * index_length + ix;
+  int tid_y = iy * input_length + ix;


变量名命名建议，同上。为什么前向的kernel里面叫tid_x，这个kernel里面叫tid_y呢？

paddle/fluid/operators/index_sample_op.cu

Xreki · 2021-01-29T05:04:02Z

paddle/fluid/operators/index_sample_op.cu

+}
+
+template <typename DeviceContext, typename T>
+class IndexSampleCUDAKernel : public framework::OpKernel<T> {


这个没有改？我是建议改成如下方式：

Paddle/paddle/fluid/operators/sum_op.cu

Lines 230 to 233 in f89da4a

template <typename T>

class SumKernel<platform::CUDADeviceContext, T>

: public framework::OpKernel<T> {

public:

这样L92的检查就可以去掉了。

Xreki · 2021-01-29T05:08:09Z

paddle/fluid/operators/index_sample_op.cu

+                  (batch_size + block_dim.y - 1) / block_dim.y);
+
+    platform::GpuMemsetAsync(input_grad_data, 0,
+                             sizeof(T) * input_length * batch_size, stream);


这里改成调用如下函数：

Paddle/paddle/fluid/operators/trace_op.h

Lines 219 to 221 in f89da4a

math::SetConstant<DeviceContext, T> set_zero;

auto& dev_ctx = context.template device_context<DeviceContext>();

set_zero(dev_ctx, d_x, static_cast<T>(0.0));

paddle-bot-old · 2021-01-30T03:26:26Z

Sorry to inform you that fec47c5's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

Xreki · 2021-01-31T07:50:59Z

paddle/fluid/operators/index_sample_op.cu

+          index_data, in_data, out_data, index_length, input_length,
+          batch_size);
+    }
+    PADDLE_ENFORCE_CUDA_SUCCESS(cudaStreamSynchronize(stream));


op实现里面不用同步。

已删除同步操作

Xreki · 2021-01-31T07:51:38Z

paddle/fluid/operators/index_sample_op.cu

+                              framework::proto::VarType::INT64)));
+    PADDLE_ENFORCE_EQ(
+        platform::is_gpu_place(ctx.GetPlace()), true,
+        platform::errors::InvalidArgument("It must use CUDAPlace."));


这个检查可以删掉了。

已删除检查判断

Xreki · 2021-01-31T07:52:13Z

paddle/fluid/operators/index_sample_op.cu

+          index_data, input_grad_data, output_grad_data, index_length,
+          input_length, batch_size, same_data_in_index_row);
+    }
+    PADDLE_ENFORCE_CUDA_SUCCESS(cudaStreamSynchronize(stream));


op实现里面不用同步。

后续已删除

JamesLim-sy

修改变量命名方式

JamesLim-sy · 2021-01-21T06:00:09Z

paddle/fluid/operators/index_sample_op.cu

+        return 16;
+      else
+        return 8;
+    };


可以的，这块的写法非常不美观，肯定替换掉。

JamesLim-sy · 2021-01-21T06:00:45Z

paddle/fluid/operators/index_sample_op.cu

+template <typename T, typename IndexT = int>
+__global__ void index_kernel_grad(const IndexT* p_index, T* p_input,
+                                  const T* p_output, size_t stride_index,
+                                  size_t stride_input, size_t height) {


根据建议修改

paddle/fluid/operators/index_sample_op.cu

JamesLim-sy · 2021-02-03T03:02:35Z

paddle/fluid/operators/index_sample_op.cu

+          index_data, input_grad_data, output_grad_data, index_length,
+          input_length, batch_size, same_data_in_index_row);
+    }
+    PADDLE_ENFORCE_CUDA_SUCCESS(cudaStreamSynchronize(stream));


后续已删除

…ernel.

Xreki

LGTM and great work~

Implement cuda kernel for index_sample.

b04cc91

JamesLim-sy added 3 commits January 14, 2021 13:11

[WIP]: index_sample grad basic kernel realization. Needed to be optim…

866dd28

…ized.

[WIP]: Very basic grad_function which passes the Ctest, in need of op…

d18b4bf

…timization desperately.

[WIP]: Very basic grad_function which passes the Ctest, in need of op…

170e5e1

…timization desperately.

Xreki reviewed Jan 21, 2021

View reviewed changes

JamesLim-sy force-pushed the new_opt_index_sample branch from d18b4bf to 98a9af7 Compare January 22, 2021 10:08

Xreki reviewed Jan 29, 2021

View reviewed changes

Regularizing the writting codes of index-sample

d931a4a

Xreki reviewed Jan 31, 2021

View reviewed changes

JamesLim-sy added 2 commits February 1, 2021 11:31

Deleting the sync codes.

ddd52bd

Deleting the sync codes.

ce289f4

JamesLim-sy commented Feb 3, 2021

View reviewed changes

JamesLim-sy requested a review from Xreki February 3, 2021 03:38

Deleting the sync operation and gpu.place check function in forward k…

aadcd13

…ernel.

Xreki approved these changes Feb 3, 2021

View reviewed changes

Xreki merged commit 6e1e036 into PaddlePaddle:develop Feb 3, 2021

JamesLim-sy deleted the new_opt_index_sample branch February 3, 2021 11:27

luotao1 mentioned this pull request Apr 26, 2022

【PFCC-Roadmap】算子性能优化 #42286

Open

This was referenced Jul 4, 2022

【PaddlePaddle Hackathon 3】数据类型扩展任务合集 #44071

Closed

【PaddlePaddle Hackathon 3】算子性能优化任务合集 #44072

Closed

This was referenced Feb 20, 2023

【PaddlePaddle Hackathon 4】核心框架开源贡献算子性能优化任务合集 #50657

Closed

【PaddlePaddle Hackathon 4】核心框架开源贡献数据类型扩展任务合集 #50658

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement cuda kernel for index_sample. #30380

Implement cuda kernel for index_sample. #30380

JamesLim-sy commented Jan 13, 2021 •

edited

Loading

CLAassistant commented Jan 13, 2021 •

edited

Loading

paddle-bot-old bot commented Jan 13, 2021

Xreki Jan 21, 2021

JamesLim-sy Jan 21, 2021

Xreki Jan 21, 2021

JamesLim-sy Jan 21, 2021

Xreki Jan 21, 2021

JamesLim-sy Jan 21, 2021

Xreki Jan 21, 2021

Xreki Jan 21, 2021

JamesLim-sy Jan 21, 2021

Xreki Jan 29, 2021

JamesLim-sy Feb 3, 2021

Xreki Jan 21, 2021

JamesLim-sy Jan 21, 2021

Xreki Jan 29, 2021

Xreki Jan 29, 2021

Xreki Jan 29, 2021

Xreki Jan 29, 2021

Xreki Jan 29, 2021

Xreki Jan 29, 2021

JamesLim-sy Feb 3, 2021

paddle-bot-old bot commented Jan 30, 2021

Xreki Jan 31, 2021

JamesLim-sy Feb 3, 2021

Xreki Jan 31, 2021

JamesLim-sy Feb 3, 2021

Xreki Jan 31, 2021

JamesLim-sy Feb 3, 2021

JamesLim-sy left a comment

JamesLim-sy Jan 21, 2021

JamesLim-sy Jan 21, 2021

JamesLim-sy Feb 3, 2021

Xreki left a comment

	template <typename T>
	class SumKernel<platform::CUDADeviceContext, T>
	: public framework::OpKernel<T> {
	public:

	inline static int RoundToPowerOfTwo(int dim) {
	if (dim > 512) {
	return 1024;
	} else if (dim > 256) {
	return 512;
	} else if (dim > 128) {
	return 256;
	} else if (dim > 64) {
	return 128;
	} else if (dim > 32) {
	return 64;
	} else {
	return 32;
	}
	}

	math::SetConstant<DeviceContext, T> set_zero;
	auto& dev_ctx = context.template device_context<DeviceContext>();
	set_zero(dev_ctx, d_x, static_cast<T>(0.0));

Implement cuda kernel for index_sample. #30380

Implement cuda kernel for index_sample. #30380

Conversation

JamesLim-sy commented Jan 13, 2021 • edited Loading

PR types

PR changes

Describe

CLAassistant commented Jan 13, 2021 • edited Loading

paddle-bot-old bot commented Jan 13, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paddle-bot-old bot commented Jan 30, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JamesLim-sy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Xreki left a comment

Choose a reason for hiding this comment

JamesLim-sy commented Jan 13, 2021 •

edited

Loading

CLAassistant commented Jan 13, 2021 •

edited

Loading