Refine accuracy_op CUDA kernel #4097

typhoonzero · 2017-09-14T02:54:20Z

dzhwinter · 2017-09-14T04:05:48Z

paddle/operators/accuracy_op.cu

        break;
      }
    }
  }
-  *accuracy = static_cast<float>(correct) / static_cast<float>(N);
+  atomicAdd(&total, count);


LOL. It seems too complicated to me. It looks correct but I really don't have enough experience to review. Please refer to @hedaoyuan review it, or write it base on some high-level library thrust.

dzhwinter · 2017-09-14T04:09:19Z

If you are interested in thrust, I will give a demo later.

hedaoyuan · 2017-09-14T04:24:09Z

paddle/operators/accuracy_op.cu

-  *accuracy = static_cast<float>(correct) / static_cast<float>(N);
+  atomicAdd(&total, count);
+  __syncthreads();
+  if (threadIdx.x == 0) {


It seems that when num_samples is greater than 4096 there will be multiple blocks, and this result may be incorrect. I think can consider using only one block to calculate.
Try to avoid using atomicAdd, the __shared__ int total; can be replaced by __shared__ int total[block_size]; and add reduce at the end.

Thanks very much for the advise, I'll try to fix it.

… refine_accuracy_op

hedaoyuan · 2017-09-14T13:09:54Z

paddle/operators/accuracy_op.cu

-      const int pred = Xdata[row * D + col];
-      if (pred == label) {
-        ++correct;
+__global__ void AccuracyCudaKernel(const int N, const int D, const int* Xdata,


Use template <int BlockSize> is better than use PADDLE_CUDA_NUM_THREADS macro.

I would prefer PADDLE_CUDA_NUM_THREADS, it's a constexpr.

If use template <int BlockSize> we have to pass BlockSize twice to when calling the kernel like: SomeKernel<BlockSize><<<1, BlockSize>>>(), and seems PADDLE_CUDA_NUM_THREADS can be always the same const value.

Well, I think we can open an issue to discuss and then write some CUDA development docs.

Using the PADDLE_CUDA_NUM_THREADS macro is also equivalent to using it twice.

hedaoyuan · 2017-09-14T13:10:20Z

paddle/operators/accuracy_op.cu

+  __shared__ int total[PADDLE_CUDA_NUM_THREADS];
+
+  // support only 1 block
+  for (int i = threadIdx.x; i < (N); i += blockDim.x * gridDim.x) {


i += blockDim.x * gridDim.x -> i += BlockSize
Use BlockSize better than blockDim.x (BlockSize is the value at compile time).

dzhwinter · 2017-09-15T00:16:11Z

I had tried to write a demo code to show how thrust use in that case. I find that it's also very painful to write it.

#include <thrust/copy.h>
#include <thrust/tuple.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/reduce.h>
#include <thrust/execution_policy.h>

#include <string.h>
#include <iostream>

void accuracy(int* data, int* label, int length, float* acc) {
  acc = .0;
  typedef thrust::device_vector<int>::iterator Iter;
  typedef thrust::tuple<Iter, Iter> IterPair;
  thrust::device_vector<int>  data_device(data, data+length);
  thrust::device_vector<int>  label_device(label, label+length);
  IterPair first = thrust::make_zip_iterator(thrust::make_tuple(data_device.begin(), label_device.begin()));
  IterPair last = thrust::make_zip_iterator(thrust::make_tuple(data_device.end(), label_device.end()));
  thrust::equal<thrust::tuple<int,int>> binary_op;
  thrust::device_vector<int>  correct(length);
  thrust::transform(first, last, correct.begin(), binary_op());

  int result = thrust::reduce(thrust::host, correct, correct+length);
  if (result != 0) acc = (float)result/correct.size();
}

related document zip_iterator

typhoonzero · 2017-09-15T02:13:08Z

@dzhwinter Thanks for the very useful code! Both ways are OK, either use thrust or not, I also prefer using thrust::reduce it's very simple to use.

hedaoyuan

LGTM

refind accuracy_op

f7907f2

typhoonzero added the OpPorting label Sep 14, 2017

dzhwinter reviewed Sep 14, 2017

View reviewed changes

hedaoyuan reviewed Sep 14, 2017

View reviewed changes

typhoonzero added 2 commits September 14, 2017 15:48

follow comments

191d0a5

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

45ac7b7

… refine_accuracy_op

hedaoyuan reviewed Sep 14, 2017

View reviewed changes

follow comments

087ad94

hedaoyuan approved these changes Sep 18, 2017

View reviewed changes

typhoonzero merged commit 8580dce into PaddlePaddle:develop Sep 18, 2017

typhoonzero deleted the refine_accuracy_op branch December 22, 2017 05:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refine accuracy_op CUDA kernel #4097

Refine accuracy_op CUDA kernel #4097

typhoonzero commented Sep 14, 2017

dzhwinter Sep 14, 2017

dzhwinter commented Sep 14, 2017

hedaoyuan Sep 14, 2017 •

edited

Loading

typhoonzero Sep 14, 2017

typhoonzero Sep 14, 2017

hedaoyuan Sep 14, 2017

typhoonzero Sep 15, 2017 •

edited

Loading

typhoonzero Sep 15, 2017

hedaoyuan Sep 15, 2017

hedaoyuan Sep 14, 2017

dzhwinter commented Sep 15, 2017

typhoonzero commented Sep 15, 2017

hedaoyuan left a comment

Refine accuracy_op CUDA kernel #4097

Refine accuracy_op CUDA kernel #4097

Conversation

typhoonzero commented Sep 14, 2017

dzhwinter Sep 14, 2017

Choose a reason for hiding this comment

dzhwinter commented Sep 14, 2017

hedaoyuan Sep 14, 2017 • edited Loading

Choose a reason for hiding this comment

typhoonzero Sep 14, 2017

Choose a reason for hiding this comment

typhoonzero Sep 14, 2017

Choose a reason for hiding this comment

hedaoyuan Sep 14, 2017

Choose a reason for hiding this comment

typhoonzero Sep 15, 2017 • edited Loading

Choose a reason for hiding this comment

typhoonzero Sep 15, 2017

Choose a reason for hiding this comment

hedaoyuan Sep 15, 2017

Choose a reason for hiding this comment

hedaoyuan Sep 14, 2017

Choose a reason for hiding this comment

dzhwinter commented Sep 15, 2017

typhoonzero commented Sep 15, 2017

hedaoyuan left a comment

Choose a reason for hiding this comment

hedaoyuan Sep 14, 2017 •

edited

Loading

typhoonzero Sep 15, 2017 •

edited

Loading