Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add depthwise conv op gpu #7885

Merged
merged 9 commits into from
Feb 2, 2018
Merged

Conversation

NHZlX
Copy link
Contributor

@NHZlX NHZlX commented Jan 25, 2018

fix #7772

Benchmark of one forwardbackward

batch ours cudnn v7 cudnn v5 cuBlas
1 0.076s 0.97s 0.91s 0.44s
20 0.62s 1.60s 1.61s 7.63s
40 1.21s 1.82s 1.89s 15s
60 2.06s 2.41s 2.66s 23s
80 2.69s 3.25 3.91s 30s

KernelDepthwiseConvInputGrad<T><<<grid, threads, 0, context.stream()>>>(
nthreads, output_grad_data, filter_data, batch_size, output_channels,
output_height, output_width, input_channels, input_height, input_width,
output_channels / input_channels, ksize_height, ksize_width,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output_channels / input_channels

This is to say that input_channels should be less than output_channels and output_channels can be divided by input_channels, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, i think i should add this check on the python side.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation of DepthwiseConvKernel should not depend on Python code. Because the interface of FLUID does not only support Python.

const int offset =
((batch * input_channels + c_in) * input_height + h_in) *
input_width +
w_in;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const int h_in = -padding_height + h_out * stride_height + kh;
const int w_in = -padding_width + w_out * stride_width + kw;
const int offset = ((batch * input_channels + c_in) * input_height + h_in) * input_width +  w_in;

These codes can be written more efficiently.

      const int h_in_s = -padding_height + h_out * stride_height;
      const int w_in_s = -padding_width + w_out * stride_width;
      const int in_offset = ((batch * input_channels + c_in) * input_height) *  input_width;
      for (int kh = 0; kh < filter_height; ++kh) {
        for (int kw = 0; kw < filter_width; ++kw) {
          const int h_in = h_in_s + kh;
          const int w_in = w_in_s + kw;
          const int offset = in_offset + h_in * input_width +  w_in;
          value += (*weight) * input_data[offset];
          ++weight;
        }
      }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, i will fix it ASAP.


std::vector<int> strides = context.Attr<std::vector<int>>("strides");
std::vector<int> paddings = context.Attr<std::vector<int>>("paddings");
std::vector<int> dilations = context.Attr<std::vector<int>>("dilations");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't DepthwiseConv support groups

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The groups value equals to the input channels num.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add the choice to the python code, it will execute the depthwise conv op when groups size equals to the input channels. But i don't add any check on the c++ side, because i directly use the conv op instead recreate one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add PADDLE_ENFORCE_EQ(....) to check and add comments about that, the implementation of DepthwiseConvKernel should not depend on Python code. Because the interface of FLUID does not only support Python.

op_library(conv_op SRCS conv_op.cc conv_op.cu.cc conv_cudnn_op.cu.cc DEPS
vol2col depthwise_conv)

# op_library(conv_op SRCS conv_op.cc conv_op.cu.cc conv_cudnn_op.cu.cc DEPS vol2col)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok


std::vector<int> strides = context.Attr<std::vector<int>>("strides");
std::vector<int> paddings = context.Attr<std::vector<int>>("paddings");
std::vector<int> dilations = context.Attr<std::vector<int>>("dilations");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add PADDLE_ENFORCE_EQ(....) to check and add comments about that, the implementation of DepthwiseConvKernel should not depend on Python code. Because the interface of FLUID does not only support Python.

REGISTER_OP_CUDA_KERNEL(
depthwise_conv_grad,
ops::DepthwiseConvGradKernel<paddle::platform::CUDADeviceContext, float>,
ops::DepthwiseConvGradKernel<paddle::platform::CUDADeviceContext, double>);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does depthwise_conv need a cudnn kernel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no depthwise cudnn kernel. If we specify the cudnn mode, we use the conv cudnn kernel.

KernelDepthwiseConvInputGrad<T><<<grid, threads, 0, context.stream()>>>(
nthreads, output_grad_data, filter_data, batch_size, output_channels,
output_height, output_width, input_channels, input_height, input_width,
output_channels / input_channels, ksize_height, ksize_width,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation of DepthwiseConvKernel should not depend on Python code. Because the interface of FLUID does not only support Python.

class DepthwiseConvFunctor {
public:
void operator()(const DeviceContext& context, const framework::Tensor& input,
const framework::Tensor& filter, std::vector<int>& strides,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::vector<int>& strides ==> const std::vector<int>& strides

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
++weight;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that line53~77 can be written more shortly and more efficiently.
Improving the efficiency by reducing the number of judgments.

e.g.

      h_in_end = ...
      w_in_end = ...
      for (int kh = h_in_start; kh < h_in_end; ++kh) {
        for (int kw = w_in_start; kw < w_in_end; ++kw) {
           ...
           ...
       }
     }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, we have to add extra complex judgment for filter data. This may outweigh the benefits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the following code can work, and it is not complex.

      h_end = h_in_start + filter_height > input_height? input_height:h_in_start + filter_height;
      w_end = w_in_start + filter_width > input_width? input_width:w_in_start + filter_width;
      h_start = h_in_start > 0? h_in_start:0;
      w_start = w_in_start > 0? w_in_start:0;
      input_data += in_offset;
      for (int h_in = h_start; h_in < h_end; ++h_in) {
        for (int w_in = w_start; w_in < w_end; ++w_in) {
            const int offset = h_in * input_width + w_in;
            value += weight[(h_in - h_start)*filter_width + w_in] * input_data[offset];
          }
        }
      }

@@ -0,0 +1,340 @@
/* Copyright (c) 2016 paddlepaddle Authors. All Rights Reserve.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2016 ==> 2018
Reserve ==> Reserved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other file seems same with mine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the file is created in 2018, the correction of copyright year should be 2018.

}
++weight;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the following code can work, and it is not complex.

      h_end = h_in_start + filter_height > input_height? input_height:h_in_start + filter_height;
      w_end = w_in_start + filter_width > input_width? input_width:w_in_start + filter_width;
      h_start = h_in_start > 0? h_in_start:0;
      w_start = w_in_start > 0? w_in_start:0;
      input_data += in_offset;
      for (int h_in = h_start; h_in < h_end; ++h_in) {
        for (int w_in = w_start; w_in < w_end; ++w_in) {
            const int offset = h_in * input_width + w_in;
            value += weight[(h_in - h_start)*filter_width + w_in] * input_data[offset];
          }
        }
      }

Copy link
Contributor

@chengduoZH chengduoZH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@NHZlX NHZlX merged commit d059951 into PaddlePaddle:develop Feb 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement depthwise separable convolutions.
2 participants