Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize 1x1 convolution for Network-in-Network style operation #1118

Merged
merged 2 commits into from
Sep 20, 2014

Conversation

shelhamer
Copy link
Member

1x1 convolution with stride 1 and no padding is a special case of Caffe matrix multiplication convolution for which im2col / col2im transformations are actually the identity. For this special case the memory and transformation are skipped.

This optimizes the execution of 1x1 convolution i.e. NIN / CCCP convolutions.

@mavenlin

// Special case: im2col is the identity for 1x1 convolution w/ stride 1,
// so flag for skipping the buffer and transformation.
is_1x1_ = kernel_w_ == 1 && kernel_h_ == 1
&& stride_h_ == 1 && stride_w_ == 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to check that there is zero padding, yes?

@shelhamer
Copy link
Member Author

@longjon right on both counts. I've fixed both points.

@sguada can you explain the case for pad = 1? This is a 1x1 conv so the padding is meaningless and the following layer can always configure its own padding.

1x1 convolution with stride 1 is a special case of Caffe matrix
multiplication convolution for which im2col / col2im transformations are
actually the identity. For this special case the memory and
transformation are skipped.
@sguada
Copy link
Contributor

sguada commented Sep 20, 2014

Sorry I was thinking in 3x3 case. No need for padding.

On Friday, September 19, 2014, Evan Shelhamer notifications@github.com
wrote:

@longjon https://github.com/longjon right on both counts. I've fixed
both points.

@sguada https://github.com/sguada can you explain the case for pad = 1?
This is a 1x1 conv so the padding is meaningless and the following layer
can always configure its own padding.


Reply to this email directly or view it on GitHub
#1118 (comment).

Sergio

Dtype* col_diff = NULL;
if (!is_1x1_) {
col_data = col_buffer_.mutable_cpu_data();
col_diff = col_buffer_.mutable_cpu_diff();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way... could we save memory in the usual case by changing this line to col_buffer_.mutable_cpu_data() (i.e., by reusing the same buffer for both data and diff)? Perhaps I have missed something, but I don't see any reason in the code below why we need two separate buffers...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch -- there's no need for the two at once since the col_data is only for the gradient w.r.t. the weight while col_diff is only for the gradient w.r.t. the bottom. Should we parallelize these in the future separate buffers will be needed, but that can be adjusted when we cross that bridge. Check out the follow-up commit.

@longjon
Copy link
Contributor

longjon commented Sep 20, 2014

Looks pretty good. It might be worth a comment near the col_buffer_ reshape to explain that memory will go lazily unused in the 1x1 case.

@sguada
Copy link
Contributor

sguada commented Sep 20, 2014

This reminds me that we should recover the shared_col_buffer across
convolutions.

Any suggestion how which class should be responsible for providing them?

On Friday, September 19, 2014, longjon notifications@github.com wrote:

Looks pretty good. It might be worth a comment near the col_buffer_
reshape to explain that memory will go lazily unused in the 1x1 case.


Reply to this email directly or view it on GitHub
#1118 (comment).

Sergio

@shelhamer
Copy link
Member Author

@sguada seems to me that Net should broker shared blobs as requested then each layer can reshape them on-the-fly as needed. The memory is shared across layers but still owned by the Net and will be freed along with the Net. @longjon's PR lets the blobs grow to the largest size needed. Could be worth a try for fully-convolutional models in the regime where Caffe's matrix multiplication is faster than cuDNN (at present).

conv forward / backward only need one of the im2col data and diff
at-a-time so consolidating the two saves a lazy allocation.
longjon added a commit that referenced this pull request Sep 20, 2014
Optimize 1x1 convolution for Network-in-Network style operation
@longjon longjon merged commit de90c60 into BVLC:dev Sep 20, 2014
@longjon
Copy link
Contributor

longjon commented Sep 20, 2014

Awesome, this looks perfect to me. Thanks @shelhamer for writing this nice tight optimization (and being super responsive!)

@shelhamer shelhamer deleted the 1x1-conv branch September 20, 2014 07:08
@sguada
Copy link
Contributor

sguada commented Sep 25, 2014

@shelhamer I think we could do the same trick in the case that the filters have the same size as the bottoms, and no padding, so therefore not stride would be needed and only one matrix multiplication is needed. Useful to replace fully connected layers with convolutions.
Would like to add case?

@shelhamer
Copy link
Member Author

@sguada yes the fully-connected / bottom dimensions = filter dimensions
case can be done as a gemm special case. It's worth adding in my opinion.

The other optimization is to allow batched im2col / col2im when memory
allows to gemm multiple inputs at once. Should be simple to add in our
implementation actually -- just need to reshape and index into col_buff.

On Wednesday, September 24, 2014, Sergio Guadarrama <
notifications@github.com> wrote:

@shelhamer https://github.com/shelhamer I think we could do the same
trick in the case that the filters have the same size as the bottoms, and
no padding, so therefore not stride would be needed and only one matrix
multiplication is needed.
Would like to add case?


Reply to this email directly or view it on GitHub
#1118 (comment).

@longjon
Copy link
Contributor

longjon commented Sep 25, 2014

@sguada @shelhamer
Re: the buffer-skipping trick: at this point we may as well ask the question: exactly when is the col buffer identical to the input? I believe the answer is...

(pad_h == 0 && pad_w == 0)
  && ((stride_w == kernel_w && width % kernel_w == 0 && kernel_h == 1)
      || (width == kernel_w && ((stride_h == kernel_h && height % kernel_h == 0)
                                || height == kernel_h)))

which is rather more general than both the special cases discussed so far.

Re: batched buffers: you would also get this for free in the above case. I wonder how much of a difference it makes though?

@sguada
Copy link
Contributor

sguada commented Sep 27, 2014

@longjon I think there are some symmetric cases missing in that formula. i.e:

(stride_h == kernel_h && height % kernel_h == 0 && kernel_w == 1)
(height == kernel_h && ((stride_w == kernel_w && width % kernel_w == 0)

How about this formula:

(pad_h == 0 && pad_w == 0) && 
((width == kernel_w && height == kernel_h) ||
(stride_w == kernel_w && width % kernel_w == 0 && (height == kernel_h || kernel_h == 1) ||
(stride_h == kernel_h && height % kernel_h == 0 && (width == kernel_w || kernel_w == 1))

@longjon
Copy link
Contributor

longjon commented Sep 27, 2014

@sguada No, it's trickier than that, and not symmetric in the way you are thinking, because row-major order goes left-to-right, up-to-down. You might think that you could get the same optimization for the column-major contiguous cases by changing the transposition parameters, but I think that cannot be done because of the channel dimension.

@sguada
Copy link
Contributor

sguada commented Sep 27, 2014

@longjon yeah you are right I forgot to consider the asymmetry introduced by the row-major.
Then for clarity let rewrite your formula (plus 1x1) as 4 different cases:

(pad_h == 0 && pad_w == 0) && 
  ((kernel_w_ == 1 && kernel_h_ == 1 && stride_h_ == 1 && stride_w_ == 1) ||
  (width == kernel_w && height == kernel_h) ||
  (stride_w == kernel_w && width % kernel_w == 0 && kernel_h == 1) ||
  (stride_h == kernel_h && height % kernel_h == 0 && kernel_w == width))

@longjon
Copy link
Contributor

longjon commented Sep 28, 2014

@sguada I believe your expression is correct, but I think it's less clear to include redundant cases. One way or another, I think the expression should be explained by comments (although one really has to have the right picture to understand it). E.g., I would suggest writing:

// For the column buffer to be identical to the input, we must have...
// zero padding, plus...
(pad_h == 0 && pad_w == 0) &&
   // the kernel must tile the input horizontally, and have height one...
  ((stride_w == kernel_w && width % kernel_w == 0 && kernel_h == 1)
   // unless it takes the whole width of the input, in which case it must
   // tile the input vertically, or take the whole height of the input!
   || (width == kernel_w && ((stride_h == kernel_h && height % kernel_h == 0)
                              || height == kernel_h)))

mitmul pushed a commit to mitmul/caffe that referenced this pull request Sep 30, 2014
Optimize 1x1 convolution for Network-in-Network style operation
RazvanRanca pushed a commit to RazvanRanca/caffe that referenced this pull request Nov 4, 2014
Optimize 1x1 convolution for Network-in-Network style operation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants