Fix transposed convolution in CPU w/o MKLDNN. #14031

apeforest · 2019-01-31T00:40:21Z

Description

transposed convolution operator in CPU w/o MKLDNN is not working properly when dilation is set. This is because the mshadow library function unpack_patch2col and pack_col2patch generate incorrect results with dilation parameter. This PR replaced these two functions with MXNet native function im2col and col2im

This PR fixs issue #11203

Passed the local test in the issue:

import mxnet as mx

data = mx.nd.array(((0,0,0),
                    (0,1,0),
                    (0,0,0)))
kernel = mx.nd.array(((1,2,3),
                      (4,5,6),
                      (7,8,9)))

data_batch = data.expand_dims(0).expand_dims(0)
weight = kernel.expand_dims(0).expand_dims(0)
# initialize and set weight
conv = mx.gluon.nn.Conv2DTranspose(in_channels=1, channels=1,
                                   kernel_size=(3,3), padding=(2,2),
                                   strides=(2,2), dilation=(2,2))
conv.initialize()
conv.weight.set_data(weight)

weight.attach_grad()
with mx.autograd.record():
    l = conv(data_batch)
    print('conv forward')
    print(l.asnumpy())
l.backward()
print('conv backward')
print(weight.grad)

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

apeforest · 2019-01-31T00:42:32Z

Just noticed that test_deconvolution in unit test has been disabled for a long time and was not re-enabled although the related issue #10973 is marked resolved. Re-enable test_deconvolution in PR: #14032. New test will be added with this PR to make it complete.

apeforest · 2019-01-31T00:46:42Z

@zhreshold @thomelane Please help to review. Thanks!

zhreshold · 2019-01-31T00:57:58Z

Can you verify the result in unittest?

apeforest · 2019-01-31T08:27:58Z

@zhreshold unit test added.

apeforest · 2019-01-31T08:28:17Z

@mxnet-label-bot add [pr-awaiting-review]

This reverts commit 0a45e1a.

pengzhao-intel · 2019-02-02T06:50:32Z

src/operator/nn/deconvolution-inl.h

@@ -226,22 +227,24 @@ class DeconvolutionOp {
    CHECK_EQ(in_data.size(), expected);
    CHECK_EQ(out_data.size(), 1U);
    Stream<xpu> *s = ctx.get_stream<xpu>();
+#if defined(__CUDACC__)
+    CHECK_EQ(s->blas_handle_ownership_, Stream<xpu>::OwnHandle)
+        << "Must init CuBLAS handle in stream";


"cuBLAS" is the official abbreviation :)

zhreshold · 2019-02-03T06:44:10Z

tests/python/unittest/test_gluon.py

@@ -503,6 +503,40 @@ def test_deconv():
    # layer = nn.Conv3DTranspose(16, (3, 3, 3), layout='NDHWC', in_channels=4)
    # # check_layer_forward(layer, (1, 10, 10, 10, 4))

+@with_seed()
+def test_deconv_dilation():


Since deconv is a really important OP, I suggest to visit the original deconv test cases and add dilation > 1 cases alongside the old tests. This ensures better coverage than this single test case.
Feel free to keep this unittest which LGTM as well.

Vikas-kum

Good catch! Looks good.

Vikas-kum · 2019-02-05T17:20:47Z

src/operator/nn/deconvolution-inl.h

@@ -485,7 +455,6 @@ class DeconvolutionOp {
  DeconvolutionParam param_;
  mshadow::Shape<2> shape_colunit_;
  mshadow::Shape<3> shape_dstunit_;
-  index_t nstep_;


Can you please tell me why was this removed?

The col2im method does not support such step.

access2rohit · 2019-02-08T19:14:41Z

src/operator/nn/deconvolution-inl.h

    const index_t nbatch = data.size(0);
    Tensor<xpu, 1, DType> workspace =
        ctx.requested[deconv::kTempSpace].get_space_typed<xpu, 1, DType>(
            Shape1(this->InitTemp(out.shape_, data.shape_)), s);
-    for (index_t i = 0; i < nbatch; i += nstep_) {


Do you know what was "nstep_" doing earlier? It would help understand the problem with the earlier code.

The col2im method does not support such step.

access2rohit · 2019-02-08T19:16:39Z

src/operator/nn/deconvolution-inl.h


    const index_t nbatch = data.size(0);
    Tensor<xpu, 1, DType> workspace =
        ctx.requested[deconv::kTempSpace].get_space_typed<xpu, 1, DType>(
            Shape1(this->InitTemp(grad.shape_, data.shape_)), s);
-    for (index_t i = 0; i < nbatch; i += nstep_) {
-      const index_t step = std::min(nstep_, nbatch - i);


Again can you tell what was the purpose of "step" in the previous code?

I think it's used to convert multiple batch of image data into columns in the prevous library. However, it is not supported in the col2im method.

access2rohit

The code changes are consistent with the optimized way to perform Deconv operation on just CPU but I have some questions that will help me understand what was happening earlier and why was it that way. Rest your code is correct and precise. Good Work !

anirudhacharya · 2019-03-03T23:26:57Z

@apeforest can you please rebase and resolve the merge conflicts?

abhinavs95 · 2019-03-28T22:50:38Z

@apeforest Could you please have a look at the CI failures?

piyushghai · 2019-04-09T00:41:26Z

@apeforest Gentle ping...

Roshrini · 2019-04-17T15:29:28Z

@apeforest Can you take a look at failing CI build?

karan6181 · 2019-05-21T21:06:44Z

@apeforest Could you please provide an updates on this PR about your progress and thoughts so that the other community members get help from this. Thanks!

apeforest · 2019-05-24T16:43:20Z

@karan6181 The new function im2col has different signature and calling sequence from old col_unpack(). The changes fail in a few unit tests and I ended up re-implementing the operator itself. Given that current MKLDNN is default in CPU and it has no issue with Conv2DTranspose operator, I would like to treat this issue as lower priority and get it complete in a few weeks.

piyushghai · 2019-06-07T22:48:54Z

@mxnet-label-bot Update[pr-work-in-progress]

@apeforest Can you look into the CI failures ?

wkcn · 2019-07-10T02:19:15Z

@apeforest Hi! Any update in this PR? The PR is important: )

PawelGlomski-Intel · 2021-09-29T17:59:46Z

@szha with #11203 fixed by #20292, I believe this PR can be closed

apeforest added 4 commits January 30, 2019 12:11

replace with im2col/col2im functions

9fe0589

fixed padding problem in transpose conv forward

19dfcb5

fix backward deconvolution

747df6c

refactor

854cff2

apeforest changed the title ~~Fix transposed convolution in CPU w/o MKLDNN.~~ [WIP] Fix transposed convolution in CPU w/o MKLDNN. Jan 31, 2019

apeforest changed the title ~~[WIP] Fix transposed convolution in CPU w/o MKLDNN.~~ Fix transposed convolution in CPU w/o MKLDNN. Jan 31, 2019

apeforest changed the title ~~Fix transposed convolution in CPU w/o MKLDNN.~~ [WIP] Fix transposed convolution in CPU w/o MKLDNN. Jan 31, 2019

Merge branch 'master' into bugfix/conv2dtran

da75280

apeforest added 4 commits January 30, 2019 20:32

fix lint

20ae427

fix unit test, remove step in deconv

926cfd7

add unit test

c49dbe1

refactor

afd75d1

apeforest changed the title ~~[WIP] Fix transposed convolution in CPU w/o MKLDNN.~~ Fix transposed convolution in CPU w/o MKLDNN. Jan 31, 2019

marcoabreu added the pr-awaiting-review PR is waiting for code review label Jan 31, 2019

apeforest added 3 commits January 31, 2019 00:46

fix build error

5b59097

Revert "Aggregate SGD (apache#13346)"

d1554c1

This reverts commit 0a45e1a.

Merge remote-tracking branch 'upstream/master' into bugfix/conv2dtran

a9da95e

pengzhao-intel reviewed Feb 2, 2019

View reviewed changes

zhreshold reviewed Feb 3, 2019

View reviewed changes

Merge remote-tracking branch 'upstream/master'

1268457

Vikas-kum reviewed Feb 6, 2019

View reviewed changes

access2rohit reviewed Feb 8, 2019

View reviewed changes

Merge remote-tracking branch 'upstream/master'

953dd95

apeforest added 3 commits March 15, 2019 14:43

Merge remote-tracking branch 'upstream/master' into bugfix/conv2dtran

27f2033

fix lint error

88892d2

Merge remote-tracking branch 'upstream/master' into bugfix/conv2dtran

aa3c4dd

apeforest added 13 commits April 23, 2019 17:14

Merge branch 'master' into bugfix/conv2dtran

1580ba0

fix a bug in calling im2col (col_shape should be 3)

0675b3b

fix im2col parameter mismatch

f403b9c

add debug

bdbf81d

set col_buffer_shape

2c92980

dump data from gpu to cpu to debug

0c44ec8

debug

2c868fd

debug

5dacddc

update function call to col2im

29c4488

fix backward pass

5f3c881

comment out debug message

db3aaef

Merge remote-tracking branch 'upstream/master' into bugfix/conv2dtran

2181f80

fix bug in backward

424f36d

marcoabreu added pr-work-in-progress PR is still work in progress and removed pr-awaiting-review PR is waiting for code review labels Jun 7, 2019

PawelGlomski-Intel mentioned this pull request Sep 15, 2021

[BACKPORT][BUGFIX][FEATURE] Add oneDNN 1D and 3D deconvolution support and fix bias #20292

Merged

9 tasks

szha closed this Sep 29, 2021

PawelGlomski-Intel mentioned this pull request Oct 15, 2021

ConvTranspose2d giving incorrect output #11203

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix transposed convolution in CPU w/o MKLDNN. #14031

Fix transposed convolution in CPU w/o MKLDNN. #14031

apeforest commented Jan 31, 2019 •

edited

Loading

apeforest commented Jan 31, 2019 •

edited

Loading

apeforest commented Jan 31, 2019

zhreshold commented Jan 31, 2019

apeforest commented Jan 31, 2019

apeforest commented Jan 31, 2019

pengzhao-intel Feb 2, 2019

zhreshold Feb 3, 2019

Vikas-kum left a comment

Vikas-kum Feb 5, 2019

apeforest Mar 6, 2019 •

edited

Loading

access2rohit Feb 8, 2019

apeforest Mar 6, 2019 •

edited

Loading

access2rohit Feb 8, 2019

apeforest Mar 6, 2019

access2rohit left a comment •

edited

Loading

anirudhacharya commented Mar 3, 2019

abhinavs95 commented Mar 28, 2019

piyushghai commented Apr 9, 2019

Roshrini commented Apr 17, 2019

karan6181 commented May 21, 2019

apeforest commented May 24, 2019

piyushghai commented Jun 7, 2019

wkcn commented Jul 10, 2019

PawelGlomski-Intel commented Sep 29, 2021

Fix transposed convolution in CPU w/o MKLDNN. #14031

Fix transposed convolution in CPU w/o MKLDNN. #14031

Conversation

apeforest commented Jan 31, 2019 • edited Loading

Description

Checklist

Essentials

apeforest commented Jan 31, 2019 • edited Loading

apeforest commented Jan 31, 2019

zhreshold commented Jan 31, 2019

apeforest commented Jan 31, 2019

apeforest commented Jan 31, 2019

pengzhao-intel Feb 2, 2019

Choose a reason for hiding this comment

zhreshold Feb 3, 2019

Choose a reason for hiding this comment

Vikas-kum left a comment

Choose a reason for hiding this comment

Vikas-kum Feb 5, 2019

Choose a reason for hiding this comment

apeforest Mar 6, 2019 • edited Loading

Choose a reason for hiding this comment

access2rohit Feb 8, 2019

Choose a reason for hiding this comment

apeforest Mar 6, 2019 • edited Loading

Choose a reason for hiding this comment

access2rohit Feb 8, 2019

Choose a reason for hiding this comment

apeforest Mar 6, 2019

Choose a reason for hiding this comment

access2rohit left a comment • edited Loading

Choose a reason for hiding this comment

anirudhacharya commented Mar 3, 2019

abhinavs95 commented Mar 28, 2019

piyushghai commented Apr 9, 2019

Roshrini commented Apr 17, 2019

karan6181 commented May 21, 2019

apeforest commented May 24, 2019

piyushghai commented Jun 7, 2019

wkcn commented Jul 10, 2019

PawelGlomski-Intel commented Sep 29, 2021

apeforest commented Jan 31, 2019 •

edited

Loading

apeforest commented Jan 31, 2019 •

edited

Loading

apeforest Mar 6, 2019 •

edited

Loading

apeforest Mar 6, 2019 •

edited

Loading

access2rohit left a comment •

edited

Loading