Fix MKLDNN sigmoid/softrelu issue #10336

jinhuang415 · 2018-03-30T06:59:28Z

Description

This PR is to fix the sigmoid and softrelu test failure for MKLDNN (related test case tests/python/gpu/test_operator_gpu.py:test_activation_with_type), MKLDNN eltwise primitive requires to input activation input data instead of output data for backward primitive (it will do activation forward for input data and calculate gradient based on forward result), so need to change to use input data to adapt for that. Tests passed for sigmoid/relu/softrelu after the fix.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

zheng-da · 2018-03-30T07:08:50Z

Is this PR to fix the problem in #10089?

zheng-da · 2018-03-30T07:11:09Z

Does it fix this as well?

======================================================================

FAIL: test_loss.test_bce_loss

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest

    self.test(*self.arg)

  File "/work/mxnet/tests/python/unittest/common.py", line 157, in test_new

    orig_test(*args, **kwargs)

  File "/work/mxnet/tests/python/unittest/test_loss.py", line 100, in test_bce_loss

    assert mod.score(data_iter, eval_metric=mx.metric.Loss())[0][1] < 0.01

AssertionError:

jinhuang415 · 2018-03-30T07:12:01Z

@zheng-da Yes, after the fix #10089 could be enabled

jinhuang415 · 2018-03-30T07:15:44Z

@zheng-da test_bce_loss could be passed with the fix:
[jinhuang@mlt-gpu203 incubator-mxnet]$ nosetests -s tests/python/unittest/test_loss.py:test_bce_loss
[INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=26219310 to reproduce.
[INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1234 to reproduce.
[15:14:52] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 40 bytes with malloc directly
[15:14:52] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 2560 bytes with malloc directly
.
----------------------------------------------------------------------
Ran 1 test in 0.939s

OK

pengzhao-intel · 2018-03-30T07:16:28Z

@jinhuang415 I think we can enable all activation in this PR and close the previous #10089 .
@zheng-da what's your opinion?

jinhuang415 · 2018-03-30T07:56:50Z

@pengzhao-intel Added change to enable all MKLDNN activation

zheng-da · 2018-03-30T15:48:31Z

agreed. I have closed #10089

zheng-da · 2018-03-30T15:55:51Z

src/operator/nn/activation.cc

@@ -82,7 +82,7 @@ void ActivationGradComputeExCPU(const nnvm::NodeAttrs& attrs,
  const ActivationParam& param = nnvm::get<ActivationParam>(attrs.parsed);
  if (SupportMKLDNN(inputs[0])) {
    MKLDNN_OPCHECK_INIT(true, outputs.size(), inputs, outputs);
-    MKLDNNActivationBackward(attrs, ctx, inputs[0], inputs[1], req[0],
+    MKLDNNActivationBackward(attrs, ctx, inputs[0], inputs[2], req[0],
                             outputs[0]);


inputs[2] doesn't exist. You need to modify ActivationGrad to pass input data to backward.

Thanks for the comments, have updated diff to pass input data and updated a few other functions as well.

zheng-da · 2018-03-31T16:50:03Z

Looks good to me now.
@piiswrong @szha Can you review the PR? This PR fixed a bug in the MKLDNN integration. We should merge it before release

TaoLv · 2018-04-01T14:37:40Z

src/operator/nn/mkldnn/mkldnn_act.cc

-  // problems.
-  return param.act_type == activation::kReLU;
-#if 0
+  return param.act_type == activation::kReLU
      || param.act_type == activation::kSigmoid
      || param.act_type == activation::kSoftReLU;


Why tanh is not enabled?

@TaoLv I added support for Tanh and it can pass UT

szha

Tests for triggering previous bugs?

piiswrong · 2018-04-01T23:16:18Z

There is no previous bug. mkldnn was just disabled for these cases

szha · 2018-04-01T23:59:33Z

OK. Feel free to dismiss my previous review in this PR if the change is already covered by tests.

jinhuang415 · 2018-04-02T04:44:12Z

@szha @piiswrong The related test case is tests/python/gpu/test_operator_gpu.py:test_activation_with_type, previously it will fail if we enable sigmoid/softrelu/tanh for mkldnn, after the fix the test can pass so we can enable sigmoid/softrelu/tanh for MKLDNN now.

zheng-da · 2018-04-02T05:26:50Z

@piiswrong MKLDNN activation backward uses input data (activation::kData) to compute in_grad, but the original code provides output data (activation::kOut). That's why it worked only for relu. I'm not sure why it always works. This PR fixes this bug. And now we can use MKLDNN for other types of activation.

zheng-da · 2018-04-02T07:27:12Z

tests/python/unittest/test_gluon.py

-    assert_almost_equal(out1.asnumpy(), out2.asnumpy())
-    assert_almost_equal(out1.asnumpy(), out3.asnumpy())
+    assert_almost_equal(out1.asnumpy(), out2.asnumpy(), rtol=1e-3)
+    assert_almost_equal(out1.asnumpy(), out3.asnumpy(), rtol=1e-3)


maybe we should set a smaller tolerance. changing from 1e-5 to 1e-3 seems to be a big jump.

It would still fail intermittently if setting to 1e-4, by checking another similar function check_consistency(), the threshold is set to 1e-3 for FP32, since for this test_lambda() case the data type is FP32 (mx.nd.random.uniform() default output FP32 type) so I set the rtol to 1e-3 as well.

* Fix MKLDNN sigmoid/softrelu issue * Enable Sigmoid and SoftRelu for MKLDNN * Add activation kData for backward calculation for MKLDNN * Add tanh support for MKLDNN activation * Adjust rtol to pass tanh tests for MKLDNN

Fix MKLDNN sigmoid/softrelu issue

ea0e7b3

jinhuang415 requested a review from cjolivier01 as a code owner March 30, 2018 06:59

Enable Sigmoid and SoftRelu for MKLDNN

179963e

zheng-da reviewed Mar 30, 2018

View reviewed changes

Add activation kData for backward calculation for MKLDNN

c3f25fa

TaoLv reviewed Apr 1, 2018

View reviewed changes

szha suggested changes Apr 1, 2018

View reviewed changes

Add tanh support for MKLDNN activation

c85641e

Adjust rtol to pass tanh tests for MKLDNN

1f68df8

zheng-da reviewed Apr 2, 2018

View reviewed changes

piiswrong merged commit e0df25c into apache:master Apr 2, 2018

jinhuang415 deleted the latest_master branch May 2, 2018 12:18

safrooze mentioned this pull request Jul 18, 2018

softrelu activation clipping bug in MKLDNN #11804

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix MKLDNN sigmoid/softrelu issue #10336

Fix MKLDNN sigmoid/softrelu issue #10336

jinhuang415 commented Mar 30, 2018 •

edited

Loading

zheng-da commented Mar 30, 2018

zheng-da commented Mar 30, 2018

jinhuang415 commented Mar 30, 2018

jinhuang415 commented Mar 30, 2018 •

edited

Loading

pengzhao-intel commented Mar 30, 2018

jinhuang415 commented Mar 30, 2018

zheng-da commented Mar 30, 2018

zheng-da Mar 30, 2018

jinhuang415 Mar 31, 2018

zheng-da commented Mar 31, 2018

TaoLv Apr 1, 2018

jinhuang415 Apr 2, 2018

szha left a comment

piiswrong commented Apr 1, 2018

szha commented Apr 1, 2018

jinhuang415 commented Apr 2, 2018

zheng-da commented Apr 2, 2018

zheng-da Apr 2, 2018 •

edited

Loading

jinhuang415 Apr 2, 2018

Fix MKLDNN sigmoid/softrelu issue #10336

Fix MKLDNN sigmoid/softrelu issue #10336

Conversation

jinhuang415 commented Mar 30, 2018 • edited Loading

Description

Checklist

Essentials

zheng-da commented Mar 30, 2018

zheng-da commented Mar 30, 2018

jinhuang415 commented Mar 30, 2018

jinhuang415 commented Mar 30, 2018 • edited Loading

pengzhao-intel commented Mar 30, 2018

jinhuang415 commented Mar 30, 2018

zheng-da commented Mar 30, 2018

zheng-da Mar 30, 2018

Choose a reason for hiding this comment

jinhuang415 Mar 31, 2018

Choose a reason for hiding this comment

zheng-da commented Mar 31, 2018

TaoLv Apr 1, 2018

Choose a reason for hiding this comment

jinhuang415 Apr 2, 2018

Choose a reason for hiding this comment

szha left a comment

Choose a reason for hiding this comment

piiswrong commented Apr 1, 2018

szha commented Apr 1, 2018

jinhuang415 commented Apr 2, 2018

zheng-da commented Apr 2, 2018

zheng-da Apr 2, 2018 • edited Loading

Choose a reason for hiding this comment

jinhuang415 Apr 2, 2018

Choose a reason for hiding this comment

jinhuang415 commented Mar 30, 2018 •

edited

Loading

jinhuang415 commented Mar 30, 2018 •

edited

Loading

zheng-da Apr 2, 2018 •

edited

Loading