Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Support SyncBatchNorm5D #14542

Merged
merged 27 commits into from
Apr 2, 2019
Merged

Conversation

wkcn
Copy link
Member

@wkcn wkcn commented Mar 27, 2019

Description

Hi! there.
Currently, SyncBatchNorm doesn't support 5+D input.
In this PR, I fix it.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • tiny change. The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • modify the code of SyncBatchNorm to support 2+D inputs.
  • Remove the default value key for SyncBatchNorm
  • Using CUDNN to compute 5+D inputs for BatchNorm
  • modify the testcase test_sync_batchnorm
  • Add an operator-test for BatchNorm and SyncBatchNorm
  • reformat tests/python/gpu/test_gluon_gpu.py

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@wkcn wkcn added the pr-work-in-progress PR is still work in progress label Mar 27, 2019
@wkcn wkcn changed the title support SyncBatchNorm5D [WIP] Support SyncBatchNorm5D Mar 27, 2019
@wkcn wkcn changed the title [WIP] Support SyncBatchNorm5D Support SyncBatchNorm5D Mar 27, 2019
@wkcn wkcn added pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress labels Mar 27, 2019
@wkcn wkcn requested review from zhreshold and szha March 27, 2019 15:50
Copy link
Member

@zhreshold zhreshold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it will support only 3D to 5D? What's the limitation here?

src/operator/contrib/sync_batch_norm-inl.h Show resolved Hide resolved
@zhreshold
Copy link
Member

looks good

@wkcn
Copy link
Member Author

wkcn commented Mar 28, 2019

There is a bug when the dimension of inputs is 2.

======================================================================

FAIL: test_gluon_gpu.test_sync_batchnorm

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest

    self.test(*self.arg)

  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 177, in test_new

    orig_test(*args, **kwargs)

  File "/work/mxnet/tests/python/gpu/test_gluon_gpu.py", line 373, in test_sync_batchnorm

    num_devices=ndev, cuda=cuda)

  File "/work/mxnet/tests/python/gpu/test_gluon_gpu.py", line 352, in _check_batchnorm_result

    atol=1e-3, rtol=1e-3)

  File "/work/mxnet/python/mxnet/test_utils.py", line 495, in assert_almost_equal

    raise AssertionError(msg)

AssertionError: 

Items are not equal:

Error 1.597678 exceeds tolerance rtol=0.001000, atol=0.001000.  Location of maximum error:(0,), a=0.912201, b=0.909150

 a: array([0.9122006, 0.906133 ], dtype=float32)

 b: array([0.9091504, 0.9045997], dtype=float32)

-------------------- >> begin captured logging << --------------------

common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=868530648 to reproduce.

@wkcn wkcn changed the title Support SyncBatchNorm5D [WIP, Don't merge] Support SyncBatchNorm5D Mar 28, 2019
@wkcn
Copy link
Member Author

wkcn commented Mar 29, 2019

There seems to be a bug in SyncBatchNorm, when spatial_shape is 1x1, 1xn or nx1. I am checking it.

@wkcn wkcn added pr-work-in-progress PR is still work in progress and removed pr-awaiting-review PR is waiting for code review labels Mar 30, 2019
@wkcn
Copy link
Member Author

wkcn commented Mar 30, 2019

It seems that the bug has been addressed, although I do not know the specific reason yet.

I will add a test for multi-output.

@wkcn wkcn changed the title [WIP, Don't merge] Support SyncBatchNorm5D Support SyncBatchNorm5D Mar 30, 2019
@wkcn wkcn added pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress labels Mar 30, 2019
@wkcn
Copy link
Member Author

wkcn commented Mar 31, 2019

@zhreshold @szha Hi! I have updated the PR and add adquate unittests.
Would you mind reviewing it? Thank you!

@zhreshold
Copy link
Member

If @szha has no complaint, I can merge it in 24hr

@zhreshold zhreshold merged commit e2f5b47 into apache:master Apr 2, 2019
@zhreshold
Copy link
Member

thanks @wkcn , this is merged!

ZhennanQin pushed a commit to ZhennanQin/incubator-mxnet that referenced this pull request Apr 3, 2019
* support SyncBatchNorm5D

* fix

* update testcase and reformat code

* retrigger CI

* update test case

* test

* Retrigger CI

* disable cudnn for batchnorm

* fix BatchNorm(cudnn)

* fix build

* Remove a testcase

* Update sync_batch_norm-inl.h

* update unittest

* update unittest

* update test

* fix test

* change atol and rtol

* BN(cudnn) 5d

* update test

* test

* Testing

* Update batch_norm.cu

* test cudnnoff

* Update test_operator.py

* update BN! : )
nswamy pushed a commit that referenced this pull request Apr 5, 2019
* support SyncBatchNorm5D

* fix

* update testcase and reformat code

* retrigger CI

* update test case

* test

* Retrigger CI

* disable cudnn for batchnorm

* fix BatchNorm(cudnn)

* fix build

* Remove a testcase

* Update sync_batch_norm-inl.h

* update unittest

* update unittest

* update test

* fix test

* change atol and rtol

* BN(cudnn) 5d

* update test

* test

* Testing

* Update batch_norm.cu

* test cudnnoff

* Update test_operator.py

* update BN! : )
input2grad.asnumpy(), atol=atol, rtol=rtol)

cfgs = [(1, False)]
num_gpus = mx.context.num_gpus()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line requires having GPU when CUDA is installed, or it would throw this error:

======================================================================
ERROR: test_gluon.test_sync_batchnorm
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/travis/build/dmlc/mxnet-distro/mxnet-build/tests/python/unittest/common.py", line 177, in test_new
    orig_test(*args, **kwargs)
  File "/home/travis/build/dmlc/mxnet-distro/mxnet-build/tests/python/unittest/test_gluon.py", line 693, in test_sync_batchnorm
    num_gpus = mx.context.num_gpus()
  File "/home/travis/build/dmlc/mxnet-distro/mxnet/context.py", line 258, in num_gpus
    check_call(_LIB.MXGetGPUCount(ctypes.byref(count)))
  File "/home/travis/build/dmlc/mxnet-distro/mxnet/base.py", line 254, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
MXNetError: [11:47:54] include/mxnet/base.h:427: Check failed: e == cudaSuccess (30 vs. 0) :  CUDA: unknown error
Stack trace:
  [bt] (0) /home/travis/build/dmlc/mxnet-distro/mxnet/libmxnet.so(+0x4b60fb) [0x7f8d608830fb]
  [bt] (1) /home/travis/build/dmlc/mxnet-distro/mxnet/libmxnet.so(+0x2440eec) [0x7f8d6280deec]
  [bt] (2) /home/travis/build/dmlc/mxnet-distro/mxnet/libmxnet.so(MXGetGPUCount+0x19) [0x7f8d6280df79]
  [bt] (3) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f8d9a2e1c7c]
  [bt] (4) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x1fc) [0x7f8d9a2e15ac]
  [bt] (5) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48e) [0x7f8d9a4f85fe]
  [bt] (6) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x15f9e) [0x7f8d9a4f9f9e]
  [bt] (7) /usr/bin/python(PyEval_EvalFrameEx+0x965) [0x4c84a5]
  [bt] (8) /usr/bin/python(PyEval_EvalCodeEx+0x2ac) [0x4cfedc]
-------------------- >> begin captured logging << --------------------
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1179889124 to reproduce.
--------------------- >> end captured logging << ---------------------
----------------------------------------------------------------------

Can you please move this test to tests/python/gpu/test_gluon_contrib_gpu.py? @wkcn @zhreshold

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was testing it on a platform without GPU, with CUDA installed. In any case, the test seems misplaced.

haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* support SyncBatchNorm5D

* fix

* update testcase and reformat code

* retrigger CI

* update test case

* test

* Retrigger CI

* disable cudnn for batchnorm

* fix BatchNorm(cudnn)

* fix build

* Remove a testcase

* Update sync_batch_norm-inl.h

* update unittest

* update unittest

* update test

* fix test

* change atol and rtol

* BN(cudnn) 5d

* update test

* test

* Testing

* Update batch_norm.cu

* test cudnnoff

* Update test_operator.py

* update BN! : )
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants