Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[ONNX export] Fixing spatial export for batchnorm #17711

Merged
merged 20 commits into from
Apr 7, 2020

Conversation

vinitra-zz
Copy link
Contributor

@vinitra-zz vinitra-zz commented Feb 27, 2020

Description

In the ONNX model zoo, we noticed that models like ArcFace and DUC that have been exported from mxnet with batchnorm operators are not treating spatial mode correctly. Without this PR, this model export pipeline is broken.

onnx/models#156
onnx/models#91 (comment)

cc: @ankkhedia @hariharans29 @snnn

Quoting from MxNet BatchNorm documentation: "One of the most popular normalization techniques is Batch Normalization, usually called BatchNorm for short. We normalize the activations across all samples in a batch for each of the channels independently."

The comment in the exporter refers to mean and variance per feature, instead of per channel. Fixing this will mean that spatial mode should be 1, instead of 0 in the ONNX export.

Changing spatial to 1 fixed these models, in accordance with the issue referenced above.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Fixing onnx export for batchnorm spatial mode

@snnn
Copy link

snnn commented Feb 27, 2020

LGTM.

Copy link

@hariharans29 hariharans29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too

@vinitra-zz
Copy link
Contributor Author

@szha, can you take a look?

@RuRo
Copy link
Contributor

RuRo commented Mar 6, 2020

As far as I understand, spatial=1 is the default in all versions of the BatchNormalization operator. The newer versions don't have this attribute, but still do a computation as if spatial=1, so I think we are better off removing the explicit attribute and relying on the default behaviour, since specifying it explicitly will make it harder to convert to newer ONNX opset versions.

What do you think?

Edit:

The comment says that the default for ONNX is to compute across all spatial features, but the docs here say spatial : int (default is 1) for all BatchNormalization ops that have this attribute. Admittedly, the ONNX documentation wording is really confusing.

@vinitra-zz
Copy link
Contributor Author

As far as I understand, spatial=1 is the default in all versions of the BatchNormalization operator. The newer versions don't have this attribute, but still do a computation as if spatial=1, so I think we are better off removing the explicit attribute and relying on the default behaviour, since specifying it explicitly will make it harder to convert to newer ONNX opset versions.

What do you think?

Edit:

The comment says that the default for ONNX is to compute across all spatial features, but the docs here say spatial : int (default is 1) for all BatchNormalization ops that have this attribute. Admittedly, the ONNX documentation wording is really confusing.

This makes sense! I've deprecated the attribute in my recent PR update, and provided more clarity in the comments.

@vinitra-zz
Copy link
Contributor Author

@haojin2 @hzfan -- this PR has been waiting for a review for a while! Can you take a look / refer us to the right folks? I noticed you have been active approvers of recent PRs.

@sandeep-krishnamurthy
Copy link
Contributor

@vandanavk - Can you please help review this changes? Thanks!

Copy link
Contributor

@sandeep-krishnamurthy sandeep-krishnamurthy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

  1. Waiting for Vandana's review.
  2. Restarted flaky windows build.

@sandeep-krishnamurthy
Copy link
Contributor

@vinitra - Thanks for your contributions. can you please sync to master?

@sandeep-krishnamurthy
Copy link
Contributor

@ChaiBapchya - windows-gpu build is failing consistently here. Any suggestions please? Thanks.

@snnn
Copy link

snnn commented Mar 26, 2020

I don't know why it just started failing, but I know how to fix it.

The root cause is: you're using 32 bits compiler for 64 bits target. A 32 bits program can only have 2GB address space, so you may run out it quickly.

[2020-03-25T21:40:57.615Z] -- Check for working C compiler: C:/Program Files (x86)/Microsoft Visual Studio 14.0/VC/bin/x86_amd64/cl.exe

Please add "-A x64 -T host=x64" to your cmake command line args.

@ChaiBapchya
Copy link
Contributor

@snnn Thanks for the suggestion. Actually @leezu in his #17912 tried that toolchain fix but that still failed (I think coz of removing JOM)..

@vinitra-zz
Copy link
Contributor Author

@mxnet-bot run ci [unix-gpu, windows-gpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-gpu, windows-gpu]

@vinitra-zz
Copy link
Contributor Author

@mxnet-bot run ci [all]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-gpu, centos-cpu, windows-cpu, sanity, centos-gpu, edge, website, unix-cpu, clang, windows-gpu, miscellaneous]

@vinitra-zz
Copy link
Contributor Author

Waiting on #17962 for CI fixes.

@sandeep-krishnamurthy
Copy link
Contributor

@vinitra - Sorry for the inconvenience caused. A fix for windows-gpu is now submitted.
Can I please request to rebase this PR so the change is reflected here?
Thanks again for the contribution and apologies as your PR took too long to be merged.

@vinitra-zz
Copy link
Contributor Author

@vinitra - Sorry for the inconvenience caused. A fix for windows-gpu is now submitted.
Can I please request to rebase this PR so the change is reflected here?
Thanks again for the contribution and apologies as your PR took too long to be merged.

No worries! Thanks for the help.

@sandeep-krishnamurthy
Copy link
Contributor

@mxnet-bot run ci [unix-gpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-gpu]

@mxnet-bot
Copy link

None of the jobs entered are supported.
Jobs entered by user: [macosx-x86_64]
CI supported Jobs: [sanity, miscellaneous, centos-gpu, edge, unix-cpu, unix-gpu, website, centos-cpu, windows-cpu, windows-gpu, clang]

@ChaiBapchya
Copy link
Contributor

@vinitra haven't configured mxnet-bot to trigger the macosx yet. [It's not mandatory for merge either]. But if needed we can increase the scope to incorporate that trigger.

@sandeep-krishnamurthy
Copy link
Contributor

@mxnet-bot run ci [unix-cpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-cpu]

@sandeep-krishnamurthy sandeep-krishnamurthy merged commit 79c576b into apache:master Apr 7, 2020
mk-61 pushed a commit to mk-61/incubator-mxnet that referenced this pull request Apr 7, 2020
* fixing spatial export for batchnorm

* retrigger CI

* fixing broken pylint

* retrigger build

* deprecating spatial attribute in exporter so default behavior of spatial=1 is conveyed
@vinitra-zz vinitra-zz deleted the fix-batchnorm-onnx branch April 7, 2020 22:09
@ChaiBapchya
Copy link
Contributor

@sandeep-krishnamurthy This fix got merged in master. I checked it's neither in v1.7.x nor in previous releases [e.g. 1.6.0]
So customers trying to export MX models [which include batchnorm op] to ONNX would fail for all release binaries[such as 1.5.1, 1.6.0, 1.7.0]
If we cherry-pick this into the 1.x & v1.7.x branch [atleast nightly binaries won't have this issue]

ChaiBapchya pushed a commit to ChaiBapchya/mxnet that referenced this pull request Aug 3, 2020
* fixing spatial export for batchnorm

* retrigger CI

* fixing broken pylint

* retrigger build

* deprecating spatial attribute in exporter so default behavior of spatial=1 is conveyed
ChaiBapchya pushed a commit to ChaiBapchya/mxnet that referenced this pull request Aug 3, 2020
* fixing spatial export for batchnorm

* retrigger CI

* fixing broken pylint

* retrigger build

* deprecating spatial attribute in exporter so default behavior of spatial=1 is conveyed
szha pushed a commit that referenced this pull request Aug 3, 2020
* fixing spatial export for batchnorm

* retrigger CI

* fixing broken pylint

* retrigger build

* deprecating spatial attribute in exporter so default behavior of spatial=1 is conveyed

Co-authored-by: Vinitra Swamy <vinitras@gmail.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants