[v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes #18632

ciyongch · 2020-06-28T03:24:52Z

Description

When doing calibration with variable input shapes, a new executor will be created here in the case of the current input has different shape compared to the previous one. While the callback function is only bound to the very first executor instead of passed down to the succeeding executors which shares the same symbol.
This PR enables passing down the callback function, to address the calibration skipping issue.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

@pengzhao-intel @TaoLv @ChaiBapchya @szha

mxnet-bot · 2020-06-28T03:24:57Z

Hey @ciyongch , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

To trigger all jobs: @mxnet-bot run ci [all]
To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [windows-gpu, website, centos-cpu, windows-cpu, unix-gpu, sanity, edge, miscellaneous, clang, centos-gpu, unix-cpu]

Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

pengzhao-intel · 2020-06-28T03:28:36Z

Great job to root cause this bug :)

ciyongch · 2020-06-28T06:36:14Z

@ChaiBapchya @leezu it looks like there's CI issues in current v1.6.x, which was existed in previous commit #18586. Do you know if there's anyone working on this? Thanks!

edge

[2020-06-28T03:57:59.801Z] + python setup.py bdist_wheel --universal
[2020-06-28T03:57:59.801Z] Traceback (most recent call last):
[2020-06-28T03:57:59.801Z]   File "setup.py", line 23, in <module>
[2020-06-28T03:57:59.801Z]     from setuptools import find_packages # This must precede distutils
[2020-06-28T03:57:59.801Z] ImportError: No module named setuptools

unix-gpu

[2020-06-28T04:11:22.856Z] /work/runtime_functions.sh: line 2083: build_ubuntu_gpu_cuda101_cudnn7_mkldnn_cpp_test: command not found

pengzhao-intel · 2020-06-28T11:19:30Z

@sandeep-krishnamurthy @ChaiBapchya for helps :)

ChaiBapchya · 2020-06-29T16:49:47Z

@ciyongch @PatricZhao
Thelatest commit in 1.6 branch was erroneously merged before it passed CI. That commit tried to fix edge/centos-cpu/gpu pipelines.

Going forward, I created another PR on 1.6.x branch: #18597
That

revert the erroneously merged commit
fix centos link issue
fix edge pipeline

However, it fails on setuptools as you pointed out. I'll try to get that fixed so that we can get the CI fixed for 1.6.x
Once merged we can rebase this PR.

…le input shapes

ChaiBapchya · 2020-07-01T06:31:57Z

It passed all 11 why did we have to retrigger? Is codecov blocking merge?
Also we should try to use mxnet-bot for re-triggering specific pipelines if any.

ciyongch · 2020-07-01T06:46:07Z

Hi @ChaiBapchya , I saw the codecov test cases failed and the mxnet-bot doesn't support re-trigger. Not sure if they're a merge blocker or not, I just re-trigger the cases.

ChaiBapchya · 2020-07-01T06:48:19Z

I don't think that's the case.
@szha @leezu can confirm if code-cov is a blocker.
If its not a blocker, lets get this PR merged.
Also I'm guessing since this fix is made to executor,

this is part of other branches as well?
also this change doesn't have any test, is that already tested somewhere? how can we confirm?
@ciyongch

ciyongch · 2020-07-01T10:52:47Z

@mxnet-bot run ci [unix-cpu]

mxnet-bot · 2020-07-01T10:52:56Z

Jenkins CI successfully triggered : [unix-cpu]

ciyongch · 2020-07-01T10:55:49Z

this is part of other branches as well?

Yes, this is a common issue for all the current branches, I will do the backport to other branches as well.

also this change doesn't have any test, is that already tested somewhere? how can we confirm?

Currently, we've only verified this via a customized case which is kind of complicated, I will try to add some tests later to cover it.

ciyongch · 2020-07-01T14:18:03Z

Codecov failures are still there...which shouldn't be the blocker I think.

sandeep-krishnamurthy · 2020-07-01T15:08:16Z

Codecov is not a blocker.

sandeep-krishnamurthy · 2020-07-01T15:10:19Z

@pengzhao-intel @TaoLv this will be good to go after your review and approval

ChaiBapchya · 2020-07-01T18:48:06Z

I will try to add some tests later to cover it.

Can we add a basic test to verify this? I guess reviewers would feel confident to approve this once they know there is a proper test to verify it and that it passes. @sandeep-krishnamurthy wdyt?

leezu · 2020-07-01T23:42:12Z

To "fix" the codecov showing up on the 1.x branches, you can include the 3 lines from https://github.com/apache/incubator-mxnet/pull/18497/files
cc @sandeep-krishnamurthy @ciyongch @ChaiBapchya

ciyongch · 2020-07-02T00:13:14Z

@ChaiBapchya We've verified the fix via an offline customized cases, anyway, it's quite reasonable to add a UT to cover this case. I will try to add this today.
@leezu thanks to point it out.

pengzhao-intel

LGTM

ciyongch · 2020-07-02T04:44:05Z

Hi @ChaiBapchya @leezu @pengzhao-intel @TaoLv , now all the CI passed and the UT is added as well, please help to merge, thanks.

ChaiBapchya

Thanks for adding the UT. LGTM!

… variable input shapes (apache#18632) * Fix the monitor_callback invalid issue during calibration with variable input shapes * retrigger CI * Add UT for monitor check and disable codecov

… variable input shapes (#18632) (#18703) * Fix the monitor_callback invalid issue during calibration with variable input shapes * retrigger CI * Add UT for monitor check and disable codecov Co-authored-by: Tao Lv <tao.a.lv@intel.com>

… variable input shapes (apache#18632) (apache#18703) * Fix the monitor_callback invalid issue during calibration with variable input shapes * retrigger CI * Add UT for monitor check and disable codecov Co-authored-by: Tao Lv <tao.a.lv@intel.com>

* * Fix einsum gradient (#18482) * [v1.7.x] Backport PRs of numpy features (#18653) * add zero grad for npi_unique (#18080) * fix np.clip scalar input case (#17788) * fix true_divide (#18393) Co-authored-by: Hao Jin <hjjn.amzn@gmail.com> Co-authored-by: Xi Wang <xidulu@gmail.com> * [v1.7.x] backport mixed type binary ops to v1.7.x (#18649) * Fix Windows GPU CI (#17962) Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in #17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <vexilligera@gmail.com> * backport mixed type Co-authored-by: Leonard Lausen <lausen@amazon.com> Co-authored-by: vexilligera <vexilligera@gmail.com> * revise activations (#18700) * [v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes (#18632) (#18703) * Fix the monitor_callback invalid issue during calibration with variable input shapes * retrigger CI * Add UT for monitor check and disable codecov Co-authored-by: Tao Lv <tao.a.lv@intel.com> * Fail build_windows.py if all retries failed (#18177) * Update to thrust 1.9.8 on Windows (#18218) * Update to thrust 1.9.8 on Windows * Remove debug logic * Re-enable build retries on MSVC (#18230) Updating thrust alone did not help. Similar issues (though less often) still occur with updated thrust, and also with nvidia cub. Tracked upstream at NVIDIA/thrust#1090 Co-authored-by: Ke Han <38852697+hanke580@users.noreply.github.com> Co-authored-by: Xingjian Shi <xshiab@connect.ust.hk> Co-authored-by: Hao Jin <hjjn.amzn@gmail.com> Co-authored-by: Xi Wang <xidulu@gmail.com> Co-authored-by: Yijun Chen <chenyijun0902@gmail.com> Co-authored-by: vexilligera <vexilligera@gmail.com> Co-authored-by: ciyong <ciyong.chen@intel.com> Co-authored-by: Tao Lv <tao.a.lv@intel.com>

* * Fix einsum gradient (apache#18482) * [v1.7.x] Backport PRs of numpy features (apache#18653) * add zero grad for npi_unique (apache#18080) * fix np.clip scalar input case (apache#17788) * fix true_divide (apache#18393) Co-authored-by: Hao Jin <hjjn.amzn@gmail.com> Co-authored-by: Xi Wang <xidulu@gmail.com> * [v1.7.x] backport mixed type binary ops to v1.7.x (apache#18649) * Fix Windows GPU CI (apache#17962) Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in apache#17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <vexilligera@gmail.com> * backport mixed type Co-authored-by: Leonard Lausen <lausen@amazon.com> Co-authored-by: vexilligera <vexilligera@gmail.com> * revise activations (apache#18700) * [v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes (apache#18632) (apache#18703) * Fix the monitor_callback invalid issue during calibration with variable input shapes * retrigger CI * Add UT for monitor check and disable codecov Co-authored-by: Tao Lv <tao.a.lv@intel.com> * Fail build_windows.py if all retries failed (apache#18177) * Update to thrust 1.9.8 on Windows (apache#18218) * Update to thrust 1.9.8 on Windows * Remove debug logic * Re-enable build retries on MSVC (apache#18230) Updating thrust alone did not help. Similar issues (though less often) still occur with updated thrust, and also with nvidia cub. Tracked upstream at NVIDIA/thrust#1090 Co-authored-by: Ke Han <38852697+hanke580@users.noreply.github.com> Co-authored-by: Xingjian Shi <xshiab@connect.ust.hk> Co-authored-by: Hao Jin <hjjn.amzn@gmail.com> Co-authored-by: Xi Wang <xidulu@gmail.com> Co-authored-by: Yijun Chen <chenyijun0902@gmail.com> Co-authored-by: vexilligera <vexilligera@gmail.com> Co-authored-by: ciyong <ciyong.chen@intel.com> Co-authored-by: Tao Lv <tao.a.lv@intel.com>

* * Fix einsum gradient (#18482) * [v1.7.x] Backport PRs of numpy features (#18653) * add zero grad for npi_unique (#18080) * fix np.clip scalar input case (#17788) * fix true_divide (#18393) Co-authored-by: Hao Jin <hjjn.amzn@gmail.com> Co-authored-by: Xi Wang <xidulu@gmail.com> * [v1.7.x] backport mixed type binary ops to v1.7.x (#18649) * Fix Windows GPU CI (#17962) Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in #17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <vexilligera@gmail.com> * backport mixed type Co-authored-by: Leonard Lausen <lausen@amazon.com> Co-authored-by: vexilligera <vexilligera@gmail.com> * revise activations (#18700) * [v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes (#18632) (#18703) * Fix the monitor_callback invalid issue during calibration with variable input shapes * retrigger CI * Add UT for monitor check and disable codecov Co-authored-by: Tao Lv <tao.a.lv@intel.com> * Fail build_windows.py if all retries failed (#18177) * Update to thrust 1.9.8 on Windows (#18218) * Update to thrust 1.9.8 on Windows * Remove debug logic * Re-enable build retries on MSVC (#18230) Updating thrust alone did not help. Similar issues (though less often) still occur with updated thrust, and also with nvidia cub. Tracked upstream at NVIDIA/thrust#1090 Co-authored-by: Ke Han <38852697+hanke580@users.noreply.github.com> Co-authored-by: Xingjian Shi <xshiab@connect.ust.hk> Co-authored-by: Hao Jin <hjjn.amzn@gmail.com> Co-authored-by: Xi Wang <xidulu@gmail.com> Co-authored-by: Yijun Chen <chenyijun0902@gmail.com> Co-authored-by: vexilligera <vexilligera@gmail.com> Co-authored-by: ciyong <ciyong.chen@intel.com> Co-authored-by: Tao Lv <tao.a.lv@intel.com> Co-authored-by: Leonard Lausen <lausen@amazon.com> Co-authored-by: Ke Han <38852697+hanke580@users.noreply.github.com> Co-authored-by: Xingjian Shi <xshiab@connect.ust.hk> Co-authored-by: Hao Jin <hjjn.amzn@gmail.com> Co-authored-by: Xi Wang <xidulu@gmail.com> Co-authored-by: Yijun Chen <chenyijun0902@gmail.com> Co-authored-by: vexilligera <vexilligera@gmail.com> Co-authored-by: ciyong <ciyong.chen@intel.com> Co-authored-by: Tao Lv <tao.a.lv@intel.com>

ciyongch requested a review from szha as a code owner June 28, 2020 03:24

ChaiBapchya mentioned this pull request Jun 30, 2020

[CI][v1.6.x] Fix failing CI pipelines #18597

Merged

Fix the monitor_callback invalid issue during calibration with variab…

72ba804

…le input shapes

ciyongch force-pushed the fix_calibration_v1.6 branch from 7d554b8 to 72ba804 Compare July 1, 2020 01:43

retrigger CI

23ced60

Add UT for monitor check and disable codecov

b88371e

pengzhao-intel approved these changes Jul 2, 2020

View reviewed changes

ChaiBapchya approved these changes Jul 2, 2020

View reviewed changes

TaoLv approved these changes Jul 2, 2020

View reviewed changes

TaoLv merged commit e503704 into apache:v1.6.x Jul 2, 2020

ciyongch mentioned this pull request Jul 14, 2020

[v1.7] Fix the monitor_callback invalid issue during calibration with variable input shapes #18703

Merged

ciyongch mentioned this pull request Jul 14, 2020

[v1.x]Fix the monitor_callback invalid issue during calibration with variable input shapes #18705

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes #18632

[v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes #18632

ciyongch commented Jun 28, 2020

mxnet-bot commented Jun 28, 2020

pengzhao-intel commented Jun 28, 2020

ciyongch commented Jun 28, 2020

pengzhao-intel commented Jun 28, 2020

ChaiBapchya commented Jun 29, 2020 •

edited

Loading

ChaiBapchya commented Jul 1, 2020

ciyongch commented Jul 1, 2020

ChaiBapchya commented Jul 1, 2020

ciyongch commented Jul 1, 2020

mxnet-bot commented Jul 1, 2020

ciyongch commented Jul 1, 2020 •

edited

Loading

ciyongch commented Jul 1, 2020

sandeep-krishnamurthy commented Jul 1, 2020

sandeep-krishnamurthy commented Jul 1, 2020

ChaiBapchya commented Jul 1, 2020

leezu commented Jul 1, 2020

ciyongch commented Jul 2, 2020

pengzhao-intel left a comment

ciyongch commented Jul 2, 2020

ChaiBapchya left a comment

[v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes #18632

[v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes #18632

Conversation

ciyongch commented Jun 28, 2020

Description

Checklist

Essentials

Changes

Comments

mxnet-bot commented Jun 28, 2020

pengzhao-intel commented Jun 28, 2020

ciyongch commented Jun 28, 2020

pengzhao-intel commented Jun 28, 2020

ChaiBapchya commented Jun 29, 2020 • edited Loading

ChaiBapchya commented Jul 1, 2020

ciyongch commented Jul 1, 2020

ChaiBapchya commented Jul 1, 2020

ciyongch commented Jul 1, 2020

mxnet-bot commented Jul 1, 2020

ciyongch commented Jul 1, 2020 • edited Loading

ciyongch commented Jul 1, 2020

sandeep-krishnamurthy commented Jul 1, 2020

sandeep-krishnamurthy commented Jul 1, 2020

ChaiBapchya commented Jul 1, 2020

leezu commented Jul 1, 2020

ciyongch commented Jul 2, 2020

pengzhao-intel left a comment

Choose a reason for hiding this comment

ciyongch commented Jul 2, 2020

ChaiBapchya left a comment

Choose a reason for hiding this comment

ChaiBapchya commented Jun 29, 2020 •

edited

Loading

ciyongch commented Jul 1, 2020 •

edited

Loading