Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[v1.x] Backport Unittest tolerance handling improvements (#18694). Also test seeding (#18762). #19148

Merged
merged 12 commits into from
Sep 17, 2020

Conversation

DickJC123
Copy link
Contributor

@DickJC123 DickJC123 commented Sep 15, 2020

Description

This backport prepares MXNet 1.8 to be built against CUDA 11 and cuDNN 8 and run on A100 GPUs, which employ TensorFloat-32 (TF32) by default. See PR #18694 for full details.

During the development of the original PR, I fixed numerous other CI issues that kept me from getting a passing CI. At the time the PR was accepted, I was working on a couple of additional fixes that I made into a follow-up PR #18694 "Improve test seeding and robustness in test_numpy_interoperablity.py". To help get a passing CI, this PR backports that as well.

@samskalicky @anirudh2290 @ChaiBapchya @ptrendx

Checklist

Essentials

  • [ X] PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • [ X] Changes are complete (i.e. I finished coding on this PR)
  • [ X] All changes have test coverage
  • [X ] Code is well-documented

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

DickJC123 and others added 10 commits September 14, 2020 19:26
* Add sm arch 80 to Makefile

* Add TF32 to cuBLAS GEMMs

Signed-off-by: Serge Panev <spanev@nvidia.com>

* Add CUDA version guards

Signed-off-by: Serge Panev <spanev@nvidia.com>

* Remove useless TF32 for double and old CUDA version

Signed-off-by: Serge Panev <spanev@nvidia.com>

* Factorize VERSION_ADJUSTED_TF32_MATH

Signed-off-by: Serge Panev <spanev@nvidia.com>

* Add TF32 considerations to test_util.py:check_consistency()

* Bypass test_gluon_gpu.py:test_large_models if gmem >32GB

* Default tols in assert_almost_equal() now a function of dtype and ctx

* Expand types listed by default_tols()

* Fix pylint

* All with_seed() tests to waitall in teardown

* Elevate MXNET_TEST_SEED logging to WARNING

* Revert test_gluon_gpu.py:test_rnn_layer to default tols

* Fix test_gluon_model_zoo_gpu.py::test_inference and test_operator_gpy.py::test_np_linalg_{solve,tensorinv}

* test_numpy_interoperability.py to not fix seed for rest of CI

* Further fix to test_np_linalg_tensorinv

* Fix test_gluon_data.py:test_dataloader_context when run on 1-GPU system.

* Fix test_operator_gpu.py::test_embedding_with_type

* Fix test_operator_gpu.py::{test_*convolution_large_c,test_np_linalg_tensorsolve}

* Remove unneeded print() from test_numpy_interoperability.py

* Unify tol handling of check_consistency() and assert_almost_equal().  Test tweeks.

* Add tol handling of assert_almost_equal() with number args

* Add tol handling of bool comparisons

* Fix test_numpy_op.py::test_np_random_rayleigh

* Fix test_operator_gpu.py::test_batchnorm_with_type

* Fix test_gluon.py::test_sync_batchnorm in cpu selftest

* Improve unittest failure reporting

* Add to robustness of test_operator_gpu.py::test_embedding_with_type

* Check_consistency() to use equal backward gradients for increased test robustness

* Fix test_operator_gpu.py::test_{fully_connected,gemm}.  Add default_numeric_eps().

* test_utils.py fix for numeric gradient calc

* Reinstate rtol=1e-2 for test_operator.py::test_order

* Remove auto-cast of check_consistency() input data to least precise dtype (not needed)

* Fix test_operator.py::test_{reciprocol,cbrt,rcbrt}_op

* Expand default float64 numeric_eps for test_operator_gpu.py::test_sofmin

* Fix segfault-on-error of @Retry decorator. Add test isolation.

* assert_almost_equal() to handle a,b scalars

* Fix test_operator_gpu.py::test_gluon_{mvn,mvn_v1} race

* Fix test_operator_gpu.py::test_flatten_slice_after_conv via scale

* Remove test_utils.py:almost_equal_ignore_nan()

* Fix sample vs. pop variance issue with test_numpy_op.py::test_npx_batch_norm

* Expose test_utils.py:effective_dtype() and use to fix test_operator_gpu.py::test_np_linalg_svd

* Fix true_divide int_array / int_scalar -> float_array to honor np_default_dtype

* Try test_elemwise_binary_ops serial to avoid pytest worker crash

* Fix (log_)softmax backward on empty ndarray

* Temporarily log all CI seeds to troubleshoot seed non-determinism

* Revert "Temporarily log all CI seeds to troubleshoot seed non-determinism"

This reverts commit f60eff2.

* Temp log all CI seeds to troubleshoot unwanted seed determinism

* Revert "Add sm arch 80 to Makefile"

This reverts commit f9306ce.

* Same fix of sample vs. pop variance issue, now with test_operator_gpu.py::test_batchnorm

* Revert "Temp log all CI seeds to troubleshoot unwanted seed determinism"

This reverts commit ff328ef.

* Marking test_sparse_dot_grad with garbage_expected after teardown error

* Fix flakiness of test_gluon_probability{_v1,_v2}.py::test_gluon_kl{_v1,}

* Temp skip of test_aggregate_duplication on gpu

* Add seeding to test_{numpy,}_contrib_gluon_data_vision.py.  Make created files unique.

* Add ndarray module isolation to help debug test_bbox_augmenters worker crash

* Marking test_sparse_square_sum serial after pytest worker crash

* Fix flakiness of test_gluon_probability{_v1,_v2}.py::test_half_cauchy{_v1,}

Co-authored-by: Serge Panev <spanev@nvidia.com>
Co-authored-by: Bart Gawrych <gawrych.bartlomiej@intel.com>
@mxnet-bot
Copy link

Hey @DickJC123 , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [website, unix-gpu, sanity, centos-gpu, windows-gpu, edge, miscellaneous, windows-cpu, clang, unix-cpu, centos-cpu]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@samskalicky
Copy link
Contributor

[2020-09-15T04:31:26.709Z] [ 99%] Linking CXX shared library mxnet_52.dll
[2020-09-15T04:37:33.250Z] LINK: command "C:\PROGRA~2\MICROS~1.0\VC\bin\X86_AM~1\link.exe /nologo @CMakeFiles\mxnet_52.dir\objects1.rsp /out:mxnet_52.dll /implib:mxnet_52.lib /pdb:C:\jenkins_slave\workspace\build-gpu\build\mxnet_52.pdb /dll /version:0.0 /machine:x64 /INCREMENTAL:NO /OPT:REF /OPT:ICF -LIBPATH:C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\lib\x64 3rdparty\mkldnn\src\dnnl.lib C:\Program Files\OpenBLAS-windows-v0_2_19\lib\libopenblas.dll.a C:\Program Files\opencv\x64\vc14\lib\opencv_world412.lib C:\Program Files\opencv\x64\vc14\lib\opencv_world412.lib C:\Program Files\opencv\x64\vc14\lib\opencv_world412.lib C:\Program Files\opencv\x64\vc14\lib\opencv_world412.lib C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\lib\x64\cudnn.lib cuda.lib 3rdparty\dmlc-core\dmlc.lib C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\lib\x64\cudart.lib C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\lib\x64\cufft.lib C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\lib\x64\cublas.lib C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\lib\x64\cusolver.lib C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\lib\x64\cusparse.lib C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\lib\x64\curand.lib C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\lib\x64\nvrtc.lib C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\lib\x64\cuda.lib cudadevrt.lib cudart_static.lib kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib /MANIFEST /MANIFESTFILE:mxnet_52.dll.manifest" failed (exit code 1102) with the following output:
[2020-09-15T04:37:33.250Z]    Creating library mxnet_52.lib and object mxnet_52.exp
[2020-09-15T04:37:33.250Z] LINK : fatal error LNK1102: out of memory

Do we need to enable compression?

@szha szha merged commit ce0a518 into apache:v1.x Sep 17, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants