Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[CI][1.x] Cherrypick: Upgrade unix gpu toolchain (#18186) #18785

Merged
merged 10 commits into from
Aug 18, 2020

Conversation

ChaiBapchya
Copy link
Contributor

Leverage G4 instances for unix-gpu instead of G3

  • update nvidiadocker command & remove cuda compat

  • replace cu101 with cuda since compat is no longer to be used

  • skip flaky tests

  • get rid of ubuntu_build_cuda and point ubuntu_cu101 to base gpu instead of cuda compat

  • Revert "skip flaky tests"

This reverts commit 1c720fa.

  • revert removal of ubuntu_build_cuda

  • add linux gpu g4 node to all steps using g3 in unix-gpu pipeline

Refer: #18186

@mxnet-bot
Copy link

Hey @ChaiBapchya , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [centos-cpu, clang, edge, centos-gpu, unix-gpu, website, windows-cpu, miscellaneous, sanity, unix-cpu, windows-gpu]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@ChaiBapchya ChaiBapchya changed the title [CI][1.x] Upgrade unix gpu toolchain (#18186) [CI][1.x] Cherrypick: Upgrade unix gpu toolchain (#18186) Jul 24, 2020
@ChaiBapchya
Copy link
Contributor Author

@mxnet-bot run ci [unix-gpu]

@ChaiBapchya
Copy link
Contributor Author

I run this locally to try & reproduce the CI error but it passes & doesn't throw the nvidia-docker error.

ci/build.py --docker-registry mxnetci --nvidiadocker --platform ubuntu_gpu_cu101 --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh unittest_ubuntu_cpugpu_perl

@leezu @josephevans any idea?

I can confirm it translates into equivalent command of

docker \
        run \
        --gpus all \
        --cap-add \
        SYS_PTRACE \
        --rm \
        --shm-size=500m \
        -v \
        /home/ubuntu/chai-mxnet:/work/mxnet \
        -v \
        /home/ubuntu/chai-mxnet/build:/work/build \
        -v \
        /home/ubuntu/.ccache:/work/ccache \
        -u \
        1000:1000 \
        -e \
        CCACHE_MAXSIZE=500G \
        -e \
        CCACHE_TEMPDIR=/tmp/ccache \
        -e \
        CCACHE_DIR=/work/ccache \
        -e \
        CCACHE_LOGFILE=/tmp/ccache.log \
        -ti \
        mxnetci/build.ubuntu_gpu_cu101 \
        /work/runtime_functions.sh \
unittest_ubuntu_cpugpu_perl

@leezu
Copy link
Contributor

leezu commented Jul 27, 2020

[2020-07-26T08:47:43.081Z] FileNotFoundError: [Errno 2] No such file or directory: 'nvidia-docker': 'nvidia-docker'

That's related to the AMI. You could also update the build.py script to run docker run --gpus=all instead of nvidia-docker

@ChaiBapchya
Copy link
Contributor Author

Yes but I've updated the Jenkinsfile_unix_gpu to use G4 instance which has the updated dockerversion [our master pipeline is using G4 instance which has updated docker version]
Moreover, build.py is updated in this PR too
Hence when I run it locally it properly translates to docker run --gpus all command
Screen Shot 2020-07-27 at 10 15 56 AM

@ChaiBapchya
Copy link
Contributor Author

Nevermind. @josephevans helped me identify that before calling the run_container it was building that docker container first and while building it was using nvidia-docker via get_docker_binary and that needs to be removed as well. Dropped it.

@szha
Copy link
Member

szha commented Jul 30, 2020

I think when enabling the branch protection, we accidentally turned on "Require branches to be up to date before merging". I'm requesting to disable it in https://issues.apache.org/jira/browse/INFRA-20616. Don't worry about updating the branch in this PR for now.

@ChaiBapchya
Copy link
Contributor Author

@mxnet-bot run ci [unix-gpu]
Now that Apache Infra team has resolved https://issues.apache.org/jira/browse/INFRA-20616

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-gpu]

@sandeep-krishnamurthy
Copy link
Contributor

@mxnet-bot run ci [unix-gpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-gpu]

* update nvidiadocker command & remove cuda compat

* replace cu101 with cuda since compat is no longer to be used

* skip flaky tests

* get rid of ubuntu_build_cuda and point ubuntu_cu101 to base gpu instead of cuda compat

* Revert "skip flaky tests"

This reverts commit 1c720fa.

* revert removal of ubuntu_build_cuda

* add linux gpu g4 node to all steps using g3 in unix-gpu pipeline
ChaiBapchya and others added 3 commits August 14, 2020 18:28
* Remove mention of nightly in pypi (apache#18635)

* update bert dev.tsv link

Co-authored-by: Sheng Zha <szha@users.noreply.github.com>
@ChaiBapchya
Copy link
Contributor Author

ubuntu_gpu_cu101 on 1.x branch relies on libcuda compat. However, for upgrading from G3 to G4 instance, we no longer rely on libcuda compat. It gives cuda driver/display driver error if using libcuda compat.

Upon removing the LD_LIBRARY_PATH kludge for libcuda compat, 4 builds in unix-gpu pipeline failed due to TVM=ON relies on libcuda compat.
PR #18204 has disabled TVM in master branch due to known issue.
Hence doing the same for v1.x branch.

Note: I haven't cherry-picked that PR because master branch CI has differences from v1.x [for e.g. most builds in unix-gpu for master branch have cmake instead of make]

@ChaiBapchya
Copy link
Contributor Author

@mxnet-bot run ci [unix-gpu] re-triggering for flaky issue.

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-gpu]

@ChaiBapchya
Copy link
Contributor Author

@jinboci I saw one of your PRs for fixing TVM Op errors.. Any idea why this test fails when using TVM=ON?
It's failing for3 tests: Python3 GPU, Python3 MKLDNN GPU, Python3 MKLDNN-NoCUDNN GPU

Common Stack Trace

test_operator_gpu.test_kernel_error_checking ... terminate called after throwing an instance of 'dmlc::Error'

[2020-08-17T05:59:15.843Z]   what():  [05:59:13] /work/mxnet/3rdparty/tvm/src/runtime/workspace_pool.cc:115: Check failed: allocated_.size() == 1 (3 vs. 1) : 

In CI Jenkins_steps.groovy for Python3 GPU
We're packing

compile_unix_full_gpu()
utils.pack_lib('gpu', mx_lib_cpp_examples)

where

mx_lib_cpp_examples = 'lib/libmxnet.so, lib/libmxnet.a, lib/libtvm_runtime.so, lib/libtvmop.so, lib/tvmop.conf, build/libcustomop_lib.so, build/libcustomop_gpu_lib.so, build/libsubgraph_lib.so, 3rdparty/dmlc-core/libdmlc.a, 3rdparty/tvm/nnvm/lib/libnnvm.a, 3rdparty/ps-lite/build/libps.a, deps/lib/libprotobuf-lite.a, deps/lib/libzmq.a, build/cpp-package/example/*, python/mxnet/_cy3/*.so, python/mxnet/_ffi/_cy3/*.so'

While unpacking

test_unix_python3_gpu()
utils.unpack_and_init('gpu', mx_lib_cython)

where mx_lib_cython is a subset of mx_lib_cpp_examples

mx_lib_cython = 'lib/libmxnet.so, lib/libmxnet.a, lib/libtvm_runtime.so, lib/libtvmop.so, lib/tvmop.conf, build/libcustomop_lib.so, build/libcustomop_gpu_lib.so, build/libsubgraph_lib.so, 3rdparty/dmlc-core/libdmlc.a, 3rdparty/tvm/nnvm/lib/libnnvm.a, python/mxnet/_cy3/*.so, python/mxnet/_ffi/_cy3/*.so'

Based on the stacktrace: It's throwing TVM runtime check failed for allocated size
@DickJC123 I see you had submitted this test. Any idea why this is troubling TVM?

@leezu
Copy link
Contributor

leezu commented Aug 17, 2020

@ChaiBapchya on master, -DUSE_TVM_OP=ON is disabled for all GPU builds due to known issues. You can disable it on 1.x branch as well.

@jinboci
Copy link
Contributor

jinboci commented Aug 18, 2020

@ChaiBapchya It seems the unix-gpu test have passed. Most of my work about TVMOp was written in the issue #18716. However, I don't think we were encountering the same problem.

@ChaiBapchya
Copy link
Contributor Author

Ya I've dropped TVMOp support from unix-gpu pipeline and that caused the pipeline to pass.

@ChaiBapchya
Copy link
Contributor Author

@mxnet-label-bot add [pr-awaiting-review]

@ChaiBapchya
Copy link
Contributor Author

@mxnet-bot run ci [windows-gpu] retriggering as windows gpu timed out

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [windows-gpu]

@szha szha merged commit 9981e84 into apache:v1.x Aug 18, 2020
@ChaiBapchya ChaiBapchya deleted the g3_to_g4 branch September 9, 2020 05:29
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CI pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants