-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Conversation
Hey @ChaiBapchya , Thanks for submitting the PR
CI supported jobs: [unix-gpu, clang, website, sanity, edge, centos-gpu, windows-cpu, unix-cpu, windows-gpu, centos-cpu, miscellaneous] Note: |
f8331ec
to
1c720fa
Compare
Rebased to fix windows-gpu issue : fixed in #18177 |
…ad of cuda compat
022e135
to
ec5330d
Compare
@mxnet-bot run ci [windows-gpu] assertion failed for test_np_mixed_precision_binary_funcs : Likely flaky |
Jenkins CI successfully triggered : [windows-gpu] |
Infra related changes : apache/mxnet-ci#20
Specific commit : apache/mxnet-ci@1a537af Manually, created a launch template for G4 node pointing to the AMI [created in dev accessible to prod account] [followed steps mentioned here : https://cwiki.apache.org/confluence/display/MXNET/Setup#Setup-Slave] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
Now fails on unix-gpu : Related issue : #16848 |
* update nvidiadocker command & remove cuda compat * replace cu101 with cuda since compat is no longer to be used * skip flaky tests * get rid of ubuntu_build_cuda and point ubuntu_cu101 to base gpu instead of cuda compat * Revert "skip flaky tests" This reverts commit 1c720fa. * revert removal of ubuntu_build_cuda * add linux gpu g4 node to all steps using g3 in unix-gpu pipeline
* update nvidiadocker command & remove cuda compat * replace cu101 with cuda since compat is no longer to be used * skip flaky tests * get rid of ubuntu_build_cuda and point ubuntu_cu101 to base gpu instead of cuda compat * Revert "skip flaky tests" This reverts commit 1c720fa. * revert removal of ubuntu_build_cuda * add linux gpu g4 node to all steps using g3 in unix-gpu pipeline
* update nvidiadocker command & remove cuda compat * replace cu101 with cuda since compat is no longer to be used * skip flaky tests * get rid of ubuntu_build_cuda and point ubuntu_cu101 to base gpu instead of cuda compat * Revert "skip flaky tests" This reverts commit 1c720fa. * revert removal of ubuntu_build_cuda * add linux gpu g4 node to all steps using g3 in unix-gpu pipeline
* update nvidiadocker command & remove cuda compat * replace cu101 with cuda since compat is no longer to be used * skip flaky tests * get rid of ubuntu_build_cuda and point ubuntu_cu101 to base gpu instead of cuda compat * Revert "skip flaky tests" This reverts commit 1c720fa. * revert removal of ubuntu_build_cuda * add linux gpu g4 node to all steps using g3 in unix-gpu pipeline
* update nvidiadocker command & remove cuda compat * replace cu101 with cuda since compat is no longer to be used * skip flaky tests * get rid of ubuntu_build_cuda and point ubuntu_cu101 to base gpu instead of cuda compat * Revert "skip flaky tests" This reverts commit 1c720fa. * revert removal of ubuntu_build_cuda * add linux gpu g4 node to all steps using g3 in unix-gpu pipeline
* Update unix gpu toolchain (#18186) * update nvidiadocker command & remove cuda compat * replace cu101 with cuda since compat is no longer to be used * skip flaky tests * get rid of ubuntu_build_cuda and point ubuntu_cu101 to base gpu instead of cuda compat * Revert "skip flaky tests" This reverts commit 1c720fa. * revert removal of ubuntu_build_cuda * add linux gpu g4 node to all steps using g3 in unix-gpu pipeline * remove docker compose files * add back the caffe test since caffe is deprecated for mx2.0 and not 1.x * drop nvidia-docker requirement since docker19.0 supports it by default :q * remove compat from dockerfile * Cherry-pick #18635 to v1.7.x (#18935) * Remove mention of nightly in pypi (#18635) * update bert dev.tsv link Co-authored-by: Sheng Zha <szha@users.noreply.github.com> * disable tvm in CI functions that rely on libcuda compat * tvm off for ubuntu_gpu_cmake build * drop tvm from all unix-gpu builds Co-authored-by: Carin Meier <cmeier@gigasquidsoftware.com> Co-authored-by: Sheng Zha <szha@users.noreply.github.com>
Description
Currently, Unix GPU & Centos GPU tests use P3 & G3 AWS EC2 instances.
In an effort to improve the cost & efficiency, switch to G4 EC2 instances has been proposed.
This switch involves upgrading the GPU toolchain broadly
Code Changes
Latest Docker [19.03] has built-in cuda support [hence replace nvidia-docker with docker --gpus all]
Given that the host machine has updated drivers, TVM Op shouldn't need cuda compat [
/usr/local/cuda/compat
]replacing
ubuntu_gpu_cu101
withubuntu_build_cuda
Docker compose follows multi-stage build [https://docs.docker.com/develop/develop-images/multistage-build/] and defines multiple targets
ubuntu_build_cuda
target isgpuwithcudaruntimelibs
ubuntu_gpu_cu101
target is : gpuwithcompatenv [which has been commented out now]After testing this on CI Dev account : http://jenkins.mxnet-ci-dev.amazon-ml.com/blue/organizations/jenkins/mxnet-validation-bapac%2Funix-gpu/detail/update_gpu_toolchain/8/pipelineThe TVMOpError related to Binary Ops was encountered : TVMOp doesn't work well with GPU builds #17840To unblock the migration from G3 to G4, these flaky tests have been skipped.Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Comments
Thanks to @ptrendx for the help identifying libcuda compat as the rootcause for
Helped me close : NVIDIA/nvidia-docker#1256
Thanks to @leezu and @josephevans throughout this migration effort and @sandeep-krishnamurthy @szha for the guidance.