[v1.x] CI fixes to make more stable and upgradable #19895

josephevans · 2021-02-13T03:08:59Z

Description

This PR makes a number of changes to make it more stable:

Remove SafeDocker client which uses python docker package to run containers. Change to use "docker run" command directly using subprocess.call(), because the python-docker client does not support a gpus parameter which newer docker versions use and we don't get timeout issues when using the docker command directly. This will allow us to update our AMIs to use newer docker versions.
In order to support both docker variants simultaneously, we first try to use the --gpus all parameter to docker run, if it fails with error code 125 (which means docker run command itself failed,) then we retry using the old --runtime nvidia parameter.
Change the GPU pipelines to use G4 instances instead of P3 (for the 3 remaining pipelines that still use P3.)
Remove the extra custom codecov calls (which usually fail and have to retry multiple times, even though the initial codecov command works.)

…and the second always fails a few times before finally succeeding (and also doesn't support the -P parameter, which causes an error.)

…t nvidia 'gpus' attribute. Switch to using subprocess calls using list parameter (to avoid shell injections). See docker/docker-py#2395

mxnet-bot · 2021-02-13T03:09:02Z

Hey @josephevans , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

To trigger all jobs: @mxnet-bot run ci [all]
To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [website, edge, centos-gpu, windows-cpu, clang, windows-gpu, unix-gpu, miscellaneous, sanity, unix-cpu, centos-cpu]

Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

… interactive mode.

…d if that fails, then try '--runtime nvidia'.

josephevans · 2021-02-15T20:24:35Z

Could you guys please review? @leezu @szha

Zha0q1

LGTM

waytrue17

LGTM, thanks!

access2rohit

LGTM

szha · 2021-02-16T02:24:44Z

LGTM

* Test moving pipelines from p3 to g4. * Remove fallback codecov command - the existing (first) command works and the second always fails a few times before finally succeeding (and also doesn't support the -P parameter, which causes an error.) * Stop using docker python client, since it still doesn't support latest nvidia 'gpus' attribute. Switch to using subprocess calls using list parameter (to avoid shell injections). See docker/docker-py#2395 * Remove old files. * Fix comment * Set default environment variables * Fix GPU syntax. * Use subprocess.run and redirect output to stdout, don't run docker in interactive mode. * Check if codecov works without providing parameters now. * Send docker stderr to sys.stderr * Support both nvidia-docker configurations, first try '--gpus all', and if that fails, then try '--runtime nvidia'. Co-authored-by: Joe Evans <joeev@amazon.com>

* [v1.x] Migrate to use ECR as docker cache instead of dockerhub (#19654) * [v1.x] Update CI build scripts to install python 3.6 from deadsnakes repo (#19788) * Install python3.6 from deadsnakes repo, since 3.5 is EOL'd and get-pip.py no longer works with 3.5. * Set symlink for python3 to point to newly installed 3.6 version. * Setting symlink or using update-alternatives causes add-apt-repository to fail, so instead just set alias in environment to call the correct python version. * Setup symlinks in /usr/local/bin, since it comes first in the path. * Don't use absolute path for python3 executable, just use python3 from path. Co-authored-by: Joe Evans <joeev@amazon.com> * Disable unix-gpu-cu110 pipeline for v1.x build since we now build with cuda 11.0 in windows pipelines. (#19828) Co-authored-by: Joe Evans <joeev@amazon.com> * [v1.x] For ECR, ensure we sanitize region input from environment variable (#19882) * Set default for cache_intermediate. * Make sure we sanitize region extracted from registry, since we pass it to os.system. Co-authored-by: Joe Evans <joeev@amazon.com> * [v1.x] Address CI failures with docker timeouts (v2) (#19890) * Add random sleep only, since retry attempts are already implemented. * Reduce random sleep to 2-10 sec. Co-authored-by: Joe Evans <joeev@amazon.com> * [v1.x] CI fixes to make more stable and upgradable (#19895) * Test moving pipelines from p3 to g4. * Remove fallback codecov command - the existing (first) command works and the second always fails a few times before finally succeeding (and also doesn't support the -P parameter, which causes an error.) * Stop using docker python client, since it still doesn't support latest nvidia 'gpus' attribute. Switch to using subprocess calls using list parameter (to avoid shell injections). See docker/docker-py#2395 * Remove old files. * Fix comment * Set default environment variables * Fix GPU syntax. * Use subprocess.run and redirect output to stdout, don't run docker in interactive mode. * Check if codecov works without providing parameters now. * Send docker stderr to sys.stderr * Support both nvidia-docker configurations, first try '--gpus all', and if that fails, then try '--runtime nvidia'. Co-authored-by: Joe Evans <joeev@amazon.com> * fix cd * fix cudnn version for cu10.2 buiuld * WAR the dataloader issue with forked processes holding stale references (#19924) * skip some tests * fix ski[ * [v.1x] Attempt to fix v1.x cd by installing new cuda compt package (#19959) * update cude compt for cd * Update Dockerfile.build.ubuntu_gpu_cu102 * Update Dockerfile.build.ubuntu_gpu_cu102 * Update Dockerfile.build.ubuntu_gpu_cu110 * Update runtime_functions.sh * Update Dockerfile.build.ubuntu_gpu_cu110 * Update Dockerfile.build.ubuntu_gpu_cu102 * update command Co-authored-by: Joe Evans <joseph.evans@gmail.com> Co-authored-by: Joe Evans <joeev@amazon.com> Co-authored-by: Joe Evans <github@250hacks.net> Co-authored-by: Przemyslaw Tredak <ptredak@nvidia.com>

* [v1.x] Migrate to use ECR as docker cache instead of dockerhub (apache#19654) * [v1.x] Update CI build scripts to install python 3.6 from deadsnakes repo (apache#19788) * Install python3.6 from deadsnakes repo, since 3.5 is EOL'd and get-pip.py no longer works with 3.5. * Set symlink for python3 to point to newly installed 3.6 version. * Setting symlink or using update-alternatives causes add-apt-repository to fail, so instead just set alias in environment to call the correct python version. * Setup symlinks in /usr/local/bin, since it comes first in the path. * Don't use absolute path for python3 executable, just use python3 from path. Co-authored-by: Joe Evans <joeev@amazon.com> * Disable unix-gpu-cu110 pipeline for v1.x build since we now build with cuda 11.0 in windows pipelines. (apache#19828) Co-authored-by: Joe Evans <joeev@amazon.com> * [v1.x] For ECR, ensure we sanitize region input from environment variable (apache#19882) * Set default for cache_intermediate. * Make sure we sanitize region extracted from registry, since we pass it to os.system. Co-authored-by: Joe Evans <joeev@amazon.com> * [v1.x] Address CI failures with docker timeouts (v2) (apache#19890) * Add random sleep only, since retry attempts are already implemented. * Reduce random sleep to 2-10 sec. Co-authored-by: Joe Evans <joeev@amazon.com> * [v1.x] CI fixes to make more stable and upgradable (apache#19895) * Test moving pipelines from p3 to g4. * Remove fallback codecov command - the existing (first) command works and the second always fails a few times before finally succeeding (and also doesn't support the -P parameter, which causes an error.) * Stop using docker python client, since it still doesn't support latest nvidia 'gpus' attribute. Switch to using subprocess calls using list parameter (to avoid shell injections). See docker/docker-py#2395 * Remove old files. * Fix comment * Set default environment variables * Fix GPU syntax. * Use subprocess.run and redirect output to stdout, don't run docker in interactive mode. * Check if codecov works without providing parameters now. * Send docker stderr to sys.stderr * Support both nvidia-docker configurations, first try '--gpus all', and if that fails, then try '--runtime nvidia'. Co-authored-by: Joe Evans <joeev@amazon.com> * fix cd * fix cudnn version for cu10.2 buiuld * WAR the dataloader issue with forked processes holding stale references (apache#19924) * skip some tests * fix ski[ * [v.1x] Attempt to fix v1.x cd by installing new cuda compt package (apache#19959) * update cude compt for cd * Update Dockerfile.build.ubuntu_gpu_cu102 * Update Dockerfile.build.ubuntu_gpu_cu102 * Update Dockerfile.build.ubuntu_gpu_cu110 * Update runtime_functions.sh * Update Dockerfile.build.ubuntu_gpu_cu110 * Update Dockerfile.build.ubuntu_gpu_cu102 * update command Co-authored-by: Joe Evans <joseph.evans@gmail.com> Co-authored-by: Joe Evans <joeev@amazon.com> Co-authored-by: Joe Evans <github@250hacks.net> Co-authored-by: Przemyslaw Tredak <ptredak@nvidia.com>

Joe Evans added 5 commits February 13, 2021 02:02

Test moving pipelines from p3 to g4.

7c64a60

Remove fallback codecov command - the existing (first) command works …

c682008

…and the second always fails a few times before finally succeeding (and also doesn't support the -P parameter, which causes an error.)

Stop using docker python client, since it still doesn't support lates…

34ff9fc

…t nvidia 'gpus' attribute. Switch to using subprocess calls using list parameter (to avoid shell injections). See docker/docker-py#2395

Remove old files.

af669a8

Fix comment

eb68b60

josephevans requested review from aaronmarkham and marcoabreu as code owners February 13, 2021 03:08

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 13, 2021

Set default environment variables

544ce51

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 13, 2021

Fix GPU syntax.

35b91e9

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 13, 2021

Use subprocess.run and redirect output to stdout, don't run docker in…

a2a667d

… interactive mode.

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-work-in-progress PR is still work in progress labels Feb 13, 2021

Joe Evans added 2 commits February 13, 2021 05:21

Check if codecov works without providing parameters now.

c5352c4

Send docker stderr to sys.stderr

5412c8b

lanking520 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 13, 2021

Support both nvidia-docker configurations, first try '--gpus all', an…

115fd3e

…d if that fails, then try '--runtime nvidia'.

lanking520 removed the pr-work-in-progress PR is still work in progress label Feb 13, 2021

josephevans changed the title ~~[v1.x] TEST: Test out changing p3 to g4 instances and pave the way for docker/nvidia updates~~ [v1.x] CI fixes to make more stable and upgradable Feb 15, 2021

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-awaiting-review PR is waiting for code review labels Feb 15, 2021

josephevans mentioned this pull request Feb 15, 2021

[v1.x] CI: Retry container creation up to 3 times when getting ReadTimeout. #19892

Closed

lanking520 added pr-awaiting-review PR is waiting for code review pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review labels Feb 15, 2021

lanking520 added pr-awaiting-review PR is waiting for code review and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 15, 2021

Zha0q1 approved these changes Feb 15, 2021

View reviewed changes

waytrue17 approved these changes Feb 15, 2021

View reviewed changes

access2rohit approved these changes Feb 15, 2021

View reviewed changes

szha requested a review from leezu February 16, 2021 02:24

leezu merged commit f6f4a5f into apache:v1.x Feb 16, 2021

josephevans deleted the ci_test_p3_to_g4 branch February 16, 2021 17:29

josephevans mentioned this pull request Feb 16, 2021

Forward port #19895 #19903

Merged

josephevans mentioned this pull request Feb 24, 2021

[v1.8.x] Backport PRs from v1.x branch #19946

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v1.x] CI fixes to make more stable and upgradable #19895

[v1.x] CI fixes to make more stable and upgradable #19895

josephevans commented Feb 13, 2021 •

edited

Loading

mxnet-bot commented Feb 13, 2021

josephevans commented Feb 15, 2021

Zha0q1 left a comment

waytrue17 left a comment

access2rohit left a comment

szha commented Feb 16, 2021

[v1.x] CI fixes to make more stable and upgradable #19895

[v1.x] CI fixes to make more stable and upgradable #19895

Conversation

josephevans commented Feb 13, 2021 • edited Loading

Description

mxnet-bot commented Feb 13, 2021

josephevans commented Feb 15, 2021

Zha0q1 left a comment

Choose a reason for hiding this comment

waytrue17 left a comment

Choose a reason for hiding this comment

access2rohit left a comment

Choose a reason for hiding this comment

szha commented Feb 16, 2021

josephevans commented Feb 13, 2021 •

edited

Loading