Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[v1.x] Address CI failures with docker timeouts (v2) #19890

Merged
merged 2 commits into from
Feb 12, 2021

Conversation

josephevans
Copy link
Contributor

@josephevans josephevans commented Feb 12, 2021

Description

Add random sleep (between 2-10 sec) to give docker time to flush pulled images to disk and minimize chance of race condition between jenkins slave slots (on same machine) which causes docker run timeout.

@mxnet-bot
Copy link

Hey @josephevans , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [unix-cpu, centos-gpu, clang, miscellaneous, windows-gpu, unix-gpu, sanity, windows-cpu, centos-cpu, website, edge]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@lanking520 lanking520 added the pr-awaiting-testing PR is reviewed and waiting CI build and test label Feb 12, 2021
@lanking520 lanking520 added pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress labels Feb 12, 2021
Copy link
Contributor

@waytrue17 waytrue17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@lanking520 lanking520 added pr-awaiting-review PR is waiting for code review and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 12, 2021
@josephevans
Copy link
Contributor Author

These 3 PRs are having the symptoms of docker run command failing:

#19868
#19851
#19888

@@ -117,6 +119,9 @@ def run(self, *args, **kwargs) -> int:
ret = 0
try:
# Race condition:
# add a random sleep to (a) give docker time to flush disk buffer after pulling image
# and (b) minimize race conditions between jenkins runs on same host
time.sleep(random.randint(2,10))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does random help? vs let's say a fixed wait of 5 seconds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each jenkins slave (linux cpu nodes, at least) have 2 "slots" they can run in parallel, and when 2 jobs using the same docker images start at the exact same time on these 2 slots, they both will attempt to pull down the image from ECR and start a container. If we randomize the delay, the idea is that both containers won't be requested to start at the exact same time.

Copy link
Contributor

@mseth10 mseth10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the fix!

@mseth10 mseth10 merged commit b5b6743 into apache:v1.x Feb 12, 2021
@josephevans josephevans deleted the ci_docker_timeouts_v1.x_v2 branch February 12, 2021 19:36
josephevans added a commit to josephevans/mxnet that referenced this pull request Feb 24, 2021
* Add random sleep only, since retry attempts are already implemented.

* Reduce random sleep to 2-10 sec.

Co-authored-by: Joe Evans <joeev@amazon.com>
Zha0q1 added a commit that referenced this pull request Mar 2, 2021
* [v1.x] Migrate to use ECR as docker cache instead of dockerhub (#19654)

* [v1.x] Update CI build scripts to install python 3.6 from deadsnakes repo (#19788)

* Install python3.6 from deadsnakes repo, since 3.5 is EOL'd and get-pip.py no longer works with 3.5.

* Set symlink for python3 to point to newly installed 3.6 version.

* Setting symlink or using update-alternatives causes add-apt-repository to fail, so instead just set alias in environment to call the correct python version.

* Setup symlinks in /usr/local/bin, since it comes first in the path.

* Don't use absolute path for python3 executable, just use python3 from path.

Co-authored-by: Joe Evans <joeev@amazon.com>

* Disable unix-gpu-cu110 pipeline for v1.x build since we now build with cuda 11.0 in windows pipelines. (#19828)

Co-authored-by: Joe Evans <joeev@amazon.com>

* [v1.x] For ECR, ensure we sanitize region input from environment variable (#19882)

* Set default for cache_intermediate.

* Make sure we sanitize region extracted from registry, since we pass it to os.system.

Co-authored-by: Joe Evans <joeev@amazon.com>

* [v1.x] Address CI failures with docker timeouts (v2) (#19890)

* Add random sleep only, since retry attempts are already implemented.

* Reduce random sleep to 2-10 sec.

Co-authored-by: Joe Evans <joeev@amazon.com>

* [v1.x] CI fixes to make more stable and upgradable (#19895)

* Test moving pipelines from p3 to g4.

* Remove fallback codecov command - the existing (first) command works and the second always fails a few times before finally succeeding (and also doesn't support the -P parameter, which causes an error.)

* Stop using docker python client, since it still doesn't support latest nvidia 'gpus' attribute. Switch to using subprocess calls using list parameter (to avoid shell injections).

See docker/docker-py#2395

* Remove old files.

* Fix comment

* Set default environment variables

* Fix GPU syntax.

* Use subprocess.run and redirect output to stdout, don't run docker in interactive mode.

* Check if codecov works without providing parameters now.

* Send docker stderr to sys.stderr

* Support both nvidia-docker configurations, first try '--gpus all', and if that fails, then try '--runtime nvidia'.

Co-authored-by: Joe Evans <joeev@amazon.com>

* fix cd

* fix cudnn version for cu10.2 buiuld

* WAR the dataloader issue with forked processes holding stale references (#19924)

* skip some tests

* fix ski[

* [v.1x] Attempt to fix v1.x cd by installing new cuda compt package (#19959)

* update cude compt for cd

* Update Dockerfile.build.ubuntu_gpu_cu102

* Update Dockerfile.build.ubuntu_gpu_cu102

* Update Dockerfile.build.ubuntu_gpu_cu110

* Update runtime_functions.sh

* Update Dockerfile.build.ubuntu_gpu_cu110

* Update Dockerfile.build.ubuntu_gpu_cu102

* update command

Co-authored-by: Joe Evans <joseph.evans@gmail.com>
Co-authored-by: Joe Evans <joeev@amazon.com>
Co-authored-by: Joe Evans <github@250hacks.net>
Co-authored-by: Przemyslaw Tredak <ptredak@nvidia.com>
mseth10 pushed a commit to mseth10/incubator-mxnet that referenced this pull request Mar 15, 2021
* [v1.x] Migrate to use ECR as docker cache instead of dockerhub (apache#19654)

* [v1.x] Update CI build scripts to install python 3.6 from deadsnakes repo (apache#19788)

* Install python3.6 from deadsnakes repo, since 3.5 is EOL'd and get-pip.py no longer works with 3.5.

* Set symlink for python3 to point to newly installed 3.6 version.

* Setting symlink or using update-alternatives causes add-apt-repository to fail, so instead just set alias in environment to call the correct python version.

* Setup symlinks in /usr/local/bin, since it comes first in the path.

* Don't use absolute path for python3 executable, just use python3 from path.

Co-authored-by: Joe Evans <joeev@amazon.com>

* Disable unix-gpu-cu110 pipeline for v1.x build since we now build with cuda 11.0 in windows pipelines. (apache#19828)

Co-authored-by: Joe Evans <joeev@amazon.com>

* [v1.x] For ECR, ensure we sanitize region input from environment variable (apache#19882)

* Set default for cache_intermediate.

* Make sure we sanitize region extracted from registry, since we pass it to os.system.

Co-authored-by: Joe Evans <joeev@amazon.com>

* [v1.x] Address CI failures with docker timeouts (v2) (apache#19890)

* Add random sleep only, since retry attempts are already implemented.

* Reduce random sleep to 2-10 sec.

Co-authored-by: Joe Evans <joeev@amazon.com>

* [v1.x] CI fixes to make more stable and upgradable (apache#19895)

* Test moving pipelines from p3 to g4.

* Remove fallback codecov command - the existing (first) command works and the second always fails a few times before finally succeeding (and also doesn't support the -P parameter, which causes an error.)

* Stop using docker python client, since it still doesn't support latest nvidia 'gpus' attribute. Switch to using subprocess calls using list parameter (to avoid shell injections).

See docker/docker-py#2395

* Remove old files.

* Fix comment

* Set default environment variables

* Fix GPU syntax.

* Use subprocess.run and redirect output to stdout, don't run docker in interactive mode.

* Check if codecov works without providing parameters now.

* Send docker stderr to sys.stderr

* Support both nvidia-docker configurations, first try '--gpus all', and if that fails, then try '--runtime nvidia'.

Co-authored-by: Joe Evans <joeev@amazon.com>

* fix cd

* fix cudnn version for cu10.2 buiuld

* WAR the dataloader issue with forked processes holding stale references (apache#19924)

* skip some tests

* fix ski[

* [v.1x] Attempt to fix v1.x cd by installing new cuda compt package (apache#19959)

* update cude compt for cd

* Update Dockerfile.build.ubuntu_gpu_cu102

* Update Dockerfile.build.ubuntu_gpu_cu102

* Update Dockerfile.build.ubuntu_gpu_cu110

* Update runtime_functions.sh

* Update Dockerfile.build.ubuntu_gpu_cu110

* Update Dockerfile.build.ubuntu_gpu_cu102

* update command

Co-authored-by: Joe Evans <joseph.evans@gmail.com>
Co-authored-by: Joe Evans <joeev@amazon.com>
Co-authored-by: Joe Evans <github@250hacks.net>
Co-authored-by: Przemyslaw Tredak <ptredak@nvidia.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants