-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[v1.x] Address CI failures with docker timeouts (v2) #19890
Conversation
Hey @josephevans , Thanks for submitting the PR
CI supported jobs: [unix-cpu, centos-gpu, clang, miscellaneous, windows-gpu, unix-gpu, sanity, windows-cpu, centos-cpu, website, edge] Note: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
@@ -117,6 +119,9 @@ def run(self, *args, **kwargs) -> int: | |||
ret = 0 | |||
try: | |||
# Race condition: | |||
# add a random sleep to (a) give docker time to flush disk buffer after pulling image | |||
# and (b) minimize race conditions between jenkins runs on same host | |||
time.sleep(random.randint(2,10)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does random help? vs let's say a fixed wait of 5 seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each jenkins slave (linux cpu nodes, at least) have 2 "slots" they can run in parallel, and when 2 jobs using the same docker images start at the exact same time on these 2 slots, they both will attempt to pull down the image from ECR and start a container. If we randomize the delay, the idea is that both containers won't be requested to start at the exact same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the fix!
* Add random sleep only, since retry attempts are already implemented. * Reduce random sleep to 2-10 sec. Co-authored-by: Joe Evans <joeev@amazon.com>
* [v1.x] Migrate to use ECR as docker cache instead of dockerhub (#19654) * [v1.x] Update CI build scripts to install python 3.6 from deadsnakes repo (#19788) * Install python3.6 from deadsnakes repo, since 3.5 is EOL'd and get-pip.py no longer works with 3.5. * Set symlink for python3 to point to newly installed 3.6 version. * Setting symlink or using update-alternatives causes add-apt-repository to fail, so instead just set alias in environment to call the correct python version. * Setup symlinks in /usr/local/bin, since it comes first in the path. * Don't use absolute path for python3 executable, just use python3 from path. Co-authored-by: Joe Evans <joeev@amazon.com> * Disable unix-gpu-cu110 pipeline for v1.x build since we now build with cuda 11.0 in windows pipelines. (#19828) Co-authored-by: Joe Evans <joeev@amazon.com> * [v1.x] For ECR, ensure we sanitize region input from environment variable (#19882) * Set default for cache_intermediate. * Make sure we sanitize region extracted from registry, since we pass it to os.system. Co-authored-by: Joe Evans <joeev@amazon.com> * [v1.x] Address CI failures with docker timeouts (v2) (#19890) * Add random sleep only, since retry attempts are already implemented. * Reduce random sleep to 2-10 sec. Co-authored-by: Joe Evans <joeev@amazon.com> * [v1.x] CI fixes to make more stable and upgradable (#19895) * Test moving pipelines from p3 to g4. * Remove fallback codecov command - the existing (first) command works and the second always fails a few times before finally succeeding (and also doesn't support the -P parameter, which causes an error.) * Stop using docker python client, since it still doesn't support latest nvidia 'gpus' attribute. Switch to using subprocess calls using list parameter (to avoid shell injections). See docker/docker-py#2395 * Remove old files. * Fix comment * Set default environment variables * Fix GPU syntax. * Use subprocess.run and redirect output to stdout, don't run docker in interactive mode. * Check if codecov works without providing parameters now. * Send docker stderr to sys.stderr * Support both nvidia-docker configurations, first try '--gpus all', and if that fails, then try '--runtime nvidia'. Co-authored-by: Joe Evans <joeev@amazon.com> * fix cd * fix cudnn version for cu10.2 buiuld * WAR the dataloader issue with forked processes holding stale references (#19924) * skip some tests * fix ski[ * [v.1x] Attempt to fix v1.x cd by installing new cuda compt package (#19959) * update cude compt for cd * Update Dockerfile.build.ubuntu_gpu_cu102 * Update Dockerfile.build.ubuntu_gpu_cu102 * Update Dockerfile.build.ubuntu_gpu_cu110 * Update runtime_functions.sh * Update Dockerfile.build.ubuntu_gpu_cu110 * Update Dockerfile.build.ubuntu_gpu_cu102 * update command Co-authored-by: Joe Evans <joseph.evans@gmail.com> Co-authored-by: Joe Evans <joeev@amazon.com> Co-authored-by: Joe Evans <github@250hacks.net> Co-authored-by: Przemyslaw Tredak <ptredak@nvidia.com>
* [v1.x] Migrate to use ECR as docker cache instead of dockerhub (apache#19654) * [v1.x] Update CI build scripts to install python 3.6 from deadsnakes repo (apache#19788) * Install python3.6 from deadsnakes repo, since 3.5 is EOL'd and get-pip.py no longer works with 3.5. * Set symlink for python3 to point to newly installed 3.6 version. * Setting symlink or using update-alternatives causes add-apt-repository to fail, so instead just set alias in environment to call the correct python version. * Setup symlinks in /usr/local/bin, since it comes first in the path. * Don't use absolute path for python3 executable, just use python3 from path. Co-authored-by: Joe Evans <joeev@amazon.com> * Disable unix-gpu-cu110 pipeline for v1.x build since we now build with cuda 11.0 in windows pipelines. (apache#19828) Co-authored-by: Joe Evans <joeev@amazon.com> * [v1.x] For ECR, ensure we sanitize region input from environment variable (apache#19882) * Set default for cache_intermediate. * Make sure we sanitize region extracted from registry, since we pass it to os.system. Co-authored-by: Joe Evans <joeev@amazon.com> * [v1.x] Address CI failures with docker timeouts (v2) (apache#19890) * Add random sleep only, since retry attempts are already implemented. * Reduce random sleep to 2-10 sec. Co-authored-by: Joe Evans <joeev@amazon.com> * [v1.x] CI fixes to make more stable and upgradable (apache#19895) * Test moving pipelines from p3 to g4. * Remove fallback codecov command - the existing (first) command works and the second always fails a few times before finally succeeding (and also doesn't support the -P parameter, which causes an error.) * Stop using docker python client, since it still doesn't support latest nvidia 'gpus' attribute. Switch to using subprocess calls using list parameter (to avoid shell injections). See docker/docker-py#2395 * Remove old files. * Fix comment * Set default environment variables * Fix GPU syntax. * Use subprocess.run and redirect output to stdout, don't run docker in interactive mode. * Check if codecov works without providing parameters now. * Send docker stderr to sys.stderr * Support both nvidia-docker configurations, first try '--gpus all', and if that fails, then try '--runtime nvidia'. Co-authored-by: Joe Evans <joeev@amazon.com> * fix cd * fix cudnn version for cu10.2 buiuld * WAR the dataloader issue with forked processes holding stale references (apache#19924) * skip some tests * fix ski[ * [v.1x] Attempt to fix v1.x cd by installing new cuda compt package (apache#19959) * update cude compt for cd * Update Dockerfile.build.ubuntu_gpu_cu102 * Update Dockerfile.build.ubuntu_gpu_cu102 * Update Dockerfile.build.ubuntu_gpu_cu110 * Update runtime_functions.sh * Update Dockerfile.build.ubuntu_gpu_cu110 * Update Dockerfile.build.ubuntu_gpu_cu102 * update command Co-authored-by: Joe Evans <joseph.evans@gmail.com> Co-authored-by: Joe Evans <joeev@amazon.com> Co-authored-by: Joe Evans <github@250hacks.net> Co-authored-by: Przemyslaw Tredak <ptredak@nvidia.com>
Description
Add random sleep (between 2-10 sec) to give docker time to flush pulled images to disk and minimize chance of race condition between jenkins slave slots (on same machine) which causes docker run timeout.