[CI] Use separate Docker cache for each CUDA version #6305

hcho3 · 2020-10-28T10:12:05Z

Closes #6301. This will greatly reduce the number of times the Docker container images are re-built.

#6202 introduced a subtle bug that pushed all GPU containers to the same registry xgb-ci.gpu, regardless of their CUDA version used. With this patch, GPU containers will be pushed to appropriate registries that are labeled with CUDA versions: xgb-ci.gpu10.0, xgb-ci.gpu10.2, and xgb-ci.gpu11.0.

hcho3 · 2020-10-28T10:13:27Z

One small typo ended up costing a lot of test time, as it turns out. The ci_build.sh needs to be cleaned up for legibility and clarity. I will create a follow-up issue for this.

trivialfis

I confess I don't understand what's happening. Approving as long as it fixes the issue.

The ci_build.sh needs to be cleaned up for legibility and clarity. I will create a follow-up issue for this.

When you are on this, can we use a script like Python instead of sh?

hcho3 · 2020-10-28T10:22:42Z

I confess I don't understand what's happening

It has to do with how we currently "cache" Docker containers, as re-building containers at every job would be very expensive. The way we cache them is to push previously built container images to a private repository (Elastic Container Repository from AWS), and then re-use the pre-built images for later testing jobs.

"Cache hit" occurs when there has been no change in the current Dockerfile or the base image (nvidia/cuda), so the pre-built image matches perfectly with the Dockerfile. The docker build will proceed very quickly, and it will emit log saying from the cache.
"Cache miss" occurs when either Dockerfile or the base image (nvidia/cuda) was recently updated, so the pre-built image is no longer with sync with Dockerfile. In this case, the CI will re-built the Docker container from scratch.

CircleCI has a good article about Docker caching and how it speeds up CI: https://circleci.com/docs/2.0/docker-layer-caching/

So what went wrong? The caching will only work if the previous build uses the same build ARG as the current build. In our use case, ARG is the CUDA version. Due to the typo in ci_build.sh, the CI ended up pushing CUDA 10.2 and 11.0 images to the same registry name xgb-ci.gpu (note the lack of CUDA version num), leading to lots of cache misses. This PR fixes the typo, and restore separate registries xgb-ci.gpu10.2 and xgb-ci.gpu11.0 to enable proper caching per CUDA version.

hcho3 · 2020-10-28T10:27:22Z

When you are on this, can we use a script like Python instead of sh?

That's an excellent idea. We can avoid lots of potential bugs with string manipulation.

hcho3 · 2020-10-28T10:32:18Z

Note that the CI is now building new containers because we changed the name of the registry 🤦 . We are already running close to the budget limit. I might want to temporarily lift budget limit for today.

hcho3 · 2020-10-28T10:38:33Z

Temporarily raised the budget limit to 100 USD for today. hcho3/xgboost-devops@2541b74

[CI] Use separate Docker cache for each CUDA version

68e82ab

hcho3 requested a review from trivialfis October 28, 2020 10:12

trivialfis approved these changes Oct 28, 2020

View reviewed changes

hcho3 mentioned this pull request Oct 28, 2020

[CI] Clean up ci_build.sh wrapper script for legibility and clarity #6306

Closed

hcho3 merged commit f6169c0 into dmlc:master Oct 28, 2020

hcho3 deleted the fix_docker_caching branch October 28, 2020 18:07

hcho3 mentioned this pull request Oct 28, 2020

[CI] Better utilization of CI resources and other CI improvements #5891

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Use separate Docker cache for each CUDA version #6305

[CI] Use separate Docker cache for each CUDA version #6305

hcho3 commented Oct 28, 2020 •

edited

Loading

hcho3 commented Oct 28, 2020

trivialfis left a comment

hcho3 commented Oct 28, 2020 •

edited

Loading

hcho3 commented Oct 28, 2020

hcho3 commented Oct 28, 2020 •

edited

Loading

hcho3 commented Oct 28, 2020 •

edited

Loading

[CI] Use separate Docker cache for each CUDA version #6305

[CI] Use separate Docker cache for each CUDA version #6305

Conversation

hcho3 commented Oct 28, 2020 • edited Loading

hcho3 commented Oct 28, 2020

trivialfis left a comment

Choose a reason for hiding this comment

hcho3 commented Oct 28, 2020 • edited Loading

hcho3 commented Oct 28, 2020

hcho3 commented Oct 28, 2020 • edited Loading

hcho3 commented Oct 28, 2020 • edited Loading

hcho3 commented Oct 28, 2020 •

edited

Loading

hcho3 commented Oct 28, 2020 •

edited

Loading

hcho3 commented Oct 28, 2020 •

edited

Loading

hcho3 commented Oct 28, 2020 •

edited

Loading