Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Reduce the number of times Docker images are built #6301

Closed
hcho3 opened this issue Oct 28, 2020 · 4 comments · Fixed by #6305
Closed

[CI] Reduce the number of times Docker images are built #6301

hcho3 opened this issue Oct 28, 2020 · 4 comments · Fixed by #6305
Assignees

Comments

@hcho3
Copy link
Collaborator

hcho3 commented Oct 28, 2020

Currently, the CI server heavily uses Docker containers to manage suitable testing environments for the Linux platform. The original intent was to avoid the need for installing packages over and over again, by caching Docker images that were previously built. So the cost of building Docker containers would be amortized over multiple test jobs.

Unfortunately, in recent weeks, the Docker containers are being re-built from scratch very frequently. The reason is that, whenever the base container (nvidia/cuda) gets updated, all the cached Docker layers get invalidated, forcing our CI server to re-build the container from scratch. Edit. This is not the correct diagnosis. See the latest comment below.

I recognize the importance of that security patches, but having to re-build entire Docker containers from scratch is tremendously wasteful. (It takes 20-30 mins to complete the build.)

Proposed fix. We should have a cron job (or something equivalent) that builds new Docker containers from the Dockerfiles at a preset frequency (say, every 2 weeks). All CI jobs should now always pull pre-build containers, instead of attempting to build containers from Dockerfile. In short, our containers should be updated less frequently than the base cuda container is updated. The disadvantage is that updating Dockerfile won't cause rebuilding of the container immediately. Edit. See the latest comment below.

@trivialfis

@hcho3 hcho3 self-assigned this Oct 28, 2020
@hcho3 hcho3 changed the title Reduce the number of times Docker images are built [CI] Reduce the number of times Docker images are built Oct 28, 2020
@hcho3
Copy link
Collaborator Author

hcho3 commented Oct 28, 2020

Example: the Jenkins CI server of the TVM project (https://github.com/apache/incubator-tvm) pulls pre-built images from the Docker hub. There exists a time gap between the modification of Dockerfile and the modification of the corresponding Docker container image. See https://tvm.apache.org/docs/contribute/pull_request.html#ci-environment. Also see the discussion in https://discuss.tvm.apache.org/t/ci-docker-how-to-match-docker-version-tags-in-source-control-with-the-ones-in-docker-hub/5152 for the rationale of avoiding Docker builds on the fly.

@hcho3

This comment has been minimized.

@hcho3
Copy link
Collaborator Author

hcho3 commented Oct 28, 2020

My diagnosis was wrong. The actual reason is that Docker containers with different CUDA versions (10.2, 11.0) get pushed to the same Docker registry (492475357299.dkr.ecr.us-west-2.amazonaws.com/xgb-ci.gpu). As a result, a container with CUDA 10.2 will cause a cache miss for a container with CUDA 11.0.

Fix. Create separate registries to cache containers with different CUDA versions, e.g. xgb-ci.gpu11.0, xgb-ci.gpu10.2 and so forth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant