feat: Merge the CPU and GPU Dockerfiles into shared definitions #129

ca-scribner · 2020-09-23T13:54:31Z

feat: Merge the CPU and GPU Dockerfiles into shared definitions

The CPU and GPU Dockerfile streams are virtually identical except for their upstream images. This duplication increases maintenance efforts and the chance of misalignment between cpu and gpu machines (minor accidental differences in versions/tools exist in the released images). This PR merges these two streams into a single Dockerfile to address these issues. The requirement that CPU and GPU images base off different upstream images is addressed by accepting the BASE_CONTAINER as a docker build --build-arg (eg, docker build --build-arg some-docker-stacks-image for cpu, docker build --build-arg some-gpu-jupyter-image for gpu.

Resolves #107
Resolves #122

Todo before merge:

remove the CI trigger for pull requests (set it back to just triggering on push to master)
squash

Combine cpu/gpu images

Edit Dockerfiles to accept a build-arg BASE_CONTAINER for the upstream FROM image so that cpu and gpu builds can share a Dockerfile
Add upstream-equivalent-notebook-gpu to create the GPU version of the upstream image for base-notebook. This pulls gpu-jupyter to create the GPU Dockerfile then builds that Dockerfile
Resolve any unintended differences between cpu/gpu images
Remove cpu/gpu subdirs

Update CI

Add version pinning during CI by passing upstream_image:this_sha rather than upstream-image:latest (ex: to build minimal-notebook we pass base-notebook:this_sha as the upstream image. This ensures that no cross-talk can occur between two concurrent builds from different tasks
Remove docker system prune -f -a from each CI image build step. This caused each build step to re-pull the previous step's layers. build_push.sh can optionally prune only recent layers if space saving is required
Add layer caching to the CI to speed up deployment/testing. CI now does docker pull this_image:master prior to each docker build in order to pull the most recently built image in an attempt to get layers that docker build can cache from. If a perfect cache hit, each image builds in <5 min instead of 15-35 min. This also prevents rebuilding images unnecessarily (eg: if making a CPU-only change, the GPU images will just pull from cache rather than rebuild). In the worst case (editing base-notebook) this adds ~3-5min to build.
Encapsulated the typical build logic for each image (pull for cache, build image, tag latest/sha, push, optionally clean) into scripts/build_push.sh
Add build_settings.env to document upstream images for CI and local dev

Add documentation

expand top level readme.md
add flowcharts showing inheritence
add readme.md in each image subdir

Add doc/convenience scripts for developers

build*.sh scripts added to root and image subdirs to help automate development

Misc fixes:

add fix-permissions $CONDA_DIR && fix-permissions /home/$NB_USER to a few pip install and conda installs that were missing them (resulting in image size). Add a note to readme.md about this

Future work:

remove debug code from CI
pytorch install includes install for cuda drivers. Does this break the cpu-only pytorch? If yes, add build-arg for cpu/gpu toggle?

The CPU and GPU Dockerfile streams are virtually identical except for their upstream images. This duplication increases maintenance efforts and the chance of misalignment between cpu and gpu machines (minor accidental differences in versions/tools exist in the released images). This PR merges these two streams into a single Dockerfile to address these issues. The requirement that CPU and GPU images base off different upstream images is addressed by accepting the `BASE_CONTAINER` as a `docker build` `--build-arg` (eg, `docker build --build-arg some-docker-stacks-image` for cpu, `docker build --build-arg some-gpu-jupyter-image` for gpu. Combine cpu/gpu images - Edit Dockerfiles to accept a build-arg BASE_CONTAINER for the upstream `FROM` image so that cpu and gpu builds can share a Dockerfile - Add `upstream-equivalent-notebook-gpu` to create the GPU version of the upstream image for `base-notebook`. This pulls `gpu-jupyter` to create the GPU Dockerfile then builds that Dockerfile - Resolve any unintended differences between cpu/gpu images - Remove cpu/gpu subdirs Update CI - Add version pinning during CI by passing `upstream_image:this_sha` rather than `upstream-image:latest` (ex: to build `minimal-notebook` we pass `base-notebook:this_sha` as the upstream image. This ensures that no cross-talk can occur between two concurrent builds from different tasks - Remove `docker system prune -f -a` from each CI image build step. This caused each build step to re-pull the previous step's layers. `build_push.sh` can optionally prune only recent layers if space saving is required - Add layer caching to the CI to speed up deployment/testing. CI now does `docker pull this_image:master` prior to each `docker build` in order to pull the most recently built image in an attempt to get layers that `docker build` can cache from. If a perfect cache hit, each image builds in <5 min instead of 15-35 min. This also prevents rebuilding images unnecessarily (eg: if making a CPU-only change, the GPU images will just pull from cache rather than rebuild). In the worst case (editing `base-notebook`) this adds ~3-5min to build. - Encapsulated the typical build logic for each image (pull for cache, build image, tag latest/sha, push, optionally clean) into `scripts/build_push.sh` - Add `build_settings.env` to document upstream images for CI and local dev Add documentation - expand top level readme.md - add flowcharts showing inheritence - add readme.md in each image subdir Add doc/convenience scripts for developers - `build*.sh` scripts added to root and image subdirs to help automate development Misc fixes: - add `fix-permissions $CONDA_DIR && fix-permissions /home/$NB_USER` to a few `pip install` and `conda install`s that were missing them (resulting in image size). Add a note to readme.md about this TODO: remove debug code from CI after testing

ca-scribner · 2020-09-23T14:23:18Z

resolves #48

Note: Cannot figure out why trivy takes so long to scan. Image will take 8 min to scan here, but then I can run again in a separate workflow and it takes seconds (cached?). Sometimes it takes ~1-2 min which was typical of other pushes to master.

ca-scribner added 2 commits September 23, 2020 09:52

fix: incorrect build command in build_push.sh

584cc2a

ca-scribner added 5 commits September 23, 2020 10:48

fix: correct container scan issue

349e929

fix: extend trivy timeout for container-scan

29ba02a

fix: typo in ml notebook. Debug gpu CI disk space

d5c8dff

feat: Add steps to further reduce space used on github nodes

bd2f1fe

debug: fixing CI build times

6c7d621

Jose-Matsuda mentioned this pull request Sep 24, 2020

Fix / Upgrade jupyterlab-dash to jupyter-dash #127

Closed

ca-scribner force-pushed the combine-cpu-gpu-dockerfiles-rebase branch 2 times, most recently from 5b26f94 to 1509673 Compare September 28, 2020 14:24

fix: Lengthen containerscan timeout

f55eab6

Note: Cannot figure out why trivy takes so long to scan. Image will take 8 min to scan here, but then I can run again in a separate workflow and it takes seconds (cached?). Sometimes it takes ~1-2 min which was typical of other pushes to master.

ca-scribner force-pushed the combine-cpu-gpu-dockerfiles-rebase branch from 1509673 to f55eab6 Compare September 28, 2020 14:28

fix: indentation in build-gpu.yml

da6f3e5

ca-scribner requested a review from zachomedia September 28, 2020 17:04

ca-scribner closed this Nov 4, 2020

ca-scribner mentioned this pull request Nov 4, 2020

Split kubeflow-containers PRs to smaller chunks for review #146

Closed

This was referenced Nov 9, 2020

Combine base notebook cpu gpu images #151

Closed

Update dockerfiles to take build arg #152

Closed

This was referenced Nov 17, 2020

Combine all cpu gpu dockerfiles #153

Closed

encapsulate repetitive steps in CI #154

Closed

feat: reuse cached layers to reduce build times #156

Closed

feat: CPU GPU scripts for local builds #158

Closed

sylus deleted the combine-cpu-gpu-dockerfiles-rebase branch January 2, 2021 03:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Merge the CPU and GPU Dockerfiles into shared definitions #129

feat: Merge the CPU and GPU Dockerfiles into shared definitions #129

ca-scribner commented Sep 23, 2020 •

edited

Loading

ca-scribner commented Sep 23, 2020

feat: Merge the CPU and GPU Dockerfiles into shared definitions #129

feat: Merge the CPU and GPU Dockerfiles into shared definitions #129

Conversation

ca-scribner commented Sep 23, 2020 • edited Loading

ca-scribner commented Sep 23, 2020

ca-scribner commented Sep 23, 2020 •

edited

Loading