Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Merge the CPU and GPU Dockerfiles into shared definitions #129

Closed
wants to merge 9 commits into from

Conversation

ca-scribner
Copy link
Contributor

@ca-scribner ca-scribner commented Sep 23, 2020

feat: Merge the CPU and GPU Dockerfiles into shared definitions

The CPU and GPU Dockerfile streams are virtually identical except for their upstream images. This duplication increases maintenance efforts and the chance of misalignment between cpu and gpu machines (minor accidental differences in versions/tools exist in the released images). This PR merges these two streams into a single Dockerfile to address these issues. The requirement that CPU and GPU images base off different upstream images is addressed by accepting the BASE_CONTAINER as a docker build --build-arg (eg, docker build --build-arg some-docker-stacks-image for cpu, docker build --build-arg some-gpu-jupyter-image for gpu.

Resolves #107
Resolves #122

Todo before merge:

  • remove the CI trigger for pull requests (set it back to just triggering on push to master)
  • squash

Combine cpu/gpu images

  • Edit Dockerfiles to accept a build-arg BASE_CONTAINER for the upstream FROM image so that cpu and gpu builds can share a Dockerfile
  • Add upstream-equivalent-notebook-gpu to create the GPU version of the upstream image for base-notebook. This pulls gpu-jupyter to create the GPU Dockerfile then builds that Dockerfile
  • Resolve any unintended differences between cpu/gpu images
  • Remove cpu/gpu subdirs

Update CI

  • Add version pinning during CI by passing upstream_image:this_sha rather than upstream-image:latest (ex: to build minimal-notebook we pass base-notebook:this_sha as the upstream image. This ensures that no cross-talk can occur between two concurrent builds from different tasks
  • Remove docker system prune -f -a from each CI image build step. This caused each build step to re-pull the previous step's layers. build_push.sh can optionally prune only recent layers if space saving is required
  • Add layer caching to the CI to speed up deployment/testing. CI now does docker pull this_image:master prior to each docker build in order to pull the most recently built image in an attempt to get layers that docker build can cache from. If a perfect cache hit, each image builds in <5 min instead of 15-35 min. This also prevents rebuilding images unnecessarily (eg: if making a CPU-only change, the GPU images will just pull from cache rather than rebuild). In the worst case (editing base-notebook) this adds ~3-5min to build.
  • Encapsulated the typical build logic for each image (pull for cache, build image, tag latest/sha, push, optionally clean) into scripts/build_push.sh
  • Add build_settings.env to document upstream images for CI and local dev

Add documentation

  • expand top level readme.md
  • add flowcharts showing inheritence
  • add readme.md in each image subdir

Add doc/convenience scripts for developers

  • build*.sh scripts added to root and image subdirs to help automate development

Misc fixes:

  • add fix-permissions $CONDA_DIR && fix-permissions /home/$NB_USER to a few pip install and conda installs that were missing them (resulting in image size). Add a note to readme.md about this

Future work:

  • remove debug code from CI
  • pytorch install includes install for cuda drivers. Does this break the cpu-only pytorch? If yes, add build-arg for cpu/gpu toggle?

The CPU and GPU Dockerfile streams are virtually identical except for their upstream images.  This duplication increases maintenance efforts and the chance of misalignment between cpu and gpu machines (minor accidental differences in versions/tools exist in the released images).  This PR merges these two streams into a single Dockerfile to address these issues.  The requirement that CPU and GPU images base off different upstream images is addressed by accepting the `BASE_CONTAINER` as a `docker build` `--build-arg` (eg, `docker build --build-arg some-docker-stacks-image` for cpu, `docker build --build-arg some-gpu-jupyter-image` for gpu.

Combine cpu/gpu images
-	Edit Dockerfiles to accept a build-arg BASE_CONTAINER for the upstream `FROM` image so that cpu and gpu builds can share a Dockerfile
-	Add `upstream-equivalent-notebook-gpu` to create the GPU version of the upstream image for `base-notebook`.  This pulls `gpu-jupyter` to create the GPU Dockerfile then builds that Dockerfile
-	Resolve any unintended differences between cpu/gpu images
-	Remove cpu/gpu subdirs

Update CI
-	Add version pinning during CI by passing `upstream_image:this_sha` rather than `upstream-image:latest` (ex: to build `minimal-notebook` we pass `base-notebook:this_sha` as the upstream image. This ensures that no cross-talk can occur between two concurrent builds from different tasks
-	Remove `docker system prune -f -a` from each CI image build step.  This caused each build step to re-pull the previous step's layers.  `build_push.sh` can optionally prune only recent layers if space saving is required
-	Add layer caching to the CI to speed up deployment/testing.  CI now does `docker pull this_image:master` prior to each `docker build` in order to pull the most recently built image in an attempt to get layers that `docker build` can cache from.  If a perfect cache hit, each image builds in <5 min instead of 15-35 min.  This also prevents rebuilding images unnecessarily (eg: if making a CPU-only change, the GPU images will just pull from cache rather than rebuild). In the worst case (editing `base-notebook`) this adds ~3-5min to build.
-	Encapsulated the typical build logic for each image (pull for cache, build image, tag latest/sha, push, optionally clean) into `scripts/build_push.sh`
-	Add `build_settings.env` to document upstream images for CI and local dev

Add documentation
-	expand top level readme.md
-	add flowcharts showing inheritence
-	add readme.md in each image subdir

Add doc/convenience scripts for developers
-	`build*.sh` scripts added to root and image subdirs to help automate development

Misc fixes:
-	add `fix-permissions $CONDA_DIR && fix-permissions /home/$NB_USER` to a few `pip install` and `conda install`s that were missing them (resulting in image size).  Add a note to readme.md about this

TODO: remove debug code from CI after testing
@ca-scribner
Copy link
Contributor Author

resolves #48

@ca-scribner ca-scribner force-pushed the combine-cpu-gpu-dockerfiles-rebase branch 2 times, most recently from 5b26f94 to 1509673 Compare September 28, 2020 14:24
Note: Cannot figure out why trivy takes so long to scan.  Image will take 8 min to scan here, but then I can run again in a separate workflow and it takes seconds (cached?).  Sometimes it takes ~1-2 min which was typical of other pushes to master.
@ca-scribner ca-scribner force-pushed the combine-cpu-gpu-dockerfiles-rebase branch from 1509673 to f55eab6 Compare September 28, 2020 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tensorflow missing cuDNN drivers Refactor notebook images to reduce duplication
1 participant