Using GPU-optimized NGC images as base for ML (Pytorch/Tensorflow) docker images #457

weiji14 · 2023-05-14T23:56:11Z

Consolidating some of the discussion @ngam had around using NVIDIA GPU Cloud (NGC) containers as the base image for pytorch-notebook and ml-notebook, and potentially cupy (#322)

Is your feature request related to a problem? Please describe.

For machine learning and data analytics work that rely on NVIDIA Graphical Processing Units (GPUs), there are several optimizations related to drivers/hardware that can help to speed up processing workflows. Currently, the pytorch-notebook and ml-notebook docker images rely on CUDA libraries from conda-forge which are less optimized than what exists on NGC.

Describe the solution you'd like

Refactor the pytorch-notebook and ml-notebook to be based on NGC containers instead of the current base-image. This might involve flipping the current installation pipeline from Pangeo-first/ML-second (base-notebook -> pangeo-notebook -> ml-notebook) to ML-first/Pangeo-second (ngc -> ml-notebook -> pangeo-notebook). Something that can help with this is a pangeo-notebook metapackage #359

Describe alternatives you've considered

Spin things off into a different repository (pangeo-gpu-docker-images?), or have a separate build chain (ngc-pytorch-notebook, ngc-ml-notebook) from the current CI/CD infrastructure.

Additional context
Add any other context or screenshots about the feature request here.

One benefit of chaging the build order to ML-first/Pangeo-second is that ML folks who don't need all of the heavy Climate/Ocean packages pangeo-notebook can get a slimmer ml-notebook. For example, if they're deploying a model to some server API, they can base their docker image on ngc-ml-notebook, instead of the current heavy ml-notebook.

Disadvantage is that the refactoring will require some effort, and we need to be careful to ensure this doesn't affect existing JupyterHub deployments.

The text was updated successfully, but these errors were encountered:

ngam · 2023-05-15T00:15:12Z

Just to note:

I think the tensorflow image was as good (if not better) than the NGC one
I recently came across this effort which may help in the infrastructure quite a bit https://github.com/rapidsai/mambaforge-cuda
I am slightly out of sync due to urgent "standard climate modeling" needs (i.e., no ML) so I could be missing on some updates --- I think the general truth remains that when it comes to tensorflow and PyTorch, our efforts in conda-forge are more delayed than we'd like due to all sorts of issues (inability to build on public ci, the licensing around Nvidia products, etc.; there are positive updates on all fronts, but it simply takes time...)

weiji14 · 2023-05-15T02:19:12Z

I recently came across this effort which may help in the infrastructure quite a bit https://github.com/rapidsai/mambaforge-cuda

Yes, I noticed that too just yesterday :D That looks to be built on top of nvidia/cuda and comes with mamba pre-installed which is pretty much what we have here:

pangeo-docker-images/base-image/Dockerfile

Lines 65 to 77 in 7d4e51e

    
           # Install latest mambaforge in ${CONDA_DIR} 
        
           RUN echo "Installing Mambaforge..." \ 
        
               && URL="https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh" \ 
        
               && wget --quiet ${URL} -O installer.sh \ 
        
               && /bin/bash installer.sh -u -b -p ${CONDA_DIR} \ 
        
               && rm installer.sh \ 
        
               && mamba install conda-lock -y \ 
        
               && mamba clean -afy \ 
        
               # After installing the packages, we cleanup some unnecessary files 
        
               # to try reduce image size - see https://jcristharif.com/conda-docker-tips.html 
        
               # Although we explicitly do *not* delete .pyc files, as that seems to slow down startup 
        
               # quite a bit unfortunately - see https://github.com/2i2c-org/infrastructure/issues/2047 
        
               && find ${CONDA_DIR} -follow -type f -name '*.a' -delete

That rapidsai/mambaforge-cuda image will also be super helpful if we decide to have an image for the cupy/RAPIDSAI stack :D

weiji14 mentioned this issue Sep 26, 2023

ML image update #188

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using GPU-optimized NGC images as base for ML (Pytorch/Tensorflow) docker images #457

Using GPU-optimized NGC images as base for ML (Pytorch/Tensorflow) docker images #457

weiji14 commented May 14, 2023

ngam commented May 15, 2023 •

edited

Loading

weiji14 commented May 15, 2023

Using GPU-optimized NGC images as base for ML (Pytorch/Tensorflow) docker images #457

Using GPU-optimized NGC images as base for ML (Pytorch/Tensorflow) docker images #457

Comments

weiji14 commented May 14, 2023

ngam commented May 15, 2023 • edited Loading

weiji14 commented May 15, 2023

ngam commented May 15, 2023 •

edited

Loading