Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update instructions to build with nvidia cuda runtime image for ONNX #2435

Merged
merged 19 commits into from
Jul 29, 2023

Conversation

agunapal
Copy link
Collaborator

@agunapal agunapal commented Jun 28, 2023

Description

TorchServe's GPU Docker Image uses NVIDIA CUDA base image.

Third part libraries such as ONNX require NVIDIA CUDA runtime base image to work.

  • Introduce a new arg -bi to docker build script to script the base image
./build_image.sh -bi nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu20.04 -g -cv cu117 -t pytorch/ts_run:latest-gpu

Fixes #(issue)

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

1_docker-regression (ubuntu-20.04).txt
2_docker-regression (self-hosted, regression-test-gpu).txt

Error message on -bi and -g

(torchserve) ubuntu@ip-172-31-60-100:~/serve/docker$ ./build_image.sh -bi nvidia/cuda:11.7.1-base-ubuntu20.04  -g -t py:test1
Incompatible options: -bi doesn't work with -g option
(torchserve) ubuntu@ip-172-31-60-100:~/serve/docker$ ./build_image.sh -bi nvidia/cuda:11.7.1-base-ubuntu20.04   -t py:test1
[+] Building 172.8s (24/24) FINISHED                                                                                                                                                      
 => [internal] load build definition from Dockerfile                                                                                                                                 0.0s
 => => transferring dockerfile: 38B                                                                                                                                                  0.0s
 => [internal] load .dockerignore                                                                                                                                                    0.0s
 => => transferring context: 2B                                                                                                                                                      0.0s
 => resolve image config for docker.io/docker/dockerfile:experimental                                                                                                                0.5s
 => CACHED docker-image://docker.io/docker/dockerfile:experimental@sha256:600e5c62eedff338b3f7a0850beb7c05866e0ef27b2d2e8c02aa468e78496ff5                                           0.0s
 => [internal] load metadata for docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04                                                                                                       0.0s
 => CACHED [compile-image 1/9] FROM docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04                                                                                                    0.0s
 => [internal] load build context                                                                                                                                                    0.0s
 => => transferring context: 80B                                                                                                                                                     0.0s
 => [compile-image 2/9] RUN --mount=type=cache,id=apt-dev,target=/var/cache/apt     apt-get update &&     apt-get upgrade -y &&     apt-get install software-properties-common -y  111.0s
 => [compile-image 3/9] RUN python3.9 -m venv /home/venv                                                                                                                             3.0s 
 => [compile-image 4/9] RUN python -m pip install -U pip setuptools                                                                                                                  3.0s 
 => [compile-image 5/9] RUN export USE_CUDA=1                                                                                                                                        0.4s 
 => [compile-image 6/9] RUN git clone --depth 1 https://github.com/pytorch/serve.git                                                                                                 2.8s 
 => [compile-image 7/9] WORKDIR serve                                                                                                                                                0.0s 
 => [compile-image 8/9] RUN     if echo "nvidia/cuda:11.7.1-base-ubuntu20.04" | grep -q "cuda:"; then         if [ "" ]; then             python ./ts_scripts/install_dependencies  35.5s 
 => [compile-image 9/9] RUN     if echo "false" | grep -q "false"; then         python -m pip install --no-cache-dir torchserve torch-model-archiver torch-workflow-archiver;    el  2.0s 
 => CACHED [production-image 2/9] RUN --mount=type=cache,target=/var/cache/apt     apt-get update &&     apt-get upgrade -y &&     apt-get install software-properties-common -y &&  0.0s 
 => CACHED [production-image 3/9] RUN useradd -m model-server     && mkdir -p /home/model-server/tmp                                                                                 0.0s 
 => [production-image 4/9] COPY --chown=model-server --from=compile-image /home/venv /home/venv                                                                                      5.0s 
 => [production-image 5/9] COPY dockerd-entrypoint.sh /usr/local/bin/dockerd-entrypoint.sh                                                                                           0.0s 
 => [production-image 6/9] RUN chmod +x /usr/local/bin/dockerd-entrypoint.sh     && chown -R model-server /home/model-server                                                         0.3s 
 => [production-image 7/9] COPY config.properties /home/model-server/config.properties                                                                                               0.0s 
 => [production-image 8/9] RUN mkdir /home/model-server/model-store && chown -R model-server /home/model-server/model-store                                                          0.4s
 => [production-image 9/9] WORKDIR /home/model-server                                                                                                                                0.0s
 => exporting to image                                                                                                                                                               5.2s
 => => exporting layers                                                                                                                                                              5.2s
 => => writing image sha256:97e78b58035f1c1519c66a2fc85f4d17e2c303c84d854d8f22f184a8515d83ab                                                                                         0.0s
 => => naming to docker.io/library/py:test1         

Nvidia-runtime

(torchserve) ubuntu@ip-172-31-2-198:~/serve/docker$ ./build_image.sh -bi nvidia/cuda:11.7.1-runtime-ubuntu20.04 -bt ci -t pytorch/torchserve:ci
[+] Building 215.2s (22/22) FINISHED                                                                                                                                                      
 => [internal] load build definition from Dockerfile                                                                                                                                 0.0s
 => => transferring dockerfile: 6.52kB                                                                                                                                               0.0s
 => [internal] load .dockerignore                                                                                                                                                    0.0s
 => => transferring context: 2B                                                                                                                                                      0.0s
 => resolve image config for docker.io/docker/dockerfile:experimental                                                                                                                0.5s
 => CACHED docker-image://docker.io/docker/dockerfile:experimental@sha256:600e5c62eedff338b3f7a0850beb7c05866e0ef27b2d2e8c02aa468e78496ff5                                           0.0s
 => [internal] load build definition from Dockerfile                                                                                                                                 0.0s
 => [internal] load .dockerignore                                                                                                                                                    0.0s
 => [internal] load metadata for docker.io/nvidia/cuda:11.7.1-runtime-ubuntu20.04                                                                                                    0.8s
 => [compile-image 1/9] FROM docker.io/nvidia/cuda:11.7.1-runtime-ubuntu20.04@sha256:4ff41b20a64e267e9bd9466061711b09adf4807f3d4c656d07009788a56f8178                               19.9s
 => => resolve docker.io/nvidia/cuda:11.7.1-runtime-ubuntu20.04@sha256:4ff41b20a64e267e9bd9466061711b09adf4807f3d4c656d07009788a56f8178                                              0.0s
 => => sha256:4ff41b20a64e267e9bd9466061711b09adf4807f3d4c656d07009788a56f8178 743B / 743B                                                                                           0.0s
 => => sha256:cca775f086be7b61abaf8428ac4aa71fba4a7a1d4718a5aee6cb09d7163ae604 13.17kB / 13.17kB                                                                                     0.0s
 => => sha256:79284eb3dfdfdfbd489bdfbcc675f51c192510f6d4ea5a5971876a0002f5bce1 2.21kB / 2.21kB                                                                                       0.0s
 => => sha256:ccfccf24900f621770df167b6811a582af777df108eee444c85ab551d39e9b77 7.94MB / 7.94MB                                                                                       0.4s
 => => sha256:6eeb573e3fe082c86b74ee5c6f0cf7c091d322a1584c89d8f843d69af7c099d3 47.88MB / 47.88MB                                                                                     0.6s
 => => sha256:ba47ed5447555d2c58ee4005528f925fc538e093f9fe584e7f0b1e5198ea6b94 183B / 183B                                                                                           0.1s
 => => sha256:e5c76c058a4460c199c66189c15efca36f4a512a35fe57273a979ccca58a0176 6.88kB / 6.88kB                                                                                       0.3s
 => => sha256:06d6ff943437bee1c3ad6bb60a3b7727408e450ed129dddd3adf3a46eac22f28 1.09GB / 1.09GB                                                                                       9.2s
 => => extracting sha256:ccfccf24900f621770df167b6811a582af777df108eee444c85ab551d39e9b77                                                                                            0.2s
 => => sha256:5ba16bd606c9f26e6b2bc4c850efa5ff293e1f3bacc3d341b25dc07001712780 62.30kB / 62.30kB                                                                                     0.6s
 => => sha256:566e1b27f99d4f73a48d51e3db82d3124417ae96bf622597b796e78d6c33e700 1.52kB / 1.52kB                                                                                       0.7s
 => => sha256:c20acb837b2297cff47e8f703eb6d6e035d74701f5492b55ac37694440cd26d9 1.68kB / 1.68kB                                                                                       0.7s
 => => extracting sha256:6eeb573e3fe082c86b74ee5c6f0cf7c091d322a1584c89d8f843d69af7c099d3                                                                                            0.7s
 => => extracting sha256:ba47ed5447555d2c58ee4005528f925fc538e093f9fe584e7f0b1e5198ea6b94                                                                                            0.0s
 => => extracting sha256:e5c76c058a4460c199c66189c15efca36f4a512a35fe57273a979ccca58a0176                                                                                            0.0s
 => => extracting sha256:06d6ff943437bee1c3ad6bb60a3b7727408e450ed129dddd3adf3a46eac22f28                                                                                           10.4s
 => => extracting sha256:5ba16bd606c9f26e6b2bc4c850efa5ff293e1f3bacc3d341b25dc07001712780                                                                                            0.0s
 => => extracting sha256:c20acb837b2297cff47e8f703eb6d6e035d74701f5492b55ac37694440cd26d9                                                                                            0.0s
 => => extracting sha256:566e1b27f99d4f73a48d51e3db82d3124417ae96bf622597b796e78d6c33e700                                                                                            0.0s
 => [compile-image 2/9] RUN --mount=type=cache,id=apt-dev,target=/var/cache/apt     apt-get update &&     apt-get upgrade -y &&     apt-get install software-properties-common -y  102.9s
 => [ci-image 2/6] RUN --mount=type=cache,target=/var/cache/apt     apt-get update &&     apt-get upgrade -y &&     apt-get install software-properties-common -y &&     add-apt-  127.3s
 => [compile-image 3/9] RUN python3.9 -m venv /home/venv                                                                                                                             3.0s
 => [compile-image 4/9] RUN python -m pip install -U pip setuptools                                                                                                                  3.2s
 => [compile-image 5/9] RUN export USE_CUDA=1                                                                                                                                        0.3s
 => [compile-image 6/9] RUN git clone --depth 1 https://github.com/pytorch/serve.git                                                                                                 2.2s
 => [compile-image 7/9] WORKDIR serve                                                                                                                                                0.0s
 => [compile-image 8/9] RUN     if echo "nvidia/cuda:11.7.1-runtime-ubuntu20.04" | grep -q "cuda:"; then         if [ "" ]; then             python ./ts_scripts/install_dependenc  37.7s
 => [compile-image 9/9] RUN python -m pip install --no-cache-dir torchserve torch-model-archiver torch-workflow-archiver                                                             1.9s
 => [ci-image 3/6] COPY --from=compile-image /home/venv /home/venv                                                                                                                   5.4s 
 => [ci-image 4/6] RUN python -m pip install --no-cache-dir -r https://raw.githubusercontent.com/pytorch/serve/master/requirements/developer.txt                                    26.8s 
 => [ci-image 5/6] RUN mkdir /home/serve                                                                                                                                             0.3s 
 => [ci-image 6/6] WORKDIR /home/serve                                                                                                                                               0.0s
 => exporting to image                                                                                                                                                               6.3s
 => => exporting layers                                                                                                                                                              6.3s
 => => writing image sha256:369e1a2f338827066d710963b0f4416f2057a02125f6ef19758458019f7ae23a                                                                                         0.0s
 => => naming to docker.io/pytorch/torchserve:ci                                                                                                                                     0.0s
(torchserve) ubuntu@ip-172-31-2-198:~/serve/docker$ docker run -it --gpus all -v $PWD:/home/serve pytorch/torchserve:ci

==========
== CUDA ==
==========

CUDA Version 11.7.1

(torchserve) ubuntu@ip-172-31-7-107:~/serve/docker$ ./build_image.sh -bi nvidia/cuda:11.7.0-cudnn8-runtime-ubuntu20.04 -g -cv cu117 -t pytorch/ts_run:latest-gpu
[+] Building 0.8s (26/26) FINISHED                                                                                                                                                        
 => [internal] load build definition from Dockerfile                                                                                                                                 0.0s
 => => transferring dockerfile: 4.96kB                                                                                                                                               0.0s
 => [internal] load .dockerignore                                                                                                                                                    0.0s
 => => transferring context: 2B                                                                                                                                                      0.0s
 => resolve image config for docker.io/docker/dockerfile:experimental                                                                                                                0.3s
 => CACHED docker-image://docker.io/docker/dockerfile:experimental@sha256:600e5c62eedff338b3f7a0850beb7c05866e0ef27b2d2e8c02aa468e78496ff5                                           0.0s
 => [internal] load .dockerignore                                                                                                                                                    0.0s
 => [internal] load build definition from Dockerfile                                                                                                                                 0.0s
 => [internal] load metadata for docker.io/nvidia/cuda:11.7.0-cudnn8-runtime-ubuntu20.04                                                                                             0.3s
 => [compile-image 1/9] FROM docker.io/nvidia/cuda:11.7.0-cudnn8-runtime-ubuntu20.04@sha256:4a4398ca4dbe0d0dbcda3bb153333f1c4d66edb0b5d4fd48eefe765ab7d83d25                         0.0s
 => [internal] load build context                                                                                                                                                    0.0s
 => => transferring context: 80B                                                                                                                                                     0.0s
 => CACHED [runtime-image 2/9] RUN --mount=type=cache,target=/var/cache/apt     apt-get update &&     apt-get upgrade -y &&     apt-get install software-properties-common -y &&     0.0s
 => CACHED [runtime-image 3/9] RUN useradd -m model-server     && mkdir -p /home/model-server/tmp                                                                                    0.0s
 => CACHED [compile-image 2/9] RUN --mount=type=cache,id=apt-dev,target=/var/cache/apt     apt-get update &&     apt-get upgrade -y &&     apt-get install software-properties-comm  0.0s
 => CACHED [compile-image 3/9] RUN python3.9 -m venv /home/venv                                                                                                                      0.0s
 => CACHED [compile-image 4/9] RUN python -m pip install -U pip setuptools                                                                                                           0.0s
 => CACHED [compile-image 5/9] RUN export USE_CUDA=1                                                                                                                                 0.0s
 => CACHED [compile-image 6/9] RUN git clone --depth 1 https://github.com/pytorch/serve.git                                                                                          0.0s
 => CACHED [compile-image 7/9] WORKDIR serve                                                                                                                                         0.0s
 => CACHED [compile-image 8/9] RUN     if echo "nvidia/cuda:11.7.0-cudnn8-runtime-ubuntu20.04" | grep -q "cuda:"; then         if [ "cu117" ]; then             python ./ts_scripts  0.0s
 => CACHED [compile-image 9/9] RUN python -m pip install --no-cache-dir torchserve torch-model-archiver torch-workflow-archiver                                                      0.0s
 => CACHED [runtime-image 4/9] COPY --chown=model-server --from=compile-image /home/venv /home/venv                                                                                  0.0s
 => CACHED [runtime-image 5/9] COPY dockerd-entrypoint.sh /usr/local/bin/dockerd-entrypoint.sh                                                                                       0.0s
 => CACHED [runtime-image 6/9] RUN chmod +x /usr/local/bin/dockerd-entrypoint.sh     && chown -R model-server /home/model-server                                                     0.0s
 => CACHED [runtime-image 7/9] COPY config.properties /home/model-server/config.properties                                                                                           0.0s
 => CACHED [runtime-image 8/9] RUN mkdir /home/model-server/model-store && chown -R model-server /home/model-server/model-store                                                      0.0s
 => CACHED [runtime-image 9/9] WORKDIR /home/model-server                                                                                                                            0.0s
 => exporting to image                                                                                                                                                               0.0s
 => => exporting layers                                                                                                                                                              0.0s
 => => writing image sha256:988e7b1f9eba0f4b6f5b0fe3396195c80ac6877a2f251500b864a30ed04ba253                                                                                         0.0s
 => => naming to docker.io/pytorch/ts_run:latest-gpu                                                                                   
REPOSITORY                              TAG          IMAGE ID       CREATED          SIZE
pytorch/ts_run                          latest-gpu   988e7b1f9eba   15 minutes ago   8.29GB
pytorch/ts_base                         latest-gpu   c9ba68f62f4f   21 minutes ago   5.12GB

Checklist:

  • Did you have fun?
  • Have you added tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

@agunapal agunapal requested review from msaroufim and lxning June 28, 2023 19:42
@codecov
Copy link

codecov bot commented Jun 28, 2023

Codecov Report

Merging #2435 (9becc65) into master (e2cd91b) will not change coverage.
The diff coverage is n/a.

❗ Current head 9becc65 differs from pull request most recent head 344f1d0. Consider uploading reports for the commit 344f1d0 to get more accurate results

@@           Coverage Diff           @@
##           master    #2435   +/-   ##
=======================================
  Coverage   72.66%   72.66%           
=======================================
  Files          78       78           
  Lines        3669     3669           
  Branches       58       58           
=======================================
  Hits         2666     2666           
  Misses        999      999           
  Partials        4        4           

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@agunapal agunapal changed the title Update instructions to build with nvidia cuda runtime image for ONNX, TensorRT, DeepSpeed Update instructions to build with nvidia cuda runtime image for ONNX Jun 28, 2023
@agunapal agunapal requested a review from lxning June 28, 2023 21:47
@msaroufim
Copy link
Member

msaroufim commented Jun 28, 2023

I'd really like us to merge running the regression test inside a freshly built docker container to make sure that this works instead of relying on logs

@agunapal
Copy link
Collaborator Author

@msaroufim I agree. We have to wait till #2403 is resolved and merged.

@msaroufim
Copy link
Member

@agunapal LGTM please just fix lint before merge

@agunapal agunapal requested a review from namannandan July 19, 2023 18:12
docker/README.md Outdated Show resolved Hide resolved
Copy link
Contributor

@chauhang chauhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @agunapal for the PR. Few items:

  1. Can we switch to using CUDA 11.8 as the default?
  2. Please attach some tests for the two cases -- image with Nvidia runtime and image with dev build needed for DeepSpeed for verification

@agunapal
Copy link
Collaborator Author

@chauhang I attached the NVIDIA Runtime logs

Upgrade to CUDA 11.8 is happening here #2489

Currently, users can't test DeepSpeed with docker because of this bug #2492

I can remove the comment added in DeepSpeed README and add it later when its fixed and verified. Is that fine

@agunapal agunapal requested a review from chauhang July 21, 2023 23:53
docker/build_image.sh Outdated Show resolved Hide resolved
@agunapal agunapal requested a review from lxning July 24, 2023 22:51
@msaroufim msaroufim dismissed chauhang’s stale review July 29, 2023 16:06

feedback seems addressed

@msaroufim msaroufim merged commit 35ef00f into master Jul 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants