Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build cuda:12.4.0-cudnn8-devel-ubuntu22.04 docker image and host it in pytorch AWS #1811

Open
atalman opened this issue May 6, 2024 · 3 comments

Comments

@atalman
Copy link
Contributor

atalman commented May 6, 2024

Build Nvidia docker image: cuda:12.4.0-cudnn8-devel-ubuntu22.04

See reference issue here:
https://gitlab.com/nvidia/container-images/cuda/-/issues/225

Upload to pytorch aws so this workflow can be fixed:
pytorch/pytorch/actions/runs/8974959068/job/24648540236?pr=125617

@atalman atalman changed the title Build cuda:12.4.0-cudnn8-devel-ubuntu22.04 docker image and host it in AWS ghcr.io Build cuda:12.4.0-cudnn8-devel-ubuntu22.04 docker image and host it in pytorch AWS May 6, 2024
@polarathene
Copy link

Alternatively you could just use the existing DockerHub image with cudnn9? Or is that not valid to build/support?

I wasn't aware of existing issues when I saw the CI failure for a PR I'm involved in, but looked into it here: pytorch/pytorch#125632 (comment)

A quick fix is to just have the matrix for docker generate a versionless cudnn portion of the tag. Presumably nvidia may be taking that approach going forward, so if the version of cudnn does not strictly need to be 8, you could relax the major version pin with the docker images? (there is no cudnn9 tag with cuda 12.4 images, only previous minor tag versions).

Otherwise, won't you need to build (or republish) all the nvidia images being used from DockerHub? The CI is failing specifically because it's trying to pull an invalid tag for nvidia/cuda that you request:

--build-arg BASE_IMAGE=nvidia/cuda:12.4.0-cudnn8-devel-ubuntu22.04

So you need to avoid building that in the docker matrix, and separately build/publish your AWS image, or as I've suggested just add the logic to select the appropriate nvidia/cuda:12.4 image: nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04

pytorchmergebot pushed a commit to pytorch/pytorch that referenced this issue May 7, 2024
Fixes #125094

Please note: Docker CUDa 12.4 failure is existing issue, related to docker image not being available on gitlab:
```
docker.io/nvidia/cuda:12.4.0-cudnn8-devel-ubuntu22.04: docker.io/nvidia/cuda:12.4.0-cudnn8-devel-ubuntu22.04: not found
```
 https://github.com/pytorch/pytorch/actions/runs/8974959068/job/24648540236?pr=125617

Here is the reference issue: https://gitlab.com/nvidia/container-images/cuda/-/issues/225

Tracked on our side: pytorch/builder#1811
Pull Request resolved: #125617
Approved by: https://github.com/huydhn, https://github.com/malfet
@atalman
Copy link
Contributor Author

atalman commented May 7, 2024

pytorchmergebot pushed a commit to pytorch/pytorch that referenced this issue May 8, 2024
Fixes #125526 [#1811](pytorch/builder#1811)

Adopt syntax=docker/dockerfile:1 whcih has been stable since 2018, while still best practice to declare in 2024.
- Syntax features dependent upon the [syntax directive version are documented here](https://hub.docker.com/r/docker/dockerfile).
- While you can set a fixed minor version, [Docker officially advises to only pin the major version]

```
(https://docs.docker.com/build/dockerfile/frontend/#stable-channel):
We recommend using docker/dockerfile:1, which always points to the latest stable release of the version 1 syntax, and receives both "minor" and "patch" updates for the version 1 release cycle.
BuildKit automatically checks for updates of the syntax when performing a build, making sure you are using the most current version.
```

**Support for building with Docker prior to v23 (released on Feb 2023)**
NOTE: 18.06 may not be the accurate minimum version for using docker/dockerfile:1, according to the [DockerHub tag history](https://hub.docker.com/layers/docker/dockerfile/1.0/images/sha256-92f5351b2fca8f7e2f452aa9aec1c34213cdd2702ca92414eee6466fab21814a?context=explore) 1.0 of the syntax seems to be from Dec 2018, which is probably why docker/dockerfile:experimental was paired with it in this file.

Personally, I'd favor only supporting builds with Docker v23. This is only relevant for someone building this Dockerfile locally, the user could still extend the already built and published image from a registry on older versions of Docker without any concern for this directive which only applies to building this Dockerfile, not images that extend it.

However if you're reluctant, you may want to refer others to [this Docker docs page](https://docs.docker.com/build/buildkit/#getting-started) where they should only need the ENV DOCKER_BUILDKIT=1, presumably the requirement for experimental was dropped with syntax=docker/dockerfile:1 with releases of Docker since Dec 2018. Affected users can often quite easily install a newer version of Docker on their OS, as per Dockers official guidance (usually via including an additional repo to the package manager).

**Reference links**
Since one of these was already included in the inline note (now a broken link), I've included relevant links mentioned above. You could alternatively rely on git blame with a commit message referencing the links or this PR for more information.

Feel free to remove any of the reference links, they're mostly only relevant to maintainers to be aware of (which this PR itself has detailed adequately above).

Pull Request resolved: #125632
Approved by: https://github.com/malfet
pytorchbot pushed a commit to pytorch/pytorch that referenced this issue May 13, 2024
Fixes #125094

Please note: Docker CUDa 12.4 failure is existing issue, related to docker image not being available on gitlab:
```
docker.io/nvidia/cuda:12.4.0-cudnn8-devel-ubuntu22.04: docker.io/nvidia/cuda:12.4.0-cudnn8-devel-ubuntu22.04: not found
```
 https://github.com/pytorch/pytorch/actions/runs/8974959068/job/24648540236?pr=125617

Here is the reference issue: https://gitlab.com/nvidia/container-images/cuda/-/issues/225

Tracked on our side: pytorch/builder#1811
Pull Request resolved: #125617
Approved by: https://github.com/huydhn, https://github.com/malfet

(cherry picked from commit b29d77b)
huydhn pushed a commit to pytorch/pytorch that referenced this issue May 13, 2024
Separate arm64 and amd64 docker builds (#125617)

Fixes #125094

Please note: Docker CUDa 12.4 failure is existing issue, related to docker image not being available on gitlab:
```
docker.io/nvidia/cuda:12.4.0-cudnn8-devel-ubuntu22.04: docker.io/nvidia/cuda:12.4.0-cudnn8-devel-ubuntu22.04: not found
```
 https://github.com/pytorch/pytorch/actions/runs/8974959068/job/24648540236?pr=125617

Here is the reference issue: https://gitlab.com/nvidia/container-images/cuda/-/issues/225

Tracked on our side: pytorch/builder#1811
Pull Request resolved: #125617
Approved by: https://github.com/huydhn, https://github.com/malfet

(cherry picked from commit b29d77b)

Co-authored-by: atalman <atalman@fb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants