Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NVIDIA NGC image aliases #384

Merged
merged 3 commits into from
Feb 22, 2022
Merged

Add NVIDIA NGC image aliases #384

merged 3 commits into from
Feb 22, 2022

Conversation

0x2b3bfa0
Copy link
Member

@0x2b3bfa0 0x2b3bfa0 commented Feb 7, 2022

NVIDIA NGC offers machine and container images with GPU drivers and Docker with GPU support out of the box for aws, az, gcp, k8s and other platforms.

We are “maintaining” custom AWS machine images (#206) that offer the same packages, plus Node.js for running the CML source distribution. After cml#825 we don't need Node.js anymore, and it makes more sense to use NVIDIA images instead of publishing our own.

Related

Example

resource "iterative_task" "example" {
  name  = "example"
  cloud = "aws"

  machine   = "m+v100"
  disk_size = "32"
  image     = "nvidia"

  script = <<-END
    #!/bin/sh
    nvidia-smi
    docker -v
  END
}

📖 Note that the default disk size (30 GB) is not enough for most of these images; it should be explicitly increased.

Versions

aws

$ aws ec2 describe-images --region=us-east-1 --owners=aws-marketplace --filter=Name=name,Values="NVIDIA*"
{
    "Images": [
        {
            "Architecture": "x86_64",
            "OwnerId": "679593333241",
            "Name": "NVIDIA Deep Learning  AMI v21.02.2-46a68101-e56b-41cd-8e32-631ac6e5d02b",
            "ProductCodes": [
                {
                    "ProductCodeId": "46kqvw7d9nbo0v2bfgi8z26gb",
                    "ProductCodeType": "marketplace"
                }
            ],
            ...
        },
        ...
    ]
}
image = "ubuntu@679593333241:x86_64:NVIDIA Deep Learning  AMI v21.02.2-46a68101-e56b-41cd-8e32-631ac6e5d02b"

📖 Visit https://aws.amazon.com/marketplace/pp?sku=46kqvw7d9nbo0v2bfgi8z26gb to accept terms and subscribe.

az

$ az vm image list --publisher=nvidia --all
[
  {
    "offer": "ngc_base_image_version_b",
    "publisher": "nvidia",
    "sku": "gen2_21-11-0",
    "urn": "nvidia:ngc_base_image_version_b:gen2_21-11-0:21.11.0",
    "version": "21.11.0"
  },
  ...
]
image = "ubuntu@nvidia:ngc_base_image_version_b:gen2_21-11-0:21.11.0#plan"

📖 Run the following az command to accept the image terms:

az vm image terms accept --urn nngc_base_image_version_b:gen2_21-06-0:21.06.0

gcp

$ gcloud compute images list --project=nvidia-ngc-public --filter=name~nvidia-gpu-cloud
NAME                                               PROJECT            FAMILY  DEPRECATED  STATUS
nvidia-gpu-cloud-image-20211105                    nvidia-ngc-public                      READY
...
image = "ubuntu@nvidia-ngc-public/nvidia-gpu-cloud-image-20211105"

⚠️ Drivers don't work out of the box because they require a manual reboot to finish the installation.

Gory details... When logging in with an interactive shell, the following message is displayed:
NVIDIA GPU Cloud (NGC) is an optimized software environment that requires the
latest NVIDIA drivers to operate. If you do not download the NVIDIA drivers at
this time, your instance will shut down. Would you like to download the latest
NVIDIA drivers so NGC can finish installing? (Y/n)

See https://gist.github.com/eliask/cd14372aabcb871aaaa661a87e2a2b2d for some workarounds.

k8s

Example for the nvidia/cuda:11.6.0-runtime-ubuntu20.04 tag:

image = "nvidia/cuda:11.6.0-runtime-ubuntu20.04

See https://hub.docker.com/r/nvidia/cuda for a comprehensive list of tags.

@DavidGOrtega
Copy link
Contributor

After iterative/cml#825 we don't need Node.js anymore,

💯

Should we extend this pr to machine and runner?

@0x2b3bfa0
Copy link
Member Author

Should we extend this pull request to machine and runner?

If we did, users would have to perform an extra step (accept the terms) before using cml runner

@0x2b3bfa0
Copy link
Member Author

Maybe we can default to just–in–time provisioning of vanilla Ubuntu 20.04 images, and then recommend switching to the Nvidia ones as an optimization?

@0x2b3bfa0
Copy link
Member Author

Still more relevant after iterative/cml#895 (comment).

@0x2b3bfa0 0x2b3bfa0 linked an issue Feb 22, 2022 that may be closed by this pull request
7 tasks
@0x2b3bfa0 0x2b3bfa0 marked this pull request as ready for review February 22, 2022 16:51
@0x2b3bfa0 0x2b3bfa0 mentioned this pull request Feb 22, 2022
7 tasks
@0x2b3bfa0 0x2b3bfa0 requested a review from a team February 22, 2022 17:36
@0x2b3bfa0 0x2b3bfa0 self-assigned this Feb 22, 2022
@0x2b3bfa0 0x2b3bfa0 added cloud-common Applies to every cloud vendor machine-image labels Feb 22, 2022
@0x2b3bfa0 0x2b3bfa0 merged commit 35ce161 into master Feb 22, 2022
@0x2b3bfa0 0x2b3bfa0 deleted the nvidia-image branch February 22, 2022 17:51
@0x2b3bfa0 0x2b3bfa0 mentioned this pull request Mar 14, 2022
@0x2b3bfa0 0x2b3bfa0 mentioned this pull request May 6, 2022
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud-common Applies to every cloud vendor machine-image
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Machine images
3 participants