Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU driver installation support in launcher #497

Merged
merged 2 commits into from
Oct 11, 2024

Conversation

meetrajvala
Copy link

This PR adds the following changes for supporting the GPU driver installation in launcher component:

  • Installation of open sourced GPU drivers using the cos-gpu-installer.
  • Changes in container_runnner to ensure that GPU device is accessible to the workload container.
  • Utility functions for listing the GPU device files and remounting the installation directory.
  • Image tests with required setup and validation scripts.
  • Unit tests

@meetrajvala meetrajvala marked this pull request as draft September 25, 2024 11:05
@meetrajvala
Copy link
Author

/gcbrun

1 similar comment
@jkl73
Copy link
Contributor

jkl73 commented Sep 25, 2024

/gcbrun

@meetrajvala meetrajvala force-pushed the gpu-support branch 2 times, most recently from 103b6fd to 63548fb Compare September 27, 2024 17:36
@meetrajvala meetrajvala marked this pull request as ready for review September 27, 2024 17:37
@meetrajvala meetrajvala force-pushed the gpu-support branch 5 times, most recently from e9588fb to 2654a31 Compare September 27, 2024 22:37
@meetrajvala
Copy link
Author

/gcbrun

@meetrajvala meetrajvala force-pushed the gpu-support branch 11 times, most recently from 494337b to f4c4fba Compare September 30, 2024 21:38
launcher/container_runner.go Show resolved Hide resolved
launcher/launcher/main.go Outdated Show resolved Hide resolved
launcher/internal/gpu/driverinstaller.go Show resolved Hide resolved
launcher/image/test/scripts/test_gpu_workload.sh Outdated Show resolved Hide resolved
launcher/internal/gpu/driverinstaller.go Show resolved Hide resolved
launcher/internal/experiments/experiments.go Show resolved Hide resolved
@meetrajvala meetrajvala force-pushed the gpu-support branch 4 times, most recently from 5774e43 to e155130 Compare October 8, 2024 22:31
@yawangwang
Copy link
Collaborator

Please squash your commits to have a cleaner history.

launcher/internal/gpu/driverinstaller.go Show resolved Hide resolved
launcher/launcher/main.go Outdated Show resolved Hide resolved
launcher/util_test.go Outdated Show resolved Hide resolved
launcher/container_runner.go Outdated Show resolved Hide resolved
launcher/util.go Outdated Show resolved Hide resolved
launcher/util.go Outdated Show resolved Hide resolved
code, _, _ := status.Result()
di.logger.Printf("Gpu driver installation task exited with status: %d\n", code)

err = remountAsExecutable(InstallationHostDir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels weird,

can you try one thing? try to create the directory before launching the install container? And see if the result driver is executable without remounting

Copy link
Author

@meetrajvala meetrajvala Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we just create the directory with required permission before installation step, then it still fails. if we run the same remountAsExecutable function before the installation steps, then it works fine.

}

func getInstallerImageReference() (string, error) {
installerImageRefBytes, err := exec.Command("cos-extensions", "list", "--", "--gpu-installer").Output()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we hardcode this ref? I run the command I saw "us.gcr.io/cos-cloud/cos-gpu-installer:v2.4.1"
I assume this won't change if in the same image?

Copy link
Author

@meetrajvala meetrajvala Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we use image family (e.g. cos-113-lts) field in cloudbuild, build will use the latest image from the family which may have the installer version updated and may break if we hardcode it. So it is recommended to get the version using cos-extensions for given base cos image as cos-gpu-installer is not guaranteed to be backward compatible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be hardcoded by running this command at build time. That's because it will always be built on the same COS version. This should be fixed by GA; please add a TODO.

@meetrajvala
Copy link
Author

/gcbrun

@meetrajvala
Copy link
Author

/gcbrun

launcher/util.go Outdated Show resolved Hide resolved
launcher/util.go Outdated Show resolved Hide resolved
launcher/internal/gpu/driverinstaller.go Show resolved Hide resolved
launcher/internal/gpu/driverinstaller.go Show resolved Hide resolved
launcher/internal/gpu/driverinstaller.go Show resolved Hide resolved
launcher/launcher/main.go Outdated Show resolved Hide resolved
@meetrajvala
Copy link
Author

/gcbrun

@meetrajvala
Copy link
Author

/gcbrun

launcher/util.go Outdated Show resolved Hide resolved
launcher/launcher/main.go Outdated Show resolved Hide resolved
launcher/launcher/main.go Outdated Show resolved Hide resolved
@meetrajvala
Copy link
Author

/gcbrun

Copy link
Collaborator

@yawangwang yawangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ensure to get approvals from other reviewers before merging this PR. Their expertise in CS image will also provide valuable insights.

gcloud builds submit --config=test_gpu_driver_installation_cloudbuild.yaml --region us-west1 \
--substitutions _IMAGE_NAME=${OUTPUT_IMAGE_PREFIX}-hardened-${OUTPUT_IMAGE_SUFFIX},_IMAGE_PROJECT=${PROJECT_ID}
exit
# TODO: Enable these tests for debug image once gpu qouta is setup for the build project.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: create a bug item to track TODOs.

@meetrajvala meetrajvala changed the base branch from main to cs_gpu October 11, 2024 03:56
@meetrajvala
Copy link
Author

/gcbrun

@meetrajvala meetrajvala merged commit 796705b into google:cs_gpu Oct 11, 2024
12 checks passed
Copy link
Contributor

@alexmwu alexmwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly minor comments.

Comment on lines 261 to 262
- name: 'gcr.io/cloud-builders/gcloud'
id: GpuDriverInstallationDebugImageTests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this test work but GpuDriverInstallationHardenedImageTests not work?

launcher/container_runner.go Show resolved Hide resolved
launcher/container_runner.go Show resolved Hide resolved
@@ -0,0 +1,97 @@
#!/bin/bash
local OPTIND
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this needs to be a separate script rather than extending the existing util.

launcher/internal/gpu/driverinstaller.go Show resolved Hide resolved
}

func getInstallerImageReference() (string, error) {
installerImageRefBytes, err := exec.Command("cos-extensions", "list", "--", "--gpu-installer").Output()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be hardcoded by running this command at build time. That's because it will always be built on the same COS version. This should be fixed by GA; please add a TODO.

launcher/internal/gpu/driverinstaller.go Show resolved Hide resolved
Comment on lines +114 to +115
oci.WithHostHostsFile,
oci.WithHostResolvconf,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since a lot of this is duplicated, it would be useful to have helper functions that help mount with hostnamespace, hosts file and resolve conf.

oci.WithHostDevices,
oci.WithMounts(mounts),
oci.WithHostNamespace(specs.NetworkNamespace),
oci.WithHostNamespace(specs.PIDNamespace),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the same PID namespace?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants