-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GPU driver installation support in launcher #497
Conversation
/gcbrun |
1 similar comment
/gcbrun |
103b6fd
to
63548fb
Compare
e9588fb
to
2654a31
Compare
/gcbrun |
494337b
to
f4c4fba
Compare
5774e43
to
e155130
Compare
Please squash your commits to have a cleaner history. |
8919575
to
a603a6d
Compare
a603a6d
to
8a29735
Compare
code, _, _ := status.Result() | ||
di.logger.Printf("Gpu driver installation task exited with status: %d\n", code) | ||
|
||
err = remountAsExecutable(InstallationHostDir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels weird,
can you try one thing? try to create the directory before launching the install container? And see if the result driver is executable without remounting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we just create the directory with required permission before installation step, then it still fails. if we run the same remountAsExecutable
function before the installation steps, then it works fine.
} | ||
|
||
func getInstallerImageReference() (string, error) { | ||
installerImageRefBytes, err := exec.Command("cos-extensions", "list", "--", "--gpu-installer").Output() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we hardcode this ref? I run the command I saw "us.gcr.io/cos-cloud/cos-gpu-installer:v2.4.1"
I assume this won't change if in the same image?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we use image family (e.g. cos-113-lts
) field in cloudbuild, build will use the latest image from the family which may have the installer version updated and may break if we hardcode it. So it is recommended to get the version using cos-extensions
for given base cos image as cos-gpu-installer
is not guaranteed to be backward compatible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be hardcoded by running this command at build time. That's because it will always be built on the same COS version. This should be fixed by GA; please add a TODO.
8a29735
to
d658692
Compare
/gcbrun |
d658692
to
2ed4e4a
Compare
/gcbrun |
2ed4e4a
to
6e65eb6
Compare
/gcbrun |
6e65eb6
to
bae1a53
Compare
/gcbrun |
bae1a53
to
91c74da
Compare
/gcbrun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please ensure to get approvals from other reviewers before merging this PR. Their expertise in CS image will also provide valuable insights.
gcloud builds submit --config=test_gpu_driver_installation_cloudbuild.yaml --region us-west1 \ | ||
--substitutions _IMAGE_NAME=${OUTPUT_IMAGE_PREFIX}-hardened-${OUTPUT_IMAGE_SUFFIX},_IMAGE_PROJECT=${PROJECT_ID} | ||
exit | ||
# TODO: Enable these tests for debug image once gpu qouta is setup for the build project. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: create a bug item to track TODOs.
/gcbrun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly minor comments.
launcher/cloudbuild.yaml
Outdated
- name: 'gcr.io/cloud-builders/gcloud' | ||
id: GpuDriverInstallationDebugImageTests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this test work but GpuDriverInstallationHardenedImageTests not work?
@@ -0,0 +1,97 @@ | |||
#!/bin/bash | |||
local OPTIND |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this needs to be a separate script rather than extending the existing util.
} | ||
|
||
func getInstallerImageReference() (string, error) { | ||
installerImageRefBytes, err := exec.Command("cos-extensions", "list", "--", "--gpu-installer").Output() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be hardcoded by running this command at build time. That's because it will always be built on the same COS version. This should be fixed by GA; please add a TODO.
oci.WithHostHostsFile, | ||
oci.WithHostResolvconf, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since a lot of this is duplicated, it would be useful to have helper functions that help mount with hostnamespace, hosts file and resolve conf.
oci.WithHostDevices, | ||
oci.WithMounts(mounts), | ||
oci.WithHostNamespace(specs.NetworkNamespace), | ||
oci.WithHostNamespace(specs.PIDNamespace), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need the same PID namespace?
This PR adds the following changes for supporting the GPU driver installation in launcher component:
cos-gpu-installer
.container_runnner
to ensure that GPU device is accessible to the workload container.