-
Notifications
You must be signed in to change notification settings - Fork 639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Advertising specific GPU types as separate extended resource #424
Comments
There is still no planned support for this in the k8s-device plugin. All of the functionality is there (as described in the link you provided), but it is explicitly disabled by this line in the code https://github.com/NVIDIA/k8s-device-plugin/blob/main/cmd/nvidia-device-plugin/main.go#L322 The future for supporting multiple GPU cards per node is via a new mechanism in Kubernetes called Dynamic Resource Allocation (DRA): |
Hey Kevin, I was actually asking about specific GPU resource naming for GPUs on different nodes (not on same node). |
Hey Kevin, Is there any functionality to advertise these specified resources/resourceClaims? Example -
Will device plugin advertise the resource usage details - how many A100 devices are being used? Will it provide details such as below in any manner? |
We're looking to install the yunikorn scheduler on the cluster and having different resources for different GPUs will help a lot to prioritize the use of more powerful (and less available) GPUs among users using the fair share. It's impossible to do with just labels. |
Is there a reason why this isn't planned to be implemented here? This seems like an essential feature for any cluster with more than 1 model of GPU and there's currently no adequate workaround at all. |
It was a product decision, not an engineering one. All of the code to support it is merged in the plugin and simply disabled by https://github.com/NVIDIA/k8s-device-plugin/blob/main/cmd/nvidia-device-plugin/main.go#L239. The decision not to support this gets revisited periodically, but our product team is still not in favor of it, so our hands are tied. If you want to enable it in a custom build of the plugin, just remove that line referenced above and it should work as described in https://docs.google.com/document/d/1dL67t9IqKC2-xqonMi6DV7W2YNZdkmfX7ibB6Jb-qmk/edit#heading=h.jw5js7865egx. |
@klueska thanks for the explanation. We also explored the extended resource options and we even have a component we wrote ourselves to patch node with gpu extended resources. Just curious would you be open to add a flag to turn this feature on/off so we don't have to deploy a customized version of nvidia device plugin? |
@yuzliu Do you have multiple GPU types per node? If not, are node-labels from GFD / nodeSelectors not enough for your use case? |
@klueska Thanks for the reply! We don't have multiple GPU types per node but we do have multiple GPU types per cluster. We have already deployed the GPU feature discovery and have gpu product label on each GPU node but that doesn't solve our problem because:
|
Got it -- labels from GPU feature discovery are sufficient for 1, but not for 2 and 3 -- for that you need a unique extended resources. |
Yep, we even have an internal component to advertise extended resources e.g. V100, A100 and T4. But I'd really love to have less customized logic internally but rely on Nvidia's official component to make our long term maintenance easier. |
This issue has become stale and will be closed automatically within 30 days if no activity is recorded. |
Any progress on this issue? |
1 similar comment
Any progress on this issue? |
Hello,
I am working at Uber.
1. Feature description
Advertising special hardware (specific GPU types say A100) as a separate extended resource.
As of now, we have a blanket of "nvidia.com/gpu" for all types of GPUs that this plugin supports. If we want our pods to run specifically on some GPU types, then, we need to be able to request such a resource.
For requesting such a specific resource, there are 2 ways -
This added functionality can be enabled based upon a configuration flag and can use gpu-feature-discovery labels to extract the SKU/GPU type.
2. Why
3. Similar existing work
I found a design doc for "Custom Resource Naming and Supporting Multiple GPU SKUs on a Single Node in Kubernetes".
It is actually advertising different types of GPUs as new resource name but those different GPU cards should be on the same node. I am not sure whether the same will also support if the corresponding GPU cards/types are on different nodes as well.
4. Summary of queries
The text was updated successfully, but these errors were encountered: