-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature] Dynamic MIG partitioner #361
Comments
Please see this document on why this is not feasible under the current Kubernetes resource model: Once the following newly accepted Kubernetes Enhancement proposal gets implemented, we will be able to build a device plugin that properly supports what you suggest: |
I updated the example to emphasise the capability to perform MIG-sliced multi-gpu training, e.g. by requesting eight 40Gi MIG slices (from different GPU cards on the same DGX). This is currently not possible AFAIK not even with a static MIG layout. |
Once we have Dynamic Resource Allocation all of what you propose will be possible. We do not plan to "hack" this support onto the existing plugin and instead will be putting all efforts to support an API like this into the new plugin for DRA. |
I agree with @klueska about how DRA is the right way. |
Hi @klueska, I cannot wait to try this new DRA feature but after read the KEP, I have some concerns about how the resource driver will be implemented. What I want is not only dynamic MIG configuration but also dynamically allocating network-attached GPUs. In my understanding, a Resource Driver needs to define its own ResourceClaimParameter CRD, allocate and configure the devices, and interact with kubelet to prepare devices for containers. Most of these work are device specific and should be handled by the device vendor I believe, but allocation seems different and complicate when the devices are dynamically attached from network. Could you tell me the NVIDIA's thought about how to implement a Resource Driver and how to support dynamically attaching devices? |
Looks like kubernetes/enhancements#3064 has merged! Any thoughts on this ask? |
Right now I am using https://github.com/nebuly-ai/nos for dynamic GPU partitioning. It's solving the purpose for now but facing issue when using with Karpenter. These days group is not active enough to contribute for the solution. |
Would it be possible that the gpu operator makes the MIG setup transparent such that the end user can directly request per-pod GPU memory requirements on-demand while, under the hood, MIG configuration is dynamically re-partitioned? i.e. without any intervention of a sysadmin / devops team.
https://www.nvidia.com/en-us/technologies/multi-instance-gpu/
The text was updated successfully, but these errors were encountered: