-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Register device memory using the GPU memory factor #31
Conversation
Did you verify it locally? |
Do the I think we already changed the number of reported devices by volcano-vgpu-device-plugin/pkg/plugin/nvidia/utils.go Lines 82 to 88 in 506be8b
defer @archlitchi for more review |
Sorry, I misunderstood the device-memory-scaling param in HAMi. I thought the gpu memory modified by the following code was used in the device plugin for node capacity, node allocatable, and so on. volcano-vgpu-device-plugin/pkg/plugin/nvidia/utils.go Lines 82 to 88 in 506be8b
While this registers gpu memory in the volcano scheduler by annotation which is used in resource scheduling and in this scene that not specifying vgpu-memory. So this bug could also impact scheduler resource calculation. |
I checked the current behavior. # nvidia-smi
Mon Oct 14 20:46:23 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05 Driver Version: 455.23.05 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 860M On | 00000000:01:00.0 Off | N/A |
| N/A 46C P8 N/A / N/A | 15MiB / 2004MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
The K8s Node allocatable GPU memory resource status will be reported as 1002 (2004/gpu-memory-factor = 2004/2 = 1002), this is expected. allocatable:
cpu: "8"
memory: 16112380Ki
pods: "110"
volcano.sh/vgpu-cores: "100"
volcano.sh/vgpu-memory: "1002"
volcano.sh/vgpu-number: "10"
capacity:
cpu: "8"
memory: 16214780Ki
pods: "110"
volcano.sh/vgpu-cores: "100"
volcano.sh/vgpu-memory: "1002"
volcano.sh/vgpu-number: "10" However, the volcano.sh/node-vgpu-register: 'GPU-xxx,10,2004,NVIDIA-xxx,false:' We should keep it consistent, so I think this PR makes sense, thanks @cybergeek2077 |
Please use |
Signed-off-by: cybergeek2077 <zhaoshen@buaa.edu.cn>
When using the GPUMemoryFactor argument without specifying vgpu-memory in the request, pod allocation relies on the total gpumem multiplied by the factor.
This PR fixes this bug, aligning device memory registration with GPUMemoryFactor, ensuring pod allocation matches the total gpumem.