Register device memory using the GPU memory factor #31

cybergeek2077 · 2024-10-08T11:57:48Z

When using the GPUMemoryFactor argument without specifying vgpu-memory in the request, pod allocation relies on the total gpumem multiplied by the factor.

This PR fixes this bug, aligning device memory registration with GPUMemoryFactor, ensuring pod allocation matches the total gpumem.

cybergeek2077 · 2024-10-08T12:00:04Z

Reference HAMi implementation: https://github.com/Project-HAMi/HAMi/blob/dbed0bd19308a2097530758c91b383fa95f22c3b/pkg/device-plugin/nvidiadevice/nvinternal/plugin/register.go#L129

cybergeek2077 · 2024-10-09T10:03:47Z

@SataQiu @archlitchi

SataQiu · 2024-10-09T13:21:25Z

Did you verify it locally?

cybergeek2077 · 2024-10-10T01:47:34Z

SGTM, Did you verify it locally?

Yes, I have verified it. My GPUMemoryFactor is 10. When not specifying vgpu-memory, the nvidia-smi shows a 10x gpu memory. After the patch, it works rightly.

Reproduce:

Fix:

SataQiu · 2024-10-10T08:39:49Z

Do the --gpu-memory-factor and --device-memory-scaling parameters mean the same thing? I think there is a difference, --device-memory-scaling is to enlarge the GPU memory to use virtual memory, while --gpu-memory-factor is to reduce the number of vGPU devices.

I think we already changed the number of reported devices by --gpu-memory-factor here:

volcano-vgpu-device-plugin/pkg/plugin/nvidia/utils.go

Lines 82 to 88 in 506be8b

    
           for j := uint(0); j < GetGPUMemory()/gpuMemoryFactor; j++ { 
        
           	fakeID := GenerateVirtualDeviceID(id, j) 
        
           	virtualDevs = append(virtualDevs, &pluginapi.Device{ 
        
           		ID:     fakeID, 
        
           		Health: pluginapi.Healthy, 
        
           	}) 
        
           }

defer @archlitchi for more review

cybergeek2077 · 2024-10-10T10:13:31Z

Sorry, I misunderstood the device-memory-scaling param in HAMi.

I thought the gpu memory modified by the following code was used in the device plugin for node capacity, node allocatable, and so on.

volcano-vgpu-device-plugin/pkg/plugin/nvidia/utils.go

Lines 82 to 88 in 506be8b

    
           for j := uint(0); j < GetGPUMemory()/gpuMemoryFactor; j++ { 
        
           	fakeID := GenerateVirtualDeviceID(id, j) 
        
           	virtualDevs = append(virtualDevs, &pluginapi.Device{ 
        
           		ID:     fakeID, 
        
           		Health: pluginapi.Healthy, 
        
           	}) 
        
           }

While this registers gpu memory in the volcano scheduler by annotation which is used in resource scheduling and in this scene that not specifying vgpu-memory.

So this bug could also impact scheduler resource calculation.

SataQiu · 2024-10-14T14:18:05Z

I checked the current behavior.
I set --gpu-memory-factor=2 and my dev environment's node GPU memory is 2004 MB.

# nvidia-smi
Mon Oct 14 20:46:23 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 455.23.05    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 860M    On   | 00000000:01:00.0 Off |                  N/A |
| N/A   46C    P8    N/A /  N/A |     15MiB /  2004MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

The K8s Node allocatable GPU memory resource status will be reported as 1002 (2004/gpu-memory-factor = 2004/2 = 1002), this is expected.

    allocatable:
      cpu: "8"
      memory: 16112380Ki
      pods: "110"
      volcano.sh/vgpu-cores: "100"
      volcano.sh/vgpu-memory: "1002"
      volcano.sh/vgpu-number: "10"
    capacity:
      cpu: "8"
      memory: 16214780Ki
      pods: "110"
      volcano.sh/vgpu-cores: "100"
      volcano.sh/vgpu-memory: "1002"
      volcano.sh/vgpu-number: "10"

However, the volcano.sh/node-vgpu-register annotation of Node is 2004(NOT 1002); it reported the real GPU memory of the host GPU device.

 volcano.sh/node-vgpu-register: 'GPU-xxx,10,2004,NVIDIA-xxx,false:'

We should keep it consistent, so I think this PR makes sense, thanks @cybergeek2077

SataQiu · 2024-10-14T14:22:33Z

Please use git commit -s to commit the PR. @cybergeek2077
https://github.com/Project-HAMi/volcano-vgpu-device-plugin/pull/31/checks?check_run_id=31231677178

Signed-off-by: cybergeek2077 <zhaoshen@buaa.edu.cn>

cybergeek2077 force-pushed the main branch from d8709ec to df923c1 Compare October 15, 2024 01:49

register device memory with GPUMemoryFactor

9dd6650

Signed-off-by: cybergeek2077 <zhaoshen@buaa.edu.cn>

cybergeek2077 force-pushed the main branch from df923c1 to 9dd6650 Compare October 15, 2024 01:55

archlitchi merged commit 86d7a26 into Project-HAMi:main Oct 15, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Register device memory using the GPU memory factor #31

Register device memory using the GPU memory factor #31

cybergeek2077 commented Oct 8, 2024

cybergeek2077 commented Oct 8, 2024

cybergeek2077 commented Oct 9, 2024

SataQiu commented Oct 9, 2024 •

edited

Loading

cybergeek2077 commented Oct 10, 2024 •

edited

Loading

SataQiu commented Oct 10, 2024

cybergeek2077 commented Oct 10, 2024 •

edited

Loading

SataQiu commented Oct 14, 2024

SataQiu commented Oct 14, 2024

Register device memory using the GPU memory factor #31

Register device memory using the GPU memory factor #31

Conversation

cybergeek2077 commented Oct 8, 2024

cybergeek2077 commented Oct 8, 2024

cybergeek2077 commented Oct 9, 2024

SataQiu commented Oct 9, 2024 • edited Loading

cybergeek2077 commented Oct 10, 2024 • edited Loading

SataQiu commented Oct 10, 2024

cybergeek2077 commented Oct 10, 2024 • edited Loading

SataQiu commented Oct 14, 2024

SataQiu commented Oct 14, 2024

SataQiu commented Oct 9, 2024 •

edited

Loading

cybergeek2077 commented Oct 10, 2024 •

edited

Loading

cybergeek2077 commented Oct 10, 2024 •

edited

Loading