Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Register device memory using the GPU memory factor #31

Merged
merged 1 commit into from
Oct 15, 2024

Conversation

cybergeek2077
Copy link
Contributor

When using the GPUMemoryFactor argument without specifying vgpu-memory in the request, pod allocation relies on the total gpumem multiplied by the factor.

This PR fixes this bug, aligning device memory registration with GPUMemoryFactor, ensuring pod allocation matches the total gpumem.

@cybergeek2077
Copy link
Contributor Author

@SataQiu @archlitchi

@SataQiu
Copy link
Member

SataQiu commented Oct 9, 2024

Did you verify it locally?

@cybergeek2077
Copy link
Contributor Author

cybergeek2077 commented Oct 10, 2024

SGTM, Did you verify it locally?

Yes, I have verified it. My GPUMemoryFactor is 10. When not specifying vgpu-memory, the nvidia-smi shows a 10x gpu memory. After the patch, it works rightly.

Reproduce:
image
Fix:
image

@SataQiu
Copy link
Member

SataQiu commented Oct 10, 2024

Do the --gpu-memory-factor and --device-memory-scaling parameters mean the same thing? I think there is a difference, --device-memory-scaling is to enlarge the GPU memory to use virtual memory, while --gpu-memory-factor is to reduce the number of vGPU devices.

I think we already changed the number of reported devices by --gpu-memory-factor here:

for j := uint(0); j < GetGPUMemory()/gpuMemoryFactor; j++ {
fakeID := GenerateVirtualDeviceID(id, j)
virtualDevs = append(virtualDevs, &pluginapi.Device{
ID: fakeID,
Health: pluginapi.Healthy,
})
}

defer @archlitchi for more review

@cybergeek2077
Copy link
Contributor Author

cybergeek2077 commented Oct 10, 2024

Sorry, I misunderstood the device-memory-scaling param in HAMi.

I thought the gpu memory modified by the following code was used in the device plugin for node capacity, node allocatable, and so on.

for j := uint(0); j < GetGPUMemory()/gpuMemoryFactor; j++ {
fakeID := GenerateVirtualDeviceID(id, j)
virtualDevs = append(virtualDevs, &pluginapi.Device{
ID: fakeID,
Health: pluginapi.Healthy,
})
}

While this registers gpu memory in the volcano scheduler by annotation which is used in resource scheduling and in this scene that not specifying vgpu-memory.

So this bug could also impact scheduler resource calculation.

@SataQiu
Copy link
Member

SataQiu commented Oct 14, 2024

I checked the current behavior.
I set --gpu-memory-factor=2 and my dev environment's node GPU memory is 2004 MB.

# nvidia-smi
Mon Oct 14 20:46:23 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 455.23.05    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 860M    On   | 00000000:01:00.0 Off |                  N/A |
| N/A   46C    P8    N/A /  N/A |     15MiB /  2004MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

The K8s Node allocatable GPU memory resource status will be reported as 1002 (2004/gpu-memory-factor = 2004/2 = 1002), this is expected.

    allocatable:
      cpu: "8"
      memory: 16112380Ki
      pods: "110"
      volcano.sh/vgpu-cores: "100"
      volcano.sh/vgpu-memory: "1002"
      volcano.sh/vgpu-number: "10"
    capacity:
      cpu: "8"
      memory: 16214780Ki
      pods: "110"
      volcano.sh/vgpu-cores: "100"
      volcano.sh/vgpu-memory: "1002"
      volcano.sh/vgpu-number: "10"

However, the volcano.sh/node-vgpu-register annotation of Node is 2004(NOT 1002); it reported the real GPU memory of the host GPU device.

 volcano.sh/node-vgpu-register: 'GPU-xxx,10,2004,NVIDIA-xxx,false:'

We should keep it consistent, so I think this PR makes sense, thanks @cybergeek2077

@SataQiu
Copy link
Member

SataQiu commented Oct 14, 2024

Signed-off-by: cybergeek2077 <zhaoshen@buaa.edu.cn>
@archlitchi archlitchi merged commit 86d7a26 into Project-HAMi:main Oct 15, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants