-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What does 'does not support GPU migration" mean specifically? #18
Comments
Right now on the restore end it's required that the GPU type and order be the same as the checkpoint side. So you can migrate your CRIU checkpoint from machine to machine as long as the layouts are identical at the moment. For example if you checkpoint on a 4 GPU A100 system it's required to restore on a 4 GPU A100 system, attempting to restore on say an 8 GPU A100 system will fail or a 4 GPU system that contains other GPUs. Some of this will be relaxed in future releases. |
@jesus-ramos
Thanks, |
@jesus-ramos thanks for the speedy reply, it's appreciated.
Given this example, "order" means the number of gpus then? How would/does this impact check pointing containerized cuda applications. Let's say the simple example of only one GPU being shared with container namespace? The number of physical GPUs needs to be same number and type regardless of container? Does it matter if a container was checkpointed with device 0 and then restarted with device 1? Any details around this space would be greatly apprecated. |
Next release should have fixed support for containers that use partial GPU passthrough, with the current release restoring requires a workaround to expose all the GPUs to the container. With the fix though as long as you expose at least as many GPUs as you checkpointed with of the same model it will work and the order of devices in heterogenous GPU setups won't matter either. I'll be posting an updated patch to CRIU as well once the next release is live with more details. |
We don't have any plans for it at the moment. |
@jesus-ramos I assume driver version needs to be same between checkpoint and restore? |
@ayushr2 Would you be able to confirm if you are asking about migration with CRIU, gVisor, or if restore from a checkpoint would work on the same system after upgrading the driver version? For example, it is possible to migrate CUDA processes across different nodes with CRIU, but CRIU requires all libraries to be the same on both source and destination nodes to correctly restore open file descriptors. |
The driver version to restore to for CUDA has to be the same. |
I was asking for both gVisor and CRIU. gVisor will have similar constraints as CRIU.
Good point, this is true for gVisor as well. To restore FDs to host files, gVisor requires container filesystem to be the same. And since the user-mode driver (files like libcuda.so) is packaged with the kernel driver, upgrading the kernel driver can also change container filesystem and hence violate the restore contract. Thanks, so driver version needs to be same. |
Is the fix available now? what is the minimal driver version that ships with the fix? |
The following page contains some information about these updates:
|
Hi @jesus-ramos, we've been experimenting with
Would you have any updates regarding that issue? We use TL;DR we expose different combinations of GPUs to workloads and
|
@luiscape I was able to replicate the same error when using partial GPU passthrough with Podman and CRIU:
The checkpoint/restore operations work as expected when all GPUs are exposed in the container, and
|
@rst0git re
Good suggestion. But we can't expose all GPUs inside a container because hosts are shared between different workloads / users. One workload shouldn't have access to the GPUs from a different workload. This wouldn't work for us, unfortunately. |
Hi, just wondering that ' does not support GPU migration means specifically? Exact same gpu on exact same host? Same gpu model? Etc.
Thanks,
Geoff
The text was updated successfully, but these errors were encountered: