Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What does 'does not support GPU migration" mean specifically? #18

Open
gflarity opened this issue Nov 12, 2024 · 14 comments
Open

What does 'does not support GPU migration" mean specifically? #18

gflarity opened this issue Nov 12, 2024 · 14 comments

Comments

@gflarity
Copy link

Hi, just wondering that ' does not support GPU migration means specifically? Exact same gpu on exact same host? Same gpu model? Etc.

Thanks,
Geoff

@jesus-ramos
Copy link
Collaborator

Right now on the restore end it's required that the GPU type and order be the same as the checkpoint side. So you can migrate your CRIU checkpoint from machine to machine as long as the layouts are identical at the moment. For example if you checkpoint on a 4 GPU A100 system it's required to restore on a 4 GPU A100 system, attempting to restore on say an 8 GPU A100 system will fail or a 4 GPU system that contains other GPUs. Some of this will be relaxed in future releases.

@viktoriaas
Copy link

@jesus-ramos
Do you plan to support the following (or any of the two options) in future releases?

  1. restore on a GPU of the same architecture but a different model (e.g. A100->A40)
  2. restore on a GPU of an architecture different from the GPU where checkpointed was performed (e.g. H100->A100)

Thanks,
Viktoria

@gflarity
Copy link
Author

@jesus-ramos thanks for the speedy reply, it's appreciated.

Right now on the restore end it's required that the GPU type and order be the same as the checkpoint side. So you can migrate your CRIU checkpoint from machine to machine as long as the layouts are identical at the moment. For example if you checkpoint on a 4 GPU A100 system it's required to restore on a 4 GPU A100 system, attempting to restore on say an 8 GPU A100 system will fail or a 4 GPU system that contains other GPUs. Some of this will be relaxed in future releases.

Given this example, "order" means the number of gpus then? How would/does this impact check pointing containerized cuda applications. Let's say the simple example of only one GPU being shared with container namespace? The number of physical GPUs needs to be same number and type regardless of container? Does it matter if a container was checkpointed with device 0 and then restarted with device 1? Any details around this space would be greatly apprecated.

@jesus-ramos
Copy link
Collaborator

@jesus-ramos thanks for the speedy reply, it's appreciated.

Right now on the restore end it's required that the GPU type and order be the same as the checkpoint side. So you can migrate your CRIU checkpoint from machine to machine as long as the layouts are identical at the moment. For example if you checkpoint on a 4 GPU A100 system it's required to restore on a 4 GPU A100 system, attempting to restore on say an 8 GPU A100 system will fail or a 4 GPU system that contains other GPUs. Some of this will be relaxed in future releases.

Given this example, "order" means the number of gpus then? How would/does this impact check pointing containerized cuda applications. Let's say the simple example of only one GPU being shared with container namespace? The number of physical GPUs needs to be same number and type regardless of container? Does it matter if a container was checkpointed with device 0 and then restarted with device 1? Any details around this space would be greatly apprecated.

Next release should have fixed support for containers that use partial GPU passthrough, with the current release restoring requires a workaround to expose all the GPUs to the container.

With the fix though as long as you expose at least as many GPUs as you checkpointed with of the same model it will work and the order of devices in heterogenous GPU setups won't matter either. I'll be posting an updated patch to CRIU as well once the next release is live with more details.

@jesus-ramos
Copy link
Collaborator

@jesus-ramos Do you plan to support the following (or any of the two options) in future releases?

  1. restore on a GPU of the same architecture but a different model (e.g. A100->A40)
  2. restore on a GPU of an architecture different from the GPU where checkpointed was performed (e.g. H100->A100)

Thanks, Viktoria

We don't have any plans for it at the moment.

@ayushr2
Copy link

ayushr2 commented Nov 15, 2024

@jesus-ramos I assume driver version needs to be same between checkpoint and restore?

@rst0git
Copy link

rst0git commented Nov 15, 2024

@ayushr2 Would you be able to confirm if you are asking about migration with CRIU, gVisor, or if restore from a checkpoint would work on the same system after upgrading the driver version?

For example, it is possible to migrate CUDA processes across different nodes with CRIU, but CRIU requires all libraries to be the same on both source and destination nodes to correctly restore open file descriptors.

@jesus-ramos
Copy link
Collaborator

The driver version to restore to for CUDA has to be the same.

@ayushr2
Copy link

ayushr2 commented Nov 15, 2024

@ayushr2 Would you be able to confirm if you are asking about migration with CRIU, gVisor, or if restore from a checkpoint would work on the same system after upgrading the driver version?

I was asking for both gVisor and CRIU. gVisor will have similar constraints as CRIU.

CRIU requires all libraries to be the same on both source and destination nodes to correctly restore open file descriptors.

Good point, this is true for gVisor as well. To restore FDs to host files, gVisor requires container filesystem to be the same. And since the user-mode driver (files like libcuda.so) is packaged with the kernel driver, upgrading the kernel driver can also change container filesystem and hence violate the restore contract.

Thanks, so driver version needs to be same.

@tkj666
Copy link

tkj666 commented Jan 21, 2025

@jesus-ramos thanks for the speedy reply, it's appreciated.

Right now on the restore end it's required that the GPU type and order be the same as the checkpoint side. So you can migrate your CRIU checkpoint from machine to machine as long as the layouts are identical at the moment. For example if you checkpoint on a 4 GPU A100 system it's required to restore on a 4 GPU A100 system, attempting to restore on say an 8 GPU A100 system will fail or a 4 GPU system that contains other GPUs. Some of this will be relaxed in future releases.

Given this example, "order" means the number of gpus then? How would/does this impact check pointing containerized cuda applications. Let's say the simple example of only one GPU being shared with container namespace? The number of physical GPUs needs to be same number and type regardless of container? Does it matter if a container was checkpointed with device 0 and then restarted with device 1? Any details around this space would be greatly apprecated.

Next release should have fixed support for containers that use partial GPU passthrough, with the current release restoring requires a workaround to expose all the GPUs to the container.

With the fix though as long as you expose at least as many GPUs as you checkpointed with of the same model it will work and the order of devices in heterogenous GPU setups won't matter either. I'll be posting an updated patch to CRIU as well once the next release is live with more details.

Is the fix available now? what is the minimal driver version that ships with the fix?

@rst0git
Copy link

rst0git commented Jan 25, 2025

Is the fix available now? what is the minimal driver version that ships with the fix?

The following page contains some information about these updates:
https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html

NVML Updates:

  • Added checkpoint/restore functionality for userspace applications

Userspace Checkpoint and Restore:

  • Added cross-system process migration support to enable process restoration on a computer different from the one where it was checkpointed
  • Added new driver API for checkpoint/restore operations
  • Added batch CUDA asynchronous memory copy APIs (cuMemcpyBatchAsync and cuMemcpyBatch3DAsync) for variable-sized transfers between multiple source and destination buffers

@luiscape
Copy link

luiscape commented Feb 6, 2025

Hi @jesus-ramos, we've been experimenting with cuda-checkpoint at Modal and think it's great. We're facing the issue you mention here:

Next release should have fixed support for containers that use partial GPU passthrough, with the current release restoring requires a workaround to expose all the GPUs to the container.

Would you have any updates regarding that issue? We use gvisor.

TL;DR we expose different combinations of GPUs to workloads and cuda-checkpoint will fail with:

Error toggling CUDA in process ID 2: "OS call failed or operation not supported on this OS"

@rst0git
Copy link

rst0git commented Feb 6, 2025

@luiscape I was able to replicate the same error when using partial GPU passthrough with Podman and CRIU:

sudo podman run -d \
	--runtime=runc \
	--device nvidia.com/gpu=0 \
        --security-opt=label=disable \
	--name cuda-counter \
        quay.io/radostin/cuda-counter \
        bash -c "/benchmark/main"

# nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.86.10              Driver Version: 570.86.10      CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off |   00000000:00:05.0 Off |                    0 |
| N/A   33C    P0             63W /  400W |     429MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          Off |   00000000:00:06.0 Off |                    0 |
| N/A   30C    P0             65W /  400W |       1MiB /  81920MiB |     23%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           40968      C   /benchmark/main                         416MiB |
+-----------------------------------------------------------------------------------------+

# cuda-checkpoint --get-state --pid 40968
running

# cuda-checkpoint --action lock --pid 40968
# cuda-checkpoint --action checkpoint --pid 40968
Could not checkpoint on process ID 40968: "operation not permitted"

# cuda-checkpoint --action unlock --pid 40968
Could not unlock on process ID 40968: "OS call failed or operation not supported on this OS"

The checkpoint/restore operations work as expected when all GPUs are exposed in the container, and CUDA_VISIBLE_DEVICES is used to specify which GPU devices are visible to the workloads:

sudo podman run -d --rm \
        --runtime=runc \
        --device nvidia.com/gpu=all \
        --env "CUDA_VISIBLE_DEVICES=0" \
        --security-opt=label=disable \
        --name cuda-counter \
        quay.io/radostin/cuda-counter \
        bash -c "/benchmark/main"

sudo podman container checkpoint -l -e /tmp/checkpoint.tar
sudo podman container restore -i /tmp/checkpoint.tar

@luiscape
Copy link

luiscape commented Feb 6, 2025

@rst0git re

The checkpoint/restore operations work as expected when all GPUs are exposed in the container,

Good suggestion. But we can't expose all GPUs inside a container because hosts are shared between different workloads / users. One workload shouldn't have access to the GPUs from a different workload. This wouldn't work for us, unfortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants