What does 'does not support GPU migration" mean specifically? #18

gflarity · 2024-11-12T23:06:25Z

Hi, just wondering that ' does not support GPU migration means specifically? Exact same gpu on exact same host? Same gpu model? Etc.

Thanks,
Geoff

jesus-ramos · 2024-11-12T23:11:30Z

Right now on the restore end it's required that the GPU type and order be the same as the checkpoint side. So you can migrate your CRIU checkpoint from machine to machine as long as the layouts are identical at the moment. For example if you checkpoint on a 4 GPU A100 system it's required to restore on a 4 GPU A100 system, attempting to restore on say an 8 GPU A100 system will fail or a 4 GPU system that contains other GPUs. Some of this will be relaxed in future releases.

viktoriaas · 2024-11-14T14:36:08Z

@jesus-ramos
Do you plan to support the following (or any of the two options) in future releases?

restore on a GPU of the same architecture but a different model (e.g. A100->A40)
restore on a GPU of an architecture different from the GPU where checkpointed was performed (e.g. H100->A100)

Thanks,
Viktoria

gflarity · 2024-11-14T15:25:43Z

@jesus-ramos thanks for the speedy reply, it's appreciated.

Right now on the restore end it's required that the GPU type and order be the same as the checkpoint side. So you can migrate your CRIU checkpoint from machine to machine as long as the layouts are identical at the moment. For example if you checkpoint on a 4 GPU A100 system it's required to restore on a 4 GPU A100 system, attempting to restore on say an 8 GPU A100 system will fail or a 4 GPU system that contains other GPUs. Some of this will be relaxed in future releases.

Given this example, "order" means the number of gpus then? How would/does this impact check pointing containerized cuda applications. Let's say the simple example of only one GPU being shared with container namespace? The number of physical GPUs needs to be same number and type regardless of container? Does it matter if a container was checkpointed with device 0 and then restarted with device 1? Any details around this space would be greatly apprecated.

jesus-ramos · 2024-11-14T18:45:24Z

@jesus-ramos thanks for the speedy reply, it's appreciated.

Right now on the restore end it's required that the GPU type and order be the same as the checkpoint side. So you can migrate your CRIU checkpoint from machine to machine as long as the layouts are identical at the moment. For example if you checkpoint on a 4 GPU A100 system it's required to restore on a 4 GPU A100 system, attempting to restore on say an 8 GPU A100 system will fail or a 4 GPU system that contains other GPUs. Some of this will be relaxed in future releases.

Given this example, "order" means the number of gpus then? How would/does this impact check pointing containerized cuda applications. Let's say the simple example of only one GPU being shared with container namespace? The number of physical GPUs needs to be same number and type regardless of container? Does it matter if a container was checkpointed with device 0 and then restarted with device 1? Any details around this space would be greatly apprecated.

Next release should have fixed support for containers that use partial GPU passthrough, with the current release restoring requires a workaround to expose all the GPUs to the container.

With the fix though as long as you expose at least as many GPUs as you checkpointed with of the same model it will work and the order of devices in heterogenous GPU setups won't matter either. I'll be posting an updated patch to CRIU as well once the next release is live with more details.

jesus-ramos · 2024-11-14T18:50:00Z

@jesus-ramos Do you plan to support the following (or any of the two options) in future releases?

restore on a GPU of the same architecture but a different model (e.g. A100->A40)

restore on a GPU of an architecture different from the GPU where checkpointed was performed (e.g. H100->A100)

Thanks, Viktoria

We don't have any plans for it at the moment.

ayushr2 · 2024-11-15T09:37:58Z

@jesus-ramos I assume driver version needs to be same between checkpoint and restore?

rst0git · 2024-11-15T12:12:29Z

@ayushr2 Would you be able to confirm if you are asking about migration with CRIU, gVisor, or if restore from a checkpoint would work on the same system after upgrading the driver version?

For example, it is possible to migrate CUDA processes across different nodes with CRIU, but CRIU requires all libraries to be the same on both source and destination nodes to correctly restore open file descriptors.

jesus-ramos · 2024-11-15T18:34:00Z

The driver version to restore to for CUDA has to be the same.

ayushr2 · 2024-11-15T20:17:34Z

@ayushr2 Would you be able to confirm if you are asking about migration with CRIU, gVisor, or if restore from a checkpoint would work on the same system after upgrading the driver version?

I was asking for both gVisor and CRIU. gVisor will have similar constraints as CRIU.

CRIU requires all libraries to be the same on both source and destination nodes to correctly restore open file descriptors.

Good point, this is true for gVisor as well. To restore FDs to host files, gVisor requires container filesystem to be the same. And since the user-mode driver (files like libcuda.so) is packaged with the kernel driver, upgrading the kernel driver can also change container filesystem and hence violate the restore contract.

Thanks, so driver version needs to be same.

tkj666 · 2025-01-21T02:33:29Z

@jesus-ramos thanks for the speedy reply, it's appreciated.

Right now on the restore end it's required that the GPU type and order be the same as the checkpoint side. So you can migrate your CRIU checkpoint from machine to machine as long as the layouts are identical at the moment. For example if you checkpoint on a 4 GPU A100 system it's required to restore on a 4 GPU A100 system, attempting to restore on say an 8 GPU A100 system will fail or a 4 GPU system that contains other GPUs. Some of this will be relaxed in future releases.

Given this example, "order" means the number of gpus then? How would/does this impact check pointing containerized cuda applications. Let's say the simple example of only one GPU being shared with container namespace? The number of physical GPUs needs to be same number and type regardless of container? Does it matter if a container was checkpointed with device 0 and then restarted with device 1? Any details around this space would be greatly apprecated.

Next release should have fixed support for containers that use partial GPU passthrough, with the current release restoring requires a workaround to expose all the GPUs to the container.

With the fix though as long as you expose at least as many GPUs as you checkpointed with of the same model it will work and the order of devices in heterogenous GPU setups won't matter either. I'll be posting an updated patch to CRIU as well once the next release is live with more details.

Is the fix available now? what is the minimal driver version that ships with the fix?

rst0git · 2025-01-25T10:23:00Z

Is the fix available now? what is the minimal driver version that ships with the fix?

The following page contains some information about these updates:
https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html

NVML Updates:

Added checkpoint/restore functionality for userspace applications

Userspace Checkpoint and Restore:

Added cross-system process migration support to enable process restoration on a computer different from the one where it was checkpointed

Added new driver API for checkpoint/restore operations

Added batch CUDA asynchronous memory copy APIs (cuMemcpyBatchAsync and cuMemcpyBatch3DAsync) for variable-sized transfers between multiple source and destination buffers

luiscape · 2025-02-06T03:02:19Z

Hi @jesus-ramos, we've been experimenting with cuda-checkpoint at Modal and think it's great. We're facing the issue you mention here:

Next release should have fixed support for containers that use partial GPU passthrough, with the current release restoring requires a workaround to expose all the GPUs to the container.

Would you have any updates regarding that issue? We use gvisor.

TL;DR we expose different combinations of GPUs to workloads and cuda-checkpoint will fail with:

Error toggling CUDA in process ID 2: "OS call failed or operation not supported on this OS"

rst0git · 2025-02-06T06:44:50Z

@luiscape I was able to replicate the same error when using partial GPU passthrough with Podman and CRIU:

sudo podman run -d \
	--runtime=runc \
	--device nvidia.com/gpu=0 \
        --security-opt=label=disable \
	--name cuda-counter \
        quay.io/radostin/cuda-counter \
        bash -c "/benchmark/main"

# nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.86.10              Driver Version: 570.86.10      CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off |   00000000:00:05.0 Off |                    0 |
| N/A   33C    P0             63W /  400W |     429MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          Off |   00000000:00:06.0 Off |                    0 |
| N/A   30C    P0             65W /  400W |       1MiB /  81920MiB |     23%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           40968      C   /benchmark/main                         416MiB |
+-----------------------------------------------------------------------------------------+

# cuda-checkpoint --get-state --pid 40968
running

# cuda-checkpoint --action lock --pid 40968
# cuda-checkpoint --action checkpoint --pid 40968
Could not checkpoint on process ID 40968: "operation not permitted"

# cuda-checkpoint --action unlock --pid 40968
Could not unlock on process ID 40968: "OS call failed or operation not supported on this OS"

The checkpoint/restore operations work as expected when all GPUs are exposed in the container, and CUDA_VISIBLE_DEVICES is used to specify which GPU devices are visible to the workloads:

sudo podman run -d --rm \
        --runtime=runc \
        --device nvidia.com/gpu=all \
        --env "CUDA_VISIBLE_DEVICES=0" \
        --security-opt=label=disable \
        --name cuda-counter \
        quay.io/radostin/cuda-counter \
        bash -c "/benchmark/main"

sudo podman container checkpoint -l -e /tmp/checkpoint.tar
sudo podman container restore -i /tmp/checkpoint.tar

luiscape · 2025-02-06T14:51:29Z

@rst0git re

The checkpoint/restore operations work as expected when all GPUs are exposed in the container,

Good suggestion. But we can't expose all GPUs inside a container because hosts are shared between different workloads / users. One workload shouldn't have access to the GPUs from a different workload. This wouldn't work for us, unfortunately.

rst0git mentioned this issue Nov 27, 2024

podman container restore - Error: runc: create criu restore mount for /usr/lib64/libEGL_nvidia.so.560.35.03 mount #21

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What does 'does not support GPU migration" mean specifically? #18

What does 'does not support GPU migration" mean specifically? #18

gflarity commented Nov 12, 2024

jesus-ramos commented Nov 12, 2024

viktoriaas commented Nov 14, 2024

gflarity commented Nov 14, 2024

jesus-ramos commented Nov 14, 2024

jesus-ramos commented Nov 14, 2024

ayushr2 commented Nov 15, 2024

rst0git commented Nov 15, 2024

jesus-ramos commented Nov 15, 2024

ayushr2 commented Nov 15, 2024

tkj666 commented Jan 21, 2025

rst0git commented Jan 25, 2025

luiscape commented Feb 6, 2025

rst0git commented Feb 6, 2025

luiscape commented Feb 6, 2025

What does 'does not support GPU migration" mean specifically? #18

What does 'does not support GPU migration" mean specifically? #18

Comments

gflarity commented Nov 12, 2024

jesus-ramos commented Nov 12, 2024

viktoriaas commented Nov 14, 2024

gflarity commented Nov 14, 2024

jesus-ramos commented Nov 14, 2024

jesus-ramos commented Nov 14, 2024

ayushr2 commented Nov 15, 2024

rst0git commented Nov 15, 2024

jesus-ramos commented Nov 15, 2024

ayushr2 commented Nov 15, 2024

tkj666 commented Jan 21, 2025

rst0git commented Jan 25, 2025

luiscape commented Feb 6, 2025

rst0git commented Feb 6, 2025

luiscape commented Feb 6, 2025