cuda-checkpoint toggle program does not respond when restoring #17

ToviHe · 2024-11-01T02:24:25Z

I use the image provided by the Llama-factory framework to run the codegeex4-all-9b model. The command is as follows
The command is as follows

docker run --gpu '"device=3"' --ipc=host --ulimit memlock=-1 -itd -p 37860:7860 -p 38000:8000 -v /data/model:/data/model llama-factory:20240710 bash

When the container startup is completed, I enter the container startup model interface service.The command is as follows

llamafatory-cli api --model_name_or_path /data/model/codegeex4-all-9b --template codegeex4

The large model service was successfully started and can provide services to the outside world normally.

When everything is ready, I use cuda-checkpoint to try to freeze and thaw the GPU instance.
The command is as follows(In the host, not inside the container)

./cuda-checkpoint --toggle --pid 93264

The command was executed successfully, and at the same time, through nvidia-smi, it was seen that no processes were occupied in the 3 cards.
Then I tried to restore the environment through the following command

./cuda-checkpoint --toggle --pid 93264

It was found that the execution of the command was blocked and it never returned.

At this time, the process log in the container is as follows

At this time, the process information on the host is as follows

Can you help me find out what is causing this? What do I need to do to execute successfully?

The text was updated successfully, but these errors were encountered:

jesus-ramos · 2024-11-01T19:10:13Z

I'll see about trying to repro this to see what could be going wrong.

A couple of things to try is make sure that cuda-checkpoint is run as root, try passing all devices through docker instead of a subset, and you can also try an r555 driver if possible. There are some bugs with partial device passthrough so for r555 you will have to pass all devices through to the container or restores will fail.

ToviHe · 2024-11-04T00:54:58Z

I'll see about trying to repro this to see what could be going wrong.

A couple of things to try is make sure that cuda-checkpoint is run as root, try passing all devices through docker instead of a subset, and you can also try an r555 driver if possible. There are some bugs with partial device passthrough so for r555 you will have to pass all devices through to the container or restores will fail.

Thank you for your reply.
The user who runs cuda-checkpoint here is root.
how to specifically operate the 'try passing all devices through docker instead of a subset and r555 driver ' mentioned here?

jesus-ramos · 2024-11-04T18:58:03Z

You can use the "--gpus all" flag instead, for the latest driver version you can either use your distributions package manager to get the latest available or download directly from the nvidia website.

ToviHe · 2024-11-06T00:56:28Z

You can use the "--gpus all" flag instead, for the latest driver version you can either use your distributions package manager to get the latest available or download directly from the nvidia website.

The GPU we are currently assigned to use only has one card, that is, device=3 can only be specified. Regarding upgrading the driver version, currently because other services are running on the server, it can only be upgraded to version 550 for the time being. Is there any other way?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda-checkpoint toggle program does not respond when restoring #17

cuda-checkpoint toggle program does not respond when restoring #17

ToviHe commented Nov 1, 2024 •

edited

Loading

jesus-ramos commented Nov 1, 2024

ToviHe commented Nov 4, 2024

jesus-ramos commented Nov 4, 2024

ToviHe commented Nov 6, 2024

cuda-checkpoint toggle program does not respond when restoring #17

cuda-checkpoint toggle program does not respond when restoring #17

Comments

ToviHe commented Nov 1, 2024 • edited Loading

jesus-ramos commented Nov 1, 2024

ToviHe commented Nov 4, 2024

jesus-ramos commented Nov 4, 2024

ToviHe commented Nov 6, 2024

ToviHe commented Nov 1, 2024 •

edited

Loading