Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda-checkpoint toggle program does not respond when restoring #17

Open
ToviHe opened this issue Nov 1, 2024 · 4 comments
Open

cuda-checkpoint toggle program does not respond when restoring #17

ToviHe opened this issue Nov 1, 2024 · 4 comments

Comments

@ToviHe
Copy link

ToviHe commented Nov 1, 2024

I use the image provided by the Llama-factory framework to run the codegeex4-all-9b model. The command is as follows
The command is as follows

docker run --gpu '"device=3"' --ipc=host --ulimit memlock=-1 -itd -p 37860:7860 -p 38000:8000 -v /data/model:/data/model llama-factory:20240710 bash

When the container startup is completed, I enter the container startup model interface service.The command is as follows

llamafatory-cli api --model_name_or_path /data/model/codegeex4-all-9b --template codegeex4

The large model service was successfully started and can provide services to the outside world normally.

When everything is ready, I use cuda-checkpoint to try to freeze and thaw the GPU instance.
The command is as follows(In the host, not inside the container)

./cuda-checkpoint --toggle --pid 93264

The command was executed successfully, and at the same time, through nvidia-smi, it was seen that no processes were occupied in the 3 cards.
Then I tried to restore the environment through the following command

./cuda-checkpoint --toggle --pid 93264

It was found that the execution of the command was blocked and it never returned.
Image

At this time, the process log in the container is as follows
Image

At this time, the process information on the host is as follows
Image

Can you help me find out what is causing this? What do I need to do to execute successfully?

@jesus-ramos
Copy link
Collaborator

I'll see about trying to repro this to see what could be going wrong.

A couple of things to try is make sure that cuda-checkpoint is run as root, try passing all devices through docker instead of a subset, and you can also try an r555 driver if possible. There are some bugs with partial device passthrough so for r555 you will have to pass all devices through to the container or restores will fail.

@ToviHe
Copy link
Author

ToviHe commented Nov 4, 2024

I'll see about trying to repro this to see what could be going wrong.

A couple of things to try is make sure that cuda-checkpoint is run as root, try passing all devices through docker instead of a subset, and you can also try an r555 driver if possible. There are some bugs with partial device passthrough so for r555 you will have to pass all devices through to the container or restores will fail.

Thank you for your reply.
The user who runs cuda-checkpoint here is root.
how to specifically operate the 'try passing all devices through docker instead of a subset and r555 driver ' mentioned here?

@jesus-ramos
Copy link
Collaborator

You can use the "--gpus all" flag instead, for the latest driver version you can either use your distributions package manager to get the latest available or download directly from the nvidia website.

@ToviHe
Copy link
Author

ToviHe commented Nov 6, 2024

You can use the "--gpus all" flag instead, for the latest driver version you can either use your distributions package manager to get the latest available or download directly from the nvidia website.

The GPU we are currently assigned to use only has one card, that is, device=3 can only be specified. Regarding upgrading the driver version, currently because other services are running on the server, it can only be upgraded to version 550 for the time being. Is there any other way?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants