-
Notifications
You must be signed in to change notification settings - Fork 605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inquiry about the docker checkpoint function for AMD GPU #2160
Comments
I guess nobody tried it so far. So we don't really know. It would be interesting to see the errors you get. |
@GYDmedwin the plugin for AMD GPUs was released in version 3.17. However, it is currently not enabled by default in CRIU packages because it requires hardware specific dependencies. Thus, you might need to build and install CRIU from source to use this functionality.
There is an example Docker container with PyTorch that can be used for testing: |
@adrianreber Thank you for your reply! The command I use is : docker checkpoint create test checkpoint1 And he following is the content of the criu-dump.log file: (00.000000) Unable to get $HOME directory, local configuration file will not be used.
/proc/self/fd/19/docker/088a3512b65d3467fc64ae7c66ee3cdd76b4eb7f90ad4d08606444d5faddaedc/tasks |
That line:
claims that the plugin cannot be found. Not sure why. @fxkamd can you help here? |
Thank you, I will try! |
@adrianreber On my machine, I found that the current CRIU was having problems running AMD plug-ins. Whether I use docker or not, the test is a failure. But things are going well when I get CRIU from here: The AMD plugin works even with docker's checkpoint function. So I suspect that there may be some feature in the current CRIU update that is affecting the proper functioning of the AMD plugin. I hope you can also test whether my results are correct. Thank you very much. |
How are you installing CRIU? That is more or less the same as here, just older. |
I install CRIU from source. And the command I use :
|
That sounds correct if you have the corresponding libraries installed. If you say that https://github.com/RadeonOpenCompute/criu works, it could be that the AMD GPU support is broken in CRIU. We cannot test it as we do not have access to to AMD GPUs in CI. |
Oh, then I hope that the developers of AMD can see this problem and carry out further verification. Thanks a lot for your response,adrianreber! Problem solved, I closed this issue. |
https://github.com/RadeonOpenCompute/criu is probably not up do date. We have not touched this repository since upstreaming amdgpu CRIU support. I can't tell what's going wrong just looking at the error message. Assuming it found the plugin, I'd expect more diagnostic messages. Our plugin is sprinkled with lots of pr_error, pr_info and pr_debug messages. |
Hello, I see that a plugin for AMD GPU is already online. There is no problem when I use it directly for the process that use GPU, it works fine.
But when I want to use the docker checkpoint function for the docker which use the AMD GPU , it fails.
Therefore, I wonder if the dump function for AMD GPU can only be used outside of docker? Or is there any other way to dump a docker with AMD GPU? Thank you!
The text was updated successfully, but these errors were encountered: