-
Notifications
You must be signed in to change notification settings - Fork 877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open MPI 2.1.0: MPI_Finalize hangs because cuIpcCloseMemHandle fails #3244
Comments
@sjeaugey This appears to be stuck in a CUDA call. Can you look into this? |
Indeed, looks like a problem on our part. I'll look into this. |
Maybe it's not even stuck. For some reason, there is a sleep(20) (!) for every IpcCloseMemHandle that fails. So if we're emptying our cache, there may be a lot of handles to close, and it can take a really long time ! I noticed yesterday my MTT was not finished in the morning, so I had to kill it but didn't have time to look into it. Same today. Maybe that's the reason. I'm currently testing a fix ignoring CUDA_DEINITIALIZED return codes and most importantly removing the sleep(20). |
I couldn't manage to reproduce the bug so far. I tested that patch :
which compiles and works but since I can't reproduce the bug, I can't confirm it fixes the problem for sure. @Evgueni-Petrov-aka-espetrov can you give it a try ? |
Thanks for the fix, @sjeaugey! |
Thanks for sharing the result. @jsquyres is it OK for me to push that patch to master directly (since Evgueni confirmed it fixed the issue ?) |
Why not just put it in a branch and submit a PR like normal? It would allow the CI to ensure nothing broke outside of this environment. |
Sure -- just takes more time. I'll submit a PR. |
@sjeaugey I still have this problem when running Horovod benchmark (a MPI framework for TensorFlow). I tried both OpenMPI 2.1.1 and the latest 3.0.1, and the problem is still there. My CUDA version is 9.0.176 and the GPU driver is 387.26. The following are the last few lines of my output for OpenMPI 3.0.1:
|
@hfutxrg This is expected, since the patch above has not been merged in 3.0.x, only in 3.1.x. Thanks ! |
Hi Open MPI,
Thank you very much for fixing #3042!
We want to switch from version 2.0.2 to 2.1.0 containing the fix but, if we do, our application starts hanging in MPI_Finalize.
From our point of view, this behavior is a regression in version 2.1.0 w.r.t version 2.0.2.
First, MPI_Finalize warns that cuIpcCloseMemHandle failed with the return value of 4 (CUDA_DEINITIALIZED), and then it prints the following messages in a loop:
...
Gdb shows the following stack:
I am not sure but I would say that MPI_Finalize tries to close a remote memory handle after the remote MPI process unloaded libcuda.so.
Probably, getting CUDA_DEINITIALIZED from cuIpcCloseMemHandle is OK?
Our CUDA version is 7.5, CUDA driver version is 361.93.02.
Evgueni.
The text was updated successfully, but these errors were encountered: