-
Notifications
You must be signed in to change notification settings - Fork 892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
occasional hangs with 3.1.3 on Ubuntu 16.04 #6548
Comments
I'm seeing the same issue with openmpi-3.1.3 on Ubuntu 16.04. Sometimes the program finishes, sometimes it just hangs. When it does hang strace just shows a lot of this: I do not see this with version 3.1.4. |
Is this effectively a dup of #6568? |
@cwilkes If this is fixed by open-mpi v3.1.4, can we close this? |
I would ask @jscook2345 as they opened the bug, but for me upgrading worked. |
This appears to be fixed in Great work! Thanks so much for your help! |
Thank you for taking the time to submit an issue!
Background information
OpenMPI version 3.1.3.
Describe how Open MPI was installed
From a source tarball downloaded from
https://open-mpi.org
Please describe the system on which you are running
Details of the problem
We had some users find a 'intermediate hang' while executing
mpirun
on Ubuntu 16.04 with OpenMPI version 3.X. This included both in a VM and in an container (Singularity).If I change to either Ubuntu 18.04 (and keep 3.X OpenMPI) or OpenMPI 4.X (and keep Ubuntu 16.04) things start to work as expected again.
I've dug down a bit on it and I can reproduce the problem doing the following:
We're configuring like this:
We are building a user's example like this:
And I am running that like this:
I expect the output to be similar to:
Kind n time (sec) Rate (MB/sec) Send/Recv 1 0.000000 25.223305 Send/Recv 2 0.000000 48.904837 Send/Recv 4 0.000000 99.975631 Send/Recv 8 0.000000 200.712529 Send/Recv 16 0.000000 362.217303 Send/Recv 32 0.000000 679.045087 Send/Recv 64 0.000000 1328.374984 Send/Recv 128 0.000000 2478.133012 Send/Recv 256 0.000001 3448.779057 Send/Recv 512 0.000001 3041.960447 Send/Recv 1024 0.000002 4432.900407 Send/Recv 2048 0.000003 6515.808284 Send/Recv 4096 0.000004 9256.497561 Send/Recv 8192 0.000006 11363.967504 Send/Recv 16384 0.000010 13134.124704 Send/Recv 32768 0.000018 14212.198410 Send/Recv 65536 0.000035 14833.231854 Send/Recv 131072 0.000069 15296.401890 Send/Recv 262144 0.000134 15631.898060 Send/Recv 524288 0.000276 15188.031504 Send/Recv 1048576 0.000822 10211.268340
And sometimes the example does finish. Sometimes it hangs. Always in a different spot.
I tried doing some stack trace dump using
mpirun
with--timeout
but they did not seem useful (it ended infini
). I will provide them if you think they would be useful, however.I also tried doing some
strace
recording, but again it did not seem useful.A number of containers make use of Ubuntu 16.04 as a base with openmpi 3.x (one of our users is getting theirs from the NVidia tensorflow example) so I'd like to help you resolve this issue.
Let me know what else I can do to help. I can provide vagrantfiles, history, etc. Just let me know!
Thanks and have a great weekend!
Justin
The text was updated successfully, but these errors were encountered: