Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

occasional hangs with 3.1.3 on Ubuntu 16.04 #6548

Closed
jscook2345 opened this issue Mar 30, 2019 · 5 comments
Closed

occasional hangs with 3.1.3 on Ubuntu 16.04 #6548

jscook2345 opened this issue Mar 30, 2019 · 5 comments

Comments

@jscook2345
Copy link

jscook2345 commented Mar 30, 2019

Thank you for taking the time to submit an issue!

Background information

OpenMPI version 3.1.3.

Describe how Open MPI was installed

From a source tarball downloaded from https://open-mpi.org

Please describe the system on which you are running

  • Operating system/version: Ubuntu 16.04
  • Computer hardware: Mac OS host, Vagrant VirtualBox VM
  • Network type: Ethernet

Details of the problem

We had some users find a 'intermediate hang' while executing mpirun on Ubuntu 16.04 with OpenMPI version 3.X. This included both in a VM and in an container (Singularity).

If I change to either Ubuntu 18.04 (and keep 3.X OpenMPI) or OpenMPI 4.X (and keep Ubuntu 16.04) things start to work as expected again.

I've dug down a bit on it and I can reproduce the problem doing the following:

We're configuring like this:

./configure --prefix=/usr/local/openmpi --disable-getpwuid --enable-orterun-prefix-by-default --without-cuda

We are building a user's example like this:

curl -LO https://raw.githubusercontent.com/tomaslaz/handybox/master/MPI/Performance/point_to_point_mpi_send.c

/usr/local/openmpi/bin/mpicc -o point_to_point_mpi_send point_to_point_mpi_send.c

And I am running that like this:

/usr/local/openmpi/bin/mpirun -np 2 ./point_to_point_mpi_send

I expect the output to be similar to:

Kind            n       time (sec)      Rate (MB/sec)
Send/Recv       1       0.000000        25.223305
Send/Recv       2       0.000000        48.904837
Send/Recv       4       0.000000        99.975631
Send/Recv       8       0.000000        200.712529
Send/Recv       16      0.000000        362.217303
Send/Recv       32      0.000000        679.045087
Send/Recv       64      0.000000        1328.374984
Send/Recv       128     0.000000        2478.133012
Send/Recv       256     0.000001        3448.779057
Send/Recv       512     0.000001        3041.960447
Send/Recv       1024    0.000002        4432.900407
Send/Recv       2048    0.000003        6515.808284
Send/Recv       4096    0.000004        9256.497561
Send/Recv       8192    0.000006        11363.967504
Send/Recv       16384   0.000010        13134.124704
Send/Recv       32768   0.000018        14212.198410
Send/Recv       65536   0.000035        14833.231854
Send/Recv       131072  0.000069        15296.401890
Send/Recv       262144  0.000134        15631.898060
Send/Recv       524288  0.000276        15188.031504
Send/Recv       1048576 0.000822        10211.268340

And sometimes the example does finish. Sometimes it hangs. Always in a different spot.

I tried doing some stack trace dump using mpirun with --timeout but they did not seem useful (it ended in fini). I will provide them if you think they would be useful, however.

I also tried doing some strace recording, but again it did not seem useful.

17:03:09 write(1, "Send/Recv\t256\t0.000001\t2974.5840"..., 35) = 35 <0.000170>
17:03:09 poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=23, events=POLLIN}, {fd=0, events=POLLIN}, {fd=28, events=POLLIN}, {fd=30, events=POLLIN}, {fd=32, events=POLLIN}, {fd=24, events=POLLIN}], 9, 239938) = 0 (Timeout) <240.035911>
17:07:09 open("/usr/local/openmpi/share/openmpi/help-orterun.txt", O_RDONLY) = 29 <0.000012>
17:07:09 ioctl(29, TCGETS, 0x7ffee16c3c50) = -1 ENOTTY (Inappropriate ioctl for device) <0.000004>
17:07:09 brk(0x2398000)                 = 0x2398000 <0.000006>
17:07:09 fstat(29, {st_mode=S_IFREG|0644, st_size=22942, ...}) = 0 <0.000005>
17:07:09 read(29, "# -*- text -*-\n#\n# Copyright (c)"..., 8192) = 8192 <0.000006>
17:07:09 read(29, "limits to 1,\nincreasing your lim"..., 8192) = 8192 <0.000006>
17:07:09 read(29, "ompi-server-pid-bad]\n%s was unab"..., 8192) = 6558 <0.000005>
17:07:09 read(29, "", 1024)             = 0 <0.000004>
17:07:09 close(29)                      = 0 <0.000008>
17:07:09 brk(0x2384000)                 = 0x2384000 <0.000008>
17:07:09 write(2, "--------------------------------"..., 432) = 432 <0.000501>
17:07:09 write(2, "Waiting for stack traces (this m"..., 58) = 58 <0.000010>

A number of containers make use of Ubuntu 16.04 as a base with openmpi 3.x (one of our users is getting theirs from the NVidia tensorflow example) so I'd like to help you resolve this issue.

Let me know what else I can do to help. I can provide vagrantfiles, history, etc. Just let me know!

Thanks and have a great weekend!

Justin

@jscook2345 jscook2345 changed the title intermediate hang with 3.1.3 on Ubuntu 16.04 occasional hangs with 3.1.3 on Ubuntu 16.04 Mar 30, 2019
@cwilkes
Copy link

cwilkes commented May 6, 2019

I'm seeing the same issue with openmpi-3.1.3 on Ubuntu 16.04. Sometimes the program finishes, sometimes it just hangs. When it does hang strace just shows a lot of this:
poll([{fd=5, events=POLLIN}, {fd=15, events=POLLIN}], 2, 0) = 0 (Timeout)
I've included a good and a bad run as recorded by
strace -ff -o /tmp/out mpirun -n 2 IMB-MPI1 PingPong -msglog 10:11 -iter 10

I do not see this with version 3.1.4.

good.mpi.txt
bad.mpi.txt

@jsquyres
Copy link
Member

Is this effectively a dup of #6568?

@gpaulsen
Copy link
Member

gpaulsen commented Jun 3, 2019

@cwilkes If this is fixed by open-mpi v3.1.4, can we close this?

@cwilkes
Copy link

cwilkes commented Jun 3, 2019

I would ask @jscook2345 as they opened the bug, but for me upgrading worked.

@jscook2345
Copy link
Author

This appears to be fixed in v3.1.4. I'm not able to reproduce the hang.

Great work! Thanks so much for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants