-
Notifications
You must be signed in to change notification settings - Fork 897
SSH launch silently hangs with certain numbers of hosts in machine file #7087
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'd suggest first checking master and then working back to the release branches |
Fair enough. I'm still holding out a vague hope that it's some kind of configuration issue on this particular cluster but I don't have access to another one of sufficient size to compare. |
Problem does not occur with the 4.0.2 release. |
Problem does not exist in MASTER. |
Closing this issue as a dup of #6198. If this is incorrect, please feel free to reopen this and discuss. |
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
3.1.4, 3.1.x (tip of 3.1.x branch)
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
As part of OPA IFS install, or by manual build from source.
Please describe the system on which you are running
Details of the problem
3.1.4: Problem appears to fail for any # of hosts greater than 64.
3.1.x: If the machine file contains a particular number of hosts the job silently hangs during launch. Known bad numbers of hosts include 72 and 130. Known good values include 80 and 129 hosts.
Problem appears to be related to #6618 but unlike that issue, in this case the launch simply hangs and the workarounds provided in that issue (
--mca routed_radix 1
,--mca routed direct
, etc..) do not resolve the problem. Completely disabling tree-based launching (-mca plm_rsh_no_tree_spawn 1
) does resolve the issue. Problem may also be somewhat different between 3.1.4 and 3.1.x and I am going to test whether the problem occurs in 4.0.x.Sample command line:
[RHEL7.6 hds1fnb8261 20191011_0927 mpi_apps]# /usr/mpi/gcc/openmpi-3.1.4-hfi/bin/mpirun -np 80 -map-by node --allow-run-as-root --mca routed_radix 1 -machinefile /root/mpi_apps/mpi_hosts /bin/hostname
\Running verbose with 3.1.x, the last output is:
The text was updated successfully, but these errors were encountered: