-
Notifications
You must be signed in to change notification settings - Fork 897
SSH launch fails when host file has more than 64 hosts #6198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Something has borked the routed setup as the default radix is 64. Either we aren't computing the routes or the table is wrong. |
BTW: easiest way to test with only a couple of nodes is to add |
Yeah :(. "Humorously", I still have your email from 12/19/17 with instructions on configuring routed so we catch these issues in MTT / CI. Guess I should have acted on that. |
Any progress on this? I should think it a blocker for the branches. |
@rhc54 mentioned on the call today, that he may have a fix for this. |
We're also running into this problem on AWS, but luckily there as an easy workaround (
|
@rhc54 Is pretty sure that he fixed this on master. @mkre @bwbarrett @dfaraj can you try a nightly snapshot from master and see if the problem is resolved? See https://www.open-mpi.org/nightly/master/ According to #6786 (comment), it looks like it is still broken on the v4.0 branch as of 29 June 2019. If it is, indeed, fixed on master, @rhc54 graciously said he'd try to track down a list of commits that fixed the issue for us so that we can port them to the v4.0.x branch. |
@jsquyres, we'll test this and report back, but it may take us a couple of days. |
Does anyone have an idea under which circumstances this issue appears? As I said, so far we couldn't see this issue on one of our InfiniBand clusters, but only an AWS. Could it be the case that Open MPI takes a different code path on those systems, or are we just lucky with the IB system? |
@jsquyres, we have tested this and I can confirm that the hang is resolved with the nightly snapshot. |
@jjhursey, sorry for the late answer. I can confirm that the issue is fixed in Open MPI 4.0.2, but still persists on 3.1.5. |
Looks like the 3.1.5 issue is reported in Issue #7087 as well. |
Removing Target: Master and Target: v4.0.x labels, as this issue is now fixed in those branches. |
FYI @mwheinz may also be interested in this fix on v3.1.x |
Do we know what change fixed this in the 4.0.x branch? If we knew that I could try to back-port it myself... |
as stated above. |
We're seeing launch failures when the host file has more than 64 hosts, which is resolved with
--mca routed direct
MCA parameter. Platform was x86_64 Linux in EC2. Each instance has 2 cores (4 hyperthreads). Hostfile looked like:The text was updated successfully, but these errors were encountered: