-
Notifications
You must be signed in to change notification settings - Fork 902
[btl_openib_component.c:3644:handle_wc REMOTE ACCESS ERROR #5047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@hjelmn Could you have a look? |
Known issue in the v2.x series I thought we fixed. Please try v3.0.1 |
Hmm. Or you already did. Odd |
Try a nightly master tarball |
I think I can actually test this again. We have a mlx5 system now. |
Definitely fixed on master. |
What interconnect are you using? I can not reproduce this failure with a mlx5 system:
Verified it is using btl/openib and osc/rdma. |
Nevermind. I think I got a related failure with -n 3. Looking into it. |
Nope, was just a missing commit from master that caused that failure. I can't get this error with any configuration on this system. Will try others. |
It is a Mellanox ConnectX-3 system (1st system below), though I've tested the nightly master tarball on 4 systems and get the same issue - though with some extra errors on the ConnectX-4 systems. The 1st and 2nd system are our clusters, the 3rd and 4th system are external.
On the 2nd system we get the error:
On the 3rd system:
On the 4th system:
|
Interesting. I will continue trying to reproduce this problem. Maybe the issue happens when there are 3 nodes (I could only get 2 the other day). |
I had to use 3 nodes as a minimum with 4, 4 & 2 per node. Though 4, 3 & 3 per node also worked. There are other combinations at higher node/rank counts that produced this as well. |
I get reproducibly the same error running the one-sided tests of the mpi_test_suite on our system with ConnectX-3 on two nodes and 40 processes. |
This issue is likely about to become moot:
|
There's been no activity on this for months; I'm going to mark this as wontfix. The Way Forward is to use the UCX PML. |
Since we installed 2.1.3 we have been seeing errors with one sided calls (MPI_Get in this case) when the number of ranks on each node is not the same (though not in all configurations). I can reproduce the issue with the Intel MPI benchmarks;
$ mpirun -np 10 -npernode 4 ./IMB-RMA All_get_all -npmin 10
I've also tested this against 3.0.1/3.1.0rc3 and the master branch and we still have the same issue. 1.8.8 works ok.
If I set -mca osc ^rdma it runs, or if I remove fetching-atomics with -mca btl_openib_flags send,put,get,need-ack,need-csum,hetero-rdma.
Perhaps this is a duplicate of #4529, though we don't see the system call error.
The text was updated successfully, but these errors were encountered: