-
Notifications
You must be signed in to change notification settings - Fork 868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix OSC RDMA to work in cases where one BTL cannot reach all communicator group members #7830
Comments
I have the implementation mostly complete. It will require reworking how BTL receive completions are handled but that is a good thing. Been looking at reworking those for years. I have a branch with the changes but I can not test usnic or portals4. I verified vader, tcp, and self. Will validate uGNI and UCX today. |
Send me the branch and what tests to run, and I'll check usnic. |
https://github.com/hjelmn/ompi/tree/btl_base_atomics_are_awesome This only has the changes to the BTL receive logic. Please run a standard set of two-sided tests. IMB would work. Feel free to push back and changes btl/usnic needs. The old receive callback got btl, tag, descriptor, cbdata. The new callback gets the btl and a new receive descriptor which contains the tag, endpoint, received message, and cbdata. |
Really curious if there are any performance regressions. There shouldn't be but that is going to be where any pushback will be. |
@jsquyres Ping. Please take a look at the branch mentioned above. I need to know if this change works for you as I can not verify the usnic BTL changes. I want to get this change approved and in this week so I can get the btl base atomics support finished. |
@hjelmn Sorry for the delay. I ran the https://github.com/hjelmn/ompi/tree/btl_base_atomics_are_awesome branch last night and things generally look good. I'm getting some random failures, but I'm apparently getting some random failures with master, too. Specifically: with the btl_base_atomics_are_awesome branch, I'm seeing most (all?) of the one-sided tests that were failing are now passing. We'll have to address the other random failures in another issue. |
This is still causing terabytes of corefiles with Cisco's MTT that I regularly have to manually clear out to avoid filling my filesystem. I'm going to have to disable Cisco MTT's one-sided testing until this is fixed. |
I don't know if this is related to #7892 or not, but I figured I'd put in the cross-reference. |
Austen said he has some time next week to take a look. Thanks! |
osc/rdma will internally call I have to think that #9010 is related. |
It helps but I still run into issues. Unfortunately I haven't had time to press on it much since the last update, and am out next week. Earliest I can give cycles on it would be the 28th. |
oops, I accidentally clicked close, then re-opened it immediately. |
Same issue as described here? Where the comm split is wrong? |
The communicator seems to be correct. I attached to one of the processes before it segfault, and the issue is in osc_rdma_lock.h, function ompi_osc_rdma_btl_fop, where the state_index is -1. The stack and peer variable do not look corrupted or uninitialized but both btl_index (state and data) of the peer object are set to 255 (because unsigned) and the selected_btl is then some memory location outside of the allocated array of BTLs. |
Any test from the onesided directory in ompi-tests. BTL: tcp and sm. As I said above I see either segfaults or deadlocks. |
FWIW: the Cisco MTT now only shows segfaults from onesided tests and from the datatype "pack" function. It is otherwise running pretty well. Strictly tcp and sm there. |
I applied the patch from #9400 and rerun the tests. Similar story as before, segfaults and deadlocks. To give a precise example, here is how I run it: $ salloc -N 2 -wa,b
$ mpirun -np 4 --map-by node --mca osc rdma --mca btl tcp,self -ca btl_tcp_if_include ib0 ./test_dan1 The test segfault in lock_unlock stage. However, the same test segfault when using UCX as well. $ mpirun -np 4 --map-by node --mca pml ucx ./test_dan1 |
@bosilca I can reproduce the segfault when running |
@bosilca The segfault from the However, even with on_demand locking, I encountered numerous issues like hang, segfault, and data corruption when running other tests under I will make a list and keep looking into it. |
Honestly let's just change the default to on demand locking in this case. The performance is not bad with on demand. |
I just updated #9400 to address several new issues I found The biggest issue continue to be that btl/tcp does not support self communication. To address this issue, I added b0197d9 which uses btl/self for self communication (if selected btl does not support self communication). To make it work, I had to do some refactor work , and make some change to btl/self, and change to osc/rdma. Then there is another bug in osc/rdma in that the shm file name is not unique. I added 3c0ae03 to address it. With the current PR, most of the test under ompi-tests/onesided directory pass. except: One negative test in t_winerror failed. The test was trying to free a non-existent window, and expect MPI_Win_free() to return MPI_WIN_ERROR, however, the error is considered fatal, so MPI aborted. I am not sure how (or should) we can fix it. Similar issue with test_start1. lock4, get2 randomly hang on two nodes, I will continue investigate put1, put4 data validation error on two nodes. I will continue investigate. |
Catching up. Can btl/sm not be used? It allows self communication. When using tcp it should always also use btl/sm. |
But I could be remembering incorrectly :). |
Problem is people are running tests with the sm btl excluded (by specifying I do not have the full background. I wonder if there is an agreement that btl/tcp must be used with btl/sm (for osc/rdma)? |
@bosilca You're assigned to this, but are you working this? |
#9696 will fix this issue too. |
@jsquyres can we close this? |
@jsquyres ping. can this be closed? |
I think this is done.. @jsquyres if this is still an issue please re-open. |
Cisco MTT shows that
MPI_WIN_CREATE
is failing since the OSC pt2pt component was removed (i.e.,MPI_WIN_CREATE
fails because there is no OSC component available). Here's an example of a simple test failing becauseMPI_WIN_CREATE
failed.After discussion on the 16 June 2020 weekly Open MPI webex, @hjelmn said that he could add a few functions to make OSC RDMA work in all cases (not just in cases where a single BTL can reach all members in the communicator group).
Marking this as a blocker for v5.0 because it's preventing MPI_WIN_CREATE from working in at least some environments, and because -- per the discussion between @hjelmn and @bwbarrett -- the fix to OSC RDMA doesn't seem too difficult.
The text was updated successfully, but these errors were encountered: