-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hang with Put;Accumulate;Barrier osc=rdma #3616
Comments
@hjelmn Can you look at this? |
I think I've identified where the progression is going wrong: all the currently available fragments for sending have been consumed, an Breaking the call sequence down in more detail:
Inside that completion callback the stack trace gets stuck at
Inside that fop call the code does roughly
The atomic_fop call sees OUT_OF_RESOURCE and queues the work onto I think those resources won't become available and the newly queued operation won't run until the "for i" loop further up has a chance to iterate over the rest of the entries that have already been identified as completed in the earlier I'm leery of completion callbacks like Instead I made an experimental change that "works" but which I'm very uneasy about: inside
to continue draining those completion events to free resources without requiring it to return. The effect of this "solution" though is potentially absurd looking stack traces. If a completion callback is written like above, and an ibv_poll_cq() piles 64 completions into the list, rather than a sensible "for i" loop iterating over the 64 items, it's entirely possible we'd have a 64-level deep stack trace with each callback's As a proof of concept, my "fix" does demonstrate what the problem is, but I think we'll need some discussion to decide what a real fix would be. |
I'm tempted to make a PR though with draining the existing completion list tuned so it only happens on a percentage of |
I created a pull request for an example fix (allowing handle_wc() to go recursive when its already inside a completion function). As noted in that PR, so far I can only imagine three categories of fix:
It should be possible to constrain solution #.2 a bit further, perhaps instead of locating the recursion generically at the top of poll_device() it could be located here:
which would at least allow us to be more explicit about when the next work completion function is allowed to start. But it does concern me that my "fix" results in WC2.cbfunc being called potentially in the middle of WC1.cbfunc rather than strictly after WC1.cbfunc in the original. |
@jsquyres Do you remember who said this could be solved easily by changing the ompi_osc_rdma_lock_release_exclusive() function, saying it didn't really need to block and was just written that way to be easier? Geoffrey thought it was Nathan, I thought it was George. I wanted to click them in as assignees on this one. |
If I had to guess: @hjelmn said it. |
I said it. It shouldn't take long to update. |
This commit changes the locking code to allow the lock release to be non-blocking. This helps with releasing the accumulate lock which may occur in a BTL callback. Fixes open-mpi#3616 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Confirmed, #3728 fixes this. |
This commit changes the locking code to allow the lock release to be non-blocking. This helps with releasing the accumulate lock which may occur in a BTL callback. Fixes #3616 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit changes the locking code to allow the lock release to be non-blocking. This helps with releasing the accumulate lock which may occur in a BTL callback. Fixes open-mpi#3616 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit 022c658) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit changes the locking code to allow the lock release to be non-blocking. This helps with releasing the accumulate lock which may occur in a BTL callback. Fixes open-mpi#3616 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit 022c658) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit changes the locking code to allow the lock release to be non-blocking. This helps with releasing the accumulate lock which may occur in a BTL callback. Fixes open-mpi#3616 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit changes the locking code to allow the lock release to be non-blocking. This helps with releasing the accumulate lock which may occur in a BTL callback. Fixes open-mpi#3616 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit changes the locking code to allow the lock release to be non-blocking. This helps with releasing the accumulate lock which may occur in a BTL callback. Fixes open-mpi#3616 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
I'm using master and running a testcase between two infiniband machines as
% mpirun -mca pml ob1 -mca btl openib,vader,self -mca osc rdma -host mpi03,mpi04 ./x
and can hit the bug using pml yalla as well. It needs to span two hosts to fail.
Here's a gist of the testcase:
https://gist.github.com/markalle/ccbd729df75188378d538767c0321f4e
It boils down to
where mydt in the MPI_Put is non-contiguous.
The initiator of the Put and Accumulate ends up going from opal_progress() to handle_wc() where it's handling a completion callback for the Accumulate request, and hangs with the bottom of its stack trace looking like this
But I'm getting lost beyond that.
The text was updated successfully, but these errors were encountered: