hang with Put;Accumulate;Barrier osc=rdma #3616

markalle · 2017-05-31T06:07:52Z

I'm using master and running a testcase between two infiniband machines as
% mpirun -mca pml ob1 -mca btl openib,vader,self -mca osc rdma -host mpi03,mpi04 ./x
and can hit the bug using pml yalla as well. It needs to span two hosts to fail.

Here's a gist of the testcase:
https://gist.github.com/markalle/ccbd729df75188378d538767c0321f4e
It boils down to

        MPI_Accumulate(sbuf, 100, MPI_INT, to, 0, 100, MPI_INT, MPI_MAX, win);
        MPI_Barrier(MPI_COMM_WORLD);
        MPI_Put(sbuf+2000, 100, mydt, to, 2000, 100, mydt, win);
        MPI_Barrier(MPI_COMM_WORLD);

where mydt in the MPI_Put is non-contiguous.

The initiator of the Put and Accumulate ends up going from opal_progress() to handle_wc() where it's handling a completion callback for the Accumulate request, and hangs with the bottom of its stack trace looking like this

#0  ompi_osc_rdma_lock_btl_fop (module=0xb07840, peer=0xbd70f0, offset=16)
    at ../../../../../../opensrc/ompi/ompi/mca/osc/rdma/osc_rdma_lock.h:63
#1  ompi_osc_rdma_lock_btl_op (module=0xb07840, peer=0xbd70f0, offset=16)
    at ../../../../../../opensrc/ompi/ompi/mca/osc/rdma/osc_rdma_lock.h:88
#2  ompi_osc_rdma_lock_release_exclusive (module=0xb07840, peer=0xbd70f0, 
    offset=16)
    at ../../../../../../opensrc/ompi/ompi/mca/osc/rdma/osc_rdma_lock.h:342
#3  0x00007fffe95c7805 in ompi_osc_rdma_acc_put_complete (btl=0x74f7b0, 
    endpoint=0x7fffe40125d0, local_address=0x7ffff0020020, 
    local_handle=0xc4e5d8, context=0xc63d90, data=0x0, status=0)
    at ../../../../../../opensrc/ompi/ompi/mca/osc/rdma/osc_rdma_accumulate.c:113

But I'm getting lost beyond that.

The text was updated successfully, but these errors were encountered:

markalle · 2017-05-31T06:09:29Z

@hjelmn Can you look at this?

markalle · 2017-06-01T20:35:27Z

I think I've identified where the progression is going wrong: all the currently available fragments for sending have been consumed, an ibv_poll_cq() has identified 64 completions to be handled, the handle_wc() callback wants to send another message and blocks looping on ompi_osc_rdma_progress() until it can send, but the necessary resources won't be freed until the 64 already-completed items get processed.

Breaking the call sequence down in more detail:

the non-contiguous MPI_Put performs 64+ ibv_post_send() calls up to some resource limit, one for each piece of the message (more pieces exist and are queued up somewhere and a subsequent mca_btl_openib_frag_progress_pending_put_get() is the call that will start the next piece)
in poll_device() an ibv_poll_cq() returns 64 completions
a simple "for i" loop wants to call handle_wc() on each
at i=0 the completion callback is ompi_osc_rdma_acc_put_complete()

Inside that completion callback the stack trace gets stuck at

ompi_osc_rdma_acc_put_complete
  ompi_osc_rdma_lock_release_exclusive
    ompi_osc_rdma_lock_btl_op
      ompi_osc_rdma_lock_btl_fop

Inside that fop call the code does roughly

        mca_btl_openib_atomic_fop(.., &atomic_complete, ..);
        while (!atomic_complete) {
            ompi_osc_rdma_progress (module);
        }

The atomic_fop call sees OUT_OF_RESOURCE and queues the work onto endpoint->pending_get_frags.

I think those resources won't become available and the newly queued operation won't run until the "for i" loop further up has a chance to iterate over the rest of the entries that have already been identified as completed in the earlier ibv_poll_cq().

I'm leery of completion callbacks like ompi_osc_rdma_acc_put_complete() being allowed to block like that, but I don't know how extensive a change it would be to avoid that (@hjelmn).

Instead I made an experimental change that "works" but which I'm very uneasy about: inside poll_device() I put ibv_poll_cq()'s wc[] list into an opal_list_t and at the beginning of poll_device() I process completions from that list any time it's not empty. This allows the completion callback's

        while (!atomic_complete) {
            ompi_osc_rdma_progress (module);
        }

to continue draining those completion events to free resources without requiring it to return.

The effect of this "solution" though is potentially absurd looking stack traces. If a completion callback is written like above, and an ibv_poll_cq() piles 64 completions into the list, rather than a sensible "for i" loop iterating over the 64 items, it's entirely possible we'd have a 64-level deep stack trace with each callback's ompi_osc_rdma_progress() draining the next completion from the list.

As a proof of concept, my "fix" does demonstrate what the problem is, but I think we'll need some discussion to decide what a real fix would be.

markalle · 2017-06-01T20:49:03Z

I'm tempted to make a PR though with draining the existing completion list tuned so it only happens on a percentage of poll_device() calls with the percentage getting lower the more levels of recursion are already present. That way incidental progress calls would be relatively unlikely to go recursive, at least not deeply, but the "while" loops like above would still get the progression they need. I think that would mitigate the danger of absurdly deep stack traces.

markalle · 2017-06-01T23:38:42Z

I created a pull request for an example fix (allowing handle_wc() to go recursive when its already inside a completion function).
#3640

As noted in that PR, so far I can only imagine three categories of fix:

don't let completion functions block like that
allow recursive handle_wc() while already inside a completion function
have some reserved resources for completion functions only so they can always complete

It should be possible to constrain solution #.2 a bit further, perhaps instead of locating the recursion generically at the top of poll_device() it could be located here:

        mca_btl_openib_atomic_fop(.., &atomic_complete, ..);
        while (!atomic_complete) {
            ompi_osc_rdma_progress (module);
            some_new_call_that_is_allowed_to_process_existing_work_completions(); // even though we're not strictly done with the current work completion
        }

which would at least allow us to be more explicit about when the next work completion function is allowed to start. But it does concern me that my "fix" results in WC2.cbfunc being called potentially in the middle of WC1.cbfunc rather than strictly after WC1.cbfunc in the original.

markalle · 2017-06-09T20:01:42Z

@jsquyres Do you remember who said this could be solved easily by changing the ompi_osc_rdma_lock_release_exclusive() function, saying it didn't really need to block and was just written that way to be easier?

Geoffrey thought it was Nathan, I thought it was George. I wanted to click them in as assignees on this one.

jsquyres · 2017-06-09T20:05:01Z

If I had to guess: @hjelmn said it.

hjelmn · 2017-06-09T21:12:25Z

I said it. It shouldn't take long to update.

This commit changes the locking code to allow the lock release to be non-blocking. This helps with releasing the accumulate lock which may occur in a BTL callback. Fixes open-mpi#3616 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

markalle · 2017-06-27T21:04:29Z

Confirmed, #3728 fixes this.

This commit changes the locking code to allow the lock release to be non-blocking. This helps with releasing the accumulate lock which may occur in a BTL callback. Fixes #3616 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

This commit changes the locking code to allow the lock release to be non-blocking. This helps with releasing the accumulate lock which may occur in a BTL callback. Fixes open-mpi#3616 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit 022c658) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

This commit changes the locking code to allow the lock release to be non-blocking. This helps with releasing the accumulate lock which may occur in a BTL callback. Fixes open-mpi#3616 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

markalle mentioned this issue Jun 1, 2017

progression discussion [work in progress] #3640

Closed

gpaulsen assigned hjelmn Jun 16, 2017

hjelmn mentioned this issue Jun 21, 2017

osc/rdma: rework locking code to improve behavior of unlock #3728

Merged

hjelmn closed this as completed in #3728 Jun 27, 2017

hjelmn mentioned this issue Jun 28, 2017

osc/rdma: rework locking code to improve behavior of unlock #3783

Merged

hjelmn mentioned this issue Jun 28, 2017

osc/rdma: rework locking code to improve behavior of unlock #3784

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hang with Put;Accumulate;Barrier osc=rdma #3616

hang with Put;Accumulate;Barrier osc=rdma #3616

markalle commented May 31, 2017

markalle commented May 31, 2017

markalle commented Jun 1, 2017 •

edited

Loading

markalle commented Jun 1, 2017

markalle commented Jun 1, 2017

markalle commented Jun 9, 2017

jsquyres commented Jun 9, 2017

hjelmn commented Jun 9, 2017

markalle commented Jun 27, 2017

hang with Put;Accumulate;Barrier osc=rdma #3616

hang with Put;Accumulate;Barrier osc=rdma #3616

Comments

markalle commented May 31, 2017

markalle commented May 31, 2017

markalle commented Jun 1, 2017 • edited Loading

markalle commented Jun 1, 2017

markalle commented Jun 1, 2017

markalle commented Jun 9, 2017

jsquyres commented Jun 9, 2017

hjelmn commented Jun 9, 2017

markalle commented Jun 27, 2017

markalle commented Jun 1, 2017 •

edited

Loading