Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hang with Put;Accumulate;Barrier osc=rdma #3616

Closed
markalle opened this issue May 31, 2017 · 8 comments
Closed

hang with Put;Accumulate;Barrier osc=rdma #3616

markalle opened this issue May 31, 2017 · 8 comments
Assignees

Comments

@markalle
Copy link
Contributor

I'm using master and running a testcase between two infiniband machines as
% mpirun -mca pml ob1 -mca btl openib,vader,self -mca osc rdma -host mpi03,mpi04 ./x
and can hit the bug using pml yalla as well. It needs to span two hosts to fail.

Here's a gist of the testcase:
https://gist.github.com/markalle/ccbd729df75188378d538767c0321f4e
It boils down to

        MPI_Accumulate(sbuf, 100, MPI_INT, to, 0, 100, MPI_INT, MPI_MAX, win);
        MPI_Barrier(MPI_COMM_WORLD);
        MPI_Put(sbuf+2000, 100, mydt, to, 2000, 100, mydt, win);
        MPI_Barrier(MPI_COMM_WORLD);

where mydt in the MPI_Put is non-contiguous.

The initiator of the Put and Accumulate ends up going from opal_progress() to handle_wc() where it's handling a completion callback for the Accumulate request, and hangs with the bottom of its stack trace looking like this

#0  ompi_osc_rdma_lock_btl_fop (module=0xb07840, peer=0xbd70f0, offset=16)
    at ../../../../../../opensrc/ompi/ompi/mca/osc/rdma/osc_rdma_lock.h:63
#1  ompi_osc_rdma_lock_btl_op (module=0xb07840, peer=0xbd70f0, offset=16)
    at ../../../../../../opensrc/ompi/ompi/mca/osc/rdma/osc_rdma_lock.h:88
#2  ompi_osc_rdma_lock_release_exclusive (module=0xb07840, peer=0xbd70f0, 
    offset=16)
    at ../../../../../../opensrc/ompi/ompi/mca/osc/rdma/osc_rdma_lock.h:342
#3  0x00007fffe95c7805 in ompi_osc_rdma_acc_put_complete (btl=0x74f7b0, 
    endpoint=0x7fffe40125d0, local_address=0x7ffff0020020, 
    local_handle=0xc4e5d8, context=0xc63d90, data=0x0, status=0)
    at ../../../../../../opensrc/ompi/ompi/mca/osc/rdma/osc_rdma_accumulate.c:113

But I'm getting lost beyond that.

@markalle
Copy link
Contributor Author

@hjelmn Can you look at this?

@markalle
Copy link
Contributor Author

markalle commented Jun 1, 2017

I think I've identified where the progression is going wrong: all the currently available fragments for sending have been consumed, an ibv_poll_cq() has identified 64 completions to be handled, the handle_wc() callback wants to send another message and blocks looping on ompi_osc_rdma_progress() until it can send, but the necessary resources won't be freed until the 64 already-completed items get processed.

Breaking the call sequence down in more detail:

  • the non-contiguous MPI_Put performs 64+ ibv_post_send() calls up to some resource limit, one for each piece of the message (more pieces exist and are queued up somewhere and a subsequent mca_btl_openib_frag_progress_pending_put_get() is the call that will start the next piece)
  • in poll_device() an ibv_poll_cq() returns 64 completions
  • a simple "for i" loop wants to call handle_wc() on each
  • at i=0 the completion callback is ompi_osc_rdma_acc_put_complete()

Inside that completion callback the stack trace gets stuck at

ompi_osc_rdma_acc_put_complete
  ompi_osc_rdma_lock_release_exclusive
    ompi_osc_rdma_lock_btl_op
      ompi_osc_rdma_lock_btl_fop

Inside that fop call the code does roughly

        mca_btl_openib_atomic_fop(.., &atomic_complete, ..);
        while (!atomic_complete) {
            ompi_osc_rdma_progress (module);
        }

The atomic_fop call sees OUT_OF_RESOURCE and queues the work onto endpoint->pending_get_frags.

I think those resources won't become available and the newly queued operation won't run until the "for i" loop further up has a chance to iterate over the rest of the entries that have already been identified as completed in the earlier ibv_poll_cq().

I'm leery of completion callbacks like ompi_osc_rdma_acc_put_complete() being allowed to block like that, but I don't know how extensive a change it would be to avoid that (@hjelmn).

Instead I made an experimental change that "works" but which I'm very uneasy about: inside poll_device() I put ibv_poll_cq()'s wc[] list into an opal_list_t and at the beginning of poll_device() I process completions from that list any time it's not empty. This allows the completion callback's

        while (!atomic_complete) {
            ompi_osc_rdma_progress (module);
        }

to continue draining those completion events to free resources without requiring it to return.

The effect of this "solution" though is potentially absurd looking stack traces. If a completion callback is written like above, and an ibv_poll_cq() piles 64 completions into the list, rather than a sensible "for i" loop iterating over the 64 items, it's entirely possible we'd have a 64-level deep stack trace with each callback's ompi_osc_rdma_progress() draining the next completion from the list.

As a proof of concept, my "fix" does demonstrate what the problem is, but I think we'll need some discussion to decide what a real fix would be.

@markalle
Copy link
Contributor Author

markalle commented Jun 1, 2017

I'm tempted to make a PR though with draining the existing completion list tuned so it only happens on a percentage of poll_device() calls with the percentage getting lower the more levels of recursion are already present. That way incidental progress calls would be relatively unlikely to go recursive, at least not deeply, but the "while" loops like above would still get the progression they need. I think that would mitigate the danger of absurdly deep stack traces.

@markalle
Copy link
Contributor Author

markalle commented Jun 1, 2017

I created a pull request for an example fix (allowing handle_wc() to go recursive when its already inside a completion function).
#3640

As noted in that PR, so far I can only imagine three categories of fix:

  1. don't let completion functions block like that
  2. allow recursive handle_wc() while already inside a completion function
  3. have some reserved resources for completion functions only so they can always complete

It should be possible to constrain solution #.2 a bit further, perhaps instead of locating the recursion generically at the top of poll_device() it could be located here:

        mca_btl_openib_atomic_fop(.., &atomic_complete, ..);
        while (!atomic_complete) {
            ompi_osc_rdma_progress (module);
            some_new_call_that_is_allowed_to_process_existing_work_completions(); // even though we're not strictly done with the current work completion
        }

which would at least allow us to be more explicit about when the next work completion function is allowed to start. But it does concern me that my "fix" results in WC2.cbfunc being called potentially in the middle of WC1.cbfunc rather than strictly after WC1.cbfunc in the original.

@markalle
Copy link
Contributor Author

markalle commented Jun 9, 2017

@jsquyres Do you remember who said this could be solved easily by changing the ompi_osc_rdma_lock_release_exclusive() function, saying it didn't really need to block and was just written that way to be easier?

Geoffrey thought it was Nathan, I thought it was George. I wanted to click them in as assignees on this one.

@jsquyres
Copy link
Member

jsquyres commented Jun 9, 2017

If I had to guess: @hjelmn said it.

@hjelmn
Copy link
Member

hjelmn commented Jun 9, 2017

I said it. It shouldn't take long to update.

hjelmn added a commit to hjelmn/ompi that referenced this issue Jun 21, 2017
This commit changes the locking code to allow the lock release to be
non-blocking. This helps with releasing the accumulate lock which may
occur in a BTL callback.

Fixes open-mpi#3616

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@markalle
Copy link
Contributor Author

Confirmed, #3728 fixes this.

hjelmn added a commit that referenced this issue Jun 27, 2017
This commit changes the locking code to allow the lock release to be
non-blocking. This helps with releasing the accumulate lock which may
occur in a BTL callback.

Fixes #3616

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
hjelmn added a commit to hjelmn/ompi that referenced this issue Jun 28, 2017
This commit changes the locking code to allow the lock release to be
non-blocking. This helps with releasing the accumulate lock which may
occur in a BTL callback.

Fixes open-mpi#3616

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 022c658)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
hjelmn added a commit to hjelmn/ompi that referenced this issue Jun 28, 2017
This commit changes the locking code to allow the lock release to be
non-blocking. This helps with releasing the accumulate lock which may
occur in a BTL callback.

Fixes open-mpi#3616

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 022c658)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
artpol84 pushed a commit to artpol84/ompi that referenced this issue Jun 29, 2017
This commit changes the locking code to allow the lock release to be
non-blocking. This helps with releasing the accumulate lock which may
occur in a BTL callback.

Fixes open-mpi#3616

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
artpol84 pushed a commit to artpol84/ompi that referenced this issue Jun 29, 2017
This commit changes the locking code to allow the lock release to be
non-blocking. This helps with releasing the accumulate lock which may
occur in a BTL callback.

Fixes open-mpi#3616

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
artpol84 pushed a commit to artpol84/ompi that referenced this issue Jun 29, 2017
This commit changes the locking code to allow the lock release to be
non-blocking. This helps with releasing the accumulate lock which may
occur in a BTL callback.

Fixes open-mpi#3616

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants