osc/rdma: Fixes when running with btl/tcp #8756

awlauria · 2021-04-01T20:32:52Z

No description provided.

devreal

Looks good except for the removal of the shared lock in the dynamic window querying, which seems suspicious.

devreal · 2021-04-01T21:22:47Z

ompi/mca/osc/rdma/osc_rdma_dynamic.c

@@ -428,10 +428,6 @@ static int ompi_osc_rdma_refresh_dynamic_region (ompi_osc_rdma_module_t *module,
        }
        peer->regions = temp;

-        /* lock the region */
-        ompi_osc_rdma_lock_acquire_shared (module, &peer->super, 1, offsetof (ompi_osc_rdma_state_t, regions_lock),


Can you say a word why you removed the lock? AFAICS, the owner of the region takes an exclusive lock in ompi_osc_rdma_attach and the shared lock here is needed to ensure the consistency of the region information.

I'm seeing a hang in this code path. From what I can see, because this process already has an exclusive lock on this peer, it can't acquire the shared lock (it thinks someone else owns it), and will loop for infinity trying to acquire this lock. To me this looks like a double-lock scenerio. @hjelmn can probably say if that is intended or not, perhaps there is a different fix needed for this case.

awlauria · 2021-04-01T21:48:39Z

I added an additional commit to fix #8677

bosilca

I don't understand what this PR is trying to do, and neither the commit logs or the PR description provides the necessary clues.

ompi/mca/osc/rdma/osc_rdma_component.c

ompi/mca/osc/rdma/osc_rdma_lock.h

bosilca · 2021-04-01T22:10:45Z

opal/mca/btl/self/btl_self.c

@@ -218,7 +218,7 @@ static int mca_btl_self_send(struct mca_btl_base_module_t *btl,
    if (btl_ownership) {
        mca_btl_self_free(btl, des);
    }
-    return 1;
+    return OPAL_SUCCESS;


There is a valid reason to return 1 from this async function. It is to indicate that everything has been taken care of, and the message is already done despite the fact that we expect it to happen in an async way. Without this return the upper level cannot make assumptions about the code in the case where MCA_BTL_DES_SEND_ALWAYS_CALLBACK is not provided.

This is a byproduct of btl/base now calling btl_send() directly. Here it expects a return code of OPAL_SUCCESS, or else it does not remove the message from the queue. For btl/self this is problematic, and will result in a double free:

https://github.com/open-mpi/ompi/blob/master/opal/mca/btl/base/btl_base_am_rdma.c#L784

I'm not sure of the best way to address this, maybe it should always remove it from the queue?

The code you pointed to is incorrect, it should not test for OMPI_SUCCESS but should use OPAL_UNLIKELY(rc >= 0) as everywhere else. I think the fix is simple, you can remove the item from the list as soon as the ret is OPAL_UNLIKELY(ret >= 0), as this means either that the message has been already sent, or that there will be a callback once the send completes (but in both cases the BTL will handle the rest of this message lifetime).

@hjelmn any comment here.

@bosilca Yes. We should be checking if the return code is >= 0 not just success. Forgot about that case. 1 means the data is gone.

Making this change back to returning 1 for btl_self_send() seems to open up a whole can of worms for osc/rdma that need to get flushed out. Unfortunately it doesn't seem like a trivial fix - it just exposes other incompatibilities.

opal/mca/btl/tcp/btl_tcp.c

Running with btl/tcp: - Make sure peer->state_endpoint is set correctly. Cleanup/leaks: - Don't parse ompi_osc_rdma_btl_alternate_names twice. - free temp variable in allocate_state_shared() loop. Signed-off-by: Austen Lauria <awlauria@us.ibm.com>

- Add it to ompi_osc_rdma_btl_alternate_names. This will allow it to be used with the active messaging btl/base. - btl/self: Return OPAL_SUCCESS from mca_btl_self_send(). btl/base expects it to return OPAL_SUCCESS, otherwise it will be incorrectly queued. Signed-off-by: Austen Lauria <awlauria@us.ibm.com>

Use btl/base active messaging protocols instead. Signed-off-by: Austen Lauria <awlauria@us.ibm.com>

Prior to entering this, it already grabs an exclusive lock. Trying to acquire a shared lock here will hang. Signed-off-by: Austen Lauria <awlauria@us.ibm.com>

The put/get functions can now use the new active messaging protocol provided by btl/base. Signed-off-by: Austen Lauria <awlauria@us.ibm.com>

- Make sure the epoch type is set before returning from MPI_Win_start(). - Make sure the group is only free'd if it is valid in MPI_Win_complete(). - Fix possible double free() of the group. Signed-off-by: Austen Lauria <awlauria@us.ibm.com>

hjelmn · 2021-04-02T02:27:05Z

btl/self should never be used by osc/rdma

bosilca · 2021-04-02T03:43:27Z

btl/self should never be used by osc/rdma

That was my understanding as well, but if you read the discussion related to the rdma_btl_alternate_names it appears it is necessary in order to allow a process to lock itself.

awlauria · 2021-04-02T04:00:52Z

It's possible that the wire-up isn't happening right when running tcp across node.

I think what's happening is when running across node, the OMPI_OSC_RDMA_PEER_LOCAL_STATE flag isn't being set on the the 'self' peer, forcing it down the fop/cswap path for lock/unlock and company.

By the comments though - that seems intentional to not set that flag when running across multiple nodes as I did in my example.

awlauria · 2021-04-02T04:10:06Z

To be clear - running tcp on a single node works for lock/unlock on myself, so:

mpirun --np 2 --mca osc rdma --mca btl self,tcp ./lock

works, but forcing that across node:

mpirun --np 2 --host "hostA:1,hostB:1" --mca osc rdma --mca btl self,tcp ./lock

is where the issues happen. The single node example will go down the ompi_osc_rdma_lock_add() path, knowing itself is on-node. But when running on multiple nodes, it will go down the btl_fop() path on itself and crash since btl_fop() isn't implemented with btl/self right now - hence my change to add it to the alternate btls to give it the active-message path. Intuitively this seems correct since the across node lock/unlock requests will also use the active-message rdma. But perhaps there is a different approach?

dalcinl · 2021-04-02T07:42:26Z

ompi/mca/osc/rdma/osc_rdma_active_target.c

@@ -466,17 +466,19 @@ int ompi_osc_rdma_complete_atomic (ompi_win_t *win)
    sync->type = OMPI_OSC_RDMA_SYNC_TYPE_NONE;
    sync->epoch_active = false;

-    /* phase 2 cleanup group */
-    OBJ_RELEASE(group);
-
    peers = sync->peer_list.peers;
    if (NULL == peers) {
        /* empty peer list */
        OPAL_THREAD_UNLOCK(&(module->lock));


Would't it be better to release the lock after OBj_RELEASE(group), right before the return statement? Any chance multiple threads would compete to decrement the group's reference count?

Maybe this is totally irrelevant, I do not know this codebase. In that case just ignore my comment.

Yeah that seems reasonable to me. I'll make the change. Thanks

The user provided group is unnecessary at this point, releasing it as early as possible seems correct.

In fact, the user provided group is only necessary early on while building the peers array. As we refcount each peer proc explicitly, technically we don't even need to keep a pointer to the group, because this leads to double refcounting the peers. From a performance perspective, I would rather keep and refcount only the group than refcounting each individual procs (because the procs are already refcounted by the group). We could save 2*peers atomic operations per epoch.

hjelmn · 2021-04-02T13:44:29Z

btl/self should never be used by osc/rdma

That was my understanding as well, but if you read the discussion related to the rdma_btl_alternate_names it appears it is necessary in order to allow a process to lock itself.

Yes. But all we should be setting a flag that local and btl atomics mix in this situation. osc/rdma should then never call the btl atomic for any local peer. It seems the problem is that it is not setting the local state as it should in this case. I can take a look a bit later and see why. Some of this code is quite complicated and due for a refactor.

awlauria · 2021-04-02T15:30:40Z

@hjelmn @bosilca I added a commit that seems to work. This avoids the btl atomics path for self, and restores mca-btl_self_send() to return 1 as before.

I did run into a problem where returning 1 from fop/cswap would try to do the callback twice, that might need more investigation and require a different fix. but the hack works.

if this is fine I can rebase and squash it down.

awlauria · 2021-04-02T15:50:57Z

scratch that - still seeing some issues. I'll investigate later.

hjelmn · 2021-04-02T15:58:26Z

So, alternate BTL -> always local state for self and local peers. That is because the alternate BTLs set the MCA_BTL_ATOMIC_SUPPORTS_GLOB flag. It really sounds like osc/rdma has an error in that logic.

- Remove self from alternate btls. - Force osc/rdma to take the non btl-atomic path for self. Signed-off-by: Austen Lauria <awlauria@us.ibm.com>

awlauria · 2021-05-19T20:31:55Z

Closing this as it is unlikely to be merged as is.

awlauria requested review from bosilca, hjelmn and devreal April 1, 2021 20:32

This was referenced Apr 1, 2021

osc/rdma: Fix some bugs running with btl/tcp. #8719

Closed

btl/tcp: MPI_Put/Accumulate(): Fix pre-mature completeion callback. #8700

Closed

devreal requested changes Apr 1, 2021

View reviewed changes

awlauria mentioned this pull request Apr 1, 2021

Regression: MPI_Win_start()/complete() with MPI_GROUP_EMPTY #8677

Closed

bosilca reviewed Apr 1, 2021

View reviewed changes

awlauria force-pushed the rdma_btl_tcp_fixes branch 2 times, most recently from 7b5f0f2 to 48bcb5f Compare April 1, 2021 23:22

awlauria added 6 commits April 1, 2021 19:36

osc/rdma: Fix some misc. bugs.

5e3219f

Running with btl/tcp: - Make sure peer->state_endpoint is set correctly. Cleanup/leaks: - Don't parse ompi_osc_rdma_btl_alternate_names twice. - free temp variable in allocate_state_shared() loop. Signed-off-by: Austen Lauria <awlauria@us.ibm.com>

btl/tcp: Don't use mca_btl_tcp_put.

95b1039

Use btl/base active messaging protocols instead. Signed-off-by: Austen Lauria <awlauria@us.ibm.com>

osc/rdma: Don't double lock in ompi_osc_rdma_refresh_dynamic_region().

532cd9e

Prior to entering this, it already grabs an exclusive lock. Trying to acquire a shared lock here will hang. Signed-off-by: Austen Lauria <awlauria@us.ibm.com>

btl/tcp: Clean up some dead code.

9dd1996

The put/get functions can now use the new active messaging protocol provided by btl/base. Signed-off-by: Austen Lauria <awlauria@us.ibm.com>

awlauria force-pushed the rdma_btl_tcp_fixes branch from 48bcb5f to 691037a Compare April 1, 2021 23:36

dalcinl reviewed Apr 2, 2021

View reviewed changes

awlauria added the ⚠️ WIP-DNM! label Apr 2, 2021

awlauria force-pushed the rdma_btl_tcp_fixes branch from 29fd7bd to a755288 Compare April 5, 2021 16:45

Restore mca_btl_self_send() to return 1

d57a944

- Remove self from alternate btls. - Force osc/rdma to take the non btl-atomic path for self. Signed-off-by: Austen Lauria <awlauria@us.ibm.com>

awlauria force-pushed the rdma_btl_tcp_fixes branch from a755288 to d57a944 Compare April 5, 2021 17:08

This was referenced Apr 5, 2021

README.md: trivial change to force mpi4py testing jsquyres/ompi#4

Open

test: Update testsuite to run with MPICH and Open MPI development mpi4py/mpi4py#37

Merged

awlauria mentioned this pull request May 19, 2021

btl/tcp: Don't use mca_btl_tcp_put. #8984

Closed

awlauria closed this May 19, 2021

awlauria deleted the rdma_btl_tcp_fixes branch March 17, 2022 17:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

osc/rdma: Fixes when running with btl/tcp #8756

osc/rdma: Fixes when running with btl/tcp #8756

awlauria commented Apr 1, 2021

devreal left a comment

devreal Apr 1, 2021

awlauria Apr 1, 2021

awlauria commented Apr 1, 2021

bosilca left a comment

bosilca Apr 1, 2021

awlauria Apr 1, 2021

awlauria Apr 1, 2021

bosilca Apr 2, 2021

bosilca Apr 2, 2021

hjelmn Apr 2, 2021

awlauria Apr 2, 2021

hjelmn commented Apr 2, 2021

bosilca commented Apr 2, 2021

awlauria commented Apr 2, 2021 •

edited

Loading

awlauria commented Apr 2, 2021

dalcinl Apr 2, 2021

awlauria Apr 2, 2021

bosilca Apr 2, 2021

hjelmn commented Apr 2, 2021

awlauria commented Apr 2, 2021 •

edited

Loading

awlauria commented Apr 2, 2021

hjelmn commented Apr 2, 2021 •

edited

Loading

awlauria commented May 19, 2021

osc/rdma: Fixes when running with btl/tcp #8756

osc/rdma: Fixes when running with btl/tcp #8756

Conversation

awlauria commented Apr 1, 2021

devreal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

awlauria commented Apr 1, 2021

bosilca left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hjelmn commented Apr 2, 2021

bosilca commented Apr 2, 2021

awlauria commented Apr 2, 2021 • edited Loading

awlauria commented Apr 2, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hjelmn commented Apr 2, 2021

awlauria commented Apr 2, 2021 • edited Loading

awlauria commented Apr 2, 2021

hjelmn commented Apr 2, 2021 • edited Loading

awlauria commented May 19, 2021

awlauria commented Apr 2, 2021 •

edited

Loading

awlauria commented Apr 2, 2021 •

edited

Loading

hjelmn commented Apr 2, 2021 •

edited

Loading