Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osc/rdma: Fixes when running with btl/tcp #8756

Closed
wants to merge 7 commits into from

Conversation

awlauria
Copy link
Contributor

@awlauria awlauria commented Apr 1, 2021

No description provided.

Copy link
Contributor

@devreal devreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good except for the removal of the shared lock in the dynamic window querying, which seems suspicious.

@@ -428,10 +428,6 @@ static int ompi_osc_rdma_refresh_dynamic_region (ompi_osc_rdma_module_t *module,
}
peer->regions = temp;

/* lock the region */
ompi_osc_rdma_lock_acquire_shared (module, &peer->super, 1, offsetof (ompi_osc_rdma_state_t, regions_lock),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you say a word why you removed the lock? AFAICS, the owner of the region takes an exclusive lock in ompi_osc_rdma_attach and the shared lock here is needed to ensure the consistency of the region information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing a hang in this code path. From what I can see, because this process already has an exclusive lock on this peer, it can't acquire the shared lock (it thinks someone else owns it), and will loop for infinity trying to acquire this lock. To me this looks like a double-lock scenerio. @hjelmn can probably say if that is intended or not, perhaps there is a different fix needed for this case.

@awlauria
Copy link
Contributor Author

awlauria commented Apr 1, 2021

I added an additional commit to fix #8677

Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what this PR is trying to do, and neither the commit logs or the PR description provides the necessary clues.

ompi/mca/osc/rdma/osc_rdma_component.c Outdated Show resolved Hide resolved
ompi/mca/osc/rdma/osc_rdma_lock.h Outdated Show resolved Hide resolved
ompi/mca/osc/rdma/osc_rdma_lock.h Outdated Show resolved Hide resolved
@@ -218,7 +218,7 @@ static int mca_btl_self_send(struct mca_btl_base_module_t *btl,
if (btl_ownership) {
mca_btl_self_free(btl, des);
}
return 1;
return OPAL_SUCCESS;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a valid reason to return 1 from this async function. It is to indicate that everything has been taken care of, and the message is already done despite the fact that we expect it to happen in an async way. Without this return the upper level cannot make assumptions about the code in the case where MCA_BTL_DES_SEND_ALWAYS_CALLBACK is not provided.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a byproduct of btl/base now calling btl_send() directly. Here it expects a return code of OPAL_SUCCESS, or else it does not remove the message from the queue. For btl/self this is problematic, and will result in a double free:

https://github.com/open-mpi/ompi/blob/master/opal/mca/btl/base/btl_base_am_rdma.c#L784

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure of the best way to address this, maybe it should always remove it from the queue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code you pointed to is incorrect, it should not test for OMPI_SUCCESS but should use OPAL_UNLIKELY(rc >= 0) as everywhere else. I think the fix is simple, you can remove the item from the list as soon as the ret is OPAL_UNLIKELY(ret >= 0), as this means either that the message has been already sent, or that there will be a callback once the send completes (but in both cases the BTL will handle the rest of this message lifetime).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hjelmn any comment here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bosilca Yes. We should be checking if the return code is >= 0 not just success. Forgot about that case. 1 means the data is gone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making this change back to returning 1 for btl_self_send() seems to open up a whole can of worms for osc/rdma that need to get flushed out. Unfortunately it doesn't seem like a trivial fix - it just exposes other incompatibilities.

opal/mca/btl/tcp/btl_tcp.c Show resolved Hide resolved
@awlauria awlauria force-pushed the rdma_btl_tcp_fixes branch 2 times, most recently from 7b5f0f2 to 48bcb5f Compare April 1, 2021 23:22
awlauria added 6 commits April 1, 2021 19:36
Running with btl/tcp:
- Make sure peer->state_endpoint is set correctly.

Cleanup/leaks:
- Don't parse ompi_osc_rdma_btl_alternate_names twice.
- free temp variable in allocate_state_shared() loop.

Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
- Add it to ompi_osc_rdma_btl_alternate_names.
  This will allow it to be used with the active messaging
  btl/base.

- btl/self: Return OPAL_SUCCESS from mca_btl_self_send().
            btl/base expects it to return OPAL_SUCCESS,
            otherwise it will be incorrectly queued.

Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
Use btl/base active messaging protocols instead.

Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
Prior to entering this, it already grabs an exclusive lock. Trying
to acquire a shared lock here will hang.

Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
The put/get functions can now use the new
active messaging protocol provided by btl/base.

Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
- Make sure the epoch type is set before returning from MPI_Win_start().
- Make sure the group is only free'd if it is valid in MPI_Win_complete().
  - Fix possible double free() of the group.

Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
@awlauria awlauria force-pushed the rdma_btl_tcp_fixes branch from 48bcb5f to 691037a Compare April 1, 2021 23:36
@hjelmn
Copy link
Member

hjelmn commented Apr 2, 2021

btl/self should never be used by osc/rdma

@bosilca
Copy link
Member

bosilca commented Apr 2, 2021

btl/self should never be used by osc/rdma

That was my understanding as well, but if you read the discussion related to the rdma_btl_alternate_names it appears it is necessary in order to allow a process to lock itself.

@awlauria
Copy link
Contributor Author

awlauria commented Apr 2, 2021

It's possible that the wire-up isn't happening right when running tcp across node.

I think what's happening is when running across node, the OMPI_OSC_RDMA_PEER_LOCAL_STATE flag isn't being set on the the 'self' peer, forcing it down the fop/cswap path for lock/unlock and company.

By the comments though - that seems intentional to not set that flag when running across multiple nodes as I did in my example.

@awlauria
Copy link
Contributor Author

awlauria commented Apr 2, 2021

To be clear - running tcp on a single node works for lock/unlock on myself, so:

mpirun --np 2 --mca osc rdma --mca btl self,tcp ./lock

works, but forcing that across node:

mpirun --np 2 --host "hostA:1,hostB:1" --mca osc rdma --mca btl self,tcp ./lock

is where the issues happen. The single node example will go down the ompi_osc_rdma_lock_add() path, knowing itself is on-node. But when running on multiple nodes, it will go down the btl_fop() path on itself and crash since btl_fop() isn't implemented with btl/self right now - hence my change to add it to the alternate btls to give it the active-message path. Intuitively this seems correct since the across node lock/unlock requests will also use the active-message rdma. But perhaps there is a different approach?

@@ -466,17 +466,19 @@ int ompi_osc_rdma_complete_atomic (ompi_win_t *win)
sync->type = OMPI_OSC_RDMA_SYNC_TYPE_NONE;
sync->epoch_active = false;

/* phase 2 cleanup group */
OBJ_RELEASE(group);

peers = sync->peer_list.peers;
if (NULL == peers) {
/* empty peer list */
OPAL_THREAD_UNLOCK(&(module->lock));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would't it be better to release the lock after OBj_RELEASE(group), right before the return statement? Any chance multiple threads would compete to decrement the group's reference count?

Maybe this is totally irrelevant, I do not know this codebase. In that case just ignore my comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that seems reasonable to me. I'll make the change. Thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user provided group is unnecessary at this point, releasing it as early as possible seems correct.

In fact, the user provided group is only necessary early on while building the peers array. As we refcount each peer proc explicitly, technically we don't even need to keep a pointer to the group, because this leads to double refcounting the peers. From a performance perspective, I would rather keep and refcount only the group than refcounting each individual procs (because the procs are already refcounted by the group). We could save 2*peers atomic operations per epoch.

@hjelmn
Copy link
Member

hjelmn commented Apr 2, 2021

btl/self should never be used by osc/rdma

That was my understanding as well, but if you read the discussion related to the rdma_btl_alternate_names it appears it is necessary in order to allow a process to lock itself.

Yes. But all we should be setting a flag that local and btl atomics mix in this situation. osc/rdma should then never call the btl atomic for any local peer. It seems the problem is that it is not setting the local state as it should in this case. I can take a look a bit later and see why. Some of this code is quite complicated and due for a refactor.

@awlauria
Copy link
Contributor Author

awlauria commented Apr 2, 2021

@hjelmn @bosilca I added a commit that seems to work. This avoids the btl atomics path for self, and restores mca-btl_self_send() to return 1 as before.

I did run into a problem where returning 1 from fop/cswap would try to do the callback twice, that might need more investigation and require a different fix. but the hack works.

if this is fine I can rebase and squash it down.

@awlauria
Copy link
Contributor Author

awlauria commented Apr 2, 2021

scratch that - still seeing some issues. I'll investigate later.

@hjelmn
Copy link
Member

hjelmn commented Apr 2, 2021

So, alternate BTL -> always local state for self and local peers. That is because the alternate BTLs set the MCA_BTL_ATOMIC_SUPPORTS_GLOB flag. It really sounds like osc/rdma has an error in that logic.

@awlauria awlauria force-pushed the rdma_btl_tcp_fixes branch from 29fd7bd to a755288 Compare April 5, 2021 16:45
- Remove self from alternate btls.

- Force osc/rdma to take the non btl-atomic path for self.

Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
@awlauria
Copy link
Contributor Author

Closing this as it is unlikely to be merged as is.

@awlauria awlauria closed this May 19, 2021
@awlauria awlauria deleted the rdma_btl_tcp_fixes branch March 17, 2022 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants