-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osc/rdma: Fixes when running with btl/tcp #8756
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good except for the removal of the shared lock in the dynamic window querying, which seems suspicious.
@@ -428,10 +428,6 @@ static int ompi_osc_rdma_refresh_dynamic_region (ompi_osc_rdma_module_t *module, | |||
} | |||
peer->regions = temp; | |||
|
|||
/* lock the region */ | |||
ompi_osc_rdma_lock_acquire_shared (module, &peer->super, 1, offsetof (ompi_osc_rdma_state_t, regions_lock), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you say a word why you removed the lock? AFAICS, the owner of the region takes an exclusive lock in ompi_osc_rdma_attach
and the shared lock here is needed to ensure the consistency of the region information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm seeing a hang in this code path. From what I can see, because this process already has an exclusive lock on this peer, it can't acquire the shared lock (it thinks someone else owns it), and will loop for infinity trying to acquire this lock. To me this looks like a double-lock scenerio. @hjelmn can probably say if that is intended or not, perhaps there is a different fix needed for this case.
I added an additional commit to fix #8677 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what this PR is trying to do, and neither the commit logs or the PR description provides the necessary clues.
opal/mca/btl/self/btl_self.c
Outdated
@@ -218,7 +218,7 @@ static int mca_btl_self_send(struct mca_btl_base_module_t *btl, | |||
if (btl_ownership) { | |||
mca_btl_self_free(btl, des); | |||
} | |||
return 1; | |||
return OPAL_SUCCESS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a valid reason to return 1 from this async function. It is to indicate that everything has been taken care of, and the message is already done despite the fact that we expect it to happen in an async way. Without this return the upper level cannot make assumptions about the code in the case where MCA_BTL_DES_SEND_ALWAYS_CALLBACK is not provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a byproduct of btl/base now calling btl_send() directly. Here it expects a return code of OPAL_SUCCESS, or else it does not remove the message from the queue. For btl/self this is problematic, and will result in a double free:
https://github.com/open-mpi/ompi/blob/master/opal/mca/btl/base/btl_base_am_rdma.c#L784
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure of the best way to address this, maybe it should always remove it from the queue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code you pointed to is incorrect, it should not test for OMPI_SUCCESS but should use OPAL_UNLIKELY(rc >= 0)
as everywhere else. I think the fix is simple, you can remove the item from the list as soon as the ret is OPAL_UNLIKELY(ret >= 0)
, as this means either that the message has been already sent, or that there will be a callback once the send completes (but in both cases the BTL will handle the rest of this message lifetime).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hjelmn any comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bosilca Yes. We should be checking if the return code is >= 0 not just success. Forgot about that case. 1 means the data is gone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making this change back to returning 1 for btl_self_send() seems to open up a whole can of worms for osc/rdma that need to get flushed out. Unfortunately it doesn't seem like a trivial fix - it just exposes other incompatibilities.
7b5f0f2
to
48bcb5f
Compare
Running with btl/tcp: - Make sure peer->state_endpoint is set correctly. Cleanup/leaks: - Don't parse ompi_osc_rdma_btl_alternate_names twice. - free temp variable in allocate_state_shared() loop. Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
- Add it to ompi_osc_rdma_btl_alternate_names. This will allow it to be used with the active messaging btl/base. - btl/self: Return OPAL_SUCCESS from mca_btl_self_send(). btl/base expects it to return OPAL_SUCCESS, otherwise it will be incorrectly queued. Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
Use btl/base active messaging protocols instead. Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
Prior to entering this, it already grabs an exclusive lock. Trying to acquire a shared lock here will hang. Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
The put/get functions can now use the new active messaging protocol provided by btl/base. Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
- Make sure the epoch type is set before returning from MPI_Win_start(). - Make sure the group is only free'd if it is valid in MPI_Win_complete(). - Fix possible double free() of the group. Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
48bcb5f
to
691037a
Compare
btl/self should never be used by osc/rdma |
That was my understanding as well, but if you read the discussion related to the rdma_btl_alternate_names it appears it is necessary in order to allow a process to lock itself. |
It's possible that the wire-up isn't happening right when running tcp across node. I think what's happening is when running across node, the By the comments though - that seems intentional to not set that flag when running across multiple nodes as I did in my example. |
To be clear - running tcp on a single node works for lock/unlock on myself, so:
works, but forcing that across node:
is where the issues happen. The single node example will go down the |
@@ -466,17 +466,19 @@ int ompi_osc_rdma_complete_atomic (ompi_win_t *win) | |||
sync->type = OMPI_OSC_RDMA_SYNC_TYPE_NONE; | |||
sync->epoch_active = false; | |||
|
|||
/* phase 2 cleanup group */ | |||
OBJ_RELEASE(group); | |||
|
|||
peers = sync->peer_list.peers; | |||
if (NULL == peers) { | |||
/* empty peer list */ | |||
OPAL_THREAD_UNLOCK(&(module->lock)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would't it be better to release the lock after OBj_RELEASE(group)
, right before the return statement? Any chance multiple threads would compete to decrement the group's reference count?
Maybe this is totally irrelevant, I do not know this codebase. In that case just ignore my comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that seems reasonable to me. I'll make the change. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user provided group is unnecessary at this point, releasing it as early as possible seems correct.
In fact, the user provided group is only necessary early on while building the peers array. As we refcount each peer proc explicitly, technically we don't even need to keep a pointer to the group, because this leads to double refcounting the peers. From a performance perspective, I would rather keep and refcount only the group than refcounting each individual procs (because the procs are already refcounted by the group). We could save 2*peers atomic operations per epoch.
Yes. But all we should be setting a flag that local and btl atomics mix in this situation. osc/rdma should then never call the btl atomic for any local peer. It seems the problem is that it is not setting the local state as it should in this case. I can take a look a bit later and see why. Some of this code is quite complicated and due for a refactor. |
@hjelmn @bosilca I added a commit that seems to work. This avoids the btl atomics path for self, and restores mca-btl_self_send() to return 1 as before. I did run into a problem where returning 1 from fop/cswap would try to do the callback twice, that might need more investigation and require a different fix. but the hack works. if this is fine I can rebase and squash it down. |
scratch that - still seeing some issues. I'll investigate later. |
So, alternate BTL -> always local state for self and local peers. That is because the alternate BTLs set the MCA_BTL_ATOMIC_SUPPORTS_GLOB flag. It really sounds like osc/rdma has an error in that logic. |
29fd7bd
to
a755288
Compare
- Remove self from alternate btls. - Force osc/rdma to take the non btl-atomic path for self. Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
a755288
to
d57a944
Compare
Closing this as it is unlikely to be merged as is. |
No description provided.