Skip to content

Commit

Permalink
btl/smcuda: Add atomic_wmb() before sm_fifo_write
Browse files Browse the repository at this point in the history
This change fixes open-mpi#12270

Testing on c7g instance type (arm64) confirms this change elminates
hangs and crashes that were previously observed in 1 in 30 runs of
IMB alltoall benchmark.  Tested with over 300 runs and no failures.

The write memory barrier prevents other CPUs from observing the fifo
get updated before they observe the updated contents of the header
itself.  Without the barrier, uninitialized header contents caused
the crashes and invalid data.

Signed-off-by: Luke Robison <lrbison@amazon.com>
(cherry picked from commit 71f378d)
  • Loading branch information
lrbison committed Feb 15, 2024
1 parent a641879 commit b6bfd8e
Showing 1 changed file with 2 additions and 0 deletions.
2 changes: 2 additions & 0 deletions opal/mca/btl/smcuda/btl_smcuda_fifo.h
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,8 @@ add_pending(struct mca_btl_base_endpoint_t *ep, void *data, bool resend)
#define MCA_BTL_SMCUDA_FIFO_WRITE(endpoint_peer, my_smp_rank, \
peer_smp_rank, hdr, resend, retry_pending_sends, rc) \
do { \
/* memory barrier: ensure writes to the hdr have completed */ \
opal_atomic_wmb(); \
sm_fifo_t* fifo = &(mca_btl_smcuda_component.fifo[peer_smp_rank][FIFO_MAP(my_smp_rank)]); \
\
if ( retry_pending_sends ) { \
Expand Down

0 comments on commit b6bfd8e

Please sign in to comment.