Skip to content

Commit

Permalink
btl/smcuda: Add atomic_wmb() before sm_fifo_write
Browse files Browse the repository at this point in the history
This change fixes open-mpi#12270

Testing on c7g instance type (arm64) confirms this change elminates
hangs and crashes that were previously observed in 1 in 30 runs of
IMB alltoall benchmark.  Tested with over 300 runs and no failures.

The write memory barrier prevents other CPUs from observing the fifo
get updated before they observe the updated contents of the header
itself.  Without the barrier, uninitialized header contents caused
the crashes and invalid data.

Signed-off-by: Luke Robison <lrbison@amazon.com>
(cherry picked from commit 71f378d)
  • Loading branch information
lrbison committed Feb 15, 2024
1 parent 8b4e629 commit 0ebea59
Showing 1 changed file with 2 additions and 0 deletions.
2 changes: 2 additions & 0 deletions opal/mca/btl/smcuda/btl_smcuda_fifo.h
Original file line number Diff line number Diff line change
@@ -85,6 +85,8 @@ static void add_pending(struct mca_btl_base_endpoint_t *ep, void *data, bool res
#define MCA_BTL_SMCUDA_FIFO_WRITE(endpoint_peer, my_smp_rank, peer_smp_rank, hdr, resend, \
retry_pending_sends, rc) \
do { \
/* memory barrier: ensure writes to the hdr have completed */ \
opal_atomic_wmb(); \
sm_fifo_t *fifo = &(mca_btl_smcuda_component.fifo[peer_smp_rank][FIFO_MAP(my_smp_rank)]); \
\
if (retry_pending_sends) { \

0 comments on commit 0ebea59

Please sign in to comment.