Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v4.1.x] btl/smcuda: Add atomic_wmb() before sm_fifo_write #12344

Merged
merged 1 commit into from
Feb 20, 2024

Conversation

lrbison
Copy link
Contributor

@lrbison lrbison commented Feb 15, 2024

cherry-pick #12338 to fix #12270 on v4.1.x

This change fixes open-mpi#12270

Testing on c7g instance type (arm64) confirms this change elminates
hangs and crashes that were previously observed in 1 in 30 runs of
IMB alltoall benchmark.  Tested with over 300 runs and no failures.

The write memory barrier prevents other CPUs from observing the fifo
get updated before they observe the updated contents of the header
itself.  Without the barrier, uninitialized header contents caused
the crashes and invalid data.

Signed-off-by: Luke Robison <lrbison@amazon.com>
(cherry picked from commit 71f378d)
Copy link
Contributor

@wenduwan wenduwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix merged in main and v5.0.x.

@jsquyres
Copy link
Member

Does this PR replace #12005? Or is this in addition to #12005?

@wenduwan
Copy link
Contributor

This PR is not directly related to #12005 - it addresses #12270.

@jsquyres
Copy link
Member

This PR is not directly related to #12005 - it addresses #12270.

Just to be clear: does that mean we need both this PR and #12005?

@wenduwan
Copy link
Contributor

Just to be clear: does that mean we need both this PR and #12005?

Nope this PR goes alone.

@jsquyres jsquyres merged commit 16721dd into open-mpi:v4.1.x Feb 20, 2024
9 checks passed
@lrbison lrbison deleted the smcuda_wmb_41x branch February 21, 2024 04:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants