-
Notifications
You must be signed in to change notification settings - Fork 878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[arm64] fix atomic operations opal_atomic_compare_exchange_strong_ #11999
Conversation
Hello! The Git Commit Checker CI bot found a few problems with this PR: c832a6f: fix atomic operations and spin lock bug in arm64
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
2 similar comments
Hello! The Git Commit Checker CI bot found a few problems with this PR: c832a6f: fix atomic operations and spin lock bug in arm64
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
Hello! The Git Commit Checker CI bot found a few problems with this PR: c832a6f: fix atomic operations and spin lock bug in arm64
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
@yuncliu Thanks for the contribution! Can you add a signed off by line in your commit message? https://docs.open-mpi.org/en/v5.0.x/developers/git-github.html#git-commits-open-source-contributor-s-declaration |
c832a6f
to
f40e3e9
Compare
Hello! The Git Commit Checker CI bot found a few problems with this PR: f40e3e9: fix atomic operation and spinlock bug
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
f40e3e9
to
fbb3dce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for finding these. I agree with most of the changes, but I found some of the changes should be unnecessary given the function names.
Would you be able to try with these modifications and see if you still observe the crash? If so it's possible the code using these atomics might need some changes.
One other comment: could you include arm64 in the commit message? |
We do not need the barrier (or memory ordering requirement) semantics for these atomics. In OMPI we split these in two different operations, atomics only do the atomic update of the value referred to, while the different memory barriers are provided via the Read this stackoverflow for more information. |
Hm. I was trying to confirm what you said. I find that opal_atomic_trylock is implemented as:
I don't see a memory barrier there, and I've looked at where atomic_trylock is used, and I don't see barriers there either. Additionally the comment around exchange_strong_acq_32 from arm64/atomic.h
I read this comment to mean that the exchange_strong_acq should include both the _rmb() and the exchange. This also seems to be a unique comment in the arm64 branch. Conclusion: I think we still need this PR. |
Signed-off-by: liuyuncheng <liuyuncheng@huawei.com>
fbb3dce
to
0e9268e
Compare
ok, so now that we are down to a more reasonable patch, we need to decide if the CAS needs or not a strong memory ordering semantics. I am not sure what point is @lrbison trying to make with the discussion on |
Locks in OMPI have acquire/release semantic. That is prudent and would break existing code if we removed it. All other atomic operations have relaxed semantics. For specializations that carry acquire/release in their names we should provide the appropriate memory ordering or we open the doors to eternal suffering. |
This patch does not remove them, it adds more. Based on what example the locks in OMPI have strong memory semantic ? |
The spinlock implementation uses Release is done using an explicit The functions changed by @yuncliu are not used in the spinlock so I'm not sure about the initial motivation. Maybe the I agree with @bosilca that adding memory ordering to the "normal" CAS operations doesn't seem right. Sorry for the confusion earlier. |
Here is the test code can recurring the bug. when it run in arm64, it will get wrong answer or crash
|
I may make some mistake, the problem is not the spin lock but the exchange_strong_32/64. I change it and the problem not come again. |
That suggests we're missing a memory barrier somewhere. Can you provide more details on where the failure happens? I don't have easy access to an arm system. A stack trace would be useful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only the exchange_strong_32/64 need to be change the stxr to stlxr
I also have no many chance to get time of arm server. crash happens few, maybe need a long time to get the crash stack. I'll report it when I get it. But the wrong answer happens alot , 8 thread with 1000,000 times allreduce sum may get 0-100 times wrong answer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the wrong fix for a problem we have somewhere in the codebase.
I see your point now George and Joseph. I'll remove my approval. I was thrown off by the fact that the (relaxed) exchange still had an acquire in the load ( |
I opened #12011 to track this issue. I don't have the resources to track it down myself though. |
I wonder if we can remove the acquire load from the CAS. I don't think it is necessary, but it will need some deeper investigation. @yuncliu we need to understand if the issue arise from the SM transport or somewhere else.
|
Single node . |
They are not working for some other reasons or you hit the same type of threading issues ? |
@yuncliu thank you again for reporting this issue. I think it may be fixed in #12338. Can we have any further discussion in the issue #12011 rather than this PR? I will close this PR for now, as I don't think it is the correct change. However if you could confirm the additional write memory barrier in smcuda fixed your issue #12011 it would be greatly appreciated! Thank you |
in arm64 ldxr and stxr must be used with memory barrier. this cause the spinlock not work in arm64. program may crash or get a wrong result when using multi-thread. Actually we already got both program crash and wrong result of allreduce sum when using multi-pthreads, and fix by this modifition