[arm64] fix atomic operations opal_atomic_compare_exchange_strong_ #11999

yuncliu · 2023-10-17T16:43:06Z

in arm64 ldxr and stxr must be used with memory barrier. this cause the spinlock not work in arm64. program may crash or get a wrong result when using multi-thread. Actually we already got both program crash and wrong result of allreduce sum when using multi-pthreads, and fix by this modifition

github-actions · 2023-10-17T16:43:47Z

Hello! The Git Commit Checker CI bot found a few problems with this PR:

c832a6f: fix atomic operations and spin lock bug in arm64

check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

github-actions · 2023-10-17T16:46:21Z

Hello! The Git Commit Checker CI bot found a few problems with this PR:

c832a6f: fix atomic operations and spin lock bug in arm64

check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

github-actions · 2023-10-17T16:49:48Z

Hello! The Git Commit Checker CI bot found a few problems with this PR:

c832a6f: fix atomic operations and spin lock bug in arm64

check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

jsquyres · 2023-10-17T17:50:22Z

@yuncliu Thanks for the contribution! Can you add a signed off by line in your commit message? https://docs.open-mpi.org/en/v5.0.x/developers/git-github.html#git-commits-open-source-contributor-s-declaration

github-actions · 2023-10-18T16:50:54Z

Hello! The Git Commit Checker CI bot found a few problems with this PR:

f40e3e9: fix atomic operation and spinlock bug

check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

lrbison

Thank you for finding these. I agree with most of the changes, but I found some of the changes should be unnecessary given the function names.

Would you be able to try with these modifications and see if you still observe the crash? If so it's possible the code using these atomics might need some changes.

opal/include/opal/sys/arm64/atomic.h

lrbison · 2023-10-18T17:22:58Z

One other comment: could you include arm64 in the commit message?

bosilca · 2023-10-18T17:40:27Z

We do not need the barrier (or memory ordering requirement) semantics for these atomics. In OMPI we split these in two different operations, atomics only do the atomic update of the value referred to, while the different memory barriers are provided via the opal_atomic_[r|w|]mb.

Read this stackoverflow for more information.

lrbison · 2023-10-19T15:42:53Z

@bosilca

Hm. I was trying to confirm what you said. I find that opal_atomic_trylock is implemented as:

static inline int opal_atomic_trylock(opal_atomic_lock_t *lock)
{
    int32_t unlocked = OPAL_ATOMIC_LOCK_UNLOCKED;
    bool ret = opal_atomic_compare_exchange_strong_acq_32(lock, &unlocked,
                                                          OPAL_ATOMIC_LOCK_LOCKED);
    return (ret == false) ? 1 : 0;
}

I don't see a memory barrier there, and I've looked at where atomic_trylock is used, and I don't see barriers there either. Additionally the comment around exchange_strong_acq_32 from arm64/atomic.h

/* these two functions aren't inlined in the non-gcc case because then
   there would be two function calls (since neither cmpset_32 nor
   atomic_?mb can be inlined).  Instead, we "inline" them by hand in
   the assembly, meaning there is one function call overhead instead
   of two */

I read this comment to mean that the exchange_strong_acq should include both the _rmb() and the exchange. This also seems to be a unique comment in the arm64 branch.

Conclusion: I think we still need this PR.

Signed-off-by: liuyuncheng <liuyuncheng@huawei.com>

bosilca · 2023-10-20T13:41:56Z

ok, so now that we are down to a more reasonable patch, we need to decide if the CAS needs or not a strong memory ordering semantics.

I am not sure what point is @lrbison trying to make with the discussion on opal_atomic_trylock, and why there would be a need for a memory ordering around trylock. The upper layer shall add the required memory barriers if needed, such as the SM BTL or the OB1 PML, and not the intermediary layers such as atomic_lock. Moreover, what is the case in which this particular change is needed ? How did Fugaku OMPI worked without this change ? What about PPC ?

devreal · 2023-10-20T13:48:36Z

Locks in OMPI have acquire/release semantic. That is prudent and would break existing code if we removed it. All other atomic operations have relaxed semantics. For specializations that carry acquire/release in their names we should provide the appropriate memory ordering or we open the doors to eternal suffering.

bosilca · 2023-10-20T13:51:12Z

This patch does not remove them, it adds more. Based on what example the locks in OMPI have strong memory semantic ?

devreal · 2023-10-20T14:09:07Z

The spinlock implementation uses opal_atomic_compare_exchange_strong_acq_32: https://github.com/open-mpi/ompi/blob/main/opal/include/opal/sys/atomic_impl_spinlock.h#L38

Release is done using an explicit opal_atomic_wmb(): https://github.com/open-mpi/ompi/blob/main/opal/include/opal/sys/atomic_impl_spinlock.h#L54

The functions changed by @yuncliu are not used in the spinlock so I'm not sure about the initial motivation. Maybe the opal_atomic_compare_exchange_strong_acq_[32|64] need to be checked?

I agree with @bosilca that adding memory ordering to the "normal" CAS operations doesn't seem right. Sorry for the confusion earlier.

yuncliu · 2023-10-20T14:14:31Z

Here is the test code can recurring the bug. when it run in arm64, it will get wrong answer or crash

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <sys/types.h>
#include "mpi.h"

#define MAX_THREADS (20)

int g_rankSize = 0;
int g_rank = 0;
MPI_Comm g_comm[MAX_THREADS];

void *mpi_thread(void* p)
{
    int id = *(int*)p;
    free(p);
    int i;
    int count = 0;
    for (i = 0; i < 1000000; ++i) {
        int s = 1;
        int r = 0;
        MPI_Allreduce(&s, &r, 1, MPI_INT, MPI_SUM, g_comm[id]);
        if (r != g_rankSize) {
            count++;
        }
    }
    printf("rank %d id %d error count = %d\n", g_rank, id, count);
    return NULL;
}

int main(int argc, char** argv)
{
    int mpi_threads_provided;
    int req = MPI_THREAD_MULTIPLE;
    pthread_t threads[MAX_THREADS];
    const int threadNum = 10;
    int64_t i;


    MPI_Init_thread(&argc, &argv, req, &mpi_threads_provided);
    MPI_Comm_rank(MPI_COMM_WORLD, &g_rank);
    MPI_Comm_size(MPI_COMM_WORLD, &g_rankSize);

    MPI_Group worldGroup;
    MPI_Comm_group(MPI_COMM_WORLD, &worldGroup);
    for (i = 0; i < threadNum; ++i) {
        MPI_Comm_create(MPI_COMM_WORLD, worldGroup, &g_comm[i]);
    }

    for (i = 0; i < threadNum; ++i) {
        int *p = (int*)malloc(sizeof(int));
        *p = (int)i;
        pthread_create(&threads[i], NULL, mpi_thread, (void*)p);
    }

    for (i = 0; i < threadNum; ++i) {
        pthread_join(threads[i], NULL);
    }
    MPI_Finalize();
    return 0;
}

yuncliu · 2023-10-20T14:18:02Z

The spinlock implementation uses opal_atomic_compare_exchange_strong_acq_32: https://github.com/open-mpi/ompi/blob/main/opal/include/opal/sys/atomic_impl_spinlock.h#L38

Release is done using an explicit opal_atomic_wmb(): https://github.com/open-mpi/ompi/blob/main/opal/include/opal/sys/atomic_impl_spinlock.h#L54

The functions changed by @yuncliu are not used in the spinlock so I'm not sure about the initial motivation. Maybe the opal_atomic_compare_exchange_strong_acq_[32|64] need to be checked?

I agree with @bosilca that adding memory ordering to the "normal" CAS operations doesn't seem right. Sorry for the confusion earlier.

I may make some mistake, the problem is not the spin lock but the exchange_strong_32/64. I change it and the problem not come again.

devreal · 2023-10-20T15:07:51Z

That suggests we're missing a memory barrier somewhere. Can you provide more details on where the failure happens? I don't have easy access to an arm system. A stack trace would be useful.

yuncliu

only the exchange_strong_32/64 need to be change the stxr to stlxr

yuncliu · 2023-10-20T15:32:51Z

That suggests we're missing a memory barrier somewhere. Can you provide more details on where the failure happens? I don't have easy access to an arm system. A stack trace would be useful.

I also have no many chance to get time of arm server. crash happens few, maybe need a long time to get the crash stack. I'll report it when I get it. But the wrong answer happens alot , 8 thread with 1000,000 times allreduce sum may get 0-100 times wrong answer.

devreal

I think this is the wrong fix for a problem we have somewhere in the codebase.

lrbison · 2023-10-20T15:54:12Z

I see your point now George and Joseph. I'll remove my approval.

I was thrown off by the fact that the (relaxed) exchange still had an acquire in the load (ldaxr) so I had assumed it should be sequential.

devreal · 2023-10-20T17:02:51Z

I opened #12011 to track this issue. I don't have the resources to track it down myself though.

bosilca · 2023-10-21T15:11:38Z

I was thrown off by the fact that the (relaxed) exchange still had an acquire in the load (ldaxr) so I had assumed it should be sequential.

I wonder if we can remove the acquire load from the CAS. I don't think it is necessary, but it will need some deeper investigation.

@yuncliu we need to understand if the issue arise from the SM transport or somewhere else.

So are you using UCX or OB1 ? You can mpirun --mca pml ob1 to force the switch to our own communication library
single node or multi-node runs ? let's force TCP everywhere to see if we can replicate. Run mpirun --mca pml ob1 --mca btl tcp,self

yuncliu · 2023-10-23T01:49:30Z

I was thrown off by the fact that the (relaxed) exchange still had an acquire in the load (ldaxr) so I had assumed it should be sequential.

I wonder if we can remove the acquire load from the CAS. I don't think it is necessary, but it will need some deeper investigation.

@yuncliu we need to understand if the issue arise from the SM transport or somewhere else.

So are you using UCX or OB1 ? You can mpirun --mca pml ob1 to force the switch to our own communication library

--mca pml ob1 not work

single node or multi-node runs ? let's force TCP everywhere to see if we can replicate. Run mpirun --mca pml ob1 --mca btl tcp,self

Single node . mpirun --mca pml ob1 --mca btl tcp,self also not work. My hardware is a server with 192 arm64 core and 4 numa node.

bosilca · 2023-10-23T12:35:09Z

They are not working for some other reasons or you hit the same type of threading issues ?

lrbison · 2024-02-21T04:32:00Z

@yuncliu thank you again for reporting this issue. I think it may be fixed in #12338. Can we have any further discussion in the issue #12011 rather than this PR?

I will close this PR for now, as I don't think it is the correct change. However if you could confirm the additional write memory barrier in smcuda fixed your issue #12011 it would be greatly appreciated!

Thank you

github-actions bot added the Target: main label Oct 17, 2023

yuncliu force-pushed the arm64_atmoic_fix branch from c832a6f to f40e3e9 Compare October 18, 2023 16:50

yuncliu force-pushed the arm64_atmoic_fix branch from f40e3e9 to fbb3dce Compare October 18, 2023 17:20

lrbison requested changes Oct 18, 2023

View reviewed changes

opal/include/opal/sys/arm64/atomic.h Outdated Show resolved Hide resolved

opal/include/opal/sys/arm64/atomic.h Outdated Show resolved Hide resolved

opal/include/opal/sys/arm64/atomic.h Outdated Show resolved Hide resolved

opal/include/opal/sys/arm64/atomic.h Outdated Show resolved Hide resolved

jsquyres mentioned this pull request Oct 18, 2023

[v4.1.x] fix atomic operations opal_atomic_compare_exchange_strong_ in arm64 #12005

Closed

arm64:fix exchange_strong

0e9268e

Signed-off-by: liuyuncheng <liuyuncheng@huawei.com>

yuncliu force-pushed the arm64_atmoic_fix branch from fbb3dce to 0e9268e Compare October 20, 2023 12:48

yuncliu changed the title ~~fix atomic operations and spin lock bug in arm64~~ [arm64 ]fix atomic operations opal_atomic_compare_exchange_strong_ Oct 20, 2023

yuncliu changed the title ~~[arm64 ]fix atomic operations opal_atomic_compare_exchange_strong_~~ [arm64] fix atomic operations opal_atomic_compare_exchange_strong_ Oct 20, 2023

yuncliu requested a review from lrbison October 20, 2023 15:08

yuncliu commented Oct 20, 2023

View reviewed changes

lrbison approved these changes Oct 20, 2023

View reviewed changes

devreal requested changes Oct 20, 2023

View reviewed changes

lrbison self-requested a review October 20, 2023 15:54

devreal mentioned this pull request Oct 20, 2023

Issue with atomics on arm64 #12011

Closed

jsquyres mentioned this pull request Jan 30, 2024

btl smcuda hang in v4.1.5 #12270

Closed

lrbison closed this Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[arm64] fix atomic operations opal_atomic_compare_exchange_strong_ #11999

[arm64] fix atomic operations opal_atomic_compare_exchange_strong_ #11999

yuncliu commented Oct 17, 2023 •

edited

Loading

github-actions bot commented Oct 17, 2023

github-actions bot commented Oct 17, 2023

github-actions bot commented Oct 17, 2023

jsquyres commented Oct 17, 2023

github-actions bot commented Oct 18, 2023

lrbison left a comment

lrbison commented Oct 18, 2023

bosilca commented Oct 18, 2023 •

edited

Loading

lrbison commented Oct 19, 2023

bosilca commented Oct 20, 2023

devreal commented Oct 20, 2023

bosilca commented Oct 20, 2023

devreal commented Oct 20, 2023

yuncliu commented Oct 20, 2023 •

edited by lrbison

Loading

yuncliu commented Oct 20, 2023

devreal commented Oct 20, 2023

yuncliu left a comment

yuncliu commented Oct 20, 2023

devreal left a comment

lrbison commented Oct 20, 2023

devreal commented Oct 20, 2023

bosilca commented Oct 21, 2023

yuncliu commented Oct 23, 2023 •

edited by bosilca

Loading

bosilca commented Oct 23, 2023

lrbison commented Feb 21, 2024

[arm64] fix atomic operations opal_atomic_compare_exchange_strong_ #11999

[arm64] fix atomic operations opal_atomic_compare_exchange_strong_ #11999

Conversation

yuncliu commented Oct 17, 2023 • edited Loading

github-actions bot commented Oct 17, 2023

github-actions bot commented Oct 17, 2023

github-actions bot commented Oct 17, 2023

jsquyres commented Oct 17, 2023

github-actions bot commented Oct 18, 2023

lrbison left a comment

Choose a reason for hiding this comment

lrbison commented Oct 18, 2023

bosilca commented Oct 18, 2023 • edited Loading

lrbison commented Oct 19, 2023

bosilca commented Oct 20, 2023

devreal commented Oct 20, 2023

bosilca commented Oct 20, 2023

devreal commented Oct 20, 2023

yuncliu commented Oct 20, 2023 • edited by lrbison Loading

yuncliu commented Oct 20, 2023

devreal commented Oct 20, 2023

yuncliu left a comment

Choose a reason for hiding this comment

yuncliu commented Oct 20, 2023

devreal left a comment

Choose a reason for hiding this comment

lrbison commented Oct 20, 2023

devreal commented Oct 20, 2023

bosilca commented Oct 21, 2023

yuncliu commented Oct 23, 2023 • edited by bosilca Loading

bosilca commented Oct 23, 2023

lrbison commented Feb 21, 2024

yuncliu commented Oct 17, 2023 •

edited

Loading

bosilca commented Oct 18, 2023 •

edited

Loading

yuncliu commented Oct 20, 2023 •

edited by lrbison

Loading

yuncliu commented Oct 23, 2023 •

edited by bosilca

Loading