Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[arm64] fix atomic operations opal_atomic_compare_exchange_strong_ #11999

Closed
wants to merge 1 commit into from

Conversation

yuncliu
Copy link

@yuncliu yuncliu commented Oct 17, 2023

in arm64 ldxr and stxr must be used with memory barrier. this cause the spinlock not work in arm64. program may crash or get a wrong result when using multi-thread. Actually we already got both program crash and wrong result of allreduce sum when using multi-pthreads, and fix by this modifition

@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

c832a6f: fix atomic operations and spin lock bug in arm64

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

2 similar comments
@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

c832a6f: fix atomic operations and spin lock bug in arm64

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

c832a6f: fix atomic operations and spin lock bug in arm64

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@jsquyres
Copy link
Member

@yuncliu Thanks for the contribution! Can you add a signed off by line in your commit message? https://docs.open-mpi.org/en/v5.0.x/developers/git-github.html#git-commits-open-source-contributor-s-declaration

@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

f40e3e9: fix atomic operation and spinlock bug

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

Copy link
Contributor

@lrbison lrbison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for finding these. I agree with most of the changes, but I found some of the changes should be unnecessary given the function names.

Would you be able to try with these modifications and see if you still observe the crash? If so it's possible the code using these atomics might need some changes.

opal/include/opal/sys/arm64/atomic.h Outdated Show resolved Hide resolved
opal/include/opal/sys/arm64/atomic.h Outdated Show resolved Hide resolved
opal/include/opal/sys/arm64/atomic.h Outdated Show resolved Hide resolved
opal/include/opal/sys/arm64/atomic.h Outdated Show resolved Hide resolved
@lrbison
Copy link
Contributor

lrbison commented Oct 18, 2023

One other comment: could you include arm64 in the commit message?

@bosilca
Copy link
Member

bosilca commented Oct 18, 2023

We do not need the barrier (or memory ordering requirement) semantics for these atomics. In OMPI we split these in two different operations, atomics only do the atomic update of the value referred to, while the different memory barriers are provided via the opal_atomic_[r|w|]mb.

Read this stackoverflow for more information.

@lrbison
Copy link
Contributor

lrbison commented Oct 19, 2023

@bosilca

Hm. I was trying to confirm what you said. I find that opal_atomic_trylock is implemented as:

static inline int opal_atomic_trylock(opal_atomic_lock_t *lock)
{
    int32_t unlocked = OPAL_ATOMIC_LOCK_UNLOCKED;
    bool ret = opal_atomic_compare_exchange_strong_acq_32(lock, &unlocked,
                                                          OPAL_ATOMIC_LOCK_LOCKED);
    return (ret == false) ? 1 : 0;
}

I don't see a memory barrier there, and I've looked at where atomic_trylock is used, and I don't see barriers there either. Additionally the comment around exchange_strong_acq_32 from arm64/atomic.h

/* these two functions aren't inlined in the non-gcc case because then
   there would be two function calls (since neither cmpset_32 nor
   atomic_?mb can be inlined).  Instead, we "inline" them by hand in
   the assembly, meaning there is one function call overhead instead
   of two */

I read this comment to mean that the exchange_strong_acq should include both the _rmb() and the exchange. This also seems to be a unique comment in the arm64 branch.

Conclusion: I think we still need this PR.

Signed-off-by: liuyuncheng <liuyuncheng@huawei.com>
@yuncliu yuncliu changed the title fix atomic operations and spin lock bug in arm64 [arm64 ]fix atomic operations opal_atomic_compare_exchange_strong_ Oct 20, 2023
@yuncliu yuncliu changed the title [arm64 ]fix atomic operations opal_atomic_compare_exchange_strong_ [arm64] fix atomic operations opal_atomic_compare_exchange_strong_ Oct 20, 2023
@bosilca
Copy link
Member

bosilca commented Oct 20, 2023

ok, so now that we are down to a more reasonable patch, we need to decide if the CAS needs or not a strong memory ordering semantics.

I am not sure what point is @lrbison trying to make with the discussion on opal_atomic_trylock, and why there would be a need for a memory ordering around trylock. The upper layer shall add the required memory barriers if needed, such as the SM BTL or the OB1 PML, and not the intermediary layers such as atomic_lock. Moreover, what is the case in which this particular change is needed ? How did Fugaku OMPI worked without this change ? What about PPC ?

@devreal
Copy link
Contributor

devreal commented Oct 20, 2023

Locks in OMPI have acquire/release semantic. That is prudent and would break existing code if we removed it. All other atomic operations have relaxed semantics. For specializations that carry acquire/release in their names we should provide the appropriate memory ordering or we open the doors to eternal suffering.

@bosilca
Copy link
Member

bosilca commented Oct 20, 2023

This patch does not remove them, it adds more. Based on what example the locks in OMPI have strong memory semantic ?

@devreal
Copy link
Contributor

devreal commented Oct 20, 2023

The spinlock implementation uses opal_atomic_compare_exchange_strong_acq_32: https://github.com/open-mpi/ompi/blob/main/opal/include/opal/sys/atomic_impl_spinlock.h#L38

Release is done using an explicit opal_atomic_wmb(): https://github.com/open-mpi/ompi/blob/main/opal/include/opal/sys/atomic_impl_spinlock.h#L54

The functions changed by @yuncliu are not used in the spinlock so I'm not sure about the initial motivation. Maybe the opal_atomic_compare_exchange_strong_acq_[32|64] need to be checked?

I agree with @bosilca that adding memory ordering to the "normal" CAS operations doesn't seem right. Sorry for the confusion earlier.

@yuncliu
Copy link
Author

yuncliu commented Oct 20, 2023

Here is the test code can recurring the bug. when it run in arm64, it will get wrong answer or crash

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <sys/types.h>
#include "mpi.h"

#define MAX_THREADS (20)

int g_rankSize = 0;
int g_rank = 0;
MPI_Comm g_comm[MAX_THREADS];

void *mpi_thread(void* p)
{
    int id = *(int*)p;
    free(p);
    int i;
    int count = 0;
    for (i = 0; i < 1000000; ++i) {
        int s = 1;
        int r = 0;
        MPI_Allreduce(&s, &r, 1, MPI_INT, MPI_SUM, g_comm[id]);
        if (r != g_rankSize) {
            count++;
        }
    }
    printf("rank %d id %d error count = %d\n", g_rank, id, count);
    return NULL;
}

int main(int argc, char** argv)
{
    int mpi_threads_provided;
    int req = MPI_THREAD_MULTIPLE;
    pthread_t threads[MAX_THREADS];
    const int threadNum = 10;
    int64_t i;


    MPI_Init_thread(&argc, &argv, req, &mpi_threads_provided);
    MPI_Comm_rank(MPI_COMM_WORLD, &g_rank);
    MPI_Comm_size(MPI_COMM_WORLD, &g_rankSize);

    MPI_Group worldGroup;
    MPI_Comm_group(MPI_COMM_WORLD, &worldGroup);
    for (i = 0; i < threadNum; ++i) {
        MPI_Comm_create(MPI_COMM_WORLD, worldGroup, &g_comm[i]);
    }

    for (i = 0; i < threadNum; ++i) {
        int *p = (int*)malloc(sizeof(int));
        *p = (int)i;
        pthread_create(&threads[i], NULL, mpi_thread, (void*)p);
    }

    for (i = 0; i < threadNum; ++i) {
        pthread_join(threads[i], NULL);
    }
    MPI_Finalize();
    return 0;
}

@yuncliu
Copy link
Author

yuncliu commented Oct 20, 2023

The spinlock implementation uses opal_atomic_compare_exchange_strong_acq_32: https://github.com/open-mpi/ompi/blob/main/opal/include/opal/sys/atomic_impl_spinlock.h#L38

Release is done using an explicit opal_atomic_wmb(): https://github.com/open-mpi/ompi/blob/main/opal/include/opal/sys/atomic_impl_spinlock.h#L54

The functions changed by @yuncliu are not used in the spinlock so I'm not sure about the initial motivation. Maybe the opal_atomic_compare_exchange_strong_acq_[32|64] need to be checked?

I agree with @bosilca that adding memory ordering to the "normal" CAS operations doesn't seem right. Sorry for the confusion earlier.

I may make some mistake, the problem is not the spin lock but the exchange_strong_32/64. I change it and the problem not come again.

@devreal
Copy link
Contributor

devreal commented Oct 20, 2023

That suggests we're missing a memory barrier somewhere. Can you provide more details on where the failure happens? I don't have easy access to an arm system. A stack trace would be useful.

@yuncliu yuncliu requested a review from lrbison October 20, 2023 15:08
Copy link
Author

@yuncliu yuncliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only the exchange_strong_32/64 need to be change the stxr to stlxr

@yuncliu
Copy link
Author

yuncliu commented Oct 20, 2023

That suggests we're missing a memory barrier somewhere. Can you provide more details on where the failure happens? I don't have easy access to an arm system. A stack trace would be useful.

I also have no many chance to get time of arm server. crash happens few, maybe need a long time to get the crash stack. I'll report it when I get it. But the wrong answer happens alot , 8 thread with 1000,000 times allreduce sum may get 0-100 times wrong answer.

Copy link
Contributor

@devreal devreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the wrong fix for a problem we have somewhere in the codebase.

@lrbison
Copy link
Contributor

lrbison commented Oct 20, 2023

I see your point now George and Joseph. I'll remove my approval.

I was thrown off by the fact that the (relaxed) exchange still had an acquire in the load (ldaxr) so I had assumed it should be sequential.

@lrbison lrbison self-requested a review October 20, 2023 15:54
@devreal
Copy link
Contributor

devreal commented Oct 20, 2023

I opened #12011 to track this issue. I don't have the resources to track it down myself though.

@bosilca
Copy link
Member

bosilca commented Oct 21, 2023

I was thrown off by the fact that the (relaxed) exchange still had an acquire in the load (ldaxr) so I had assumed it should be sequential.

I wonder if we can remove the acquire load from the CAS. I don't think it is necessary, but it will need some deeper investigation.

@yuncliu we need to understand if the issue arise from the SM transport or somewhere else.

  1. So are you using UCX or OB1 ? You can mpirun --mca pml ob1 to force the switch to our own communication library
  2. single node or multi-node runs ? let's force TCP everywhere to see if we can replicate. Run mpirun --mca pml ob1 --mca btl tcp,self

@yuncliu
Copy link
Author

yuncliu commented Oct 23, 2023

I was thrown off by the fact that the (relaxed) exchange still had an acquire in the load (ldaxr) so I had assumed it should be sequential.

I wonder if we can remove the acquire load from the CAS. I don't think it is necessary, but it will need some deeper investigation.

@yuncliu we need to understand if the issue arise from the SM transport or somewhere else.

  1. So are you using UCX or OB1 ? You can mpirun --mca pml ob1 to force the switch to our own communication library

--mca pml ob1 not work

  1. single node or multi-node runs ? let's force TCP everywhere to see if we can replicate. Run mpirun --mca pml ob1 --mca btl tcp,self

Single node . mpirun --mca pml ob1 --mca btl tcp,self also not work. My hardware is a server with 192 arm64 core and 4 numa node.

@bosilca
Copy link
Member

bosilca commented Oct 23, 2023

They are not working for some other reasons or you hit the same type of threading issues ?

@lrbison
Copy link
Contributor

lrbison commented Feb 21, 2024

@yuncliu thank you again for reporting this issue. I think it may be fixed in #12338. Can we have any further discussion in the issue #12011 rather than this PR?

I will close this PR for now, as I don't think it is the correct change. However if you could confirm the additional write memory barrier in smcuda fixed your issue #12011 it would be greatly appreciated!

Thank you

@lrbison lrbison closed this Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants