Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCX: Segfault in Fetch_and_op after Get #6552

Closed
devreal opened this issue Apr 1, 2019 · 43 comments
Closed

UCX: Segfault in Fetch_and_op after Get #6552

devreal opened this issue Apr 1, 2019 · 43 comments

Comments

@devreal
Copy link
Contributor

devreal commented Apr 1, 2019

I am observing a Segfault in UCX when issuing a specific sequence of operations on an exclusively locked window. I was able to boil it down to an example that calls MPI_Rget+MPI_Wait followed by MPI_Fetch_and_op. The segfault in the fetch-and-op seems to depend on the previous operation, i.e., a put or other fetch-and-op does not trigger it.

The example code:

#include <mpi.h>
#include <stdio.h>
#include <stdint.h>

int main(int argc, char **argv)
{
    int rank, size;
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Win win;
    char *base;

    MPI_Win_allocate(
        sizeof(uint64_t),
        1,
        MPI_INFO_NULL,
        MPI_COMM_WORLD,
        &base,
        &win);

    int target = 0;

    if (size == 2) {
      if (rank != target) {
        MPI_Win_lock(MPI_LOCK_EXCLUSIVE, target, 0, win);
        uint64_t res;
        uint64_t val;
        MPI_Request req;
        MPI_Rget(&val, 1, MPI_UINT64_T, target, 0, 1, MPI_UINT64_T, win, &req);
        MPI_Wait(&req, MPI_STATUS_IGNORE);
        // SEGFAULTs
        MPI_Fetch_and_op(&val, &res, MPI_UINT64_T, target, 0, MPI_SUM, win);
        MPI_Win_flush(target, win);

        MPI_Win_unlock(target, win);
      }
    } else {
      printf("Skipping exclusive lock test for more than 2 ranks!\n");
    }

    MPI_Barrier(MPI_COMM_WORLD);
    MPI_Win_free(&win);

    MPI_Finalize();

    return 0;
}

Buitlt with

$ mpicc mpi_lock_segfault.c -o mpi_lock_segfault -g
$ mpirun -n 2 -N 1 ./mpi_lock_segfault
[n082701:36468:0:36468] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1)
==== backtrace ====
    0  /lustre/nec/ws2/ws/hpcjschu-amsgq-eurompi/opt-vulcan/ucx-1.5.0/lib/libucs.so.0(+0x286f1) [0x2b29a0ce96f1]
    1  /lustre/nec/ws2/ws/hpcjschu-amsgq-eurompi/opt-vulcan/ucx-1.5.0/lib/libucs.so.0(+0x28893) [0x2b29a0ce9893]
    2  /lustre/nec/ws2/ws/hpcjschu-amsgq-eurompi/opt-vulcan/openmpi-4.0.1-ucx-intel-debug/lib/openmpi/mca_osc_ucx.so(req_completion+0x34a) [0x2b29a52e2faa]
    3  /lustre/nec/ws2/ws/hpcjschu-amsgq-eurompi/opt-vulcan/ucx-1.5.0/lib/libucp.so.0(ucp_atomic_rep_handler+0x2af) [0x2b29a0031901]
    4  /lustre/nec/ws2/ws/hpcjschu-amsgq-eurompi/opt-vulcan/ucx-1.5.0/lib/libuct.so.0(+0x3d36c) [0x2b29a030636c]
    5  /lustre/nec/ws2/ws/hpcjschu-amsgq-eurompi/opt-vulcan/ucx-1.5.0/lib/libuct.so.0(+0x3de68) [0x2b29a0306e68]
    6  /lustre/nec/ws2/ws/hpcjschu-amsgq-eurompi/opt-vulcan/ucx-1.5.0/lib/libucp.so.0(+0x220c4) [0x2b29a00220c4]
    7  /lustre/nec/ws2/ws/hpcjschu-amsgq-eurompi/opt-vulcan/ucx-1.5.0/lib/libucp.so.0(ucp_worker_progress+0x1d5) [0x2b29a0027c26]
    8  /lustre/nec/ws2/ws/hpcjschu-amsgq-eurompi/opt-vulcan/openmpi-4.0.1-ucx-intel-debug/lib/openmpi/mca_osc_ucx.so(ompi_osc_ucx_fetch_and_op+0x1cc3) [0x2b29a52d3847]
    9  /lustre/nec/ws2/ws/hpcjschu-amsgq-eurompi/opt-vulcan/openmpi-4.0.1-ucx-intel-debug/lib/libmpi.so.40(MPI_Fetch_and_op+0x319) [0x2b298cf5d015]
   10  ./mpi_lock_segfault() [0x400e12]
   11  /lib64/libc.so.6(__libc_start_main+0xf5) [0x2b298dad13d5]
   12  ./mpi_lock_segfault() [0x400b89]
===================

DDT reports:

Processes,Threads,Function
1,1,main (mpi_lock_segfault.c:33)
1,1,  PMPI_Fetch_and_op
1,1,    ompi_osc_ucx_fetch_and_op
1,1,      ucp_worker_progress (ucp_worker.c:1426)
1,1,        uct_worker_progress (uct.h:1677)
1,1,          ucs_callbackq_dispatch (callbackq.h:209)
1,1,            uct_rc_verbs_iface_progress (rc_verbs_iface.c:111)
1,1,              uct_rc_verbs_iface_poll_rx_common (rc_verbs_common.h:191)
1,1,                uct_rc_verbs_iface_handle_am (rc_verbs_common.h:162)
1,1,                  uct_iface_invoke_am (uct_iface.h:535)
1,1,                    ucp_atomic_rep_handler (amo_sw.c:250)
1,1,                      ucp_request_complete_send (ucp_request.inl:97)
1,1,                        req_completion
1,1,ucs_async_thread_func (thread.c:93)
1,1,  epoll_wait
1,2,progress_engine
1,2,  opal_libevent2022_event_base_loop (event.c:1630)
1,1,    epoll_dispatch (epoll.c:407)
1,1,      epoll_wait
1,1,    poll_dispatch (poll.c:165)
1,1,      poll

Not sure whether this is related to #6549 or #6546.

I used the Open MPI 4.0.1 release tarball and the Open UCX tarball. Open MPI was configured using:

../configure CC=icc CXX=icpc FTN=ifort --with-ucx=/path/to/ucx-1.5.0/ --without-verbs --enable-debug

The segfault is consistently reproducible on our IB cluster.

Please let me know if I can provide any other information.

@jsquyres
Copy link
Member

jsquyres commented Apr 9, 2019

@yosefe @jladd-mlnx FYI.

@vspetrov
Copy link

I was able to repro this bug with ompi v4.0.x and the fix is: #6594. OMPI master does not reproduce this.

@gpaulsen
Copy link
Member

@jladd-mlnx Any progress?

@artpol84
Copy link
Contributor

artpol84 commented Jul 1, 2019

@devreal it seems that @vspetrov fixed the issue. Can you confirm?

@devreal
Copy link
Contributor Author

devreal commented Jul 2, 2019

I just tested 514e273 with the UCX 1.6.x branch and see the exact same error as reported above.

@yosefe
Copy link
Contributor

yosefe commented Jul 2, 2019

@janjust can you pls confirm if OCS/UCX component checks atomic operation size is 32 or 64 before posting it to UCX?
seems do me it does not (for example in ompi_osc_ucx_fetch_and_op)

edit: the comment is irrelevant for this issue.

@janjust
Copy link
Contributor

janjust commented Jul 2, 2019

@yosefe , hmm, you're right, it does not.

@jladd-mlnx
Copy link
Member

@janjust So is the plan to add fallback to AM logic for 32 and 64 bit atomics?

@janjust
Copy link
Contributor

janjust commented Jul 9, 2019

@jladd-mlnx 32/64bit atomic ops are handled. It's the 8bit and 16bit that are not, and yes for those the plan is to use AM logic.

@jladd-mlnx
Copy link
Member

Target date?

@janjust
Copy link
Contributor

janjust commented Jul 9, 2019

@jladd-mlnx I don't have one yet

@jladd-mlnx jladd-mlnx assigned janjust and unassigned yosefe Jul 9, 2019
@artpol84
Copy link
Contributor

artpol84 commented Jul 9, 2019

@devreal
Could you clarify a couple of things:

  1. What IB device are you using? The reason I'm asking is I see that in your backtrace the SW atomics are used which is not what I would expect. @yosefe can you comment?
  2. with 68c88e8 are you seeing exactly the same backtrace?

@yosefe
Copy link
Contributor

yosefe commented Jul 9, 2019

@artpol84 @janjust sorry, my comment #6552 (comment) is misleading, it was supposed to be for #6777)

The issue here is probably lack of progress calls with OSC/UCX which causes a deadlock when using SW atomics. for example. start_atomicity() looks suspicious because it does not call opal_progress() while waiting for completion.

@artpol84
Copy link
Contributor

artpol84 commented Jul 9, 2019

But why at all SW atomics are involved on IB system?

@artpol84
Copy link
Contributor

artpol84 commented Jul 9, 2019

@yosefe

for example. start_atomicity() looks suspicious because it does not call opal_progress() while waiting for completion.

But it does call ucp progress which should be enough for AMO, isnt' it?
https://github.com/open-mpi/ompi/blob/v4.0.x/ompi/mca/osc/ucx/osc_ucx_comm.c#L297

@yosefe
Copy link
Contributor

yosefe commented Jul 9, 2019

the problem is that it progresses only one worker, is there is some other worker on that process (e.g pml/ucx or some other osc/ucx worker) it will not progress

@yosefe
Copy link
Contributor

yosefe commented Jul 9, 2019

we have SW atomics if HW does not support it directly, for example on older hardware

@artpol84
Copy link
Contributor

Ok, this is important, because @janjust was trying to reproduce and couldn’t.
I guess if we will force SW AMO we will have more chances to reproduce.
I assume it is possible.

@artpol84
Copy link
Contributor

We will check the progress guess.
Since it is 4.0.x it has only one UCX worker per OSC. Though I assume that pml/UCX worker is there as well.

@devreal
Copy link
Contributor Author

devreal commented Jul 10, 2019

@artpol84 I can confirm that the issue persists with 68c88e8:

$ git checkout 68c88e8
$ cd build
$ ../configure CC=gcc CXX=g++ FTN=gfortran --with-ucx=$HOME/opt/ucx-1.6.x-gnu/ --without-verbs --prefix=$HOME/opt/openmpi-git-68c88e8-ucx
$ export PATH=$HOME/opt/openmpi-git-68c88e8-ucx/bin:$PATH
$ mpicc test_mpi_rget_fetch_op.c -o test_mpi_rget_fetch_op
$ mpirun -n 2 -N 1 ./test_mpi_rget_fetch_op
Processes,Threads,Function
1,1,ucs_async_thread_func (thread.c:93)
1,1,  epoll_wait
1,2,progress_engine
1,2,  opal_libevent2022_event_base_loop (event.c:1630)
1,1,    epoll_dispatch (epoll.c:407)
1,1,      epoll_wait
1,1,    poll_dispatch (poll.c:165)
1,1,      poll
1,1,main
1,1,  PMPI_Fetch_and_op
1,1,    ompi_osc_ucx_fetch_and_op
1,1,      ucp_worker_progress (ucp_worker.c:1632)
1,1,        uct_worker_progress (uct.h:1806)
1,1,          ucs_callbackq_dispatch (callbackq.h:211)
1,1,            uct_rc_verbs_iface_progress (rc_verbs_iface.c:131)
1,1,              uct_rc_verbs_iface_poll_rx_common (rc_verbs_impl.h:125)
1,1,                uct_rc_verbs_iface_handle_am (rc_verbs_impl.h:96)
1,1,                  uct_iface_invoke_am (uct_iface.h:596)
1,1,                    ucp_atomic_rep_handler (amo_sw.c:253)
1,1,                      ucp_request_complete_send (ucp_request.inl:97)
1,1,                        req_completion

(req_completion is where the Segfault happens)

Output from ibv_devices:

$ mpirun -n 2 -N 1 ibv_devices 
    device          	   node GUID
    ------          	----------------
    mlx5_0          	506b4b0300ddfb46
    device          	   node GUID
    ------          	----------------
    mlx5_0          	506b4b0300da7010

Also:

$ mpirun -n 1 lspci | grep -i infini
5e:00.0 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]

@artpol84
Copy link
Contributor

@devreal, thanks!

@yosefe
Copy link
Contributor

yosefe commented Jul 11, 2019

@devreal can I please ask what version of libibverbs and OFED are you using? it could be detected by running rpm -q libibverbs and ofed_info -s. Thanks!

@devreal
Copy link
Contributor Author

devreal commented Jul 11, 2019

Sure:

$ rpm -q libibverbs
libibverbs-17.2-3.el7.x86_64

There is no ofed_info available on the system and I am not able to change the software environment. According to our admins, OFED is shipped with the installed Red Hat 7.6. I can try to query specific package versions if that helps (and I know what to look for ;))

@janjust
Copy link
Contributor

janjust commented Jul 11, 2019

@devreal Is this issue always reproduced? I can reproduce a similar trace (mine ends up in HW atomics), but the issue is not always reproduced.

[sputnik6:19430:0:19430] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1)

/hpc/mtr_scrap/users/tomislavj/ompi/ompi/mca/osc/ucx/../../../../opal/threads/thread_usage.h: [ opal_thread_add_fetch_32() ]
      ...
      145 }
      146
      147 OPAL_THREAD_DEFINE_ATOMIC_OP(int32_t, add, +, 32)
==>   148 OPAL_THREAD_DEFINE_ATOMIC_OP(size_t, add, +, size_t)
      149 OPAL_THREAD_DEFINE_ATOMIC_OP(int32_t, and, &, 32)
      150 OPAL_THREAD_DEFINE_ATOMIC_OP(int32_t, or, |, 32)
      151 OPAL_THREAD_DEFINE_ATOMIC_OP(int32_t, xor, ^, 32)

==== backtrace (tid:  19430) ====
 0 0x000000000000c2e6 opal_thread_add_fetch_32()  /hpc/mtr_scrap/users/tomislavj/ompi/ompi/mca/osc/ucx/../../../../opal/threads/thread_usage.h:148
 1 0x0000000000022e05 ucp_request_complete_send()  /hpc/mtr_scrap/users/tomislavj/ucx/src/ucp/core/ucp_request.inl:97
 2 0x0000000000043372 uct_invoke_completion()  /hpc/mtr_scrap/users/tomislavj/ucx/src/uct/base/uct_iface.h:614
 3 0x0000000000040385 uct_rc_txqp_completion_op()  /hpc/mtr_scrap/users/tomislavj/ucx/src/uct/ib/rc/base/rc_ep.h:380
 4 0x000000000001fefa ucs_callbackq_dispatch()  /hpc/mtr_scrap/users/tomislavj/ucx/src/ucs/datastruct/callbackq.h:211
 5 0x0000000000007d5d opal_common_ucx_wait_request()  /hpc/mtr_scrap/users/tomislavj/ompi/ompi/mca/osc/ucx/../../../../opal/mca/common/ucx/common_ucx.h:136
 6 0x0000000000063276 PMPI_Fetch_and_op()  /hpc/mtr_scrap/users/tomislavj/ompi/ompi/mpi/c/profile/pfetch_and_op.c:73
 7 0x0000000000400d22 main()  ???:0
 8 0x0000000000021c05 __libc_start_main()  ???:0
 9 0x0000000000400b39 _start()  ???:0
=================================

@janjust
Copy link
Contributor

janjust commented Jul 11, 2019

@yosefe @artpol84 Just a note on progress, though it seems that in v4.0.x opal progress isn't called, it's called in master, by extension also in the failing commit 68c88e8.

https://github.com/open-mpi/ompi/blob/master/opal/mca/common/ucx/common_ucx.h#L66
Called in start/end_atomicity via:

opal_common_ucx_atomic_cswap()
+ - opal_common_ucx_atomic_fetch()
    +- opal_common_ucx_wait_request()

@devreal
Copy link
Contributor Author

devreal commented Jul 12, 2019

@janjust Yes, I can reproduce this consistently every time I run the reproducer. Is there a way to figure out why software atomics are being used?

@janjust
Copy link
Contributor

janjust commented Jul 12, 2019

@devreal I would assume it's due to inbox/OFED version, not sure of a surefire way to determine the reason why SW atomics are used. Your HW supports HW atomics.

@artpol84
Copy link
Contributor

@devreal I think we pinpointed the issue.
The error is not in the Fetch_and_op but in the Rget implementation.
OMPI request is not correctly cleaned up by osc/ucx in MPI_Wait.
@janjust will open a PR shortly.

@artpol84
Copy link
Contributor

artpol84 commented Jul 12, 2019

Is there a way to figure out why software atomics are being used?

According to @yosefe, one of the cause could be an old OFED.

@janjust
Copy link
Contributor

janjust commented Jul 12, 2019

@devreal Can you please try testing #6810 , it fixes the issue for me, I'm using HW atomics, but your segfault signature
[n082701:36468:0:36468] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1)
seems identical, I'm certain it's the same issue.

@artpol84
Copy link
Contributor

@devreal as @janjust noted in #6810 the issue doesn't reproduce in master (at least directly) because we significantly redesigned UCX/osc there.
I see you assigned "TargetMaster" tag here. Was there any particular reason for that? Have experienced this with the master?

@devreal
Copy link
Contributor Author

devreal commented Jul 14, 2019

@artpol84 The target:master label was added by @hppritcha, I have not tried it with master. I will give it a shot on master as well though, just to make sure. (running into trouble running anything on master right now, gotta figure what's going wrong there)

@janjust Unfortunately, the error still occurs. I tried running under DDT again but the segfault occurs in a macro so I'm at a loss there (I might try inlining that specific macro to get more information). I also tried running with UCX_LOG_LEVEL=trace but I don't seem to get the trace message from ucp_request_complete_send right before the macro invocation, only the following:

[1563108074.533099] [n062602:18481:0]          mpool.c:200  UCX  DEBUG mpool rc_send_desc: allocated chunk 0x2b99b1000018 of 10485736 bytes with 1270 elements
[1563108074.533148] [n062602:18481:0]       ib_iface.c:1161 UCX  TRACE arm_cq: got 0 send and 1 recv events, returning BUSY
[1563108074.533151] [n062602:18481:0]     ucp_worker.c:689  UCX  TRACE arm iface 0x240d020 returned BUSY
[1563108074.533154] [n062602:18481:0]     ucp_worker.c:657  UCX  TRACE iface 0x22e79e0 already activated
[1563108074.533159] [n062602:18481:0]     ucp_worker.c:675  UCX  TRACE armed iface 0x240d020
[n062602:18481:0:18481] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1)
==== backtrace (tid:  18481) ====
 0 0x000000000000c581 req_completion()  ???:0
 1 0x000000000002830c ucp_request_complete_send()  /zhome/academic/HLRS/hlrs/hpcjschu/src/openucx/git/build_debug/../src/ucp/core/ucp_request.inl:97
 2 0x000000000002830c ucp_atomic_rep_handler()  /zhome/academic/HLRS/hlrs/hpcjschu/src/openucx/git/build_debug/src/ucp/../../../src/ucp/rma/amo_sw.c:253
 3 0x00000000000212a4 uct_iface_invoke_am()  /zhome/academic/HLRS/hlrs/hpcjschu/src/openucx/git/build_debug/../src/uct/base/uct_iface.h:596
 4 0x0000000000021bdb uct_rc_verbs_iface_handle_am()  /zhome/academic/HLRS/hlrs/hpcjschu/src/openucx/git/build_debug/src/uct/ib/../../../../src/uct/ib/rc/verbs/rc_verbs_impl.h:96
 5 0x0000000000021bdb uct_rc_verbs_iface_poll_rx_common()  /zhome/academic/HLRS/hlrs/hpcjschu/src/openucx/git/build_debug/src/uct/ib/../../../../src/uct/ib/rc/verbs/rc_verbs_impl.h:125
 6 0x0000000000021bdb uct_rc_verbs_iface_progress()  /zhome/academic/HLRS/hlrs/hpcjschu/src/openucx/git/build_debug/src/uct/ib/../../../../src/uct/ib/rc/verbs/rc_verbs_iface.c:131
 7 0x000000000001c564 ucs_callbackq_dispatch()  /zhome/academic/HLRS/hlrs/hpcjschu/src/openucx/git/build_debug/../src/ucs/datastruct/callbackq.h:211
 8 0x00000000000216e0 uct_worker_progress()  /zhome/academic/HLRS/hlrs/hpcjschu/src/openucx/git/build_debug/../src/uct/api/uct.h:1806
 9 0x0000000000007fff ompi_osc_ucx_fetch_and_op()  ???:0
10 0x00000000000724c5 PMPI_Fetch_and_op()  ???:0
11 0x0000000000400d05 main()  ???:0
12 0x00000000000223d5 __libc_start_main()  ???:0
13 0x0000000000400b39 _start()  ???:0
=================================

@artpol84
Copy link
Contributor

artpol84 commented Jul 14, 2019

@devreal,

  • Could you try with debugging version of OMPI (with “-O0”). I recall that it shifted the segfault to an accurate place that wasn’t in the macro.
  • Also DDT backtrace looks less informative compared to GDB in the sense that it doesn’t prints function arguments. Can we have them somehow?
  • In our case problem was in double request completion as the result of absence of a proper cleanup.
    I’ll propose a version with additional printouts later today and will point you on it. Would appreciate if you could run it.

@yosefe is there a way to force SW tm so we can try to reproduce locally?

@artpol84
Copy link
Contributor

@devreal,
And just in case, please double check that our fix was present in the build ( maybe adding printf next to our added line).
because I was 100% sure that you have the same issue as we observed.

@devreal
Copy link
Contributor Author

devreal commented Jul 14, 2019

I just realized that I used the wrong version in my last test, will report back when I used the actual branch with #6810.

@devreal
Copy link
Contributor Author

devreal commented Jul 14, 2019

OK, when using @janjust branch it works as expected, sorry for the noise and thanks a lot for the fix @artpol84 and @janjust :)

@artpol84
Copy link
Contributor

Great! That was my expectation!

@devreal
Copy link
Contributor Author

devreal commented Jul 15, 2019

On the use of sw atomics: I'm fairly sure the good ol' ibverbs btl supported hardware atomics on this system so in that regard switching to UCX causes a performance regression on that machine (for our use case at least). I haven't tested another IB-based systems I have access to, will do that today. What are the minimum OFED requirements for hw atomics to be available?

@yosefe
Copy link
Contributor

yosefe commented Jul 15, 2019

@devreal you seem to be using rdma-core 17 which does not support HW atomics with UCX (the problem it supports only 64-bit atomics where UCX supports both 32 and 64-bit). with MLNX_OFED 4.6 (latest release) they would be supported.

@devreal
Copy link
Contributor Author

devreal commented Jul 15, 2019

@yosefe Just so I understand it correctly: unless the site manages to install the latest MLNX_OFED, RMA atomics will be done in software if UCX is used (that is in 4.x and the upcoming 5.x)? That seems like a significant step back for Open MPI's RMA implementation, which so far has been the only fully usable MPI RMA implementation with decent hardware support for me. IIRC, @hjelmn mentioned the issue with 32bit integers somewhere and that the ibverbs btl uses CAS in that case (which makes it slower than 64bit, yes). I'd argue that this is an acceptable trade-off if 64bit operations can be done in hardware instead...

@janjust
Copy link
Contributor

janjust commented Jul 17, 2019

@devreal So this issues is fixed with my branch, can it be closed?

@devreal
Copy link
Contributor Author

devreal commented Jul 18, 2019

@janjust Yes, this can be closed when #6810 is merged 👍

@devreal
Copy link
Contributor Author

devreal commented Jul 24, 2019

Fixed with #6810 in the 4.0.x branch. This issue does not occur on master. Thanks for the fix!

@devreal devreal closed this as completed Jul 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants