Skip to content

UCX: Segfault in Fetch_and_op after Get #6552

Closed
@devreal

Description

@devreal

I am observing a Segfault in UCX when issuing a specific sequence of operations on an exclusively locked window. I was able to boil it down to an example that calls MPI_Rget+MPI_Wait followed by MPI_Fetch_and_op. The segfault in the fetch-and-op seems to depend on the previous operation, i.e., a put or other fetch-and-op does not trigger it.

The example code:

#include <mpi.h>
#include <stdio.h>
#include <stdint.h>

int main(int argc, char **argv)
{
    int rank, size;
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Win win;
    char *base;

    MPI_Win_allocate(
        sizeof(uint64_t),
        1,
        MPI_INFO_NULL,
        MPI_COMM_WORLD,
        &base,
        &win);

    int target = 0;

    if (size == 2) {
      if (rank != target) {
        MPI_Win_lock(MPI_LOCK_EXCLUSIVE, target, 0, win);
        uint64_t res;
        uint64_t val;
        MPI_Request req;
        MPI_Rget(&val, 1, MPI_UINT64_T, target, 0, 1, MPI_UINT64_T, win, &req);
        MPI_Wait(&req, MPI_STATUS_IGNORE);
        // SEGFAULTs
        MPI_Fetch_and_op(&val, &res, MPI_UINT64_T, target, 0, MPI_SUM, win);
        MPI_Win_flush(target, win);

        MPI_Win_unlock(target, win);
      }
    } else {
      printf("Skipping exclusive lock test for more than 2 ranks!\n");
    }

    MPI_Barrier(MPI_COMM_WORLD);
    MPI_Win_free(&win);

    MPI_Finalize();

    return 0;
}

Buitlt with

$ mpicc mpi_lock_segfault.c -o mpi_lock_segfault -g
$ mpirun -n 2 -N 1 ./mpi_lock_segfault
[n082701:36468:0:36468] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1)
==== backtrace ====
    0  /lustre/nec/ws2/ws/hpcjschu-amsgq-eurompi/opt-vulcan/ucx-1.5.0/lib/libucs.so.0(+0x286f1) [0x2b29a0ce96f1]
    1  /lustre/nec/ws2/ws/hpcjschu-amsgq-eurompi/opt-vulcan/ucx-1.5.0/lib/libucs.so.0(+0x28893) [0x2b29a0ce9893]
    2  /lustre/nec/ws2/ws/hpcjschu-amsgq-eurompi/opt-vulcan/openmpi-4.0.1-ucx-intel-debug/lib/openmpi/mca_osc_ucx.so(req_completion+0x34a) [0x2b29a52e2faa]
    3  /lustre/nec/ws2/ws/hpcjschu-amsgq-eurompi/opt-vulcan/ucx-1.5.0/lib/libucp.so.0(ucp_atomic_rep_handler+0x2af) [0x2b29a0031901]
    4  /lustre/nec/ws2/ws/hpcjschu-amsgq-eurompi/opt-vulcan/ucx-1.5.0/lib/libuct.so.0(+0x3d36c) [0x2b29a030636c]
    5  /lustre/nec/ws2/ws/hpcjschu-amsgq-eurompi/opt-vulcan/ucx-1.5.0/lib/libuct.so.0(+0x3de68) [0x2b29a0306e68]
    6  /lustre/nec/ws2/ws/hpcjschu-amsgq-eurompi/opt-vulcan/ucx-1.5.0/lib/libucp.so.0(+0x220c4) [0x2b29a00220c4]
    7  /lustre/nec/ws2/ws/hpcjschu-amsgq-eurompi/opt-vulcan/ucx-1.5.0/lib/libucp.so.0(ucp_worker_progress+0x1d5) [0x2b29a0027c26]
    8  /lustre/nec/ws2/ws/hpcjschu-amsgq-eurompi/opt-vulcan/openmpi-4.0.1-ucx-intel-debug/lib/openmpi/mca_osc_ucx.so(ompi_osc_ucx_fetch_and_op+0x1cc3) [0x2b29a52d3847]
    9  /lustre/nec/ws2/ws/hpcjschu-amsgq-eurompi/opt-vulcan/openmpi-4.0.1-ucx-intel-debug/lib/libmpi.so.40(MPI_Fetch_and_op+0x319) [0x2b298cf5d015]
   10  ./mpi_lock_segfault() [0x400e12]
   11  /lib64/libc.so.6(__libc_start_main+0xf5) [0x2b298dad13d5]
   12  ./mpi_lock_segfault() [0x400b89]
===================

DDT reports:

Processes,Threads,Function
1,1,main (mpi_lock_segfault.c:33)
1,1,  PMPI_Fetch_and_op
1,1,    ompi_osc_ucx_fetch_and_op
1,1,      ucp_worker_progress (ucp_worker.c:1426)
1,1,        uct_worker_progress (uct.h:1677)
1,1,          ucs_callbackq_dispatch (callbackq.h:209)
1,1,            uct_rc_verbs_iface_progress (rc_verbs_iface.c:111)
1,1,              uct_rc_verbs_iface_poll_rx_common (rc_verbs_common.h:191)
1,1,                uct_rc_verbs_iface_handle_am (rc_verbs_common.h:162)
1,1,                  uct_iface_invoke_am (uct_iface.h:535)
1,1,                    ucp_atomic_rep_handler (amo_sw.c:250)
1,1,                      ucp_request_complete_send (ucp_request.inl:97)
1,1,                        req_completion
1,1,ucs_async_thread_func (thread.c:93)
1,1,  epoll_wait
1,2,progress_engine
1,2,  opal_libevent2022_event_base_loop (event.c:1630)
1,1,    epoll_dispatch (epoll.c:407)
1,1,      epoll_wait
1,1,    poll_dispatch (poll.c:165)
1,1,      poll

Not sure whether this is related to #6549 or #6546.

I used the Open MPI 4.0.1 release tarball and the Open UCX tarball. Open MPI was configured using:

../configure CC=icc CXX=icpc FTN=ifort --with-ucx=/path/to/ucx-1.5.0/ --without-verbs --enable-debug

The segfault is consistently reproducible on our IB cluster.

Please let me know if I can provide any other information.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions