Closed
Description
I am observing a Segfault in UCX when issuing a specific sequence of operations on an exclusively locked window. I was able to boil it down to an example that calls MPI_Rget
+MPI_Wait
followed by MPI_Fetch_and_op
. The segfault in the fetch-and-op seems to depend on the previous operation, i.e., a put or other fetch-and-op does not trigger it.
The example code:
#include <mpi.h>
#include <stdio.h>
#include <stdint.h>
int main(int argc, char **argv)
{
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Win win;
char *base;
MPI_Win_allocate(
sizeof(uint64_t),
1,
MPI_INFO_NULL,
MPI_COMM_WORLD,
&base,
&win);
int target = 0;
if (size == 2) {
if (rank != target) {
MPI_Win_lock(MPI_LOCK_EXCLUSIVE, target, 0, win);
uint64_t res;
uint64_t val;
MPI_Request req;
MPI_Rget(&val, 1, MPI_UINT64_T, target, 0, 1, MPI_UINT64_T, win, &req);
MPI_Wait(&req, MPI_STATUS_IGNORE);
// SEGFAULTs
MPI_Fetch_and_op(&val, &res, MPI_UINT64_T, target, 0, MPI_SUM, win);
MPI_Win_flush(target, win);
MPI_Win_unlock(target, win);
}
} else {
printf("Skipping exclusive lock test for more than 2 ranks!\n");
}
MPI_Barrier(MPI_COMM_WORLD);
MPI_Win_free(&win);
MPI_Finalize();
return 0;
}
Buitlt with
$ mpicc mpi_lock_segfault.c -o mpi_lock_segfault -g
$ mpirun -n 2 -N 1 ./mpi_lock_segfault
[n082701:36468:0:36468] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1)
==== backtrace ====
0 /lustre/nec/ws2/ws/hpcjschu-amsgq-eurompi/opt-vulcan/ucx-1.5.0/lib/libucs.so.0(+0x286f1) [0x2b29a0ce96f1]
1 /lustre/nec/ws2/ws/hpcjschu-amsgq-eurompi/opt-vulcan/ucx-1.5.0/lib/libucs.so.0(+0x28893) [0x2b29a0ce9893]
2 /lustre/nec/ws2/ws/hpcjschu-amsgq-eurompi/opt-vulcan/openmpi-4.0.1-ucx-intel-debug/lib/openmpi/mca_osc_ucx.so(req_completion+0x34a) [0x2b29a52e2faa]
3 /lustre/nec/ws2/ws/hpcjschu-amsgq-eurompi/opt-vulcan/ucx-1.5.0/lib/libucp.so.0(ucp_atomic_rep_handler+0x2af) [0x2b29a0031901]
4 /lustre/nec/ws2/ws/hpcjschu-amsgq-eurompi/opt-vulcan/ucx-1.5.0/lib/libuct.so.0(+0x3d36c) [0x2b29a030636c]
5 /lustre/nec/ws2/ws/hpcjschu-amsgq-eurompi/opt-vulcan/ucx-1.5.0/lib/libuct.so.0(+0x3de68) [0x2b29a0306e68]
6 /lustre/nec/ws2/ws/hpcjschu-amsgq-eurompi/opt-vulcan/ucx-1.5.0/lib/libucp.so.0(+0x220c4) [0x2b29a00220c4]
7 /lustre/nec/ws2/ws/hpcjschu-amsgq-eurompi/opt-vulcan/ucx-1.5.0/lib/libucp.so.0(ucp_worker_progress+0x1d5) [0x2b29a0027c26]
8 /lustre/nec/ws2/ws/hpcjschu-amsgq-eurompi/opt-vulcan/openmpi-4.0.1-ucx-intel-debug/lib/openmpi/mca_osc_ucx.so(ompi_osc_ucx_fetch_and_op+0x1cc3) [0x2b29a52d3847]
9 /lustre/nec/ws2/ws/hpcjschu-amsgq-eurompi/opt-vulcan/openmpi-4.0.1-ucx-intel-debug/lib/libmpi.so.40(MPI_Fetch_and_op+0x319) [0x2b298cf5d015]
10 ./mpi_lock_segfault() [0x400e12]
11 /lib64/libc.so.6(__libc_start_main+0xf5) [0x2b298dad13d5]
12 ./mpi_lock_segfault() [0x400b89]
===================
DDT reports:
Processes,Threads,Function
1,1,main (mpi_lock_segfault.c:33)
1,1, PMPI_Fetch_and_op
1,1, ompi_osc_ucx_fetch_and_op
1,1, ucp_worker_progress (ucp_worker.c:1426)
1,1, uct_worker_progress (uct.h:1677)
1,1, ucs_callbackq_dispatch (callbackq.h:209)
1,1, uct_rc_verbs_iface_progress (rc_verbs_iface.c:111)
1,1, uct_rc_verbs_iface_poll_rx_common (rc_verbs_common.h:191)
1,1, uct_rc_verbs_iface_handle_am (rc_verbs_common.h:162)
1,1, uct_iface_invoke_am (uct_iface.h:535)
1,1, ucp_atomic_rep_handler (amo_sw.c:250)
1,1, ucp_request_complete_send (ucp_request.inl:97)
1,1, req_completion
1,1,ucs_async_thread_func (thread.c:93)
1,1, epoll_wait
1,2,progress_engine
1,2, opal_libevent2022_event_base_loop (event.c:1630)
1,1, epoll_dispatch (epoll.c:407)
1,1, epoll_wait
1,1, poll_dispatch (poll.c:165)
1,1, poll
Not sure whether this is related to #6549 or #6546.
I used the Open MPI 4.0.1 release tarball and the Open UCX tarball. Open MPI was configured using:
../configure CC=icc CXX=icpc FTN=ifort --with-ucx=/path/to/ucx-1.5.0/ --without-verbs --enable-debug
The segfault is consistently reproducible on our IB cluster.
Please let me know if I can provide any other information.