Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCX hangs in MPI_Win_lock on master #6816

Closed
devreal opened this issue Jul 15, 2019 · 13 comments
Closed

UCX hangs in MPI_Win_lock on master #6816

devreal opened this issue Jul 15, 2019 · 13 comments

Comments

@devreal
Copy link
Contributor

devreal commented Jul 15, 2019

I'm trying to run the example from #6552 on master (020a591) with UCX 1.5.2 (built from release tarball), which leads to process 1 being stuck in MPI_Win_lock and process 0 waiting in MPI_Barrier:

Process 0:

#6 main (argc=3, argv=0x7fffffffb208) at test_mpi_rget_fetch_op.c:44 (at 0x0000000000400dab)
#5 PMPI_Barrier () from openmpi-git-ucx/lib/libmpi.so.0 (at 0x00002aaaaab39e45)
#4 ompi_coll_base_barrier_intra_two_procs () from openmpi-git-ucx/lib/libmpi.so.0 (at 0x00002aaaaab84a32)
#3 ompi_request_default_wait () from openmpi-git-ucx/lib/libmpi.so.0 (at 0x00002aaaaab21255)
#2 opal_progress () from openmpi-git-ucx/lib/libopen-pal.so.0 (at 0x00002aaaab2e4d1c)
#1 progress_callback () from openmpi-git-ucx/lib/openmpi/mca_osc_ucx.so (at 0x00002aaaad52835c)
#0 pthread_mutex_unlock () from /usr/lib64/libpthread.so.0 (at 0x00002aaaaacd9e53)

Process 1:

#6 main (argc=3, argv=0x7fffffffb208) at test_mpi_rget_fetch_op.c:28 (at 0x0000000000400cf1)
#5 PMPI_Win_lock () from openmpi-git-ucx/lib/libmpi.so.0 (at 0x00002aaaaab6868d)
#4 ompi_osc_ucx_lock () from openmpi-git-ucx/lib/openmpi/mca_osc_ucx.so (at 0x00002aaaad52c067)
#3 ucp_worker_progress (worker=0x2aaaaf02b010) at openucx/ucx-1.5.2/build/src/ucp/../../../src/ucp/core/ucp_worker.c:1426 (at 0x00002aaaac1b9ee2)
#2 uct_worker_progress (worker=<optimized out>) at openucx/ucx-1.5.2/build/../src/uct/api/uct.h:1677 (at 0x00002aaaac1b9ee2)
#1 ucs_callbackq_dispatch (cbq=<optimized out>) at openucx/ucx-1.5.2/build/../src/ucs/datastruct/callbackq.h:209 (at 0x00002aaaac1b9ee2)
#0 uct_ud_verbs_iface_progress (tl_iface=0x6ae610) at openucx/ucx-1.5.2/build/src/uct/../../../src/uct/ib/ud/verbs/ud_verbs.c:376 (at 0x00002aaaac336294)

Process 1 seems to never get back from ompi_osc_ucx_lock.

Infiniband devices:

$ srun ibv_devices
    device          	   node GUID
    ------          	----------------
    mlx5_0          	08003800013c7663
    device          	   node GUID
    ------          	----------------
    mlx5_0          	08003800013c7507

ibverbs version:

$ rpm -q libibverbs
libibverbs-41mlnx1-OFED.4.2.0.1.4.42100.x86_64

Open MPI was configured using:

../configure CC=gcc CXX=g++ FC=gfortran --with-ucx=$HOME/opt/ucx-1.5.2 --prefix=$HOME/opt/openmpi-git-ucx --with-slurm --with-pmi=/usr --with-pmi-libdir=/usr/lib64

In case it matters: this is a different system than the one used in #6552, mainly due to #6815 preventing me to run on the original system. The main difference is that the current system uses SLURM instead of PBS but I don't think that has any impact on UCX.

Potentially related: #6549 (although that was reported on v4.0.x and managed to get an exclusive lock)

Curiously, trying to run with UCX logging enabled leads to memory corruption:

$ UCX_LOG_LEVEL=trace srun ./test_mpi_rget_fetch_op
[1563195672.358175] [taurusi6605:32400:0]          debug.c:1099 UCX  DEBUG using signal stack 0x2b48f90e2000 size 141824
[1563195672.358199] [taurusi6605:32400:0]        profile.c:302  UCX  TRACE profiling is disabled
[1563195672.358214] [taurusi6605:32400:0]           init.c:89   UCX  DEBUG /home/jschuch/opt/ucx-1.5.2/lib/libucs.so.0 loaded at 0x2b48f8f5e000
[1563195672.358245] [taurusi6605:32400:0]           init.c:90   UCX  DEBUG cmd line: /home/h4/jschuch/src/test/./test_mpi_rget_fetch_op 
[1563195672.373363] [taurusi6605:32400:0]          debug.c:1099 UCX  DEBUG using signal stack 0x2b48f8a77000 size 141824
[1563195672.373379] [taurusi6605:32400:0]        profile.c:302  UCX  TRACE profiling is disabled
[1563195672.373394] [taurusi6605:32400:0]           init.c:89   UCX  DEBUG /home/jschuch/opt/ucx-1.5.2/lib/libucs.so.0 loaded at 0x2b48f9756000
[1563195672.373412] [taurusi6605:32400:0]           init.c:90   UCX  DEBUG cmd line: /home/h4/jschuch/src/test/./test_mpi_rget_fetch_op 
[1563195672.374687] [taurusi6605:32400:0]    ucp_context.c:1069 UCX  DEBUG Estimated number of endpoints is 2
[1563195672.374693] [taurusi6605:32400:0]    ucp_context.c:1105 UCX  DEBUG allocation method[0] is md 'sysv'
[1563195672.374694] [taurusi6605:32400:0]    ucp_context.c:1105 UCX  DEBUG allocation method[1] is md 'posix'
[1563195672.374698] [taurusi6605:32400:0]    ucp_context.c:1117 UCX  DEBUG allocation method[2] is 'huge'
[1563195672.374699] [taurusi6605:32400:0]    ucp_context.c:1117 UCX  DEBUG allocation method[3] is 'thp'
[1563195672.374701] [taurusi6605:32400:0]    ucp_context.c:1105 UCX  DEBUG allocation method[4] is md '*'
[1563195672.374702] [taurusi6605:32400:0]    ucp_context.c:1117 UCX  DEBUG allocation method[5] is 'mmap'
[1563195672.374703] [taurusi6605:32400:0]    ucp_context.c:1117 UCX  DEBUG allocation method[6] is 'heap'
[1563195672.375295] [taurusi6605:32400:0]    ucp_context.c:556  UCX  TRACE allowed transport 0 : 'all'
[1563195672.375302] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'all'
[1563195672.375304] [taurusi6605:32400:0]    ucp_context.c:473  UCX  TRACE self/self is enabled
[1563195672.375342] [taurusi6605:32400:0]            sys.c:700  UCX  DEBUG ioctl(req=35093, ifr_name=.) failed: No such device
[1563195672.375353] [taurusi6605:32400:0]            sys.c:700  UCX  DEBUG ioctl(req=35093, ifr_name=..) failed: No such device
[1563195672.375389] [taurusi6605:32400:0]            sys.c:700  UCX  DEBUG ioctl(req=35093, ifr_name=enp7s0f1) failed: Cannot assign requested address
[1563195672.375398] [taurusi6605:32400:0]    ucp_context.c:556  UCX  TRACE allowed transport 0 : 'all'
[1563195672.375400] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'all'
[1563195672.375401] [taurusi6605:32400:0]    ucp_context.c:473  UCX  TRACE tcp/virbr0 is enabled
[1563195672.375403] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'all'
[1563195672.375404] [taurusi6605:32400:0]    ucp_context.c:473  UCX  TRACE tcp/ib0 is enabled
[1563195672.375405] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'all'
[1563195672.375406] [taurusi6605:32400:0]    ucp_context.c:473  UCX  TRACE tcp/enp7s0f0 is enabled
[1563195672.375435] [taurusi6605:32400:0]          ib_md.c:1396 UCX  TRACE opening IB device ib/mlx5_0
[1563195672.380814] [taurusi6605:32400:0]      ib_device.c:347  UCX  DEBUG initialized device 'mlx5_0' (InfiniBand channel adapter) with 1 ports
[1563195672.386339] [taurusi6605:32400:0]          ib_md.c:298  UCX  DEBUG initialized UMR QP 0x387bd, max_inline_klm_list 20
[1563195672.386366] [taurusi6605:32400:0]          mpool.c:77   UCX  DEBUG mpool rcache_inv_mp: align 1, maxelems 4294967295, elemsize 32
[1563195672.387608] [taurusi6605:32400:0]          ib_md.c:1243 UCX  DEBUG mlx5_0: using registration cache
[1563195672.387629] [taurusi6605:32400:0]      ib_device.c:707  UCX  DEBUG found usable port for tl rc mlx5_0:1
[1563195672.387634] [taurusi6605:32400:0]      ib_device.c:466  UCX  TRACE mlx5_0:1 (Generic HCA) does not support flags 0x4
[1563195672.387639] [taurusi6605:32400:0]      ib_device.c:695  UCX  TRACE mlx5_0:1 does not support flags 0x4: Unsupported operation
[1563195672.387640] [taurusi6605:32400:0]      ib_device.c:712  UCX  DEBUG no compatible IB ports found for flags 0x4
[1563195672.387642] [taurusi6605:32400:0]         uct_md.c:164  UCX  DEBUG Failed to query rc_mlx5 resources: No such device
[1563195672.387650] [taurusi6605:32400:0]      ib_device.c:707  UCX  DEBUG found usable port for tl dc mlx5_0:1
[1563195672.387653] [taurusi6605:32400:0]      ib_device.c:466  UCX  TRACE mlx5_0:1 (Generic HCA) does not support flags 0x4
[1563195672.387661] [taurusi6605:32400:0]      ib_device.c:695  UCX  TRACE mlx5_0:1 does not support flags 0xc4: Unsupported operation
[1563195672.387662] [taurusi6605:32400:0]      ib_device.c:712  UCX  DEBUG no compatible IB ports found for flags 0xc4
[1563195672.387663] [taurusi6605:32400:0]         uct_md.c:164  UCX  DEBUG Failed to query dc_mlx5 resources: No such device
[1563195672.387668] [taurusi6605:32400:0]      ib_device.c:707  UCX  DEBUG found usable port for tl ud mlx5_0:1
[1563195672.387671] [taurusi6605:32400:0]      ib_device.c:466  UCX  TRACE mlx5_0:1 (Generic HCA) does not support flags 0x4
[1563195672.387673] [taurusi6605:32400:0]      ib_device.c:695  UCX  TRACE mlx5_0:1 does not support flags 0x4: Unsupported operation
[1563195672.387674] [taurusi6605:32400:0]      ib_device.c:712  UCX  DEBUG no compatible IB ports found for flags 0x4
[1563195672.387675] [taurusi6605:32400:0]         uct_md.c:164  UCX  DEBUG Failed to query ud_mlx5 resources: No such device
[1563195672.387679] [taurusi6605:32400:0]      ib_device.c:707  UCX  DEBUG found usable port for tl cm mlx5_0:1
[1563195672.387681] [taurusi6605:32400:0]    ucp_context.c:556  UCX  TRACE allowed transport 0 : 'all'
[1563195672.387684] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'all'
[1563195672.387685] [taurusi6605:32400:0]    ucp_context.c:473  UCX  TRACE rc/mlx5_0:1 is enabled
[1563195672.387687] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'all'
[1563195672.387688] [taurusi6605:32400:0]    ucp_context.c:473  UCX  TRACE dc/mlx5_0:1 is enabled
[1563195672.387690] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'all'
[1563195672.387691] [taurusi6605:32400:0]    ucp_context.c:473  UCX  TRACE ud/mlx5_0:1 is enabled
[1563195672.387692] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'all'
[1563195672.387693] [taurusi6605:32400:0]    ucp_context.c:473  UCX  TRACE cm/mlx5_0:1 is enabled
[1563195672.387704] [taurusi6605:32400:0]    ucp_context.c:556  UCX  TRACE allowed transport 0 : 'all'
[1563195672.387707] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'all'
[1563195672.387708] [taurusi6605:32400:0]    ucp_context.c:473  UCX  TRACE rdmacm/sockaddr is enabled
[1563195672.387723] [taurusi6605:32400:0]    ucp_context.c:556  UCX  TRACE allowed transport 0 : 'all'
[1563195672.387725] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'all'
[1563195672.387726] [taurusi6605:32400:0]    ucp_context.c:473  UCX  TRACE mm/sysv is enabled
[1563195672.387732] [taurusi6605:32400:0]    ucp_context.c:556  UCX  TRACE allowed transport 0 : 'all'
[1563195672.387733] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'all'
[1563195672.387734] [taurusi6605:32400:0]    ucp_context.c:473  UCX  TRACE mm/posix is enabled
[1563195672.387738] [taurusi6605:32400:0]    ucp_context.c:556  UCX  TRACE allowed transport 0 : 'all'
[1563195672.387739] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'all'
[1563195672.387740] [taurusi6605:32400:0]    ucp_context.c:473  UCX  TRACE cma/cma is enabled
[1563195672.387761] [taurusi6605:32400:0]          mpool.c:77   UCX  DEBUG mpool rcache_inv_mp: align 1, maxelems 4294967295, elemsize 32
[1563195672.387765] [taurusi6605:32400:0]    ucp_context.c:556  UCX  TRACE allowed transport 0 : 'all'
[1563195672.387767] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'all'
[1563195672.387768] [taurusi6605:32400:0]    ucp_context.c:473  UCX  TRACE knem/knem is enabled
[1563195672.387776] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'ud'
[1563195672.387777] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'ud_x'
[1563195672.387782] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'ud'
[1563195672.387783] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'ud_x'
[1563195672.387787] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'ud'
[1563195672.387792] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'ud_x'
[1563195672.387796] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'ud'
[1563195672.387797] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'ud_x'
[1563195672.387801] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'ud'
[1563195672.387802] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'ud_x'
[1563195672.387807] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'ud'
[1563195672.387808] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'ud_x'
[1563195672.387811] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'ud'
[1563195672.387813] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'ud'
[1563195672.387815] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'ud_x'
[1563195672.387819] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'ud'
[1563195672.387820] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'rdmacm'for alias 'ud'
[1563195672.387822] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'ud'
[1563195672.387823] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'ud_x'
[1563195672.387827] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'ud'
[1563195672.387828] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'ud_x'
[1563195672.387832] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'ud'
[1563195672.387834] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'ud_x'
[1563195672.387837] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'ud'
[1563195672.387839] [taurusi6605:32400:0]    ucp_context.c:347  UCX  TRACE enabling tl 'ud_x'
[1563195672.387843] [taurusi6605:32400:0]          mpool.c:77   UCX  DEBUG mpool ucp_rkeys: align 64, maxelems 4294967295, elemsize 136
[1563195672.387845] [taurusi6605:32400:0]    ucp_context.c:1230 UCX  DEBUG created ucp context 0x18f16b0 [8 mds 13 tls] features 0x1
[1563195672.406328] [taurusi6605:32400:0]            sys.c:156  UCX  TRACE MAC address is 0x61763C380008
[1563195672.406364] [taurusi6605:32400:0]          mpool.c:77   UCX  DEBUG mpool ucp_requests: align 64, maxelems 4294967295, elemsize 400
[1563195672.406433] [taurusi6605:32400:0]          mpool.c:77   UCX  DEBUG mpool self_msg_desc: align 64, maxelems 4294967295, elemsize 8200
[1563195672.406435] [taurusi6605:32400:0]           self.c:182  UCX  DEBUG created self iface id 0x5034e09195255bbd send_size 8192
[1563195672.406440] [taurusi6605:32400:0]     ucp_worker.c:857  UCX  DEBUG created interface[0]=0x195bc60 using self/self on worker 0x2b48f9d73010
[1563195672.406447] [taurusi6605:32400:0]     ucp_worker.c:557  UCX  TRACE activate iface 0x195bc60 acount=0 aifaces=0
[1563195672.406500] [taurusi6605:32400:0]      tcp_iface.c:343  UCX  DEBUG tcp_iface 0x195c3c0: listening for connections on 192.168.0.254:47854
[1563195672.412483] [taurusi6605:32400:0]          mpool.c:189  UCX  DEBUG mpool rcache_inv_mp: allocated chunk 0x2b48f90c9008 of 36856 bytes with 1151 elements
[1563195672.412515] [taurusi6605:32400:0]          mpool.c:189  UCX  DEBUG mpool rcache_inv_mp: allocated chunk 0x2b48f90d2008 of 36856 bytes with 1151 elements
*** Error in `/home/h4/jschuch/src/test/./test_mpi_rget_fetch_op': free(): invalid pointer: 0x00002b48f7ae17b8 ***
======= Backtrace: =========
/usr/lib64/libc.so.6(+0x7c619)[0x2b48f77a1619]
ucx-1.5.2/lib/libucs.so.0(ucs_debug_cleanup+0x62)[0x2b48f97a21a2]
ucx-1.5.2/lib/libucs.so.0(+0x4c842)[0x2b48f97a2842]
/usr/lib64/libpthread.so.0(+0xf5e0)[0x2b48f75185e0]
ucx-1.5.2/lib/libucs.so.0(+0xde5f1)[0x2b48f98345f1]
ucx-1.5.2/lib/libucs.so.0(+0x73f7d)[0x2b48f97c9f7d]
ucx-1.5.2/lib/libucs.so.0(+0x4a469)[0x2b48f97a0469]
ucx-1.5.2/lib/libucs.so.0(+0x57c2c)[0x2b48f97adc2c]
ucx-1.5.2/lib/libucs.so.0(ucs_debug_lookup_address+0x119)[0x2b48f97a0ef9]
ucx-1.5.2/lib/libucs.so.0(ucs_debug_get_symbol_name+0x519)[0x2b48f97a18b9]
ucx-1.5.2/lib/libucs.so.0(+0x3c4e6)[0x2b48f97924e6]
ucx-1.5.2/lib/libucs.so.0(+0x3ceaf)[0x2b48f9792eaf]
ucx-1.5.2/lib/libucs.so.0(ucs_async_set_event_handler+0x3e)[0x2b48f979305e]
ucx-1.5.2/lib/libuct.so.0(+0x28c48)[0x2b48f8b0dc48]
ucx-1.5.2/lib/libuct.so.0(uct_iface_open+0x75)[0x2b48f8b08185]
ucx-1.5.2/lib/libucp.so.0(ucp_worker_iface_init+0x1bb)[0x2b48f89f07ab]
ucx-1.5.2/lib/libucp.so.0(ucp_worker_create+0x372)[0x2b48f89f1982]
openmpi-git-ucx/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x67)[0x2b48f89c3447]
openmpi-git-ucx/lib/openmpi/mca_pml_ucx.so(+0x849c)[0x2b48f89c549c]
openmpi-git-ucx/lib/libmpi.so.0(mca_pml_base_select+0x1d3)[0x2b48f73d7703]
openmpi-git-ucx/lib/libmpi.so.0(ompi_mpi_init+0x6ca)[0x2b48f73e462a]
openmpi-git-ucx/lib/libmpi.so.0(MPI_Init+0x5e)[0x2b48f7388b0e]
./test_mpi_rget_fetch_op[0x400c64]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b48f7746c05]
./test_mpi_rget_fetch_op[0x400b99]
@hjelmn
Copy link
Member

hjelmn commented Jul 15, 2019 via email

@devreal
Copy link
Contributor Author

devreal commented Jul 15, 2019

@hjelmn In that case, the test case gets stuck in MPI_Win_flush:

Process 1:

#14 main (argc=3, argv=0x7fffffffb928) at test_mpi_rget_fetch_op.c:36 (at 0x0000000000400d84)
#13 PMPI_Win_flush () from openmpi-git-ucx/lib/libmpi.so.0 (at 0x00002aaaaab674a1)
#12 ompi_osc_pt2pt_flush () from openmpi-git-ucx/lib/openmpi/mca_osc_pt2pt.so (at 0x00002aaaad842d3d)
#11 ompi_osc_pt2pt_flush_lock () from openmpi-git-ucx/lib/openmpi/mca_osc_pt2pt.so (at 0x00002aaaad840b15)
#10 opal_progress () from openmpi-git-ucx/lib/libopen-pal.so.0 (at 0x00002aaaab2e4d1c)
#9 mca_pml_ucx_progress () from openmpi-git-ucx/lib/openmpi/mca_pml_ucx.so (at 0x00002aaaac8d07d7)
#8 ucp_worker_progress (worker=0x2aaaad879010) at openucx/ucx-1.5.2/build/src/ucp/../../../src/ucp/core/ucp_worker.c:1426 (at 0x00002aaaac170ee2)
#7 uct_worker_progress (worker=<optimized out>) at openucx/ucx-1.5.2/build/../src/uct/api/uct.h:1677 (at 0x00002aaaac170ee2)
#6 ucs_callbackq_dispatch (cbq=<optimized out>) at openucx/ucx-1.5.2/build/../src/ucs/datastruct/callbackq.h:209 (at 0x00002aaaac170ee2)
#5 uct_rc_verbs_iface_progress (arg=0x4f6770) at openucx/ucx-1.5.2/build/src/uct/../../../src/uct/ib/rc/verbs/rc_verbs_iface.c:116 (at 0x00002aaaac206347)
#4 uct_rc_verbs_iface_poll_tx (iface=0x4f6770) at openucx/ucx-1.5.2/build/src/uct/../../../src/uct/ib/rc/verbs/rc_verbs_iface.c:83 (at 0x00002aaaac206347)
#3 uct_ib_poll_cq (wcs=0x7fffffffb310, count=<synthetic pointer>, cq=<optimized out>) at openucx/ucx-1.5.2/build/../src/uct/ib/base/ib_device.h:289 (at 0x00002aaaac206347)
#2 ibv_poll_cq (wc=0x7fffffffb310, num_entries=<optimized out>, cq=<optimized out>) at /usr/include/infiniband/verbs.h:1458 (at 0x00002aaaac206347)
#1 mlx5_poll_cq () from /usr/lib64/libmlx5.so.1 (at 0x00002aaaac4f0ea7)
#0 pthread_spin_lock () from /usr/lib64/libpthread.so.0 (at 0x00002aaaaacdb483)

Process 0:

#13 main (argc=3, argv=0x7fffffffb928) at test/test_mpi_rget_fetch_op.c:44 (at 0x0000000000400dab)
#12 PMPI_Barrier () from openmpi-git-ucx/lib/libmpi.so.0 (at 0x00002aaaaab39e45)
#11 ompi_coll_base_barrier_intra_two_procs () from openmpi-git-ucx/lib/libmpi.so.0 (at 0x00002aaaaab84a32)
#10 ompi_request_default_wait () from openmpi-git-ucx/lib/libmpi.so.0 (at 0x00002aaaaab21255)
#9 opal_progress () from openmpi-git-ucx/lib/libopen-pal.so.0 (at 0x00002aaaab2e4d1c)
#8 mca_pml_ucx_progress () from openmpi-git-ucx/lib/openmpi/mca_pml_ucx.so (at 0x00002aaaac8d07d7)
#7 ucp_worker_progress (worker=0x2aaaad879010) at openucx/ucx-1.5.2/build/src/ucp/../../../src/ucp/core/ucp_worker.c:1426 (at 0x00002aaaac170ee2)
#6 uct_worker_progress (worker=<optimized out>) at openucx/ucx-1.5.2/build/../src/uct/api/uct.h:1677 (at 0x00002aaaac170ee2)
#5 ucs_callbackq_dispatch (cbq=<optimized out>) at openucx/ucx-1.5.2/build/../src/ucs/datastruct/callbackq.h:209 (at 0x00002aaaac170ee2)
#4 uct_rc_verbs_iface_progress (arg=0x4f6770) at openucx/ucx-1.5.2/build/src/uct/../../../src/uct/ib/rc/verbs/rc_verbs_iface.c:111 (at 0x00002aaaac2061ff)
#3 uct_rc_verbs_iface_poll_rx_common (iface=0x4f6770) at openucx/ucx-1.5.2/build/src/uct/../../../src/uct/ib/rc/verbs/rc_verbs_common.h:182 (at 0x00002aaaac2061ff)
#2 uct_ib_poll_cq (wcs=0x7fffffffb2b0, count=<synthetic pointer>, cq=<optimized out>) at openucx/ucx-1.5.2/build/../src/uct/ib/base/ib_device.h:289 (at 0x00002aaaac2061ff)
#1 ibv_poll_cq (wc=0x7fffffffb2b0, num_entries=<optimized out>, cq=<optimized out>) at /usr/include/infiniband/verbs.h:1458 (at 0x00002aaaac2061ff)
#0 mlx5_poll_cq () from /usr/lib64/libmlx5.so.1 (at 0x00002aaaac4f0e49)

@hjelmn
Copy link
Member

hjelmn commented Jul 15, 2019

Opps, fat-fingered the option: --mca btl_uct_memory_domains ib/mlx5_0 . Now I will need to track down why osc/pt2pt is hanging...

@devreal
Copy link
Contributor Author

devreal commented Jul 15, 2019

With --mca btl_uct_memory_domains ib/mlx5_0 the test runs through, thanks 👍

@devreal
Copy link
Contributor Author

devreal commented Jul 15, 2019

@hjelmn Should the osc/pt2pt issue be tracked separately?

@artpol84
Copy link
Contributor

@devreal we are currently in the process of verification of the master's implementation.
It was redesigned to better support multi-threads.
I had your test from #6552 in my todo list. We will return to this once we are done with the issues in the 4.0.x branch.
Thank you, for checking.

@hjelmn
Copy link
Member

hjelmn commented Jul 15, 2019

@devreal We should track the osc/pt2pt issue separately. It is still on my TODO list to eventually kill it but it needs to work until that point.

@janjust
Copy link
Contributor

janjust commented Jul 22, 2019

@devreal , can you try this one with ucx v1.6.x as well?

@devreal
Copy link
Contributor Author

devreal commented Jul 22, 2019

@janjust Just tested master in combination with UCX 1.6.x and the problem persists.

@janjust
Copy link
Contributor

janjust commented Aug 13, 2019

@devreal , so I can reproduce the issue with v1.5.x, but I cannot with 1.6.x or master ucx version, are you 100% sure you're using 1.6.x?

@devreal
Copy link
Contributor Author

devreal commented Aug 16, 2019

I have confirmed that the issue still occurs on latest Open MPI master and latest OpenUCX 1.6.x branch:

$ export PATH=$HOME/opt-vulcan/openmpi-git/bin:$PATH
$ ompi_info | grep revision
  Open MPI repo revision: v2.x-dev-7103-gd58c59e
$ ompi_info | grep ucx
  Configure command line: 'CC=gcc' 'CXX=g++' 'FTN=gfortran' '--prefix=$HOME/opt-vulcan/openmpi-git' '--with-ucx=$HOME/opt-vulcan/ucx-1.6.x-gnu/'
$ $HOME/opt-vulcan/ucx-1.6.x-gnu/bin/ucx_info -v
# UCT version=1.6.1 revision 736d503
$ mpicc test_mpi_rget_fetch_op.c -o test_mpi_rget_fetch_op
$ mpirun -n 2 -N 1 ./test_mpi_rget_fetch_op
<hangs>

[paths altered by me]

Interestingly, if I modify the test to have the target process lock and unlock the window first (and synchronized with the other process through a barrier) the test succeeds.

On the Bull system I have access to the same happens. If I set UCX_IB_DEVICE_SPECS=0x119f:4113:ConnectIB:5 (as suggested for this system in #6868), it does not hang. I thus suspect the SW atomics to be at fault.

@hppritcha
Copy link
Member

@artpol84 is this issue resolved?

@devreal
Copy link
Contributor Author

devreal commented Mar 11, 2021

I believe this was fixed with #7632. Closing

@devreal devreal closed this as completed Mar 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants