-
Notifications
You must be signed in to change notification settings - Fork 877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCX hangs in MPI_Win_lock on master #6816
Comments
Might be worth running with --mca osc ^ucx --mca btl_uct_memory_domains mlx5_0 as a sanity check.
…-Nathan
On Jul 15, 2019, at 7:14 AM, Joseph Schuchart ***@***.***> wrote:
I'm trying to run the example from #6552 on master (020a591) with UCX 1.5.2 (built from release tarball), which leads to process 1 being stuck in MPI_Win_lock and process 0 waiting in MPI_Barrier:
Process 0:
#6 main (argc=3, argv=0x7fffffffb208) at test_mpi_rget_fetch_op.c:44 (at 0x0000000000400dab)
#5 PMPI_Barrier () from openmpi-git-ucx/lib/libmpi.so.0 (at 0x00002aaaaab39e45)
#4 ompi_coll_base_barrier_intra_two_procs () from openmpi-git-ucx/lib/libmpi.so.0 (at 0x00002aaaaab84a32)
#3 ompi_request_default_wait () from openmpi-git-ucx/lib/libmpi.so.0 (at 0x00002aaaaab21255)
#2 opal_progress () from openmpi-git-ucx/lib/libopen-pal.so.0 (at 0x00002aaaab2e4d1c)
#1 progress_callback () from openmpi-git-ucx/lib/openmpi/mca_osc_ucx.so (at 0x00002aaaad52835c)
#0 pthread_mutex_unlock () from /usr/lib64/libpthread.so.0 (at 0x00002aaaaacd9e53)
Process 1:
#6 main (argc=3, argv=0x7fffffffb208) at test_mpi_rget_fetch_op.c:28 (at 0x0000000000400cf1)
#5 PMPI_Win_lock () from openmpi-git-ucx/lib/libmpi.so.0 (at 0x00002aaaaab6868d)
#4 ompi_osc_ucx_lock () from openmpi-git-ucx/lib/openmpi/mca_osc_ucx.so (at 0x00002aaaad52c067)
#3 ucp_worker_progress (worker=0x2aaaaf02b010) at openucx/ucx-1.5.2/build/src/ucp/../../../src/ucp/core/ucp_worker.c:1426 (at 0x00002aaaac1b9ee2)
#2 uct_worker_progress (worker=<optimized out>) at openucx/ucx-1.5.2/build/../src/uct/api/uct.h:1677 (at 0x00002aaaac1b9ee2)
#1 ucs_callbackq_dispatch (cbq=<optimized out>) at openucx/ucx-1.5.2/build/../src/ucs/datastruct/callbackq.h:209 (at 0x00002aaaac1b9ee2)
#0 uct_ud_verbs_iface_progress (tl_iface=0x6ae610) at openucx/ucx-1.5.2/build/src/uct/../../../src/uct/ib/ud/verbs/ud_verbs.c:376 (at 0x00002aaaac336294)
Process 1 seems to never get back from ompi_osc_ucx_lock.
Infiniband devices:
$ srun ibv_devices
device node GUID
------ ----------------
mlx5_0 08003800013c7663
device node GUID
------ ----------------
mlx5_0 08003800013c7507
ibverbs version:
$ rpm -q libibverbs
libibverbs-41mlnx1-OFED.4.2.0.1.4.42100.x86_64
Open MPI was configured using:
../configure CC=gcc CXX=g++ FC=gfortran --with-ucx=$HOME/opt/ucx-1.5.2 --prefix=$HOME/opt/openmpi-git-ucx --with-slurm --with-pmi=/usr --with-pmi-libdir=/usr/lib64
In case it matters: this is a different system than the one used in #6552, mainly due to #6815 preventing me to run on the original system. The main difference is that the current system uses SLURM instead of PBS but I don't think that has any impact on UCX.
Potentially related: #6549 (although that was reported on v4.0.x and managed to get an exclusive lock)
Curiously, trying to run with UCX logging enabled leads to memory corruption:
$ UCX_LOG_LEVEL=trace srun ./test_mpi_rget_fetch_op
[1563195672.358175] [taurusi6605:32400:0] debug.c:1099 UCX DEBUG using signal stack 0x2b48f90e2000 size 141824
[1563195672.358199] [taurusi6605:32400:0] profile.c:302 UCX TRACE profiling is disabled
[1563195672.358214] [taurusi6605:32400:0] init.c:89 UCX DEBUG /home/jschuch/opt/ucx-1.5.2/lib/libucs.so.0 loaded at 0x2b48f8f5e000
[1563195672.358245] [taurusi6605:32400:0] init.c:90 UCX DEBUG cmd line: /home/h4/jschuch/src/test/./test_mpi_rget_fetch_op
[1563195672.373363] [taurusi6605:32400:0] debug.c:1099 UCX DEBUG using signal stack 0x2b48f8a77000 size 141824
[1563195672.373379] [taurusi6605:32400:0] profile.c:302 UCX TRACE profiling is disabled
[1563195672.373394] [taurusi6605:32400:0] init.c:89 UCX DEBUG /home/jschuch/opt/ucx-1.5.2/lib/libucs.so.0 loaded at 0x2b48f9756000
[1563195672.373412] [taurusi6605:32400:0] init.c:90 UCX DEBUG cmd line: /home/h4/jschuch/src/test/./test_mpi_rget_fetch_op
[1563195672.374687] [taurusi6605:32400:0] ucp_context.c:1069 UCX DEBUG Estimated number of endpoints is 2
[1563195672.374693] [taurusi6605:32400:0] ucp_context.c:1105 UCX DEBUG allocation method[0] is md 'sysv'
[1563195672.374694] [taurusi6605:32400:0] ucp_context.c:1105 UCX DEBUG allocation method[1] is md 'posix'
[1563195672.374698] [taurusi6605:32400:0] ucp_context.c:1117 UCX DEBUG allocation method[2] is 'huge'
[1563195672.374699] [taurusi6605:32400:0] ucp_context.c:1117 UCX DEBUG allocation method[3] is 'thp'
[1563195672.374701] [taurusi6605:32400:0] ucp_context.c:1105 UCX DEBUG allocation method[4] is md '*'
[1563195672.374702] [taurusi6605:32400:0] ucp_context.c:1117 UCX DEBUG allocation method[5] is 'mmap'
[1563195672.374703] [taurusi6605:32400:0] ucp_context.c:1117 UCX DEBUG allocation method[6] is 'heap'
[1563195672.375295] [taurusi6605:32400:0] ucp_context.c:556 UCX TRACE allowed transport 0 : 'all'
[1563195672.375302] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'all'
[1563195672.375304] [taurusi6605:32400:0] ucp_context.c:473 UCX TRACE self/self is enabled
[1563195672.375342] [taurusi6605:32400:0] sys.c:700 UCX DEBUG ioctl(req=35093, ifr_name=.) failed: No such device
[1563195672.375353] [taurusi6605:32400:0] sys.c:700 UCX DEBUG ioctl(req=35093, ifr_name=..) failed: No such device
[1563195672.375389] [taurusi6605:32400:0] sys.c:700 UCX DEBUG ioctl(req=35093, ifr_name=enp7s0f1) failed: Cannot assign requested address
[1563195672.375398] [taurusi6605:32400:0] ucp_context.c:556 UCX TRACE allowed transport 0 : 'all'
[1563195672.375400] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'all'
[1563195672.375401] [taurusi6605:32400:0] ucp_context.c:473 UCX TRACE tcp/virbr0 is enabled
[1563195672.375403] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'all'
[1563195672.375404] [taurusi6605:32400:0] ucp_context.c:473 UCX TRACE tcp/ib0 is enabled
[1563195672.375405] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'all'
[1563195672.375406] [taurusi6605:32400:0] ucp_context.c:473 UCX TRACE tcp/enp7s0f0 is enabled
[1563195672.375435] [taurusi6605:32400:0] ib_md.c:1396 UCX TRACE opening IB device ib/mlx5_0
[1563195672.380814] [taurusi6605:32400:0] ib_device.c:347 UCX DEBUG initialized device 'mlx5_0' (InfiniBand channel adapter) with 1 ports
[1563195672.386339] [taurusi6605:32400:0] ib_md.c:298 UCX DEBUG initialized UMR QP 0x387bd, max_inline_klm_list 20
[1563195672.386366] [taurusi6605:32400:0] mpool.c:77 UCX DEBUG mpool rcache_inv_mp: align 1, maxelems 4294967295, elemsize 32
[1563195672.387608] [taurusi6605:32400:0] ib_md.c:1243 UCX DEBUG mlx5_0: using registration cache
[1563195672.387629] [taurusi6605:32400:0] ib_device.c:707 UCX DEBUG found usable port for tl rc mlx5_0:1
[1563195672.387634] [taurusi6605:32400:0] ib_device.c:466 UCX TRACE mlx5_0:1 (Generic HCA) does not support flags 0x4
[1563195672.387639] [taurusi6605:32400:0] ib_device.c:695 UCX TRACE mlx5_0:1 does not support flags 0x4: Unsupported operation
[1563195672.387640] [taurusi6605:32400:0] ib_device.c:712 UCX DEBUG no compatible IB ports found for flags 0x4
[1563195672.387642] [taurusi6605:32400:0] uct_md.c:164 UCX DEBUG Failed to query rc_mlx5 resources: No such device
[1563195672.387650] [taurusi6605:32400:0] ib_device.c:707 UCX DEBUG found usable port for tl dc mlx5_0:1
[1563195672.387653] [taurusi6605:32400:0] ib_device.c:466 UCX TRACE mlx5_0:1 (Generic HCA) does not support flags 0x4
[1563195672.387661] [taurusi6605:32400:0] ib_device.c:695 UCX TRACE mlx5_0:1 does not support flags 0xc4: Unsupported operation
[1563195672.387662] [taurusi6605:32400:0] ib_device.c:712 UCX DEBUG no compatible IB ports found for flags 0xc4
[1563195672.387663] [taurusi6605:32400:0] uct_md.c:164 UCX DEBUG Failed to query dc_mlx5 resources: No such device
[1563195672.387668] [taurusi6605:32400:0] ib_device.c:707 UCX DEBUG found usable port for tl ud mlx5_0:1
[1563195672.387671] [taurusi6605:32400:0] ib_device.c:466 UCX TRACE mlx5_0:1 (Generic HCA) does not support flags 0x4
[1563195672.387673] [taurusi6605:32400:0] ib_device.c:695 UCX TRACE mlx5_0:1 does not support flags 0x4: Unsupported operation
[1563195672.387674] [taurusi6605:32400:0] ib_device.c:712 UCX DEBUG no compatible IB ports found for flags 0x4
[1563195672.387675] [taurusi6605:32400:0] uct_md.c:164 UCX DEBUG Failed to query ud_mlx5 resources: No such device
[1563195672.387679] [taurusi6605:32400:0] ib_device.c:707 UCX DEBUG found usable port for tl cm mlx5_0:1
[1563195672.387681] [taurusi6605:32400:0] ucp_context.c:556 UCX TRACE allowed transport 0 : 'all'
[1563195672.387684] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'all'
[1563195672.387685] [taurusi6605:32400:0] ucp_context.c:473 UCX TRACE rc/mlx5_0:1 is enabled
[1563195672.387687] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'all'
[1563195672.387688] [taurusi6605:32400:0] ucp_context.c:473 UCX TRACE dc/mlx5_0:1 is enabled
[1563195672.387690] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'all'
[1563195672.387691] [taurusi6605:32400:0] ucp_context.c:473 UCX TRACE ud/mlx5_0:1 is enabled
[1563195672.387692] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'all'
[1563195672.387693] [taurusi6605:32400:0] ucp_context.c:473 UCX TRACE cm/mlx5_0:1 is enabled
[1563195672.387704] [taurusi6605:32400:0] ucp_context.c:556 UCX TRACE allowed transport 0 : 'all'
[1563195672.387707] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'all'
[1563195672.387708] [taurusi6605:32400:0] ucp_context.c:473 UCX TRACE rdmacm/sockaddr is enabled
[1563195672.387723] [taurusi6605:32400:0] ucp_context.c:556 UCX TRACE allowed transport 0 : 'all'
[1563195672.387725] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'all'
[1563195672.387726] [taurusi6605:32400:0] ucp_context.c:473 UCX TRACE mm/sysv is enabled
[1563195672.387732] [taurusi6605:32400:0] ucp_context.c:556 UCX TRACE allowed transport 0 : 'all'
[1563195672.387733] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'all'
[1563195672.387734] [taurusi6605:32400:0] ucp_context.c:473 UCX TRACE mm/posix is enabled
[1563195672.387738] [taurusi6605:32400:0] ucp_context.c:556 UCX TRACE allowed transport 0 : 'all'
[1563195672.387739] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'all'
[1563195672.387740] [taurusi6605:32400:0] ucp_context.c:473 UCX TRACE cma/cma is enabled
[1563195672.387761] [taurusi6605:32400:0] mpool.c:77 UCX DEBUG mpool rcache_inv_mp: align 1, maxelems 4294967295, elemsize 32
[1563195672.387765] [taurusi6605:32400:0] ucp_context.c:556 UCX TRACE allowed transport 0 : 'all'
[1563195672.387767] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'all'
[1563195672.387768] [taurusi6605:32400:0] ucp_context.c:473 UCX TRACE knem/knem is enabled
[1563195672.387776] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'ud'
[1563195672.387777] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'ud_x'
[1563195672.387782] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'ud'
[1563195672.387783] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'ud_x'
[1563195672.387787] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'ud'
[1563195672.387792] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'ud_x'
[1563195672.387796] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'ud'
[1563195672.387797] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'ud_x'
[1563195672.387801] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'ud'
[1563195672.387802] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'ud_x'
[1563195672.387807] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'ud'
[1563195672.387808] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'ud_x'
[1563195672.387811] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'ud'
[1563195672.387813] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'ud'
[1563195672.387815] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'ud_x'
[1563195672.387819] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'ud'
[1563195672.387820] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'rdmacm'for alias 'ud'
[1563195672.387822] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'ud'
[1563195672.387823] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'ud_x'
[1563195672.387827] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'ud'
[1563195672.387828] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'ud_x'
[1563195672.387832] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'ud'
[1563195672.387834] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'ud_x'
[1563195672.387837] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'ud'
[1563195672.387839] [taurusi6605:32400:0] ucp_context.c:347 UCX TRACE enabling tl 'ud_x'
[1563195672.387843] [taurusi6605:32400:0] mpool.c:77 UCX DEBUG mpool ucp_rkeys: align 64, maxelems 4294967295, elemsize 136
[1563195672.387845] [taurusi6605:32400:0] ucp_context.c:1230 UCX DEBUG created ucp context 0x18f16b0 [8 mds 13 tls] features 0x1
[1563195672.406328] [taurusi6605:32400:0] sys.c:156 UCX TRACE MAC address is 0x61763C380008
[1563195672.406364] [taurusi6605:32400:0] mpool.c:77 UCX DEBUG mpool ucp_requests: align 64, maxelems 4294967295, elemsize 400
[1563195672.406433] [taurusi6605:32400:0] mpool.c:77 UCX DEBUG mpool self_msg_desc: align 64, maxelems 4294967295, elemsize 8200
[1563195672.406435] [taurusi6605:32400:0] self.c:182 UCX DEBUG created self iface id 0x5034e09195255bbd send_size 8192
[1563195672.406440] [taurusi6605:32400:0] ucp_worker.c:857 UCX DEBUG created interface[0]=0x195bc60 using self/self on worker 0x2b48f9d73010
[1563195672.406447] [taurusi6605:32400:0] ucp_worker.c:557 UCX TRACE activate iface 0x195bc60 acount=0 aifaces=0
[1563195672.406500] [taurusi6605:32400:0] tcp_iface.c:343 UCX DEBUG tcp_iface 0x195c3c0: listening for connections on 192.168.0.254:47854
[1563195672.412483] [taurusi6605:32400:0] mpool.c:189 UCX DEBUG mpool rcache_inv_mp: allocated chunk 0x2b48f90c9008 of 36856 bytes with 1151 elements
[1563195672.412515] [taurusi6605:32400:0] mpool.c:189 UCX DEBUG mpool rcache_inv_mp: allocated chunk 0x2b48f90d2008 of 36856 bytes with 1151 elements
*** Error in `/home/h4/jschuch/src/test/./test_mpi_rget_fetch_op': free(): invalid pointer: 0x00002b48f7ae17b8 ***
======= Backtrace: =========
/usr/lib64/libc.so.6(+0x7c619)[0x2b48f77a1619]
ucx-1.5.2/lib/libucs.so.0(ucs_debug_cleanup+0x62)[0x2b48f97a21a2]
ucx-1.5.2/lib/libucs.so.0(+0x4c842)[0x2b48f97a2842]
/usr/lib64/libpthread.so.0(+0xf5e0)[0x2b48f75185e0]
ucx-1.5.2/lib/libucs.so.0(+0xde5f1)[0x2b48f98345f1]
ucx-1.5.2/lib/libucs.so.0(+0x73f7d)[0x2b48f97c9f7d]
ucx-1.5.2/lib/libucs.so.0(+0x4a469)[0x2b48f97a0469]
ucx-1.5.2/lib/libucs.so.0(+0x57c2c)[0x2b48f97adc2c]
ucx-1.5.2/lib/libucs.so.0(ucs_debug_lookup_address+0x119)[0x2b48f97a0ef9]
ucx-1.5.2/lib/libucs.so.0(ucs_debug_get_symbol_name+0x519)[0x2b48f97a18b9]
ucx-1.5.2/lib/libucs.so.0(+0x3c4e6)[0x2b48f97924e6]
ucx-1.5.2/lib/libucs.so.0(+0x3ceaf)[0x2b48f9792eaf]
ucx-1.5.2/lib/libucs.so.0(ucs_async_set_event_handler+0x3e)[0x2b48f979305e]
ucx-1.5.2/lib/libuct.so.0(+0x28c48)[0x2b48f8b0dc48]
ucx-1.5.2/lib/libuct.so.0(uct_iface_open+0x75)[0x2b48f8b08185]
ucx-1.5.2/lib/libucp.so.0(ucp_worker_iface_init+0x1bb)[0x2b48f89f07ab]
ucx-1.5.2/lib/libucp.so.0(ucp_worker_create+0x372)[0x2b48f89f1982]
openmpi-git-ucx/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x67)[0x2b48f89c3447]
openmpi-git-ucx/lib/openmpi/mca_pml_ucx.so(+0x849c)[0x2b48f89c549c]
openmpi-git-ucx/lib/libmpi.so.0(mca_pml_base_select+0x1d3)[0x2b48f73d7703]
openmpi-git-ucx/lib/libmpi.so.0(ompi_mpi_init+0x6ca)[0x2b48f73e462a]
openmpi-git-ucx/lib/libmpi.so.0(MPI_Init+0x5e)[0x2b48f7388b0e]
./test_mpi_rget_fetch_op[0x400c64]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b48f7746c05]
./test_mpi_rget_fetch_op[0x400b99]
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@hjelmn In that case, the test case gets stuck in Process 1:
Process 0:
|
Opps, fat-fingered the option: --mca btl_uct_memory_domains ib/mlx5_0 . Now I will need to track down why osc/pt2pt is hanging... |
With |
@hjelmn Should the osc/pt2pt issue be tracked separately? |
@devreal We should track the osc/pt2pt issue separately. It is still on my TODO list to eventually kill it but it needs to work until that point. |
@devreal , can you try this one with ucx v1.6.x as well? |
@janjust Just tested |
@devreal , so I can reproduce the issue with v1.5.x, but I cannot with 1.6.x or master ucx version, are you 100% sure you're using 1.6.x? |
I have confirmed that the issue still occurs on latest Open MPI
[paths altered by me] Interestingly, if I modify the test to have the target process lock and unlock the window first (and synchronized with the other process through a barrier) the test succeeds. On the Bull system I have access to the same happens. If I set |
@artpol84 is this issue resolved? |
I believe this was fixed with #7632. Closing |
I'm trying to run the example from #6552 on
master
(020a591
) with UCX 1.5.2 (built from release tarball), which leads to process 1 being stuck inMPI_Win_lock
and process 0 waiting inMPI_Barrier
:Process 0:
Process 1:
Process 1 seems to never get back from
ompi_osc_ucx_lock
.Infiniband devices:
ibverbs version:
Open MPI was configured using:
In case it matters: this is a different system than the one used in #6552, mainly due to #6815 preventing me to run on the original system. The main difference is that the current system uses SLURM instead of PBS but I don't think that has any impact on UCX.
Potentially related: #6549 (although that was reported on v4.0.x and managed to get an exclusive lock)
Curiously, trying to run with UCX logging enabled leads to memory corruption:
The text was updated successfully, but these errors were encountered: