Description
I tried to run a rather simple benchmark (osu_latency_mt
ported to use Argobots) to test the Argobots integration on an IB cluster using UCX. It appears that the UCX PML does not use the sync objects in blocking operations but instead hammers the UCX library, with occasional calls to opal_progress
. Of course, there will never be a switch to another argobot in that case, causing deadlocks if there are more argobots than execution streams.
One way to work around this I guess is to call ABT_thread_yield
in opal_progress
. So I replaced the call to sched_yield
with ABT_thread_yield
. Curiously, it seems that running with --mca mpi_yield_when_idle true
does not actually cause opal_progress_yield_when_idle
to be set to true
so that code path isn't taken (I am not familiar with the mca parameter code and couldn't spot anything wrong on a first glance). I can open a separate issue for this if people think that this is fishy.
If I remove the check for opal_progress_yield_when_idle
in opal_progress
things start to work. Now the latency seems sensitive to pml_ucx_progress_iterations
, the smaller the better (as the UCX PML calls into opal_progress
more often).