Skip to content

Argobots integration vs UCX PML #7702

Closed
@devreal

Description

@devreal

I tried to run a rather simple benchmark (osu_latency_mt ported to use Argobots) to test the Argobots integration on an IB cluster using UCX. It appears that the UCX PML does not use the sync objects in blocking operations but instead hammers the UCX library, with occasional calls to opal_progress. Of course, there will never be a switch to another argobot in that case, causing deadlocks if there are more argobots than execution streams.

One way to work around this I guess is to call ABT_thread_yield in opal_progress. So I replaced the call to sched_yield with ABT_thread_yield. Curiously, it seems that running with --mca mpi_yield_when_idle true does not actually cause opal_progress_yield_when_idle to be set to true so that code path isn't taken (I am not familiar with the mca parameter code and couldn't spot anything wrong on a first glance). I can open a separate issue for this if people think that this is fishy.

If I remove the check for opal_progress_yield_when_idle in opal_progress things start to work. Now the latency seems sensitive to pml_ucx_progress_iterations, the smaller the better (as the UCX PML calls into opal_progress more often).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions