Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCX osc: missing progress in shared memory #7631

Closed
devreal opened this issue Apr 15, 2020 · 0 comments · Fixed by #7632
Closed

UCX osc: missing progress in shared memory #7631

devreal opened this issue Apr 15, 2020 · 0 comments · Fixed by #7632
Assignees

Comments

@devreal
Copy link
Contributor

devreal commented Apr 15, 2020

I encountered the following problem when forcing UCX with --mca osc ucx on a shared memory system (mainly for testing; I know it's not optimal but it shouldn't hang):

MPI_Win win;
MPI_Win_allocate(..., &win);
if (myrank == 1) {
  MPI_Win_lock_all(win); // <- hangs
}
MPI_Barrier(MPI_COMM_WORLD);

Process 1 will hang in MPI_Win_lock_all because MPI_Barrier will not progress UCX (presumably because UCX is not used for collectives in shared memory?). The UCX osc's progress callback gets called in the barrier call but since there are no active workers in process 0 it will not progress the operations required by MPI_Win_lock_all on process 1.

I have a possible fix that I will PR soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant