-
Notifications
You must be signed in to change notification settings - Fork 877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCX osc: make progress on idle worker if none are active #7632
Conversation
f22c532
to
9c03904
Compare
@@ -308,6 +309,10 @@ opal_common_ucx_wpool_progress(opal_common_ucx_wpool_t *wpool) | |||
} | |||
opal_mutex_unlock(&winfo->mutex); | |||
} | |||
if (active_workers == 0 && opal_list_get_size(&wpool->idle_workers)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think we need to progress it in any case.
It might happen that it's in the "idle" list even if we have other workers non-idling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are currently doing a significant refactoring of this code BTW.
But this is a good hint for us. So thank you very much.
For now, I think, we can just always progress this one. But with locking precautions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wpool->mutex
is held at this point, that should be sufficient, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devreal we need to progress the default worker.
@artpol84 Is there anything left to do here? |
Can one of the admins verify this patch? |
9c03904
to
64856d2
Compare
@artpol84 @janjust I rebased this PR onto current master. I was surprised to see that the |
if (active_workers == 0 && opal_list_get_size(&wpool->idle_workers)) { | ||
/* make sure to progress at least some */ | ||
ucp_worker_progress(wpool->dflt_worker); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (active_workers == 0 && opal_list_get_size(&wpool->idle_workers)) { | |
/* make sure to progress at least some */ | |
ucp_worker_progress(wpool->dflt_worker); | |
} | |
ucp_worker_progress(wpool->dflt_worker); |
@@ -282,6 +283,10 @@ opal_common_ucx_wpool_progress(opal_common_ucx_wpool_t *wpool) | |||
} while (progressed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can skip the default worker here to avoid double-progress.
But we can't skip progressing the default worker if there are active workers. The default worker might still not be active, but you want to progress it.
@devreal lets sync if you still would like to discuss this. |
64856d2
to
f3e632a
Compare
@artpol84 I made a couple of changes:
|
Calloc/Malloc + OBJ_CONSTRUCT is morally equivalent to OBJ_NEW. It can therefore be used with OBJ_RELEASE. |
@bosilca I am aware of that, but there was no |
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
f3e632a
to
581478d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks ok.
Though please let @janjust to check what’s the easier sequence of commits in the context of upcoming refactoring.
@@ -272,17 +278,33 @@ opal_common_ucx_wpool_progress(opal_common_ucx_wpool_t *wpool) | |||
/* Go over all active workers and progress them | |||
* TODO: may want to have some partitioning to progress only part of | |||
* workers */ | |||
opal_mutex_lock(&wpool->mutex); | |||
if (0 != opal_mutex_trylock(&wpool->mutex)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have this in @janjust pr already.
Maybe we should have merged it first :(
Thanks a lot! |
When using UCX in shared memory progress is missing if one process waits in a barrier while the another process attempts to perform RMA operations. This PR adds progress on the first inactive workers if there are no active workers available.
Fixes #7631
Signed-off-by: Joseph Schuchart schuchart@hlrs.de