-
Notifications
You must be signed in to change notification settings - Fork 908
coll/libnbc: demote progress_lock to regular flag #3901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coll/libnbc: demote progress_lock to regular flag #3901
Conversation
Can one of the admins verify this patch? |
@ggouaillardet am I missing something here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if i understand correctly this PR, you replaced an opal_atomic_lock_t
with a boolean that is protected by the mca_coll_libnbc_component.lock
, is that right ?
iirc, there is some recursivity involved in coll/libnbc
, and this is regardless opal_using_threads()
so in the likely case where opal is not using threads, in_progress
is no more protected so this is no more a mutex (and that could be fine, i am not really sure about it at this stage)
imho, the right approach, if feasable of course, would be to get rid of the recursivity.
ompi/mca/coll/libnbc/coll_libnbc.h
Outdated
@@ -75,7 +75,7 @@ struct ompi_coll_libnbc_component_t { | |||
opal_free_list_t requests; | |||
opal_list_t active_requests; | |||
int32_t active_comms; | |||
opal_atomic_lock_t progress_lock; /* protect from recursive calls */ | |||
bool in_progress; /* protect from recursive calls */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imho, in_progress
should rather be a static variable of coll_libnbc_component.c
since it is not used anywhere else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I've updated the PR.
} | ||
OPAL_THREAD_LOCK(&mca_coll_libnbc_component.lock); | ||
mca_coll_libnbc_component.in_progress = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at first glance, this should be in the if (! in_progress)
block.
makes sense ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, the diff without -w is making it hard to read.
Yes, that's correct.
Sadly I'm not familiar enough with the inner workings of libnbc (or openmpi) to figure that out. |
FWIW if this isn't going to get merged (it slashes around 40% of libnbc time when running single-threaded), the recursive calls are being caused by this line which has been in libnbc since the beginning. Stack trace goes:
|
Could you please squash the two commits into a single one ? |
Signed-off-by: Carlos Bederián <bc@famaf.unc.edu.ar>
There you go, thanks |
thanks, i just merged the PR |
Quantum Espresso PWscf benchmark runs, libnbc went from taking 2% of walltime to around 1.1%. I should take OMB for a spin. |
instead of invoking ompi_request_test_all(), that will end up calling opal_progress() recursively, manually check the status of the requests. the same method is used in ompi_comm_request_progress() Refs open-mpi#3901 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
instead of invoking ompi_request_test_all(), that will end up calling opal_progress() recursively, manually check the status of the requests. the same method is used in ompi_comm_request_progress() Refs open-mpi#3901 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
instead of invoking ompi_request_test_all(), that will end up calling opal_progress() recursively, manually check the status of the requests. the same method is used in ompi_comm_request_progress() Refs open-mpi#3901 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
instead of invoking ompi_request_test_all(), that will end up calling opal_progress() recursively, manually check the status of the requests. the same method is used in ompi_comm_request_progress() Refs #3901 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
The LOCK COMPXCHG that grabs
mca_coll_libnbc_component.progress_lock
(which detects recursion and defines a critical section) has been showing up on my profiles during single-threaded runs, where there shouldn't be any contention. This PR changes the lock to an OPAL_THREAD_LOCKed boolean flag to speed up the single-thread case while keeping the critical section otherwise.Signed-off-by: Carlos Bederián bc@famaf.unc.edu.ar