Skip to content

coll/libnbc: demote progress_lock to regular flag #3901

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 25, 2017
Merged

coll/libnbc: demote progress_lock to regular flag #3901

merged 1 commit into from
Jul 25, 2017

Conversation

zzzoom
Copy link
Contributor

@zzzoom zzzoom commented Jul 17, 2017

The LOCK COMPXCHG that grabs mca_coll_libnbc_component.progress_lock (which detects recursion and defines a critical section) has been showing up on my profiles during single-threaded runs, where there shouldn't be any contention. This PR changes the lock to an OPAL_THREAD_LOCKed boolean flag to speed up the single-thread case while keeping the critical section otherwise.

Signed-off-by: Carlos Bederián bc@famaf.unc.edu.ar

@ompiteam-bot
Copy link

Can one of the admins verify this patch?

@zzzoom
Copy link
Contributor Author

zzzoom commented Jul 17, 2017

@ggouaillardet am I missing something here?

Copy link
Contributor

@ggouaillardet ggouaillardet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if i understand correctly this PR, you replaced an opal_atomic_lock_t with a boolean that is protected by the mca_coll_libnbc_component.lock, is that right ?
iirc, there is some recursivity involved in coll/libnbc, and this is regardless opal_using_threads()
so in the likely case where opal is not using threads, in_progress is no more protected so this is no more a mutex (and that could be fine, i am not really sure about it at this stage)
imho, the right approach, if feasable of course, would be to get rid of the recursivity.

@@ -75,7 +75,7 @@ struct ompi_coll_libnbc_component_t {
opal_free_list_t requests;
opal_list_t active_requests;
int32_t active_comms;
opal_atomic_lock_t progress_lock; /* protect from recursive calls */
bool in_progress; /* protect from recursive calls */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imho, in_progress should rather be a static variable of coll_libnbc_component.c since it is not used anywhere else

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I've updated the PR.

}
OPAL_THREAD_LOCK(&mca_coll_libnbc_component.lock);
mca_coll_libnbc_component.in_progress = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at first glance, this should be in the if (! in_progress) block.
makes sense ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, the diff without -w is making it hard to read.

@zzzoom
Copy link
Contributor Author

zzzoom commented Jul 17, 2017

if i understand correctly this PR, you replaced an opal_atomic_lock_t with a boolean that is protected by the mca_coll_libnbc_component.lock, is that right ?

Yes, that's correct.

imho, the right approach, if feasable of course, would be to get rid of the recursivity.

Sadly I'm not familiar enough with the inner workings of libnbc (or openmpi) to figure that out.

@zzzoom
Copy link
Contributor Author

zzzoom commented Jul 24, 2017

FWIW if this isn't going to get merged (it slashes around 40% of libnbc time when running single-threaded), the recursive calls are being caused by this line which has been in libnbc since the beginning.

Stack trace goes:

#0  ompi_coll_libnbc_progress () at coll_libnbc_component.c:295
#1  0x00007ffff7562641 in opal_progress () at runtime/opal_progress.c:226
#2  0x00007ffff7b0daf1 in ompi_request_default_test_all (count=8013936, 
    requests=0x7fffe39d7b60, completed=0x1, statuses=0x2)
    at request/req_test.c:204
#3  0x00007fffe39d87b5 in NBC_Progress (handle=0x7a4870) at nbc.c:329
#4  0x00007fffe39d7bcb in ompi_coll_libnbc_progress ()
    at coll_libnbc_component.c:275
#5  0x00007ffff7562641 in opal_progress () at runtime/opal_progress.c:226
#6  0x00007ffff7b0e499 in ompi_request_wait_completion (req_ptr=0x7a4870, 
    status=0x7fffe39d7b60) at ../ompi/request/request.h:392
#7  ompi_request_default_wait (req_ptr=0x7a4870, status=0x7fffe39d7b60)
    at request/req_wait.c:41
#8  0x00007ffff7b4207e in PMPI_Wait (request=0x7a4870, status=0x7fffe39d7b60)
    at pwait.c:70

@ggouaillardet
Copy link
Contributor

Could you please squash the two commits into a single one ?
Then i will merge this PR and backport into the release branches

Signed-off-by: Carlos Bederián <bc@famaf.unc.edu.ar>
@zzzoom
Copy link
Contributor Author

zzzoom commented Jul 24, 2017

There you go, thanks

@ggouaillardet ggouaillardet merged commit e054f87 into open-mpi:master Jul 25, 2017
@ggouaillardet
Copy link
Contributor

thanks, i just merged the PR
what is the benchmark you used to evidence the 40% bump in performance ?

@zzzoom
Copy link
Contributor Author

zzzoom commented Jul 25, 2017

Quantum Espresso PWscf benchmark runs, libnbc went from taking 2% of walltime to around 1.1%. I should take OMB for a spin.

ggouaillardet added a commit to ggouaillardet/ompi that referenced this pull request Jan 14, 2018
instead of invoking ompi_request_test_all(), that will end up
calling opal_progress() recursively, manually check the status
of the requests.

the same method is used in ompi_comm_request_progress()

Refs open-mpi#3901

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
ggouaillardet added a commit to ggouaillardet/ompi that referenced this pull request Jan 21, 2018
instead of invoking ompi_request_test_all(), that will end up
calling opal_progress() recursively, manually check the status
of the requests.

the same method is used in ompi_comm_request_progress()

Refs open-mpi#3901

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
ggouaillardet added a commit to ggouaillardet/ompi that referenced this pull request Jan 21, 2018
instead of invoking ompi_request_test_all(), that will end up
calling opal_progress() recursively, manually check the status
of the requests.

the same method is used in ompi_comm_request_progress()

Refs open-mpi#3901

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
hjelmn pushed a commit that referenced this pull request Jul 17, 2018
instead of invoking ompi_request_test_all(), that will end up
calling opal_progress() recursively, manually check the status
of the requests.

the same method is used in ompi_comm_request_progress()

Refs #3901

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants