v4.1.x Thread fixes. #9312

awlauria · 2021-08-25T21:28:50Z

ompi/request: Add a read memory barrier to sync the receive buffer after wait completes.

We found an issue where with using multiple threads, it is possible for the data
to not be in the buffer before MPI_Wait() returns. Testing the buffer later after
MPI_Wait() returned would show the data arrives eventually without the rmb().

We have seen this issue on Power9 intermittently using PAMI, but in theory could
happen with any transport.

Signed-off-by: Austen Lauria awlauria@us.ibm.com
(cherry picked from commit 12192f1)

…on after wait completes. We found an issue where with using multiple threads, it is possible for the data to not be in the buffer before MPI_Wait() returns. Testing the buffer later after MPI_Wait() returned would show the data arrives eventually without the rmb(). We have seen this issue on Power9 intermittently using PAMI, but in theory could happen with any transport. Signed-off-by: Austen Lauria <awlauria@us.ibm.com> (cherry picked from commit 12192f1)

ibm-ompi · 2021-08-25T21:54:56Z

The IBM CI (GNU/Scale) build failed! Please review the log, linked below.

Gist: https://gist.github.com/5993f493f00f83864718692818bc75dc

ibm-ompi · 2021-08-25T21:57:18Z

The IBM CI (XL) build failed! Please review the log, linked below.

Gist: https://gist.github.com/8c4777045b26173501a49ea7ff4f9b15

ibm-ompi · 2021-08-25T22:04:11Z

The IBM CI (PGI) build failed! Please review the log, linked below.

Gist: https://gist.github.com/77f2bd1854f5da192ec6c2107c8049fe

jsquyres · 2021-08-29T18:34:46Z

ompi/mpi/c/init_thread.c

    ompi_hook_base_mpi_init_thread_top(argc, argv, required, provided);

+    if (NULL != (env = getenv("OMPI_MPI_THREAD_LEVEL")))  {
+        required = atoi(env);


Is this documented anywhere? It would seem pretty important to document this behavior -- e.g., in the man page.

I also don't quite understand the commit message. It says "Allow mpi_init_thread to override the MPI_THREAD_LEVEL", but it looks more like the commit allows an env variable to override the thread level set by the app in the call to MPI_INIT_THREAD.

This is new behavior that has not existed in 4.0.x and 4.1.x. There are small (but nonzero) affects to backwards compatibility here, too.

It's making it consistent with ompi_mpi_init, which also uses this ENV variable.

ompi_mpi_init() does not check the environment variable for the thread level: MPI_Init() does.

We created an env variable "back door" for legacy apps that want to run with MPI threading support but couldn't change their source code to call MPI_Init_thread(). Honestly, it was mainly a way for us to enable MPI_THREAD_MULTIPLE with all the existing OMPI MPI test codes without changing all of them to MPI_Init_thread().

We intentionally did not put in the same backdoor to MPI_Init_thread() because MPI_Init_thread() allows the app to say exactly what it wants (where MPI_Init() does not). Having an environment variable override to something that was specified as a parameter to MPI_Init_thread() seems... sketchy.

jsquyres

This PR is named "Thread fixes", but contains a fairly significant behavior change.

Indeed, that behavior change was added after the PR was positively reviewed by @bosilca. That is not cool.

The errant commit should be removed from this PR, and the first commit -- which is actually a thread fix -- can probably be merged.

The commit about adding a new feature to MPI_Init_thread() should be debated separately on a different PR. If that commit has already been merged to the v4.0.x branch, it should be reverted.

awlauria · 2021-08-30T14:57:47Z

@jsquyres that was not added after @bosilca reviewed. In fact - he commented on it in the master PR:

Funny that we support OMPI_MPI_THREAD_LEVEL in MPI_Init but not in MPI_Init_thread.

Originally posted by @bosilca in #9302 (comment)

jsquyres · 2021-08-30T15:02:58Z

He may have commented on it elsewhere, but on the timeline of this PR, the commit came in afterwards.

Regardless, this is a big behavior change. It's not a fix. And I would argue against this behavior change, because it is specifically discarding what the app stated in the provided argument. We can/should have the debate as to whether this is a good feature to have or not (I'm not personally in favor of it, but I could be overruled). If it is behavior we want, perhaps this should actually be turned into an real MCA var, and moved to ompi_mpi_init() (not handled independently in MPI_Init() and MPI_Init_thread(), not a standalone env var. Additionally, the commit message should be fixed to be correct.

awlauria · 2021-08-30T15:06:08Z

@jsquyres if you actually clicked on the 'forced-push' link you will see that I fixed a bad cherry-pick of that exact commit - in that I only changed the name of the variable which caused the build to fail (a slight difference from the master cherry-pick)

I did not add this after @bosilca reviewed.

awlauria · 2021-08-30T15:06:57Z

if you don't want this that is fine - but for the record I did not "sneak this in" after it was reviewed, and it matches what went into v4.0.x, v5.0.x and master.

jsquyres · 2021-08-30T15:09:28Z

if you don't want this that is fine - but for the record I did not "sneak this in" after it was reviewed, and it matches what went into v4.0.x, v5.0.x and master.

Ok. That is not immediately clear from the github UI. In the future, it would probably be good to either re-request reviews or put a comment in explaining what happened and why a re-review was not necessary.

awlauria · 2021-08-30T15:12:01Z

if you don't want this that is fine - but for the record I did not "sneak this in" after it was reviewed, and it matches what went into v4.0.x, v5.0.x and master.

Ok. That is not immediately clear from the github UI. In the future, it would probably be good to either re-request reviews or put a comment in explaining what happened and why a re-review was not necessary.

it is clear from github - it's literally documented in the link.

I did not want to waste peoples time with re-reviewing a mistake I made in the cherry-pick. If it was a real change, I would have re-requested a review from the appropriate parties.

jsquyres · 2021-08-30T15:40:42Z

You have educated me on a point that I didn't know: you can click on a force push link on the github UI and it shows the change from that force push. Thanks.

However, the github UI very much makes it look like a commit was added after the review: there's a line for the commit and then there's another line for the force push. So there's no indication left on the UI that the 2nd commit was there before the review.

This is why I say that I think it's helpful in a collaborative spirit to add a comment saying "Hey, that force push was just me making a trivial fix. No need to review again." Indeed, sometimes there are a bunch of force pushes, and having to click through a bunch of them to determine what changed is more work for each reviewer vs. the author adding a quick comment. Additionally, we also get emails upon force pushes. Rather than having to click through to chase down everything a PR author did, if the PR author just adds a quick comment summarizing the changes, that can be helpful.

Just my $0.02.

awlauria · 2021-08-30T15:53:44Z

Yeah - redundancy is never a bad thing I suppose. I didn't think such a trivial change needed to be documented/commented - and figured that the link to the code change was sufficient.

if this is such a huge issue I think github has an option to invalidate reviews on a force push. Maybe we should consider turning that on.

awlauria · 2021-08-30T16:02:30Z

I will say that commenting on what a force-push does isn't bound to anything - and people could say anything about it if they think maintainers won't follow the push and really wanted to sneak code in.

So - following and viewing the diffs is still good diligence - even if the person doing the force-push commented on what it did.

jsquyres · 2021-08-30T16:39:27Z

I will say that commenting on what a force-push does isn't bound to anything - and people could say anything about it if they think maintainers won't follow the push and really wanted to sneak code in.

So - following and viewing the diffs is still good diligence - even if the person doing the force-push commented on what it did.

Understood. I don't think we disagree. We are a trusting dev community, probably because most of us know each other. The reasons for my "request changes" review (and the corresponding #9331 revert to v4.0.x) are more about the fact that I don't think that we want to make this potentially-backwards-compatibility-affecting change to the release branches.

#9332 is a place to discuss what we want to do with this feature for v5.0.x and forward.

Removed commit

jsquyres

Thanks for removing the additional commit. All is good now.

awlauria added the Target: v4.1.x label Aug 25, 2021

awlauria added this to the v4.1.2 milestone Aug 25, 2021

awlauria requested review from bosilca and jsquyres August 25, 2021 21:28

bosilca approved these changes Aug 26, 2021

View reviewed changes

awlauria force-pushed the v4.1.x_threads branch from 2bfe913 to 55bcda1 Compare August 26, 2021 12:46

jsquyres reviewed Aug 29, 2021

View reviewed changes

jsquyres previously requested changes Aug 30, 2021

View reviewed changes

jsquyres mentioned this pull request Aug 30, 2021

v4.0.x: Revert "Allow mpi_init_thread to override the MPI_THREAD_LEVEL" #9331

Merged

bosilca mentioned this pull request Aug 30, 2021

What to do with OMPI_MPI_THREAD_LEVEL env variable? #9332

Open

awlauria force-pushed the v4.1.x_threads branch from 55bcda1 to aa4529b Compare September 3, 2021 15:14

awlauria requested a review from jsquyres September 3, 2021 15:15

jsquyres added the RM approved label Sep 7, 2021

jsquyres approved these changes Sep 7, 2021

View reviewed changes

jsquyres merged commit 66d34c3 into open-mpi:v4.1.x Sep 7, 2021

awlauria deleted the v4.1.x_threads branch March 17, 2022 17:28

v4.1.x Thread fixes. #9312

v4.1.x Thread fixes. #9312

Uh oh!

Conversation

awlauria commented Aug 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ibm-ompi commented Aug 25, 2021

Uh oh!

ibm-ompi commented Aug 25, 2021

Uh oh!

ibm-ompi commented Aug 25, 2021

Uh oh!

jsquyres Aug 29, 2021

Choose a reason for hiding this comment

Uh oh!

jsquyres Aug 29, 2021

Choose a reason for hiding this comment

Uh oh!

awlauria Aug 30, 2021

Choose a reason for hiding this comment

Uh oh!

jsquyres Aug 30, 2021

Choose a reason for hiding this comment

Uh oh!

jsquyres left a comment

Choose a reason for hiding this comment

Uh oh!

awlauria commented Aug 30, 2021

Uh oh!

jsquyres commented Aug 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

awlauria commented Aug 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

awlauria commented Aug 30, 2021

Uh oh!

jsquyres commented Aug 30, 2021

Uh oh!

awlauria commented Aug 30, 2021

Uh oh!

jsquyres commented Aug 30, 2021

Uh oh!

awlauria commented Aug 30, 2021

Uh oh!

awlauria commented Aug 30, 2021

Uh oh!

jsquyres commented Aug 30, 2021

Uh oh!

jsquyres left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

awlauria commented Aug 25, 2021 •

edited

Loading

jsquyres commented Aug 30, 2021 •

edited

Loading

awlauria commented Aug 30, 2021 •

edited

Loading