-
Notifications
You must be signed in to change notification settings - Fork 860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v4.1.x Thread fixes. #9312
v4.1.x Thread fixes. #9312
Conversation
…on after wait completes. We found an issue where with using multiple threads, it is possible for the data to not be in the buffer before MPI_Wait() returns. Testing the buffer later after MPI_Wait() returned would show the data arrives eventually without the rmb(). We have seen this issue on Power9 intermittently using PAMI, but in theory could happen with any transport. Signed-off-by: Austen Lauria <awlauria@us.ibm.com> (cherry picked from commit 12192f1)
The IBM CI (GNU/Scale) build failed! Please review the log, linked below. Gist: https://gist.github.com/5993f493f00f83864718692818bc75dc |
The IBM CI (XL) build failed! Please review the log, linked below. Gist: https://gist.github.com/8c4777045b26173501a49ea7ff4f9b15 |
The IBM CI (PGI) build failed! Please review the log, linked below. Gist: https://gist.github.com/77f2bd1854f5da192ec6c2107c8049fe |
2bfe913
to
55bcda1
Compare
ompi/mpi/c/init_thread.c
Outdated
|
||
ompi_hook_base_mpi_init_thread_top(argc, argv, required, provided); | ||
|
||
if (NULL != (env = getenv("OMPI_MPI_THREAD_LEVEL"))) { | ||
required = atoi(env); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this documented anywhere? It would seem pretty important to document this behavior -- e.g., in the man page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't quite understand the commit message. It says "Allow mpi_init_thread to override the MPI_THREAD_LEVEL", but it looks more like the commit allows an env variable to override the thread level set by the app in the call to MPI_INIT_THREAD.
This is new behavior that has not existed in 4.0.x and 4.1.x. There are small (but nonzero) affects to backwards compatibility here, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's making it consistent with ompi_mpi_init, which also uses this ENV variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ompi_mpi_init()
does not check the environment variable for the thread level: MPI_Init()
does.
We created an env variable "back door" for legacy apps that want to run with MPI threading support but couldn't change their source code to call MPI_Init_thread()
. Honestly, it was mainly a way for us to enable MPI_THREAD_MULTIPLE
with all the existing OMPI MPI test codes without changing all of them to MPI_Init_thread()
.
We intentionally did not put in the same backdoor to MPI_Init_thread()
because MPI_Init_thread()
allows the app to say exactly what it wants (where MPI_Init()
does not). Having an environment variable override to something that was specified as a parameter to MPI_Init_thread()
seems... sketchy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is named "Thread fixes", but contains a fairly significant behavior change.
Indeed, that behavior change was added after the PR was positively reviewed by @bosilca. That is not cool.
The errant commit should be removed from this PR, and the first commit -- which is actually a thread fix -- can probably be merged.
The commit about adding a new feature to MPI_Init_thread()
should be debated separately on a different PR. If that commit has already been merged to the v4.0.x branch, it should be reverted.
@jsquyres that was not added after @bosilca reviewed. In fact - he commented on it in the master PR: Funny that we support OMPI_MPI_THREAD_LEVEL in MPI_Init but not in MPI_Init_thread. Originally posted by @bosilca in #9302 (comment) |
He may have commented on it elsewhere, but on the timeline of this PR, the commit came in afterwards. Regardless, this is a big behavior change. It's not a fix. And I would argue against this behavior change, because it is specifically discarding what the app stated in the |
if you don't want this that is fine - but for the record I did not "sneak this in" after it was reviewed, and it matches what went into v4.0.x, v5.0.x and master. |
Ok. That is not immediately clear from the github UI. In the future, it would probably be good to either re-request reviews or put a comment in explaining what happened and why a re-review was not necessary. |
it is clear from github - it's literally documented in the link. I did not want to waste peoples time with re-reviewing a mistake I made in the cherry-pick. If it was a real change, I would have re-requested a review from the appropriate parties. |
You have educated me on a point that I didn't know: you can click on a force push link on the github UI and it shows the change from that force push. Thanks. However, the github UI very much makes it look like a commit was added after the review: there's a line for the commit and then there's another line for the force push. So there's no indication left on the UI that the 2nd commit was there before the review. This is why I say that I think it's helpful in a collaborative spirit to add a comment saying "Hey, that force push was just me making a trivial fix. No need to review again." Indeed, sometimes there are a bunch of force pushes, and having to click through a bunch of them to determine what changed is more work for each reviewer vs. the author adding a quick comment. Additionally, we also get emails upon force pushes. Rather than having to click through to chase down everything a PR author did, if the PR author just adds a quick comment summarizing the changes, that can be helpful. Just my $0.02. |
Yeah - redundancy is never a bad thing I suppose. I didn't think such a trivial change needed to be documented/commented - and figured that the link to the code change was sufficient. if this is such a huge issue I think github has an option to invalidate reviews on a force push. Maybe we should consider turning that on. |
I will say that commenting on what a force-push does isn't bound to anything - and people could say anything about it if they think maintainers won't follow the push and really wanted to sneak code in. So - following and viewing the diffs is still good diligence - even if the person doing the force-push commented on what it did. |
Understood. I don't think we disagree. We are a trusting dev community, probably because most of us know each other. The reasons for my "request changes" review (and the corresponding #9331 revert to v4.0.x) are more about the fact that I don't think that we want to make this potentially-backwards-compatibility-affecting change to the release branches. #9332 is a place to discuss what we want to do with this feature for v5.0.x and forward. |
55bcda1
to
aa4529b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for removing the additional commit. All is good now.
We found an issue where with using multiple threads, it is possible for the data
to not be in the buffer before MPI_Wait() returns. Testing the buffer later after
MPI_Wait() returned would show the data arrives eventually without the rmb().
We have seen this issue on Power9 intermittently using PAMI, but in theory could
happen with any transport.
Signed-off-by: Austen Lauria awlauria@us.ibm.com
(cherry picked from commit 12192f1)