Skip to content
This repository has been archived by the owner on Sep 30, 2022. It is now read-only.

ompi/request: change semantics of ompi request callbacks #1325

Merged
merged 1 commit into from
Sep 6, 2016

Conversation

hjelmn
Copy link
Member

@hjelmn hjelmn commented Aug 18, 2016

This commit changes the sematics of ompi request callbacks. If a
request's callback has freed or re-posted (using start) a request
the callback must return 1 instead of OMPI_SUCCESS. This indicates
to ompi_request_complete that the request should not be modified
further. This fixes a race condition in osc/pt2pt that could lead
to the req_state being inconsistent if a request is freed between
the callback and setting the request as complete.

Signed-off-by: Nathan Hjelm hjelmn@lanl.gov

(cherry picked from commit open-mpi/ompi@6aa658a)

Signed-off-by: Nathan Hjelm hjelmn@lanl.gov

This commit changes the sematics of ompi request callbacks. If a
request's callback has freed or re-posted (using start) a request
the callback must return 1 instead of OMPI_SUCCESS. This indicates
to ompi_request_complete that the request should not be modified
further. This fixes a race condition in osc/pt2pt that could lead
to the req_state being inconsistent if a request is freed between
the callback and setting the request as complete.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

(cherry picked from commit open-mpi/ompi@6aa658a)

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@hjelmn
Copy link
Member Author

hjelmn commented Aug 18, 2016

:bot🏷️bug
:bot:milestone:v2.0.2
:bot:assign: @bosilca

This fixes a race in osc/pt2pt when using multiple threads. @bosilca and I have agreed on this fix. This change only affects osc/pt2pt on v2.x at the moment.

@ompiteam-bot ompiteam-bot added this to the v2.0.2 milestone Aug 18, 2016
@hjelmn
Copy link
Member Author

hjelmn commented Aug 18, 2016

@hppritcha I marked this for 2.0.2 because it is not 100% necessary for 2.0.1 but it is a nice to have. The problem leads to a hang or a crash with lots of threads and osc/pt2pt.

@bosilca
Copy link
Member

bosilca commented Aug 18, 2016

👍

@jsquyres
Copy link
Member

@hppritcha We should consider moving this back to v2.0.1. It's a real threaded race condition.

@mellanox-github
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/2100/ for details.

@hjelmn
Copy link
Member Author

hjelmn commented Aug 25, 2016

@jsquyres I have one more commit for this. If possible please consider for 2.0.1. I am testing the last fix now. So far I have not been able to get it to hang.

@jsquyres
Copy link
Member

@hjelmn I think we missed the boat on this one for v2.0.1, especially since there's another commit coming.

Specifically: there's always going to be one more commit from a random developer -- we have to close the door and actually do a release at some point. Sorry.

@hjelmn
Copy link
Member Author

hjelmn commented Aug 26, 2016

Yeah, not unexpected. I think osc/pt2pt is clean now. Spent most of yesterday tracking down a hang on Cray systems. Looks like that one is a btl/ugni bug. Will have that fixed for 2.0.2.

@hjelmn hjelmn mentioned this pull request Sep 2, 2016
@jsquyres
Copy link
Member

jsquyres commented Sep 6, 2016

@hppritcha and I talked -- approved.

@jsquyres jsquyres merged commit b2d1c50 into open-mpi:v2.x Sep 6, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants