Skip to content

Address a race condition in libevent select. #6784

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 30, 2020

Conversation

abouteiller
Copy link
Member

In some rare instances (race), when an fd is closed, the libevent will enter an infinite loop.
If we do not remove a bad fd from the select list we keep getting the same error from select, and we stop doing any progress on the communication side. Thus, we forcefully disable all bad fd as soon
as select fails, and we are back in track, progress ensure and everything seems to work as expected (no leftover events in the event base).

Signed-off-by: George Bosilca bosilca@icl.utk.edu

This is not really a fix for the race condition because I could not
figure out how it happen, but it does address the problem generated by
the race. If we do not remove a bad fd from the select list we keep
getting the same error from select, and we stop doing any progress on
the communication side. Thus, we forcefully disable all bad fd as soon
as select fails, and we are back in track, progress ensure and
everything seems to work as expected (no leftover events in the event
base).

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
@rhc54
Copy link
Contributor

rhc54 commented Jun 28, 2019

Would you please push this upstream? It affects a lot of us who use libevent but don't embed it.

@abouteiller
Copy link
Member Author

libevent/libevent#911

@jsquyres
Copy link
Member

jsquyres commented Oct 8, 2019

Any reason this should not be merged?

@awlauria awlauria added this to the v5.0.0 milestone Mar 23, 2020
@awlauria
Copy link
Contributor

This looks like a good change. Not merged with upstream libevent though - but no complaints over it.

Adding it to the 5.0 milestone for tracking.

@awlauria
Copy link
Contributor

5.0 is slated to branch on April 30th. Since this is a bug fix, it doesn't need to get in by then, but the sooner the better. Thanks!

@awlauria
Copy link
Contributor

I ping'd upstream libevent on it - if they sign off I think we should pull it in.

@awlauria
Copy link
Contributor

Based on the comments in libevent/libevent#911 this should be good to go - they just have a minor nit on syntax and want a unit test before they approve.

@awlauria
Copy link
Contributor

@gpaulsen do you agree this is good to go?

@awlauria awlauria merged commit b6b300d into open-mpi:master Jun 30, 2020
@awlauria
Copy link
Contributor

I forgot we don't patch other components anymore - my bad. Revert here:

#7940

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants