-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert "Address a race condition in libevent select." #7940
Conversation
We do not want to be patching upstream components anymore. The proper method is to get this merged upstream, then pull it in the next upstream release. This reverts commit c39fb57. Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
8e6a173
to
67d9016
Compare
This is bad policy. We can't accept to live with broken functionality until some external project deems our patch interesting enough to pull it in and propagate it through their own releases. |
I'm afraid we don't have a choice. With the move to git submodules for all external packages, we no longer have the ability to support OMPI-committed changes. Besides, downstream packagers routinely refuse to build with our internal versions - so if the code cannot work with the stock release of a package, then our users are trashed anyway. Best if we get the package fixed, or figure out how to work around the error. |
We could still make something work with submodules, although it would be more work. The second (and a related third) aspect are more the issue, in my mind. We decided for 5.0 to get more aggressive about preferring external packages (of hwloc, pmix, libevent, prrte) rather than our internal packages. So if there's a bug in upstream, it is going to impact most of our users. When I surveyed distros for libevent, all but the soon to be EOLed CentOS/RHEL 6 had a recent enough libevent. So we need to think more around mitigations / avoidance and less about direct bug fixes, or we're not actually helping our customers. |
I'm kind of inbetween. I think that we should try and patch external functionality for critical issues - something like it causing a wrong answer in OMPI or it segv'ing in a critical path. In those cases I think a good argument can be made that we can't wait - especially if the external project is dragging their feet (such as in this patch). But patching every fix will be difficult to maintain with the new submodule system. So I guess the question is, do we think this is a critical issue? Reading the PR it doesn't seem like one to me, but I could be wrong on that. It is a good fix though and we should try and get it in to libevent. |
I think @bwbarrett hit it correctly - we have to get off of customized embedded versions because it fools us into thinking all is okay...and then our downstream packagers connect to the official releases, and our users can't figure out why everything is broken. |
Right, instead we prefer to have our users figuring out why some capabilities described in the CHANGELOG are not available on their system, which is as bad. I'm not arguing against trying to push the fix upstream in the correct package, that's a no-brainer, the desirable long term approach. I am mostly concerned about the timeframe of such code propagations and the status of our codebase meanwhile. And here my concern is two-fold:
Last but not least, we are not very good at following our own time frame for releases. But if we start depending on external packages to release their own stable versions such that we can pull them in our stable releases and do the appropriate testing, I am not even sure we should try setting any timeframe. @awlauria in particular this PR enables ULFM capabilities over TCP, by not generating spurious error events on remote socket closing. We might see this in master as well, but as we only close sockets on MPI_Finalize we don't really case. |
This is up to you OMPI folks, but I cannot help but laugh about the whole thing. The original rationale for including embedded code was that the packages weren't universally available and we wanted the user to be able to do one download and build - i.e., we didn't want to make them install other packages just to build/run OMPI. Once OMPI became more accepted, we found that the downstream packagers refuse to package OMPI with the embedded packages, insisting instead that we operate with the external releases of those packages. So we adjusted and are moving towards removing the embedded packages. Now you are arguing that we can only work with our embedded packages, which means that anyone installing from downstream is screwed - and that this is a better situation? I guess I don't get it. 🤷♂️ whatever you guys decide |
I don't think I said what is a better solution, I just make clear that none of the current proposals are good, and that we should decide what we want to sacrifice. |
We talked about this today on the Webex:
|
Hang on, how does this fit with the strategy of "use an external libevent if it is found"? It feels like the discussion went wrong somewhere (likely because I unexpectedly could not attend today). |
I think the conversation went okay - what we basically concluded was that IF someone configures OMPI with ULFM, then we would fall back to using the internal libevent. If they try to configure with ULFM and with an external libevent, then we would just error out. Someday, when we know an external libevent release version has the patch, we can modify that to allow the external libevent if it is at least the "known good" level. Otherwise, we can only reject the ULFM patch - it cannot safely run with an unpatched libevent. |
I ran with the latest libevent release 2.1.12 which is only a few weeks old.
As for mitigation on OS-X:
|
So it sounds like we should:
Correct? |
See #7666 for a discussion / handy table for minimum supported versions of libevent that we talked about for v5.0. |
This is going away when we update the libevent submodule and the new configury work coming in, but for completeness and git history, merging this. |
Notes from Webex discussion on 28 July 2020:
|
We do not want to be patching upstream components anymore.
The proper method is to get this merged upstream, then
pull it in the next upstream release.
This reverts commit c39fb57.
Signed-off-by: Austen Lauria awlauria@us.ibm.com