-
Notifications
You must be signed in to change notification settings - Fork 901
Why is Big-endian powerpc no longer supported ? #4349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for asking your question. I'm sorry that dropping support for Big-endian powerpc has affected you. The community makes every effort to support as wide of a selection of platforms as possible, with our limited volunteered resources. During the development of Open MPI v3.0.0, during our Open MPI supported platforms discussion, we discussed who might be willing to work on and support Open MPI on Big-Endian Power PC platforms. IBM committed to support the ppc64le (Little Endian) platform on Linux, however all IBM roadmaps only have Little Endian products. Due to perceived lack of interest, IBM chose not to support Big Endian Platforms. Another aspect to the removal of support is due to our new improved continuous integration testing automation. Whenever anyone creates a pull request to either master branch or a release branch, the automation automatically begins a build and test of that pull request on all supported platforms (existing across various different member organizations). This automation has drastically improved quality to the point where we now define our support statement for Open MPI such that we only support platforms that we have added to this infrastructure and test regularly. The Open MPI community is interested in supporting as many platforms as practical, and would be interested in working with anyone who is interested in supporting a platform, provided we can obtain a testing platform, and a contact person or organization. If you are still interested, please reply, and we can continue discussing. |
Thanks for this reply. My concern is that, from a Debian Linux perspective, we support and test multiple architectures that are not part of the CI infrastructure upstream (see https://buildd.debian.org/status/package.php?p=openmpi); we then run our own tests on these. PPC on Big-endian systems is arguably one of the better supported of the lesser archs. It seems strange that we will support alpha, hppa, sparc64, etc (without the latest fabrics, of course) but not powerpc. This testing on other archs has two main benefits beyond supporting the tiny number of users doing HPC on minority archs. It shakes out the bugs latent in the codebase, and then in doing so, it makes future ports to new archs (arm64, etc) possible as developers know the codebase actually supports the standards rather than having many assumptions ("All the worlds a Vax", as they used to say). Can we downgrade the failure to build on BE PPC from an hard error to a warning - "WARNING: this architecture is not tested and officially supported" ? |
I think the work required is a bit more involved than just changing the error to a warning. I thought I remembered some serious bug reports against PowerPC Big Endian that would need to be resolved. |
Of course, it was IBM. They also forcefully removed PowerPC BE support in Golang for no reason (I still have a working fork which I regularly rebase against master) and also almost objected my work of zfs on linux on PowerPC BE.
We in Debian are happy to support big-endian PowerPC targets and if there are bugs to be fixed, there are many people who are happy to help. It would be great if this could be turned into a warning again and if there are issues, we will be seeing them on our infrastructure and provide patches to fix them. Thanks! |
Fedora still builds for big-endian powerpc. I'm not sure what will be more painful - dealing with a buggy open-mpi on BE powerpc or putting in all of the ExcludeArch ppc64 statements in all of the dependent libraries. FWIW - BE powerpc64 is one of the three architectures that Fedora's CI system runs builds for, see https://apps.fedoraproject.org/koschei/package/openmpi. You'll also see there that there are sporadic test failures for ppc64. |
Hmm, really surprised to see this change introduced in 2.1.2 - ostensibly a bug-fix release. |
The Open MPI community has a policy to discontinue support in OpenMPI v3.0.0 for any platform that does not have regular continuous integration testing for new pull requests, and also a clearly identified maintainer who is willing to step up and help debug and fix problems found on that platform. Unfortunately ppc64 Big Endian was one of the architectures that no one in the community stepped up to this level of support. The Open MPI Community welcomes additional continuous integration platforms, help maintaining those platforms, submitting CI test results to the community jenkins server, and help in debugging issues found on those platforms during continuous integration. If you're interested, please let us know. |
I believe the question wasn't about v3.0.0, but rather why it disappeared in v2.1.2 - which was supposed to be a bug-fix release and should have fallen under the compatibility promise. |
What's wrong with keeping the code if it doesn't hurt the other architectures? I don't understand a policy like this. There is code that people are using, so why not just keep it? I would understand this argument if the code would actually hurt other targets. But as long as that doesn't happen, why not just leave it in and accept drive-by patches to fix issues. These super-strict policies just result in lots of frustration in downstream projects. Debian and other downstreams are willing to help and provide patches, but please understand that OpenMPI isn't the only upstream project on this planet and if every other project posed such high standards on their downstream projects, we could throw away all ports except x86_64 and arm64. We are helping whereever we can. Contributing patches to almost any important project in the community. Heck, I have even become upstream committer in OpenJDK and Firefox because I am so busy contributing such patches. But we (I) can't be upstream maintainer and committer in every project on the planet. Really, please don't do that. It's incredibly frustrating. |
@gpaulsen , for reference you can see the build systems we use here: In building packages, we also run any test suites contained within these. Periodically, but in particular as part of transitions (major library changes) we rebuild everything and test everything - eg. for openmpi2 -> openmpi3 testing I've rebuilt all MPI packages in Debian against openmpi3 - minor bugs ensued - eg. python-escript presumed Openmpi version numbers are always dotted integers, which 3.0.1rc1 broke - minor patch sent to them for this, etc. So we are typically doing a lot of testing that is not seen upstream (here) unless bugs are found and the change is in openmpi not the client package (or below, on a lib used by openmpi). It would be useful if there was a reference here to the changeset that dropped BE support, and the bugs that this closed. |
Thanks for bringing up all these issues and keeping us honest. Let me tell you what we have been doing over the past 36 hours (reminder that "Power BE" in this conversation effectively means "Power 7"):
|
Fedora is definitely moving away from BE ppc64 as well - see https://lists.fedoraproject.org/archives/list/ppc@lists.fedoraproject.org/message/C23EQYITA4DQWM7CQF6LJC5ABXY2XIEM/ Although unfortunately we haven't just dropped it yet. |
Well, there are other companies besides IBM making PowerPC hardware. Many users are running Linux ppc64 Big-Endian on FreeScale processors, for example. Does keeping the ppc64 big-endian code hurt the rest of the OpenMPI codebase? Or what exactly is the reason it should get removed? I understand that IBM wants to deprecate anything older than POWER8 because they have a strong interest in selling new hardware instead of keeping old hardware supported. But I don't think it's fair that IBM alone gets to decide about the PowerPC support status in free software projects. It might surprise some people of upstream projects, but distributions like Debian and Gentoo support hardware as old as DEC Alpha, Motorola 68000 and HP PA-RISC. And as long as keeping the code for these architectures around doesn't hurt any of the other architectures, I don't see why this should be of any problem. Other projects like systemd, OpenJDK, LibreOffice and so on don't have problems with supporting old architectures. Why is it so much of a problem for OpenMPI? |
@opoplawski Ok, good to know. So just to be clear: are you saying that Fedora doesn't care if we put Power BE support back? @glaubitz It feels like you didn't read my entire post. Can you re-read, see where I already answered your questions, and then answer the questions that I asked of distros? (by your GitHub profile, it looks like you're a SuSE employee) Thanks! |
@jsquyres You posed a single question which is: "What do the distros support in terms of BE on Power 8/9?" And the answer is none. No one who ones hardware which is capable of running little-endian code will run anything big-endian on it, there is simply no market for it. Therefore there is no distribution which has a POWER8/9 port in big-endian. And I am not sure what my employment status with SUSE has to do anything with this thread. I am talking as my role of a Debian Developer here, not as a SUSE employee.
And many other projects prove that this is possible. It's perfectly fine to mark old platforms as "Tier 2" and mark them as not officially supported but still allow people to use the code. It works well for many projects and the community often steps up to provide patches when things break. |
@jsquyres Here's the thing - there is a lot more distribution work required to build packages for only certain architectures. All packages that depend on said package need to have conditionals to work around the fact that a certain dependency is not present in certain locations. I'm not looking forward to doing this for the openmpi stack in Fedora. So giving up openmpi on ppc64 in Fedora, while certainly consistent with the "best effort" level of support, will be hard. That said, there are certainly times when the ppc64 build has failed and required work to fix, so that's work too. At this point I would prefer it if Fedora dropped ppc64, but not sure that's going to happen any time soon. I'm sure some old ppc mac owners would complain, both of them. :) |
@glaubitz Sure. I assumed you were asking such questions because of your employer. My mistake. However, if you're asking for any distro (e.g., Debian instead of SuSE), for the purposes of this conversation, it doesn't really matter which one. I also asked if someone would maintain Open MPI on Power BE platforms. If so, we'll re-enable it. Right now, we have no one to test+maintain our code base on Power BE platforms, and we're deeply concerned about shipping code for a platform that we have no one watching over at all: we don't know when we break it, we don't have anyone to fix bugs, etc. (it's pretty clear that you and I disagree on this point; there's probably not a lot of point in going back and forth about it). If that changes, and someone helps us out with maintaining Open MPI on Power BE platforms, great. But let's also keep in mind the truly practical point of: who on earth is running Open MPI on Power BE platforms? I'm all for free software, but I'm not in favor of doing work to support a platform on which Open MPI will realistically never be used. Specifically: I really don't want to re-enable support for a platform that is probably only compile-tested (e.g., in distro automated build farms) but largely -- or even entirely -- run-time untested just for the sake of filling in a check box on a support matrix. |
@opoplawski Fair points. While we're talking through all the options here, let me ask this (not saying this is the final solution -- I just want to ask a question here): is it viable to carry a local patch in your package that reverts #4104 / #4105? |
@jsquyres The concept of "supported" for a distribution like Debian is fuzzier than for commercial distributions: if you look at the logs (eg. https://buildd.debian.org/status/package.php?p=openmpi&suite=experimental) the grayed out architectures are "unofficial", 2nd tier: they are not candidates for the next official release and bugs for these archs will not block releases (we commit to keep the supported archs in sync). They may be included in the next release if their quality and support by the HPPA maintainer(s) is good enough, but bugfixes in them is lower priority for the package maintainer. This is the approach I advocate: keep the current set as release archs / officially supported; mark any bugs for PPC BE etc as 'unsuppported' and to be fixed as best-effort. In practice the "supported" is a matter for the package maintainer: the majority of bugs for such archs are latent standards bugs: BE /LE , 32-vs-64 bit bugs, etc. Fixing these is good for the codebase not just the architecture. My intro to Debian came as a systems developer tasked with bringing up a Unix-like userspace on a handheld device with a new mips-based ASIC. The well-worn path at Debian for bringing up new archs meant this was a feasible task for a single engineer in a few months. Today in HPC I see a similar issue when exploring e.g. new cpus on FPGA-based accelerators, getting netcdf to work on the accelerators. I see the work I do in Debian as systems integration and testing, relevant here: I don't seriously expect to see anyone use OpenFOAM on m68k just because it compiles and runs, but keeping the codebase healthy enables new developments. If you look at the list of Debian archs above you'll see that OpenMPI 3.0 works on HPPA. Thats because Debian maintains a trivial patch. Now HPPA is pretty quixotic, and Debian's inclusion of it is mostly humouring some hobbyists who dont use HPC, but as a systems integrator, I see OpenMPI works on HPPA simply as a matter of course: it should work on HPPA because it's well engineered and all the patch does is enable gcc atomics. I would consider it not working on HPPA as a problem in OpenMPI to be investigated, not a HPPA problem. This is why dropping PPC support irks so much: its a newer, better architecture and should just work. @opoplawski "dependency contagion" is a real problem but the answer has to be to use it to drive the engineering - eg keep dependencies controlled within as few packages as possible and build abstractions accordingly. (eg. layers above the mpi layer should be oblivious to fabrics; packages using netcdf/hdf5 should be oblivious to whether the netcdf layer is serial/mpi or which version of mpi is used, etc. In summary, "re-enable support" should, at the OpenMPI level, come down to a compile-time warning of "THIS IS NOT A SUPPORTED ARCHITECTURE" and no checks there, rather than a compile failure, and labelling bugs as 'unsupported/low priority'. OpenMPI shouldn't worry about the whether distros/hobbyists/researchers are using their code on unsupported archs, they just shouldn't break it. |
@amckinstry You make valid points. We'll discuss this in the developer community. My question still stands, though: do you downstream packagers / distro-representing people on this issue want Power BE re-enabled in the v2.0.x and v2.1.x series? I think whether we put a "This is not supported!" output and/or whether we re-enable Power BE for v3.0.x and v3.1.x are separate questions. |
For Debian we have 2.1.1 in stable and will not be moving to 2.1.2; I'm planning 3.0* for the next release, so no opinion on re-enabling Power BE for the 2.* series. |
@amckinstry @opoplawski @glaubitz We had a lengthy discussion about this stuff yesterday in our face-to-face Open MPI development meeting. Let me report the results to you... Short version
More detailEarlier in this issue, I cited that we could not remember why we had removed POWER7/BE support in v2.x. After much discussion today, we remembered: we thought we had a silent data corruption issue. In our world, that's about the most serious kind of bug that there is (i.e., you run an simply get wrong answers, but no obvious error occurs). Also earlier in this issue, I stated that we dropped POWER7/BE in v3.0.x because there was no one to maintain it. That is true, but the (much) more serious issue at the time was the silent data corruption -- that's why we took the extraordinary step of blocking it in configure. I.e., we thought it was seriously broken and no one was going to fix it. At the time, we did not understand the exact problem. We thought there was a strong possibility of silent data corruption (i.e., run an MPI program and get wrong results). No one could fix it, so we decided just to turn it off. This was deemed better than shipping known-seriously-broken code. Hence, we didn't want the casual user to be able to build/install Open MPI at all on this platform (because they would silently get wrong answers), so we added the configure block to all of v2.x, v3.0.x, and v3.1.x.
Later, we figured out that had an atomic issue that would lead to deadlocks (#4563). I don't think we put two-and-two together at the time to realize that what we thought was a POWER7/BE silent data corruption was this atomic/deadlocking issue. Regardless, we now understand the issue much better and @hjelmn has said that he will work on the fix for #4563 this week. We'll get that back ported to v2.x, v3.0.x, and v3.1.x. Then we'll remove the POWER7/BE block in configure in all those branches. This doesn't mean that POWER7/BE is supported -- it just means that it is no longer known to be bad (even though it's not as bad as we thought it was). We'll add something to NEWS about this as well (that the configure block for POWER7/BE was removed, but it doesn't mean it is supported, yadda yadda yadda). That being said, this issue has basically highlighted the fact that we need to communicate with you, our downstream packagers, better. E.g., perhaps you could have helped us debug / fix this issue. To that end, we have setup a new mailing list: ompi-packagers. We'd like to use this list (and intentionally try to keep it low volume) to communicate with you about such issues in the future. |
OK, thanks, this is a good outcome. I agree with the distinction 'not supported' vs 'known to be bad'. And yes, the problem was when the cause/nature of the 'silent data corruption' issue accidentally got dropped; if there was a trail that led back to a bugreport, then better decisions could be made (on all parts). I'm signing up to the ompi-packagers, thanks |
We thought there was a silent data corruption issue on POWER 7/BE systems, so we blocked building on POWER 7/BE systems altogether. We later figured out that it was just data hangs -- not silent data corruption. So in hindsight, the configure block probably wasn't necessary -- but we didn't know it at the time. Regardless, the hangs have now been fixed, and we're removing the POWER 7/BE block in configure. For more detail on the entire saga, see open-mpi#4349 (comment). Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
We thought there was a silent data corruption issue on POWER 7/BE systems, so we blocked building on POWER 7/BE systems altogether. We later figured out that it was just data hangs -- not silent data corruption. So in hindsight, the configure block probably wasn't necessary -- but we didn't know it at the time. Regardless, the hangs have now been fixed, and we're removing the POWER 7/BE block in configure. For more detail on the entire saga, see open-mpi#4349 (comment). Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit 3f0ccff)
We thought there was a silent data corruption issue on POWER 7/BE systems, so we blocked building on POWER 7/BE systems altogether. We later figured out that it was just data hangs -- not silent data corruption. So in hindsight, the configure block probably wasn't necessary -- but we didn't know it at the time. Regardless, the hangs have now been fixed, and we're removing the POWER 7/BE block in configure. For more detail on the entire saga, see open-mpi#4349 (comment). Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit 3f0ccff)
We thought there was a silent data corruption issue on POWER 7/BE systems, so we blocked building on POWER 7/BE systems altogether. We later figured out that it was just data hangs -- not silent data corruption. So in hindsight, the configure block probably wasn't necessary -- but we didn't know it at the time. Regardless, the hangs have now been fixed, and we're removing the POWER 7/BE block in configure. For more detail on the entire saga, see open-mpi#4349 (comment). Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit 3f0ccff)
I am late here, but we in Macports support PowerPC on MacOS and actively develop for it (that is, not only maintain the code, but improve and fix new software for PPC). |
@glaubitz OMG, how I share your feelings. Also been astonished and disappointed how some upstream just throws away working code simply because no one happened to comment on time in an obscure GitHub thread… And then it takes months sometimes to restore it and non-trivial efforts to convince that “yes, it is actually used”. |
@barracuda156 you cannot reasonably expect to put the burden of supporting recent software on obsolete hardware to anyone but the hobbyist who do that in their spare time. |
On Feb 19, 2023, at 3:06 PM, Gilles Gouaillardet ***@***.***> wrote:
@barracuda156 you cannot reasonably expect to put the burden of supporting recent software on obsolete hardware to anyone but the hobbyist who do that in their spare time.
I don’t think anyone expects you to *support* it. Just don’t break it intentionally.
|
And keep shipping code that has not even been compiled for 5+ years?
Just revert the infamous commit, and since you do not expect any support,
do what you have to do to make it work...
…On Sun, Feb 19, 2023, 23:37 John Paul Adrian Glaubitz < ***@***.***> wrote:
I don’t think anyone expects you to *support* it. Just don’t break it
intentionally.
|
Multiple platforms support PowerPC presently and compile code for it: Macports, FreeBSD, some versions of Linux. |
@barracuda156 From my reading of this issue (and the PR's that were cross-linked at the end), the |
Sorry about the "non-bug" nature of this issue, but I'm working on OpenMPI 3.0.0 for Debian, and see BE powerpc is no longer supported.
Given that other powerpc and other BE systems are supported, I'm curious about this, and wondering whats going on.
The text was updated successfully, but these errors were encountered: