-
Notifications
You must be signed in to change notification settings - Fork 890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
opal_lifo test fails on s390x #10988
Comments
@opoplawski I cannot find the error in the build log. Could you please provide some more info on your environment? What compiler are you using? |
Ah, shoot. linked the wrong build. This is Fedora Rawhide - gcc 12.2.1 https://kojipkgs.fedoraproject.org//work/tasks/4918/93474918/build.log |
Thanks! Any chance you can access |
That's the output I pasted in the first comment. |
I see. The tests run succesfully on Summit with GCC 12.1.0 (the latest GCC available on that machine) but I'm not even sure that's the same architecture. I don't have access to any other IBM machine. |
Can you build in a Fedora Rawhide mock environment on that machine? Looks like I have access to some kind of s390x test machine if there are any particular things you'd like me to try.
|
Any chance you could try a different compiler like LLVM? Also, we recently switched from C11 atomics being the default to GCC builtin atomics. Could you try running with C11 atomics instead by passing Summit has Power9 CPUs, so different architecture. |
So, some data points:
|
I'll also note that the test succeeds in 4.1.4 - but maybe the test has changed in 5.0. |
I've managed to reproduce it on our test machine in case there are any other local tests you would like me to run. |
Still present in 5.0.0rc10 |
Still present in 5.0.0 final |
@opoplawski Realistically, I don't know who is going to fix this. I don't know if anyone has ever run an MPI job on an IBM s390 mainframe. I don't think anyone in the known community has the resources to fix this. |
I tried to reproduce this on current |
@opoplawski Is this still an issue in Fedora rawhide? I tried again with a s390x rawhide emulated docker container and couldn't reproduce this error. |
Well, 5.0.1 still failed - see the s390x build.log from https://koji.fedoraproject.org/koji/buildinfo?buildID=2336590
I can't manage to build the rpm from a github tarball. autogen fails with:
|
I don't believe OMPI supports that approach, if you are talking about the GitHub tarballs they attach to the repo tags. I've had a rare request/discussion about that and believe it traces to the use of submodules, which leaves some dangling connections in the GitHub tarball (since it is literally just created using That said, this specific error is one I encountered elsewhere and resolved by executing |
Since we already build with external libs, I'm getting around the submodule issue with:
Thanks for the suggestion, but the |
I got me a free instance on the IBM community cloud but still not luck reproducing this. Since there are only two cores available and this seems to be some multi-threading/atomic issue I might not be able to trigger it there.
Interestingly, when building with clang 17 on that system
To summarize what I observed:
I am running out of time to spend on this, unfortunately. And finding proper docs on this architecture is tedious. Either someone who cares about s390x (anyone from IBM?) will pick it up or OMPI stays broken on that arch. Sorry. |
Sadly, that won't take care of it - the problem is that there is another submodule attached to the You should check to see if the GitHub tarball populates that directory. Pretty sure it doesn't, and that is why you are hitting all those errors. |
I re-discovered the "nightly" tarballs - https://www.open-mpi.org/nightly/main/ and that works for me. FWIW - build on s390x with failure: https://kojipkgs.fedoraproject.org//work/tasks/1335/111371335/build.log |
@joseemoreira Can you help here? |
Hello. Sorry for my delay in responding. I was not aware of this issue until a colleague from IBM just pointed it to me. I have to find the right person in our System z development team to address this. Will get back to you all soon. PS: Do I need to do something so that issues like this show up in my Dashboard? |
I think an @-mention will just send you an email (depending on what your github notification settings for this org are). I just assigned the issue to you, so perhaps it will show up in your dashboard now...? |
Interestingly, I have exactly the same issue on FreeBSD.
This happens with:
To reproduce the problem, simply do a |
@LaurentChardon Am I to understand that this is happening to you on an x86-64 platform? If so, that's a very different error than what is being reported here (on an IBM s390x platform). |
@jsquyres yes it's FreeBSD on amd64. |
@LaurentChardon It would be good to open a new issue on this (and possibly cross-link it to this one) -- the title of this issue pretty much guarantees that it won't get much attention. |
@jsquyres you're right, I'll do that. I will also add some data points, for example it works with aarch64, but with a different version of clang. I'll investigate a little more with various configurations and then open a new issue with the relevant info. |
Can you please look in opal_config.h for OPAL_HAVE_ATOMIC_COMPARE_EXCHANGE_128. I would also compare the assembly code for |
Looking at updating the Fedora openmpi package to 5.0.0rc9. I'm getting the following test failure on s390x:
Full log here (for a few days at least): https://kojipkgs.fedoraproject.org//work/tasks/3745/93473745/build.log
The text was updated successfully, but these errors were encountered: