-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression on PPC970, resulting in [some] dependents of OpenBLAS being unusable #5034
Comments
@martin-frbg Could you please take a look? I can do bisecting is needed, but it will be helpful to have an idea what is likely to have gone wrong. Otherwise it will take forever, OpenBLAS is not a quick thing to build on a G5. Update. I will try to rebuild 0.3.28 with native opts to make sure if the regression happened in fact after it or just went unnoticed. |
@martin-frbg Turns out 0.3.28 is also broken if built with optimizations. So it is not something after 0.3.28 for sure. Could be optimization-specific bug in OpenBLAS, could be a bug in GCC. |
Would you happen to know which older version definitely worked for you ? This is probably an older bug, if it is a bug in OpenBLAS at all - last time I changed something for the old G-series was 0.3.26 I think, and most BLAS kernels are likely original GotoBLAS material from 15 years ago. Any new "POWERPC" code is targeting at least POWER8, and I do not have any G4/G5 era hardware |
@martin-frbg We had this issue earlier, related to optimizations: #4376 It is hard to say which version worked for sure (i.e. worked with optimizations enabled), since the criterion is unclear (I mean, the bug may have been present but was unnoticed). Everything was building against both versions of OpenBLAS (with the exception of the issue linked), and I was using OpenBLAS-devel quite often, I think. However that in itself does not guarantee that it was built correctly. If this is a possible bug in GCC, I will bring it to upstream, but I need some meaningful and minimal test case to reproduce the problem. Could you advise what to check specifically, given the context we have? |
OK, that would be the change in 0.3.26 that I remembered - but I do not think I wii be able to be of much help here. If the error is intermittent, maybe it depends on register use or data alignment - I can look into that, but the ppc970 is an ancient and unfamiliar cpu to me. |
I can certainly build 2–3 releases with |
The thing is that I do not think any recent commit could have been particularly problematic, as all ongoing powerpc development is focused on current server cpus. If there is a way for you to figure out from your Python backtrace which BLAS function is causing the crash, I can take a closer look - but there should be minimal overlap, if any, between the optimized kernels on a POWER9 or 10 cpu and what gets invoked on a 20+ year old desktop machine. |
That said, what (I) changed in 0.3.28 was the SCAL kernel in #4807 - introducing an additional variable and conditional into the assembly code that is used across multiple ppc cpus. But if that commit is broken, it should have terminated the build on more that one target, no matter if "native" optimization is enabled or not... |
@martin-frbg Thank you. The log seems to mention P. S. MacPorts does not build anything at all on PowerPC, buildbots are not there. I am building some ports (not in a reference environment), but it is unfeasible to test everything which is being built. I try to test important stuff though or when needed to debug something, and OpenBLAS is important certainly. |
thanks - DGEQR2 is (unoptimized Fortran) LAPACK but the call graph at https://www.netlib.org/lapack/explore-html/d6/da5/group__geqr2_ga0ff91490bc2e246cabb8fe02f3f1da97.html touches DSCAL so it could have been me (if the new variable lives at a different stack location on MacOS, which I have no way of testing). Reverting #4807 should cause a number of test failures unless you disable the SCAL-related utests - this is an unfortunate trade-off berween fixing the NaN handling in user-generated calls vs. the naive behaviour of SCAL when called by other BLAS functions, often as a quick way to zero an array |
Looking at this again (and on more than just a phone screen), it occurs to me that (1) if the DSCAL changes were wrong, they should blow up both with and without native optimizations - if by native optimizations you mean compiler flags (?) |
Maybe time to bring GCC's Iain Sandoe back into the loop (though I think he'd want a very small and self-contained reproducer while we're not even sure where we are exactly) ? Or try dropping the |
I will try your suggestion, thanks. |
@martin-frbg OpenBLAS from 20240704 with native opts (which adds |
gfortran would be implied as the culprit if the crash is in DGEQR2, simply because dgeqr2 is LAPACK code copied from the Reference-LAPACK project, which is entirely written in Fortran. |
I am running the build of OpenBLAS from e1eef56 (20240704) now on a system where the failure was observed, so soon I will know if it is broken or now. On another test system where 20240704 worked it was built with gcc 14.1.0, so we got different gcc versions. I just thought whether gcc config matters: I have an optimized version with was built with So there are a number of variables here which may affect the result. I will run several builds to exclude, hopefully, some possibilities, at least those which do not require rebuilding gcc itself. |
@martin-frbg Okay, e1eef56
|
Let me try bisecting from here. We know that e1eef56 works but 0.3.28 does not. There are perhaps not too many rebuilds needed to identify a breaking commit. |
@martin-frbg Bisecting was quick. It is indeed #4807 PR broken it. 15c53dd is still fine, but fb7c53c fails:
|
@martin-frbg Since you have no older powerpc hardware, can I assist with finding more specifically what causes the breakage? Rebuilds on the Quad are quick, so if you can make a branch with possible fix, I can try that, or w/e else may be needed. |
#4807 would be about the only relevant change between July 4 and July 25 - at least I don't think #4796 could break anything. Quickest workaround would probably be to drop
into kernel/power/KERNEL.PPC970 |
Yes, it is exactly fb7c53c which introduced the breakage, since immediately previous commit works fine. I will try the suggestion. |
Thank you - but frankly the problem is that I'm not really sure I know what I'm doing on these old machines. Looks like I'm either using a register that the OS or compiler want to have for different things, or the stack address I'm reading the flag value from is not the correct one when running MacOS. Guess I would need a stack dump from inside a debugger to figure out at least the second option. Not sure if that is worth the trouble, performance is probably not that much worse with the simple C kernels and I suspect nobody is using 20 years old hardware for performance reasons anyway. What might be good to know is if the same problem exists on the PPC440 or PPCG4 targets with current |
@martin-frbg Regarding registers, if you tell me in a couple of words how R2, R11, R12, R13 are used, I can verify with ABI docs or we can ask Iain to help here. Few things to note:
I will check G4 build. Rebuilding now from the develop branch with added patch which you suggested to try. P. S. It is worth fixing this, IMO, as long as it does not take huge effort, at least for two reasons. Of course, nobody will use OpenBLAS commercially on a G5, but a) it is used on a number of systems (all three BSDs run on PowerPC Macs, some Linux distros do, some it is not just Apple-relevant), and users may end up with a broken OpenBLAS, once they build cpu-specific code; b) some computation-heavy stuff depends on OpenBLAS, and testing that takes time, so performance reason is still there. |
@martin-frbg
|
To test G4 kernel, do I need to physically build that on a G4 or I can just force the build to use G4-specific options, running on G5? And also, should I start with the same patch or without it? |
I think simply building on your G5 with TARGET=PPC440 or PPCG4 should be sufficient, thank you. Regarding register use, r12 is simply used to take an integer that (I assume) is passed on the stack, and compare that to see if it is zero or one. I have used r13 instead of r12 in a branch that I think does not get executed at all in your case, for DSCAL in ppc mode (I was blindly following the register use in existing code there). In principle, any other available register would do to serve as temporary storage for the FLAG value, I saw r12 used in existing code e.g. gemv so assumed it would be safe. |
From the ABI document, I still think my assumption that the added parameter would be in SP+120 on the stack is correct. |
@martin-frbg Sorry, I missed that earlier: what MacPorts does by default (as an allegedly non-optimized build) is just forcing G4 kernel for ppc:
So no need to test that, we know that this build worked fine. Only PPC970 was affected. Allow me to look into the ABI tomorrow, it was a long day. P. S. By the way why adding those from arm dir fixed the problem?
|
Good to know that G4 is unaffected. Adding the "generic" C kernels to the build configuration overrides the default from kernel/Makefile.L1 which specifies SSCALKERNEL=scal.S and DSCALKERNEL=scal.S . (Overriding the defaults for CSCAL and ZSCAL should indeed be unnecessary, as the new flag parameter for NaN handling was only added to the real-space functions. The fact that a lot of the generic BLAS kernels ended up in "arm" or "mips" rather than "generic" is just a historic oddity - not sure if OpenBLAS inherited this from GotoBLAS already, or if it crept in later as more architectures got added). |
Thank you, and the same to you! (Yes, I am in TW at the moment, so time is accurate.) |
@martin-frbg Yes, I did not forget, sorry for a delay. In plans for this weekend. |
No problem, I just wanted to make the temporary fix public, and perhaps release the much-delayed 0.3.29 with it while I have free time. |
@martin-frbg Hold on, let’s give it a try to fix it now. Since the issue is localized now, we can probably figure out the correct code for Darwin ABI just by trying a number of possible options. I do not need to rebuild the whole of OpenBLAS for that, it is four files, so should be a matter of less than an hour to try everything which can make sense. |
Easiest would be changing the register for "FLAG" in scal.S to r10 or r14 (if r14 exists on PPC970):
|
I will try build now with this patch and let you know soon. |
Ok, the patch with
I will try r10. |
I have an idea, let me try something else. |
Less than an hour was overly optimistic estimate LOL Comparing G4 kernel code for G4 version:
G5 version:
Also, shouldn't it be |
Oh, I have fixed it by the way. |
To be honest, I have no idea - this is very old code and a platform I'm neither familiar with nor have available for testing. My addition was mostly a guess based on what worked on other platforms and what the other code for G4/G5 looked like. Thanks for fixing this. |
@martin-frbg Did you or someone test the new code on 32-bit AIX and/or Linux? I know ABI differs from Darwin, but loads are bitness-conditional for all OS in ppc440 but are fixed for all OS in ppc970, so that would probably be sufficient in this context. |
All I can do for this old hardware is try to compile-test on Power7/AIX (Gcc Compile Farm machine provided by OSUOSL). I guess there would eventually be a bug report like yours from some Linux distribution still catering to these old systems. |
@martin-frbg Thanks for merging! |
I have built OpenBLAS from the current master 17803e7 and it turns out the library is broken on Darwin PowerPC.
Updated. Release 0.3.28 works fine without native optimizations, but exhibits the same problem when optimizations are enabled.
Specifically, I tried building py-scipy against the new OpenBLAS, and it failed at configure. Meson was not really helpful with the output, however eventually I found where exactly configure fails and just tried the same command outside of the build:
It led to a bus error:
Switching to 0.3.28, I was able to configure py-scipy normally, and the build runs now with no issues. Nothing else changed, I did not even have to rebuild
py-numpy
, only switch OpenBLAS from 17803e7 to the release.Importantly, MacPorts build
OpenBLAS-devel
(the one tracking upstream) with native optimizations, while releases are built without those.The text was updated successfully, but these errors were encountered: