-
Notifications
You must be signed in to change notification settings - Fork 703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FlexiBLAS cause core dump (simple test example) #16387
Comments
@grisuthedragon Any thoughts on this? |
I can reproduce this too, looks like it's a segfault problem. GDB backtrace:
|
@schiotz It looks like this could be a problem in recent versions of OpenBLAS, and that FlexiBLAS has nothing to do with it (but I'm not sure) |
@boegel Do you have any suggestions for workarounds or how to fix it? It is a showstopper for us, basically locking us on the foss/2020b toolchain. We see this core dump all over our code. |
Maybe you can try installing If that doesn't help, it gets more interesting, we would need to hunt down the cause of the problem, and probably come up with a patch to fix it. We should also see if the problem only happens if FlexiBLAS is used (by tweaking |
|
@boegel Can you compile OpenBLAS (same version as above) with debug info and then run your reproducer with |
@boegel I see the problem also with foss/2022a, which uses However, I tried to use IMKL as a backend (I am not sure I know what I am doing) and it crash in the same way. This could indicate that it is a FlexiBLAS issue, perhaps the calling convention issue that @bartoldeman is referring to.
|
@schiotz you should not use Can you also try with BLIS?
(last one just to confirm it's using BLIS) |
I can confirm it does not crash with IMKL or BLIS. Here is the output when running with BLIS:
|
@schiotz can you run the original (crashing) testcase with FLEXIBLAS_VERBOSE=1 as well? |
|
Here's (the relevant part of) the GDB backtrace with
|
@bartoldeman wrote:
It certainly only occurs with complex arrays, but it is also important that the axes of the second array are swapped, so it must somehow be related to the array being non-contiguous. |
Is there anything I can do to help making progress on this? Can I test something to see if it is due to how easybuild builds it or a bug in OpenBLAS? If the latter, I guess it should be reported upstream. |
I'm working on reproducing this, I suspect it's indeed something upstream in the assembly language kernel of zdot, but will isolate a bit further. |
I'll try this. I assume I have to rebuild GCCcore, then OpenBLAS and try again. I'll have to be careful to use the modules I built myself and not the ones on the system, but I can use FLEXIBLAS_VERBOSE to see which library it pick up. The tricky thing may be to check that EasyBuild uses the right GCCcore module itself, but I should be able to see that from the full paths shown by ps while compiling. I'll report back once the builds are finished. |
@bartoldeman Unfortunately, recompiling GCCcore and then both OpenBLAS and FlexiBLAS did not change anything. Regarding reproducibility: We have four different login nodes on our cluster, with four slightly different architectures. I only see the bug on three of them. Edit Affected:
Not affected:
|
I've been able to reproduce it now, so I can debug the issue. |
Checking the assembly language there's another compiler vectorization bug :(, where the loop in
gets compiled (if I understand it well!) as the equivalent of
(EDIT:NO, this isn't the case, it does load for the final loop iteration I'm checking if the fix to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107212 fixes this. |
Adding Still bad to have another GCC issue as it could affect other code. The fix above didn't solve the issue. Will try with a snapshot, then see if it's another GCC issue with a small test case if that fails too. |
GCC bug report here https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107451 |
Thank you very much, @bartoldeman for looking into this. Is it a workaround to use |
Thank you very much indeed, @bartoldeman I can confirm that this appears to fix our problems, at least for the test case. We are now rebuilding on all platforms, and testing our code. I would expect it to work now. |
@schiotz Can this issue be closed? |
I am not 100% sure, we seem to still have some issues and are trying to figure out if it is related to this problem, or something else. |
After rebuilding GCCcore and OpenBLAS, we are seeing lots of segfaults in OpenMPI that we did not see before. I cannot in any way imagine how fixing a vectorization bug could cause that, but perhaps something else changed as well. I'll continue to investigate. |
@bartoldeman @boegel |
It is a bit strange, since OpenBLAS/FlexiBLAS do not interact with Open MPI. It's possible something in the Open MPI easyconfig was changed between compilations as well. In any case, I wouldn't worry about it. I'll close this one though, as the segfault is fixed. |
Hi EasyBuilders,
We have problems with a core-dump inside our GPAW code, from within FlexiBLAS:
We can reproduce the bug with this four-line code snippet (pure numpy code) on most (but not all) our machines:
We see the problem with
SciPy-bundle/2021.10-foss-2021b
andSciPy-bundle/2022.05-foss-2022a
, but not with the corresponding intel-toolchain packages. Nor do we see the problem withSciPy-bundle/2020.11-foss-2020b
which does not use FlexiBlas (I think...).CC: @jjmortensen
The text was updated successfully, but these errors were encountered: