-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
segfault in sgemm_kernel_direct on x86_64: Illegal instruction #2526
Comments
Don't think any non-Haswell instructions could have crept up in that file recently, could it be an alignment issue ? |
Maybe this is caused by a DYNAMIC_ARCH=1 compilation on a SKX CPU and then run on a HSW (or lower) CPU. The /interface/gemm.c was compiled with USE_SGEMM_KERNEL_DIRECT=1 thus linked to sgemm_kernel_direct() function which can only run on CPUs supporting avx512. |
Ah yes you are right - then this must have been lurking for a while... technically the interface codes should not be assuming anything about the cpu I think. Now how to resolve this, apart from disabling this in the DYNAMIC_ARCH case as a stopgap measure ? Perhaps that sgemm_direct_performant function can be lifted out of the skylakex kernel and/or overridden in driver/others/dynamic.c |
I guess that the simplest way is to compare "gotoblas" struct pointer to the address of gotoblas_SKYLAKEX before calling sgemm_kernel_direct(), when DYNAMIC_ARCH is enabled. |
Going with that theory: this would explain why the azure CI is showing these failures while the travis CI is not: the original OpenBLAS libraries are compiled on travis machines, which I think have avx512 CPUs. |
@wjc404 thanks for your suggestion, glad this is what I arrived at just a little later (now hope I did it correctly) |
Hmm. Wouldn't that fail only if the DYNAMIC_ARCH build was done on an AVX512 host without setting TARGET to something less powerful ? (In which case the build would be hosed anyway as the compliler could have generated AVX512 instructions anywhere in the common code). |
I guess this is consistent with the following from the README.
Maybe TARGET should be PRESCOTT or something like that if not specified in a DYNAMIC_ARCH x86 build? |
There probably will be "legitimate" cases where the default TARGET can be the build host. (And eventually there will be something newer than SKYLAKEX, so even defaulting to something non-AVX512 when building on SKX may not be a good idea in the future) |
The theory that is it is due to running on Azure machines when OpenBLAS is compiled on Travis-ci was debunked in numpy/numpy#15809. Forcing the buffers to be misaligned does not seem to trigger a "illegal instruction. It may be some kind of race condition. Are there environment variables I can use to help debug this from a release build of OpenBLAS? |
Not sure how that was debunked, but the sgemm_kernel_direct function should only ever get called when the build was done on an AVX512-capable machine without TARGET being explicitly set to some older cpu. |
To actually answer your question (sorry), you can set OPENBLAS_CORETYPE at runtime to override automatic cpu detection (e.g. |
Merged #1527 now |
Assumed fixed by #2533 (though it is probably still not a good idea to build DYNAMIC_ARCH on SKX without specifying a lower-capability TARGET for the common codes) |
@isuruf this makes sense. Is PRESCOTT going to be the lowest current CPU NumPy is likely to encounter? |
On x86_64, yes. |
Over in NumPy, we are getting a segfault using commit ddcbed6. xref numpy/numpy#15796. I can reliably trigger this in numpy on my x86_64 laptop via
It does require sudo to install some helper packages, then downloads a pre-built OpenBLAS, and runs NumPy tests. I don't think this is a threading problem, since I don't see the
[New Thread 0x7fffee89b700 (LWP 14552)]
message around this call like I do when threads are active. Could there be a buffer overrun somewhere?I am not sure it is a OpenBLAS error, since it only happens on PyPy. Any hints to debug would be welcome.
The text was updated successfully, but these errors were encountered: