segfault in sgemm_kernel_direct on x86_64: Illegal instruction #2526

mattip · 2020-03-22T11:17:50Z

Over in NumPy, we are getting a segfault using commit ddcbed6. xref numpy/numpy#15796. I can reliably trigger this in numpy on my x86_64 laptop via

git checkout master
bash tools/pypy_test.sh

It does require sudo to install some helper packages, then downloads a pre-built OpenBLAS, and runs NumPy tests. I don't think this is a threading problem, since I don't see the [New Thread 0x7fffee89b700 (LWP 14552)] message around this call like I do when threads are active. Could there be a buffer overrun somewhere?

I am not sure it is a OpenBLAS error, since it only happens on PyPy. Any hints to debug would be welcome.

The text was updated successfully, but these errors were encountered:

martin-frbg · 2020-03-22T11:40:46Z

Don't think any non-Haswell instructions could have crept up in that file recently, could it be an alignment issue ?

wjc404 · 2020-03-22T12:35:34Z

Maybe this is caused by a DYNAMIC_ARCH=1 compilation on a SKX CPU and then run on a HSW (or lower) CPU. The /interface/gemm.c was compiled with USE_SGEMM_KERNEL_DIRECT=1 thus linked to sgemm_kernel_direct() function which can only run on CPUs supporting avx512.

martin-frbg · 2020-03-22T12:48:26Z

Ah yes you are right - then this must have been lurking for a while... technically the interface codes should not be assuming anything about the cpu I think. Now how to resolve this, apart from disabling this in the DYNAMIC_ARCH case as a stopgap measure ? Perhaps that sgemm_direct_performant function can be lifted out of the skylakex kernel and/or overridden in driver/others/dynamic.c

wjc404 · 2020-03-22T13:06:11Z

I guess that the simplest way is to compare "gotoblas" struct pointer to the address of gotoblas_SKYLAKEX before calling sgemm_kernel_direct(), when DYNAMIC_ARCH is enabled.

mattip · 2020-03-22T13:28:15Z

Going with that theory: this would explain why the azure CI is showing these failures while the travis CI is not: the original OpenBLAS libraries are compiled on travis machines, which I think have avx512 CPUs.

martin-frbg · 2020-03-22T14:03:22Z

@wjc404 thanks for your suggestion, glad this is what I arrived at just a little later (now hope I did it correctly)

martin-frbg · 2020-03-22T14:47:22Z

Hmm. Wouldn't that fail only if the DYNAMIC_ARCH build was done on an AVX512 host without setting TARGET to something less powerful ? (In which case the build would be hosed anyway as the compliler could have generated AVX512 instructions anywhere in the common code).

isuruf · 2020-03-22T14:53:33Z

I guess this is consistent with the following from the README.

The TARGET option can be used in conjunction with DYNAMIC_ARCH=1 to specify which cpu model should be assumed for all the common code in the library, usually you will want to set this to the oldest model you expect to encounter.

Maybe TARGET should be PRESCOTT or something like that if not specified in a DYNAMIC_ARCH x86 build?

martin-frbg · 2020-03-22T14:59:09Z

There probably will be "legitimate" cases where the default TARGET can be the build host. (And eventually there will be something newer than SKYLAKEX, so even defaulting to something non-AVX512 when building on SKX may not be a good idea in the future)

mattip · 2020-03-23T10:46:27Z

The theory that is it is due to running on Azure machines when OpenBLAS is compiled on Travis-ci was debunked in numpy/numpy#15809. Forcing the buffers to be misaligned does not seem to trigger a "illegal instruction. It may be some kind of race condition. Are there environment variables I can use to help debug this from a release build of OpenBLAS?

martin-frbg · 2020-03-23T10:56:27Z

Not sure how that was debunked, but the sgemm_kernel_direct function should only ever get called when the build was done on an AVX512-capable machine without TARGET being explicitly set to some older cpu.

martin-frbg · 2020-03-23T11:32:02Z

To actually answer your question (sorry), you can set OPENBLAS_CORETYPE at runtime to override automatic cpu detection (e.g. export OPENBLAS_CORETYPE=NEHALEM), and OPENBLAS_NUM_THREADS (or OMP_NUM_THREADS if using OpenMP) to limit the number of threads to less than the available (semi)cores in the system. As noted above I believe neither is likely to have any influence on this issue.

martin-frbg · 2020-03-23T11:48:45Z

Merged #1527 now

martin-frbg · 2020-03-30T18:18:15Z

Assumed fixed by #2533 (though it is probably still not a good idea to build DYNAMIC_ARCH on SKX without specifying a lower-capability TARGET for the common codes)

mattip · 2020-04-01T05:26:19Z

Maybe TARGET should be PRESCOTT or something like that if not specified in a DYNAMIC_ARCH x86 build

@isuruf this makes sense. Is PRESCOTT going to be the lowest current CPU NumPy is likely to encounter?

isuruf · 2020-04-01T05:34:55Z

Is PRESCOTT going to be the lowest current CPU NumPy is likely to encounter?

On x86_64, yes.

martin-frbg mentioned this issue Mar 22, 2020

reboots on haswell system #2524

Closed

martin-frbg mentioned this issue Mar 22, 2020

Avoid calling DIRECT codepath in DYNAMIC_ARCH on non-SKX #2527

Merged

martin-frbg mentioned this issue Mar 23, 2020

Add message highlighting minimum target choice at end of DYNAMIC_ARCH… #2530

Merged

mattip mentioned this issue Mar 25, 2020

pick up PR 2527 from OpenBLAS MacPython/openblas-libs#28

Closed

martin-frbg mentioned this issue Mar 26, 2020

Use runtime check for AVX512 capability in DYNAMIC_ARCH builds made on SKX #2533

Merged

mattip mentioned this issue Mar 30, 2020

TST: use draft OpenBLAS build numpy/numpy#15868

Merged

martin-frbg closed this as completed Mar 30, 2020

mattip mentioned this issue Apr 1, 2020

set a TARGET when using DYNAMIC_ARCH=1 MacPython/openblas-libs#30

Closed

saudet mentioned this issue Apr 29, 2020

OpenBLAS 0.3.8 issue on AMD Threadripper cpu deeplearning4j/deeplearning4j#8747

Closed

mattip mentioned this issue Jul 3, 2020

PyPy + NumPy + SVX causes a segfault #2705

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

segfault in sgemm_kernel_direct on x86_64: Illegal instruction #2526

segfault in sgemm_kernel_direct on x86_64: Illegal instruction #2526

mattip commented Mar 22, 2020

martin-frbg commented Mar 22, 2020

wjc404 commented Mar 22, 2020 •

edited

Loading

martin-frbg commented Mar 22, 2020

wjc404 commented Mar 22, 2020 •

edited

Loading

mattip commented Mar 22, 2020

martin-frbg commented Mar 22, 2020

martin-frbg commented Mar 22, 2020

isuruf commented Mar 22, 2020

martin-frbg commented Mar 22, 2020

mattip commented Mar 23, 2020

martin-frbg commented Mar 23, 2020

martin-frbg commented Mar 23, 2020

martin-frbg commented Mar 23, 2020

martin-frbg commented Mar 30, 2020

mattip commented Apr 1, 2020

isuruf commented Apr 1, 2020

segfault in sgemm_kernel_direct on x86_64: Illegal instruction #2526

segfault in sgemm_kernel_direct on x86_64: Illegal instruction #2526

Comments

mattip commented Mar 22, 2020

martin-frbg commented Mar 22, 2020

wjc404 commented Mar 22, 2020 • edited Loading

martin-frbg commented Mar 22, 2020

wjc404 commented Mar 22, 2020 • edited Loading

mattip commented Mar 22, 2020

martin-frbg commented Mar 22, 2020

martin-frbg commented Mar 22, 2020

isuruf commented Mar 22, 2020

martin-frbg commented Mar 22, 2020

mattip commented Mar 23, 2020

martin-frbg commented Mar 23, 2020

martin-frbg commented Mar 23, 2020

martin-frbg commented Mar 23, 2020

martin-frbg commented Mar 30, 2020

mattip commented Apr 1, 2020

isuruf commented Apr 1, 2020

wjc404 commented Mar 22, 2020 •

edited

Loading

wjc404 commented Mar 22, 2020 •

edited

Loading