Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyPy + NumPy + SVX causes a segfault #2705

Closed
mattip opened this issue Jul 3, 2020 · 16 comments
Closed

PyPy + NumPy + SVX causes a segfault #2705

mattip opened this issue Jul 3, 2020 · 16 comments

Comments

@mattip
Copy link
Contributor

mattip commented Jul 3, 2020

Not sure what is going on, maybe someone here has an idea. I don't have access to x64 Skylake SVX hardware to try this out. Over at NumPy, using OpenBLAS 0.3.10, we are seeing intermittent segfaults when testing with PyPy 3.6-v7.3.1. xref numpy/numpy#16737. Any thoughts? We recently improved the CPU detection done in NumPy in preparation for refactoring our SIMD code, which is why we now know the CPU architecture that segfaults.

@martin-frbg
Copy link
Collaborator

That matmul is basically DGEMM, right ? If so, the change in #2646 could play a role (though I wonder why our tests would not catch this - if this is actually OpenBLAS' fault, the build-time test/dblat3 should have segfaulted as well, as it runs dgemm for various matrix sizes including 63x63)

@mattip
Copy link
Contributor Author

mattip commented Jul 3, 2020

The strange thing about this is it only seems to happen on PyPy (which has a JIT): I wonder if somehow some registers or floating point flags are getting messed up

@martin-frbg
Copy link
Collaborator

Can't really help you with PyPy or its JIT, but did you start seeing this only with 0.3.10 (which would suggest a connection with recent changes in OpenBLAS) ?

@mattip
Copy link
Contributor Author

mattip commented Jul 3, 2020

We had segfaults with PyPy, reported in #2526 that dissapeared after #1527 #2527. Then after merging 0.3.10 they started up again.

Edit: wrong PR.

@mattip
Copy link
Contributor Author

mattip commented Jul 3, 2020

BTW, #2526 mentioned NumPy should be setting TARGET, which is not done.

@martin-frbg
Copy link
Collaborator

Without TARGET, OpenBLAS would get built for whatever the build host is - which can mess up DYNAMIC_ARCH builds by adding AVX512 instructions to the common code. 0.3.9 (when compiled on SKX) caused an additional problem by missing a runtime check for AVX512 capability in the SGEMM kernel (which should be fixed by #2527 in 0.3.10 - I do not think a later change could have overwritten this fix, but I will check this)

@mattip
Copy link
Contributor Author

mattip commented Jul 3, 2020

Since the machine that segfaults reports it is SVX capable, I don’t think TARGET is the problem

@martin-frbg
Copy link
Collaborator

martin-frbg commented Jul 3, 2020

So what exactly do I need to (install and) do to reproduce this ?

@mattip
Copy link
Contributor Author

mattip commented Jul 4, 2020

After you git clone numpy, this is the script run by CI. You may want to comment out the sudo linesnear the top. The script downloads and puts OpenBLAS into a tmpdir, then adjust the build to use it, then builds and tests numpy via ./runtests.py. After it has run once, you can rerun a test with

pypy3/bin/pypy3 runtests.py — path/to/file -k testname

which will hand off arguments after the double dash to pytest

@wjc404
Copy link
Contributor

wjc404 commented Jul 4, 2020

I've just cloned numpy and followed the steps shown by Prof. Matti Picus to build it with OpenBLAS-0.3.10 in a temporary directory.

wang@wang-Z390-M-GAMING:~$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Core(TM) i7-9800X CPU @ 3.80GHz
Stepping: 4
CPU MHz: 1200.240
CPU max MHz: 3801.0000
CPU min MHz: 1200.0000
BogoMIPS: 7599.80
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 16896K
NUMA node0 CPU(s): 0-15
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d

wang@wang-Z390-M-GAMING:~/numpy$ pypy3/bin/pypy3 tools/openblas_support.py --check_version
OpenBLAS get_config returned b'OpenBLAS 0.3.10 DYNAMIC_ARCH NO_AFFINITY SkylakeX MAX_THREADS=64'
b'OpenBLAS 0.3.10'

Then I ran the code in numpy/numpy#16737 without observing something stucked.

wang@wang-Z390-M-GAMING:~$ numpy/pypy3/bin/pypy3
Python 3.6.9 (2ad108f17bdb, Apr 07 2020, 02:59:05)
[PyPy 7.3.1 with GCC 7.3.1 20180303 (Red Hat 7.3.1-5)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>> import numpy as np
>>>> shape = (50, 50)
>>>> a = np.random.randn(*shape)
>>>> a = np.matmul(a.T, a)
>>>> exit()

wang@wang-Z390-M-GAMING:~/numpy$ pypy3/bin/pypy3 runtests.py
Building, see build.log...
Build OK
NumPy version 1.20.0.dev0+9298eeb
NumPy relaxed strides checking option: True
NumPy CPU features: SSE SSE2 SSE3 SSSE3* SSE41* POPCNT* SSE42* AVX* F16C* FMA3* AVX2* AVX512F* AVX512CD* AVX512_KNL? AVX512_SKX* AVX512_CNL?
"A list of '.', 'x' and 's', not shown here"
[100%]
11449 passed, 160 skipped, 111 deselected, 77 xfailed, 11 xpassed in 273.63s (0:04:33)

wang@wang-Z390-M-GAMING:~/numpy$ pypy3/bin/pypy3 numpy/linalg/tests/test_linalg.py
wang@wang-Z390-M-GAMING:~/numpy$

@martin-frbg
Copy link
Collaborator

No hang or other "unexpected" failure observed on my system either (Xeon-W 2123, gcc 9.3, python 3.6.9, script summary "11460 passed, 149 skipped, 111 deselected, 77 xfailed, 11 xpassed")

@mattip
Copy link
Contributor Author

mattip commented Jul 4, 2020

@wjc404 (I am not a professor, but thanks), @martin-frbg thanks for checking. The failing machine is reporting

NumPy version 1.20.0.dev0+4fd295c
NumPy relaxed strides checking option: True
NumPy CPU features: SSE SSE2 SSE3 SSSE3* SSE41* POPCNT* SSE42* AVX* F16C* FMA3* AVX2* 
AVX512F* \
AVX512CD* AVX512_KNL? AVX512_KNM? AVX512_SKX* AVX512_CNL?

Compared to @wjc404's machine there is an additional AVX512_KNM? . The question mark means "Dispatched features that are not supported by the running machine end with ?" where the asterix means "Supported dispatched features by the running machine end with *". I am not sure what to make of that, but it seems the CI machine that segfaults is subtly different from these machines?

@martin-frbg
Copy link
Collaborator

I'd read that as "the software knows how to pass along the specific AVX512 feature sets of Knights Landing, Knights Mill and Cannon Lake, but this particular hardware does not support them" ? For the record, mine reports

NumPy version 1.20.0.dev0+9298eeb
NumPy relaxed strides checking option: True
NumPy CPU features: SSE SSE2 SSE3 SSSE3* SSE41* POPCNT* SSE42* AVX* F16C* FMA3* AVX2*
 AVX512F* \
 AVX512CD* AVX512_KNL? AVX512_KNM? AVX512_SKX* AVX512_CLX? AVX512_CNL? AVX512_ICL?

so the ones with a question mark are at least not relevant for proper functioning (I think AVX512CD is currently the bare minimum the OpenBLAS SKX kernels require, and an unsupported AVX512 instruction would probably cause SIGILL rather than SIGSEGV)

@brada4
Copy link
Contributor

brada4 commented Jul 5, 2020

In a linked thread I find that a * a' was failing while a * a is not.

@brada4
Copy link
Contributor

brada4 commented Jul 5, 2020

Does the sample fail only in long test batch,i.e. some corruption accumulated previously, or simple standalone script, like set array and multiply also fails?

@mattip
Copy link
Contributor Author

mattip commented Jul 9, 2020

Closing as "can't reproduce". Thanks all for digging into this, sorry for the time invested with no conclusion.

@mattip mattip closed this as completed Jul 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants