-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyPy + NumPy + SVX causes a segfault #2705
Comments
That matmul is basically DGEMM, right ? If so, the change in #2646 could play a role (though I wonder why our tests would not catch this - if this is actually OpenBLAS' fault, the build-time test/dblat3 should have segfaulted as well, as it runs dgemm for various matrix sizes including 63x63) |
The strange thing about this is it only seems to happen on PyPy (which has a JIT): I wonder if somehow some registers or floating point flags are getting messed up |
Can't really help you with PyPy or its JIT, but did you start seeing this only with 0.3.10 (which would suggest a connection with recent changes in OpenBLAS) ? |
BTW, #2526 mentioned NumPy should be setting |
Without TARGET, OpenBLAS would get built for whatever the build host is - which can mess up DYNAMIC_ARCH builds by adding AVX512 instructions to the common code. 0.3.9 (when compiled on SKX) caused an additional problem by missing a runtime check for AVX512 capability in the SGEMM kernel (which should be fixed by #2527 in 0.3.10 - I do not think a later change could have overwritten this fix, but I will check this) |
Since the machine that segfaults reports it is SVX capable, I don’t think TARGET is the problem |
So what exactly do I need to (install and) do to reproduce this ? |
After you git clone numpy, this is the script run by CI. You may want to comment out the
which will hand off arguments after the double dash to pytest |
I've just cloned numpy and followed the steps shown by Prof. Matti Picus to build it with OpenBLAS-0.3.10 in a temporary directory.
Then I ran the code in numpy/numpy#16737 without observing something stucked.
|
No hang or other "unexpected" failure observed on my system either (Xeon-W 2123, gcc 9.3, python 3.6.9, script summary "11460 passed, 149 skipped, 111 deselected, 77 xfailed, 11 xpassed") |
@wjc404 (I am not a professor, but thanks), @martin-frbg thanks for checking. The failing machine is reporting
Compared to @wjc404's machine there is an additional |
I'd read that as "the software knows how to pass along the specific AVX512 feature sets of Knights Landing, Knights Mill and Cannon Lake, but this particular hardware does not support them" ? For the record, mine reports
so the ones with a question mark are at least not relevant for proper functioning (I think AVX512CD is currently the bare minimum the OpenBLAS SKX kernels require, and an unsupported AVX512 instruction would probably cause SIGILL rather than SIGSEGV) |
In a linked thread I find that |
Does the sample fail only in long test batch,i.e. some corruption accumulated previously, or simple standalone script, like set array and multiply also fails? |
Closing as "can't reproduce". Thanks all for digging into this, sorry for the time invested with no conclusion. |
Not sure what is going on, maybe someone here has an idea. I don't have access to x64 Skylake SVX hardware to try this out. Over at NumPy, using OpenBLAS 0.3.10, we are seeing intermittent segfaults when testing with PyPy 3.6-v7.3.1. xref numpy/numpy#16737. Any thoughts? We recently improved the CPU detection done in NumPy in preparation for refactoring our SIMD code, which is why we now know the CPU architecture that segfaults.
The text was updated successfully, but these errors were encountered: