-
Notifications
You must be signed in to change notification settings - Fork 703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LAPACK tests are failing with OpenBLAS-0.3.20 and GCC-11.3.0 #16380
Comments
@akesandgren Any thoughts on this? |
@maxim-masterov Did you check whether building |
W.r.t. things looking worse with GCC 12, that's probably because the auto-vectorizer is enabled by default there, see also https://www.phoronix.com/news/GCC-12-Auto-Vec-O2 (hat tip @zao) |
Maybe related to https://github.com/xianyi/OpenBLAS/pull/3745/files resp. the crashes on OSX that disabling tree-vectorize fixes |
Seems the added |
Update:errors look significant (1e6 and above) but majority arise already from compiling just the netlib-derived LAPACK part with |
One quick note on LAPACK testing. It is imperative to compile things under TESTING and MATGEN with -O0. |
also things like Reference-LAPACK/lapack#679 where tests expects exact same result as from their own non-optimized BLAS... |
Not with |
OpenBLAS 0.3.21 (vs 0.3.20) updated the copy of Reference-LAPACK to 3.10.1 plus fixes, which may have fixed your use case of GGEV through changes therein. I am not immediately aware of any other change that would have affected thread safety or resilience against more aggressive optimisation, but unfortunately I am not equipped to test either VASP or EPYC (although I am a computational chemist by training - now self-employed in an unrelated sector) |
Some more results. To avoid questions like "why we didn't use the internal LAPACK tests available with OpenBLAS releases and used LAPACK tests taken directly from netlib". I downloaded OpenBLAS-0.3.20 and compiled it with GCC/11.3.0 on the AMD EPYC ROME (zen2) machine. Then I built netlib's LAPACK tests available from the OpenBLAS release. To change optimization flags I modified the Every OpenBLAS version and LAPACK test were built from scratch in a new folder after untaring the OpenBLAS tarball (so, no Steps to reproduce:$ module purge
$ module load 2022 GCC/11.3.0
$ wget https://github.com/xianyi/OpenBLAS/archive/refs/tags/v0.3.20.tar.gz
$ tar -xf v0.3.20.tar.gz
$ cd OpenBLAS-0.3.20
$ vim Makefile.rule
# modify optimization flags
$ make -j 32
...
$ make PREFIX=${PWD}/install install
...
$ cd lapack-netlib/TESTING
$ make -j 32
$ cd .. && python3 ./lapack_testing.py -t eig -p x ResultsThe lists of flags from below are taken from the log, therefore there are some repetitions, e.g. two 1Flags: Output: --> LAPACK TESTING SUMMARY <--
Processing LAPACK Testing output found in the TESTING directory
SUMMARY nb test run numerical error other error
================ =========== ================= ================
REAL 889893 0 (0.000%) 0 (0.000%)
DOUBLE PRECISION 889893 0 (0.000%) 0 (0.000%)
COMPLEX 336649 0 (0.000%) 0 (0.000%)
COMPLEX16 336649 0 (0.000%) 0 (0.000%)
--> ALL PRECISIONS 2453084 0 (0.000%) 0 (0.000%) 2Flags: Output: --> LAPACK TESTING SUMMARY <--
Processing LAPACK Testing output found in the TESTING directory
SUMMARY nb test run numerical error other error
================ =========== ================= ================
REAL 889893 0 (0.000%) 0 (0.000%)
DOUBLE PRECISION 889893 0 (0.000%) 0 (0.000%)
COMPLEX 336649 0 (0.000%) 0 (0.000%)
COMPLEX16 336649 0 (0.000%) 0 (0.000%)
--> ALL PRECISIONS 2453084 0 (0.000%) 0 (0.000%) 3Flags: Output: --> LAPACK TESTING SUMMARY <--
Processing LAPACK Testing output found in the TESTING directory
SUMMARY nb test run numerical error other error
================ =========== ================= ================
REAL 886365 3 (0.000%) 0 (0.000%)
DOUBLE PRECISION 889893 0 (0.000%) 0 (0.000%)
COMPLEX 336649 0 (0.000%) 0 (0.000%)
COMPLEX16 336649 0 (0.000%) 0 (0.000%)
--> ALL PRECISIONS 2449556 3 (0.000%) 0 (0.000%) 4Flags: Output: --> LAPACK TESTING SUMMARY <--
Processing LAPACK Testing output found in the TESTING directory
SUMMARY nb test run numerical error other error
================ =========== ================= ================
REAL 886365 3 (0.000%) 0 (0.000%)
DOUBLE PRECISION 889893 0 (0.000%) 0 (0.000%)
COMPLEX 336649 0 (0.000%) 0 (0.000%)
COMPLEX16 336649 0 (0.000%) 0 (0.000%)
--> ALL PRECISIONS 2449556 3 (0.000%) 0 (0.000%) 5Flags: Output: --> LAPACK TESTING SUMMARY <--
Processing LAPACK Testing output found in the TESTING directory
SUMMARY nb test run numerical error other error
================ =========== ================= ================
REAL 848925 1187 (0.140%) 0 (0.000%)
DOUBLE PRECISION 875853 1153 (0.132%) 0 (0.000%)
COMPLEX 322609 985 (0.305%) 0 (0.000%)
COMPLEX16 336649 0 (0.000%) 0 (0.000%)
--> ALL PRECISIONS 2384036 3325 (0.139%) 0 (0.000%) 6Flags: Output: --> LAPACK TESTING SUMMARY <--
Processing LAPACK Testing output found in the TESTING directory
SUMMARY nb test run numerical error other error
================ =========== ================= ================
REAL 889893 0 (0.000%) 0 (0.000%)
DOUBLE PRECISION 889893 0 (0.000%) 0 (0.000%)
COMPLEX 328369 38 (0.012%) 0 (0.000%)
COMPLEX16 328369 94 (0.029%) 0 (0.000%)
--> ALL PRECISIONS 2436524 132 (0.005%) 0 (0.000%) To me, these results show that the more aggressive implicit vectorisation we use, the more LAPACK tests fail with OpenBLAS-0.3.20 and GCC-11.3.0. Also, I think that the test # 1 should also indicate that there are no problems with a compiler, since it uses |
@maxim-masterov Can you check if you seeing the same problems for I plan to open a PR for the relevant OpenBLAS easyconfigs to disable the use of Longer term, we should enhance the OpenBLAS easyblock to more carefully check the result of the tests being run (and perhaps also expand the set of tests being run). |
This is an issue with the reference LAPACK, nothing to do with OpenBLAS in principle, since you can get the same errors with reference LAPACK combined with BLIS. The |
Crude fix could be to change line 281 in the toplevel OpenBLAS Makefile
to add |
@boegel here are some results from OpenBLAS-0.3.20 built with GCC/11.2.0. I used the same procedure as before #16380 (comment). The test command: 1Flags: Output: --> LAPACK TESTING SUMMARY <--
Processing LAPACK Testing output found in the TESTING directory
SUMMARY nb test run numerical error other error
================ =========== ================= ================
REAL 889893 0 (0.000%) 0 (0.000%)
DOUBLE PRECISION 889893 0 (0.000%) 0 (0.000%)
COMPLEX 336649 0 (0.000%) 0 (0.000%)
COMPLEX16 336649 0 (0.000%) 0 (0.000%)
--> ALL PRECISIONS 2453084 0 (0.000%) 0 (0.000%) 2Flags: Output: --> LAPACK TESTING SUMMARY <--
Processing LAPACK Testing output found in the TESTING directory
SUMMARY nb test run numerical error other error
================ =========== ================= ================
REAL 889893 0 (0.000%) 0 (0.000%)
DOUBLE PRECISION 889893 0 (0.000%) 0 (0.000%)
COMPLEX 336649 0 (0.000%) 0 (0.000%)
COMPLEX16 336649 0 (0.000%) 0 (0.000%)
--> ALL PRECISIONS 2453084 0 (0.000%) 0 (0.000%) 3Flags: Output: --> LAPACK TESTING SUMMARY <--
Processing LAPACK Testing output found in the TESTING directory
SUMMARY nb test run numerical error other error
================ =========== ================= ================
REAL 886365 3 (0.000%) 0 (0.000%)
DOUBLE PRECISION 889893 0 (0.000%) 0 (0.000%)
COMPLEX 336649 0 (0.000%) 0 (0.000%)
COMPLEX16 336649 0 (0.000%) 0 (0.000%)
--> ALL PRECISIONS 2449556 3 (0.000%) 0 (0.000%) 4Flags: Output: --> LAPACK TESTING SUMMARY <--
Processing LAPACK Testing output found in the TESTING directory
SUMMARY nb test run numerical error other error
================ =========== ================= ================
REAL 886365 3 (0.000%) 0 (0.000%)
DOUBLE PRECISION 889893 0 (0.000%) 0 (0.000%)
COMPLEX 336649 0 (0.000%) 0 (0.000%)
COMPLEX16 336649 0 (0.000%) 0 (0.000%)
--> ALL PRECISIONS 2449556 3 (0.000%) 0 (0.000%) 5Flags: Output: --> LAPACK TESTING SUMMARY <--
Processing LAPACK Testing output found in the TESTING directory
SUMMARY nb test run numerical error other error
================ =========== ================= ================
REAL 848925 1187 (0.140%) 0 (0.000%)
DOUBLE PRECISION 875853 1153 (0.132%) 0 (0.000%)
COMPLEX 322609 985 (0.305%) 0 (0.000%)
COMPLEX16 336649 0 (0.000%) 0 (0.000%)
--> ALL PRECISIONS 2384036 3325 (0.139%) 0 (0.000%) 6Flags: Output: --> LAPACK TESTING SUMMARY <--
Processing LAPACK Testing output found in the TESTING directory
SUMMARY nb test run numerical error other error
================ =========== ================= ================
REAL 889893 0 (0.000%) 0 (0.000%)
DOUBLE PRECISION 889893 0 (0.000%) 0 (0.000%)
COMPLEX 328369 38 (0.012%) 0 (0.000%)
COMPLEX16 328369 94 (0.029%) 0 (0.000%)
--> ALL PRECISIONS 2436524 132 (0.005%) 0 (0.000%) 7Built OpenBLAS using Output: --> LAPACK TESTING SUMMARY <--
Processing LAPACK Testing output found in the TESTING directory
SUMMARY nb test run numerical error other error
================ =========== ================= ================
REAL 850647 1187 (0.140%) 4 (0.000%)
DOUBLE PRECISION 877585 1153 (0.131%) 4 (0.000%)
COMPLEX 323912 985 (0.304%) 8 (0.002%)
COMPLEX16 336859 4 (0.001%) 8 (0.002%)
--> ALL PRECISIONS 2389003 3329 (0.139%) 24 (0.001%) |
I found miscompilation in |
This is for GCC 11.3, 9.3 doesn't fail. will check a few more compiler versions... |
Submitted |
@bartoldeman So long story short, we should avoid using |
I've opened a PR for the OpenBLAS easyblock that add support for opting into to running the LAPACK test suite, and catching too many failing tests, see easybuilders/easybuild-easyblocks#2801 We should also update the most recent OpenBLAS easyconfigs to i) disable the use of |
…max. number of failing tests due to numerical errors to 150 (cfr. easybuilders#16380) requires: * easybuilders/easybuild-easyblocks#2801
@boegel a conservative and easy fix is to disable A more targeted fix is to only compile the Lapack (Fortran) parts of OpenBLAS and FlexiBLAS with The GCC bug is making progress though, it's already fixed on trunk, I'll check if that patch is trivially backported. In the GCC bug it's also mentioned that |
I tested myself with reference LAPACK 3.10.1 + BLIS, with in LAPACK's make.inc:
and backported the GCC patch (https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=9ed4a849afb5b18b462bea311e7eee454c2c9f68), just needs to change .cc to .c in filenames. The number of failures is a lot lower though not quite at zero (they could come from BLIS as well, to check). Before
After
With OpenBLAS-0.3.21, similar procedure as above, patched compiler:
so only failures left in complex. Certainly a LOT better but I'm still going to check if those complex failures are worrying. |
A patch for GCC 11.3.0 is here: In the last Testing output above almost all the complex tests use CGEEV and related functions with and without computation of eigenvectors (in both cases eigenvalues are computed), and compare the eigenvalues, in the longer explanation you can see that as "result 5" or "test(5)" failing. If they're not numerically exactly the same, the tests fails, even if those eigenvalues are super close. It'll take some time to sort those out but this shouldn't have real-world significance. I believe a test such as |
Upstream issue: |
Thanks to the changes in #16406, we are now running the LAPACK tests for recent OpenBLAS easyconfigs, and too many failing LAPACK tests (> 150) will lead to an installation error. Note that the enhanced OpenBLAS easyblock from easybuilders/easybuild-easyblocks#2801 (which adds support for running the LAPACK tests and checking on the results) is required, and that the patch for GCC 11.x + 12.x that was added in #16411 is also required to ensure a low number of failing LAPACK tests due to numerical errors, so both |
Creating this issue to properly log all the progress.
How it started
It was observed that the VASP6 installation with
foss/2022a
lead to inaccurate results. After some digging the culprit was found -DGGEV
subroutine fromLAPACK
. To simplify debugging of the problem we isolated LAPACK tests from the official netlib distribution (3.10.1) and started to run them using different combinations of compiler flags and OpenBLAS versions.What we have
The following tests are performed on AMD EPYC ROME (zen2 architecture):
OpenBLAS/0.3.15-GCC-10.3.0
(taken fromfoss/2021a
) results in ~130 failed tests:OpenBLAS/0.3.20-GCC-11.3.0
(taken fromfoss/2022a
) results in ~4.2k failed testsOpenBLAS/0.3.15-GCC-11.3.0
(new build) results in ~4.2k failed tests@zao got similar results on Ryzen 9 3900X (zen2 desktop) when built LAPACK tests with full
foss/2022a
andbuildenv
that picked upFlexiBLAS
as theUSE_OPTIMIZED_BLAS
implementation:The main question - are failing tests caused by FlexiBLAS or by the optimization flags?
Update 1
From @zao :
Stripping -ftree-vectorize from the build flags that buildenv sets (leaving -O2 -march=native) makes it behave, so it's probably the better vectorizer in GCC11 lifting up some latent problem in OpenBLAS. It wouldn't be the first time...
Update 2
From @zao
I've set up a fresh environment on a Haswell machine, got the same grade of broken outcome as on our zen2 so not µarch-dependent. Steps:
Update 3
From @zao
Update 4
I got the following number of numerical errors using
lapack_testing.py -p x -t eig
from the LAPACK distribution:Build with
GCC-11.3
:-O2 -march=znver2 -funroll-all-loops -fno-math-errno -ftree-vectorize
: 4090-O2 -march=znver2 -funroll-all-loops -fno-math-errno
: 136-O2 -march=znver2 -fno-math-errno
: 136-O2 -fno-math-errno
: 7Build with
GCC-10.3
:-O2 -march=znver2 -funroll-all-loops -fno-math-errno -ftree-vectorize
: 136every OpebBLAS version was built manually using
GCC/11.3.0
orGCC/10.3.0
module (no FlexiBLAS involved)Way to reproduce:
The text was updated successfully, but these errors were encountered: