-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault using symmetric eigenvalue problems for large matrices #4868
Comments
Somewhat strange as there is no obvious (to me) change in 0.3.28 that could have caused this, and no such failures in the LAPACK testsuite even if I increase matrix sizes significantly beyond the defaults. That is at least when compiled with gcc, maybe LLVM18 plays a role here although the cpu seems to be fairly standard AVX2-only |
that said, the backtrace (from Julia 1.12.0.DEV.1043 aka 96fd25a764c) has dgemm_itcopy called from dsyr2k_UN (with no changes to either DGEMM or DSYR2K code in 0.3.28, but maybe Julia used a version earlier than 0.3.27 previously ?) |
For what is worth, I believe the segmentation fault may depend on the version of GCC used. I can see it when we compile OpenBLAS with GCC 11 but not with GCC 8. I'm trying other combinations now.
Julia nightly was using OpenBLAS v0.3.27 until 3 days ago. Upcoming Julia v1.11 will use OpenBLAS v0.3.27 (feature freeze was a few months ago), Julia v1.10 (current stable version) uses OpenBLAS v0.3.23. |
It seems to work for me when compiling OpenBLAS with GCC 12. |
Can you tell me your build options for OpenBLAS please ? I'm trying with clang/flang-new from LLVM18.1 but getting a missing reference to FortranACharacterCompareScalar1 when I replace the library that comes with the Julia nightly |
I was wrong with all my references to GCC versions, I forgot a detail that was mentioned in JuliaLang/LinearAlgebra.jl#1086, but not here: the segmentation fault happens when setting the number of threads with
For x86_64 it should be
|
As an additional datapoint, I don't get any crash when setting
If the environment variable Is there a different setup of threading in OpenBLAS when the env var is set? |
I could reproduce it without setting any environment variable, perhaps the difference is just in random memory layout/contents if this turns out to be uninitialized or overflowed memory |
Indeed looks like the code reshuffle to support other threading backends has created a loophole for starting a thread without having a memory buffer allocated for it. Working on a patch (or rather, I have a patch but no full understanding of the root cause yet) |
@martin-frbg I backported your patch in JuliaLang/julia#4871 with JuliaPackaging/Yggdrasil#9262, but I'm still see segmentation faults on some machines, but not everywhere:
What's interesting is that Ryzen 9 3950X and EPYC 7742 have the same OpenBLAS core type, both "Zen", but Ryzen 9 3950X has also fewer threads. |
Indeed looks like I simplified yesterday's patch too far before committing, sorry. Have reproduced the persistance of the problem on the EPYC in the gcc compile farm and will push an updated PR shortly |
Out of curiosity, did you manage to craft a C reproducer? I'm failing badly at that 😞 |
I haven't even tried yet, and I'm wondering if this may somehow be connected to how Julia uses OpenBLAS (though I couldn't say what might be special about it). If it was that easy to get on the "bad" code path simply by calling TRMM or GEMM, this should have blown up in the LAPACK, numpy and scipy testsuites or even one of our trivial benchmarks prior to the release. |
@ViralBShah over at JuliaLang/LinearAlgebra.jl#1086 prepared a C program which I can confirm segfaults on a macbook M1 for me with openblas v0.3.28, but not v0.3.23, nor with JuliaLang/julia#4879: #include <stdio.h>
#include <stdlib.h>
int main() {
long n = 1024;
double *a;
a = (double*) malloc(n*n*sizeof(double));
openblas_set_num_threads64_(16);
// Check if the memory has been successfully
// allocated by malloc or not
if (a == NULL) {
printf("Memory not allocated.\n");
exit(0);
}
for(long row=0; row<n; row++){
for(long col=0; col<=row; col++){
a[row*n + col] = 1.0;
}
}
// Allocate space for eigenvalues
double *w;
w = (double*) malloc(n * sizeof(double));
if (w == NULL) {
printf("Memory not allocated.\n");
exit(0);
}
// Leading dimension of the matrix
long lda = n;
// workspace query
double *work = (double*) malloc(sizeof(double));
long lwork = -1;
long *iwork = (long*) malloc(sizeof(long));
long liwork = -1;
long info = -100;
char jobz = 'V';
char uplo = 'U';
dsyevd_64_(&jobz, &uplo, &n, a, &lda, w, work, &lwork, iwork, &liwork, &info, 1, 1);
// Workspace allocation
lwork = work[0];
work = (double *) malloc(lwork * sizeof(double));
liwork = iwork[0];
iwork = (long *) malloc(liwork * sizeof(long));
dsyevd_64_(&jobz, &uplo, &n, a, &lda, w, work, &lwork, iwork, &liwork, &info, 1, 1);
// Print eigenvalues
printf("First 5 Eigenvalues:\n");
for (int i = 0; i < 5; i++) {
printf("%f\n", w[i]);
}
return 0;
} However the same program doesn't segfault for me with OpenBLAS v0.3.28 on the other machines I mentioned at #4868 (comment) where I could still reproduce the segfault after JuliaLang/julia#4871. |
Actually, I can reproduce it on the other machines, but I need to increase the number of threads quite a bit, perhaps depending on the system. For example this segfaults for me on AMD EPYC 7742 with 384 threads: - openblas_set_num_threads64_(16);
+ openblas_set_num_threads64_(384); $ gcc -o test test.c -L${HOME}/.julia/juliaup/julia-nightly/lib/julia -Wl,-rpath,${HOME}/.julia/juliaup/julia-nightly/lib/julia -lopenblas64_
test.c: In function ‘main’:
test.c:9:5: warning: implicit declaration of function ‘openblas_set_num_threads64_’ [-Wimplicit-function-declaration]
openblas_set_num_threads64_(384);
^~~~~~~~~~~~~~~~~~~~~~~~~~~
test.c:46:5: warning: implicit declaration of function ‘dsyevd_64_’ [-Wimplicit-function-declaration]
dsyevd_64_(&jobz, &uplo, &n, a, &lda, w, work, &lwork, iwork, &liwork, &info, 1, 1);
^~~~~~~~~~
$ ./test
Segmentation fault (core dumped) |
Cross posting from the julia issue tracker (JuliaLang/LinearAlgebra.jl#1086), symmetric eigenvalue problems (
dsyev
/dsyevd
/dsyevr
) seem to be segfaulting for matrix sizes of 64x64 and above onx86_64-linux-gnu
using OpenBLAS v0.3.28.How to reproduce in Julia (using a nightly version > v"1.12.0-DEV.1038" that includes OpenBLAS v0.3.28):
System info:
Matrices smaller than 64x64 do not produce this error.
The text was updated successfully, but these errors were encountered: