Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenBLAS crash from GSL (only with very large matrix) #2387

Closed
ghost opened this issue Feb 4, 2020 · 28 comments
Closed

OpenBLAS crash from GSL (only with very large matrix) #2387

ghost opened this issue Feb 4, 2020 · 28 comments

Comments

@ghost
Copy link

ghost commented Feb 4, 2020

I've encountered a crash in OpenBLAS, in both versions 0.2.19 (from the RHEL Fedora EPEL repo) and in 0.3.8 (which I built myself). The crash occurs on an AWS x1e.32xlarge instance, running the RHEL Fedora OS that ships with the Amazon Linux AMI. The instance has about 4 TB of RAM and 128 CPUs.

The stack trace is:
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff6b6eae0 in dscal_kernel_8_zero () from /usr/lib64/libopenblas.so.0

#0 0x00007ffff6b6eae0 in dscal_kernel_8_zero () from /usr/lib64/libopenblas.so.0
#1 0x00007ffff6b6ed34 in dscal_k_HASWELL () from /usr/lib64/libopenblas.so.0
#2 0x00007ffff5e17bff in symv_kernel () from /usr/lib64/libopenblas.so.0
#3 0x00007ffff5ff3549 in exec_blas () from /usr/lib64/libopenblas.so.0
#4 0x00007ffff5e17eb6 in dsymv_thread_U () from /usr/lib64/libopenblas.so.0
#5 0x00007ffff5dd4a46 in cblas_dsymv () from /usr/lib64/libopenblas.so.0
#6 0x00007ffff79b7d80 in gsl_blas_dsymv (Uplo=Uplo@entry=CblasLower, alpha=alpha@entry=1.1044668107764084, A=A@entry=0x7fffffffc800, X=X@entry=0x7fffffffc7a0,
beta=beta@entry=0, Y=Y@entry=0x7fffffffc7d0) at blas.c:782
#7 0x00007ffff7a1d732 in gsl_linalg_symmtd_decomp (A=A@entry=0x6b1eac50, tau=tau@entry=0x7fffffffc970) at symmtd.c:97
#8 0x00007ffff79c07e9 in gsl_eigen_symm (A=A@entry=0x6b1eac50, eval=eval@entry=0x6b1f5cf0, w=w@entry=0x6b12f280) at symm.c:126

You can see that the crash occurs during a call to the GSL function gsl_eigen_symm. The input to gsl_eigen_symm, in this case, is a 42898x42898 symmetric matrix.

At the time of the crash, there are 128 threads. Most of the threads are in either dscal_kernel_8_zero or dsymv_U_HASWELL at the time of the crash.

I saw some old posts about a crash with a similar stack trace, but it wasn't clear to me whether the underlying issue was fixed. Is this a known issue with a simple workaround?

@ghost
Copy link
Author

ghost commented Feb 4, 2020

In the particular case I'm debugging now, the crash occurred in thread 29, with the backtrace:

#0 0x00007ffff6b6eae0 in dscal_kernel_8_zero () from /usr/lib64/libopenblas.so.0
#1 0x00007ffff6b6ed34 in dscal_k_HASWELL () from /usr/lib64/libopenblas.so.0
#2 0x00007ffff5e17bff in symv_kernel () from /usr/lib64/libopenblas.so.0
#3 0x00007ffff5ff2e77 in blas_thread_server () from /usr/lib64/libopenblas.so.0
#4 0x00007ffff41b4e75 in start_thread (arg=0x7fffb205e700) at pthread_create.c:307
#5 0x00007ffff3edd8fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

@martin-frbg
Copy link
Collaborator

Probably #1698 - meaning you would need to increase the BUFFER_SIZE in common_x86_64.h

@ghost
Copy link
Author

ghost commented Feb 4, 2020

OK. I can test this now. I also saw some mention of increasing the stack limit in #1698. Should I do that as well? ulimit -a tells me my stack size is currently 8192 Kbytes. Any guidance for how much I should increase the buffer size, to be safe?

@ghost
Copy link
Author

ghost commented Feb 4, 2020

Informational note: increasing the stack size from 8 MB to 64 MB did not fix the crash. Next, I'll rebuild the library with an aggressive increase in the buffer size.

@brada4
Copy link
Contributor

brada4 commented Feb 5, 2020

You are using single-threaded openblas from multiple threads or how?

@ghost
Copy link
Author

ghost commented Feb 5, 2020

I'm using multi-threaded OpenBLAS, which is being invoked from a single thread of my application.

@ghost
Copy link
Author

ghost commented Feb 5, 2020

I'm still debugging the crash. When I changed the BUFFER_SIZE from "32 << 20" to "1024UL << 20", that did, in fact, stop the crash. But, after the change, it seemed as if the code was going to run indefinitely, so I terminated the program. (The run time was extraordinarily large, even compared to other large matrices of comparable size.) I'll look deeper into what's happening today.

@martin-frbg
Copy link
Collaborator

Yep the whole thing is fishy, which is why #1698 is still open. (Though I did not see excessive run times with that test case). With 128 threads and a huge matrix it could be that you hit someother bug as well - e.g. there may be a race condition in level3_thread.c (#2363, helgrind warnings on code protected by memory barriers only).

@ghost
Copy link
Author

ghost commented Feb 5, 2020

Update: I swapped out OpenBLAS with the Intel MKL, and this also solves the crash. With the MKL, the runtime is also longer than I expected (appears to be the same as with the hacked OpenBLAS I built). I'm investigating the runtime in depth now, but the preliminary evidence suggests that increasing the buffer size did fix the problem for OpenBLAS.

@brada4
Copy link
Contributor

brada4 commented Feb 5, 2020

Can you produce a profiler report,
perf record 'your command'
Then
perf report
Then just post the text, like first screen.
Maybe it reveals some anomaly.

@ghost
Copy link
Author

ghost commented Feb 6, 2020

After further investigation, there's no problem at all with the speed. The computation took longer than I expected because it hit a corner case which required the computation of all the eigenvalues of a symmetric matrix (which I've now improved to compute only a single eigenvalue).

@ghost
Copy link
Author

ghost commented Feb 6, 2020

The conclusion is the following. There's a bug in OpenBLAS which causes a segmentation fault, but that crash can be avoided by increasing the BUFFER_SIZE, as suggested above. Alternatively, the crash can also be avoided by swapping out OpenBLAS with another BLAS library, like the Intel MKL.

@ghost
Copy link
Author

ghost commented Feb 6, 2020

I understand how incredibly difficult it is to maintain any software, and much more so when the code is very highly optimized, portable, and extremely low level. (And, on top of it all, to do it essentially for free!) But if the OpenBLAS has not been thoroughly tested on architectures with > 32 cores (as I saw in a recent post here), and many people have reported problems in that regime, then this caveat should be made much clearer to prospective users on the OpenBLAS website. Computing on machines with many CPU nodes is now very common (anyone can get access to such machines via AWS EC2), and the CPU limitations of the library will be an important consideration for many scientists and engineers trying to decide which open-source library to use.

Even though the change in the BUFFER_SIZE has fixed the crash for me, I worry about whether I should continue using the library for very large computations, or whether I should just switch over to the Intel MKL, which is now available for free as well. What do you, as the maintainers of OpenBLAS, recommend?

@martin-frbg
Copy link
Collaborator

martin-frbg commented Feb 6, 2020

I am not into marketing - use whatever you like but be sure to check results (for plausibilty at least) wherever you can. Do not expect MKL (or anything else) to be bug-free just because it is supplied by a big corporation. Contributors to free software may be amateurs when it comes to that particular kind of task, but more often than not they will be professionals in a related discipline.

All I can say is that I can not vouch for OpenBLAS performance on big systems at the moment, and a few issues have been reported lately (which are out in the open here, though not all may turn out to be actual OpenBLAS bugs). (But if you read the MKL end user license, I bet it will be fit for decorative purpose only as well 😄 ) OpenBLAS is now used by very many programs (knowingly or unknowingly), so it could perhaps even be claimed that it must be perfect if there are only around 130 bugs reported against it. (Most of them involve specific, even outdated hardware, or build problems with a particular compiler or operating system).

I may not be up to date on the licensing terms of MKL (and the other bits related to the Intel compilers) - at least until a few years ago, it used to be non-redistributable and any hint of grant money or other payments received in relation to one's work disallowed use of the free version.

@brada4
Copy link
Contributor

brada4 commented Feb 6, 2020

Could you elaborate on

very large computations

You still did not elaborate on the size of input matrices. Those pesky debug symbols, you know.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Feb 6, 2020

@brada4 a matrix size of 42k by 42k was mentioned from the very start, in #1698 it was a 3 by 100.000.000 array (more manageable in terms of memory but still too big for the default BUFFER_SIZE). As it seems there are no other side effects, this setting could be exposed in Makefile.rule (and CMakeLists.txt) - I guess not everybody would require (nor want to malloc) a gigabyte-sized buffer although "common" memory sizes on x86_64 have certainly improved since the days of GotoBLAS. (The x86_64 default was copied from GotoBLAS2 of 2009,
the GotoBLAS-1.00 from 2005 even had 64<<20 once. All later architectures like ARM or ZARCH probably started out by borrowing settings from an existing one. With common_zarch.h it is even obvious that it is copypasted from common_arm64.h).
Perhaps the actual message here is that OpenBLAS needs to attract more users from structural genomics :)

@brada4
Copy link
Contributor

brada4 commented Feb 6, 2020

There is one STACK_ALLOC instance in interface/gemv.c which in turn makes note about being too fragile for optimisations.
Not sure if MAX_STACK_ALLOC=0 does some good in this case.
What I suspected was int32 overflow in internal calculations, not yet there....

@martin-frbg
Copy link
Collaborator

Lowercase "buffer_size" in that file is something else entirely, and is only on the order of the sum of the matrix dimensions (and won't actually go on the stack if it gets too big for that).

@ghost
Copy link
Author

ghost commented Feb 6, 2020

I have a technical question now on the patched build of the library I made. I built the library using
'make FC=gfortran DYNAMIC_ARCH=1 NUM_THREADS=128'
This results in a shared object libopenblasp-r0.3.8.dev.so, and a couple sym links:
libopenblas.so.0 -> libopenblasp-r0.3.8.dev.so
libopenblas.so -> libopenblasp-r0.3.8.dev.so

Now, in the makefile for my application, I have -lblas on the link line, so I also made the sym link
libblas.so -> libopenblasp-r0.3.8.dev.so
(I do this so that my application is unaware of what BLAS library is actually being used.)

I noticed that, if I remove the libopenblas symlinks, my application fails at startup because it can't find the libopenblas .so files. Any idea why that is? The program should be looking up libblas.so, not libopenblas.so. I'm running on RHEL Fedora, which doesn't have an alternatives mechanism like ubuntu (as far as I know).

I then checked the older version of libopenblas that I got from my package manager. In /usr/lib64, there are many shared objects and symlinks, but two of them are:
libblas.so -> libopenblasp64-r0.2.19.so
libopenblas.so -> libopenblas-r0.2.19.so
So the libblas is multi-threaded, but the libopenblas is not (right?). The version of openblas that I had linked into my program before was clearly multi-threaded, but presumably it also depended on libblas.so?

I'm a little confused at this point. I'm asking this question because I want to ensure I haven't made some mistake with my custom build of the library.

@martin-frbg
Copy link
Collaborator

Check with ldd what libraries it is actually trying to load (perhaps the "old" symlinks in /usr/lib64 take precedence over wherever you put the new libopenblas ?). Also make sure that your libblas.so symlinks actually points at where you wanted it to go. A libopenblas.so is highly unlikely to depend on a separate libblas.so as it contains a complete BLAS implementation itself. (There is an option to compile without the LAPACK component however)

@ghost
Copy link
Author

ghost commented Feb 6, 2020

OK, I see from ldd that my application was built against libopenblas.so.0, not libblas.so, which explains that observation.

@ghost
Copy link
Author

ghost commented Feb 6, 2020

Aha, now I see what happened. The following command
'readelf -a libopenblasp-r0.3.8.dev.so | grep SONAME'
returns
0x000000000000000e (SONAME) Library soname: [libopenblas.so.0]

This means that the library I built, libopenblasp-r0.3.8.dev.so, has the the DT_SONAME field set to "libopenblas.so.0". (This must be specified somewhere in the OpenBLAS installation logic.) Now, whenever a program is linked against a shared object, the library loader will check the DT_SONAME field of that .so file. If it's set, then the loader will load the shared object with that name, NOT the name of the library that was specified on the link line!

So I linked my application against lblas (/usr/lib64libblas), which is a symlink to /usr/lib64/libopenblasp-r0.3.8.dev.so. Then, when I run my application, the loader actually looks for the file /usr/lib64/libopenblas.so.0, because that's the DT_SONAME in /usr/lib64/libopenblasp-r0.3.8.dev.so.

This means that the libopenblas.so.0 file must always be present, always the loader will fail.

@martin-frbg
Copy link
Collaborator

Yes, sorry for not being more specific earier - this naming is by convention, to include the exact major version of the library required (in case of ABI-breaking changes in later library versions). This is set in exports/Makefile

@ghost
Copy link
Author

ghost commented Feb 7, 2020

One last question before I happily close this issue, with thanks for all the help and insight I've received. I have other applications which rely on OpenBLAS, besides the one discussed here. In those applications, different application threads (C++ threads, using std::threads) independently call routines which invoke OpenBLAS methods. In these cases, I set the environment variable OPENBLAS_NUM_THREADS to a number (still greater than one) to avoid thread contention between my application threads and the threads needed by OpenBLAS. Is this thread-safe, in both versions 0.2.19 and 0.3.8 of the library? (The separate threads are all operating on different variables with non-overlapping memory addresses--they're not reading or writing the same data.) I checked that the library was thread-safe in this way before I developed those applications, but now I just want to double-check with the experts themselves.

@martin-frbg
Copy link
Collaborator

In that case it is advisable to build 0.3.8 with USE_LOCKING=1 - any version before 0.2.20 is unlikely to have been thread safe at all. (Finding out about this and trying to fix it was what got me involved in OpenBLAS development)

@ghost
Copy link
Author

ghost commented Feb 7, 2020

I thought the USE_LOCKING option was only needed if one builds the single-threaded version of OpenBLAS. But I always build/use the multi-threaded version. Should I still build 0.3.8 with USE_LOCKING if I’m only building multi-threaded OpenBLAS?

@martin-frbg
Copy link
Collaborator

Meh - I misread, sorry about that. The comment about 0.2.19 is valid however.

@ghost
Copy link
Author

ghost commented Feb 7, 2020

OK. Got it. I'm closing this issue now, with many thanks for your help and support.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants