-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent segfault from getenv #716
Comments
There should be one more line on top of backtrace (bt), and (gdb) thread apply all bt .. may be more useable. |
Could also be that the actual memory corruption happens before this, and the code only blows up at the next malloc. Can you update your OpenBLAS to 0.2.15 (or even git develop branch snapshot), and/or run your code from valgrind in the hope of catching any earlier corruption ? |
Interesting. I never meet segfault from getenv before. |
@martin-frbg this thread is new, some other already is in weird state. |
@brada4 yes I see that. My take is that memory management got trashed before this point - which went unnoticed until the next allocation, which happens to be the tiny amount of memory needed for the OMP_NUM_THREADS value or whatever getenv() is sent to fetch on creation of a new thread. So the current backtrace only tells us that there is a serious problem "somewhere", perhaps not even through OpenBLAS' fault. |
I have pasted the result of "(gdb) thread apply all bt" below. Also we installed the debug symbols for glibc in the Amazon EC2 instances so we got better info about crash in
|
It seems that OpenBlas is initialised on a thread and calls On a different occasion I got the following backtraces (there is more in thread 4 but I pasted the important bits):
|
Looking into this with @dedalusj, I think the detailed chain is something like:
Now the race condition - the following occur simultaneously:
Now, Reading the glibc source, |
If all of the above is correct, I think the fix is something like calling |
@c42f , thank you for the suggestion. Work on this issue. |
@xianyi thanks a lot. I'm fairly satisfied we had the root cause in the above, but I'm afraid we never managed to get a minimal reproducible test case so testing will be difficult. @dedalusj do we still have the infrastructure in place to rerun our large scale code when the changes here flow through into numpy? |
@c42f Sure no problem. |
I experience intermittent segmentation fault from OpenBlas when used through numpy.
I got a core dump and the backtrace is the following:
I noticed the same backtrace coming from various code paths.
The version of OpenBlas is 0.2.14. If it makes your life easier I got it through the conda packaging system and their build number is 3.
The text was updated successfully, but these errors were encountered: