-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gbmv! segfaults for large matrices #1698
Comments
Which version of OpenBLAS is this ? |
According to
|
4 cores I think. It's a MacBook Pro 2017. |
reproduced with a git clone of julia. gdb claims the action happens in dscal_kernel_8_zero called from dscal_k_HASWELL. (So pretty much the same signature as #1089 indeed). |
seems like osx now adds mprotect-ed guard pages around allocations to catch over/under flows. I suspect https://github.com/xianyi/OpenBLAS/blob/25f2d25cfecf6cf850aa5b989f279e5023b6234b/driver/level2/gbmv_thread.c#L98 where I removed buffer, the padding-looking adjustment should actually go to x, (?) and code was wrong before and after (?) |
Unlikely as that line is inside an |
Is L246 writing past end of queue? , same rationale as 529bfc3 (but not 1:1 refecting to files there) |
Nope, does not help. (And interestingly it appears to be a random thread from the middle of the queue array that experiences the segfault. Valgrind is no use because of JuliaLang/julia#27131 ) |
work divider behaves somewhat counter-intuitive, in particular case 25(0) 25(0)1 25(0) 24(9) |
...Does not crash with single thread where crashes with multiple |
Do you have a reproducer that does not need julia ? |
@martin-frbg should be identical to scipy dgbmv , but that is via cblas by preference. R exports it via R_ext.h, but no standard or popular module wraps it into scriptable resource. |
Unfortunately building with -fsanitize=address appears to make the bug go away. (it reappears when running julia from gdb, but address sanitizer does not provide any additional information in that context) |
Hmm. Perhaps BUFFER_SIZE as defined in common_x86_64.h is actually too small here ? |
It points to threading threshold in interface/scal.c
I dont know if that is thread safety issue (in principle threading at this level should be single thread dscal?) |
Not sure I understand your find - when scal is called from gbmv_thread, it gets passed an invalid buffer pointer ("y") already. (All the setup code in gbmv_thread.c seems to be concerned with partitioning a buffer that is apparently assumed to always be big enough. That is why I now suspect that setting BUFFER_SIZE to 64<<20 or 128<<20 (as seen in the ppc and ia64headers respectively) might help. I will not be able to test this before the next weekend unfortunately. |
Increasing BUFFER_SIZE does not help. (Increasing the stack size limit does, same as in #1089) |
|
Some of the valgrind traces look as if Julia's garbage collector moved the work array while the OpenBLAS threads were trying to access it, but I cannot really make sense of it yet. What seems clear however is that this condition already existed with 0.2.13 (and probably much earlier, but trying older versions gets tedious as they lacked the symbol suffixes for interface64 builds.) |
Revisiting this, it now looks to me as if this is a fundamental issue rather than a coding bug - each thread needs to allocate a memory range big enough to hold the entire matrix from within the "buffer" region that is defined by the BUFFER_SIZE constant in |
Moving to next milestone as this is still unclear to me, could be similar in concept to bug #173 |
Made configurable in the build system for now to avoid having to hack common_x86_64.h. |
closing as the original issue is fixed by increasing the default size and making it configurable - the requirement for this big buffer as such is a fundamental design "feature" that will need revisiting at some point |
Segfaulting by design seems like a weird choice... |
...from something like 20 years ago when computer memory was scarce and where it would have been the norm that users would configure the innards of the library for their particular workloads. |
An error saying “increase the default size” seems much more sensible |
assuming an appropriate single location for such a check can be found (that does not cost performance during regular operation). Increasing the default to accomodate your case (and match what the various internal parameters would have required in the first place) has already led to certain misgivings in the python camp, so a more fundamental long-term resolution would certainly be desirable. |
I'm getting the following issue calling
gbmv!
in Julia, that is only present when compiled with OpenBLAS:When compiled with Apple's system BLAS, the above examples work fine.
The text was updated successfully, but these errors were encountered: