-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash in ZTRMV for large thread counts #4269
Comments
Crash meaning a segfault without additional information, or do you get some kind of error or warning message around the time of the crash ? (And does it crash only when the number of threads is large ?) |
It just segfaults without additional information. It works for 32 threads (and less). It segfaults for 64 threads and more. I haven't tested between 32 and 64 threads but can do it if this would help. |
It survived on ibm power9 with 128 threads, suggesting the problem is not in the interface/thread setup code shared between all architectures. Now looking for a sufficiently big arm64 host... One explanation could be running out of preallocated memory buffers (especially if you did not build OpenBLAS on that exact machine configuration, or set NUM_THREADS sufficiently high at build time) but that would normally print a warning. |
The build machine is a much more modest computer. I typically compile OpenBLAS with |
Sorry for overlooking the NUM_THREADS setting. I wonder if you could run a DEBUG=1 build from gdb so that we get an idea where it blows up ? (Or if that is too bothersome, perhaps a build of the current |
I gave it a try with OpenBLAS-master and
|
Thanks. That's an unexpected location for the crash, and one that points more towards the common code for splitting the workload between threads again. I wonder why I do not see this on the architectures where I have big systems to play with. (Again you could try replacing ZSCALKERNEL in kernel/arm64/KERNEL.ARMV8SVE with ../arm/zscal.c to get a more naive and more readable C implementation, but if that gets fed nonsense by the level2 trmv_thread.c it won't work either...) |
I am currently running on Neoverse N1 (without SVE), so I changed from
|
Weird, but at least seems to match the previous crash in the assembly kernel - the scaling function was told to scale by zero, so is setting array elements to zero but the address of the array element appears to be completely bogus. |
But multithreaded TRMV had a somewhat gory history in the 2017..2019 timeframe (#1332) and my go-to platform for large thread counts has a larger DTB_ENTRIES count than just about everybody else - maybe the old problem has been pushed out to larger matrix dimensions but not actually fixed yet. |
I still fail to reproduce this (even if I reduce DTB_ENTRIES on platforms that have a larger setting), the partitioning of the workload looks perfectly fine. |
The benchmark doesn't crash which is very interesting. One difference is that I am calling the code from C++ with |
Interesting. If it was a mismatch of complex number representation I would expect garbage output or crashing that does not depend on the number of threads. But have you verified that the output is correct for 32 threads or less, or just that it did not crash there ? |
OK, so here is a quick-and-dirty demo. Build OpenBLAS master with:
Then the example program:
Build with
Run with up to (and including) 57 threads - no problem. Run with 58 or more threads - segfault. |
No crash seen so far on Power9 (128 threads) so probably not a general problem in the code parts common to all architectures. (Currently waiting to see if valgrind brings up anything interesting there, while the arm64 machine is still struggling to fill the input array) |
I ran the example on a Cooper Lake system with 32c/64t and |
It could be, but all the workload splitting and thread setup happens in common code and I see no indication of anything being wrong at lower thread counts on ARM64. (ZSCAL is actually the first BLAS function to be called in OpenBLAS' implementation of ZTRMV, and you/we already replaced that arm64 kernel with a trivial C implementation above). Maybe the size of the on-stack buffer as calculated in interface/ztrmv.c is invalid, but there is nothing arm64-specific to that. I'll see if I can get a test run on AWS - CirrusCI does not let me use that many cores and the GCC Compile Farm lacks a big armv8 server. |
Hmm. One other thing that is architecture-specific is the size of the memory buffer used to transfer partial results from parallel threads in Goto's blocked matrix algorithm - there is some dark magic in there, and the default base value for the computation, the BUFFER_SIZE in common_arm64.h, is smaller on ARM64 than it is on either x86_64 or power architecture. (probably owing to a legacy of smartphones and small appliances). |
Yes, with |
Thanks for testing, hadn't gotten around to mangling our AWS Cirun job yet. Will commit that PR then - and there's probably something "historically" wrong in the calculation of buffersize in general |
Thanks! |
I am observing a crash in ZTRMV for large thread counts. I have observed the crash with 64 and 128 threads on ARM64. The Crash occurs, for example, with Graviton 3 and Ampere Altra CPUs (at least).
This is how OpenBLAS is compiled:
It still crashes even if I disable
DYNAMIC_ARCH
.The input for which ZTRMV crashes:
I tested versions 0.3.23 and 0.3.24 - both have the problem. Interestingly, the NVIDIA implementation (which comes with NVIDIA HPC SDK) suffers from the same problem but not Arm Performance Libraries.
The text was updated successfully, but these errors were encountered: