Crash in ZTRMV for large thread counts #4269

dnoan · 2023-10-18T14:46:49Z

I am observing a crash in ZTRMV for large thread counts. I have observed the crash with 64 and 128 threads on ARM64. The Crash occurs, for example, with Graviton 3 and Ampere Altra CPUs (at least).

This is how OpenBLAS is compiled:

TARGET = ARMV8
DYNAMIC_ARCH = 1
CC = gcc-12
FC = gfortran-12
USE_THREAD = 1
USE_LOCKING = 1
USE_OPENMP = 1
NUM_THREADS = 256

It still crashes even if I disable DYNAMIC_ARCH.

The input for which ZTRMV crashes:

UPLO = 'U'
TRANS = 'N'
DIAG = 'N'
N = 42474
LDA = 42474
INCX = 1

I tested versions 0.3.23 and 0.3.24 - both have the problem. Interestingly, the NVIDIA implementation (which comes with NVIDIA HPC SDK) suffers from the same problem but not Arm Performance Libraries.

The text was updated successfully, but these errors were encountered:

martin-frbg · 2023-10-18T15:29:08Z

Crash meaning a segfault without additional information, or do you get some kind of error or warning message around the time of the crash ? (And does it crash only when the number of threads is large ?)

dnoan · 2023-10-18T15:33:45Z

It just segfaults without additional information. It works for 32 threads (and less). It segfaults for 64 threads and more. I haven't tested between 32 and 64 threads but can do it if this would help.

martin-frbg · 2023-10-18T16:07:18Z

It survived on ibm power9 with 128 threads, suggesting the problem is not in the interface/thread setup code shared between all architectures. Now looking for a sufficiently big arm64 host... One explanation could be running out of preallocated memory buffers (especially if you did not build OpenBLAS on that exact machine configuration, or set NUM_THREADS sufficiently high at build time) but that would normally print a warning.

dnoan · 2023-10-18T16:22:54Z

The build machine is a much more modest computer. I typically compile OpenBLAS with -j32. NUM_THREADS is set to 256 as mentioned above. When I run on the build machine with 64 threads (even though it has only 32 cores), it doesn't crash.

martin-frbg · 2023-10-19T12:42:03Z

Sorry for overlooking the NUM_THREADS setting. I wonder if you could run a DEBUG=1 build from gdb so that we get an idea where it blows up ? (Or if that is too bothersome, perhaps a build of the current develop branch to see if I might have "unintentionally" fixed this already with #4233 , or a build with the ZGEMVNKERNEL and ZGEMVTKERNEL entries in kernel/arm64/KERNEL.ARMV8SVE set to ../arm/zgemv_n.c and ../arm/zgemv_t.c respectively instead of the assembly ones) I'd have to look into setting up a "big" AWS instance to investigate this myself.

dnoan · 2023-10-19T14:03:59Z

I gave it a try with OpenBLAS-master and DEBUG=1:

Thread 50 "ztrmv" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xfff19f56b060 (LWP 648911)]
zscal_begin () at ../kernel/arm64/zscal.S:384
384     ../kernel/arm64/zscal.S: No such file or directory.

martin-frbg · 2023-10-19T14:53:05Z

Thanks. That's an unexpected location for the crash, and one that points more towards the common code for splitting the workload between threads again. I wonder why I do not see this on the architectures where I have big systems to play with. (Again you could try replacing ZSCALKERNEL in kernel/arm64/KERNEL.ARMV8SVE with ../arm/zscal.c to get a more naive and more readable C implementation, but if that gets fed nonsense by the level2 trmv_thread.c it won't work either...)

dnoan · 2023-10-19T15:19:38Z

I am currently running on Neoverse N1 (without SVE), so I changed from zscal.S to ../arm/zscal.c in KERNEL.NEOVERSEN1 and it crashed with the following:

Thread 50 "ztrmv" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xfff198f2b060 (LWP 675448)]
zscal_k_NEOVERSEN1 (n=20330, dummy0=0, dummy1=0, da_r=0, da_i=0, x=0xfff1915fdca0, inc_x=1, y=0x0, inc_y=0, dummy=0x0, dummy2=0) at ../kernel/arm64/../arm/zscal.c:58
58      ../kernel/arm64/../arm/zscal.c: No such file or directory.

martin-frbg · 2023-10-19T15:41:09Z

Weird, but at least seems to match the previous crash in the assembly kernel - the scaling function was told to scale by zero, so is setting array elements to zero but the address of the array element appears to be completely bogus.

martin-frbg · 2023-10-19T16:09:01Z

But multithreaded TRMV had a somewhat gory history in the 2017..2019 timeframe (#1332) and my go-to platform for large thread counts has a larger DTB_ENTRIES count than just about everybody else - maybe the old problem has been pushed out to larger matrix dimensions but not actually fixed yet.

martin-frbg · 2023-10-20T07:23:35Z

I still fail to reproduce this (even if I reduce DTB_ENTRIES on platforms that have a larger setting), the partitioning of the workload looks perfectly fine.
Are you testing with some code of your own, or with the ztrmv.goto benchmark ? (If the former, could you try building the benchmarks and running OPENBLAS_DIAG=N ./ztrmv.goto 42474 42474 1 to see if the problem persists in a simple program that does not do much besides calling ZTRMV ?)

dnoan · 2023-10-23T13:29:39Z

The benchmark doesn't crash which is very interesting. One difference is that I am calling the code from C++ with std::vector<std::complex<double>> as input. Maybe I can write a small demo program in the coming days.

martin-frbg · 2023-10-23T13:46:52Z

Interesting. If it was a mismatch of complex number representation I would expect garbage output or crashing that does not depend on the number of threads. But have you verified that the output is correct for 32 threads or less, or just that it did not crash there ?

dnoan · 2023-10-29T09:11:14Z

OK, so here is a quick-and-dirty demo. Build OpenBLAS master with:

diff a/Makefile.rule b/Makefile.rule
index 707924904..21d094918 100644
--- a/Makefile.rule
+++ b/Makefile.rule
-# TARGET = PENRYN
+TARGET = NEOVERSEN1
 
-# USE_THREAD = 0
+USE_THREAD = 1
 
-# USE_LOCKING = 1
+USE_LOCKING = 1
 
-# USE_OPENMP = 1
+USE_OPENMP = 1
 
-# NUM_THREADS = 24
+NUM_THREADS = 256

Then the example program:

// ztrmv.cpp
#include <iostream>
#include <random>
#include <vector>
#include <complex>

constexpr int DOUBLE_MIN = -10000;
constexpr int DOUBLE_MAX = 10000;

extern "C" {
  void ztrmv_(char *Uplo, char *TransA, char *Diag, int *N, std::complex<double> *A,
              int *lda, std::complex<double> *X, int *incX);
};

int main(int argc, char *argv[]) {
  char Uplo = 'U';
  char TransA = 'N';
  char Diag = 'N';
  int N = 42474;
  int lda = N;
  int incX = 1;

  std::random_device rd;
  std::default_random_engine eng(rd());
  std::uniform_real_distribution<> distr(DOUBLE_MIN, DOUBLE_MAX);

  std::cout << "Allocating memory..." << std::endl;

  std::vector<std::complex<double>> A, X;
  A.resize(lda * N);
  X.resize(1 + (N - 1) * std::abs(incX));

  std::cout << "Generating input data..." << std::endl;

  for (auto &d : A)
    d = std::complex<double>(distr(eng), distr(eng));

  for (auto &d : X)
    d = std::complex<double>(distr(eng), distr(eng));

  std::cout << "Computing..." << std::endl;

  ztrmv_(&Uplo, &TransA, &Diag, &N, &*A.begin(), &lda, &*X.begin(), &incX);

  std::cout << "Done" << std::endl;

  return 0;
}

Build with

g++ -fopenmp -O2 ztrmv.cpp libopenblas_neoversen1p-r0.3.24.dev.a -o ztrmv

Run with up to (and including) 57 threads - no problem. Run with 58 or more threads - segfault.

martin-frbg · 2023-10-29T12:37:33Z

No crash seen so far on Power9 (128 threads) so probably not a general problem in the code parts common to all architectures. (Currently waiting to see if valgrind brings up anything interesting there, while the arm64 machine is still struggling to fill the input array)

dnoan · 2023-10-30T09:35:02Z

I ran the example on a Cooper Lake system with 32c/64t and OMP_NUM_THREADS=64 and it didn't crash. The problem could be ARM64-specific.

martin-frbg · 2023-10-30T13:11:02Z

It could be, but all the workload splitting and thread setup happens in common code and I see no indication of anything being wrong at lower thread counts on ARM64. (ZSCAL is actually the first BLAS function to be called in OpenBLAS' implementation of ZTRMV, and you/we already replaced that arm64 kernel with a trivial C implementation above). Maybe the size of the on-stack buffer as calculated in interface/ztrmv.c is invalid, but there is nothing arm64-specific to that. I'll see if I can get a test run on AWS - CirrusCI does not let me use that many cores and the GCC Compile Farm lacks a big armv8 server.

martin-frbg · 2023-10-31T08:46:33Z

Hmm. One other thing that is architecture-specific is the size of the memory buffer used to transfer partial results from parallel threads in Goto's blocked matrix algorithm - there is some dark magic in there, and the default base value for the computation, the BUFFER_SIZE in common_arm64.h, is smaller on ARM64 than it is on either x86_64 or power architecture. (probably owing to a legacy of smartphones and small appliances).

dnoan · 2023-10-31T14:24:03Z

Yes, with #define BUFFER_SIZE (32 << 22) it doesn't crash.

martin-frbg · 2023-10-31T14:39:52Z

Thanks for testing, hadn't gotten around to mangling our AWS Cirun job yet. Will commit that PR then - and there's probably something "historically" wrong in the calculation of buffersize in general

dnoan · 2023-10-31T14:46:52Z

Thanks!

martin-frbg added the Support label Oct 23, 2023

martin-frbg mentioned this issue Oct 31, 2023

Increase the default GEMM buffer size on modern ARM server cpus #4279

Merged

martin-frbg closed this as completed in #4279 Oct 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash in ZTRMV for large thread counts #4269

Crash in ZTRMV for large thread counts #4269

dnoan commented Oct 18, 2023

martin-frbg commented Oct 18, 2023

dnoan commented Oct 18, 2023

martin-frbg commented Oct 18, 2023

dnoan commented Oct 18, 2023

martin-frbg commented Oct 19, 2023

dnoan commented Oct 19, 2023

martin-frbg commented Oct 19, 2023

dnoan commented Oct 19, 2023

martin-frbg commented Oct 19, 2023

martin-frbg commented Oct 19, 2023

martin-frbg commented Oct 20, 2023

dnoan commented Oct 23, 2023

martin-frbg commented Oct 23, 2023

dnoan commented Oct 29, 2023

martin-frbg commented Oct 29, 2023

dnoan commented Oct 30, 2023

martin-frbg commented Oct 30, 2023

martin-frbg commented Oct 31, 2023

dnoan commented Oct 31, 2023

martin-frbg commented Oct 31, 2023

dnoan commented Oct 31, 2023

Crash in ZTRMV for large thread counts #4269

Crash in ZTRMV for large thread counts #4269

Comments

dnoan commented Oct 18, 2023

martin-frbg commented Oct 18, 2023

dnoan commented Oct 18, 2023

martin-frbg commented Oct 18, 2023

dnoan commented Oct 18, 2023

martin-frbg commented Oct 19, 2023

dnoan commented Oct 19, 2023

martin-frbg commented Oct 19, 2023

dnoan commented Oct 19, 2023

martin-frbg commented Oct 19, 2023

martin-frbg commented Oct 19, 2023

martin-frbg commented Oct 20, 2023

dnoan commented Oct 23, 2023

martin-frbg commented Oct 23, 2023

dnoan commented Oct 29, 2023

martin-frbg commented Oct 29, 2023

dnoan commented Oct 30, 2023

martin-frbg commented Oct 30, 2023

martin-frbg commented Oct 31, 2023

dnoan commented Oct 31, 2023

martin-frbg commented Oct 31, 2023

dnoan commented Oct 31, 2023