Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in ZTRMV for large thread counts #4269

Closed
dnoan opened this issue Oct 18, 2023 · 21 comments · Fixed by #4279
Closed

Crash in ZTRMV for large thread counts #4269

dnoan opened this issue Oct 18, 2023 · 21 comments · Fixed by #4279
Labels

Comments

@dnoan
Copy link
Contributor

dnoan commented Oct 18, 2023

I am observing a crash in ZTRMV for large thread counts. I have observed the crash with 64 and 128 threads on ARM64. The Crash occurs, for example, with Graviton 3 and Ampere Altra CPUs (at least).

This is how OpenBLAS is compiled:

TARGET = ARMV8
DYNAMIC_ARCH = 1
CC = gcc-12
FC = gfortran-12
USE_THREAD = 1
USE_LOCKING = 1
USE_OPENMP = 1
NUM_THREADS = 256

It still crashes even if I disable DYNAMIC_ARCH.

The input for which ZTRMV crashes:

UPLO = 'U'
TRANS = 'N'
DIAG = 'N'
N = 42474
LDA = 42474
INCX = 1

I tested versions 0.3.23 and 0.3.24 - both have the problem. Interestingly, the NVIDIA implementation (which comes with NVIDIA HPC SDK) suffers from the same problem but not Arm Performance Libraries.

@martin-frbg
Copy link
Collaborator

Crash meaning a segfault without additional information, or do you get some kind of error or warning message around the time of the crash ? (And does it crash only when the number of threads is large ?)

@dnoan
Copy link
Contributor Author

dnoan commented Oct 18, 2023

It just segfaults without additional information. It works for 32 threads (and less). It segfaults for 64 threads and more. I haven't tested between 32 and 64 threads but can do it if this would help.

@martin-frbg
Copy link
Collaborator

It survived on ibm power9 with 128 threads, suggesting the problem is not in the interface/thread setup code shared between all architectures. Now looking for a sufficiently big arm64 host... One explanation could be running out of preallocated memory buffers (especially if you did not build OpenBLAS on that exact machine configuration, or set NUM_THREADS sufficiently high at build time) but that would normally print a warning.

@dnoan
Copy link
Contributor Author

dnoan commented Oct 18, 2023

The build machine is a much more modest computer. I typically compile OpenBLAS with -j32. NUM_THREADS is set to 256 as mentioned above. When I run on the build machine with 64 threads (even though it has only 32 cores), it doesn't crash.

@martin-frbg
Copy link
Collaborator

Sorry for overlooking the NUM_THREADS setting. I wonder if you could run a DEBUG=1 build from gdb so that we get an idea where it blows up ? (Or if that is too bothersome, perhaps a build of the current develop branch to see if I might have "unintentionally" fixed this already with #4233 , or a build with the ZGEMVNKERNEL and ZGEMVTKERNEL entries in kernel/arm64/KERNEL.ARMV8SVE set to ../arm/zgemv_n.c and ../arm/zgemv_t.c respectively instead of the assembly ones) I'd have to look into setting up a "big" AWS instance to investigate this myself.

@dnoan
Copy link
Contributor Author

dnoan commented Oct 19, 2023

I gave it a try with OpenBLAS-master and DEBUG=1:

Thread 50 "ztrmv" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xfff19f56b060 (LWP 648911)]
zscal_begin () at ../kernel/arm64/zscal.S:384
384     ../kernel/arm64/zscal.S: No such file or directory.

@martin-frbg
Copy link
Collaborator

Thanks. That's an unexpected location for the crash, and one that points more towards the common code for splitting the workload between threads again. I wonder why I do not see this on the architectures where I have big systems to play with. (Again you could try replacing ZSCALKERNEL in kernel/arm64/KERNEL.ARMV8SVE with ../arm/zscal.c to get a more naive and more readable C implementation, but if that gets fed nonsense by the level2 trmv_thread.c it won't work either...)

@dnoan
Copy link
Contributor Author

dnoan commented Oct 19, 2023

I am currently running on Neoverse N1 (without SVE), so I changed from zscal.S to ../arm/zscal.c in KERNEL.NEOVERSEN1 and it crashed with the following:

Thread 50 "ztrmv" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xfff198f2b060 (LWP 675448)]
zscal_k_NEOVERSEN1 (n=20330, dummy0=0, dummy1=0, da_r=0, da_i=0, x=0xfff1915fdca0, inc_x=1, y=0x0, inc_y=0, dummy=0x0, dummy2=0) at ../kernel/arm64/../arm/zscal.c:58
58      ../kernel/arm64/../arm/zscal.c: No such file or directory.

@martin-frbg
Copy link
Collaborator

Weird, but at least seems to match the previous crash in the assembly kernel - the scaling function was told to scale by zero, so is setting array elements to zero but the address of the array element appears to be completely bogus.

@martin-frbg
Copy link
Collaborator

But multithreaded TRMV had a somewhat gory history in the 2017..2019 timeframe (#1332) and my go-to platform for large thread counts has a larger DTB_ENTRIES count than just about everybody else - maybe the old problem has been pushed out to larger matrix dimensions but not actually fixed yet.

@martin-frbg
Copy link
Collaborator

I still fail to reproduce this (even if I reduce DTB_ENTRIES on platforms that have a larger setting), the partitioning of the workload looks perfectly fine.
Are you testing with some code of your own, or with the ztrmv.goto benchmark ? (If the former, could you try building the benchmarks and running OPENBLAS_DIAG=N ./ztrmv.goto 42474 42474 1 to see if the problem persists in a simple program that does not do much besides calling ZTRMV ?)

@dnoan
Copy link
Contributor Author

dnoan commented Oct 23, 2023

The benchmark doesn't crash which is very interesting. One difference is that I am calling the code from C++ with std::vector<std::complex<double>> as input. Maybe I can write a small demo program in the coming days.

@martin-frbg
Copy link
Collaborator

Interesting. If it was a mismatch of complex number representation I would expect garbage output or crashing that does not depend on the number of threads. But have you verified that the output is correct for 32 threads or less, or just that it did not crash there ?

@dnoan
Copy link
Contributor Author

dnoan commented Oct 29, 2023

OK, so here is a quick-and-dirty demo. Build OpenBLAS master with:

diff a/Makefile.rule b/Makefile.rule
index 707924904..21d094918 100644
--- a/Makefile.rule
+++ b/Makefile.rule
-# TARGET = PENRYN
+TARGET = NEOVERSEN1
 
-# USE_THREAD = 0
+USE_THREAD = 1
 
-# USE_LOCKING = 1
+USE_LOCKING = 1
 
-# USE_OPENMP = 1
+USE_OPENMP = 1
 
-# NUM_THREADS = 24
+NUM_THREADS = 256

Then the example program:

// ztrmv.cpp
#include <iostream>
#include <random>
#include <vector>
#include <complex>

constexpr int DOUBLE_MIN = -10000;
constexpr int DOUBLE_MAX = 10000;

extern "C" {
  void ztrmv_(char *Uplo, char *TransA, char *Diag, int *N, std::complex<double> *A,
              int *lda, std::complex<double> *X, int *incX);
};

int main(int argc, char *argv[]) {
  char Uplo = 'U';
  char TransA = 'N';
  char Diag = 'N';
  int N = 42474;
  int lda = N;
  int incX = 1;

  std::random_device rd;
  std::default_random_engine eng(rd());
  std::uniform_real_distribution<> distr(DOUBLE_MIN, DOUBLE_MAX);

  std::cout << "Allocating memory..." << std::endl;

  std::vector<std::complex<double>> A, X;
  A.resize(lda * N);
  X.resize(1 + (N - 1) * std::abs(incX));

  std::cout << "Generating input data..." << std::endl;

  for (auto &d : A)
    d = std::complex<double>(distr(eng), distr(eng));

  for (auto &d : X)
    d = std::complex<double>(distr(eng), distr(eng));

  std::cout << "Computing..." << std::endl;

  ztrmv_(&Uplo, &TransA, &Diag, &N, &*A.begin(), &lda, &*X.begin(), &incX);

  std::cout << "Done" << std::endl;

  return 0;
}

Build with

g++ -fopenmp -O2 ztrmv.cpp libopenblas_neoversen1p-r0.3.24.dev.a -o ztrmv

Run with up to (and including) 57 threads - no problem. Run with 58 or more threads - segfault.

@martin-frbg
Copy link
Collaborator

No crash seen so far on Power9 (128 threads) so probably not a general problem in the code parts common to all architectures. (Currently waiting to see if valgrind brings up anything interesting there, while the arm64 machine is still struggling to fill the input array)

@dnoan
Copy link
Contributor Author

dnoan commented Oct 30, 2023

I ran the example on a Cooper Lake system with 32c/64t and OMP_NUM_THREADS=64 and it didn't crash. The problem could be ARM64-specific.

@martin-frbg
Copy link
Collaborator

It could be, but all the workload splitting and thread setup happens in common code and I see no indication of anything being wrong at lower thread counts on ARM64. (ZSCAL is actually the first BLAS function to be called in OpenBLAS' implementation of ZTRMV, and you/we already replaced that arm64 kernel with a trivial C implementation above). Maybe the size of the on-stack buffer as calculated in interface/ztrmv.c is invalid, but there is nothing arm64-specific to that. I'll see if I can get a test run on AWS - CirrusCI does not let me use that many cores and the GCC Compile Farm lacks a big armv8 server.

@martin-frbg
Copy link
Collaborator

Hmm. One other thing that is architecture-specific is the size of the memory buffer used to transfer partial results from parallel threads in Goto's blocked matrix algorithm - there is some dark magic in there, and the default base value for the computation, the BUFFER_SIZE in common_arm64.h, is smaller on ARM64 than it is on either x86_64 or power architecture. (probably owing to a legacy of smartphones and small appliances).

@dnoan
Copy link
Contributor Author

dnoan commented Oct 31, 2023

Yes, with #define BUFFER_SIZE (32 << 22) it doesn't crash.

@martin-frbg
Copy link
Collaborator

Thanks for testing, hadn't gotten around to mangling our AWS Cirun job yet. Will commit that PR then - and there's probably something "historically" wrong in the calculation of buffersize in general

@dnoan
Copy link
Contributor Author

dnoan commented Oct 31, 2023

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants