`dtbmv` in `libopenblas_haswellp-r0.2.18.dylib` on macOS Sierra (10.12.3) fails? #1089

hiro4bbh · 2017-02-09T07:38:27Z

dtbmv in libopenblas_haswellp-r0.2.18.dylib on macOS Sierra (10.12.3) may fail on large scale as the following code.

// clang -L/usr/local/opt/openblas/lib -I/usr/local/opt/openblas/include -lopenblas openblas_dtbmv.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <cblas.h>

int main(int argc, char **argv)
{
  // if n <= 2097136 then dtbmv work almost surely...
  int n = 2097136 + 1;
  double *p = calloc(n, sizeof(double));
  p[0] = 4.0;
  p[1] = 2.0;
  p[2] = 1.0;
  printf("p@%p = (%g, %g, %g, %g, ...)\n", p, p[0], p[1], p[2], p[3]);
  cblas_dtbmv(CblasColMajor, CblasUpper, CblasNoTrans, CblasNonUnit, n, 0, p, 1, p, 1);
  printf("p@%p = (%g, %g, %g, %g, ...)\n", p, p[0], p[1], p[2], p[3]);
  free(p);
  return 0;
}

I have only MacBook running macOS Sierra, so I have no memory debugger like valgrind.
However, I quote the console output running with lldb.

$ lldb ./a.out
(lldb) target create "./a.out"
Current executable set to './a.out' (x86_64).
(lldb) r
Process 48585 launched: './a.out' (x86_64)
p@0x104800000 = (4, 2, 1, 0, ...)
Process 48585 stopped
* thread #2: tid = 0x125f57, 0x00007fffe2027c13 libsystem_platform.dylib`_platform_bzero$VARIANT$Haswell + 115, stop reason = EXC_BAD_ACCESS (code=1, address=0x107800000)
    frame #0: 0x00007fffe2027c13 libsystem_platform.dylib`_platform_bzero$VARIANT$Haswell + 115
libsystem_platform.dylib`_platform_bzero$VARIANT$Haswell:
->  0x7fffe2027c13 <+115>: movq   %rsi, (%rdi,%rdx)
    0x7fffe2027c17 <+119>: subq   $0x8, %rdx
    0x7fffe2027c1b <+123>: jae    0x7fffe2027c13            ; <+115>
    0x7fffe2027c1d <+125>: addq   $0x8, %rdx
(lldb) thread backtrace all
  thread #1: tid = 0x125f42, 0x000000010047e000 libopenblas_haswellp-r0.2.18.dylib`dscal_kernel_8_zero + 65412, queue = 'com.apple.main-thread'
    frame #0: 0x000000010047e000 libopenblas_haswellp-r0.2.18.dylib`dscal_kernel_8_zero + 65412
    frame #1: 0x000000010045e056 libopenblas_haswellp-r0.2.18.dylib`dscal_k + 86
    frame #2: 0x00000001000f2782 libopenblas_haswellp-r0.2.18.dylib`trmv_kernel + 210
    frame #3: 0x000000010036375d libopenblas_haswellp-r0.2.18.dylib`exec_blas + 110
    frame #4: 0x00000001000f2613 libopenblas_haswellp-r0.2.18.dylib`dtbmv_thread_NUN + 828
    frame #5: 0x00000001000ac43b libopenblas_haswellp-r0.2.18.dylib`cblas_dtbmv + 457
    frame #6: 0x0000000100000edd a.out`main + 237
    frame #7: 0x00007fffe1e16255 libdyld.dylib`start + 1

* thread #2: tid = 0x125f57, 0x00007fffe2027c13 libsystem_platform.dylib`_platform_bzero$VARIANT$Haswell + 115, stop reason = EXC_BAD_ACCESS (code=1, address=0x107800000)
  * frame #0: 0x00007fffe2027c13 libsystem_platform.dylib`_platform_bzero$VARIANT$Haswell + 115
    frame #1: 0x000000010045e19a libopenblas_haswellp-r0.2.18.dylib`dscal_k + 410
    frame #2: 0x00000001000f2782 libopenblas_haswellp-r0.2.18.dylib`trmv_kernel + 210
    frame #3: 0x000000010036352d libopenblas_haswellp-r0.2.18.dylib`blas_thread_server + 326
    frame #4: 0x00007fffe202daab libsystem_pthread.dylib`_pthread_body + 180
    frame #5: 0x00007fffe202d9f7 libsystem_pthread.dylib`_pthread_start + 286
    frame #6: 0x00007fffe202d1fd libsystem_pthread.dylib`thread_start + 13
(lldb)

This library is installed with brew install openblas --build-from-source.

The text was updated successfully, but these errors were encountered:

martin-frbg · 2017-02-09T08:17:57Z

valgrind on Linux KabyLake (Haswell kernel) says

Invalid write of size 8
==1856== at 0x521C540: dscal_kernel_8_zero (dscal_microk_haswell-2.c:150)
==1856== by 0x521C847: dscal_k (dscal.c:215)
==1856== by 0x4F3E49B: trmv_kernel (tbmv_thread.c:115)
==1856== by 0x51AE341: blas_thread_server (blas_server.c:433)
==1856== by 0x67A4733: start_thread (in /lib64/libpthread-2.22.so)
==1856== Address 0x11d35100 is not stack'd, malloc'd or (recently) free'd
==1856==
==1856== Process terminating with default action of signal 11 (SIGSEGV)
==1856== Access not within mapped region at address 0x11D35100
==1856== at 0x521C540: dscal_kernel_8_zero (dscal_microk_haswell-2.c:150)

but then suggests to rerun with a stacksize bigger than the 8388608 it had used.
valgrind --main-stacksize=16800000 ./dtbmv runs to completion with output

p@0x8d35040 = (4, 2, 1, 0, ...)
p@0x8d35040 = (16, 4, 1, 0, ...)

hiro4bbh · 2017-02-09T09:15:08Z

Thank you for your rapid reply.
I tested valgrind on OS X El Capitan (10.11.5) on Mac mini (Late 2014) with
--main-stacksize=16800000, but I got same invalid access, not always
but sometimes.

How does stack size affect this problem?

martin-frbg · 2017-02-09T09:58:23Z

I am not sure yet, I just wanted to put that observation here in case it gives an idea. Maybe it just serves to move "other things" out of the way so that whatever collateral damage gets done by dscal_kernel_8_zero() goes unnoticed.

I see now that the same function is/was also implicated in #730 (though only perceived as a performance bottleneck).That issue thread is a somewhat frustrating read, but what you could do is
try the patch from the Jan 19 posting by jakirkham that simply replaces the inline assembly with
a simple C loop to zero the array. (Or if you want to humor me, only drop the single ".align 16" line from the assembly at first and see if that changes anything at all - which it probably won't)

hiro4bbh · 2017-02-09T12:56:21Z

Ok, thank you for your reply.

According to tbmv_thread.c, using dtbmv for Hadamard product is
inefficient (calling xaxpy for each elements?), so I use my own
Hadamard product routine in C.
However, this is mysterious (maybe) deep bug happening sometimes
or influenced by heap usages.

martin-frbg · 2017-02-10T07:52:27Z

Just for the record, replacing dscal_kernel_8_zero with its C equivalent from the proposed patch in #730 does not fix this - still crashing on what appears to be the first write to x in this function. Also fails on "Nehalem" class hardware, but works when run with at most two threads (on a dualcore machine at least).

martin-frbg · 2017-02-10T23:45:51Z

On the Nehalem system at least, the problem is reproducible with snapshots corresponding to versions down to 0.2.9 at least (making it likely that this problem is inherited from libgoto2).
If there is a hard limit on "n" at all, rather than just a gradual increase in likelyhood of failure with larger n, it appears to be slightly below 1048576 (i.e. 1024*1024). helgrind does not detect any threading errors in either case.

brada4 · 2017-02-28T00:27:45Z

Why a read-only argument is being modified? What result should be returned when purpoted input gets sprayed with partial result as computation goes ahead?

martin-frbg · 2017-02-28T07:34:01Z

@brada4 mind being less aggressive but more verbose ? Where do you see input getting overwritten, do you think the test case is wrong or did you find an error in the implementation of dtbmv ?

brada4 · 2017-02-28T07:45:14Z

Just that A gets rewriten with result X, so that input nibbles as BLAS processes it.
At some degree of out of order processing that creates unavoidable data race.
i.e example code is bad, but it works with in-order netlib.

martin-frbg · 2017-02-28T21:17:49Z

I see nothing wrong with the example, and in fact introducing a separate array q for the X parameter of dtbmv does not change anything. (In any case what you are suggesting would not lead to the segmentation fault that this issue is about)

brada4 · 2017-03-05T17:24:27Z

weird heap corruption every 5-6 runs:

dscal_k_PILEDRIVER () at ../kernel/x86_64/scal_sse2.S:105
105     ../kernel/x86_64/scal_sse2.S: No such file or directory.
(gdb) bt
#0  dscal_k_PILEDRIVER () at ../kernel/x86_64/scal_sse2.S:105
#1  0x00007ffff5d05458 in trmv_kernel (args=0x7fffffffad10, range_m=<optimized out>, range_n=0x7fffffffada8, dummy1=<optimized out>, buffer=<optimized out>, pos=<optimized out>) at tbmv_thread.c:115
#2  0x00007ffff5ec9eec in blas_thread_server (arg=<optimized out>) at blas_server.c:418
#3  0x00007ffff52d70a4 in start_thread (arg=0x7fffefb53700) at pthread_create.c:309
#4  0x00007ffff58d202d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

and at other occurences:

p@0x7fffee353010 = (4, 2, 1, 0, ...)
p@0x7fffee353010 = (0, 0, 0, 0, ...)
*** Error in `/tmp/a.out': free(): invalid pointer: 0x00007fffee353010 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7277f)[0x7ffff585f77f]
/lib64/libc.so.6(+0x78026)[0x7ffff5865026]
/lib64/libc.so.6(+0x78d53)[0x7ffff5865d53]
/tmp/a.out[0x40092a]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ffff580eb25]
/tmp/a.out[0x4006e9]

piledriver does not include scal_sse2 ????
no problem with single thread

brada4 · 2017-07-12T21:55:07Z

@hiro4bbh can you retest? Does not fail anymore for me (You just need to add +1 in one file and re-run last 'make', just make sure you build & run OpenBLAS on same MAC)

martin-frbg · 2017-07-25T22:36:04Z

@brada4 unfortunately at least the behaviour under valgrind is unchanged with your patch

brada4 · 2017-07-26T04:53:54Z

I ran it hundred times, should have crashed few times at least.
Maybe it is just partial fix for a thread slot buffer overflow. And scal still writes [+1]

martin-frbg · 2017-07-26T20:24:00Z

Seems it is the funky bitmask operation logic in the blas_quickdivide branch at the end of tbmv_thread.c that is broken - given the chance, it will happily produce a range_n argument for the last batch that exceeds the actual n of the range it was tasked to divide among the threads. So for the original poster's testcase I get matrix elements 1048569 to 2097168 when the last is actually 2097137, and even the builtin test cases have a maximum range_n of 80 where the data only goes to 63. A simple
if (range_n[num_cpu] >n) range_n[num_cpu] = n; appears to fix it (but may not be the best solution).

brada4 · 2017-07-26T21:29:58Z

Mine also did not try to be correct, just pad structure to not owerflow...
Minimum memory protection domain (guard pages etc) is one page. Typically malloc of half page or more gets aligned and a gratis slack space to the next page boundary. I think it is possible to make a test to mprotect ends of inputs and outputs to the word+page boundary to catch such cases.
the funky bitops part may slip through unnoticed anyway... maybe it aligns when rest is aligned.

martin-frbg · 2017-08-01T18:16:14Z

Had to revert my patch for now as it introduces accuracy errors - I still believe the fundamental idea behind it is essentially correct, but I may have introduced a range limit too many or in the wrong place.
(Most likely something as simple as a range limit that should be on m rather than n in the TRANS case, but I will make extra sure before I try to commit something again)

martin-frbg · 2018-03-10T11:45:30Z

Just for the record the corrected patch for this went in as #1262 in august, so this is believed to be long fixed on the develop branch.

brada4 mentioned this issue Jul 11, 2017

Fix write past fixed size buffer #1234

Merged

This was referenced Jul 27, 2017

Fix calculated range limit exceeding actual data size for last thread #1254

Merged

Travis: Rewrite config, build and test also on Alpine Linux (musl libc) #1255

Merged

martin-frbg mentioned this issue Aug 6, 2017

Make sure that range limit of last thread never exceeds data size #1262

Merged

brada4 mentioned this issue Aug 23, 2017

Plan for 0.2.21 #1258

Closed

martin-frbg mentioned this issue Oct 22, 2017

Wrong results for small DTRMV on x86-64 #1332

Closed

martin-frbg mentioned this issue Dec 9, 2017

Issue1388 #1389

Merged

martin-frbg closed this as completed Mar 10, 2018

martin-frbg mentioned this issue Jul 25, 2018

gbmv! segfaults for large matrices #1698

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`dtbmv` in `libopenblas_haswellp-r0.2.18.dylib` on macOS Sierra (10.12.3) fails? #1089

`dtbmv` in `libopenblas_haswellp-r0.2.18.dylib` on macOS Sierra (10.12.3) fails? #1089

hiro4bbh commented Feb 9, 2017 •

edited

Loading

martin-frbg commented Feb 9, 2017

hiro4bbh commented Feb 9, 2017 •

edited

Loading

martin-frbg commented Feb 9, 2017 •

edited

Loading

hiro4bbh commented Feb 9, 2017

martin-frbg commented Feb 10, 2017 •

edited

Loading

martin-frbg commented Feb 10, 2017

brada4 commented Feb 28, 2017

martin-frbg commented Feb 28, 2017

brada4 commented Feb 28, 2017 •

edited

Loading

martin-frbg commented Feb 28, 2017

brada4 commented Mar 5, 2017

brada4 commented Jul 12, 2017 •

edited

Loading

martin-frbg commented Jul 25, 2017

brada4 commented Jul 26, 2017

martin-frbg commented Jul 26, 2017

brada4 commented Jul 26, 2017

martin-frbg commented Aug 1, 2017 •

edited

Loading

martin-frbg commented Mar 10, 2018

dtbmv in libopenblas_haswellp-r0.2.18.dylib on macOS Sierra (10.12.3) fails? #1089

dtbmv in libopenblas_haswellp-r0.2.18.dylib on macOS Sierra (10.12.3) fails? #1089

Comments

hiro4bbh commented Feb 9, 2017 • edited Loading

martin-frbg commented Feb 9, 2017

hiro4bbh commented Feb 9, 2017 • edited Loading

martin-frbg commented Feb 9, 2017 • edited Loading

hiro4bbh commented Feb 9, 2017

martin-frbg commented Feb 10, 2017 • edited Loading

martin-frbg commented Feb 10, 2017

brada4 commented Feb 28, 2017

martin-frbg commented Feb 28, 2017

brada4 commented Feb 28, 2017 • edited Loading

martin-frbg commented Feb 28, 2017

brada4 commented Mar 5, 2017

brada4 commented Jul 12, 2017 • edited Loading

martin-frbg commented Jul 25, 2017

brada4 commented Jul 26, 2017

martin-frbg commented Jul 26, 2017

brada4 commented Jul 26, 2017

martin-frbg commented Aug 1, 2017 • edited Loading

martin-frbg commented Mar 10, 2018

`dtbmv` in `libopenblas_haswellp-r0.2.18.dylib` on macOS Sierra (10.12.3) fails? #1089

`dtbmv` in `libopenblas_haswellp-r0.2.18.dylib` on macOS Sierra (10.12.3) fails? #1089

hiro4bbh commented Feb 9, 2017 •

edited

Loading

hiro4bbh commented Feb 9, 2017 •

edited

Loading

martin-frbg commented Feb 9, 2017 •

edited

Loading

martin-frbg commented Feb 10, 2017 •

edited

Loading

brada4 commented Feb 28, 2017 •

edited

Loading

brada4 commented Jul 12, 2017 •

edited

Loading

martin-frbg commented Aug 1, 2017 •

edited

Loading