Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dtbmv in libopenblas_haswellp-r0.2.18.dylib on macOS Sierra (10.12.3) fails? #1089

Closed
hiro4bbh opened this issue Feb 9, 2017 · 18 comments
Closed

Comments

@hiro4bbh
Copy link

hiro4bbh commented Feb 9, 2017

dtbmv in libopenblas_haswellp-r0.2.18.dylib on macOS Sierra (10.12.3) may fail on large scale as the following code.

// clang -L/usr/local/opt/openblas/lib -I/usr/local/opt/openblas/include -lopenblas openblas_dtbmv.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <cblas.h>

int main(int argc, char **argv)
{
  // if n <= 2097136 then dtbmv work almost surely...
  int n = 2097136 + 1;
  double *p = calloc(n, sizeof(double));
  p[0] = 4.0;
  p[1] = 2.0;
  p[2] = 1.0;
  printf("p@%p = (%g, %g, %g, %g, ...)\n", p, p[0], p[1], p[2], p[3]);
  cblas_dtbmv(CblasColMajor, CblasUpper, CblasNoTrans, CblasNonUnit, n, 0, p, 1, p, 1);
  printf("p@%p = (%g, %g, %g, %g, ...)\n", p, p[0], p[1], p[2], p[3]);
  free(p);
  return 0;
}

I have only MacBook running macOS Sierra, so I have no memory debugger like valgrind.
However, I quote the console output running with lldb.

$ lldb ./a.out
(lldb) target create "./a.out"
Current executable set to './a.out' (x86_64).
(lldb) r
Process 48585 launched: './a.out' (x86_64)
p@0x104800000 = (4, 2, 1, 0, ...)
Process 48585 stopped
* thread #2: tid = 0x125f57, 0x00007fffe2027c13 libsystem_platform.dylib`_platform_bzero$VARIANT$Haswell + 115, stop reason = EXC_BAD_ACCESS (code=1, address=0x107800000)
    frame #0: 0x00007fffe2027c13 libsystem_platform.dylib`_platform_bzero$VARIANT$Haswell + 115
libsystem_platform.dylib`_platform_bzero$VARIANT$Haswell:
->  0x7fffe2027c13 <+115>: movq   %rsi, (%rdi,%rdx)
    0x7fffe2027c17 <+119>: subq   $0x8, %rdx
    0x7fffe2027c1b <+123>: jae    0x7fffe2027c13            ; <+115>
    0x7fffe2027c1d <+125>: addq   $0x8, %rdx
(lldb) thread backtrace all
  thread #1: tid = 0x125f42, 0x000000010047e000 libopenblas_haswellp-r0.2.18.dylib`dscal_kernel_8_zero + 65412, queue = 'com.apple.main-thread'
    frame #0: 0x000000010047e000 libopenblas_haswellp-r0.2.18.dylib`dscal_kernel_8_zero + 65412
    frame #1: 0x000000010045e056 libopenblas_haswellp-r0.2.18.dylib`dscal_k + 86
    frame #2: 0x00000001000f2782 libopenblas_haswellp-r0.2.18.dylib`trmv_kernel + 210
    frame #3: 0x000000010036375d libopenblas_haswellp-r0.2.18.dylib`exec_blas + 110
    frame #4: 0x00000001000f2613 libopenblas_haswellp-r0.2.18.dylib`dtbmv_thread_NUN + 828
    frame #5: 0x00000001000ac43b libopenblas_haswellp-r0.2.18.dylib`cblas_dtbmv + 457
    frame #6: 0x0000000100000edd a.out`main + 237
    frame #7: 0x00007fffe1e16255 libdyld.dylib`start + 1

* thread #2: tid = 0x125f57, 0x00007fffe2027c13 libsystem_platform.dylib`_platform_bzero$VARIANT$Haswell + 115, stop reason = EXC_BAD_ACCESS (code=1, address=0x107800000)
  * frame #0: 0x00007fffe2027c13 libsystem_platform.dylib`_platform_bzero$VARIANT$Haswell + 115
    frame #1: 0x000000010045e19a libopenblas_haswellp-r0.2.18.dylib`dscal_k + 410
    frame #2: 0x00000001000f2782 libopenblas_haswellp-r0.2.18.dylib`trmv_kernel + 210
    frame #3: 0x000000010036352d libopenblas_haswellp-r0.2.18.dylib`blas_thread_server + 326
    frame #4: 0x00007fffe202daab libsystem_pthread.dylib`_pthread_body + 180
    frame #5: 0x00007fffe202d9f7 libsystem_pthread.dylib`_pthread_start + 286
    frame #6: 0x00007fffe202d1fd libsystem_pthread.dylib`thread_start + 13
(lldb)

This library is installed with brew install openblas --build-from-source.

@martin-frbg
Copy link
Collaborator

valgrind on Linux KabyLake (Haswell kernel) says

Invalid write of size 8
==1856== at 0x521C540: dscal_kernel_8_zero (dscal_microk_haswell-2.c:150)
==1856== by 0x521C847: dscal_k (dscal.c:215)
==1856== by 0x4F3E49B: trmv_kernel (tbmv_thread.c:115)
==1856== by 0x51AE341: blas_thread_server (blas_server.c:433)
==1856== by 0x67A4733: start_thread (in /lib64/libpthread-2.22.so)
==1856== Address 0x11d35100 is not stack'd, malloc'd or (recently) free'd
==1856==
==1856== Process terminating with default action of signal 11 (SIGSEGV)
==1856== Access not within mapped region at address 0x11D35100
==1856== at 0x521C540: dscal_kernel_8_zero (dscal_microk_haswell-2.c:150)

but then suggests to rerun with a stacksize bigger than the 8388608 it had used.
valgrind --main-stacksize=16800000 ./dtbmv runs to completion with output

p@0x8d35040 = (4, 2, 1, 0, ...)
p@0x8d35040 = (16, 4, 1, 0, ...)

@hiro4bbh
Copy link
Author

hiro4bbh commented Feb 9, 2017

Thank you for your rapid reply.
I tested valgrind on OS X El Capitan (10.11.5) on Mac mini (Late 2014) with
--main-stacksize=16800000, but I got same invalid access, not always
but sometimes.

How does stack size affect this problem?

@martin-frbg
Copy link
Collaborator

martin-frbg commented Feb 9, 2017

I am not sure yet, I just wanted to put that observation here in case it gives an idea. Maybe it just serves to move "other things" out of the way so that whatever collateral damage gets done by dscal_kernel_8_zero() goes unnoticed.

I see now that the same function is/was also implicated in #730 (though only perceived as a performance bottleneck).That issue thread is a somewhat frustrating read, but what you could do is
try the patch from the Jan 19 posting by jakirkham that simply replaces the inline assembly with
a simple C loop to zero the array. (Or if you want to humor me, only drop the single ".align 16" line from the assembly at first and see if that changes anything at all - which it probably won't)

@hiro4bbh
Copy link
Author

hiro4bbh commented Feb 9, 2017

Ok, thank you for your reply.

According to tbmv_thread.c, using dtbmv for Hadamard product is
inefficient (calling xaxpy for each elements?), so I use my own
Hadamard product routine in C.
However, this is mysterious (maybe) deep bug happening sometimes
or influenced by heap usages.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Feb 10, 2017

Just for the record, replacing dscal_kernel_8_zero with its C equivalent from the proposed patch in #730 does not fix this - still crashing on what appears to be the first write to x in this function. Also fails on "Nehalem" class hardware, but works when run with at most two threads (on a dualcore machine at least).

@martin-frbg
Copy link
Collaborator

On the Nehalem system at least, the problem is reproducible with snapshots corresponding to versions down to 0.2.9 at least (making it likely that this problem is inherited from libgoto2).
If there is a hard limit on "n" at all, rather than just a gradual increase in likelyhood of failure with larger n, it appears to be slightly below 1048576 (i.e. 1024*1024). helgrind does not detect any threading errors in either case.

@brada4
Copy link
Contributor

brada4 commented Feb 28, 2017

Why a read-only argument is being modified? What result should be returned when purpoted input gets sprayed with partial result as computation goes ahead?

@martin-frbg
Copy link
Collaborator

@brada4 mind being less aggressive but more verbose ? Where do you see input getting overwritten, do you think the test case is wrong or did you find an error in the implementation of dtbmv ?

@brada4
Copy link
Contributor

brada4 commented Feb 28, 2017

Just that A gets rewriten with result X, so that input nibbles as BLAS processes it.
At some degree of out of order processing that creates unavoidable data race.
i.e example code is bad, but it works with in-order netlib.

@martin-frbg
Copy link
Collaborator

I see nothing wrong with the example, and in fact introducing a separate array q for the X parameter of dtbmv does not change anything. (In any case what you are suggesting would not lead to the segmentation fault that this issue is about)

@brada4
Copy link
Contributor

brada4 commented Mar 5, 2017

weird heap corruption every 5-6 runs:

dscal_k_PILEDRIVER () at ../kernel/x86_64/scal_sse2.S:105
105     ../kernel/x86_64/scal_sse2.S: No such file or directory.
(gdb) bt
#0  dscal_k_PILEDRIVER () at ../kernel/x86_64/scal_sse2.S:105
#1  0x00007ffff5d05458 in trmv_kernel (args=0x7fffffffad10, range_m=<optimized out>, range_n=0x7fffffffada8, dummy1=<optimized out>, buffer=<optimized out>, pos=<optimized out>) at tbmv_thread.c:115
#2  0x00007ffff5ec9eec in blas_thread_server (arg=<optimized out>) at blas_server.c:418
#3  0x00007ffff52d70a4 in start_thread (arg=0x7fffefb53700) at pthread_create.c:309
#4  0x00007ffff58d202d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

and at other occurences:

p@0x7fffee353010 = (4, 2, 1, 0, ...)
p@0x7fffee353010 = (0, 0, 0, 0, ...)
*** Error in `/tmp/a.out': free(): invalid pointer: 0x00007fffee353010 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7277f)[0x7ffff585f77f]
/lib64/libc.so.6(+0x78026)[0x7ffff5865026]
/lib64/libc.so.6(+0x78d53)[0x7ffff5865d53]
/tmp/a.out[0x40092a]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ffff580eb25]
/tmp/a.out[0x4006e9]
  1. piledriver does not include scal_sse2 ????
  2. no problem with single thread

@brada4
Copy link
Contributor

brada4 commented Jul 12, 2017

@hiro4bbh can you retest? Does not fail anymore for me (You just need to add +1 in one file and re-run last 'make', just make sure you build & run OpenBLAS on same MAC)

@martin-frbg
Copy link
Collaborator

@brada4 unfortunately at least the behaviour under valgrind is unchanged with your patch

@brada4
Copy link
Contributor

brada4 commented Jul 26, 2017

I ran it hundred times, should have crashed few times at least.
Maybe it is just partial fix for a thread slot buffer overflow. And scal still writes [+1]

@martin-frbg
Copy link
Collaborator

Seems it is the funky bitmask operation logic in the blas_quickdivide branch at the end of tbmv_thread.c that is broken - given the chance, it will happily produce a range_n argument for the last batch that exceeds the actual n of the range it was tasked to divide among the threads. So for the original poster's testcase I get matrix elements 1048569 to 2097168 when the last is actually 2097137, and even the builtin test cases have a maximum range_n of 80 where the data only goes to 63. A simple
if (range_n[num_cpu] >n) range_n[num_cpu] = n; appears to fix it (but may not be the best solution).

@brada4
Copy link
Contributor

brada4 commented Jul 26, 2017

Mine also did not try to be correct, just pad structure to not owerflow...
Minimum memory protection domain (guard pages etc) is one page. Typically malloc of half page or more gets aligned and a gratis slack space to the next page boundary. I think it is possible to make a test to mprotect ends of inputs and outputs to the word+page boundary to catch such cases.
the funky bitops part may slip through unnoticed anyway... maybe it aligns when rest is aligned.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Aug 1, 2017

Had to revert my patch for now as it introduces accuracy errors - I still believe the fundamental idea behind it is essentially correct, but I may have introduced a range limit too many or in the wrong place.
(Most likely something as simple as a range limit that should be on m rather than n in the TRANS case, but I will make extra sure before I try to commit something again)

@martin-frbg
Copy link
Collaborator

Just for the record the corrected patch for this went in as #1262 in august, so this is believed to be long fixed on the develop branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants