Race Condition in OpenBLAS on IBM OpenPower 8 #1191

grisuthedragon · 2017-05-30T15:00:51Z

This issue is an follow up to #1071 which extends the problem and gives a clearer example. I consider a code computing the TILE-QR decomposition using OpenMP 4 task depend paralleization. The code, including a makefile to reproduce the errors presented here is available at: https://github.com/grisuthedragon/openblas-tileqr. Details about the algorithm can be found here.

System Details:

CentOS 7.3 litle endian
IBM Power8 (SL822C) (20 Cores, 160 Threads)
GCC 5.4.1

For the first experiment I compiled OpenBLAS in single-thread mode using

make USE_OPENMP=0 USE_THREAD=0 NO_CBLAS=1 NO_LAPACKE=1

and compiled and executed the example code via

gfortran -o tileqr_serial -fopenmp tileqr.f90 libopenblas_serial.a -lgomp 
OMP_NUM_THREADS=20 ./tileqr_serial 5120 5120 256 32

and I obtained:

LAPACK   -- Time:    8.576    Gflops/s:   20.867
Tiled    -- Time:    0.930    Gflops/s:  192.450
 
Max diff between R from tiled   and LAPACK is: 0.2E+09

which is the wrong result because the code has the provide the same result as LAPACK.

Compiling OpenBLAS with

make USE_OPENMP=1 USE_THREAD=1 NO_CBLAS=1 NO_LAPACKE=1

and compiled and executed the example code via

gfortran -o tileqr_openmp -fopenmp tileqr.f90 libopenblas_openmp.a -lgomp 
OMP_NUM_THREADS=20 ./tileqr_openmp 5120 5120 256 32

and I obtained:

LAPACK   -- Time:    6.553    Gflops/s:   27.310
Tiled    -- Time:    1.120    Gflops/s:  159.760
 
Max diff between R from tiled   and LAPACK is: 0.1E+04

which is also wrong.

Finally I used the old threading implementation for level-3 operations by compiling OpenBLAS via:

make USE_OPENMP=1 USE_THREAD=1 USE_SIMPLE_THREADED_LEVEL3=1 NO_CBLAS=1 NO_LAPACKE=1

and for the example

gfortran -o tileqr_openmp_simple -fopenmp tileqr.f90 libopenblas_openmp_simple.a -lgomp 
OMP_NUM_THREADS=20 ./tileqr_openmp_simple 5120 5120 256 32

I get

LAPACK   -- Time:    7.574    Gflops/s:   23.627
Tiled    -- Time:    1.138    Gflops/s:  157.312
 
Max diff between R from tiled   and LAPACK is: 0.5E+07

If I compile the example using pure Netlib BLAS and Netlib LAPACK I obtain the correct results as well.

On an Ubuntu 16.04 - x86_64 with a 4 core Intel(R) Core(TM) i7-6700 CPU I got the following correct results:

./tileqr_serial 5120 5120 256 32
LAPACK   -- Time:    7.577    Gflops/s:   23.620
Tiled    -- Time:    3.499    Gflops/s:   51.145
 
Max diff between R from tiled   and LAPACK is: 0.2E-16

for the serial OpenBLAS variant, and for the OpenMP multithreaded version:

./tileqr_openmp 5120 5120 256 32
LAPACK   -- Time:    6.133    Gflops/s:   29.177
Tiled    -- Time:    4.047    Gflops/s:   44.220
 
Max diff between R from tiled   and LAPACK is: 0.2E-16

and even with the old threading scheme I get

./tileqr_openmp_simple 5120 5120 256 32
LAPACK   -- Time:    8.444    Gflops/s:   21.194
Tiled    -- Time:    3.752    Gflops/s:   47.701
 
Max diff between R from tiled   and LAPACK is: 0.2E-16

The text was updated successfully, but these errors were encountered:

martin-frbg · 2017-05-30T15:27:27Z

So all your OpenBLAS calculations on Power8 give the wrong result no matter which build options you use. From #1071 I believe you already tried removing some/all of the now-suspect assembly implementations from the KERNEL file - did this lead anywhere ?

grisuthedragon · 2017-05-30T15:58:02Z

@martin-frbg Yes with the example, in contrast to the initial one from #1071 all calculations went wrong. I added the generic KERNEL.POWER8 file to the git project and even there I obtain the error and additionally to the tuned KERNELs I also get the error during the BLAS testing as mentioned in #1071 .

martin-frbg · 2017-05-30T16:29:43Z

I am not sure if what you did is sufficient to avoid using any of the .S files - as mentioned in #1071 it seems necessary to declare the generic implementations of the xSYMV functions in the kernel file to
ensure these get used instead of their power8.S counterparts (which I recall removed the errors seen during BLAS testing), similarly I notice .S implementations for the xGEMM_BETA functions named in the default KERNEL file. Other power8.S files may be picked up automatically by virtue of their names.
Hopefully it is apparent from the compile log and/or the generated .o files in kernel/power if any of the plain assembly routines are still used in your build - at the moment, only the inline assembly contained in .c files can reasonably be assumed to be safe.

martin-frbg · 2017-10-08T15:51:47Z

With my patches from PR #1317 this test now passes with a max. diff. of 0.3E-16 at least on an emulated ppc64le system.

grisuthedragon · 2017-10-09T12:32:45Z

I did the experiments from above on my real ppc64le system and I get the following:

OpenBLAS with OpenMP support:

OMP_NUM_THREADS=20 ./tileqr_openmp  10240 10240 512 64
LAPACK   -- Time:   13.016    Gflops/s:  109.992
Tiled    -- Time:    5.438    Gflops/s:  263.256
 
Max diff between R from tiled   and LAPACK is: 0.2E-16

OpenBLAS with simple OpenMP support:

 OMP_NUM_THREADS=20 ./tileqr_openmp_simple  10240 10240 512 64
LAPACK   -- Time:   29.076    Gflops/s:   49.238
Tiled    -- Time:    6.747    Gflops/s:  212.177
 
Max diff between R from tiled   and LAPACK is: 0.2E-16

OpenBLAS without OpenMP support:

OMP_NUM_THREADS=20 ./tileqr_serial  10240 10240 256 32
LAPACK   -- Time:   71.130    Gflops/s:   20.127
Tiled    -- Time:    5.393    Gflops/s:  265.441

Max diff between R from tiled   and LAPACK is: 0.1E+02

As we see it works for both OpenMP enabled OpenBLAS versions. But not for the serial one. For me its clear because in this case we have race conditions on the internal buffers inside OpenBLAS but for normal users a warning should appear if OpenBLAS is called from a parallel region. I know that this warning appears for for loop etc. but it seems not to appear in OpenMP Tasks. Which does not give the user a hint that its environment is wrong. Interesstingly on x86-64 this does not disturb the result.

martin-frbg · 2017-10-09T16:17:30Z

Indeed the non-OPENMP level3 thread race issue is not adressed by my PR. I wonder if that case would pass as well if OpenBLAS was compiled with USE_SIMPLE_THREADED_LEVEL3=1, or is that your default now anyway ?

martin-frbg · 2017-10-10T07:35:21Z

BTW the Makefile.rule carries a clear warning (added by wernsaar) to always use OPENMP with POWER8, so arguably the misbehaviour of your serial case is a "documented feature".

grisuthedragon · 2017-10-10T08:08:58Z

@martin-frbg Yes you are right. but I think we should mention it in the README.md file ( where the POWER 8 support is not mentioned ) because if a normal user reads something this is normally the README file and not something documented in the makefile.

I let some more tests ran overnight to see whether the above results are a random success or not and I found several runs where the the difference between the results is not in the order of the machine epsilon.
For many some calls to

OMP_NUM_THREADS=20 ./tileqr_openmp  10240 10240 512 64

I obtain

Max diff between R from tiled   and LAPACK is: 0.2E-16

but sometimes it gives

Max diff between R from tiled   and LAPACK is: 0.2E-04

or

Max diff between R from tiled   and LAPACK is: 0.1E-05

ans similar wrong results. This means the race condition is not fixed completely.

martin-frbg · 2017-10-10T08:59:33Z

Agreed about the README (which is probably outdated in other areas as well - e.g. do the ARM specialists here still consider the ARMV8 implementation(s) experimental ?). Putting vital information into Makefile.rule is one of the legacies from libGoto.
And I think there are two different bugs at work here - one the race condition(s) between threads (where USE_SIMPLE_THREADED_LEVEL3 may come in as a band-aid) and the other the potential for collateral damage to contents of the vector registers, which my PR is trying to address. As this is my first foray into assembly programming (besides reading a book on Z80 assembly in the very distant past) it is unfortunately possible that I overlooked or misinterpreted one of the offsets in the code and some calculation done on the stack may still scribble over the temporary storage area I added there for saving the vector registers.

martin-frbg · 2017-10-12T16:06:53Z

Note that I cannot test this any further - within the limitations of my qemu setup, my builds all appear to pass both with and without USE_SIMPLE_THREADED_LEVEL3 (though the "max diff" value
is marginally smaller with this option)

edelsohn · 2017-10-12T18:42:38Z

@martin-frbg Have you requested a VM at OSUOSL instead of trying to use qemu?

martin-frbg · 2018-01-06T22:23:33Z

@grisuthedragon, you mentioned in #1332 that disabling multithreading in trmv fixes this issue here as well, do I understand correctly that was without the more general USE_SIMPLE_THREADED_LEVEL3 option ?
(At the very least this finding would suggest that hopefully the assembly-level problems are fixed now)
As the recommendation to use OPENMP with POWER8 builds was added to README.md by your PR #1325, perhaps this issue can be closed now ?

grisuthedragon · 2018-01-09T10:31:56Z

Yes I compiled it without USE_SIMPLE_THREADED_LEVEL3. I did some long running tests over the last night the dtrmv fix with one thread decreases the number of cases where the code fails on POWER 8 but not all.

edelsohn · 2019-02-04T16:30:02Z

Do we have any better analysis of the root cause of the problem? IBM is willing to open a bounty to solve this.

martin-frbg · 2019-02-04T17:16:03Z

Somebody needs to check if this is still reproducible, with luck the fix for #1851 may have helped. (The link to the original testcase is dead, I'll see if I still have it archived)

grisuthedragon · 2019-02-04T17:19:11Z

@martin-frbg
I will check it tonight or tomorrow on the system mentioned above.

hartb · 2019-02-04T17:26:54Z

@edelsohn asked me to comment about another OMP_NUM_THREADS-related symptom
we're seeing.

We're exercising OpenBLAS via PyTorch:

#!/usr/bin/env python

import torch

for sz in [ 127, 128, 129, 130, 500 ]:
    print("Size = {}".format(sz))
    for i in range(10):
        a = torch.randn(sz, sz)
        q, r = torch.qr(a)
        a_qr = torch.mm(q, r)
        m = float(max(max(x) for x in a - a_qr))
        print("max diff = {0:.6f}{1}".format(m, " FAIL!" if m > 0.001 else ""))

The intent is that QR decomposition followed by MM should result in the
original matrix.

This test always succeeds on Power 8, and always succeeds when
OMP_NUM_THREADS=1. It fails occasionally on Power 9 when OMP_NUM_THREADS is
> 1

Even in that configuration, the test always succeeds when the (square) matrix
size is < 128 x 128.

Otherwise, the test fails with greater frequency as OMP_NUM_THREADS grows and
(more strongly) as the matrix size grows. In a recent run with
OMP_NUM_THREADS=32 we saw:

no failures at 127 x 127
1 in 10 failures from 128 x 128 through 130 x 130
10 in 10 failures at 500 x 500

We had done some library tracing during test runs, and see much different
behavior when the matrix size crosses the 127 x 127 to 128 x 128 size boundary
(many more MM and copies in the larger case). Presumably the library switches
to a different algorithm with the large size, and that exposes some
serialization issue.

I'm afraid we don't have an OpenBLAS-only version of the testcase.

The problem was seen at both 0.2.20 and 0.3.3 OpenBLAS.

hartb · 2019-02-04T23:46:17Z

I intended to confirm whether the fix for #1851 also resolved the QR problem (see above), but now I'm unable to reproduce at 0.3.x versions of OpenBLAS:

OMP_NUM_THREADS=32

LD_PRELOAD=/home/builder/libopenblas.so.0.2.20
Size = 500
FFFF.FFFFFFFFFFFFFFFFFFFF  96.00% failure

LD_PRELOAD=/home/builder/libopenblas.so.0.3.2
Size = 500
.........................   0.00% failure

LD_PRELOAD=/home/builder/libopenblas.so.0.3.3
Size = 500
.........................   0.00% failure

LD_PRELOAD=/home/builder/libopenblas.so.0.3.4
Size = 500
.........................   0.00% failure

I thought we'd seen this at 0.3.3, but it seems I was mistaken. Sorry for the noise!

grisuthedragon · 2019-02-05T08:25:34Z

I tried it on the above mentioned system with the current development branch (0.3.6-dev) for 100 times and the example code (now reuploaded) works correctly on my POWER8 system now.

martin-frbg · 2019-02-05T08:49:23Z

Great, thanks for testing.

grisuthedragon mentioned this issue May 30, 2017

Race Condition in Multithreaded OpenBLAS on IBM OpenPower 8 #1071

Closed

martin-frbg mentioned this issue Oct 8, 2017

Save and restore VSX registers #1317

Merged

grisuthedragon mentioned this issue Oct 20, 2017

Wrong results for small DTRMV on x86-64 #1332

Closed

martin-frbg mentioned this issue Jan 3, 2018

POWER8 VSX BLAS 1/2 double precision #795

Closed

martin-frbg closed this as completed Feb 5, 2019

Flamefire mentioned this issue Sep 30, 2020

Incorrect results when loading OpenBLAS with OpenMP prior to pytorch on POWER9 #2869

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race Condition in OpenBLAS on IBM OpenPower 8 #1191

Race Condition in OpenBLAS on IBM OpenPower 8 #1191

grisuthedragon commented May 30, 2017

martin-frbg commented May 30, 2017

grisuthedragon commented May 30, 2017

martin-frbg commented May 30, 2017

martin-frbg commented Oct 8, 2017 •

edited

Loading

grisuthedragon commented Oct 9, 2017

martin-frbg commented Oct 9, 2017

martin-frbg commented Oct 10, 2017

grisuthedragon commented Oct 10, 2017

martin-frbg commented Oct 10, 2017

martin-frbg commented Oct 12, 2017

edelsohn commented Oct 12, 2017

martin-frbg commented Jan 6, 2018 •

edited

Loading

grisuthedragon commented Jan 9, 2018

edelsohn commented Feb 4, 2019

martin-frbg commented Feb 4, 2019

grisuthedragon commented Feb 4, 2019

hartb commented Feb 4, 2019

hartb commented Feb 4, 2019

grisuthedragon commented Feb 5, 2019

martin-frbg commented Feb 5, 2019

Race Condition in OpenBLAS on IBM OpenPower 8 #1191

Race Condition in OpenBLAS on IBM OpenPower 8 #1191

Comments

grisuthedragon commented May 30, 2017

martin-frbg commented May 30, 2017

grisuthedragon commented May 30, 2017

martin-frbg commented May 30, 2017

martin-frbg commented Oct 8, 2017 • edited Loading

grisuthedragon commented Oct 9, 2017

martin-frbg commented Oct 9, 2017

martin-frbg commented Oct 10, 2017

grisuthedragon commented Oct 10, 2017

martin-frbg commented Oct 10, 2017

martin-frbg commented Oct 12, 2017

edelsohn commented Oct 12, 2017

martin-frbg commented Jan 6, 2018 • edited Loading

grisuthedragon commented Jan 9, 2018

edelsohn commented Feb 4, 2019

martin-frbg commented Feb 4, 2019

grisuthedragon commented Feb 4, 2019

hartb commented Feb 4, 2019

hartb commented Feb 4, 2019

grisuthedragon commented Feb 5, 2019

martin-frbg commented Feb 5, 2019

martin-frbg commented Oct 8, 2017 •

edited

Loading

martin-frbg commented Jan 6, 2018 •

edited

Loading