Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race Condition in OpenBLAS on IBM OpenPower 8 #1191

Closed
grisuthedragon opened this issue May 30, 2017 · 20 comments
Closed

Race Condition in OpenBLAS on IBM OpenPower 8 #1191

grisuthedragon opened this issue May 30, 2017 · 20 comments

Comments

@grisuthedragon
Copy link
Contributor

This issue is an follow up to #1071 which extends the problem and gives a clearer example. I consider a code computing the TILE-QR decomposition using OpenMP 4 task depend paralleization. The code, including a makefile to reproduce the errors presented here is available at: https://github.com/grisuthedragon/openblas-tileqr. Details about the algorithm can be found here.

System Details:

  • CentOS 7.3 litle endian
  • IBM Power8 (SL822C) (20 Cores, 160 Threads)
  • GCC 5.4.1

For the first experiment I compiled OpenBLAS in single-thread mode using

make USE_OPENMP=0 USE_THREAD=0 NO_CBLAS=1 NO_LAPACKE=1 

and compiled and executed the example code via

gfortran -o tileqr_serial -fopenmp tileqr.f90 libopenblas_serial.a -lgomp 
OMP_NUM_THREADS=20 ./tileqr_serial 5120 5120 256 32

and I obtained:

LAPACK   -- Time:    8.576    Gflops/s:   20.867
Tiled    -- Time:    0.930    Gflops/s:  192.450
 
Max diff between R from tiled   and LAPACK is: 0.2E+09

which is the wrong result because the code has the provide the same result as LAPACK.

Compiling OpenBLAS with

make USE_OPENMP=1 USE_THREAD=1 NO_CBLAS=1 NO_LAPACKE=1 

and compiled and executed the example code via

gfortran -o tileqr_openmp -fopenmp tileqr.f90 libopenblas_openmp.a -lgomp 
OMP_NUM_THREADS=20 ./tileqr_openmp 5120 5120 256 32

and I obtained:

LAPACK   -- Time:    6.553    Gflops/s:   27.310
Tiled    -- Time:    1.120    Gflops/s:  159.760
 
Max diff between R from tiled   and LAPACK is: 0.1E+04

which is also wrong.

Finally I used the old threading implementation for level-3 operations by compiling OpenBLAS via:

make USE_OPENMP=1 USE_THREAD=1 USE_SIMPLE_THREADED_LEVEL3=1 NO_CBLAS=1 NO_LAPACKE=1

and for the example

gfortran -o tileqr_openmp_simple -fopenmp tileqr.f90 libopenblas_openmp_simple.a -lgomp 
OMP_NUM_THREADS=20 ./tileqr_openmp_simple 5120 5120 256 32

I get

LAPACK   -- Time:    7.574    Gflops/s:   23.627
Tiled    -- Time:    1.138    Gflops/s:  157.312
 
Max diff between R from tiled   and LAPACK is: 0.5E+07

If I compile the example using pure Netlib BLAS and Netlib LAPACK I obtain the correct results as well.

On an Ubuntu 16.04 - x86_64 with a 4 core Intel(R) Core(TM) i7-6700 CPU I got the following correct results:

./tileqr_serial 5120 5120 256 32
LAPACK   -- Time:    7.577    Gflops/s:   23.620
Tiled    -- Time:    3.499    Gflops/s:   51.145
 
Max diff between R from tiled   and LAPACK is: 0.2E-16

for the serial OpenBLAS variant, and for the OpenMP multithreaded version:

./tileqr_openmp 5120 5120 256 32
LAPACK   -- Time:    6.133    Gflops/s:   29.177
Tiled    -- Time:    4.047    Gflops/s:   44.220
 
Max diff between R from tiled   and LAPACK is: 0.2E-16

and even with the old threading scheme I get

./tileqr_openmp_simple 5120 5120 256 32
LAPACK   -- Time:    8.444    Gflops/s:   21.194
Tiled    -- Time:    3.752    Gflops/s:   47.701
 
Max diff between R from tiled   and LAPACK is: 0.2E-16
@martin-frbg
Copy link
Collaborator

So all your OpenBLAS calculations on Power8 give the wrong result no matter which build options you use. From #1071 I believe you already tried removing some/all of the now-suspect assembly implementations from the KERNEL file - did this lead anywhere ?

@grisuthedragon
Copy link
Contributor Author

@martin-frbg Yes with the example, in contrast to the initial one from #1071 all calculations went wrong. I added the generic KERNEL.POWER8 file to the git project and even there I obtain the error and additionally to the tuned KERNELs I also get the error during the BLAS testing as mentioned in #1071 .

@martin-frbg
Copy link
Collaborator

I am not sure if what you did is sufficient to avoid using any of the .S files - as mentioned in #1071 it seems necessary to declare the generic implementations of the xSYMV functions in the kernel file to
ensure these get used instead of their power8.S counterparts (which I recall removed the errors seen during BLAS testing), similarly I notice .S implementations for the xGEMM_BETA functions named in the default KERNEL file. Other power8.S files may be picked up automatically by virtue of their names.
Hopefully it is apparent from the compile log and/or the generated .o files in kernel/power if any of the plain assembly routines are still used in your build - at the moment, only the inline assembly contained in .c files can reasonably be assumed to be safe.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Oct 8, 2017

With my patches from PR #1317 this test now passes with a max. diff. of 0.3E-16 at least on an emulated ppc64le system.

@grisuthedragon
Copy link
Contributor Author

I did the experiments from above on my real ppc64le system and I get the following:

OpenBLAS with OpenMP support:

OMP_NUM_THREADS=20 ./tileqr_openmp  10240 10240 512 64
LAPACK   -- Time:   13.016    Gflops/s:  109.992
Tiled    -- Time:    5.438    Gflops/s:  263.256
 
Max diff between R from tiled   and LAPACK is: 0.2E-16

OpenBLAS with simple OpenMP support:

 OMP_NUM_THREADS=20 ./tileqr_openmp_simple  10240 10240 512 64
LAPACK   -- Time:   29.076    Gflops/s:   49.238
Tiled    -- Time:    6.747    Gflops/s:  212.177
 
Max diff between R from tiled   and LAPACK is: 0.2E-16

OpenBLAS without OpenMP support:

OMP_NUM_THREADS=20 ./tileqr_serial  10240 10240 256 32
LAPACK   -- Time:   71.130    Gflops/s:   20.127
Tiled    -- Time:    5.393    Gflops/s:  265.441

Max diff between R from tiled   and LAPACK is: 0.1E+02

As we see it works for both OpenMP enabled OpenBLAS versions. But not for the serial one. For me its clear because in this case we have race conditions on the internal buffers inside OpenBLAS but for normal users a warning should appear if OpenBLAS is called from a parallel region. I know that this warning appears for for loop etc. but it seems not to appear in OpenMP Tasks. Which does not give the user a hint that its environment is wrong. Interesstingly on x86-64 this does not disturb the result.

@martin-frbg
Copy link
Collaborator

Indeed the non-OPENMP level3 thread race issue is not adressed by my PR. I wonder if that case would pass as well if OpenBLAS was compiled with USE_SIMPLE_THREADED_LEVEL3=1, or is that your default now anyway ?

@martin-frbg
Copy link
Collaborator

BTW the Makefile.rule carries a clear warning (added by wernsaar) to always use OPENMP with POWER8, so arguably the misbehaviour of your serial case is a "documented feature".

@grisuthedragon
Copy link
Contributor Author

@martin-frbg Yes you are right. but I think we should mention it in the README.md file ( where the POWER 8 support is not mentioned ) because if a normal user reads something this is normally the README file and not something documented in the makefile.

I let some more tests ran overnight to see whether the above results are a random success or not and I found several runs where the the difference between the results is not in the order of the machine epsilon.
For many some calls to

OMP_NUM_THREADS=20 ./tileqr_openmp  10240 10240 512 64

I obtain

Max diff between R from tiled   and LAPACK is: 0.2E-16

but sometimes it gives

Max diff between R from tiled   and LAPACK is: 0.2E-04

or

Max diff between R from tiled   and LAPACK is: 0.1E-05

ans similar wrong results. This means the race condition is not fixed completely.

@martin-frbg
Copy link
Collaborator

Agreed about the README (which is probably outdated in other areas as well - e.g. do the ARM specialists here still consider the ARMV8 implementation(s) experimental ?). Putting vital information into Makefile.rule is one of the legacies from libGoto.
And I think there are two different bugs at work here - one the race condition(s) between threads (where USE_SIMPLE_THREADED_LEVEL3 may come in as a band-aid) and the other the potential for collateral damage to contents of the vector registers, which my PR is trying to address. As this is my first foray into assembly programming (besides reading a book on Z80 assembly in the very distant past) it is unfortunately possible that I overlooked or misinterpreted one of the offsets in the code and some calculation done on the stack may still scribble over the temporary storage area I added there for saving the vector registers.

@martin-frbg
Copy link
Collaborator

Note that I cannot test this any further - within the limitations of my qemu setup, my builds all appear to pass both with and without USE_SIMPLE_THREADED_LEVEL3 (though the "max diff" value
is marginally smaller with this option)

@edelsohn
Copy link

@martin-frbg Have you requested a VM at OSUOSL instead of trying to use qemu?

@martin-frbg
Copy link
Collaborator

martin-frbg commented Jan 6, 2018

@grisuthedragon, you mentioned in #1332 that disabling multithreading in trmv fixes this issue here as well, do I understand correctly that was without the more general USE_SIMPLE_THREADED_LEVEL3 option ?
(At the very least this finding would suggest that hopefully the assembly-level problems are fixed now)
As the recommendation to use OPENMP with POWER8 builds was added to README.md by your PR #1325, perhaps this issue can be closed now ?

@grisuthedragon
Copy link
Contributor Author

Yes I compiled it without USE_SIMPLE_THREADED_LEVEL3. I did some long running tests over the last night the dtrmv fix with one thread decreases the number of cases where the code fails on POWER 8 but not all.

@edelsohn
Copy link

edelsohn commented Feb 4, 2019

Do we have any better analysis of the root cause of the problem? IBM is willing to open a bounty to solve this.

@martin-frbg
Copy link
Collaborator

Somebody needs to check if this is still reproducible, with luck the fix for #1851 may have helped. (The link to the original testcase is dead, I'll see if I still have it archived)

@grisuthedragon
Copy link
Contributor Author

@martin-frbg
I will check it tonight or tomorrow on the system mentioned above.

@hartb
Copy link

hartb commented Feb 4, 2019

@edelsohn asked me to comment about another OMP_NUM_THREADS-related symptom
we're seeing.

We're exercising OpenBLAS via PyTorch:

#!/usr/bin/env python

import torch

for sz in [ 127, 128, 129, 130, 500 ]:
    print("Size = {}".format(sz))
    for i in range(10):
        a = torch.randn(sz, sz)
        q, r = torch.qr(a)
        a_qr = torch.mm(q, r)
        m = float(max(max(x) for x in a - a_qr))
        print("max diff = {0:.6f}{1}".format(m, " FAIL!" if m > 0.001 else ""))

The intent is that QR decomposition followed by MM should result in the
original matrix.

This test always succeeds on Power 8, and always succeeds when
OMP_NUM_THREADS=1. It fails occasionally on Power 9 when OMP_NUM_THREADS is
> 1

Even in that configuration, the test always succeeds when the (square) matrix
size is < 128 x 128.

Otherwise, the test fails with greater frequency as OMP_NUM_THREADS grows and
(more strongly) as the matrix size grows. In a recent run with
OMP_NUM_THREADS=32 we saw:

  • no failures at 127 x 127
  • 1 in 10 failures from 128 x 128 through 130 x 130
  • 10 in 10 failures at 500 x 500

We had done some library tracing during test runs, and see much different
behavior when the matrix size crosses the 127 x 127 to 128 x 128 size boundary
(many more MM and copies in the larger case). Presumably the library switches
to a different algorithm with the large size, and that exposes some
serialization issue.

I'm afraid we don't have an OpenBLAS-only version of the testcase.

The problem was seen at both 0.2.20 and 0.3.3 OpenBLAS.

@hartb
Copy link

hartb commented Feb 4, 2019

I intended to confirm whether the fix for #1851 also resolved the QR problem (see above), but now I'm unable to reproduce at 0.3.x versions of OpenBLAS:

OMP_NUM_THREADS=32

LD_PRELOAD=/home/builder/libopenblas.so.0.2.20
Size = 500
FFFF.FFFFFFFFFFFFFFFFFFFF  96.00% failure

LD_PRELOAD=/home/builder/libopenblas.so.0.3.2
Size = 500
.........................   0.00% failure

LD_PRELOAD=/home/builder/libopenblas.so.0.3.3
Size = 500
.........................   0.00% failure

LD_PRELOAD=/home/builder/libopenblas.so.0.3.4
Size = 500
.........................   0.00% failure

I thought we'd seen this at 0.3.3, but it seems I was mistaken. Sorry for the noise!

@grisuthedragon
Copy link
Contributor Author

I tried it on the above mentioned system with the current development branch (0.3.6-dev) for 100 times and the example code (now reuploaded) works correctly on my POWER8 system now.

@martin-frbg
Copy link
Collaborator

Great, thanks for testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants