-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race Condition in OpenBLAS on IBM OpenPower 8 #1191
Comments
So all your OpenBLAS calculations on Power8 give the wrong result no matter which build options you use. From #1071 I believe you already tried removing some/all of the now-suspect assembly implementations from the KERNEL file - did this lead anywhere ? |
@martin-frbg Yes with the example, in contrast to the initial one from #1071 all calculations went wrong. I added the generic KERNEL.POWER8 file to the git project and even there I obtain the error and additionally to the tuned KERNELs I also get the error during the BLAS testing as mentioned in #1071 . |
I am not sure if what you did is sufficient to avoid using any of the .S files - as mentioned in #1071 it seems necessary to declare the generic implementations of the xSYMV functions in the kernel file to |
With my patches from PR #1317 this test now passes with a max. diff. of 0.3E-16 at least on an emulated ppc64le system. |
I did the experiments from above on my real ppc64le system and I get the following: OpenBLAS with OpenMP support:
OpenBLAS with simple OpenMP support:
OpenBLAS without OpenMP support:
As we see it works for both OpenMP enabled OpenBLAS versions. But not for the serial one. For me its clear because in this case we have race conditions on the internal buffers inside OpenBLAS but for normal users a warning should appear if OpenBLAS is called from a parallel region. I know that this warning appears for for loop etc. but it seems not to appear in OpenMP Tasks. Which does not give the user a hint that its environment is wrong. Interesstingly on x86-64 this does not disturb the result. |
Indeed the non-OPENMP level3 thread race issue is not adressed by my PR. I wonder if that case would pass as well if OpenBLAS was compiled with USE_SIMPLE_THREADED_LEVEL3=1, or is that your default now anyway ? |
BTW the Makefile.rule carries a clear warning (added by wernsaar) to always use OPENMP with POWER8, so arguably the misbehaviour of your serial case is a "documented feature". |
@martin-frbg Yes you are right. but I think we should mention it in the README.md file ( where the POWER 8 support is not mentioned ) because if a normal user reads something this is normally the README file and not something documented in the makefile. I let some more tests ran overnight to see whether the above results are a random success or not and I found several runs where the the difference between the results is not in the order of the machine epsilon.
I obtain
but sometimes it gives
or
ans similar wrong results. This means the race condition is not fixed completely. |
Agreed about the README (which is probably outdated in other areas as well - e.g. do the ARM specialists here still consider the ARMV8 implementation(s) experimental ?). Putting vital information into Makefile.rule is one of the legacies from libGoto. |
Note that I cannot test this any further - within the limitations of my qemu setup, my builds all appear to pass both with and without USE_SIMPLE_THREADED_LEVEL3 (though the "max diff" value |
@martin-frbg Have you requested a VM at OSUOSL instead of trying to use qemu? |
@grisuthedragon, you mentioned in #1332 that disabling multithreading in trmv fixes this issue here as well, do I understand correctly that was without the more general USE_SIMPLE_THREADED_LEVEL3 option ? |
Yes I compiled it without USE_SIMPLE_THREADED_LEVEL3. I did some long running tests over the last night the dtrmv fix with one thread decreases the number of cases where the code fails on POWER 8 but not all. |
Do we have any better analysis of the root cause of the problem? IBM is willing to open a bounty to solve this. |
Somebody needs to check if this is still reproducible, with luck the fix for #1851 may have helped. (The link to the original testcase is dead, I'll see if I still have it archived) |
@martin-frbg |
@edelsohn asked me to comment about another We're exercising OpenBLAS via PyTorch:
The intent is that QR decomposition followed by MM should result in the This test always succeeds on Power 8, and always succeeds when Even in that configuration, the test always succeeds when the (square) matrix Otherwise, the test fails with greater frequency as
We had done some library tracing during test runs, and see much different I'm afraid we don't have an OpenBLAS-only version of the testcase. The problem was seen at both 0.2.20 and 0.3.3 OpenBLAS. |
I intended to confirm whether the fix for #1851 also resolved the QR problem (see above), but now I'm unable to reproduce at
I thought we'd seen this at 0.3.3, but it seems I was mistaken. Sorry for the noise! |
I tried it on the above mentioned system with the current development branch (0.3.6-dev) for 100 times and the example code (now reuploaded) works correctly on my POWER8 system now. |
Great, thanks for testing. |
This issue is an follow up to #1071 which extends the problem and gives a clearer example. I consider a code computing the TILE-QR decomposition using OpenMP 4 task depend paralleization. The code, including a makefile to reproduce the errors presented here is available at: https://github.com/grisuthedragon/openblas-tileqr. Details about the algorithm can be found here.
System Details:
For the first experiment I compiled OpenBLAS in single-thread mode using
and compiled and executed the example code via
and I obtained:
which is the wrong result because the code has the provide the same result as LAPACK.
Compiling OpenBLAS with
and compiled and executed the example code via
and I obtained:
which is also wrong.
Finally I used the old threading implementation for level-3 operations by compiling OpenBLAS via:
and for the example
I get
If I compile the example using pure Netlib BLAS and Netlib LAPACK I obtain the correct results as well.
On an Ubuntu 16.04 - x86_64 with a 4 core Intel(R) Core(TM) i7-6700 CPU I got the following correct results:
for the serial OpenBLAS variant, and for the OpenMP multithreaded version:
and even with the old threading scheme I get
The text was updated successfully, but these errors were encountered: