Incorrect results when loading OpenBLAS with OpenMP prior to pytorch on POWER9 #2869

Flamefire · 2020-09-30T09:23:48Z

For context: We have a POWER9 cluster with big GPUs in our HPC system and are compiling everything from source via EasyBuild. One of such applications is PyTorch which used to show faulty behavior until a recent change that I tracked down to not importing numpy prior to torch anymore. But it reappears when doing that manually

In short:

Loading openblas prior to import torch leads to miscalculations. Load can either happen via import numpy which depends on OpenBLAS or via ctypes.CDLL, which basically dlopens the library
building OpenBLAS without OpenMP support (as done by e.g. conda) makes the problem disappear, which leads me to thinking that this is a bug in OpenBLAS rather than pytorch
The same setup (same OpenBLAS version, compilation settings, ...) works on x86

The test case I'm using is from e.g. #1191 (comment) / pytorch/pytorch#3716 (comment) hence ccing @hartb

#!/usr/bin/env python

import sys
import os
import ctypes
import subprocess

if os.getenv('IMPORT_NUMPY'):
      import numpy as _np
else:
  ctypes.CDLL('/sw/installed/OpenBLAS/0.3.7-GCC-8.3.0/lib/libopenblas.so.0', mode=ctypes.RTLD_LOCAL)

import torch

isOK = True

for sz in [ 127, 128, 129, 130 ]:
        diffs = []
        print("Size = {}".format(sz))
        for i in range(10):
                a = torch.randn(sz, sz)
                q, r = torch.qr(a)
                a_qr = torch.mm(q, r)
                m = float(max(max(x) for x in a - a_qr))
                diffs.append("max diff = {0:.6f}{1}".format(m, " FAIL!" if m > 0.001 else ""))
                if m > 0.001:
                  isOK = False
        print(", ".join(diffs))


if not isOK:
  sys.exit(1)

I see failures already at the 127 sizes consistently. As mentioned compiling OpenBLAS without OpenMP makes this pass, not importing OpenBLAS (either via the ctypes.CDLL or via numpy) prior to torch also works and setting OMP_NUM_THREADS=1 works too, reducing to e.g. 2 or 4 makes the failures less frequent. In that cases they appear more frequently at higher sizes.

The text was updated successfully, but these errors were encountered:

martin-frbg · 2020-09-30T09:45:22Z

Which OpenBLAS version are you using (the path in the CDLL line suggests 0.3.7, current is 0.3.10) ?

Flamefire · 2020-09-30T10:56:16Z

I tried using 0.3.10 by compiling it as usual and adapting the LD_LIBRARY_PATH and double checked it gets loaded but the issue persists. However I could confirm that with 0.3.10 DYNAMIC_ARCH=1 was working again (regular tests passed) and there were less failures in the extended lapack-test (but there still some)

martin-frbg · 2020-09-30T11:28:46Z

There being some lapack-test failures still is largely a function of the compiler used, and I hope the situation is already better with current develop. The original issue looks like the OpenMP build is erroneously relying on OpenMP to prevent a race condition where the plain threads code employs locks. (Which is a bit weird as the "official" stance used to be that compiling with OpenMP is absolutely mandatory for OpenBLAS on POWER)

Flamefire · 2020-09-30T12:20:33Z

If I understand https://github.com/xianyi/OpenBLAS/blob/4d36711547e71dbebaded85f9ddd92f9dfbe887a/Makefile.power#L2-L10 correctly then there is only singlethread or OpenMP on POWER. So no "plain thread code"?

I tried with develop, still failing e.g. Nonsymmetric-Eigenvalue-Problem-EIG/xeigtsts, RFP-linear-equation-routines-LIN/xlintstrfs

For reference I also tried using the numpy functions instead of torch and they exhibit the same faulty behavior, so I assume both numpy and pytorch map straight to OpenBLAS.
I'm really puzzled about this only happening when OpenBLAS is loaded before torch

martin-frbg · 2020-09-30T12:31:41Z

Reviewing the pytorch issue ticket this appears to be a very old bug, pity that it never surfaced here apparently (not that i need it now though...)
A reproducer without torch would be helpful I think, not sure who packages pytorch for ppc64le. Or does the bug require both OpenBLAS and torch to be present ? Multithreaded OpenBLAS allocates some buffer space
for communicating partial results between threads, some of that gets put on the stack but I do not think we are
actively overwriting anybody there.

Flamefire · 2020-09-30T12:39:51Z

I failed to reproduce without torch. As mentioned the mystery to me is that loading OpenBLAS before torch triggers the bug but loading it afterwards (no-op as it is loaded by torch) does not trigger the bug. Easiest to test by importing numpy before or not. And even for torch: I manually loaded all libraries that torch loads until only torch libs remained (checked using strace), then loading OpenBLAS, then torch and it still is torch, not any other lib.

Currently got another lead where some minor difference in the make command leads to 0.3.10 passing the issue, trying to isolate and reproduce

hartb · 2020-09-30T15:00:30Z

I'm afraid I've moved on to other responsibilities since updating that PyTorch issue, but some recollections from when I was looking at it....

We were looking mostly at OpenBLAS 0.2.20, and things seemed improved by the 0.3.3 release.
We didn't see failures with input sizes <= 128x128. I used ftrace to try to see what was happening, and could see that (at least) one of the OpenBLAS routines switched to a different strategy when the input size grew to 129x129. It would use a single-threaded solution for the smaller sizes, and decompose into a parallel solution for the larger sizes. Perhaps the switch-point was based on the vector size provided by the platform. I didn't find any OpenBLAS QR unit tests with input sizes approaching anything like the switch-over point size (maybe the largest dimension of test input sizes was 4-5, or maybe around 20? I can't remember, but much smaller than 129x129 from what I could find).
I tried recreating outside pytorch but wasn't able. I'll attach a test program (qrt.c.gz) I was trying with. A few caveats:
- I'm not really a linear algebra or BLAS guy, so it's kind of cobbled together
- It's based on the old PyTorch qr code (https://github.com/pytorch/pytorch/blob/v1.1.0/aten/src/TH/generic/THTensorLapack.cpp). But it's limited to square inputs because: a) torch would also fail with square inputs, and b) it avoids the resizing and narrowing
- The torch qr implementation changed quite a bit in PyTorch 1.2.0: Enable batched QR decomposition and add a some option pytorch/pytorch#20689

A couple other thoughts, probably unrelated, but for completeness:

We ran into hangs due to OpenMP non-thread-safety when mixing: OpenMP-enabled PyTorch, PyTorch multiprocessing, and OpenMP-enabled OpenBLAS. See: https://stackoverflow.com/questions/49049388/understanding-openmp-shortcomings-regarding-fork This is at least in part why Anaconda builds their OpenBLAS explicitly without OpenMP support on Linux (https://github.com/AnacondaRecipes/openblas-feedstock/blob/master/recipe/build.sh#L31)
I see that Anaconda also disables TLS in their build (https://github.com/AnacondaRecipes/openblas-feedstock/blob/master/recipe/0003-turn-TLS-off.patch). Looks like that might just be so that they can run against an older GLIBC.

Flamefire · 2020-09-30T15:49:37Z

@hartb Thanks for the notes, I'll add my findings:

I'm seeing this issue with 0.3.7 and 0.3.10
failures occur at 127x127 although the bigger the size, the more often it fails. So that might be a slightly different issue
I'm using pytorch 1.6 however as mentioned even numpy breaks when it is loaded before pytorch (i.e. loading numpy+pytorch -> break, loading pytorch+numpy works). I also could not recreate it without loading pytorch
the fork might actually be an issue. Using gdb I found PyTorch executing a fork through subprocess but that only runs uname, so maybe not the issue. Maybe loading openBLAS library already runs some OpenMP stuff that will then violate the fork conditions while pytorch forks before loading openblas? Just a wild guess, not sure how to verify. Disabling OpenMP on the cluster-wide installed OpenBLAS would likely be an issue as it means slowing down numpy and other "safe" usages
That TLS patch still exists but is not used. That was a bug in OpenBLAS which enabled TLS by default which is no longer the case: https://github.com/AnacondaRecipes/openblas-feedstock/blob/4841a03388955fccb667588f5bdd132822b0b2f5/recipe/meta.yaml#L18 So we can exclude that as the cause

I experimented a bit and found that setting NUM_THREADS to the number of CPU cores (in my case 176) leads to the breakage while setting it to any lower value (e.g. 175) works. Maybe some off-by-one error?

Oh, my... Just tested the fork theory and it really is able to trigger the issue:

#!/usr/bin/env python

import sys
import os
import ctypes
import subprocess

import numpy as np
import platform
if os.getenv('RUN_FORK') and platform.system():
  pass

isOK = True

for sz in [ 127, 128, 129, 180 ]:
        diffs = []
        print("Size = {}".format(sz))
        for i in range(10):
            a = np.random.randn(sz, sz)
            q, r = np.linalg.qr(a)
            a_qr = np.dot(q, r)
            m = float(max(max(x) for x in a - a_qr))
            diffs.append("max diff = {0:.6f}{1}".format(m, " FAIL!" if m > 0.001 else ""))
            if m > 0.001:
              isOK = False
        print(", ".join(diffs))


if not isOK:
  sys.exit(1)

When running this with RUN_FORK set it fails O.o

martin-frbg · 2020-09-30T16:08:45Z

IIRC (and probably unhelpful) the fork problems are a feature of the GNU libgomp, not shared by the llvm libomp(?)
OTOH I do not get the connection to the number of cpus, so perhaps there are two bugs ?

Flamefire · 2020-09-30T16:19:42Z

FWIW I tried to build 0.3.10 with clang and it failed due to some power flags not known to clang...
Yeah, no idea about the NUM_THREADS issue or why it solves this. Maybe some buffer gets corrupted that only shows when all spaces are used?

Flamefire · 2020-09-30T16:24:10Z

Ouch... It seems the issue is fixed by #2765

hartb · 2020-09-30T16:32:19Z

Ah; good find!

martin-frbg · 2020-09-30T16:46:59Z

Sorry, my brain is mush from too many competing issues (not all OpenBLAS-related) right now, I knew (felt rather?) there was something possibly relevant in recent fixes but had not gotten round to refreshing my memory.

Flamefire · 2020-10-01T13:16:58Z

I can confirm, that this is indeed the issue but am not sure why it depends on forks and NUM_THREADS.

I build OpenBLAS with -DDEBUG=1 and used this Python code to check for sync problems leading to buffer double-use:

import re

inUse = [False]*300

with open('/home/alex/np3.log') as f:
    for line in f:
        m = re.search(r'Position -> (\d+)', line)
        if m:
            idx = int(m.group(1))
            if inUse[idx]:
                print('Double use of %s' % idx)
            inUse[idx] = True
        m = re.search(r'Position : (\d+)', line)
        if m:
            idx = int(m.group(1))
            if not inUse[idx]:
                print('Double free of %s' % idx)
            inUse[idx] = False

On the system with 176 cores/threads and at 0.3.10 without the patch I get many messages, after applying the patch I get none. Without the patch and NUM_THREADS=174 I also get none and with NUM_THREADS=175 I get a single Double use of 0 which is strange as it is missing the double free

Flamefire · 2020-10-01T13:53:16Z

I found the issue: A fork clears the per-thread memory buffers that are created on application init when using OpenMP. So on every BLAS invocation all threads race to alloc memory which in turn shows the missing sync barrier and degrades performance. Fixed in #2876

I verified that after fixing this only a single thread is making calls to blas_memory_alloc

Flamefire · 2020-10-01T14:36:09Z

Given that there are now 2 quite serious issues (1 specific to POWER and 1 performance issue for all OpenMP builds) I'd suggest to release 0.3.11 earlier so package maintainers can pick those fixes up

Flamefire mentioned this issue Oct 1, 2020

Lazyly reinit threads after a fork in OMP mode #2876

Merged

martin-frbg added this to the 0.3.11 milestone Oct 6, 2020

martin-frbg closed this as completed Oct 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect results when loading OpenBLAS with OpenMP prior to pytorch on POWER9 #2869

Incorrect results when loading OpenBLAS with OpenMP prior to pytorch on POWER9 #2869

Flamefire commented Sep 30, 2020

martin-frbg commented Sep 30, 2020

Flamefire commented Sep 30, 2020

martin-frbg commented Sep 30, 2020

Flamefire commented Sep 30, 2020

martin-frbg commented Sep 30, 2020

Flamefire commented Sep 30, 2020

hartb commented Sep 30, 2020

Flamefire commented Sep 30, 2020

martin-frbg commented Sep 30, 2020

Flamefire commented Sep 30, 2020

Flamefire commented Sep 30, 2020

hartb commented Sep 30, 2020

martin-frbg commented Sep 30, 2020

Flamefire commented Oct 1, 2020

Flamefire commented Oct 1, 2020 •

edited

Loading

Flamefire commented Oct 1, 2020

Incorrect results when loading OpenBLAS with OpenMP prior to pytorch on POWER9 #2869

Incorrect results when loading OpenBLAS with OpenMP prior to pytorch on POWER9 #2869

Comments

Flamefire commented Sep 30, 2020

martin-frbg commented Sep 30, 2020

Flamefire commented Sep 30, 2020

martin-frbg commented Sep 30, 2020

Flamefire commented Sep 30, 2020

martin-frbg commented Sep 30, 2020

Flamefire commented Sep 30, 2020

hartb commented Sep 30, 2020

Flamefire commented Sep 30, 2020

martin-frbg commented Sep 30, 2020

Flamefire commented Sep 30, 2020

Flamefire commented Sep 30, 2020

hartb commented Sep 30, 2020

martin-frbg commented Sep 30, 2020

Flamefire commented Oct 1, 2020

Flamefire commented Oct 1, 2020 • edited Loading

Flamefire commented Oct 1, 2020

Flamefire commented Oct 1, 2020 •

edited

Loading