Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect results when loading OpenBLAS with OpenMP prior to pytorch on POWER9 #2869

Closed
Flamefire opened this issue Sep 30, 2020 · 16 comments
Closed
Milestone

Comments

@Flamefire
Copy link
Contributor

For context: We have a POWER9 cluster with big GPUs in our HPC system and are compiling everything from source via EasyBuild. One of such applications is PyTorch which used to show faulty behavior until a recent change that I tracked down to not importing numpy prior to torch anymore. But it reappears when doing that manually

In short:

  • Loading openblas prior to import torch leads to miscalculations. Load can either happen via import numpy which depends on OpenBLAS or via ctypes.CDLL, which basically dlopens the library
  • building OpenBLAS without OpenMP support (as done by e.g. conda) makes the problem disappear, which leads me to thinking that this is a bug in OpenBLAS rather than pytorch
  • The same setup (same OpenBLAS version, compilation settings, ...) works on x86

The test case I'm using is from e.g. #1191 (comment) / pytorch/pytorch#3716 (comment) hence ccing @hartb

#!/usr/bin/env python

import sys
import os
import ctypes
import subprocess

if os.getenv('IMPORT_NUMPY'):
      import numpy as _np
else:
  ctypes.CDLL('/sw/installed/OpenBLAS/0.3.7-GCC-8.3.0/lib/libopenblas.so.0', mode=ctypes.RTLD_LOCAL)

import torch

isOK = True

for sz in [ 127, 128, 129, 130 ]:
        diffs = []
        print("Size = {}".format(sz))
        for i in range(10):
                a = torch.randn(sz, sz)
                q, r = torch.qr(a)
                a_qr = torch.mm(q, r)
                m = float(max(max(x) for x in a - a_qr))
                diffs.append("max diff = {0:.6f}{1}".format(m, " FAIL!" if m > 0.001 else ""))
                if m > 0.001:
                  isOK = False
        print(", ".join(diffs))


if not isOK:
  sys.exit(1)

I see failures already at the 127 sizes consistently. As mentioned compiling OpenBLAS without OpenMP makes this pass, not importing OpenBLAS (either via the ctypes.CDLL or via numpy) prior to torch also works and setting OMP_NUM_THREADS=1 works too, reducing to e.g. 2 or 4 makes the failures less frequent. In that cases they appear more frequently at higher sizes.

@martin-frbg
Copy link
Collaborator

Which OpenBLAS version are you using (the path in the CDLL line suggests 0.3.7, current is 0.3.10) ?

@Flamefire
Copy link
Contributor Author

I tried using 0.3.10 by compiling it as usual and adapting the LD_LIBRARY_PATH and double checked it gets loaded but the issue persists. However I could confirm that with 0.3.10 DYNAMIC_ARCH=1 was working again (regular tests passed) and there were less failures in the extended lapack-test (but there still some)

@martin-frbg
Copy link
Collaborator

There being some lapack-test failures still is largely a function of the compiler used, and I hope the situation is already better with current develop. The original issue looks like the OpenMP build is erroneously relying on OpenMP to prevent a race condition where the plain threads code employs locks. (Which is a bit weird as the "official" stance used to be that compiling with OpenMP is absolutely mandatory for OpenBLAS on POWER)

@Flamefire
Copy link
Contributor Author

If I understand https://github.com/xianyi/OpenBLAS/blob/4d36711547e71dbebaded85f9ddd92f9dfbe887a/Makefile.power#L2-L10 correctly then there is only singlethread or OpenMP on POWER. So no "plain thread code"?

I tried with develop, still failing e.g. Nonsymmetric-Eigenvalue-Problem-EIG/xeigtsts, RFP-linear-equation-routines-LIN/xlintstrfs

For reference I also tried using the numpy functions instead of torch and they exhibit the same faulty behavior, so I assume both numpy and pytorch map straight to OpenBLAS.
I'm really puzzled about this only happening when OpenBLAS is loaded before torch

@martin-frbg
Copy link
Collaborator

Reviewing the pytorch issue ticket this appears to be a very old bug, pity that it never surfaced here apparently (not that i need it now though...)
A reproducer without torch would be helpful I think, not sure who packages pytorch for ppc64le. Or does the bug require both OpenBLAS and torch to be present ? Multithreaded OpenBLAS allocates some buffer space
for communicating partial results between threads, some of that gets put on the stack but I do not think we are
actively overwriting anybody there.

@Flamefire
Copy link
Contributor Author

I failed to reproduce without torch. As mentioned the mystery to me is that loading OpenBLAS before torch triggers the bug but loading it afterwards (no-op as it is loaded by torch) does not trigger the bug. Easiest to test by importing numpy before or not. And even for torch: I manually loaded all libraries that torch loads until only torch libs remained (checked using strace), then loading OpenBLAS, then torch and it still is torch, not any other lib.

Currently got another lead where some minor difference in the make command leads to 0.3.10 passing the issue, trying to isolate and reproduce

@hartb
Copy link

hartb commented Sep 30, 2020

I'm afraid I've moved on to other responsibilities since updating that PyTorch issue, but some recollections from when I was looking at it....

  • We were looking mostly at OpenBLAS 0.2.20, and things seemed improved by the 0.3.3 release.

  • We didn't see failures with input sizes <= 128x128. I used ftrace to try to see what was happening, and could see that (at least) one of the OpenBLAS routines switched to a different strategy when the input size grew to 129x129. It would use a single-threaded solution for the smaller sizes, and decompose into a parallel solution for the larger sizes. Perhaps the switch-point was based on the vector size provided by the platform. I didn't find any OpenBLAS QR unit tests with input sizes approaching anything like the switch-over point size (maybe the largest dimension of test input sizes was 4-5, or maybe around 20? I can't remember, but much smaller than 129x129 from what I could find).

  • I tried recreating outside pytorch but wasn't able. I'll attach a test program (qrt.c.gz) I was trying with. A few caveats:

A couple other thoughts, probably unrelated, but for completeness:

@Flamefire
Copy link
Contributor Author

@hartb Thanks for the notes, I'll add my findings:

  • I'm seeing this issue with 0.3.7 and 0.3.10
  • failures occur at 127x127 although the bigger the size, the more often it fails. So that might be a slightly different issue
  • I'm using pytorch 1.6 however as mentioned even numpy breaks when it is loaded before pytorch (i.e. loading numpy+pytorch -> break, loading pytorch+numpy works). I also could not recreate it without loading pytorch
  • the fork might actually be an issue. Using gdb I found PyTorch executing a fork through subprocess but that only runs uname, so maybe not the issue. Maybe loading openBLAS library already runs some OpenMP stuff that will then violate the fork conditions while pytorch forks before loading openblas? Just a wild guess, not sure how to verify. Disabling OpenMP on the cluster-wide installed OpenBLAS would likely be an issue as it means slowing down numpy and other "safe" usages
  • That TLS patch still exists but is not used. That was a bug in OpenBLAS which enabled TLS by default which is no longer the case: https://github.com/AnacondaRecipes/openblas-feedstock/blob/4841a03388955fccb667588f5bdd132822b0b2f5/recipe/meta.yaml#L18 So we can exclude that as the cause

I experimented a bit and found that setting NUM_THREADS to the number of CPU cores (in my case 176) leads to the breakage while setting it to any lower value (e.g. 175) works. Maybe some off-by-one error?

Oh, my... Just tested the fork theory and it really is able to trigger the issue:

#!/usr/bin/env python

import sys
import os
import ctypes
import subprocess

import numpy as np
import platform
if os.getenv('RUN_FORK') and platform.system():
  pass

isOK = True

for sz in [ 127, 128, 129, 180 ]:
        diffs = []
        print("Size = {}".format(sz))
        for i in range(10):
            a = np.random.randn(sz, sz)
            q, r = np.linalg.qr(a)
            a_qr = np.dot(q, r)
            m = float(max(max(x) for x in a - a_qr))
            diffs.append("max diff = {0:.6f}{1}".format(m, " FAIL!" if m > 0.001 else ""))
            if m > 0.001:
              isOK = False
        print(", ".join(diffs))


if not isOK:
  sys.exit(1)

When running this with RUN_FORK set it fails O.o

@martin-frbg
Copy link
Collaborator

IIRC (and probably unhelpful) the fork problems are a feature of the GNU libgomp, not shared by the llvm libomp(?)
OTOH I do not get the connection to the number of cpus, so perhaps there are two bugs ?

@Flamefire
Copy link
Contributor Author

FWIW I tried to build 0.3.10 with clang and it failed due to some power flags not known to clang...
Yeah, no idea about the NUM_THREADS issue or why it solves this. Maybe some buffer gets corrupted that only shows when all spaces are used?

@Flamefire
Copy link
Contributor Author

Ouch... It seems the issue is fixed by #2765

@hartb
Copy link

hartb commented Sep 30, 2020

Ah; good find!

@martin-frbg
Copy link
Collaborator

Sorry, my brain is mush from too many competing issues (not all OpenBLAS-related) right now, I knew (felt rather?) there was something possibly relevant in recent fixes but had not gotten round to refreshing my memory.

@Flamefire
Copy link
Contributor Author

I can confirm, that this is indeed the issue but am not sure why it depends on forks and NUM_THREADS.

I build OpenBLAS with -DDEBUG=1 and used this Python code to check for sync problems leading to buffer double-use:

import re

inUse = [False]*300

with open('/home/alex/np3.log') as f:
    for line in f:
        m = re.search(r'Position -> (\d+)', line)
        if m:
            idx = int(m.group(1))
            if inUse[idx]:
                print('Double use of %s' % idx)
            inUse[idx] = True
        m = re.search(r'Position : (\d+)', line)
        if m:
            idx = int(m.group(1))
            if not inUse[idx]:
                print('Double free of %s' % idx)
            inUse[idx] = False

On the system with 176 cores/threads and at 0.3.10 without the patch I get many messages, after applying the patch I get none. Without the patch and NUM_THREADS=174 I also get none and with NUM_THREADS=175 I get a single Double use of 0 which is strange as it is missing the double free

@Flamefire
Copy link
Contributor Author

Flamefire commented Oct 1, 2020

I found the issue: A fork clears the per-thread memory buffers that are created on application init when using OpenMP. So on every BLAS invocation all threads race to alloc memory which in turn shows the missing sync barrier and degrades performance. Fixed in #2876

I verified that after fixing this only a single thread is making calls to blas_memory_alloc

@Flamefire
Copy link
Contributor Author

Given that there are now 2 quite serious issues (1 specific to POWER and 1 performance issue for all OpenMP builds) I'd suggest to release 0.3.11 earlier so package maintainers can pick those fixes up

@martin-frbg martin-frbg added this to the 0.3.11 milestone Oct 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants