-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect results when loading OpenBLAS with OpenMP prior to pytorch on POWER9 #2869
Comments
Which OpenBLAS version are you using (the path in the CDLL line suggests 0.3.7, current is 0.3.10) ? |
I tried using 0.3.10 by compiling it as usual and adapting the LD_LIBRARY_PATH and double checked it gets loaded but the issue persists. However I could confirm that with 0.3.10 |
There being some lapack-test failures still is largely a function of the compiler used, and I hope the situation is already better with current |
If I understand https://github.com/xianyi/OpenBLAS/blob/4d36711547e71dbebaded85f9ddd92f9dfbe887a/Makefile.power#L2-L10 correctly then there is only singlethread or OpenMP on POWER. So no "plain thread code"? I tried with develop, still failing e.g. Nonsymmetric-Eigenvalue-Problem-EIG/xeigtsts, RFP-linear-equation-routines-LIN/xlintstrfs For reference I also tried using the numpy functions instead of torch and they exhibit the same faulty behavior, so I assume both numpy and pytorch map straight to OpenBLAS. |
Reviewing the pytorch issue ticket this appears to be a very old bug, pity that it never surfaced here apparently (not that i need it now though...) |
I failed to reproduce without torch. As mentioned the mystery to me is that loading OpenBLAS before torch triggers the bug but loading it afterwards (no-op as it is loaded by torch) does not trigger the bug. Easiest to test by importing numpy before or not. And even for torch: I manually loaded all libraries that torch loads until only torch libs remained (checked using strace), then loading OpenBLAS, then torch and it still is torch, not any other lib. Currently got another lead where some minor difference in the make command leads to 0.3.10 passing the issue, trying to isolate and reproduce |
I'm afraid I've moved on to other responsibilities since updating that PyTorch issue, but some recollections from when I was looking at it....
A couple other thoughts, probably unrelated, but for completeness:
|
@hartb Thanks for the notes, I'll add my findings:
I experimented a bit and found that setting Oh, my... Just tested the fork theory and it really is able to trigger the issue:
When running this with |
IIRC (and probably unhelpful) the fork problems are a feature of the GNU libgomp, not shared by the llvm libomp(?) |
FWIW I tried to build 0.3.10 with clang and it failed due to some power flags not known to clang... |
Ouch... It seems the issue is fixed by #2765 |
Ah; good find! |
Sorry, my brain is mush from too many competing issues (not all OpenBLAS-related) right now, I knew (felt rather?) there was something possibly relevant in recent fixes but had not gotten round to refreshing my memory. |
I can confirm, that this is indeed the issue but am not sure why it depends on forks and NUM_THREADS. I build OpenBLAS with
On the system with 176 cores/threads and at 0.3.10 without the patch I get many messages, after applying the patch I get none. Without the patch and NUM_THREADS=174 I also get none and with NUM_THREADS=175 I get a single |
I found the issue: A fork clears the per-thread memory buffers that are created on application init when using OpenMP. So on every BLAS invocation all threads race to alloc memory which in turn shows the missing sync barrier and degrades performance. Fixed in #2876 I verified that after fixing this only a single thread is making calls to |
Given that there are now 2 quite serious issues (1 specific to POWER and 1 performance issue for all OpenMP builds) I'd suggest to release 0.3.11 earlier so package maintainers can pick those fixes up |
For context: We have a POWER9 cluster with big GPUs in our HPC system and are compiling everything from source via EasyBuild. One of such applications is PyTorch which used to show faulty behavior until a recent change that I tracked down to not importing
numpy
prior totorch
anymore. But it reappears when doing that manuallyIn short:
import torch
leads to miscalculations. Load can either happen viaimport numpy
which depends on OpenBLAS or viactypes.CDLL
, which basicallydlopen
s the libraryThe test case I'm using is from e.g. #1191 (comment) / pytorch/pytorch#3716 (comment) hence ccing @hartb
I see failures already at the 127 sizes consistently. As mentioned compiling OpenBLAS without OpenMP makes this pass, not importing OpenBLAS (either via the
ctypes.CDLL
or via numpy) prior to torch also works and settingOMP_NUM_THREADS=1
works too, reducing to e.g. 2 or 4 makes the failures less frequent. In that cases they appear more frequently at higher sizes.The text was updated successfully, but these errors were encountered: