-
-
Notifications
You must be signed in to change notification settings - Fork 481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docbuild segfaults when pari is compiled with threading #26608
Comments
comment:1
I think there's a little bit of misinformation / misconception here. There's nothing about Sage's docbuild program that uses multi-threading. It uses a process pool and builds each sub-document in separate processes. (There are some cases where it does not run builds in subprocesses when it probably should, and I think that is contributing somewhat to the explosion of memory usage in the docbuild, but that's a separate issue). |
comment:2
I don't know if PARI uses openblas in its multi-threaded mode but I wonder if this is related to #26585 |
comment:3
Note that the code excerpts in the last lines of the backtrace are nonsense since I was compiling an older version of the docs. Here's the "translated" version:
|
This comment has been minimized.
This comment has been minimized.
comment:5
Replying to @embray:
I'll test if that openblas patch fixes it. Very interesting ticket, I wonder if that also causes #26130 (I've heard darwin is somewhat more prone to threading bugs). |
comment:6
I'm not sure if it does, but it might. I grepped the pari/gp source and it doesn't use |
comment:7
There's also some multi-threading support in FLINT which could be problematic, but I have no idea if that's relevant in this case. |
comment:8
I read on the mailing list post "It is called indirectly via matplotlib when rendering plots, see full backtrace below (btw, I had to downgrade to an old version of Sage to get a meaningful backtrace - I really dislike this trend of hiding build output, it makes it very hard to debug stuff)" I had a very similar problem to this; it actually came from the BLAS library by way of a Numpy ufunc (I think for the "dot" product of a matrix in a vector, or two matrices). I feel like I actually fixed this but now I can't remember. |
comment:9
Do you know some direct, specific way to reproduce this so that I can try it? |
comment:10
Replying to @timokau:
I don't think it's related, because this only showed up when we patched fflas-ffpack to allow configuring the number of threads to use with openblas (by default it just sets it to 1). But conceivably there's a similar bug elsewhere. Possibly related to fork(). I have found many bugs in different projects related to threads/fork interaction. |
comment:11
You know what though--I'm looking at the relevant code in openblas, and |
comment:12
Replying to @embray:
I don't think that PARI uses BLAS in any way. |
This comment has been minimized.
This comment has been minimized.
comment:14
Replying to @embray:
I'm also betting on this. PARI might setup some data structures related to threading (when compiled with threading support) which are invalid when running in a forked child process. |
This comment has been minimized.
This comment has been minimized.
comment:15
Nevermind; that does not appear to be the case, I don't think. |
comment:16
Replying to @jdemeyer:
Yes, I think you must be right. PARI has its own thread management, and it does not implement any pthread_atfork handler that I can find, which is strong cause to suspect it... |
comment:17
ISTM PARI/GP is not even built with multi-threading enabled unless you run its |
comment:18
Now if I build PARI with a) Anyone getting having this problem is building PARI with b) Exactly what code is being run in Sage that invokes multi-threading in PARI. |
comment:19
Replying to @embray:
Yes, that is why this issue came up on sage-packaging. Some distros ship pari with threading enabled. Sage does not. My effort in #26002 was not to change that, but to make sage compatible with a system pari. |
comment:20
I can also make it deadlock with the right combination of evil calls. |
comment:21
Even when you explicitly set |
comment:22
I don't know. I've been sick for a the last week so I completely forget exactly where I left this. Nevertheless, now I know that unless I compile pari with |
comment:23
Okay, glad you're feeling better :) Let me know if I can help. |
comment:24
Totally forgot about this... |
comment:25
Too bad, I also forgot about this. I'm literally right now returning from a week-long PARI/GP workshop where I could have discussed this. |
comment:26
The alternative multiprocess doc build introduced in #27490 works as a temporary workaround. I just replaced the revert with a one-line patch to enable it unconditionally for 8.7 |
comment:27
Removing most of the rest of my open tickets out of the 8.7 milestone, which should be closed. |
comment:28
see also #28242, where we enable system Pari in vanilla Sage. |
comment:29
Replying to @embray:
I wish I knew what I meant by this, because I want to investigate this again but I don't have a clear way yet to even reproduce the issue, and I can't find whatever example code this might be referring to... :( |
comment:30
Some PARI notes regarding threading:
|
comment:31
and Sage installs
Well, I can use this on #28242 to test whether we got single-threaded libpari. |
comment:32
Managed to reproduce the problem again by doing a parallel docbuild after building PARI with Unfortunately the diff --git a/src/sage_setup/docbuild/__init__.py b/src/sage_setup/docbuild/__init__.py
index e406bca..4912b5c 100644
--- a/src/sage_setup/docbuild/__init__.py
+++ b/src/sage_setup/docbuild/__init__.py
@@ -49,6 +49,7 @@ import shutil
import subprocess
import sys
import time
+import traceback
import warnings
logger = logging.getLogger(__name__)
@@ -136,6 +137,8 @@ def builder_helper(type):
if ABORT_ON_ERROR:
raise
except BaseException as e:
+ exc_type, exc_value, exc_traceback = sys.exc_info()
+ traceback.print_tb(exc_traceback)
# We need to wrap a BaseException that is not an Exception in a
# regular Exception. Otherwise multiprocessing.Pool.get hangs, see
# #25161 With that, I get a long traceback (much of which is stuff in Sphinx that isn't interesting). But as previously reported--and unsurprisingly--the segfault originates from some code for a plot:
Here it's probably building the reference docs for It looks like it's not actually reaching the |
comment:33
The |
comment:34
If #26002 has any effect, the pool should have just a single thread, no? |
comment:35
In further digging, yes, I just found #26002 and that indeed pari has |
comment:36
hmm, maybe it still does something with thread local variables? |
comment:37
On a wild guess, I tried switching the docbuild to use my The major difference is that in Somehow this alone is enough to leave some structures in PARI in a bad state, and only if it was built with multithreading support in the first place (apparently). |
comment:38
Going back to comment [comment:30], if The only effect of
So variables in PARI declared |
comment:39
Replying to @embray:
This is matching what I observed on #28242 (and the workaround I added to that branch) -- |
comment:40
Replying to @embray:
Yup. That's really all it is. PARI has tons of global variables declared This isn't so unusual in PARI's case. It assumes that it's the only one that will be starting new threads that it manages. It does not assume it will ever be used in someone else's multi-threaded application. |
comment:41
#28356 proposes a workaround for this issue. It won't solve the issue in general (PARI is not safe to use in arbitrary multi-threaded code), but at least it won't crash when building the docs. |
comment:42
Did somebody check that this problem is limited to docbuilds? Does the Sage testsuite pass (apart from doc-related tests of course)? |
comment:43
Replying to @jdemeyer:
Yes, there are no test suite failures related to this. |
comment:44
Replying to @jdemeyer:
In principle it's not just limited to docbuilds: The broader problem, for which I would like to find a better resolution, is that the cypari2 As it is, Sage doesn't use threads for much of anything, whether in the tests, or in general, so the problem arises primarily in the docbuild. But this can cause problems for anyone who carelessly tries to use !* If they do anything in those threads that happens to use PARI. |
This ticket is a followup to this sage-packaging discussion. To summarize:
sage does not work together with pari's threading. Instead of relying on it being compiled without threading, I made use of the "nthreads" option to disable threading at runtime in #26002.
However since #24655 (unconditionally enabling threaded docbuild), the docbuild segfaults when pari is compiled with threading support. Apparently sage somehow uses pari while ignoring the
nthread
option. We get the following backtrace (provided by Antonio with an older version of sage):That shows us that
src/sage/matrix/matrix_integer_dense.pyx
is involved. Apparently that file directly uses cypari c-bindings instead of thelibs/pari.py
interface (where thenthreads
option is added). For example:Can someone more familiar with cython and cypari tell if the options defined in
libs/pari.py
would apply here? Why isn'tlibs/pari.py
used?CC: @antonio-rojas @jdemeyer @kiwifb @dimpase @saraedum @embray
Component: documentation
Keywords: docbuild, pari
Issue created by migration from https://trac.sagemath.org/ticket/26608
The text was updated successfully, but these errors were encountered: