Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfaults in cypari2 when running out of stack during deep object deallocation #27267

Closed
embray opened this issue Feb 12, 2019 · 60 comments
Closed

Comments

@embray
Copy link
Contributor

embray commented Feb 12, 2019

Since cypari2-2 (#26442), the __dealloc__ for Gen may cause deep (but finite) recursions, causing a stack overflow. This happens in particular on Cygwin which has a smallish stack by default (and maybe it also uses more stack space?) but it can be provoked easily on other systems too with

pari.allocatemem(2^28); L = [pari(i) for i in range(2^20)]; x = pari.Pi(); del L; del x

For reference, this is the test failure on Cygwin:

sage -t --long src/sage/rings/number_field/totallyreal.pyx
**********************************************************************
File "src/sage/rings/number_field/totallyreal.pyx", line 243, in sage.rings.number_field.totallyreal.?
Failed example:
    len(enumerate_totallyreal_fields_prim(2,2**15)) # long time
Exception raised:
    Traceback (most recent call last):
      File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 671, in _run
        self.compile_and_execute(example, compiler, test.globs)
      File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 1095, in compile_and_execute
        exec(compiled, globs)
      File "<doctest sage.rings.number_field.totallyreal.?[3]>", line 1, in <module>
        len(enumerate_totallyreal_fields_prim(Integer(2),Integer(2)**Integer(15))) # long time
      File "sage/misc/lazy_import.pyx", line 354, in sage.misc.lazy_import.LazyImport.__call__ (build/cythonized/sage/misc/lazy_import.c:3683)
        return self.get_object()(*args, **kwds)
      File "sage/rings/number_field/totallyreal.pyx", line 393, in sage.rings.number_field.totallyreal.enumerate_totallyreal_fields_prim (build/cythonized/sage/rings/number_field/totallyreal.c:5753)
        ng = <pari_gen>((<pari_gen>(pari([nf,zk]))).polredabs())
      File "cypari2/auto_gen.pxi", line 22098, in cypari2.gen.Gen_base.polredabs
      File "cypari2/stack.pyx", line 156, in cypari2.stack.new_gen
      File "cypari2/stack.pyx", line 194, in cypari2.stack.new_gen_noclear
      File "cypari2/stack.pyx", line 128, in cypari2.stack.move_gens_to_heap
    SignalError: Segmentation fault

It turns out that CPython's "trashcan" mechanism was precisely designed to solve this problem. So we added support upstream in Cython for @cython.trashcan and then use that in cypari2. While we're at it, we upgrade to the latest Cython release

Tarball: https://files.pythonhosted.org/packages/e0/31/4a166556f92c469d8291d4b03a187f325c773c330fffc1e798bf83d947f2/Cython-0.29.5.tar.gz

Upstream tickets:

Upstream: Fixed upstream, but not in a stable release.

CC: @jdemeyer

Component: packages: standard

Keywords: upgrade, cypari2

Author: Jeroen Demeyer

Branch/Commit: b6ea17e

Reviewer: Erik Bray

Issue created by migration from https://trac.sagemath.org/ticket/27267

@embray embray added this to the sage-8.7 milestone Feb 12, 2019
@embray
Copy link
Contributor Author

embray commented Feb 12, 2019

comment:1

I have no further information to add at this point, but CC-ing you in case anything about this is obvious to you.

@jdemeyer
Copy link

comment:2

Are these problems reproducible? Do they occur when running the tests individually?

It would also be good to know the precise value of the pari stack size (pari.default("parisizemax")). Does changing the value of sizemax in src/sage/libs/pari/__init__.py affect the crashes?

@embray
Copy link
Contributor Author

embray commented Feb 13, 2019

comment:3

It's completely reproducible and deterministic, at least when running the test suite. I haven't tried putting together a simpler way to reproduce it. But it was definitely introduced by #26442, and it's caused by an infinite recursion, apparently, in remove_from_pari_stack():

#17155 0x0000000398b9e477 in __pyx_pw_4sage_4misc_11lazy_import_10LazyImport_17__call__ ()
   from /home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/misc/lazy_import.dll
(gdb)
#17154 0x0000000398b915fd in __Pyx_PyObject_Call ()
   from /home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/misc/lazy_import.dll
(gdb)
#17153 0x00000003933faf4a in __pyx_pw_4sage_5rings_12number_field_11totallyreal_3enumerate_totallyreal_fields_prim ()
   from /home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/rings/number_field/totallyreal.dll
(gdb)
#17152 0x00000003933e47d5 in __Pyx_PyObject_CallOneArg ()
   from /home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/rings/number_field/totallyreal.dll
(gdb)
#17151 0x00000003933e2a36 in __Pyx__PyObject_CallOneArg ()
   from /home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/rings/number_field/totallyreal.dll
(gdb)
#17150 0x000000039881394c in __Pyx_CyFunction_CallAsMethod ()
   from /home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/misc/persist.dll
(gdb)
#17149 0x00000003b277220e in __pyx_pw_7cypari2_3gen_8Gen_base_1255polredabs ()
   from /home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/cypari2/gen.dll
(gdb)
#17148 0x00000003b26a2eb9 in __pyx_pf_7cypari2_3gen_8Gen_base_1254polredabs ()
   from /home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/cypari2/gen.dll
(gdb)
#17147 0x00000003b1c69540 in __pyx_f_7cypari2_5stack_new_gen ()
   from /home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/cypari2/stack.dll
#17147 0x00000003b1c69540 in __pyx_f_7cypari2_5stack_new_gen ()
   from /home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/cypari2/stack.dll
(gdb) 
#17146 0x00000003b1c68871 in __pyx_f_7cypari2_5stack_new_gen_noclear ()
   from /home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/cypari2/stack.dll
(gdb) 
#17145 0x00000003b1c67dc8 in __pyx_f_7cypari2_5stack_move_gens_to_heap ()
   from /home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/cypari2/stack.dll
(gdb) 
#17144 0x00000003b1c6693d in __pyx_f_7cypari2_5stack_remove_from_pari_stack ()
   from /home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/cypari2/stack.dll
(gdb) 
#17143 0x00000003b2692095 in __pyx_tp_dealloc_7cypari2_3gen_Gen ()
   from /home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/cypari2/gen.dll
(gdb) 
#17142 0x00000003b1c6693d in __pyx_f_7cypari2_5stack_remove_from_pari_stack ()
   from /home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/cypari2/stack.dll
(gdb) 
#17141 0x00000003b2692095 in __pyx_tp_dealloc_7cypari2_3gen_Gen ()
   from /home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/cypari2/gen.dll
...

@jdemeyer
Copy link

comment:4

Would it be possible to check if it's an infinite recursion or just a very deep recursion, for example by increasing the process stack size (if this is possible on Cygwin)? GC is known to introduce deep-but-finite recursions (the Python trashcan mechanism was invented to deal with that).

@jdemeyer jdemeyer changed the title Cygwin: Segfaults in cypari2 and/or PARI Cygwin: Segfaults in cypari2 Feb 13, 2019
@jdemeyer
Copy link

comment:6

Confirmed on Linux with ulimit -s 1000 (1MB stack size).

@jdemeyer
Copy link

Upstream: Reported upstream. Developers acknowledge bug.

@jdemeyer
Copy link

comment:8

Digging deep in the CPython trashcan mechanism...

@jdemeyer
Copy link

comment:9

Ideally, this would be fixed by Cython upstream in cython/cython#2842

@embray
Copy link
Contributor Author

embray commented Feb 18, 2019

comment:10

This all seems interesting and I want to take a closer look. But fixing this issue obviously cannot rely, at least in the short term, on both a patch to CPython and a new feature in Cython. That begins to feel like there must be a simpler approach by just reworking how cypari2 is handling these deallocations.

@jdemeyer
Copy link

comment:11

Just for the record: the fix does not depend on the CPython patch. The CPython patch is meant to fix an issue in the trashcan when subclassing is involved. But the trashcan would work without that patch.

@jdemeyer
Copy link

comment:12

And yes, this could be fixed without any Cython patch but I'd rather not bother.

@embray
Copy link
Contributor Author

embray commented Feb 18, 2019

comment:13

Regardless, it's quite an integration hassle to depend on a new feature in Cython which does not exist yet, and a new release of Cython, in order to build this in such a way that it does not easily cause stack overflows. I'd rather revert the upgrade to cypari2 until this can be resolved.

@embray
Copy link
Contributor Author

embray commented Feb 18, 2019

comment:14

Replying to @jdemeyer:

Just for the record: the fix does not depend on the CPython patch. The CPython patch is meant to fix an issue in the trashcan when subclassing is involved. But the trashcan would work without that patch.

It depends on what you mean by "subclassing is involved". In this case subclassing is involved, as Gen is a subclass of Gen_base. But maybe this is not what you mean, as there is no subclassing at the Python level (currently), and Gen_base does not implement a custom tp_dealloc, nor use the "trashcan" mechanism. So I don't think the issue is immediately in effect here, but I wanted to be sure about that. It's not obvious.

@jdemeyer
Copy link

comment:15

Would it be possible (as interim work-around) to run Sage with a larger stack size (ulimit -s 8192 for example)?

@jdemeyer
Copy link

comment:16

With "subclassing is involved", I mean more precisely a class inheriting from a base class, where the base class uses the trashcan. This is broken in CPython (for example, for subclasses of list) but that bug is unrelated to this ticket here.

@embray
Copy link
Contributor Author

embray commented Feb 18, 2019

comment:17

I did a build of Python 2 with a 16 MB stack for the Python executable, and it fixed some of the segfaulting tests, but not others. E.g.

$ ./sage -t --long src/sage/libs/libecm.pyx
too many failed tests, not using stored timings
Running doctests with ID 2019-02-18-14-30-03-ccffc1e9.
Git branch: HEAD
Using --optional=dochtml,mpir,python2,sage
Doctesting 1 file.
sage -t --long src/sage/libs/libecm.pyx
**********************************************************************
File "src/sage/libs/libecm.pyx", line 111, in sage.libs.libecm.ecmfactor
Failed example:
    N.is_prime()
Exception raised:
    Traceback (most recent call last):
      File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 671, in _run
        self.compile_and_execute(example, compiler, test.globs)
      File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 1086, in compile_and_execute
        exec(compiled, globs)
      File "<doctest sage.libs.libecm.ecmfactor[8]>", line 1, in <module>
        N.is_prime()
      File "sage/rings/integer.pyx", line 5129, in sage.rings.integer.Integer.is_prime (build/cythonized/sage/rings/integer.c:32101)
        return self.__pari__().isprime()
      File "cypari2/gen.pyx", line 2077, in cypari2.gen.Gen.isprime
    SignalError: Segmentation fault
**********************************************************************
1 item had failures:
   1 of  21 in sage.libs.libecm.ecmfactor
    [28 tests, 1 failure, 1.53 s]

Among others. What should the stack size be so that this doesn't blow up? 16MB seems like plenty...

@embray
Copy link
Contributor Author

embray commented Feb 18, 2019

comment:18

Replying to @jdemeyer:

With "subclassing is involved", I mean more precisely a class inheriting from a base class, where the base class uses the trashcan. This is broken in CPython (for example, for subclasses of list) but that bug is unrelated to this ticket here.

I see that now. Would a class that uses @cython.trashcan have the same problem if it were subclassed? Or does your implementation take care of working around that bug?

@embray
Copy link
Contributor Author

embray commented Feb 18, 2019

comment:19

Replying to @jdemeyer:

Would it be possible (as interim work-around) to run Sage with a larger stack size (ulimit -s 8192 for example)?

Setting ulimit -s doesn't work on Cygwin: You have to actually link executables with a fixed maximum stack size, which is a field in the binary's headers. In this case it should suffice to link the python interpreter executable with a large enough stack that it at least doesn't crash on the tests. That would be acceptable to me as a temporary workaround. But I've gone now up to 32 MB and it's still crashing. I wonder if there is something else at play here.

@embray
Copy link
Contributor Author

embray commented Feb 18, 2019

comment:20

Changed title since this does not just affect Cygwin.

@embray embray changed the title Cygwin: Segfaults in cypari2 Segfaults in cypari2 when running out of stack during deep object deallocation Feb 18, 2019
@jdemeyer
Copy link

comment:21

On Linux (at least the laptop that I'm writing this on), the default stack size is 8MB. So either programs under Cygwin use much more stack or there is indeed something else.

@jdemeyer
Copy link

comment:22

Replying to @embray:

Or does your implementation take care of working around that bug?

The current Cython PR uses the same code that I proposed to CPython. So yes, it works around that bug.

@jdemeyer
Copy link

comment:23

The Cython PR is merged.

@jdemeyer
Copy link

Changed upstream from Reported upstream. Developers acknowledge bug. to Fixed upstream, but not in a stable release.

@jdemeyer

This comment has been minimized.

@jdemeyer
Copy link

comment:39

Replying to @jdemeyer:

You could try pari(2^127 - 1).isprime() in Sage and see whether that crashes.

What about pari.isprime(2^127 - 1)?

@jdemeyer
Copy link

comment:40

And also try pari("isprime(2^127 - 1)")

@jdemeyer
Copy link

comment:41

And pari("2^127 - 1").isprime()

@jdemeyer
Copy link

Author: Jeroen Demeyer

@jdemeyer

This comment has been minimized.

@jdemeyer
Copy link

@jdemeyer
Copy link

Commit: b6ea17e

@jdemeyer
Copy link

New commits:

4569a83Cython 0.29.5; add trashcan patch
b6ea17ecypari2: use trashcan for Gen

@embray
Copy link
Contributor Author

embray commented Feb 20, 2019

comment:45

I'm a little confused by the attached tarball. Is a Cython upgrade needed for this, or just applying the patch?

@jdemeyer
Copy link

comment:46

Replying to @embray:

Is a Cython upgrade needed for this

No, only the patch is needed. But I thought that it would make sense to upgrade from 0.29.2 to 0.29.5 while we're updating Cython anyway.

@embray
Copy link
Contributor Author

embray commented Feb 21, 2019

comment:47

I've confirmed to my satisfaction that this does fix the recursion depth issue with deallocations. The isprime segfault persists, but appears to be orthogonal so I'll open a separate ticket for that.

I just want to do one run of the full test suite to ensure that no other new failures occur, but otherwise this LGTM.

@embray
Copy link
Contributor Author

embray commented Feb 21, 2019

comment:48

Opened #27335 for the Gen.isprime bug.

@embray
Copy link
Contributor Author

embray commented Feb 25, 2019

comment:49

I just upgraded my Linux build to 8.7.beta5 and now I'm getting this crash in tons of places, even with

$ ulimit -s
8192

without even having to try. I will try rebasing this ticket on 8.7.beta5 and resuming...

I'm pretty satisfied now that it works on Cygwin as well.

@slel
Copy link
Member

slel commented Feb 25, 2019

Changed keywords from none to upgrade, cypari2

@slel
Copy link
Member

slel commented Feb 25, 2019

comment:50

Adding "upgrade" to keywords so release manager knows to upload tarball.

@slel slel added the c: cython label Feb 25, 2019
@nbruin
Copy link
Contributor

nbruin commented Feb 25, 2019

comment:51

(this comment is only triggered by the mention of cython.trashcan -- your expertise with it here may be useful elsewhere!)

I know of at least one place where cpython's trashcan is necessary in sagelib: for entries in MonoDict and TripleDict in sage/structure/coerce_dict.pyx. There we use extract_mono_cell and extract_triple_cell to move the relevant references to a fresh tuple and then rely on the trashcan mechanism for the deletion of tuples to avoid stack overflow. That means we are incurring the penalty of allocating an extra tuple there for deletion. If cython now exposes a way of participating in the trashcan directly, it may be worth moving that code over as well.

@jdemeyer
Copy link

comment:52

For the sake of supporting distributions, we should wait until Cython 3.0 is actually released before we use @cython.trashcan in the Sage library.

@embray
Copy link
Contributor Author

embray commented Feb 26, 2019

Reviewer: Erik Bray

@embray
Copy link
Contributor Author

embray commented Feb 26, 2019

comment:54

Replying to @jdemeyer:

For the sake of supporting distributions, we should wait until Cython 3.0 is actually released before we use @cython.trashcan in the Sage library.

I agree, but how does that impact cypari2 which is a dependency of Sage anyways?

@jdemeyer
Copy link

comment:55

Replying to @embray:

I agree, but how does that impact cypari2 which is a dependency of Sage anyways?

This ticket adds patches for both cypari2 and Cython. Distros will probably add patches to neither. Both are acceptable.

@vbraun
Copy link
Member

vbraun commented Feb 28, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants