Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in clear_weakref() in pydantic test suite, in 3.12.0rc1 and newer #108295

Closed
mgorny opened this issue Aug 22, 2023 · 19 comments
Closed

Crash in clear_weakref() in pydantic test suite, in 3.12.0rc1 and newer #108295

mgorny opened this issue Aug 22, 2023 · 19 comments
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes release-blocker topic-typing type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@mgorny
Copy link
Contributor

mgorny commented Aug 22, 2023

Crash report

CPython versions tested on:

3.12, CPython main branch

Operating systems tested on:

Linux

Output from running 'python -VV' on the command line:

Python 3.12.0rc1+ (heads/3.12:149d70c254, Aug 22 2023, 16:11:47) [GCC 13.2.0]

What happened?

The pydantic test suite started causing pytest to segfault on exit recently, with Python 3.12. I've been able to bisect it to pydantic commit pydantic/pydantic@f5e3aa9 that looks relatively harmless. I've also been able to bisect CPython into commit 58f9c88:

gh-106403: Restore weakref support for TypeVar and friends (GH-106418)

I have been able to reproduce the problem with tip of main as well, though I have to note it's a bit of heisenbug. I've basically had a very lucky day today that I've managed to correctly bisect on the second attempt.

To reproduce:

git clone https://github.com/pydantic/pydantic/
cd pydantic
python -m venv .venv
. .venv/bin/activate
pip install -e . pytest dirty-equals
# may need to be repeated a few times
python -m pytest -o addopts= tests/test_generics.py --deselect tests/test_generics.py::test_partial_specification_name -s

Error messages

Backtrace
Core was generated by `python -m pytest -o addopts= tests/test_generics.py --deselect tests/test_gener'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  clear_weakref (self=self@entry=0x7f0a9a171df0) at Objects/weakrefobject.c:62
62	        if (*list == self)
(gdb) bt
#0  clear_weakref (self=self@entry=0x7f0a9a171df0) at Objects/weakrefobject.c:62
#1  0x0000557c3663d843 in _PyWeakref_ClearRef (self=self@entry=0x7f0a9a171df0) at Objects/weakrefobject.c:102
#2  0x0000557c366ec77e in handle_weakrefs (unreachable=unreachable@entry=0x7ffeec923ad0, 
    old=old@entry=0x557c36a40f68 <_PyRuntime+76552>) at Modules/gcmodule.c:804
#3  0x0000557c366ecad3 in gc_collect_main (tstate=0x557c36a9e7b0 <_PyRuntime+459600>, generation=generation@entry=2, 
    n_collected=n_collected@entry=0x0, n_uncollectable=n_uncollectable@entry=0x0, nofail=nofail@entry=1) at Modules/gcmodule.c:1284
#4  0x0000557c366ed333 in _PyGC_CollectNoFail (tstate=tstate@entry=0x557c36a9e7b0 <_PyRuntime+459600>) at Modules/gcmodule.c:2135
#5  0x0000557c366c39bb in finalize_modules (tstate=tstate@entry=0x557c36a9e7b0 <_PyRuntime+459600>) at Python/pylifecycle.c:1602
#6  0x0000557c366c524e in Py_FinalizeEx () at Python/pylifecycle.c:1863
#7  0x0000557c366eb4f9 in Py_RunMain () at Modules/main.c:691
#8  0x0000557c366eb56e in pymain_main (args=args@entry=0x7ffeec923c10) at Modules/main.c:719
#9  0x0000557c366eb63d in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:743
#10 0x0000557c3655979e in main (argc=<optimized out>, argv=<optimized out>) at ./Programs/python.c:15
(gdb) p self
$1 = (PyWeakReference *) 0x7f0a9a171df0
(gdb) p *self
$2 = {ob_base = {{ob_refcnt = 1, ob_refcnt_split = {1, 0}}, ob_type = 0x557c38bda680}, wr_object = 0x7f0a9a171c70, wr_callback = 0x0, 
  hash = -1, wr_prev = 0x0, wr_next = 0x0, vectorcall = 0x557c3663aa08 <weakref_vectorcall>}
(gdb) p *list
Cannot access memory at address 0xfe15355ba7e0
(gdb) p list
$3 = (PyWeakReference **) 0xfe15355ba7e0

Linked PRs

@mgorny mgorny added the type-crash A hard crash of the interpreter, possibly with a core dump label Aug 22, 2023
@Eclips4
Copy link
Member

Eclips4 commented Aug 22, 2023

cc @JelleZijlstra

@mgorny
Copy link
Contributor Author

mgorny commented Aug 22, 2023

Oh, and this is also filed at pydantic/pydantic#7181. I'm not really 100% sure whether pydantic isn't doing something wrong that got exposed here.

@Eclips4
Copy link
Member

Eclips4 commented Aug 22, 2023

Oh, and this is also filed at pydantic/pydantic#7181. I'm not really 100% sure whether pydantic isn't doing something wrong that got exposed here.

It would be nice if you could provide a pure python code(without pydantic) which crashes =)

@mgorny
Copy link
Contributor Author

mgorny commented Aug 22, 2023

I'm afraid that's as far as I managed (reducing it to one test file). Even with that, there are times when I can't reproduce it a single time — and times when it fails all the time. Right now I'm at the former, so I have no clue how to move forward.

@immolo
Copy link

immolo commented Aug 23, 2023

I've done a few quick tests on a clean system to see what I can find, so far I've found this issue doesn't exist when using 3.12beta4 and below.

I'll add an strace from my system as it seems to failing at the same place for me every time and then see if I can confirm anything more to help solve this.
strace.txt

@mgorny
Copy link
Contributor Author

mgorny commented Aug 24, 2023

@immolo, thanks. I've already pointed out the specific commit (58f9c88) where this starts, and strace is unlikely to help.

@immolo
Copy link

immolo commented Aug 24, 2023

@immolo, thanks. I've already pointed out the specific commit (58f9c88) where this starts, and strace is unlikely to help.

Sorry I worded this poorly @mgorny

I meant this as a I've confirmed what you are seeing post.

@mgorny
Copy link
Contributor Author

mgorny commented Aug 26, 2023

Eh, this bug really sucks. All I've been able to establish so far is that — obviously — it's GC-related. If I decrease GC threshold low enough or disable GC, it stops happening. If I reduce the file too much, it stops happening. And ofc it also randomly stops happening independently of that, and FWICS it's not affected by PYTHONHASHSEED.

So far my only guess is that if something is GC-ed too late, it segfaults.

@JelleZijlstra
Copy link
Member

I suspect we have to call PyObject_ClearWeakRefs(self) in typevar_dealloc after we call _PyObject_GC_UNTRACK. I haven't tried to repro yet, so I'll do that first and then see if this suggested change makes the issue go away.

@mgorny
Copy link
Contributor Author

mgorny commented Aug 26, 2023

In the meantime, I managed to produce a segfault when allocating a large WeakValueDictionary but I'm not sure if it's the same bug:

from typing import TypeVar
from weakref import WeakValueDictionary


vals = WeakValueDictionary()

for x in range(100000):
    vals[x] = TypeVar(15*str(x))

print(len(vals))
$ ./python test.py 
100000
Segmentation fault (core dumped)
Program terminated with signal SIGSEGV, Segmentation fault.
#0  clear_weakref (self=0x7f4b0e522ec0) at Objects/weakrefobject.c:62
62	        if (*list == self)
(gdb) bt
#0  clear_weakref (self=0x7f4b0e522ec0) at Objects/weakrefobject.c:62
#1  weakref_dealloc (self=0x7f4b0e522ec0) at Objects/weakrefobject.c:110
#2  0x000055a89f1942b1 in subtype_dealloc (self=0x7f4b0e522ec0) at Objects/typeobject.c:2051
#3  0x000055a89f154665 in Py_DECREF (op=<optimized out>) at ./Include/object.h:708
#4  Py_XDECREF (op=<optimized out>) at ./Include/object.h:801
#5  free_keys_object (interp=0x55a89f671eb0 <_PyRuntime+76368>, keys=0x7f4b0d163010) at Objects/dictobject.c:681
#6  0x000055a89f15c57e in dictkeys_decref (dk=<optimized out>, interp=<optimized out>) at Objects/dictobject.c:333
#7  dict_dealloc (mp=0x7f4b0f5c5c80) at Objects/dictobject.c:2374
#8  0x000055a89f15b628 in Py_DECREF (op=<optimized out>) at ./Include/object.h:708
#9  Py_XDECREF (op=<optimized out>) at ./Include/object.h:801
#10 _PyObject_FreeInstanceAttributes (self=<optimized out>) at Objects/dictobject.c:5571
#11 0x000055a89f194394 in subtype_dealloc (self=0x7f4b0f102210) at Objects/typeobject.c:2017
#12 0x000055a89f15465d in Py_DECREF (op=<optimized out>) at ./Include/object.h:708
#13 Py_XDECREF (op=<optimized out>) at ./Include/object.h:801
#14 free_keys_object (interp=0x55a89f671eb0 <_PyRuntime+76368>, keys=0x7f4b0e52c800) at Objects/dictobject.c:673
#15 0x000055a89f15c57e in dictkeys_decref (dk=<optimized out>, interp=<optimized out>) at Objects/dictobject.c:333
#16 dict_dealloc (mp=0x7f4b0f5c5bc0) at Objects/dictobject.c:2374
#17 0x000055a89f170ab5 in Py_DECREF (op=<optimized out>) at ./Include/object.h:708
#18 Py_XDECREF (op=<optimized out>) at ./Include/object.h:801
#19 module_dealloc (m=0x7f4b0f593380) at Objects/moduleobject.c:721
#20 0x000055a89f15d9b8 in Py_DECREF (op=0x7f4b0f593380) at ./Include/object.h:708
#21 Py_XDECREF (op=<optimized out>) at ./Include/object.h:801
#22 insertdict (interp=0x55a89f671eb0 <_PyRuntime+76368>, mp=0x7f4b0f558200, key=0x7f4b0f5c5b70, hash=<optimized out>, 
    value=0x55a89f57c600 <_Py_NoneStruct>) at Objects/dictobject.c:1319
#23 0x000055a89f0d6a08 in PyObject_SetItem (o=o@entry=0x7f4b0f558200, key=<optimized out>, value=0x55a89f57c600 <_Py_NoneStruct>)
    at Objects/abstract.c:212
#24 0x000055a89f2b0b28 in finalize_remove_modules (verbose=0, modules=0x7f4b0f558200) at Python/pylifecycle.c:1416
#25 finalize_modules (tstate=tstate@entry=0x55a89f6cf7b0 <_PyRuntime+459600>) at Python/pylifecycle.c:1552
#26 0x000055a89f2b1327 in Py_FinalizeEx () at Python/pylifecycle.c:1863
#27 0x000055a89f2b4415 in Py_FinalizeEx () at Python/pylifecycle.c:1981
#28 0x000055a89f2ec95f in Py_RunMain () at Modules/main.c:691
#29 pymain_main (args=0x7ffdee78eca0) at Modules/main.c:719
#30 Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:743
#31 0x00007f4b0f223f0a in __libc_start_call_main (main=main@entry=0x55a89f085800 <main>, argc=argc@entry=2, 
    argv=argv@entry=0x7ffdee78ee28) at ../sysdeps/nptl/libc_start_call_main.h:58
#32 0x00007f4b0f223fc5 in __libc_start_main_impl (main=0x55a89f085800 <main>, argc=2, argv=0x7ffdee78ee28, init=<optimized out>, 
    fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffdee78ee18) at ../csu/libc-start.c:360
#33 0x000055a89f0963b1 in _start ()

@JelleZijlstra
Copy link
Member

That one doesn't repro for me with a freshly built main:

% ./python.exe wvd.py
0

@JelleZijlstra
Copy link
Member

Oh wait, I had already applied my suggested patch when I tried your repro. Without the patch, it does segfault, which strongly suggests that my patch is right. Thanks for the much smaller repro case!

The repro from the original post doesn't work on main because pydantic-core doesn't build there. I'll build 3.12 when I get a chance, but in the meantime I'll send a PR for the PyObject_ClearWeakRefs change.

@mgorny
Copy link
Contributor Author

mgorny commented Aug 26, 2023

Thanks. I can confirm that the change fixes my reproducer, and probably pydantic too. I've tried 100 iterations of the test call, and it didn't crash (while without the change it crashes on the first attempt).

@JelleZijlstra
Copy link
Member

That's great to hear! I got the original reproducer running on 3.12, but I get a different crash. Possibly Pydantic just doesn't work with a debug-mode Python?

....FF...F....E....F...F....F.F..F.F.F...E.Fsx...Assertion failed: (!_PyErr_Occurred(tstate)), function type_call, file typeobject.c, line 1630.
warning: python was compiled with optimization - stepping may behave oddly; variables may not be available.
Process 93123 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = hit program assert
    frame #4: 0x000000010025ddb0 python`type_call.cold.6 at typeobject.c:1630:5 [opt]
   1627	    /* type_call() must not be called with an exception set,
   1628	       because it can clear it (directly or indirectly) and so the
   1629	       caller loses its exception */
-> 1630	    assert(!_PyErr_Occurred(tstate));
   1631	#endif
   1632	
   1633	    /* Special case: type(x) should return Py_TYPE(x) */
Target 0: (python) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = hit program assert
    frame #0: 0x00000001a259c724 libsystem_kernel.dylib`__pthread_kill + 8
    frame #1: 0x00000001a25d3c28 libsystem_pthread.dylib`pthread_kill + 288
    frame #2: 0x00000001a24e1ae8 libsystem_c.dylib`abort + 180
    frame #3: 0x00000001a24e0e44 libsystem_c.dylib`__assert_rtn + 272
  * frame #4: 0x000000010025ddb0 python`type_call.cold.6 at typeobject.c:1630:5 [opt]
    frame #5: 0x00000001000ea080 python`type_call(type=0x000000010042c370, args=0x00000001055b6a80, kwds=0x0000000000000000) at typeobject.c:1630:5 [opt]
    frame #6: 0x00000001000895b8 python`_PyObject_MakeTpCall(tstate=0x00000001005797e0, callable=0x000000010042c370, args=0x000000016fdfcd00, nargs=<unavailable>, keywords=0x0000000000000000) at call.c:240:18 [opt]
    frame #7: 0x0000000100089328 python`_PyObject_VectorcallTstate(tstate=0x00000001005797e0, callable=0x000000010042c370, args=0x000000016fdfcd00, nargsf=9223372036854775809, kwnames=0x0000000000000000) at pycore_call.h:90:16 [opt]
    frame #8: 0x000000010008a048 python`PyObject_CallOneArg(func=0x000000010042c370, arg=0x000000010565ab90) at call.c:401:12 [opt]
    frame #9: 0x0000000100130d40 python`warn_explicit(tstate=0x00000001005797e0, category=0x000000010042c370, message=0x000000010565ab90, filename=0x00000001055b51c0, lineno=0, module=0x00000001055b6cf0, registry=0x0000000000000000, sourceline=0x0000000000000000, source=0x0000000000000000) at _warnings.c:669:19 [opt]
    frame #10: 0x00000001001311c4 python`PyErr_WarnExplicitFormat(category=0x000000010042c370, filename_str=<unavailable>, lineno=0, module_str=<unavailable>, registry=0x0000000000000000, format=<unavailable>) at _warnings.c:1356:19 [opt]
    frame #11: 0x00000001001e71b8 python`PyObject_GC_Del(op=0x0000000105221a90) at gcmodule.c:2385:13 [opt]
    frame #12: 0x0000000104a1c404 _pydantic_core.cpython-312-darwin.so`___lldb_unnamed_symbol772 + 360
    frame #13: 0x00000001000d7ce8 python`_Py_Dealloc(op=0x0000000105221a90) at object.c:2625:5 [opt]
    frame #14: 0x0000000104b5c73c _pydantic_core.cpython-312-darwin.so`___lldb_unnamed_symbol2379 + 660
    frame #15: 0x0000000104a1bdf4 _pydantic_core.cpython-312-darwin.so`___lldb_unnamed_symbol769 + 492
    frame #16: 0x0000000104b112e0 _pydantic_core.cpython-312-darwin.so`___lldb_unnamed_symbol1908 + 72
    frame #17: 0x0000000100094a54 python`method_vectorcall_FASTCALL_KEYWORDS(func=0x000000010454d250, args=<unavailable>, nargsf=<unavailable>, kwnames=0x0000000104793ed0) at descrobject.c:427:24 [opt]
    frame #18: 0x00000001000892ec python`_PyObject_VectorcallTstate(tstate=0x00000001005797e0, callable=0x000000010454d250, args=0x0000000100911850, nargsf=9223372036854775810, kwnames=0x0000000104793ed0) at pycore_call.h:92:11 [opt]

It's trying to dealloc one of these:

  tp_name = 0x000000010500be20 "pydantic_core._pydantic_core.ValidatorCallable"

So this seems like a different bug.

@davidhewitt
Copy link
Contributor

That's possibly related to PyO3/pyo3#3064. In general I think it's a reasonable objective for us to support debug builds, but we can't easily test them in CI (actions/setup-python#86)

@mgorny
Copy link
Contributor Author

mgorny commented Aug 26, 2023

--with-pydebug? Yeah, I was wondering if it's supposed to be used against other packages, or just to test CPython itself — because so many things failed in pydantic. That's why I used --with-assertions instead.

@JelleZijlstra
Copy link
Member

Ideally all C extensions should work in a pydebug build, but that's a bigger project and it sounds like it would also need work in PyO3.

miss-islington pushed a commit to miss-islington/cpython that referenced this issue Aug 27, 2023
(cherry picked from commit 482fad7)

Co-authored-by: Jelle Zijlstra <jelle.zijlstra@gmail.com>
Yhg1s pushed a commit that referenced this issue Aug 27, 2023
)

gh-108295: Fix crashes with TypeVar weakrefs (GH-108517)
(cherry picked from commit 482fad7)

Co-authored-by: Jelle Zijlstra <jelle.zijlstra@gmail.com>
@erlend-aasland erlend-aasland moved this from Todo to In Progress in Release and Deferred blockers 🚫 Aug 27, 2023
@erlend-aasland
Copy link
Contributor

It looks to me that this has been resolved.

@davidhewitt
Copy link
Contributor

Thanks, yes I will investigate debug build support in pyo3/ pydantic separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes release-blocker topic-typing type-crash A hard crash of the interpreter, possibly with a core dump
Projects
Development

No branches or pull requests

7 participants