Program aborts when Python's garbage collector gets called from another thread and attempts to traverse an unsendable pyclass instance. #3688

JRRudy1 · 2023-12-21T20:57:45Z

I have created a repository providing a full breakdown and minimal reproducible example of the error
at https://github.com/JRRudy1/pyo3_gc_error. I will provide a summary below, but please check out
the repository instead as I put a lot of effort into clearly presenting and investigating the issue.

In summary, I have discovered an error, or perhaps an undocumented limitation, in the way
PyO3 handles thread-checking for "unsendable" pyclass instances as they are being traversed
by Python's garbage collector (GC). In particular, this occurs when garbage collection is triggered
from a separate thread, and the pyclasses integrate with the GC by implementing the __traverse__
magic method. The error (or limitation) results in a hard abort, and is particularly problematic
since it cannot be caught from Python using a try/except block.

The conditions and sequence of events leading to the error can be summarized as:

Two (or more) instances of an "unsendable" pyclass are created from Python
The objects are in a reference cycle, and the pyclass defines __traverse__/__clear__ to break it
All references to them outside the cycle are dropped, so the next GC cycle should clean them up
Before the GC runs automatically, it gets explicitly called from another thread (gc.collect from
Python or GcCollect from C)
When the GC calls back into Rust to traverse the objects, PyO3 detects that the calling thread is
not the original thread and incorrectly deduces that the object was sent between threads
PyO3 triggers a panic and the program aborts with a misleading error message

I have gotten reasonably familiar with PyO3's internals and may be interested in working on this,
but I would need some guidance from an "expert" with a more nuanced understanding of the
possible implications. It is possible that the limitation cannot be safely fixed, and the only solution
is to improve the error message and add a warning to the documentation.

As mentioned above, please visit https://github.com/JRRudy1/pyo3_gc_error for more information.

The text was updated successfully, but these errors were encountered:

davidhewitt · 2023-12-21T21:14:22Z

Hmm. This is unfortunate, but not entirely a surprise. At least we crash safely.

When the GC calls back into Rust to traverse the objects, PyO3 detects that the calling thread is
not the original thread and incorrectly deduces that the object was sent between threads

I disagree that this deduction is incorrect. From PyO3's perspective this is true; the data is being read on another thread, which violates the !Send nature of that type.

One option could be to make unsendable pyclass not support gc integration, forcing the user to choose what functionality they desire. I think this seems like a reasonable restriction, because the Python GC is multithreaded.

JRRudy1 · 2023-12-21T22:55:18Z

From PyO3's perspective this is true; the data is being read on another thread, which violates the !Send nature of that type.

I was afraid you'd say that! Unfortunate indeed.

One option could be to make unsendable pyclass not support gc integration

I suppose that's fair. However the issue only arises in the fairly niche case where a GC call from another thread happens to occur while there are unsendable GC-integrated objects in a reference cycle waiting to be collected, so I'm not sure whether it would be a worthy motivation for removing functionality that works fine in most cases. But maybe it is?

I did just think of another possible solution; see this github.dev link. Apparently the ThreadCheckerImpl struct (and PyClassThreadChecker trait) already has a can_drop method that seems to check for and react to this exact problem, but it only gets called in the context of the struct being dropped. By updating the _call_traverse function to call can_drop (or a similar new method) before attempting to borrow from the cell, the error could be handled more gracefully with a more informative error message. Of course we'd need to add a higher-level method to PyCell or something that would call can_drop when appropriate, instead of calling it directly in the _call_traverse function like I did in the dev link.

adamreichold · 2023-12-22T08:05:52Z

One other option I see is, instead of an error, to make unsendable pyclasses "invisible" to the GC when it is running on a different thread, i.e. turn __traverse__ into a no-op and only actually traverse anything if on the original thread. This would imply leaks for such objects as e.g. the home thread could already have exited and the GC would never be able to run again there. But this might be a smaller caveat than raising an error (which I think is not supported by the contract of __traverse__, i.e. we would always abort).

davidhewitt · 2023-12-22T10:35:45Z

I making them opaque to other threads is quite a reasonable option, we can also document this caveat as part of the offering of unsendable. That's a softer form of "do not support GC integration", I guess, which is practical for truly single threaded programs 👍

That said, I think it's possible that these things might still get collected by another thread running a GC collection? E.g. if the unsendable class itself does not directly contain the cycle but is referenced from an object that does participate in a cycle. Then when the cycle gets collected, the unsendable class gets dropped by the wrong thread. IIRC we leak and warn in this situation already, as per #3176, so I think this edge case is ok but unfortunate.

(The only solution I can see to mitigate that would be to have a per-thread queue so that unsendable classes could post themselves to their owning thread instead of leaking, but I'm not sure that it's worth the complexity.)

adamreichold · 2023-12-22T10:40:39Z

I making them opaque to other threads is quite a reasonable option, we can also document this caveat as part of the offering of unsendable. That's a softer form of "do not support GC integration", I guess, which is practical for truly single threaded programs 👍

Will prepare a PR to turn __traverse__ into a no-op on other than the original threads then.

(The only solution I can see to mitigate that would be to have a per-thread queue so that unsendable classes could post themselves to their owning thread instead of leaking, but I'm not sure that it's worth the complexity.)

I think we should definitely try to reduce global state in PyO3, we already have quite to much and I would like to avoid adding more. If something like this is desired, I would prefer to have that in downstream code which actually how threading is used.

davidhewitt · 2023-12-22T10:42:52Z

Agreed very much so on that point 👍

JRRudy1 · 2023-12-22T18:49:28Z

Wow that was fast, thank you for your effort! I like that solution and implementation, great work guys

adamreichold · 2023-12-22T18:56:30Z

So this means you tested your PoC using the proposed change and it worked as expected?

adamreichold mentioned this issue Dec 22, 2023

Turn calls of __traverse__ into no-ops for unsendable pyclass if on the wrong thread #3689

Merged

adamreichold closed this as completed in #3689 Dec 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Program aborts when Python's garbage collector gets called from another thread and attempts to traverse an unsendable pyclass instance. #3688

Program aborts when Python's garbage collector gets called from another thread and attempts to traverse an unsendable pyclass instance. #3688

JRRudy1 commented Dec 21, 2023 •

edited

Loading

davidhewitt commented Dec 21, 2023

JRRudy1 commented Dec 21, 2023 •

edited

Loading

adamreichold commented Dec 22, 2023

davidhewitt commented Dec 22, 2023

adamreichold commented Dec 22, 2023

davidhewitt commented Dec 22, 2023

JRRudy1 commented Dec 22, 2023

adamreichold commented Dec 22, 2023

Program aborts when Python's garbage collector gets called from another thread and attempts to traverse an unsendable pyclass instance. #3688

Program aborts when Python's garbage collector gets called from another thread and attempts to traverse an unsendable pyclass instance. #3688

Comments

JRRudy1 commented Dec 21, 2023 • edited Loading

davidhewitt commented Dec 21, 2023

JRRudy1 commented Dec 21, 2023 • edited Loading

adamreichold commented Dec 22, 2023

davidhewitt commented Dec 22, 2023

adamreichold commented Dec 22, 2023

davidhewitt commented Dec 22, 2023

JRRudy1 commented Dec 22, 2023

adamreichold commented Dec 22, 2023

JRRudy1 commented Dec 21, 2023 •

edited

Loading

JRRudy1 commented Dec 21, 2023 •

edited

Loading