Skip to content

Conversation

@rwgk
Copy link
Collaborator

@rwgk rwgk commented Dec 14, 2025

Description

This PR makes two related changes aimed at stabilizing and debugging the free-threaded CPython 3.14 CI jobs:

  • Pin the 3.14t CI jobs to 3.14.0t explicitly, instead of allowing actions/setup-python to pick the latest 3.14t.
  • Add a custom Catch2 "progress" reporter for the test_with_catch executable so that C++ tests print a per-test progress line (and the Python version) in CI and local logs.

Together, these changes (1) avoid a recently introduced CPython regression in the free-threaded 3.14t builds, and (2) make it much easier to see where C++ tests are hanging or failing when such regressions occur.


Why pin the 3.14t CI jobs to 3.14.0t

The free-threaded jobs in this project have been running against the moving 3.14t target provided by actions/setup-python. Recently, those same jobs began to hang in the C++ test binary (test_with_catch) once the underlying CPython moved from 3.14.0t to 3.14.1t+. The symptoms, however, differ by platform:

Windows: hang in Move Subinterpreter teardown

On Windows, with python-version: 3.14t:

  • The cpptest target runs and the test_with_catch executable starts.
  • Using a custom Catch reporter and additional debug markers in the Move Subinterpreter test, the logs show that:
    • All earlier tests in test_with_catch pass.
    • TEST_CASE("Move Subinterpreter") starts and runs through all of its internal steps, including:
      • Creating a py::subinterpreter.
      • Importing datetime, threading, and external_module inside the subinterpreter.
      • Reusing the same subinterpreter from a second native thread.
      • Destroying the subinterpreter from that second thread.
      • Calling unsafe_reset_internals_for_single_interpreter() on return.
    • All internal LOOOK file:line markers (reverted) fire up to the very end of the test body.
  • Only after the test body returns do we see the hang: Catch never prints the final [ OK ] Move Subinterpreter, and the process sits until the CI job times out and the runner kills the test_with_catch process.

This strongly suggests the hang is in interpreter teardown / subinterpreter finalization, not in the user-visible test code itself.

Ubuntu and macOS: hang in CMake VerifyGlobs.cmake before C++ tests start

On Ubuntu and macOS, with python-version: 3.14t, the failure mode is different but appears to be related:

  • The Python tests (pytest) complete successfully.

  • When building the cpptest target, Ninja runs:

    • cmake -P .../CMakeFiles/VerifyGlobs.cmake
  • The jobs hang inside this VerifyGlobs.cmake step, with no output at all from test_with_catch:

    • test_with_catch is never actually invoked.
    • As a result, the Catch-based per-test progress and the instrumentation in Move Subinterpreter never have a chance to run on these platforms.

While we do not have a minimal reproducer for the VerifyGlobs.cmake hang, the timing is suspicious:

  • The last good CI runs used 3.14.0t and did not exhibit this behavior.

  • The first bad CI runs used 3.14.1t+, and both the Windows Move Subinterpreter hang and the Ubuntu/macOS VerifyGlobs.cmake hang appeared at the same time.

  • A local bisect over the CPython 3.14 branch (using a small SCons-based harness and a 5-second timeout on test_with_catch) isolates the first failing CPython commit to 08bea299bfd4377611df42e4e42414ffacea4f7f:

    [3.14] gh-112729: Correctly fail when the process is out of memory during interpreter creation (GH-139164) (GH-139168)

This commit changes interpreter creation/teardown logic in Python/pylifecycle.c to handle OOM more robustly. Given that:

  • On Windows we see a hang after a subinterpreter-heavy test returns, during teardown.
  • On Ubuntu/macOS we see a hang in a CMake script that is likely to trigger CMake/CPython interactions (e.g. subprocesses, interpreter creation, or probing scripts) before test_with_catch is even run.

…it is very plausible that this is one underlying CPython regression manifesting differently on different platforms, rather than two unrelated issues that just happened to appear at the same time.

Rationale for pinning

Pinning the free-threaded CI jobs to 3.14.0t is therefore a pragmatic, short-term workaround that:

  • Keeps exercising the free-threaded configuration (3.14t) in CI, but
  • Avoids the specific 3.14.1t+ regression in interpreter creation/teardown that is currently breaking these jobs, and
  • Makes CI behavior stable and predictable again while an upstream fix is discussed and implemented on the CPython side.

Concretely, .github/workflows/ci.yml is updated so that the free-threaded entries in the matrices use:

  • python-version: '3.14.0t' on Ubuntu, macOS, and Windows where we previously used '3.14t'.

All other Python versions in the matrices continue to behave as before.


Progress reporter for test_with_catch

Before this PR, the test_with_catch C++ binary only printed a summary like:

Passed all N test cases with M assertions.

This made it very difficult to see which test was hanging or failing in CI, especially once free-threaded regressions began to affect subinterpreter-related tests.

This PR updates tests/test_with_catch/catch.cpp to:

  • Define and register a simple Catch2 ProgressReporter that:
    • Prints a one-line marker before each test case of the form:
      • [ RUN ] <test-name>
    • Prints a one-line result after each test case of the form:
      • [ OK ] <test-name> or [ FAILED ] <test-name>
    • Prints a one-time Python version banner at the start of the run using Py_GetVersion():
      • [ PYTHON ] 3.14.1+ free-threading build (tags/v3.14.1:..., ...) [GCC ...]
  • Make ProgressReporter the default reporter for this binary via CATCH_CONFIG_DEFAULT_REPORTER "progress".

This has several concrete benefits:

  • CI observability (Windows case):

    • On Windows, where test_with_catch actually runs and reaches Move Subinterpreter before hanging, the new reporter makes it immediately obvious which test is in progress and which Python version is in use.
    • When combined with additional instrumentation, this was critical in confirming that the Move Subinterpreter test body completes and that the hang is in teardown.
  • CI observability (Ubuntu/macOS case):

    • On Ubuntu and macOS, the current hang occurs before test_with_catch is ever invoked, so the progress reporter does not yet help with that specific symptom.
    • However, once the upstream CPython regression is fixed and VerifyGlobs.cmake no longer hangs, we will immediately benefit from the improved C++ test logging on those platforms as well.
  • Local debugging:

    • The same binary now gives per-test progress when run locally, which greatly simplifies reproducing and narrowing down issues, especially for subinterpreter and GIL-related tests.

In short, even though the primary motivation was diagnosing the free-threaded regression, the progress reporter is a generally useful improvement to the C++ test experience and is intended to stay.

For easy future reference, this is the ChatGPT 5.2 Pro Thinking chat related to developing the implementation of the progress reporter:

According to the chat, there is no equivalent built-in reporter.


Opting back into the old compact output

If a developer prefers the old, more compact Catch2 output locally, they can still request it explicitly on the command line, for example:

  • test_with_catch -r compact

This overrides the default progress reporter set at compile time and restores the previous summary-style behavior for that run.

In other words, the new progress reporter is:

  • On by default for CI and local runs, to make hangs and regressions much easier to diagnose (especially on Windows today, and on Ubuntu/macOS as soon as the upstream regression is addressed).
  • Opt-out at runtime by passing -r compact (or another reporter) for developers who prefer the old output style for ad hoc runs.

Summary of intent

  • This PR does not attempt to "paper over" a pybind11 bug; rather, it:
    • Pins CI to 3.14.0t, the last-known-good free-threaded build for this test suite, as a temporary workaround for a CPython regression identified via bisect.
    • Improves the C++ test harness so that any future issues—whether in pybind11 or in the surrounding ecosystem—are easier to localize and report.

The expectation is that, once the CPython regression in interpreter creation/teardown is fixed upstream and available via actions/setup-python, we can relax the pin back to 3.14t (or another appropriate selector), while keeping the progress reporter as a permanent quality-of-life improvement for debugging.

Suggested changelog entry:

  • n/a

Add explicit timeouts to the busy-wait coordination loops in the
Per-Subinterpreter GIL test in tests/test_with_catch/test_subinterpreter.cpp.
Previously those loops spun indefinitely waiting for shared atomics like
`started` and `sync` to change, which is fine when CPython's free-threading
and per-interpreter GIL behavior matches the test's expectations but becomes
pathologically bad when that behavior regresses: the `test_with_catch`
executable can then hang forever, causing our 3.14t CI jobs to time out
after 90 minutes.

This change keeps the structure and intent of the test but adds a
std::chrono::steady_clock deadline to each of the coordination loops,
using a conservative 10 second bound. Worker threads record a failure and
return if they hit the timeout, while the main thread fails the test via
Catch2 instead of hanging. That way, if future CPython free-threading
patches change the semantics again, the test will fail quickly and
produced a diagnosable error instead of wedging the CI job.
@rwgk
Copy link
Collaborator Author

rwgk commented Dec 14, 2025

EDIT: REVERTED


Summary

This PR adds timeouts to the Per-Subinterpreter GIL test in test_with_catch so that free‑threading regressions in CPython result in fast, diagnosable failures instead of 90‑minute CI hangs.

Background and suspected root cause

  • We have strong evidence that the change from Python 3.14.0 / 3.14.0t to 3.14.1 / 3.14.1t triggered the new CI failures:
    • The last good CI run (logs_19839060935, 2025‑12‑01) used:
      • macOS 3.14t: python-3.14.0-darwin-arm64-freethreaded.tar.gzCPython 3.14.0t
      • Ubuntu 3.14: CPython 3.14.0
      • Windows 3.14: CPython 3.14.0
    • The first bad CI run (logs_19884911572, 2025‑12‑03) used:
      • macOS 3.14t: python-3.14.1-darwin-arm64-freethreaded.tar.gzCPython 3.14.1t
      • Ubuntu 3.14t: python-3.14.1-linux-24.04-x64-freethreaded.tar.gzCPython 3.14.1t
    • Later runs on 3.14.2t show the same pattern: the Python tests pass, and the hang occurs while running the C++ test_with_catch executable as part of the cpptest target.
  • All three problematic jobs hang only after:
    • cmake --build build --target pytest completes successfully, and
    • cmake --build build --target cpptest starts test_with_catch, which runs our embedded/interpreter/subinterpreter tests under free‑threading.

Currently affected CI matrix entries

All failures are in the standard CI workflow (.github/workflows/ci.yml), in the jobs that exercise Python 3.14 free‑threading (3.14t):

  • standard-small job (uses reusable-standard.yml)

    • Matrix entry:
      • runs-on: ubuntu-latest
      • python-version: '3.14t'
      • cmake-args: -DCMAKE_CXX_STANDARD=17 -DPYBIND11_TEST_SMART_HOLDER=ON
    • Displayed job name (from logs):
      • 🐍 (ubuntu-latest, 3.14t, -DCMAKE_CXX_STANDARD=17 -DPYBIND11_TEST_SMART_HOLDER=ON) / 🧪
  • standard-large job (uses reusable-standard.yml)

    • Matrix entry:

      • runs-on: macos-latest
      • python-version: '3.14t'
      • cmake-args: -DCMAKE_CXX_STANDARD=20
    • Displayed job name:

      • 🐍 (macos-latest, 3.14t, -DCMAKE_CXX_STANDARD=20) / 🧪
    • Matrix entry:

      • runs-on: windows-latest
      • python-version: '3.14t'
      • cmake-args: -DCMAKE_CXX_STANDARD=23
    • Displayed job name:

      • 🐍 (windows-latest, 3.14t, -DCMAKE_CXX_STANDARD=23) / 🧪

All three of these jobs currently run until the workflow’s 90‑minute timeout, with the only stuck process being the test_with_catch binary.

Change in this PR

  • Limit busy‑wait loops in tests/test_with_catch/test_subinterpreter.cpp’s TEST_CASE("Per-Subinterpreter GIL"):
    • Worker‑thread loops that previously spun on started and sync now:
      • Use std::chrono::steady_clock and a 10‑second timeout.
      • Record a failure via the existing T_REQUIRE helper and return if they time out.
    • Main‑thread loops that waited for started == 3 and sync == 3 now:
      • Also use a 10‑second timeout.
      • Call FAIL(...) on timeout instead of hanging the entire job.
  • The intent is to preserve the semantics of the test (detect incorrect free‑threading / per‑interpreter GIL behavior) while ensuring that future CPython changes cause fast, explicit test failures instead of silent 90‑minute job hangs.

@rwgk
Copy link
Collaborator Author

rwgk commented Dec 14, 2025

@b-pass Could you please help working on this?

Last-good run was: https://github.com/pybind/pybind11/actions/runs/19839060935

First-bad run was: https://github.com/pybind/pybind11/actions/runs/19884911572

I'm using Cursor, thus far without looking much at the logs or the code myself. The analysis in the comment above was completely generated by Cursor.

Could you please let me know any ideas or suggestions you have? I'll feed them into my Cursor chat. If you want to make changes here directly, please let me know, I'll figure out how to get you write access.

rwgk added 3 commits December 13, 2025 19:04
Introduce a custom Catch2 reporter for tests/test_with_catch that prints a
simple one-line status for each test case as it starts and ends, and wire the
cpptest CMake target to invoke test_with_catch with -r progress. This makes
it much easier to see where the embedded/interpreter test binary is spending
its time in CI logs, and in particular to pinpoint which test case is stuck
when the free-threading builds hang.

Compared to adding ad hoc timeouts around potentially infinite busy-wait
loops in individual tests, a progress reporter is a more general and robust
approach: it gives visibility into all tests (including future ones) without
changing their behavior, and turns otherwise opaque 90-minute timeouts into
locatable issues in the Catch output.
@rwgk rwgk changed the title Limit busy-wait loops in per-subinterpreter GIL test WIP resolve test_with_catch hangs Dec 14, 2025
@rwgk rwgk requested a review from henryiii as a code owner December 14, 2025 03:27
@rwgk rwgk removed the request for review from henryiii December 14, 2025 03:31
@rwgk
Copy link
Collaborator Author

rwgk commented Dec 14, 2025

@b-pass The initial timeout idea was a fail. It's reverted. I'm trying a different hammer now. Cursor added a progress reporter for catch (commit 179a66f) that produces the output below. Hopefully that'll help us home in on the troublemaker test.

[ RUN      ] PYTHONPATH is used to update sys.path
[       OK ] PYTHONPATH is used to update sys.path
[ RUN      ] Pass classes and data between modules defined in C++ and Python
[       OK ] Pass classes and data between modules defined in C++ and Python
[ RUN      ] Override cache
[       OK ] Override cache
[ RUN      ] Import error handling
[       OK ] Import error handling
[ RUN      ] There can be only one interpreter
[       OK ] There can be only one interpreter
[ RUN      ] Custom PyConfig
[       OK ] Custom PyConfig
[ RUN      ] scoped_interpreter with PyConfig_InitIsolatedConfig and argv
[       OK ] scoped_interpreter with PyConfig_InitIsolatedConfig and argv
[ RUN      ] scoped_interpreter with PyConfig_InitPythonConfig and argv
[       OK ] scoped_interpreter with PyConfig_InitPythonConfig and argv
[ RUN      ] Add program dir to path pre-PyConfig
[       OK ] Add program dir to path pre-PyConfig
[ RUN      ] Add program dir to path using PyConfig
[       OK ] Add program dir to path using PyConfig
[ RUN      ] Restart the interpreter
[       OK ] Restart the interpreter
[ RUN      ] Execution frame
[       OK ] Execution frame
[ RUN      ] Threads
[       OK ] Threads
[ RUN      ] Reload module from file
[       OK ] Reload module from file
[ RUN      ] sys.argv gets initialized properly
[       OK ] sys.argv gets initialized properly
[ RUN      ] make_iterator can be called before then after finalizing an interpreter
[       OK ] make_iterator can be called before then after finalizing an interpreter

@rwgk
Copy link
Collaborator Author

rwgk commented Dec 14, 2025

Quick update: I switched to root-causing locally, by installing cpython from sources, then building the pybind11 tests with it, and running test_with_catch.

Already confirmed:

  • test_with_catch passes locally with v3.14.0 built from sources.
  • test_with_catch fails locally with the 3.14 branch on current HEAD (7297d3a98d377519c83ef142043ad22376abfe7c).
  • test_with_catch fails locally with v3.14.1

Cursor is currently running git bisect to find the commit between v3.14.0 and v3.14.1 that breaks the test.

@rwgk
Copy link
Collaborator Author

rwgk commented Dec 14, 2025

Result of the bisect

  • First bad commit (on the 3.14 branch, free-threaded build, with your test_with_catch oracle):

    08bea299bfd4377611df42e4e42414ffacea4f7f
    [3.14] gh-112729: Correctly fail when the process is out of memory during interpreter creation (GH-139164) (GH-139168)

  • Behavior around it:

    • Earlier commits (up to and including ddd12644698 and a69bdab5410) are PASS: Move Subinterpreter runs to completion and prints [ OK ] Move Subinterpreter.
    • Starting at 08bea299bfd and later, your Move Subinterpreter test starts to show the same pattern as on CI HEAD: it runs through all LOOOK points and then the process hangs in post-test teardown (in your CI case) or, under the 5s timeout oracle, is classified as BAD.
    • All intermediate commits I logged in sha_pass_fail.txt have consistent PASS/FAIL status with this boundary.
  • What the commit touches (per message diff summary):

    • Modifies Python/pylifecycle.c in the interpreter creation / lifecycle path to "correctly fail when the process is out of memory during interpreter creation".
    • Adds stress tests under Lib/test/test_interpreters/test_stress.py.
    • Adds a NEWS entry.

Given that this is squarely in the interpreter creation/teardown code path, and your hang appears after Move Subinterpreter returns (i.e. during interpreter teardown / subinterpreter handling), this commit is an extremely strong candidate for the regression you're seeing with 3.14.1+ free-threading and your Move Subinterpreter test.

rwgk added 8 commits December 13, 2025 23:25
Print the CPython version once at the start of the Catch-based
interpreter tests using Py_GetVersion(). This makes it trivial to
confirm which free-threaded build a failing run is using when
inspecting CI or local logs.
Update the standard-small and standard-large GitHub Actions jobs to
request python-version 3.14.0t instead of 3.14t. This forces setup-python
to use the last-known-good 3.14.0 free-threaded build rather than the
newer 3.14.1+ builds where subinterpreter finalization regressed.
Update the standard-small and standard-large GitHub Actions jobs to
request python-version 3.14.0t instead of 3.14t. This forces setup-python
to use the last-known-good 3.14.0 free-threaded build rather than the
newer 3.14.1+ builds where subinterpreter finalization regressed.
@rwgk rwgk changed the title WIP resolve test_with_catch hangs Pin 3.14t CI jobs to 3.14.0t to work around a regression in 3.14.1t+ Dec 14, 2025
@b-pass
Copy link
Contributor

b-pass commented Dec 14, 2025

Hmmm... I spent a while trying to figure out how that CPython change could've caused an issue and it seems like it should be harmless. So I'm not sure yet what's going on it. Based on how cpython handles ThreadState lifecycles in 3.14 it should be OK to move them to different thread than where they were created (which is what is being tested by the failing test).

rwgk added 4 commits December 14, 2025 19:05
The progress reporter emits output as test cases start and finish and
flushes immediately to keep CI logs current with progress (so that we can see
immediately where tests hang). StreamingReporterBase matches this behavior
directly, whereas CumulativeReporterBase is meant for reporters that collect
results and emit output at the end of the run.
/__w/pybind11/pybind11/tests/test_with_catch/catch.cpp:62:22: error: statement should be inside braces [readability-braces-around-statements,-warnings-as-errors]
   62 |         if (printed_)
      |                      ^
      |                       {
   63 |             return;
      |
@rwgk
Copy link
Collaborator Author

rwgk commented Dec 15, 2025

@b-pass

@XuehaiPan

The text below is mostly generated by Cursor, but with some non-trivial manual edits:


With a lot of help from Cursor I’ve put together a small standalone reproducer that tries to mirror TEST_CASE("Move Subinterpreter") as closely as possible, but written directly against the CPython C API.

You can find it in this PR under:

  • move_subinterpreter_redux/move_subinterpreter_redux.c
  • move_subinterpreter_redux/build_and_run.sh

The idea is to exercise essentially the same lifecycle as Move Subinterpreter:

  • Create a subinterpreter via Py_NewInterpreterFromConfig with a PyInterpreterConfig that matches what pybind11 uses (allow_threads = 1, check_multi_interp_extensions = 1, gil = PyInterpreterConfig_OWN_GIL).
  • On the main thread, create a temporary PyThreadState for that interpreter, PyThreadState_Swap into it, import some moderately non-trivial modules (sys, datetime, threading), then clear/delete that PyThreadState again.
  • On a worker thread, repeat the same pattern: create a fresh PyThreadState for the same PyInterpreterState, run a bit of code, then destroy the subinterpreter from that worker thread by calling Py_EndInterpreter with another fresh PyThreadState for the subinterpreter (mirroring what py::subinterpreter::~subinterpreter does on 3.13+).

There are no pybind11 internals involved here, just the raw C API and a single global PyInterpreterState * for the subinterpreter (sub_interp).

Behavior with 3.14.0t and 3.14.1t

Using two local builds:

  • 3.14.0t at ~/wrk/cpython_installs/v3.14_ebf955df7a8 (tags/v3.14.0:ebf955df7a8)
  • 3.14.1t at ~/wrk/cpython_installs/v3.14_57e0d177c26 (tags/v3.14.1:57e0d177c26)

I ran:

cd move_subinterpreter_redux
./build_and_run.sh "$HOME/wrk/cpython_installs/v3.14_ebf955df7a8/bin/python3.14t-config"
./build_and_run.sh "$HOME/wrk/cpython_installs/v3.14_57e0d177c26/bin/python3.14t-config"

Both succeed and print very similar output:

With 3.14.0t:

Building move_subinterpreter_redux with: /home/rgrossekunst/wrk/cpython_installs/v3.14_ebf955df7a8/bin/python3.14t-config
+ gcc -O0 -g -Wall -Wextra -o move_subinterpreter_redux move_subinterpreter_redux.c -I/home/rgrossekunst/wrk/cpython_installs/v3.14_ebf955df7a8/include/python3.14t -I/home/rgrossekunst/wrk/cpython_installs/v3.14_ebf955df7a8/include/python3.14t -fno-strict-overflow -Wsign-compare -DNDEBUG -g -O3 -Wall -L/home/rgrossekunst/wrk/cpython_installs/v3.14_ebf955df7a8/lib -lpython3.14t -ldl -lm -lpython3.14t -ldl -lm -lpthread
+ set -x
++ /home/rgrossekunst/wrk/cpython_installs/v3.14_ebf955df7a8/bin/python3.14t-config --prefix
+ prefix=/home/rgrossekunst/wrk/cpython_installs/v3.14_ebf955df7a8
+ export LD_LIBRARY_PATH=/home/rgrossekunst/wrk/cpython_installs/v3.14_ebf955df7a8/lib
+ LD_LIBRARY_PATH=/home/rgrossekunst/wrk/cpython_installs/v3.14_ebf955df7a8/lib
+ echo 'Running move_subinterpreter_redux...'
Running move_subinterpreter_redux...
+ set -x
+ set +e
+ timeout 3s ./move_subinterpreter_redux
Python version: 3.14.0 free-threading build (tags/v3.14.0:ebf955df7a8, Dec 13 2025, 22:38:04) [GCC 13.3.0]
Main interpreter initialized.
Subinterpreter created.
main: activating subinterpreter on this thread
main: finished running code in subinterpreter
Subinterpreter imports on main thread done.
worker: activating subinterpreter on this thread
worker: finished running code in subinterpreter
worker: calling Py_EndInterpreter on subinterpreter
main: ran code in subinterpreter
worker: ran code in subinterpreter
worker: returned from Py_EndInterpreter
Worker thread joined.
Py_FinalizeEx() returned 0.
+ status=0
+ set -e
+ set +x
move_subinterpreter_redux: finished successfully (exit code 0)

With 3.14.1t:

Building move_subinterpreter_redux with: /home/rgrossekunst/wrk/cpython_installs/v3.14_57e0d177c26/bin/python3.14t-config
+ gcc -O0 -g -Wall -Wextra -o move_subinterpreter_redux move_subinterpreter_redux.c -I/home/rgrossekunst/wrk/cpython_installs/v3.14_57e0d177c26/include/python3.14t -I/home/rgrossekunst/wrk/cpython_installs/v3.14_57e0d177c26/include/python3.14t -fno-strict-overflow -Wsign-compare -DNDEBUG -g -O3 -Wall -L/home/rgrossekunst/wrk/cpython_installs/v3.14_57e0d177c26/lib -lpython3.14t -ldl -lm -lpython3.14t -ldl -lm -lpthread
+ set -x
++ /home/rgrossekunst/wrk/cpython_installs/v3.14_57e0d177c26/bin/python3.14t-config --prefix
+ prefix=/home/rgrossekunst/wrk/cpython_installs/v3.14_57e0d177c26
+ export LD_LIBRARY_PATH=/home/rgrossekunst/wrk/cpython_installs/v3.14_57e0d177c26/lib
+ LD_LIBRARY_PATH=/home/rgrossekunst/wrk/cpython_installs/v3.14_57e0d177c26/lib
+ echo 'Running move_subinterpreter_redux...'
Running move_subinterpreter_redux...
+ set -x
+ set +e
+ timeout 3s ./move_subinterpreter_redux
Python version: 3.14.1 free-threading build (tags/v3.14.1:57e0d177c26, Dec 13 2025, 22:49:30) [GCC 13.3.0]
Main interpreter initialized.
Subinterpreter created.
main: activating subinterpreter on this thread
main: finished running code in subinterpreter
Subinterpreter imports on main thread done.
worker: activating subinterpreter on this thread
worker: finished running code in subinterpreter
worker: calling Py_EndInterpreter on subinterpreter
main: ran code in subinterpreter
worker: ran code in subinterpreter
worker: returned from Py_EndInterpreter
Worker thread joined.
Py_FinalizeEx() returned 0.
+ status=0
+ set -e
+ set +x
move_subinterpreter_redux: finished successfully (exit code 0)

So the basic C-level pattern that Move Subinterpreter is trying to follow (create subinterpreter with its own GIL, use it from main thread and worker thread, then end it from the worker) appears to be accepted by both 3.14.0t and 3.14.1t, provided we:

  • Don’t keep any long-lived PyThreadState* for the subinterpreter around; and
  • Always create fresh PyThreadState instances on whichever thread is currently using/destroying the subinterpreter, clearing/deleting them when done.

In other words, the low-level CPython story looks consistent between 3.14.0t and 3.14.1t for this pattern.

Why I suspect a higher-level pybind11 issue (#5926 style)

Given that this C-only reproducer passes on both versions, but the pybind11 test suite is still seeing hangs during teardown with 3.14.1t, my current suspicion is that the remaining problem is elsewhere in pybind11, very much in the spirit of #5926:

  • pybind11 (or an extension using it) caches Python objects in process-global statics (e.g. via py::gil_safe_call_once_and_store, or similar helpers).
  • Those cached objects are interpreter-local (modules, dicts, capsules, or even interned strings created in a subinterpreter).
  • If a subinterpreter initializes one of those statics, and that interpreter is later destroyed, then a subsequent access from another interpreter (often the main interpreter, typically during teardown) will hit a dangling or cross-interpreter object.
  • On earlier Python versions (or with slightly different interpreter teardown ordering) this may have "worked by accident" or just crashed in a more obvious way. With 3.14.1t’s stricter subinterpreter semantics, it’s much easier to end up in a bad state that manifests as a hang during interpreter shutdown rather than a clean error.

This lines up pretty closely with the diagnosis in #5926: a lifetime mismatch between C++ static storage and per-interpreter Python objects. In that issue it showed up as a segfault; in our CI it looks more like a teardown hang, but the root cause could be very similar.

Ask / next steps

I’d really appreciate it if you could take a look at move_subinterpreter_redux.c and sanity-check the pattern there against your mental model of how pybind11’s subinterpreter and subinterpreter_scoped_activate are supposed to behave.

If you agree that the C-level pattern is sound (and that it maps onto what the pybind11 code is intending to do), then my conclusion would be:

  • The remaining 3.14.1t teardown issues are likely coming from pybind11’s internals elsewhere (e.g. remaining uses of gil_safe_call_once_and_store or other process-global caches that still assume a single interpreter), rather than from the raw Py_EndInterpreter/PyThreadState usage.

In that case, focusing effort on auditing/removing interpreter-local objects from process-global statics (as in #5926) seems like the most promising path to making pybind11 reliably subinterpreter-safe on 3.14.1t+.

@rwgk
Copy link
Collaborator Author

rwgk commented Dec 15, 2025

@b-pass @XuehaiPan I forgot to add: I'll stop working on this PR for now, until #5933 is ready for review, and we can test if it resolves the 3.14t hangs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants