-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Pin 3.14t CI jobs to 3.14.0t to work around a regression in 3.14.1t+ #5934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Add explicit timeouts to the busy-wait coordination loops in the Per-Subinterpreter GIL test in tests/test_with_catch/test_subinterpreter.cpp. Previously those loops spun indefinitely waiting for shared atomics like `started` and `sync` to change, which is fine when CPython's free-threading and per-interpreter GIL behavior matches the test's expectations but becomes pathologically bad when that behavior regresses: the `test_with_catch` executable can then hang forever, causing our 3.14t CI jobs to time out after 90 minutes. This change keeps the structure and intent of the test but adds a std::chrono::steady_clock deadline to each of the coordination loops, using a conservative 10 second bound. Worker threads record a failure and return if they hit the timeout, while the main thread fails the test via Catch2 instead of hanging. That way, if future CPython free-threading patches change the semantics again, the test will fail quickly and produced a diagnosable error instead of wedging the CI job.
|
EDIT: REVERTED SummaryThis PR adds timeouts to the Background and suspected root cause
Currently affected CI matrix entriesAll failures are in the standard CI workflow (
All three of these jobs currently run until the workflow’s 90‑minute timeout, with the only stuck process being the Change in this PR
|
|
@b-pass Could you please help working on this? Last-good run was: https://github.com/pybind/pybind11/actions/runs/19839060935 First-bad run was: https://github.com/pybind/pybind11/actions/runs/19884911572 I'm using Cursor, thus far without looking much at the logs or the code myself. The analysis in the comment above was completely generated by Cursor. Could you please let me know any ideas or suggestions you have? I'll feed them into my Cursor chat. If you want to make changes here directly, please let me know, I'll figure out how to get you write access. |
This reverts commit 7847ada.
Introduce a custom Catch2 reporter for tests/test_with_catch that prints a simple one-line status for each test case as it starts and ends, and wire the cpptest CMake target to invoke test_with_catch with -r progress. This makes it much easier to see where the embedded/interpreter test binary is spending its time in CI logs, and in particular to pinpoint which test case is stuck when the free-threading builds hang. Compared to adding ad hoc timeouts around potentially infinite busy-wait loops in individual tests, a progress reporter is a more general and robust approach: it gives visibility into all tests (including future ones) without changing their behavior, and turns otherwise opaque 90-minute timeouts into locatable issues in the Catch output.
|
@b-pass The initial timeout idea was a fail. It's reverted. I'm trying a different hammer now. Cursor added a progress reporter for catch (commit 179a66f) that produces the output below. Hopefully that'll help us home in on the troublemaker test. |
|
Quick update: I switched to root-causing locally, by installing cpython from sources, then building the pybind11 tests with it, and running test_with_catch. Already confirmed:
Cursor is currently running git bisect to find the commit between v3.14.0 and v3.14.1 that breaks the test. |
|
Result of the bisect
Given that this is squarely in the interpreter creation/teardown code path, and your hang appears after Move Subinterpreter returns (i.e. during interpreter teardown / subinterpreter handling), this commit is an extremely strong candidate for the regression you're seeing with 3.14.1+ free-threading and your Move Subinterpreter test. |
Print the CPython version once at the start of the Catch-based interpreter tests using Py_GetVersion(). This makes it trivial to confirm which free-threaded build a failing run is using when inspecting CI or local logs.
This reverts commit ad3e1c3.
Update the standard-small and standard-large GitHub Actions jobs to request python-version 3.14.0t instead of 3.14t. This forces setup-python to use the last-known-good 3.14.0 free-threaded build rather than the newer 3.14.1+ builds where subinterpreter finalization regressed.
This reverts commit 5281e1c.
This reverts commit ed11292.
This reverts commit 0fe6a42.
This reverts commit 60ae0e8.
Update the standard-small and standard-large GitHub Actions jobs to request python-version 3.14.0t instead of 3.14t. This forces setup-python to use the last-known-good 3.14.0 free-threaded build rather than the newer 3.14.1+ builds where subinterpreter finalization regressed.
|
Hmmm... I spent a while trying to figure out how that CPython change could've caused an issue and it seems like it should be harmless. So I'm not sure yet what's going on it. Based on how cpython handles ThreadState lifecycles in 3.14 it should be OK to move them to different thread than where they were created (which is what is being tested by the failing test). |
The progress reporter emits output as test cases start and finish and flushes immediately to keep CI logs current with progress (so that we can see immediately where tests hang). StreamingReporterBase matches this behavior directly, whereas CumulativeReporterBase is meant for reporters that collect results and emit output at the end of the run.
/__w/pybind11/pybind11/tests/test_with_catch/catch.cpp:62:22: error: statement should be inside braces [readability-braces-around-statements,-warnings-as-errors]
62 | if (printed_)
| ^
| {
63 | return;
|
|
The text below is mostly generated by Cursor, but with some non-trivial manual edits: With a lot of help from Cursor I’ve put together a small standalone reproducer that tries to mirror You can find it in this PR under:
The idea is to exercise essentially the same lifecycle as
There are no pybind11 internals involved here, just the raw C API and a single global Behavior with 3.14.0t and 3.14.1tUsing two local builds:
I ran: cd move_subinterpreter_redux
./build_and_run.sh "$HOME/wrk/cpython_installs/v3.14_ebf955df7a8/bin/python3.14t-config"
./build_and_run.sh "$HOME/wrk/cpython_installs/v3.14_57e0d177c26/bin/python3.14t-config"Both succeed and print very similar output: With 3.14.0t: With 3.14.1t: So the basic C-level pattern that
In other words, the low-level CPython story looks consistent between 3.14.0t and 3.14.1t for this pattern. Why I suspect a higher-level pybind11 issue (#5926 style)Given that this C-only reproducer passes on both versions, but the pybind11 test suite is still seeing hangs during teardown with 3.14.1t, my current suspicion is that the remaining problem is elsewhere in pybind11, very much in the spirit of #5926:
This lines up pretty closely with the diagnosis in #5926: a lifetime mismatch between C++ Ask / next stepsI’d really appreciate it if you could take a look at If you agree that the C-level pattern is sound (and that it maps onto what the pybind11 code is intending to do), then my conclusion would be:
In that case, focusing effort on auditing/removing interpreter-local objects from process-global statics (as in #5926) seems like the most promising path to making pybind11 reliably subinterpreter-safe on 3.14.1t+. |
|
@b-pass @XuehaiPan I forgot to add: I'll stop working on this PR for now, until #5933 is ready for review, and we can test if it resolves the 3.14t hangs. |
Description
This PR makes two related changes aimed at stabilizing and debugging the free-threaded CPython 3.14 CI jobs:
3.14.0texplicitly, instead of allowingactions/setup-pythonto pick the latest3.14t.test_with_catchexecutable so that C++ tests print a per-test progress line (and the Python version) in CI and local logs.Together, these changes (1) avoid a recently introduced CPython regression in the free-threaded 3.14t builds, and (2) make it much easier to see where C++ tests are hanging or failing when such regressions occur.
Why pin the 3.14t CI jobs to
3.14.0tThe free-threaded jobs in this project have been running against the moving
3.14ttarget provided byactions/setup-python. Recently, those same jobs began to hang in the C++ test binary (test_with_catch) once the underlying CPython moved from 3.14.0t to 3.14.1t+. The symptoms, however, differ by platform:Windows: hang in
Move SubinterpreterteardownOn Windows, with
python-version: 3.14t:cpptesttarget runs and thetest_with_catchexecutable starts.Move Subinterpretertest, the logs show that:test_with_catchpass.TEST_CASE("Move Subinterpreter")starts and runs through all of its internal steps, including:py::subinterpreter.datetime,threading, andexternal_moduleinside the subinterpreter.unsafe_reset_internals_for_single_interpreter()on return.LOOOK file:linemarkers (reverted) fire up to the very end of the test body.[ OK ] Move Subinterpreter, and the process sits until the CI job times out and the runner kills thetest_with_catchprocess.This strongly suggests the hang is in interpreter teardown / subinterpreter finalization, not in the user-visible test code itself.
Ubuntu and macOS: hang in CMake
VerifyGlobs.cmakebefore C++ tests startOn Ubuntu and macOS, with
python-version: 3.14t, the failure mode is different but appears to be related:The Python tests (
pytest) complete successfully.When building the
cpptesttarget, Ninja runs:cmake -P .../CMakeFiles/VerifyGlobs.cmakeThe jobs hang inside this
VerifyGlobs.cmakestep, with no output at all fromtest_with_catch:test_with_catchis never actually invoked.Move Subinterpreternever have a chance to run on these platforms.While we do not have a minimal reproducer for the
VerifyGlobs.cmakehang, the timing is suspicious:The last good CI runs used 3.14.0t and did not exhibit this behavior.
The first bad CI runs used 3.14.1t+, and both the Windows
Move Subinterpreterhang and the Ubuntu/macOSVerifyGlobs.cmakehang appeared at the same time.A local bisect over the CPython 3.14 branch (using a small SCons-based harness and a 5-second timeout on
test_with_catch) isolates the first failing CPython commit to08bea299bfd4377611df42e4e42414ffacea4f7f:This commit changes interpreter creation/teardown logic in
Python/pylifecycle.cto handle OOM more robustly. Given that:test_with_catchis even run.…it is very plausible that this is one underlying CPython regression manifesting differently on different platforms, rather than two unrelated issues that just happened to appear at the same time.
Rationale for pinning
Pinning the free-threaded CI jobs to
3.14.0tis therefore a pragmatic, short-term workaround that:Concretely,
.github/workflows/ci.ymlis updated so that the free-threaded entries in the matrices use:python-version: '3.14.0t'on Ubuntu, macOS, and Windows where we previously used'3.14t'.All other Python versions in the matrices continue to behave as before.
Progress reporter for
test_with_catchBefore this PR, the
test_with_catchC++ binary only printed a summary like:This made it very difficult to see which test was hanging or failing in CI, especially once free-threaded regressions began to affect subinterpreter-related tests.
This PR updates
tests/test_with_catch/catch.cppto:ProgressReporterthat:[ RUN ] <test-name>[ OK ] <test-name>or[ FAILED ] <test-name>Py_GetVersion():[ PYTHON ] 3.14.1+ free-threading build (tags/v3.14.1:..., ...) [GCC ...]ProgressReporterthe default reporter for this binary viaCATCH_CONFIG_DEFAULT_REPORTER "progress".This has several concrete benefits:
CI observability (Windows case):
test_with_catchactually runs and reachesMove Subinterpreterbefore hanging, the new reporter makes it immediately obvious which test is in progress and which Python version is in use.Move Subinterpretertest body completes and that the hang is in teardown.CI observability (Ubuntu/macOS case):
test_with_catchis ever invoked, so the progress reporter does not yet help with that specific symptom.VerifyGlobs.cmakeno longer hangs, we will immediately benefit from the improved C++ test logging on those platforms as well.Local debugging:
In short, even though the primary motivation was diagnosing the free-threaded regression, the progress reporter is a generally useful improvement to the C++ test experience and is intended to stay.
For easy future reference, this is the ChatGPT 5.2 Pro Thinking chat related to developing the implementation of the progress reporter:
According to the chat, there is no equivalent built-in reporter.
Opting back into the old compact output
If a developer prefers the old, more compact Catch2 output locally, they can still request it explicitly on the command line, for example:
test_with_catch -r compactThis overrides the default
progressreporter set at compile time and restores the previous summary-style behavior for that run.In other words, the new progress reporter is:
-r compact(or another reporter) for developers who prefer the old output style for ad hoc runs.Summary of intent
The expectation is that, once the CPython regression in interpreter creation/teardown is fixed upstream and available via
actions/setup-python, we can relax the pin back to3.14t(or another appropriate selector), while keeping the progress reporter as a permanent quality-of-life improvement for debugging.Suggested changelog entry: