fix: modify the internals pointer-to-pointer implementation to not use `thread_local` #5709

b-pass · 2025-06-02T02:14:18Z

Description

As mentioned in #5705, there are a couple platforms that don't support C++11's thread_local keyword but they do still support Python's thread-specific-storage.

So this implementation restructures the part of sub-interpreter support that was using thread_local to instead use CPython's TSS.

To make this a little easier I added a wrapper class around the PYBIND11_TLS_* macros. This makes accessing them feel a lot more like accessing a pointer, and puts their allocation and release into RAII.

I also changed the internals use of PYBIND11_TLS (tstate and loader_life_support_tls_key members) to use the new wrapper. This was not strictly required to address the goal of the PR, but it makes sense to do this since there is a wrapper for it now.

NOTE: This should probably increment the internals ABI number, but since that was already changed for RC1, I didn't change it in this PR. Should it be changed?

Suggested changelog entry:

Modify internals pointer-to-pointer implementation to not use thread_local (better iOS support)

henryiii · 2025-06-02T03:59:03Z

I'll try to get #5705, rebase this and add a comment the removes the subinterp disable define. That should also trigger the full CI.

I think we do need an internals bump just in case people built with the old RC. @rwgk might also bump the internals once more for the reworking he proposed in #5700 (maybe next weekend?).

rwgk

A couple comments.

I don't have the free bandwidth at the moment to fully review this. I only scrolled through and stopped here and there. I like what I'm seeing!

@henryiii please go ahead and merge this if it looks good to you. (Assuming you're bumping the internals version.)

include/pybind11/detail/internals.h

Should now just be able to delete the internals PP on destruction

Also fix a couple more pedantic warings

So instead, just make sure it was zero'd and don't try to compare the addresses. Also a little code cleanup

Signed-off-by: Henry Schreiner <henryschreineriii@gmail.com>

rwgk · 2025-07-10T18:39:30Z

Hi @b-pass, I'm picking this PR more or less randomly to get your attention.

Coincidentally, I noticed the "ignored" AssertionError below today. I don't remember seeing this before, but it's easily missed in the wall of output (I'm using VERBOSE), and my dev environment changed a lot in the last couple weeks. The tests ran under Windows 11 WSL/Ubuntu 24.04.

Do you have any ideas what could be behind the assert tlock.locked() AssertionError?

make  -f tests/test_embed/CMakeFiles/cpptest.dir/build.make tests/test_embed/CMakeFiles/cpptest.dir/build
make[3]: Entering directory '/home/rgrossekunst/bld/cmake_build'
cd /home/rgrossekunst/bld/cmake_build/tests/test_embed && /home/rgrossekunst/bld/cmake_build/tests/test_embed/test_embed
Exception ignored in: <module 'threading' from '/usr/lib/python3.12/threading.py'>
Traceback (most recent call last):
  File "/usr/lib/python3.12/threading.py", line 1600, in _shutdown
    assert tlock.locked()
           ^^^^^^^^^^^^^^
AssertionError:
===============================================================================
All tests passed (1589 assertions in 20 test cases)

make[3]: Leaving directory '/home/rgrossekunst/bld/cmake_build'

b-pass · 2025-07-15T22:35:36Z

Do you have any ideas what could be behind the assert tlock.locked() AssertionError?

I poked around a bit, I'm not totally sure.... this part of internal CPython threading is very different on newer versions. My guess is that this is related to a subinterpreter state being already release when the threading module goes to do its cleanup in this version.

As explained in a new code comment, loader_life_support needs to be thread_local but does not need to be isolated to a particular interpreter because any given function call is already going to only happen on a single interpreter by definiton. Performance before on M4 Max using pybind/pybind11_benchmark unmodified repo: ``` > python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' 5000000 loops, best of 5: 63.8 nsec per loop ``` After: ``` python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' 5000000 loops, best of 5: 53.1 nsec per loop ``` Open questions: - How do we determine whether we can safely use `thread_local`? I see concerns about old iOS versions on pybind#5705 (comment) and pybind#5709; is there anything else? - Do we have a test that covers "function called in one interpreter calls a C++ function that causes a function call in another interpreter??

As explained in a new code comment, `loader_life_support` needs to be `thread_local` but does not need to be isolated to a particular interpreter because any given function call is already going to only happen on a single interpreter by definiton. Performance before: - on M4 Max using pybind/pybind11_benchmark unmodified repo: ``` > python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' 5000000 loops, best of 5: 63.8 nsec per loop ``` - Linux server: ``` python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' (pytorch) 2000000 loops, best of 5: 120 nsec per loop ``` After: - M4 Max: ``` python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' 5000000 loops, best of 5: 53.1 nsec per loop ``` - Linux server: ``` > python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' (pytorch) 2000000 loops, best of 5: 101 nsec per loop ``` A quick profile with perf shows that pthread_setspecific and pthread_getspecific are gone. Open questions: - How do we determine whether we can safely use `thread_local`? I see concerns about old iOS versions on pybind#5705 (comment) and pybind#5709; is there anything else? - Do we have a test that covers "function called in one interpreter calls a C++ function that causes a function call in another interpreter"? I think it's fine, but can it happen? - Are we happy with what we think will happen in the case where multiple extensions compiled with and without this PR interoperate? I think it's fine -- each dispatch pushes and cleans up its own state -- but a second opinion is certainly welcome.

* Use thread_local for loader_life_support to improve performance As explained in a new code comment, `loader_life_support` needs to be `thread_local` but does not need to be isolated to a particular interpreter because any given function call is already going to only happen on a single interpreter by definiton. Performance before: - on M4 Max using pybind/pybind11_benchmark unmodified repo: ``` > python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' 5000000 loops, best of 5: 63.8 nsec per loop ``` - Linux server: ``` python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' (pytorch) 2000000 loops, best of 5: 120 nsec per loop ``` After: - M4 Max: ``` python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' 5000000 loops, best of 5: 53.1 nsec per loop ``` - Linux server: ``` > python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' (pytorch) 2000000 loops, best of 5: 101 nsec per loop ``` A quick profile with perf shows that pthread_setspecific and pthread_getspecific are gone. Open questions: - How do we determine whether we can safely use `thread_local`? I see concerns about old iOS versions on #5705 (comment) and #5709; is there anything else? - Do we have a test that covers "function called in one interpreter calls a C++ function that causes a function call in another interpreter"? I think it's fine, but can it happen? - Are we happy with what we think will happen in the case where multiple extensions compiled with and without this PR interoperate? I think it's fine -- each dispatch pushes and cleans up its own state -- but a second opinion is certainly welcome. * Remove PYBIND11_CAN_USE_THREAD_LOCAL * clarify comment * Simplify loader_life_support TLS storage Replace the `fake_thread_specific_storage` struct with a direct thread-local pointer managed via a function-local static: static loader_life_support *& tls_current_frame() This retains the "stack of frames" behavior via the `parent` link. It also reduces indirection and clarifies intent. Note: this form is C++11-compatible; once pybind11 requires C++17, the helper can be simplified to: inline static thread_local loader_life_support *tls_current_frame = nullptr; * loader_life_support: avoid duplicate tls_current_frame() calls Replace repeated calls with a single local reference: auto &frame = tls_current_frame(); This ensures the thread_local initialization guard is checked only once per constructor/destructor call site, avoids potential clang-tidy complaints, and makes the code more readable. Functional behavior is unchanged. * Add REMINDER for next version bump in internals.h --------- Co-authored-by: Ralf W. Grosse-Kunstleve <rgrossekunst@nvidia.com>

b-pass marked this pull request as draft June 2, 2025 02:14

b-pass marked this pull request as ready for review June 2, 2025 03:37

henryiii force-pushed the subinterpreter-in-tss branch from dd73712 to cb9227a Compare June 2, 2025 04:31

rwgk reviewed Jun 2, 2025

View reviewed changes

include/pybind11/detail/internals.h Show resolved Hide resolved

include/pybind11/detail/internals.h Show resolved Hide resolved

henryiii mentioned this pull request Jun 2, 2025

iOS issues pypa/cibuildwheel#2435

Closed

b-pass added 10 commits June 2, 2025 16:59

Refactor internals to use a holder that manages the PP

fc9c4aa

Refactor internals to use a holder that manages the PP

f85aa9c

Fix cleanup/destruction issues.

35e2231

Fix one more destruction issue

3e2592b

Should now just be able to delete the internals PP on destruction

Make clang-tidy happy

b793555

Try to fix exception translators issue on certain platforms

ef830c3

Also fix a couple more pedantic warings

Fix test, after internals is free'd it can come back at the same address

6989cb6

So instead, just make sure it was zero'd and don't try to compare the addresses. Also a little code cleanup

Comment tweak [skip ci]

edf6933

Switch to ifdef instead of if

e31ec0b

Re-enable subinterpreters in iOS

20f3606

b-pass force-pushed the subinterpreter-in-tss branch from cb9227a to 20f3606 Compare June 2, 2025 21:06

pre-commit-ci bot and others added 3 commits June 2, 2025 21:06

style: pre-commit fixes

add793f

Oops, this snuck in on merge

9493089

fix: bump ABI version to 10

01ff5a9

Signed-off-by: Henry Schreiner <henryschreineriii@gmail.com>

henryiii merged commit c7026d0 into pybind:master Jun 3, 2025
83 checks passed

github-actions bot added the needs changelog Possibly needs a changelog entry label Jun 3, 2025

henryiii changed the title ~~Modify the internals pointer-to-pointer implementation to not use thread_local~~ fix: modify the internals pointer-to-pointer implementation to not use thread_local Jun 3, 2025

BrewTestBot mentioned this pull request Jul 10, 2025

pybind11 3.0.0 Homebrew/homebrew-core#229675

Merged

rwgk removed the needs changelog Possibly needs a changelog entry label Jul 10, 2025

swolchok mentioned this pull request Sep 5, 2025

Use thread_local for loader_life_support to improve performance #5830

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: modify the internals pointer-to-pointer implementation to not use `thread_local` #5709

fix: modify the internals pointer-to-pointer implementation to not use `thread_local` #5709

Uh oh!

b-pass commented Jun 2, 2025 •

edited by henryiii

Loading

Uh oh!

henryiii commented Jun 2, 2025

Uh oh!

rwgk left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rwgk commented Jul 10, 2025

Uh oh!

b-pass commented Jul 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: modify the internals pointer-to-pointer implementation to not use thread_local #5709

fix: modify the internals pointer-to-pointer implementation to not use thread_local #5709

Uh oh!

Conversation

b-pass commented Jun 2, 2025 • edited by henryiii Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Suggested changelog entry:

Uh oh!

henryiii commented Jun 2, 2025

Uh oh!

rwgk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rwgk commented Jul 10, 2025

Uh oh!

b-pass commented Jul 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: modify the internals pointer-to-pointer implementation to not use `thread_local` #5709

fix: modify the internals pointer-to-pointer implementation to not use `thread_local` #5709

b-pass commented Jun 2, 2025 •

edited by henryiii

Loading