Skip to content

Conversation

b-pass
Copy link
Contributor

@b-pass b-pass commented Jun 2, 2025

Description

As mentioned in #5705, there are a couple platforms that don't support C++11's thread_local keyword but they do still support Python's thread-specific-storage.

So this implementation restructures the part of sub-interpreter support that was using thread_local to instead use CPython's TSS.

To make this a little easier I added a wrapper class around the PYBIND11_TLS_* macros. This makes accessing them feel a lot more like accessing a pointer, and puts their allocation and release into RAII.

I also changed the internals use of PYBIND11_TLS (tstate and loader_life_support_tls_key members) to use the new wrapper. This was not strictly required to address the goal of the PR, but it makes sense to do this since there is a wrapper for it now.

NOTE: This should probably increment the internals ABI number, but since that was already changed for RC1, I didn't change it in this PR. Should it be changed?

Suggested changelog entry:

  • Modify internals pointer-to-pointer implementation to not use thread_local (better iOS support)

@b-pass b-pass marked this pull request as draft June 2, 2025 02:14
@b-pass b-pass marked this pull request as ready for review June 2, 2025 03:37
@henryiii
Copy link
Collaborator

henryiii commented Jun 2, 2025

I'll try to get #5705, rebase this and add a comment the removes the subinterp disable define. That should also trigger the full CI.

I think we do need an internals bump just in case people built with the old RC. @rwgk might also bump the internals once more for the reworking he proposed in #5700 (maybe next weekend?).

@henryiii henryiii force-pushed the subinterpreter-in-tss branch from dd73712 to cb9227a Compare June 2, 2025 04:31
Copy link
Collaborator

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple comments.

I don't have the free bandwidth at the moment to fully review this. I only scrolled through and stopped here and there. I like what I'm seeing!

@henryiii please go ahead and merge this if it looks good to you. (Assuming you're bumping the internals version.)

@b-pass b-pass force-pushed the subinterpreter-in-tss branch from cb9227a to 20f3606 Compare June 2, 2025 21:06
pre-commit-ci bot and others added 3 commits June 2, 2025 21:06
Signed-off-by: Henry Schreiner <henryschreineriii@gmail.com>
@henryiii henryiii merged commit c7026d0 into pybind:master Jun 3, 2025
83 checks passed
@github-actions github-actions bot added the needs changelog Possibly needs a changelog entry label Jun 3, 2025
@henryiii henryiii changed the title Modify the internals pointer-to-pointer implementation to not use thread_local fix: modify the internals pointer-to-pointer implementation to not use thread_local Jun 3, 2025
@rwgk
Copy link
Collaborator

rwgk commented Jul 10, 2025

Hi @b-pass, I'm picking this PR more or less randomly to get your attention.

Coincidentally, I noticed the "ignored" AssertionError below today. I don't remember seeing this before, but it's easily missed in the wall of output (I'm using VERBOSE), and my dev environment changed a lot in the last couple weeks. The tests ran under Windows 11 WSL/Ubuntu 24.04.

Do you have any ideas what could be behind the assert tlock.locked() AssertionError?

make  -f tests/test_embed/CMakeFiles/cpptest.dir/build.make tests/test_embed/CMakeFiles/cpptest.dir/build
make[3]: Entering directory '/home/rgrossekunst/bld/cmake_build'
cd /home/rgrossekunst/bld/cmake_build/tests/test_embed && /home/rgrossekunst/bld/cmake_build/tests/test_embed/test_embed
Exception ignored in: <module 'threading' from '/usr/lib/python3.12/threading.py'>
Traceback (most recent call last):
  File "/usr/lib/python3.12/threading.py", line 1600, in _shutdown
    assert tlock.locked()
           ^^^^^^^^^^^^^^
AssertionError:
===============================================================================
All tests passed (1589 assertions in 20 test cases)

make[3]: Leaving directory '/home/rgrossekunst/bld/cmake_build'

@rwgk rwgk removed the needs changelog Possibly needs a changelog entry label Jul 10, 2025
@b-pass
Copy link
Contributor Author

b-pass commented Jul 15, 2025

Do you have any ideas what could be behind the assert tlock.locked() AssertionError?

I poked around a bit, I'm not totally sure.... this part of internal CPython threading is very different on newer versions. My guess is that this is related to a subinterpreter state being already release when the threading module goes to do its cleanup in this version.

swolchok added a commit to swolchok/pybind11 that referenced this pull request Sep 5, 2025
As explained in a new code comment, loader_life_support needs to be
thread_local but does not need to be isolated to a particular
interpreter because any given function call is already going to only
happen on a single interpreter by definiton.

Performance before on M4 Max using pybind/pybind11_benchmark unmodified repo:
```
> python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)'
5000000 loops, best of 5: 63.8 nsec per loop
```

After:
```
python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)'
5000000 loops, best of 5: 53.1 nsec per loop
```

Open questions:

- How do we determine whether we can safely use `thread_local`? I see
  concerns about old iOS versions on
  pybind#5705 (comment)
  and pybind#5709; is there anything
  else?
- Do we have a test that covers "function called in one interpreter
  calls a C++ function that causes a function call in another
  interpreter??
swolchok added a commit to swolchok/pybind11 that referenced this pull request Sep 5, 2025
As explained in a new code comment, `loader_life_support` needs to be
`thread_local` but does not need to be isolated to a particular
interpreter because any given function call is already going to only
happen on a single interpreter by definiton.

Performance before:
- on M4 Max using pybind/pybind11_benchmark unmodified repo:
```
> python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)'
5000000 loops, best of 5: 63.8 nsec per loop
```

- Linux server:
```
python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)'                                                                                                                                        (pytorch)
2000000 loops, best of 5: 120 nsec per loop
```

After:
- M4 Max:
```
python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)'
5000000 loops, best of 5: 53.1 nsec per loop
```

- Linux server:
```
> python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)'                                                                                                                                        (pytorch)
2000000 loops, best of 5: 101 nsec per loop
```

A quick profile with perf shows that pthread_setspecific and pthread_getspecific are gone.

Open questions:

- How do we determine whether we can safely use `thread_local`? I see
  concerns about old iOS versions on
  pybind#5705 (comment)
  and pybind#5709; is there anything
  else?
- Do we have a test that covers "function called in one interpreter
  calls a C++ function that causes a function call in another
  interpreter"? I think it's fine, but can it happen?
- Are we happy with what we think will happen in the case where
  multiple extensions compiled with and without this PR interoperate?
  I think it's fine -- each dispatch pushes and cleans up its own
  state -- but a second opinion is certainly welcome.
rwgk added a commit that referenced this pull request Sep 12, 2025
* Use thread_local for loader_life_support to improve performance

As explained in a new code comment, `loader_life_support` needs to be
`thread_local` but does not need to be isolated to a particular
interpreter because any given function call is already going to only
happen on a single interpreter by definiton.

Performance before:
- on M4 Max using pybind/pybind11_benchmark unmodified repo:
```
> python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)'
5000000 loops, best of 5: 63.8 nsec per loop
```

- Linux server:
```
python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)'                                                                                                                                        (pytorch)
2000000 loops, best of 5: 120 nsec per loop
```

After:
- M4 Max:
```
python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)'
5000000 loops, best of 5: 53.1 nsec per loop
```

- Linux server:
```
> python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)'                                                                                                                                        (pytorch)
2000000 loops, best of 5: 101 nsec per loop
```

A quick profile with perf shows that pthread_setspecific and pthread_getspecific are gone.

Open questions:

- How do we determine whether we can safely use `thread_local`? I see
  concerns about old iOS versions on
  #5705 (comment)
  and #5709; is there anything
  else?
- Do we have a test that covers "function called in one interpreter
  calls a C++ function that causes a function call in another
  interpreter"? I think it's fine, but can it happen?
- Are we happy with what we think will happen in the case where
  multiple extensions compiled with and without this PR interoperate?
  I think it's fine -- each dispatch pushes and cleans up its own
  state -- but a second opinion is certainly welcome.

* Remove PYBIND11_CAN_USE_THREAD_LOCAL

* clarify comment

* Simplify loader_life_support TLS storage

Replace the `fake_thread_specific_storage` struct with a direct
thread-local pointer managed via a function-local static:

    static loader_life_support *& tls_current_frame()

This retains the "stack of frames" behavior via the `parent` link. It also
reduces indirection and clarifies intent.

Note: this form is C++11-compatible; once pybind11 requires C++17, the
helper can be simplified to:

    inline static thread_local loader_life_support *tls_current_frame = nullptr;

* loader_life_support: avoid duplicate tls_current_frame() calls

Replace repeated calls with a single local reference:

    auto &frame = tls_current_frame();

This ensures the thread_local initialization guard is checked only once
per constructor/destructor call site, avoids potential clang-tidy
complaints, and makes the code more readable. Functional behavior is
unchanged.

* Add REMINDER for next version bump in internals.h

---------

Co-authored-by: Ralf W. Grosse-Kunstleve <rgrossekunst@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants