You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Use thread_local for loader_life_support to improve performance
As explained in a new code comment, `loader_life_support` needs to be
`thread_local` but does not need to be isolated to a particular
interpreter because any given function call is already going to only
happen on a single interpreter by definiton.
Performance before:
- on M4 Max using pybind/pybind11_benchmark unmodified repo:
```
> python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)'
5000000 loops, best of 5: 63.8 nsec per loop
```
- Linux server:
```
python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' (pytorch)
2000000 loops, best of 5: 120 nsec per loop
```
After:
- M4 Max:
```
python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)'
5000000 loops, best of 5: 53.1 nsec per loop
```
- Linux server:
```
> python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' (pytorch)
2000000 loops, best of 5: 101 nsec per loop
```
A quick profile with perf shows that pthread_setspecific and pthread_getspecific are gone.
Open questions:
- How do we determine whether we can safely use `thread_local`? I see
concerns about old iOS versions on
#5705 (comment)
and #5709; is there anything
else?
- Do we have a test that covers "function called in one interpreter
calls a C++ function that causes a function call in another
interpreter"? I think it's fine, but can it happen?
- Are we happy with what we think will happen in the case where
multiple extensions compiled with and without this PR interoperate?
I think it's fine -- each dispatch pushes and cleans up its own
state -- but a second opinion is certainly welcome.
0 commit comments