-
Notifications
You must be signed in to change notification settings - Fork 2.2k
fix: modify the internals pointer-to-pointer implementation to not use thread_local
#5709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'll try to get #5705, rebase this and add a comment the removes the subinterp disable define. That should also trigger the full CI. I think we do need an internals bump just in case people built with the old RC. @rwgk might also bump the internals once more for the reworking he proposed in #5700 (maybe next weekend?). |
dd73712
to
cb9227a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple comments.
I don't have the free bandwidth at the moment to fully review this. I only scrolled through and stopped here and there. I like what I'm seeing!
@henryiii please go ahead and merge this if it looks good to you. (Assuming you're bumping the internals version.)
Should now just be able to delete the internals PP on destruction
Also fix a couple more pedantic warings
So instead, just make sure it was zero'd and don't try to compare the addresses. Also a little code cleanup
cb9227a
to
20f3606
Compare
Signed-off-by: Henry Schreiner <henryschreineriii@gmail.com>
thread_local
thread_local
Hi @b-pass, I'm picking this PR more or less randomly to get your attention. Coincidentally, I noticed the "ignored" Do you have any ideas what could be behind the
|
I poked around a bit, I'm not totally sure.... this part of internal CPython threading is very different on newer versions. My guess is that this is related to a subinterpreter state being already release when the threading module goes to do its cleanup in this version. |
As explained in a new code comment, loader_life_support needs to be thread_local but does not need to be isolated to a particular interpreter because any given function call is already going to only happen on a single interpreter by definiton. Performance before on M4 Max using pybind/pybind11_benchmark unmodified repo: ``` > python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' 5000000 loops, best of 5: 63.8 nsec per loop ``` After: ``` python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' 5000000 loops, best of 5: 53.1 nsec per loop ``` Open questions: - How do we determine whether we can safely use `thread_local`? I see concerns about old iOS versions on pybind#5705 (comment) and pybind#5709; is there anything else? - Do we have a test that covers "function called in one interpreter calls a C++ function that causes a function call in another interpreter??
As explained in a new code comment, `loader_life_support` needs to be `thread_local` but does not need to be isolated to a particular interpreter because any given function call is already going to only happen on a single interpreter by definiton. Performance before: - on M4 Max using pybind/pybind11_benchmark unmodified repo: ``` > python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' 5000000 loops, best of 5: 63.8 nsec per loop ``` - Linux server: ``` python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' (pytorch) 2000000 loops, best of 5: 120 nsec per loop ``` After: - M4 Max: ``` python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' 5000000 loops, best of 5: 53.1 nsec per loop ``` - Linux server: ``` > python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' (pytorch) 2000000 loops, best of 5: 101 nsec per loop ``` A quick profile with perf shows that pthread_setspecific and pthread_getspecific are gone. Open questions: - How do we determine whether we can safely use `thread_local`? I see concerns about old iOS versions on pybind#5705 (comment) and pybind#5709; is there anything else? - Do we have a test that covers "function called in one interpreter calls a C++ function that causes a function call in another interpreter"? I think it's fine, but can it happen? - Are we happy with what we think will happen in the case where multiple extensions compiled with and without this PR interoperate? I think it's fine -- each dispatch pushes and cleans up its own state -- but a second opinion is certainly welcome.
* Use thread_local for loader_life_support to improve performance As explained in a new code comment, `loader_life_support` needs to be `thread_local` but does not need to be isolated to a particular interpreter because any given function call is already going to only happen on a single interpreter by definiton. Performance before: - on M4 Max using pybind/pybind11_benchmark unmodified repo: ``` > python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' 5000000 loops, best of 5: 63.8 nsec per loop ``` - Linux server: ``` python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' (pytorch) 2000000 loops, best of 5: 120 nsec per loop ``` After: - M4 Max: ``` python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' 5000000 loops, best of 5: 53.1 nsec per loop ``` - Linux server: ``` > python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' (pytorch) 2000000 loops, best of 5: 101 nsec per loop ``` A quick profile with perf shows that pthread_setspecific and pthread_getspecific are gone. Open questions: - How do we determine whether we can safely use `thread_local`? I see concerns about old iOS versions on #5705 (comment) and #5709; is there anything else? - Do we have a test that covers "function called in one interpreter calls a C++ function that causes a function call in another interpreter"? I think it's fine, but can it happen? - Are we happy with what we think will happen in the case where multiple extensions compiled with and without this PR interoperate? I think it's fine -- each dispatch pushes and cleans up its own state -- but a second opinion is certainly welcome. * Remove PYBIND11_CAN_USE_THREAD_LOCAL * clarify comment * Simplify loader_life_support TLS storage Replace the `fake_thread_specific_storage` struct with a direct thread-local pointer managed via a function-local static: static loader_life_support *& tls_current_frame() This retains the "stack of frames" behavior via the `parent` link. It also reduces indirection and clarifies intent. Note: this form is C++11-compatible; once pybind11 requires C++17, the helper can be simplified to: inline static thread_local loader_life_support *tls_current_frame = nullptr; * loader_life_support: avoid duplicate tls_current_frame() calls Replace repeated calls with a single local reference: auto &frame = tls_current_frame(); This ensures the thread_local initialization guard is checked only once per constructor/destructor call site, avoids potential clang-tidy complaints, and makes the code more readable. Functional behavior is unchanged. * Add REMINDER for next version bump in internals.h --------- Co-authored-by: Ralf W. Grosse-Kunstleve <rgrossekunst@nvidia.com>
Description
As mentioned in #5705, there are a couple platforms that don't support C++11's
thread_local
keyword but they do still support Python's thread-specific-storage.So this implementation restructures the part of sub-interpreter support that was using thread_local to instead use CPython's TSS.
To make this a little easier I added a wrapper class around the PYBIND11_TLS_* macros. This makes accessing them feel a lot more like accessing a pointer, and puts their allocation and release into RAII.
I also changed the
internals
use of PYBIND11_TLS (tstate
andloader_life_support_tls_key
members) to use the new wrapper. This was not strictly required to address the goal of the PR, but it makes sense to do this since there is a wrapper for it now.NOTE: This should probably increment the internals ABI number, but since that was already changed for RC1, I didn't change it in this PR. Should it be changed?
Suggested changelog entry:
thread_local
(better iOS support)