Skip to content

Commit a912648

Browse files
committed
Use thread_local for loader_life_support to improve performance
As explained in a new code comment, `loader_life_support` needs to be `thread_local` but does not need to be isolated to a particular interpreter because any given function call is already going to only happen on a single interpreter by definiton. Performance before: - on M4 Max using pybind/pybind11_benchmark unmodified repo: ``` > python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' 5000000 loops, best of 5: 63.8 nsec per loop ``` - Linux server: ``` python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' (pytorch) 2000000 loops, best of 5: 120 nsec per loop ``` After: - M4 Max: ``` python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' 5000000 loops, best of 5: 53.1 nsec per loop ``` - Linux server: ``` > python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' (pytorch) 2000000 loops, best of 5: 101 nsec per loop ``` A quick profile with perf shows that pthread_setspecific and pthread_getspecific are gone. Open questions: - How do we determine whether we can safely use `thread_local`? I see concerns about old iOS versions on #5705 (comment) and #5709; is there anything else? - Do we have a test that covers "function called in one interpreter calls a C++ function that causes a function call in another interpreter"? I think it's fine, but can it happen? - Are we happy with what we think will happen in the case where multiple extensions compiled with and without this PR interoperate? I think it's fine -- each dispatch pushes and cleans up its own state -- but a second opinion is certainly welcome.
1 parent 852a4b5 commit a912648

File tree

2 files changed

+37
-3
lines changed

2 files changed

+37
-3
lines changed

include/pybind11/detail/common.h

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1344,5 +1344,8 @@ constexpr
13441344
# define PYBIND11_BACKWARD_COMPATIBILITY_TP_DICTOFFSET
13451345
#endif
13461346

1347+
// TODO: determine which platforms cannot use thread_local.
1348+
#define PYBIND11_CAN_USE_THREAD_LOCAL 1
1349+
13471350
PYBIND11_NAMESPACE_END(detail)
13481351
PYBIND11_NAMESPACE_END(PYBIND11_NAMESPACE)

include/pybind11/detail/type_caster_base.h

Lines changed: 34 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -45,17 +45,48 @@ class loader_life_support {
4545
loader_life_support *parent = nullptr;
4646
std::unordered_set<PyObject *> keep_alive;
4747

48+
#if defined(PYBIND11_CAN_USE_THREAD_LOCAL)
49+
struct fake_thread_specific_storage {
50+
loader_life_support *instance = nullptr;
51+
loader_life_support *get() const { return instance; }
52+
53+
fake_thread_specific_storage &operator=(loader_life_support *pval) {
54+
instance = pval;
55+
return *this;
56+
}
57+
};
58+
using loader_storage = fake_thread_specific_storage;
59+
#else
60+
using loader_storage = thread_specific_storage<loader_life_support>;
61+
#endif
62+
63+
static loader_storage &get_stack_top() {
64+
#if defined(PYBIND11_CAN_USE_THREAD_LOCAL)
65+
// Without this branch, loader_life_support destruction is a
66+
// significant cost per function call.
67+
//
68+
// Observation: loader_life_support needs to be thread-local, but
69+
// we don't need to go to extra effort to keep it per-interpreter
70+
// (i.e., by putting it in internals) since function calls are
71+
// already isolated to a single interpreter.
72+
static thread_local fake_thread_specific_storage storage;
73+
return storage;
74+
#else
75+
return get_internals().loader_life_support_tls;
76+
#endif
77+
}
78+
4879
public:
4980
/// A new patient frame is created when a function is entered
5081
loader_life_support() {
51-
auto &stack_top = get_internals().loader_life_support_tls;
82+
auto &stack_top = get_stack_top();
5283
parent = stack_top.get();
5384
stack_top = this;
5485
}
5586

5687
/// ... and destroyed after it returns
5788
~loader_life_support() {
58-
auto &stack_top = get_internals().loader_life_support_tls;
89+
auto &stack_top = get_stack_top();
5990
if (stack_top.get() != this) {
6091
pybind11_fail("loader_life_support: internal error");
6192
}
@@ -68,7 +99,7 @@ class loader_life_support {
6899
/// This can only be used inside a pybind11-bound function, either by `argument_loader`
69100
/// at argument preparation time or by `py::cast()` at execution time.
70101
PYBIND11_NOINLINE static void add_patient(handle h) {
71-
loader_life_support *frame = get_internals().loader_life_support_tls.get();
102+
loader_life_support *frame = get_stack_top().get();
72103
if (!frame) {
73104
// NOTE: It would be nice to include the stack frames here, as this indicates
74105
// use of pybind11::cast<> outside the normal call framework, finding such

0 commit comments

Comments
 (0)