-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle early finalization of TSFN by Node.js #744
Conversation
@@ -34,6 +35,7 @@ unsafe impl Sync for Tsfn {} | |||
/// function for scheduling tasks to execute on a JavaScript thread. | |||
pub struct ThreadsafeFunction<T> { | |||
tsfn: Tsfn, | |||
is_finalized: Arc<Mutex<bool>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be an AtomicBool
instead of a Mutex<bool>
? I'm also not clear why it needs to be an Arc
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arc
provides a nice way of passing the pointer to the C function, because unlike Box
you can clone it before turning into a raw pointer, and then decrement the reference count when turning it back from the raw pointer to the Arc
instance and dropping it.
I think it could be AtomicBool
. Let me try that in a moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to AtomicBool
. @jrose-signal must be really happy to see this happen 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it has to be something indirect because ThreadsafeFunction is passed around by value. Box is probably safe because the finalizer callback is guaranteed to be invoked before Drop::drop
completes (either Node invokes it on shutdown or we invoke it in drop
), but it's a lot easier to mess that up somehow.
I think you're right about the race in #739 (comment). :-( A 99% multithreading fix can be worse than no fix, because it makes it harder to track down further issues. |
There more I think about it the more I agree that this is a serious bug in Node.js that has to be fixed. Another race condition that I just discovered is with the |
Nevermind, I'm a fool. ThreadsafeFunction can only be destroyed from the main loop because it might hold a reference to a JS function. The documentation explicitly says:
Now, I've checked all code paths and it cannot crash as I previously thought. To prevent the race condition in this patch, however, we'll have to hold the mutex for the duration of the |
Alright, pushed an update. Sadly
In both of the cases we get the desired result. This PR is ready for a review again. |
Had to add a mutex to the
This is quite inconvenient for Neon, but would likely be needed until we'll get some resolution in nodejs/node#38775 (which would be an documentation change, but hopefully not ABI change) |
Thread-safe functions can get released early by Node.js itself during the teardown of the Environment that holds them. When this happens - `finalize_cb` of the thread-safe function is invoked to notify us. This change makes us remember if the thread-safe function was finalized before we got into the `Drop` implementation for it so that we don't deallocate it twice. Additionally, we should not invoke `call_threadsafe_function` after either `finalize_cb` call or `napi_closing` return value from previous `call_threadsafe_function`. Handle this by holding a mutex around the call, which doesn't introduce any additional locking because `call_threadsafe_function` itself holds another mutex while active.
Force pushed PR with a rebase over main branch and all changes squashed. |
// Node.js | ||
if *is_finalized { | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you need to drop the lock before calling napi::release_threadsafe_function
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, actually I don't the idea here is that we want finalize_cb
to block until release_threadsafe_function
completes, because otherwise finalize_cb
might finish first and delete the tsfn in C++.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mutex isn't re-entrant, though, so in the non-early-exit case you'll lock, call release_threadsafe_function
, and fail when it's time to lock again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
release_threadsafe_function
doesn't call finalize_cb
directly. It schedules an event on the loop thread and executes finalize_cb
in another tick.
See: napi-rs/napi-rs#522 . We are not the first to run into it! napi-rs took a milder approach of just leaking the function and letting Node.js reclaim it 😂 |
After fixing a leak in #750, this started happening every time with the global queue. I think I fully understand this fix now and appreciate it. Thanks! |
No worries! Let's land the thing! 😉 |
Thanks so much for this fix! I'm going to release it along with this PR: #750 |
Thread-safe functions can get released early by Node.js itself during
the teardown of the Environment that holds them. When this happens -
finalize_cb
of the thread-safe function is invoked to notify us. Thischange makes us remember if the thread-safe function was finalized
before we got into the
Drop
implementation for it so that we don'tdeallocate it twice.
Additionally, we should not invoke
call_threadsafe_function
aftereither
finalize_cb
call ornapi_closing
return value from previouscall_threadsafe_function
. Handle this by holding a mutex around thecall, which doesn't introduce any additional locking because
call_threadsafe_function
itself holds another mutex while active.See: #739 (comment)