Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle early finalization of TSFN by Node.js #744

Merged
merged 1 commit into from
Jun 1, 2021

Conversation

indutny
Copy link
Contributor

@indutny indutny commented May 20, 2021

Thread-safe functions can get released early by Node.js itself during
the teardown of the Environment that holds them. When this happens -
finalize_cb of the thread-safe function is invoked to notify us. This
change makes us remember if the thread-safe function was finalized
before we got into the Drop implementation for it so that we don't
deallocate it twice.

Additionally, we should not invoke call_threadsafe_function after
either finalize_cb call or napi_closing return value from previous
call_threadsafe_function. Handle this by holding a mutex around the
call, which doesn't introduce any additional locking because
call_threadsafe_function itself holds another mutex while active.

See: #739 (comment)

@@ -34,6 +35,7 @@ unsafe impl Sync for Tsfn {}
/// function for scheduling tasks to execute on a JavaScript thread.
pub struct ThreadsafeFunction<T> {
tsfn: Tsfn,
is_finalized: Arc<Mutex<bool>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be an AtomicBool instead of a Mutex<bool>? I'm also not clear why it needs to be an Arc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arc provides a nice way of passing the pointer to the C function, because unlike Box you can clone it before turning into a raw pointer, and then decrement the reference count when turning it back from the raw pointer to the Arc instance and dropping it.

I think it could be AtomicBool. Let me try that in a moment.

Copy link
Contributor

@indutny-signal indutny-signal May 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to AtomicBool. @jrose-signal must be really happy to see this happen 😉

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it has to be something indirect because ThreadsafeFunction is passed around by value. Box is probably safe because the finalizer callback is guaranteed to be invoked before Drop::drop completes (either Node invokes it on shutdown or we invoke it in drop), but it's a lot easier to mess that up somehow.

@jrose-signal
Copy link
Contributor

I think you're right about the race in #739 (comment). :-( A 99% multithreading fix can be worse than no fix, because it makes it harder to track down further issues.

@indutny-signal
Copy link
Contributor

There more I think about it the more I agree that this is a serious bug in Node.js that has to be fixed. Another race condition that I just discovered is with the finalize_cb. It is invoked after closing all UV handles that are used in the thread-safe function so Node will just crash if you try to release it at a bad time.

@indutny
Copy link
Contributor Author

indutny commented May 22, 2021

Nevermind, I'm a fool. ThreadsafeFunction can only be destroyed from the main loop because it might hold a reference to a JS function. The documentation explicitly says:

Aside from the main loop thread, no threads should be using the thread-safe function after the finalize callback completes.

Now, I've checked all code paths and it cannot crash as I previously thought. To prevent the race condition in this patch, however, we'll have to hold the mutex for the duration of the drop() method. I'll modify this PR accordingly.

@indutny
Copy link
Contributor Author

indutny commented May 22, 2021

Alright, pushed an update. Sadly AtomicBool had to go. When finalize() and drop() are called concurrently - two different orderings are possible:

  • finalize() gets to the mutex first (and returns) and thus drop() will see is_finalized = true and won't call release
  • drop() holds the mutex and observes is_finalized = false, finalize() is waiting on mutex on the UV loop thread and does not return until drop() calls the release(). In this case release() will attempt to send a uv_async_t event to the UV loop, which won't ever trigger because right after finalize() returns Node.js will close the uv_async_t handle

In both of the cases we get the desired result. This PR is ready for a review again.

@indutny
Copy link
Contributor Author

indutny commented May 22, 2021

Had to add a mutex to the call() method as well because we shouldn't be calling it after either of:

  • getting napi_closed from a previous call
  • finalize_cb was called

This is quite inconvenient for Neon, but would likely be needed until we'll get some resolution in nodejs/node#38775 (which would be an documentation change, but hopefully not ABI change)

Thread-safe functions can get released early by Node.js itself during
the teardown of the Environment that holds them. When this happens -
`finalize_cb` of the thread-safe function is invoked to notify us. This
change makes us remember if the thread-safe function was finalized
before we got into the `Drop` implementation for it so that we don't
deallocate it twice.

Additionally, we should not invoke `call_threadsafe_function` after
either `finalize_cb` call or `napi_closing` return value from previous
`call_threadsafe_function`. Handle this by holding a mutex around the
call, which doesn't introduce any additional locking because
`call_threadsafe_function` itself holds another mutex while active.
@indutny
Copy link
Contributor Author

indutny commented May 22, 2021

Force pushed PR with a rebase over main branch and all changes squashed.

// Node.js
if *is_finalized {
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you need to drop the lock before calling napi::release_threadsafe_function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, actually I don't the idea here is that we want finalize_cb to block until release_threadsafe_function completes, because otherwise finalize_cb might finish first and delete the tsfn in C++.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mutex isn't re-entrant, though, so in the non-early-exit case you'll lock, call release_threadsafe_function, and fail when it's time to lock again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

release_threadsafe_function doesn't call finalize_cb directly. It schedules an event on the loop thread and executes finalize_cb in another tick.

@indutny-signal
Copy link
Contributor

See: napi-rs/napi-rs#522 . We are not the first to run into it! napi-rs took a milder approach of just leaking the function and letting Node.js reclaim it 😂

@kjvalencik
Copy link
Member

After fixing a leak in #750, this started happening every time with the global queue.

I think I fully understand this fix now and appreciate it. Thanks!

@indutny-signal
Copy link
Contributor

No worries! Let's land the thing! 😉

@kjvalencik kjvalencik merged commit dd0af64 into neon-bindings:main Jun 1, 2021
@kjvalencik
Copy link
Member

Thanks so much for this fix! I'm going to release it along with this PR: #750

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants