Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock when a language worker restart crashes on initialization #8403

Closed
brettsam opened this issue May 18, 2022 · 2 comments
Closed

Deadlock when a language worker restart crashes on initialization #8403

brettsam opened this issue May 18, 2022 · 2 comments
Assignees

Comments

@brettsam
Copy link
Member

brettsam commented May 18, 2022

We have a deadlock in the language worker restart flow.

  1. For example, for some reason (due to a timeout, say), we need to kill the process.
  2. We restart the process and wait for the init call to come from the worker (waiting on a TaskCompletionSource to be completed):
  3. But the worker process crashes before connecting (which happened to your incident). Therefore, we never complete the _workerInitTask.
  4. We see the error and try to restart again.
  5. We hit a lock in the restart loop that is meant to prevent multiple restarts from occurring simultaneously... but b/c we never completed that init task in step 1, it is never released. As far as it's concerned, 2 restarts are happening simultaneously.

This can cause the process to effectively hang as it's waiting for these two events to complete.

Note: In some other flows, we do explicitly set the _workerInitTask -- https://github.com/Azure/azure-functions-host/blob/dev/src/WebJobs.Script.Grpc/Channel/GrpcWorkerChannel.cs#L815

@alrod
Copy link
Member

alrod commented May 19, 2022

Possibly related:
Azure/azure-functions-nodejs-worker#482

@brettsam
Copy link
Member Author

brettsam commented Jun 8, 2022

This was actually related to a fix already made by @alrod -- it took a while to realize this. Dupe of #7983, which has already been fixed and deployed. The incident mentioned above happened in 4.1.0 -- which was before this was deployed in 4.2.0.

@brettsam brettsam closed this as completed Jun 8, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Jul 8, 2022
@fabiocav fabiocav modified the milestones: functions sprint 123, Left Overs Jul 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants