Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Local RPC race condition on app startup #1719

Merged
merged 9 commits into from
Mar 15, 2021
Merged

Conversation

ConnorMcMahon
Copy link
Contributor

@ConnorMcMahon ConnorMcMahon commented Mar 11, 2021

Our previous attempt to dynamically select ports for the local RPC endpoint fixed many issues customers had, but there still exists a race condition when multiple hosts start up at the same time. To completely resolve this issue, this PR makes the following changes:

  • Move the code that finds a free port closer to the code that attempts to listen on that port.
  • Add a retry layer if the free port has been taken by another worker between when we found the free port and when we tried to listen to that port.
  • If we somehow hit this race condition 3 times in a row, even though it should be more rare now, we gracefully fallback to using the public endpoints for out-of-process clients instead of failing to start up the host.

resolves #1273

Pull request checklist

  • My changes do not require documentation changes
    • Otherwise: Documentation PR is ready to merge and referenced in pending_docs.md
  • My changes should not be added to the release notes for the next release
    • Otherwise: I've added my notes to release_notes.md
  • My changes do not need to be backported to a previous version
  • I have added all required tests (Unit tests, E2E tests)

@ConnorMcMahon ConnorMcMahon changed the title [Draft] Add test for local rpc failures [Draft] Fix Local RPC race condition on app startup Mar 11, 2021
@ConnorMcMahon ConnorMcMahon changed the title [Draft] Fix Local RPC race condition on app startup Fix Local RPC race condition on app startup Mar 11, 2021
Copy link
Contributor

@davidmrdavid davidmrdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, left some stylistic requests and some questions. Some comments were not associated with the review itself but please give those a look as well. Thanks!

src/WebJobs.Extensions.DurableTask/LocalHttpListener.cs Outdated Show resolved Hide resolved
test/Common/DurableTaskEndToEndTests.cs Outdated Show resolved Hide resolved
Copy link
Member

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly nits. My only real concern is what to do if we fail to find a free port. I'm not super comfortable to falling back to public endpoints because that could put a customer into an unpredictable state - i.e. it works on some machines but not on others - and diagnosing such issues will be a lot harder. I'm also not sure what this means when someone is running outside of the Azure Functions hosted service (K8s, WebJobs, etc.) where they may not have any public HTTP endpoints. Throwing an exception might actually be safer.

src/WebJobs.Extensions.DurableTask/LocalHttpListener.cs Outdated Show resolved Hide resolved
src/WebJobs.Extensions.DurableTask/LocalHttpListener.cs Outdated Show resolved Hide resolved
test/Common/DurableTaskEndToEndTests.cs Outdated Show resolved Hide resolved
numAttempts++;
}
}
while (numAttempts <= 3);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should try more times, like 10. I also wonder if we should have a small sleep in-between retries. I'm not sure how random the port selection is, so it would be nice if we could mitigate the case where two hosts are starting up at the same time and selecting the same random sequence of available port numbers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. if we are worried about them being in sync, would doing a semi-random amount of sleep time help (i.e. between 50-100 ms)? That way we keep the two hosts out of lockstep.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a dynamic sleep between 0 and 1 seconds here. Let me know if that's what you had in mind.

@ConnorMcMahon
Copy link
Contributor Author

@cgillum, I am a bit afraid about just throwing an exception, because I don't know if that shuts down the whole host. I know some customers who hit the current race condition seem to hit to be stuck in a bad state. Would a failfast be better?

@cgillum
Copy link
Member

cgillum commented Mar 11, 2021

Oy...if an exception here doesn't restart the host then that would be really bad. I hate to resort to FailFast because that's a pretty severe hammer, but I guess I don't have any better ideas than this...

@ConnorMcMahon
Copy link
Contributor Author

I'll see if I can find an example of what happens when an exception is thrown. Overall, I expect this to happen incredibly rarely, especially if we up the retry count to 10 and add some randomized delays, as we have already made the race condition less likely by reducing the time window between port selection and listening on that port.

Copy link
Member

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ConnorMcMahon ConnorMcMahon merged commit 7617df9 into dev Mar 15, 2021
@davidmrdavid davidmrdavid deleted the LocalRpcRaceCondition branch March 15, 2021 19:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Local HTTP listener causing startup issues
3 participants