-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Local RPC race condition on app startup #1719
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, left some stylistic requests and some questions. Some comments were not associated with the review itself but please give those a look as well. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly nits. My only real concern is what to do if we fail to find a free port. I'm not super comfortable to falling back to public endpoints because that could put a customer into an unpredictable state - i.e. it works on some machines but not on others - and diagnosing such issues will be a lot harder. I'm also not sure what this means when someone is running outside of the Azure Functions hosted service (K8s, WebJobs, etc.) where they may not have any public HTTP endpoints. Throwing an exception might actually be safer.
numAttempts++; | ||
} | ||
} | ||
while (numAttempts <= 3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should try more times, like 10. I also wonder if we should have a small sleep in-between retries. I'm not sure how random the port selection is, so it would be nice if we could mitigate the case where two hosts are starting up at the same time and selecting the same random sequence of available port numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. if we are worried about them being in sync, would doing a semi-random amount of sleep time help (i.e. between 50-100 ms)? That way we keep the two hosts out of lockstep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a dynamic sleep between 0 and 1 seconds here. Let me know if that's what you had in mind.
@cgillum, I am a bit afraid about just throwing an exception, because I don't know if that shuts down the whole host. I know some customers who hit the current race condition seem to hit to be stuck in a bad state. Would a failfast be better? |
Oy...if an exception here doesn't restart the host then that would be really bad. I hate to resort to FailFast because that's a pretty severe hammer, but I guess I don't have any better ideas than this... |
I'll see if I can find an example of what happens when an exception is thrown. Overall, I expect this to happen incredibly rarely, especially if we up the retry count to 10 and add some randomized delays, as we have already made the race condition less likely by reducing the time window between port selection and listening on that port. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Our previous attempt to dynamically select ports for the local RPC endpoint fixed many issues customers had, but there still exists a race condition when multiple hosts start up at the same time. To completely resolve this issue, this PR makes the following changes:
resolves #1273
Pull request checklist
pending_docs.md
release_notes.md