Fix Local RPC race condition on app startup #1719

ConnorMcMahon · 2021-03-11T03:09:32Z

Our previous attempt to dynamically select ports for the local RPC endpoint fixed many issues customers had, but there still exists a race condition when multiple hosts start up at the same time. To completely resolve this issue, this PR makes the following changes:

Move the code that finds a free port closer to the code that attempts to listen on that port.
Add a retry layer if the free port has been taken by another worker between when we found the free port and when we tried to listen to that port.
If we somehow hit this race condition 3 times in a row, even though it should be more rare now, we gracefully fallback to using the public endpoints for out-of-process clients instead of failing to start up the host.

resolves #1273

Pull request checklist

My changes do not require documentation changes
- Otherwise: Documentation PR is ready to merge and referenced in pending_docs.md
My changes should not be added to the release notes for the next release
- Otherwise: I've added my notes to release_notes.md
My changes do not need to be backported to a previous version
- Otherwise: Backport tracked by issue/PR Backport local rpc race condition fix to Durable 1.x #1723
I have added all required tests (Unit tests, E2E tests)

src/WebJobs.Extensions.DurableTask/HttpApiHandler.cs

davidmrdavid

Looks good to me, left some stylistic requests and some questions. Some comments were not associated with the review itself but please give those a look as well. Thanks!

src/WebJobs.Extensions.DurableTask/LocalHttpListener.cs

test/Common/DurableTaskEndToEndTests.cs

cgillum

Mostly nits. My only real concern is what to do if we fail to find a free port. I'm not super comfortable to falling back to public endpoints because that could put a customer into an unpredictable state - i.e. it works on some machines but not on others - and diagnosing such issues will be a lot harder. I'm also not sure what this means when someone is running outside of the Azure Functions hosted service (K8s, WebJobs, etc.) where they may not have any public HTTP endpoints. Throwing an exception might actually be safer.

src/WebJobs.Extensions.DurableTask/LocalHttpListener.cs

test/Common/DurableTaskEndToEndTests.cs

cgillum · 2021-03-11T21:29:49Z

src/WebJobs.Extensions.DurableTask/LocalHttpListener.cs

+                    numAttempts++;
+                }
+            }
+            while (numAttempts <= 3);


I wonder if we should try more times, like 10. I also wonder if we should have a small sleep in-between retries. I'm not sure how random the port selection is, so it would be nice if we could mitigate the case where two hosts are starting up at the same time and selecting the same random sequence of available port numbers.

Makes sense. if we are worried about them being in sync, would doing a semi-random amount of sleep time help (i.e. between 50-100 ms)? That way we keep the two hosts out of lockstep.

Added a dynamic sleep between 0 and 1 seconds here. Let me know if that's what you had in mind.

ConnorMcMahon · 2021-03-11T21:46:58Z

@cgillum, I am a bit afraid about just throwing an exception, because I don't know if that shuts down the whole host. I know some customers who hit the current race condition seem to hit to be stuck in a bad state. Would a failfast be better?

cgillum · 2021-03-11T21:52:08Z

Oy...if an exception here doesn't restart the host then that would be really bad. I hate to resort to FailFast because that's a pretty severe hammer, but I guess I don't have any better ideas than this...

ConnorMcMahon · 2021-03-11T21:55:19Z

I'll see if I can find an example of what happens when an exception is thrown. Overall, I expect this to happen incredibly rarely, especially if we up the retry count to 10 and add some randomized delays, as we have already made the race condition less likely by reducing the time window between port selection and listening on that port.

cgillum

LGTM!

Connor McMahon added 7 commits March 10, 2021 17:42

Add test

f45dcd4

Fix broken test

d94a584

Finally have broken test

783b5e5

Fix build warnings

08c43b1

Fix implementation

d9f2d60

Add release notes

5d2a896

Fix broken test

9fea8e8

ConnorMcMahon changed the title ~~[Draft] Add test for local rpc failures~~ [Draft] Fix Local RPC race condition on app startup Mar 11, 2021

ConnorMcMahon changed the title ~~[Draft] Fix Local RPC race condition on app startup~~ Fix Local RPC race condition on app startup Mar 11, 2021

ConnorMcMahon requested review from davidmrdavid, cgillum, amdeel and bachuv March 11, 2021 19:46

ConnorMcMahon mentioned this pull request Mar 11, 2021

Backport local rpc race condition fix to Durable 1.x #1723

Open

davidmrdavid reviewed Mar 11, 2021

View reviewed changes

src/WebJobs.Extensions.DurableTask/HttpApiHandler.cs Outdated Show resolved Hide resolved

davidmrdavid reviewed Mar 11, 2021

View reviewed changes

src/WebJobs.Extensions.DurableTask/HttpApiHandler.cs Outdated Show resolved Hide resolved

davidmrdavid approved these changes Mar 11, 2021

View reviewed changes

src/WebJobs.Extensions.DurableTask/LocalHttpListener.cs Outdated Show resolved Hide resolved

test/Common/DurableTaskEndToEndTests.cs Outdated Show resolved Hide resolved

cgillum reviewed Mar 11, 2021

View reviewed changes

Respond to PR feedback

c431aa4

cgillum approved these changes Mar 12, 2021

View reviewed changes

Fix merge conflicts

7202d19

ConnorMcMahon merged commit 7617df9 into dev Mar 15, 2021

davidmrdavid deleted the LocalRpcRaceCondition branch March 15, 2021 19:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Local RPC race condition on app startup #1719

Fix Local RPC race condition on app startup #1719

ConnorMcMahon commented Mar 11, 2021 •

edited

Loading

davidmrdavid left a comment

cgillum left a comment

cgillum Mar 11, 2021

ConnorMcMahon Mar 11, 2021

ConnorMcMahon Mar 11, 2021

ConnorMcMahon commented Mar 11, 2021

cgillum commented Mar 11, 2021

ConnorMcMahon commented Mar 11, 2021

cgillum left a comment

Fix Local RPC race condition on app startup #1719

Fix Local RPC race condition on app startup #1719

Conversation

ConnorMcMahon commented Mar 11, 2021 • edited Loading

Pull request checklist

davidmrdavid left a comment

Choose a reason for hiding this comment

cgillum left a comment

Choose a reason for hiding this comment

cgillum Mar 11, 2021

Choose a reason for hiding this comment

ConnorMcMahon Mar 11, 2021

Choose a reason for hiding this comment

ConnorMcMahon Mar 11, 2021

Choose a reason for hiding this comment

ConnorMcMahon commented Mar 11, 2021

cgillum commented Mar 11, 2021

ConnorMcMahon commented Mar 11, 2021

cgillum left a comment

Choose a reason for hiding this comment

ConnorMcMahon commented Mar 11, 2021 •

edited

Loading