Pass cancellation token to grpc reader #2133
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Regardless of all of the following, this should be a good change since we are just passing a cancellation token somewhere you would think it should be passed. Nonetheless I've outlined everything below.
This fixes a difficult to reproduce issue.
It would occur when members were being added / removed from the cluster. The GossipActor on a member (not a new member) would stop receiving messages and everything would timeout for 15 min. Which of course would cause various actors to be unresponsive or stop working.
After applying this fix i was no longer able to reproduce the issue with the following method.
The easiest way for me to reproduce was to continually kill a pod in k8s which would join the cluster as a new member each time it restarted, every 20s (which gave it enough time to start and to see logs on other members about it joining the cluster. Of course killing a pod like this without a graceful shutdown will cause some temporary issues but this is not about that.
My theory of how this occurred:
When a member is removed it has to be removed from the EndpointManager on other members. Each removal is done separately within a lock. Part of that removal involves stopping the ServerConnector for that endpoint. A cancellation token is canceled and then it waits for it's main method to return. Depending on the timing, it can end up waiting on a grpc call
await call.ResponseStream.MoveNext()
and there is nothing in there that will see the cancellation token (this PR changes that). Then something in the grpc call would timeout after 15 min (every time i reproduced the issue, it lasted for 15 min and then recovered), producing a[ServerConnector] Error in reader RpcException
error log. Then it would finally finish and exit the log allowing other processes to continue.On another thread, the SendGossipStateRequest is started but doesn't complete until after the 15 min. During this time we continually see
Gossip loop failed
error logs and many other errors as a result. SendGossipStateRequest does use reentry but the handler for that message does not return until after the GossipRequest is ultimately written to the Channel in the Endpoint class. You can follow the code that runs synchronously all the way down, RequestReenter => RequestAsync => RequestAsync => Send => SendUserMessage => SendUserMessage => RemoteProcess.SendUserMessage => Send => GetEndpoint then SendMessage => TryWrite. GetEndpoint calls GetOrAddServerEndpoint which will wait enter the lock that is blocked by the ServerConnector if the endpoint is not already there (or in this case, has already been removed). Of course there is logic for blocking endpoints for a time but when it's stuck waiting for the ServerConnector to finish, the endpoint hasn't been added to the block list yet. Anyways it wait 15 min for that lock and then after that things start flowing again and recover.This PR causes the grpc call
await call.ResponseStream.MoveNext()
to return right away when the ServerConnector is stopped by passing the cancellation token.Purpose
This pull request is a:
Checklist