release-20.2: kvcoord: fix rangefeed retries on transport errors #67024
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport 1/2 commits from #66910.
Also pulls in gomock additions from #63616, as well as pkg/cmd/import-tools/stub.go from #62560 to get around a testrace CI failure for that package (build constraints exclude all Go files).
Failure is somewhat different in 20.2, since e.g. decommissioned nodes do not return
PermissionDenied
. Same bug in the range feed though, in that transport errors are not retried./cc @cockroachdb/release @cockroachdb/kv
DistSender.RangeFeed()
was meant to retry transport errors afterrefreshing the range descriptor (invalidating the cached entry).
However, due to an incorrect error type check (
*sendError
vssendError
), these errors failed the range feed without invalidatingthe cached range descriptor.
This was particularly severe in cases where a large number of nodes had
been decommissioned, where some stale range descriptors on some nodes
contained only decommissioned nodes. Since change feeds set up range
feeds across many nodes and ranges in the cluster, they are likely to
encounter these decommissioned nodes and return an error -- and since
the descriptor cache wasn't invalidated they would keep erroring until
the nodes were restarted such that the caches were flushed (often
requiring a full cluster restart).
Resolves #66636, touches #66586.
Release note (bug fix): Change feeds now properly invalidate cached
range descriptors and retry when encountering decommissioned nodes.