Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-20.2: kvcoord: fix rangefeed retries on transport errors #67024

Merged

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Jun 29, 2021

Backport 1/2 commits from #66910.

Also pulls in gomock additions from #63616, as well as pkg/cmd/import-tools/stub.go from #62560 to get around a testrace CI failure for that package (build constraints exclude all Go files).

Failure is somewhat different in 20.2, since e.g. decommissioned nodes do not return PermissionDenied. Same bug in the range feed though, in that transport errors are not retried.

/cc @cockroachdb/release @cockroachdb/kv


DistSender.RangeFeed() was meant to retry transport errors after
refreshing the range descriptor (invalidating the cached entry).
However, due to an incorrect error type check (*sendError vs
sendError), these errors failed the range feed without invalidating
the cached range descriptor.

This was particularly severe in cases where a large number of nodes had
been decommissioned, where some stale range descriptors on some nodes
contained only decommissioned nodes. Since change feeds set up range
feeds across many nodes and ranges in the cluster, they are likely to
encounter these decommissioned nodes and return an error -- and since
the descriptor cache wasn't invalidated they would keep erroring until
the nodes were restarted such that the caches were flushed (often
requiring a full cluster restart).

Resolves #66636, touches #66586.

Release note (bug fix): Change feeds now properly invalidate cached
range descriptors and retry when encountering decommissioned nodes.

@erikgrinaker erikgrinaker self-assigned this Jun 29, 2021
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@aliher1911 aliher1911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Looks like backporting mocks starting to get hairy.

Reviewed 12 of 12 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @tbg)

@erikgrinaker
Copy link
Contributor Author

Looks like backporting mocks starting to get hairy.

Less hairy here than 21.1, since there's no Bazel build to update.

`DistSender.RangeFeed()` was meant to retry transport errors after
refreshing the range descriptor (invalidating the cached entry).
However, due to an incorrect error type check (`*sendError` vs
`sendError`), these errors failed the range feed without invalidating
the cached range descriptor.

This was particularly severe in cases where a large number of nodes had
been decommissioned, where some stale range descriptors on some nodes
contained only decommissioned nodes. Since change feeds set up range
feeds across many nodes and ranges in the cluster, they are likely to
encounter these decommissioned nodes and return an error -- and since
the descriptor cache wasn't invalidated they would keep erroring until
the nodes were restarted such that the caches were flushed (often
requiring a full cluster restart).

Release note (bug fix): Change feeds now properly invalidate cached
range descriptors and retry when encountering decommissioned nodes.
@erikgrinaker erikgrinaker merged commit 65b4076 into cockroachdb:release-20.2 Jun 30, 2021
@erikgrinaker erikgrinaker deleted the backport20.2-66910 branch July 1, 2021 11:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants