Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeedccl: Rangefeeds might fail due to stale range cache #66636

Closed
miretskiy opened this issue Jun 18, 2021 · 4 comments · Fixed by #66910
Closed

changefeedccl: Rangefeeds might fail due to stale range cache #66636

miretskiy opened this issue Jun 18, 2021 · 4 comments · Fixed by #66910
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team

Comments

@miretskiy
Copy link
Contributor

Customer observed changefeeds failing after some of the nodes were
decomissioned

ERROR: failed to connect to n4 at 172.31.91.238:26257: initial connection heartbeat failed: rpc error: code = PermissionDenied desc = n4 was permanently removed from the cluster at 2021-05-27 21:08:12.282567682 +0000 U…

In addition, error message

return args.Timestamp, newSendError(
				fmt.Sprintf("sending to all %d replicas failed", len(replicas)))

The hypothesis is that we might not be handling decomissioned nodes correctly. In particular, perhaps we need to handle InitialHeartbeatFailed error in addition to the errors we already handle in partialRangeFeed

  1. We need to write tests around decomissioned nodes
  2. We need to improve above error message in dist_sender_rangefeed -- we should print all replicas.
  3. We need to handle decomissioned nodes
@miretskiy miretskiy added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jun 18, 2021
@erikgrinaker
Copy link
Contributor

Note that this isn't really about InitialHeartbeatFailed, but rather the wrapped grpcstatus.PermissionDenied, which is returned for decommissioned nodes. This usually gets classified as a permanent error via grpcutil.IsAuthenticationError(), bypassing retries and often cache invalidation.

Also note that #66199 changes the symmetry of PermissionDenied for decommissioned nodes. After that change, the decommissioned node itself gets PermissionDenied when talking to other nodes, but other nodes talking to a decommissioned node get FailedPrecondition. The former should always be considered a permanent error (to avoid hangs and infinite retry loops on decommissioned nodes), but the latter should be retried when talking to a range (since there could be other viable replicas) but not when addressing the decommissioned node specifically.

Most of this applies to master and 21.1, where there's been several changes to error handling. It's unclear whether we want to backport all of this to 20.2, so we may need a simpler fix there.

miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Jun 21, 2021
Avoid non-active nodes (i.e. those that are decomissioning or
decomissioned) when planning distributed sql flows.

Informs cockroachdb#66586
Informs cockroachdb#66636

Release Notes: None
@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jun 24, 2021

So I've dug into the RangeFeed code path here in light of #66199, and I believe this should do the right thing now. Here's my reasoning:

  • The error logged by the changefeed was sending to all %d replicas failed. This is returned as a sendError here:

    if transport.IsExhausted() {
    return args.Timestamp, newSendError(
    fmt.Sprintf("sending to all %d replicas failed", len(replicas)))
    }

  • This sendError will be caught in partialRangeFeed(), invalidating the range cache token and retrying:

    case errors.HasType(err, (*sendError)(nil)), errors.HasType(err, (*roachpb.RangeNotFoundError)(nil)):
    // Evict the descriptor from the cache and reload on next attempt.
    rangeInfo.token.Evict(ctx)
    rangeInfo.token = rangecache.EvictionToken{}
    continue

  • Originally, the retry would get an PermissionDenied error when refreshing the range info and hitting the decommissioned node, which is returned to the caller without retrying:

    ri, err := ds.getRoutingInfo(ctx, rangeInfo.rs.Key, rangecache.EvictionToken{}, false)
    if err != nil {
    log.VErrEventf(ctx, 1, "range descriptor re-lookup failed: %s", err)
    if !rangecache.IsRangeLookupErrorRetryable(err) {
    return err
    }
    continue
    }

  • After server: return FailedPrecondition when talking to decom node #66199, the IsRangeLookupErrorRetryable() check here will still trigger on FailedPrecondition and return without retrying. However, as discussed in server: return FailedPrecondition when talking to decom node #66199 (review), it uses a DistSender as a source of range data which will manage retries of FailedPrecondition internally:

    if grpcutil.IsAuthError(err) {
    // Authentication or authorization error. Propagate.
    if ambiguousError != nil {
    return nil, roachpb.NewAmbiguousResultErrorf("error=%s [propagate]", ambiguousError)
    }
    return nil, err
    }

Thus, I believe this should already be resolved. We may want to change IsRangeLookupErrorRetryable to return true for FailedPrecondition to make this explicit, as long as it doesn't have any adverse effects elsewhere, but it shouldn't really matter either (cc @aliher1911).

In any case, I'll try to write up a unit/integration test to verify this. I've been testing with roachprod and so far been unable to reproduce any of these decommissioned node errors with the existing PRs applied, so it's looking promising.

@miretskiy
Copy link
Contributor Author

Thanks, @erikgrinaker for solid analysis. I think this does indeed fix this problem. Closing this issue.

@erikgrinaker
Copy link
Contributor

Although the above reasoning was valid, there was a subtle bug in the logic: it checks for *sendError while the transport returns a sendError, so the branch is never taken and the range cache is never invalidated. Finishing up a test case and fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants