changefeedccl: Rangefeeds might fail due to stale range cache #66636

miretskiy · 2021-06-18T18:57:19Z

Customer observed changefeeds failing after some of the nodes were
decomissioned

ERROR: failed to connect to n4 at 172.31.91.238:26257: initial connection heartbeat failed: rpc error: code = PermissionDenied desc = n4 was permanently removed from the cluster at 2021-05-27 21:08:12.282567682 +0000 U…

In addition, error message

return args.Timestamp, newSendError(
				fmt.Sprintf("sending to all %d replicas failed", len(replicas)))

The hypothesis is that we might not be handling decomissioned nodes correctly. In particular, perhaps we need to handle InitialHeartbeatFailed error in addition to the errors we already handle in partialRangeFeed

We need to write tests around decomissioned nodes
We need to improve above error message in dist_sender_rangefeed -- we should print all replicas.
We need to handle decomissioned nodes

The text was updated successfully, but these errors were encountered:

erikgrinaker · 2021-06-19T12:45:39Z

Note that this isn't really about InitialHeartbeatFailed, but rather the wrapped grpcstatus.PermissionDenied, which is returned for decommissioned nodes. This usually gets classified as a permanent error via grpcutil.IsAuthenticationError(), bypassing retries and often cache invalidation.

Also note that #66199 changes the symmetry of PermissionDenied for decommissioned nodes. After that change, the decommissioned node itself gets PermissionDenied when talking to other nodes, but other nodes talking to a decommissioned node get FailedPrecondition. The former should always be considered a permanent error (to avoid hangs and infinite retry loops on decommissioned nodes), but the latter should be retried when talking to a range (since there could be other viable replicas) but not when addressing the decommissioned node specifically.

Most of this applies to master and 21.1, where there's been several changes to error handling. It's unclear whether we want to backport all of this to 20.2, so we may need a simpler fix there.

Avoid non-active nodes (i.e. those that are decomissioning or decomissioned) when planning distributed sql flows. Informs cockroachdb#66586 Informs cockroachdb#66636 Release Notes: None

erikgrinaker · 2021-06-24T13:48:18Z

So I've dug into the RangeFeed code path here in light of #66199, and I believe this should do the right thing now. Here's my reasoning:

The error logged by the changefeed was sending to all %d replicas failed. This is returned as a sendError here:

cockroach/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go

Lines 247 to 250 in 0cea9dc

    
           if transport.IsExhausted() { 
        
           	return args.Timestamp, newSendError( 
        
           		fmt.Sprintf("sending to all %d replicas failed", len(replicas))) 
        
           }

This sendError will be caught in partialRangeFeed(), invalidating the range cache token and retrying:

cockroach/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go

Lines 167 to 171 in 0cea9dc

    
           case errors.HasType(err, (*sendError)(nil)), errors.HasType(err, (*roachpb.RangeNotFoundError)(nil)): 
        
           	// Evict the descriptor from the cache and reload on next attempt. 
        
           	rangeInfo.token.Evict(ctx) 
        
           	rangeInfo.token = rangecache.EvictionToken{} 
        
           	continue

Originally, the retry would get an PermissionDenied error when refreshing the range info and hitting the decommissioned node, which is returned to the caller without retrying:

cockroach/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go

Lines 139 to 146 in 0cea9dc

    
           ri, err := ds.getRoutingInfo(ctx, rangeInfo.rs.Key, rangecache.EvictionToken{}, false) 
        
           if err != nil { 
        
           	log.VErrEventf(ctx, 1, "range descriptor re-lookup failed: %s", err) 
        
           	if !rangecache.IsRangeLookupErrorRetryable(err) { 
        
           		return err 
        
           	} 
        
           	continue 
        
           }

After server: return FailedPrecondition when talking to decom node #66199, the IsRangeLookupErrorRetryable() check here will still trigger on FailedPrecondition and return without retrying. However, as discussed in server: return FailedPrecondition when talking to decom node #66199 (review), it uses a DistSender as a source of range data which will manage retries of FailedPrecondition internally:

cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go

Lines 1864 to 1870 in e045f06

    
           if grpcutil.IsAuthError(err) { 
        
           	// Authentication or authorization error. Propagate. 
        
           	if ambiguousError != nil { 
        
           		return nil, roachpb.NewAmbiguousResultErrorf("error=%s [propagate]", ambiguousError) 
        
           	} 
        
           	return nil, err 
        
           }

Thus, I believe this should already be resolved. We may want to change IsRangeLookupErrorRetryable to return true for FailedPrecondition to make this explicit, as long as it doesn't have any adverse effects elsewhere, but it shouldn't really matter either (cc @aliher1911).

In any case, I'll try to write up a unit/integration test to verify this. I've been testing with roachprod and so far been unable to reproduce any of these decommissioned node errors with the existing PRs applied, so it's looking promising.

miretskiy · 2021-06-24T14:20:02Z

Thanks, @erikgrinaker for solid analysis. I think this does indeed fix this problem. Closing this issue.

erikgrinaker · 2021-06-25T16:33:05Z

Although the above reasoning was valid, there was a subtle bug in the logic: it checks for *sendError while the transport returns a sendError, so the branch is never taken and the range cache is never invalidated. Finishing up a test case and fix.

miretskiy added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jun 18, 2021

erikgrinaker mentioned this issue Jun 21, 2021

kv: operations failing when encountering decommissioned nodes #66586

Closed

3 tasks

amruss assigned miretskiy Jun 21, 2021

miretskiy mentioned this issue Jun 21, 2021

distsql: Avoid decomissioned nodes. #66671

Closed

miretskiy closed this as completed Jun 24, 2021

erikgrinaker reopened this Jun 25, 2021

erikgrinaker assigned erikgrinaker and unassigned miretskiy Jun 25, 2021

erikgrinaker added the T-kv KV Team label Jun 25, 2021

erikgrinaker mentioned this issue Jun 25, 2021

kvcoord: fix rangefeed retries on transport errors #66910

Merged

craig bot closed this as completed in 393e04a Jun 29, 2021

This was referenced Jun 29, 2021

release-21.1: kvcoord: fix rangefeed retries on transport errors #67013

Merged

release-20.2: kvcoord: fix rangefeed retries on transport errors #67024

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

changefeedccl: Rangefeeds might fail due to stale range cache #66636

changefeedccl: Rangefeeds might fail due to stale range cache #66636

miretskiy commented Jun 18, 2021

erikgrinaker commented Jun 19, 2021

erikgrinaker commented Jun 24, 2021 •

edited

Loading

miretskiy commented Jun 24, 2021

erikgrinaker commented Jun 25, 2021

changefeedccl: Rangefeeds might fail due to stale range cache #66636

changefeedccl: Rangefeeds might fail due to stale range cache #66636

Comments

miretskiy commented Jun 18, 2021

erikgrinaker commented Jun 19, 2021

erikgrinaker commented Jun 24, 2021 • edited Loading

miretskiy commented Jun 24, 2021

erikgrinaker commented Jun 25, 2021

erikgrinaker commented Jun 24, 2021 •

edited

Loading