kv: operations failing when encountering decommissioned nodes #66586

erikgrinaker · 2021-06-17T13:05:17Z

On 21.1.1, operations tend to fail when encountering decommissioned nodes. Users have reported that after doing a full cycle of spinning up new nodes and decommissioning all old nodes, CHANGEFEED and BACKUP operations would fail:

CREATE CHANGEFEED FOR TABLE t INTO 'experimental-s3://bucket/changefeed/?AUTH=implicit' WITH compression = 'gzip', diff, format = 'json', initial_scan, resolved = '5m', schema_change_events = 'column_changes', schema_change_policy = 'backfill', updated;
ERROR: failed to connect to n4 at 172.31.91.238:26257: initial connection heartbeat failed: rpc error: code = PermissionDenied desc = n4 was permanently removed from the cluster at 2021-05-27 21:08:12.282567682 +0000 UTC; it is not allowed to rejoin the cluster

Restarting the affected nodes fixed the problem, so it seems related to some transitory state like range descriptor/leaseholder caches or gossip state.

This hasn't been reproducible, but other operations such as SET CLUSTER SETTING and SELECT * FROM t have occasionally been seen to fail this way.

There seems to be multiple reasons for this:

Any RPC ping involving a decommissioned node (both to and from) returned PermissionDenied, which is treated as a permanent error. However, this was too symmetrical: when e.g. the DistSender tries to contact a range and encounters a decommissioned node, it shouldn't give up, but rather refresh its caches and retry against a different replica. Any KV operation could fail due to this. Fixed by server: return FailedPrecondition when talking to decom node #66199.
Operations that explicitly choose nodes to operate on (in particular, DistSQL jobs) often only consider node liveness, but not its decommissioned state. Fixed by *: check node decommissioned/draining state for DistSQL/consistency #66632.
changefeedccl: Rangefeeds might fail due to stale range cache #66636 Range feeds appear to not not flush range descriptor caches or lease caches when encountering authentication errors, this was found to be due to a bug in range feed error handling. Fixed by kvcoord: fix rangefeed retries on transport errors #66910.

/cc @cockroachdb/kv

The text was updated successfully, but these errors were encountered:

Avoid non-active nodes (i.e. those that are decomissioning or decomissioned) when planning distributed sql flows. Informs cockroachdb#66586 Informs cockroachdb#66636 Release Notes: None

66632: *: check node decommissioned/draining state for DistSQL/consistency r=tbg,knz a=erikgrinaker The DistSQL planner and consistency queue did not take the nodes' decommissioned or draining states into account, which in particular could cause spurious errors when interacting with decommissioned nodes. This patch adds convenience methods for checking node availability and draining states, and avoids scheduling DistSQL flows on unavailable nodes and consistency checks on unavailable/draining nodes. Touches #66586, touches #45123. Release note (bug fix): Avoid interacting with decommissioned nodes during DistSQL planning and consistency checking. /cc @cockroachdb/kv Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>

66910: kvcoord: fix rangefeed retries on transport errors r=miretskiy,tbg,aliher1911,rickystewart a=erikgrinaker ### rangecache: consider FailedPrecondition a retryable error In #66199, the gRPC error `FailedPrecondition` (returned when contacting a decommissioned node) was incorrectly considered a permanent error during range descriptor lookups, via `IsRangeLookupErrorRetryable()`. These lookups are all trying to reach a meta-range (addressed by key), so when encountering a decommissioned node they should evict the cache token and retry the lookup. This patch reverses that change, making only authentication errors permanent (i.e. when the local node is decommissioned or otherwise not part of the cluster). Release note: None ### kvcoord: fix rangefeed retries on transport errors `DistSender.RangeFeed()` was meant to retry transport errors after refreshing the range descriptor (invalidating the cached entry). However, due to an incorrect error type check (`*sendError` vs `sendError`), these errors failed the range feed without invalidating the cached range descriptor. This was particularly severe in cases where a large number of nodes had been decommissioned, where some stale range descriptors on some nodes contained only decommissioned nodes. Since change feeds set up range feeds across many nodes and ranges in the cluster, they are likely to encounter these decommissioned nodes and return an error -- and since the descriptor cache wasn't invalidated they would keep erroring until the nodes were restarted such that the caches were flushed (often requiring a full cluster restart). Resolves #66636, touches #66586. Release note (bug fix): Change feeds now properly invalidate cached range descriptors and retry when encountering decommissioned nodes. /cc @cockroachdb/kv Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>

erikgrinaker · 2021-06-29T10:25:37Z

I believe all of the known bugs should be fixed now.

erikgrinaker · 2021-06-29T12:49:30Z

To verify, I wrote up the following script which readily failed before these fixes with:

rpc error: code = PermissionDenied desc = n1 was permanently removed from the cluster at 2021-06-29 12:41:50.019428021 +0000 UTC; it is not allowed to rejoin the cluster

After these fixes, I did dozens of runs without seeing a single failure.

roachprod destroy local

set -euxo pipefail

roachprod create local -n 9
roachprod start local:1-3
roachprod sql local:1 -- -e 'CREATE TABLE t (id int primary key, value int);'
roachprod sql local:1 -- -e 'INSERT INTO t VALUES (1, 1), (2, 2), (3, 3);'
sleep 3

roachprod start local:4-9
for N in 4 5 6 7 8 9; do
  roachprod sql local:$N -- -e 'SELECT * FROM t;'
done

cockroach node decommission --insecure 1 2 3

roachprod sql local:5 -- -e 'SET CLUSTER SETTING kv.rangefeed.enabled = true;'
roachprod sql local:7 -- -e "CREATE CHANGEFEED FOR TABLE t INTO 'experimental-nodelocal://6/tmp/changefeed';"
roachprod sql local:8 -- -e "BACKUP TABLE t INTO 'nodelocal://9/tmp/backup/t';"
for N in 4 5 6 7 8 9; do
  roachprod sql local:$N -- -e 'SELECT * FROM t;'
done

erikgrinaker · 2021-06-30T09:41:00Z

Relevant fixes have been backported to 20.2 and 21.1, will be included in the next patch releases (20.2.13 and 21.1.5).

erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-client Relating to the KV client and the KV interface. T-kv KV Team labels Jun 17, 2021

erikgrinaker self-assigned this Jun 17, 2021

erikgrinaker mentioned this issue Jun 18, 2021

*: check node decommissioned/draining state for DistSQL/consistency #66632

Merged

erikgrinaker changed the title ~~kv: fix operations failing when encountering decommissioned nodes~~ kv: operations failing when encountering decommissioned nodes Jun 18, 2021

miretskiy mentioned this issue Jun 21, 2021

distsql: Avoid decomissioned nodes. #66671

Closed

This was referenced Jun 25, 2021

kvcoord: fix rangefeed retries on transport errors #66910

Merged

release-21.1: *: check node decommissioned/draining state for DistSQL/consistency #66950

Merged

release-20.2: *: check node decommissioned/draining state for DistSQL/consistency #66951

Merged

erikgrinaker closed this as completed Jun 29, 2021

This was referenced Jun 29, 2021

release-21.1: kvcoord: fix rangefeed retries on transport errors #67013

Merged

release-20.2: kvcoord: fix rangefeed retries on transport errors #67024

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: operations failing when encountering decommissioned nodes #66586

kv: operations failing when encountering decommissioned nodes #66586

erikgrinaker commented Jun 17, 2021 •

edited

Loading

erikgrinaker commented Jun 29, 2021

erikgrinaker commented Jun 29, 2021 •

edited

Loading

erikgrinaker commented Jun 30, 2021

kv: operations failing when encountering decommissioned nodes #66586

kv: operations failing when encountering decommissioned nodes #66586

Comments

erikgrinaker commented Jun 17, 2021 • edited Loading

erikgrinaker commented Jun 29, 2021

erikgrinaker commented Jun 29, 2021 • edited Loading

erikgrinaker commented Jun 30, 2021

erikgrinaker commented Jun 17, 2021 •

edited

Loading

erikgrinaker commented Jun 29, 2021 •

edited

Loading