kvserver: use shorter lease expiration for liveness range #88443

erikgrinaker · 2022-09-22T09:18:30Z

If the liveness range leaseholder is lost, the range may be unavailable for long enough that all other leaseholders also lose their epoch-based lease, since they all have the same lease expiration time. We should use a shorter lease expiration interval for the liveness range, to ensure that in the typical case, a non-cooperative lease transfer can happen without disrupting other leases.

Relates to #41162.
Jira issue: CRDB-19826

Epic CRDB-40200

blathers-crl · 2022-09-22T09:18:32Z

cc @cockroachdb/replication

jjathman · 2022-10-05T20:18:05Z

a non-cooperative lease transfer can happen without disrupting other leases

Can you explain a bit more about how things should work and how this problem may manifest? We had an issue in one of our clusters where the liveness lease holder lost network connectivity, which seemed to cause the entire cluster to lock up and quit allowing any DB connections until the node which was the liveness leaseholder was restarted. The entire cluster was unusable for over 24 hours (this was a dev cluster so it was not immediately noticed).

erikgrinaker · 2022-10-06T08:58:46Z

Practically all leases in the system are what's called epoch-based leases, and these are tied to node heartbeats. If a node fails to heartbeat for some time, it loses all of its leases. Node heartbeats are essentially a write to the liveness range. If the liveness range leaseholder is lost, the liveness range will be unavailable until the lease interval (~10s) expires. During this time, all node heartbeats will fail, which may cause other nodes to lose their leases as well.

This will manifest as many/most leases in the system being invalidated (causing range unavailability) following loss of the liveness leaseholder. This will last until the liveness lease is reacquired, at which point other nodes can acquire the remaining invalid leases -- ideally about 10 seconds, but can take longer due to various other interactions.

This issue is specifically about avoiding this unavailability blip, by ensuring the liveness lease can be reacquired fast enough that other nodes won't lose their leases in the meanwhile. The problem you're describing sounds different, in that the outage persisted. It may e.g. be related to partial network partitions or node unresponsiveness, which we've seen cause these symptoms:

I see that we have an RCA in progress for this outage. That should shed some light on the specific failure mode here.

jjathman · 2022-10-06T16:54:39Z

Thank you for the links. It does seem like the symptoms of our outage are more closely aligned with what you've posted. We did get a couple log messages on the node that was the liveness lease holder about disk stall problems right before this issue manifested so maybe that's what we've hit.

93039: roachtest: add `failover/liveness` r=erikgrinaker a=erikgrinaker This patch adds a roachtest that measures the duration of *user* range unavailability following a liveness leaseholder failure, as well as the number of expired leases. When the liveness range is unavailable, other nodes are unable to heartbeat and extend their leases, which can cause them to expire and these ranges to become unavailable as well. The test sets up a 4-node cluster with all other ranges on n1-n3, and the liveness range on n1-n4 with the lease on n4. A kv workload is run against n1-n3 while n4 fails and recovers repeatedly (both with process crashes and network outages). Workload latency histograms are recorded, where the pMax latency is a measure of the failure impact, as well as the `replicas_leaders_invalid_lease` metric over time. Touches #88443. Epic: none Release note: None Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>

erikgrinaker · 2022-12-09T17:20:45Z

We did some experiments over in #93073, and this probably isn't viable because we'd have to drop the Raft election timeout extremely low -- so low that it'd likely destabilize multiregion clusters with high latencies.

We should consider other approaches to avoiding the impact of liveness range unavailability, e.g. storing node liveness info (or rather, coalesced lease extensions) somewhere else, possibly sharded.

erikgrinaker added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-recovery T-kv-replication labels Sep 22, 2022

erikgrinaker mentioned this issue Sep 22, 2022

kvserver: use shorter lease durations #79494

Closed

erikgrinaker added A-kv Anything in KV that doesn't belong in a more specific category. and removed A-kv-recovery labels Sep 22, 2022

exalate-issue-sync bot assigned erikgrinaker Nov 8, 2022

erikgrinaker mentioned this issue Nov 16, 2022

base: reduce Raft election timeout and lease interval #91947

Merged

erikgrinaker mentioned this issue Dec 5, 2022

roachtest: add failover/liveness #93039

Merged

exalate-issue-sync bot unassigned erikgrinaker Dec 18, 2022

exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Jun 28, 2024

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Incoming in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: use shorter lease expiration for liveness range #88443

kvserver: use shorter lease expiration for liveness range #88443

erikgrinaker commented Sep 22, 2022 •

edited by exalate-issue-sync bot

Loading

blathers-crl bot commented Sep 22, 2022

jjathman commented Oct 5, 2022

erikgrinaker commented Oct 6, 2022

jjathman commented Oct 6, 2022

erikgrinaker commented Dec 9, 2022 •

edited

Loading

kvserver: use shorter lease expiration for liveness range #88443

kvserver: use shorter lease expiration for liveness range #88443

Comments

erikgrinaker commented Sep 22, 2022 • edited by exalate-issue-sync bot Loading

blathers-crl bot commented Sep 22, 2022

jjathman commented Oct 5, 2022

erikgrinaker commented Oct 6, 2022

jjathman commented Oct 6, 2022

erikgrinaker commented Dec 9, 2022 • edited Loading

erikgrinaker commented Sep 22, 2022 •

edited by exalate-issue-sync bot

Loading

erikgrinaker commented Dec 9, 2022 •

edited

Loading