Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: use shorter lease expiration for liveness range #88443

Open
erikgrinaker opened this issue Sep 22, 2022 · 5 comments
Open

kvserver: use shorter lease expiration for liveness range #88443

erikgrinaker opened this issue Sep 22, 2022 · 5 comments
Labels
A-kv Anything in KV that doesn't belong in a more specific category. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Sep 22, 2022

If the liveness range leaseholder is lost, the range may be unavailable for long enough that all other leaseholders also lose their epoch-based lease, since they all have the same lease expiration time. We should use a shorter lease expiration interval for the liveness range, to ensure that in the typical case, a non-cooperative lease transfer can happen without disrupting other leases.

Relates to #41162.
Jira issue: CRDB-19826

Epic CRDB-40200

@erikgrinaker erikgrinaker added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-recovery T-kv-replication labels Sep 22, 2022
@blathers-crl
Copy link

blathers-crl bot commented Sep 22, 2022

cc @cockroachdb/replication

@erikgrinaker erikgrinaker added A-kv Anything in KV that doesn't belong in a more specific category. and removed A-kv-recovery labels Sep 22, 2022
@jjathman
Copy link

jjathman commented Oct 5, 2022

a non-cooperative lease transfer can happen without disrupting other leases

Can you explain a bit more about how things should work and how this problem may manifest? We had an issue in one of our clusters where the liveness lease holder lost network connectivity, which seemed to cause the entire cluster to lock up and quit allowing any DB connections until the node which was the liveness leaseholder was restarted. The entire cluster was unusable for over 24 hours (this was a dev cluster so it was not immediately noticed).

@erikgrinaker
Copy link
Contributor Author

Practically all leases in the system are what's called epoch-based leases, and these are tied to node heartbeats. If a node fails to heartbeat for some time, it loses all of its leases. Node heartbeats are essentially a write to the liveness range. If the liveness range leaseholder is lost, the liveness range will be unavailable until the lease interval (~10s) expires. During this time, all node heartbeats will fail, which may cause other nodes to lose their leases as well.

This will manifest as many/most leases in the system being invalidated (causing range unavailability) following loss of the liveness leaseholder. This will last until the liveness lease is reacquired, at which point other nodes can acquire the remaining invalid leases -- ideally about 10 seconds, but can take longer due to various other interactions.

This issue is specifically about avoiding this unavailability blip, by ensuring the liveness lease can be reacquired fast enough that other nodes won't lose their leases in the meanwhile. The problem you're describing sounds different, in that the outage persisted. It may e.g. be related to partial network partitions or node unresponsiveness, which we've seen cause these symptoms:

I see that we have an RCA in progress for this outage. That should shed some light on the specific failure mode here.

@jjathman
Copy link

jjathman commented Oct 6, 2022

Thank you for the links. It does seem like the symptoms of our outage are more closely aligned with what you've posted. We did get a couple log messages on the node that was the liveness lease holder about disk stall problems right before this issue manifested so maybe that's what we've hit.

craig bot pushed a commit that referenced this issue Dec 8, 2022
93039: roachtest: add `failover/liveness` r=erikgrinaker a=erikgrinaker

This patch adds a roachtest that measures the duration of *user* range unavailability following a liveness leaseholder failure, as well as the number of expired leases. When the liveness range is unavailable, other nodes are unable to heartbeat and extend their leases, which can cause them to expire and these ranges to become unavailable as well.

The test sets up a 4-node cluster with all other ranges on n1-n3, and the liveness range on n1-n4 with the lease on n4. A kv workload is run against n1-n3 while n4 fails and recovers repeatedly (both with process crashes and network outages). Workload latency histograms are recorded, where the pMax latency is a measure of the failure impact, as well as the `replicas_leaders_invalid_lease` metric over time.

Touches #88443.

Epic: none
Release note: None

Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Dec 9, 2022

We did some experiments over in #93073, and this probably isn't viable because we'd have to drop the Raft election timeout extremely low -- so low that it'd likely destabilize multiregion clusters with high latencies.

We should consider other approaches to avoiding the impact of liveness range unavailability, e.g. storing node liveness info (or rather, coalesced lease extensions) somewhere else, possibly sharded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
No open projects
Status: Incoming
Development

No branches or pull requests

2 participants