-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: use shorter lease expiration for liveness range #88443
Comments
cc @cockroachdb/replication |
Can you explain a bit more about how things should work and how this problem may manifest? We had an issue in one of our clusters where the liveness lease holder lost network connectivity, which seemed to cause the entire cluster to lock up and quit allowing any DB connections until the node which was the liveness leaseholder was restarted. The entire cluster was unusable for over 24 hours (this was a dev cluster so it was not immediately noticed). |
Practically all leases in the system are what's called epoch-based leases, and these are tied to node heartbeats. If a node fails to heartbeat for some time, it loses all of its leases. Node heartbeats are essentially a write to the liveness range. If the liveness range leaseholder is lost, the liveness range will be unavailable until the lease interval (~10s) expires. During this time, all node heartbeats will fail, which may cause other nodes to lose their leases as well. This will manifest as many/most leases in the system being invalidated (causing range unavailability) following loss of the liveness leaseholder. This will last until the liveness lease is reacquired, at which point other nodes can acquire the remaining invalid leases -- ideally about 10 seconds, but can take longer due to various other interactions. This issue is specifically about avoiding this unavailability blip, by ensuring the liveness lease can be reacquired fast enough that other nodes won't lose their leases in the meanwhile. The problem you're describing sounds different, in that the outage persisted. It may e.g. be related to partial network partitions or node unresponsiveness, which we've seen cause these symptoms:
I see that we have an RCA in progress for this outage. That should shed some light on the specific failure mode here. |
Thank you for the links. It does seem like the symptoms of our outage are more closely aligned with what you've posted. We did get a couple log messages on the node that was the liveness lease holder about disk stall problems right before this issue manifested so maybe that's what we've hit. |
93039: roachtest: add `failover/liveness` r=erikgrinaker a=erikgrinaker This patch adds a roachtest that measures the duration of *user* range unavailability following a liveness leaseholder failure, as well as the number of expired leases. When the liveness range is unavailable, other nodes are unable to heartbeat and extend their leases, which can cause them to expire and these ranges to become unavailable as well. The test sets up a 4-node cluster with all other ranges on n1-n3, and the liveness range on n1-n4 with the lease on n4. A kv workload is run against n1-n3 while n4 fails and recovers repeatedly (both with process crashes and network outages). Workload latency histograms are recorded, where the pMax latency is a measure of the failure impact, as well as the `replicas_leaders_invalid_lease` metric over time. Touches #88443. Epic: none Release note: None Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
We did some experiments over in #93073, and this probably isn't viable because we'd have to drop the Raft election timeout extremely low -- so low that it'd likely destabilize multiregion clusters with high latencies. We should consider other approaches to avoiding the impact of liveness range unavailability, e.g. storing node liveness info (or rather, coalesced lease extensions) somewhere else, possibly sharded. |
If the liveness range leaseholder is lost, the range may be unavailable for long enough that all other leaseholders also lose their epoch-based lease, since they all have the same lease expiration time. We should use a shorter lease expiration interval for the liveness range, to ensure that in the typical case, a non-cooperative lease transfer can happen without disrupting other leases.
Relates to #41162.
Jira issue: CRDB-19826
Epic CRDB-40200
The text was updated successfully, but these errors were encountered: