-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
liveness: medium-probability global ~5s cluster unavailability if liveness range leasholder dies #41162
Comments
@tbg let me know if I remember this right, your proposed solution would be to tweak the lease expiration delay in relationship to the liveness record expiration delay. What was the ratio you were thinking about again? |
Another 5-second suspicious log interval:
|
@knz: your hypothesis sounds right, this is (unfortunately) a very possible blip that would occur in that situation. It's not clear to me what can immediately be done about it however. No matter what the ratio between lease expiration and liveness record expiration, the simultaneous expiration could still very well happen at ~ the same point in time, no? |
Yes as long as the protocol is unchanged the fault remains. Different protocol might be an option, perhaps pre-elect the next lease ahead of time |
No, because the liveness record is heartbeat ahead of expiration. I think the liveness expiration is 9s, and after 4.5s, nodes will try to extend it. So let's say that the lease duration on the liveness range is 3s, in the "worst case" (excluding network latencies, etc, which will make this case a little worse), a node wants to refresh its liveness record when it has 4.5s left on the clock. Liveness range is down for the first 3s of that, and then recovers. That leaves 1.5s for the node to heartbeat its liveness before it goes dark. It does seem like a little tuning can get us a long way here? Right now, the expiration based lease duration matches the liveness record duration: Lines 436 to 458 in 1d05979
It doesn't seem unreasonable to do something like this: // RangeLeaseActiveDuration is the duration of the active period of leader
// leases requested.
func (cfg RaftConfig) RangeLeaseActiveDuration() time.Duration {
rangeLeaseActive, _ := cfg.RangeLeaseDurations()
if rangeLeaseActive < time.Second { // avoid being overly aggressive in tests, not sure this is needed but probably
return rangeLeaseActive
}
return rangeLeaseActive / 3
} There are some loose ends (if we set the lease expiration to 3s, but the raft election timeout is 10s, it doesn't buy us anything, though I think it's closer to 3s as well and so it should be fine) |
We have marked this issue as stale because it has been inactive for |
Describe the problem
Taking a 3+ node cluster, subject to load (eg. kv or tpcc).
While the workload is running, randomly kill (not quit) a node that the client is not connected to and restart it.
Most of the time, the workload continues unaffected, or with a small throughput dip. Sometimes however (with likelihood decreasing with the number of nodes) the QPS traffic goes down to zero for about 5 seconds, then resumes.
When this happens it seems to correlated with the following entries in the logs of other nodes:
Hypothesis is that if the leaseholder for the liveness range (rL) dies, and the lease on another unrelated range (rX) expires on a different node nX, then nX will require a liveness record to ask for a new lease, which will fail for about 5 seconds, until rL finds a new leaseholder.
if rX is the meta range or a system range (namespace etc) the unavailability can become global.
Some example runs with attached artifacts:
http://shakespeare-artifacts.crdb.io/public/201909-report/tpcc-small-g/20190911124637/index.html
http://shakespeare-artifacts.crdb.io/public/201909-report/tpcc-small-g/20190911130744/index.html
http://shakespeare-artifacts.crdb.io/public/201909-report/tpcc-small-h/20190911140000/index.html
(many more available)
cc @tbg
Jira issue: CRDB-5485
The text was updated successfully, but these errors were encountered: