Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

liveness: medium-probability global ~5s cluster unavailability if liveness range leasholder dies #41162

Open
knz opened this issue Sep 27, 2019 · 6 comments
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. T-kv KV Team

Comments

@knz
Copy link
Contributor

knz commented Sep 27, 2019

Describe the problem

Taking a 3+ node cluster, subject to load (eg. kv or tpcc).

While the workload is running, randomly kill (not quit) a node that the client is not connected to and restart it.

Most of the time, the workload continues unaffected, or with a small throughput dip. Sometimes however (with likelihood decreasing with the number of nodes) the QPS traffic goes down to zero for about 5 seconds, then resumes.

When this happens it seems to correlated with the following entries in the logs of other nodes:

I190911 12:40:08.053787 317 storage/node_liveness.go:836  [n3,liveness-hb] retrying liveness update after storage.errRetryLiveness: result is ambiguous (error=unable to dial n2: breaker open [exhausted])
W190911 12:40:08.053807 317 storage/node_liveness.go:559  [n3,liveness-hb] slow heartbeat took 4.5s
W190911 12:40:08.053832 317 storage/node_liveness.go:484  [n3,liveness-hb] failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 4.5s

Hypothesis is that if the leaseholder for the liveness range (rL) dies, and the lease on another unrelated range (rX) expires on a different node nX, then nX will require a liveness record to ask for a new lease, which will fail for about 5 seconds, until rL finds a new leaseholder.

if rX is the meta range or a system range (namespace etc) the unavailability can become global.

Some example runs with attached artifacts:

http://shakespeare-artifacts.crdb.io/public/201909-report/tpcc-small-g/20190911124637/index.html

http://shakespeare-artifacts.crdb.io/public/201909-report/tpcc-small-g/20190911130744/index.html

http://shakespeare-artifacts.crdb.io/public/201909-report/tpcc-small-h/20190911140000/index.html

(many more available)

cc @tbg

Jira issue: CRDB-5485

@knz knz added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. A-kv-replication Relating to Raft, consensus, and coordination. labels Sep 27, 2019
@knz knz changed the title liveness: global cluster unavailability if liveness range leasholder dies liveness: low-probability global cluster unavailability if liveness range leasholder dies Sep 27, 2019
@knz knz changed the title liveness: low-probability global cluster unavailability if liveness range leasholder dies liveness: medium-probability global ~5s cluster unavailability if liveness range leasholder dies Sep 27, 2019
@knz
Copy link
Contributor Author

knz commented Sep 27, 2019

@tbg let me know if I remember this right, your proposed solution would be to tweak the lease expiration delay in relationship to the liveness record expiration delay. What was the ratio you were thinking about again?

@knz
Copy link
Contributor Author

knz commented Sep 27, 2019

Another 5-second suspicious log interval:

I190911 13:59:07.475858 196 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n2] circuitbreaker: rpc [::]:26257 [n3] tripped: failed to resolve n3: unable to look up descriptor for n3
I190911 13:59:07.475875 196 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n2] circuitbreaker: rpc [::]:26257 [n3] event: BreakerTripped
I190911 13:59:07.669786 197 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n2] circuitbreaker: rpc [::]:26257 [n3] tripped: failed to resolve n3: unable to look up descriptor for n3
I190911 13:59:07.669800 197 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n2] circuitbreaker: rpc [::]:26257 [n3] event: BreakerTripped
I190911 13:59:07.669881 197 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n2] circuitbreaker: rpc [::]:26257 [n4] tripped: failed to resolve n4: unable to look up descriptor for n4
I190911 13:59:07.669907 197 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n2] circuitbreaker: rpc [::]:26257 [n4] event: BreakerTripped
I190911 13:59:08.130072 54 storage/node_liveness.go:474  [n2,liveness-hb] heartbeat failed on epoch increment; retrying
I190911 13:59:08.263502 412 storage/node_liveness.go:790  [n2,s2,r3/2:/System/{NodeLive…-tsd}] incremented n3 liveness epoch to 6
I190911 13:59:09.585542 164 gossip/gossip.go:1531  [n2] node has connected to cluster via gossip
I190911 13:59:09.585655 164 storage/stores.go:259  [n2] wrote 1 node addresses to persistent storage
I190911 13:59:11.333594 86 storage/stores.go:259  [n2] wrote 2 node addresses to persistent storage
I190911 13:59:11.408464 198 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n2] circuitbreaker: rpc [::]:26257 [n3] event: BreakerReset
I190911 13:59:12.596335 263 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n2] circuitbreaker: rpc [::]:26257 [n3] event: BreakerReset
I190911 13:59:13.279968 218 storage/queue.go:518  [n2,s2,r3/2:/System/{NodeLive…-tsd}] rate limited in MaybeAdd (raftlog): context canceled

@irfansharif
Copy link
Contributor

@knz: your hypothesis sounds right, this is (unfortunately) a very possible blip that would occur in that situation. It's not clear to me what can immediately be done about it however. No matter what the ratio between lease expiration and liveness record expiration, the simultaneous expiration could still very well happen at ~ the same point in time, no?

@knz
Copy link
Contributor Author

knz commented Jun 11, 2020

Yes as long as the protocol is unchanged the fault remains.

Different protocol might be an option, perhaps pre-elect the next lease ahead of time

@tbg
Copy link
Member

tbg commented Nov 12, 2020

@knz: your hypothesis sounds right, this is (unfortunately) a very possible blip that would occur in that situation. It's not clear to me what can immediately be done about it however. No matter what the ratio between lease expiration and liveness record expiration, the simultaneous expiration could still very well happen at ~ the same point in time, no?

No, because the liveness record is heartbeat ahead of expiration. I think the liveness expiration is 9s, and after 4.5s, nodes will try to extend it. So let's say that the lease duration on the liveness range is 3s, in the "worst case" (excluding network latencies, etc, which will make this case a little worse), a node wants to refresh its liveness record when it has 4.5s left on the clock. Liveness range is down for the first 3s of that, and then recovers. That leaves 1.5s for the node to heartbeat its liveness before it goes dark.

It does seem like a little tuning can get us a long way here? Right now, the expiration based lease duration matches the liveness record duration:

// RangeLeaseActiveDuration is the duration of the active period of leader
// leases requested.
func (cfg RaftConfig) RangeLeaseActiveDuration() time.Duration {
rangeLeaseActive, _ := cfg.RangeLeaseDurations()
return rangeLeaseActive
}
// RangeLeaseRenewalDuration specifies a time interval at the end of the
// active lease interval (i.e. bounded to the right by the start of the stasis
// period) during which operations will trigger an asynchronous renewal of the
// lease.
func (cfg RaftConfig) RangeLeaseRenewalDuration() time.Duration {
_, rangeLeaseRenewal := cfg.RangeLeaseDurations()
return rangeLeaseRenewal
}
// NodeLivenessDurations computes durations for node liveness expiration and
// renewal based on a default multiple of Raft election timeout.
func (cfg RaftConfig) NodeLivenessDurations() (livenessActive, livenessRenewal time.Duration) {
livenessActive = cfg.RangeLeaseActiveDuration()
livenessRenewal = time.Duration(float64(livenessActive) * livenessRenewalFraction)
return
}

It doesn't seem unreasonable to do something like this:

// RangeLeaseActiveDuration is the duration of the active period of leader
// leases requested.
func (cfg RaftConfig) RangeLeaseActiveDuration() time.Duration {
	rangeLeaseActive, _ := cfg.RangeLeaseDurations()
	if rangeLeaseActive < time.Second { // avoid being overly aggressive in tests, not sure this is needed but probably
		return rangeLeaseActive
	}
	return rangeLeaseActive / 3
}

There are some loose ends (if we set the lease expiration to 3s, but the raft election timeout is 10s, it doesn't buy us anything, though I think it's closer to 3s as well and so it should be fine)

@github-actions
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 2, 2023
@yuzefovich yuzefovich removed the X-stale label May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. T-kv KV Team
Projects
No open projects
Status: Incoming
Development

No branches or pull requests

5 participants