liveness: medium-probability global ~5s cluster unavailability if liveness range leasholder dies #41162

knz · 2019-09-27T16:22:52Z

Describe the problem

Taking a 3+ node cluster, subject to load (eg. kv or tpcc).

While the workload is running, randomly kill (not quit) a node that the client is not connected to and restart it.

Most of the time, the workload continues unaffected, or with a small throughput dip. Sometimes however (with likelihood decreasing with the number of nodes) the QPS traffic goes down to zero for about 5 seconds, then resumes.

When this happens it seems to correlated with the following entries in the logs of other nodes:

I190911 12:40:08.053787 317 storage/node_liveness.go:836  [n3,liveness-hb] retrying liveness update after storage.errRetryLiveness: result is ambiguous (error=unable to dial n2: breaker open [exhausted])
W190911 12:40:08.053807 317 storage/node_liveness.go:559  [n3,liveness-hb] slow heartbeat took 4.5s
W190911 12:40:08.053832 317 storage/node_liveness.go:484  [n3,liveness-hb] failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 4.5s

Hypothesis is that if the leaseholder for the liveness range (rL) dies, and the lease on another unrelated range (rX) expires on a different node nX, then nX will require a liveness record to ask for a new lease, which will fail for about 5 seconds, until rL finds a new leaseholder.

if rX is the meta range or a system range (namespace etc) the unavailability can become global.

Some example runs with attached artifacts:

http://shakespeare-artifacts.crdb.io/public/201909-report/tpcc-small-g/20190911124637/index.html

http://shakespeare-artifacts.crdb.io/public/201909-report/tpcc-small-g/20190911130744/index.html

http://shakespeare-artifacts.crdb.io/public/201909-report/tpcc-small-h/20190911140000/index.html

(many more available)

cc @tbg

Jira issue: CRDB-5485

The text was updated successfully, but these errors were encountered:

knz · 2019-09-27T16:33:16Z

@tbg let me know if I remember this right, your proposed solution would be to tweak the lease expiration delay in relationship to the liveness record expiration delay. What was the ratio you were thinking about again?

knz · 2019-09-27T16:40:29Z

Another 5-second suspicious log interval:

I190911 13:59:07.475858 196 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n2] circuitbreaker: rpc [::]:26257 [n3] tripped: failed to resolve n3: unable to look up descriptor for n3
I190911 13:59:07.475875 196 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n2] circuitbreaker: rpc [::]:26257 [n3] event: BreakerTripped
I190911 13:59:07.669786 197 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n2] circuitbreaker: rpc [::]:26257 [n3] tripped: failed to resolve n3: unable to look up descriptor for n3
I190911 13:59:07.669800 197 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n2] circuitbreaker: rpc [::]:26257 [n3] event: BreakerTripped
I190911 13:59:07.669881 197 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n2] circuitbreaker: rpc [::]:26257 [n4] tripped: failed to resolve n4: unable to look up descriptor for n4
I190911 13:59:07.669907 197 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n2] circuitbreaker: rpc [::]:26257 [n4] event: BreakerTripped
I190911 13:59:08.130072 54 storage/node_liveness.go:474  [n2,liveness-hb] heartbeat failed on epoch increment; retrying
I190911 13:59:08.263502 412 storage/node_liveness.go:790  [n2,s2,r3/2:/System/{NodeLive…-tsd}] incremented n3 liveness epoch to 6
I190911 13:59:09.585542 164 gossip/gossip.go:1531  [n2] node has connected to cluster via gossip
I190911 13:59:09.585655 164 storage/stores.go:259  [n2] wrote 1 node addresses to persistent storage
I190911 13:59:11.333594 86 storage/stores.go:259  [n2] wrote 2 node addresses to persistent storage
I190911 13:59:11.408464 198 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n2] circuitbreaker: rpc [::]:26257 [n3] event: BreakerReset
I190911 13:59:12.596335 263 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n2] circuitbreaker: rpc [::]:26257 [n3] event: BreakerReset
I190911 13:59:13.279968 218 storage/queue.go:518  [n2,s2,r3/2:/System/{NodeLive…-tsd}] rate limited in MaybeAdd (raftlog): context canceled

irfansharif · 2020-06-10T19:29:30Z

@knz: your hypothesis sounds right, this is (unfortunately) a very possible blip that would occur in that situation. It's not clear to me what can immediately be done about it however. No matter what the ratio between lease expiration and liveness record expiration, the simultaneous expiration could still very well happen at ~ the same point in time, no?

knz · 2020-06-11T08:10:23Z

Yes as long as the protocol is unchanged the fault remains.

Different protocol might be an option, perhaps pre-elect the next lease ahead of time

tbg · 2020-11-12T08:47:46Z

@knz: your hypothesis sounds right, this is (unfortunately) a very possible blip that would occur in that situation. It's not clear to me what can immediately be done about it however. No matter what the ratio between lease expiration and liveness record expiration, the simultaneous expiration could still very well happen at ~ the same point in time, no?

No, because the liveness record is heartbeat ahead of expiration. I think the liveness expiration is 9s, and after 4.5s, nodes will try to extend it. So let's say that the lease duration on the liveness range is 3s, in the "worst case" (excluding network latencies, etc, which will make this case a little worse), a node wants to refresh its liveness record when it has 4.5s left on the clock. Liveness range is down for the first 3s of that, and then recovers. That leaves 1.5s for the node to heartbeat its liveness before it goes dark.

It does seem like a little tuning can get us a long way here? Right now, the expiration based lease duration matches the liveness record duration:

cockroach/pkg/base/config.go

Lines 436 to 458 in 1d05979

    
           // RangeLeaseActiveDuration is the duration of the active period of leader 
        
           // leases requested. 
        
           func (cfg RaftConfig) RangeLeaseActiveDuration() time.Duration { 
        
           	rangeLeaseActive, _ := cfg.RangeLeaseDurations() 
        
           	return rangeLeaseActive 
        
           } 
        
           // RangeLeaseRenewalDuration specifies a time interval at the end of the 
        
           // active lease interval (i.e. bounded to the right by the start of the stasis 
        
           // period) during which operations will trigger an asynchronous renewal of the 
        
           // lease. 
        
           func (cfg RaftConfig) RangeLeaseRenewalDuration() time.Duration { 
        
           	_, rangeLeaseRenewal := cfg.RangeLeaseDurations() 
        
           	return rangeLeaseRenewal 
        
           } 
        
           // NodeLivenessDurations computes durations for node liveness expiration and 
        
           // renewal based on a default multiple of Raft election timeout. 
        
           func (cfg RaftConfig) NodeLivenessDurations() (livenessActive, livenessRenewal time.Duration) { 
        
           	livenessActive = cfg.RangeLeaseActiveDuration() 
        
           	livenessRenewal = time.Duration(float64(livenessActive) * livenessRenewalFraction) 
        
           	return 
        
           }

It doesn't seem unreasonable to do something like this:

// RangeLeaseActiveDuration is the duration of the active period of leader
// leases requested.
func (cfg RaftConfig) RangeLeaseActiveDuration() time.Duration {
	rangeLeaseActive, _ := cfg.RangeLeaseDurations()
	if rangeLeaseActive < time.Second { // avoid being overly aggressive in tests, not sure this is needed but probably
		return rangeLeaseActive
	}
	return rangeLeaseActive / 3
}

There are some loose ends (if we set the lease expiration to 3s, but the raft election timeout is 10s, it doesn't buy us anything, though I think it's closer to 3s as well and so it should be fine)

github-actions · 2023-09-19T11:11:07Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

knz changed the title ~~liveness: global cluster unavailability if liveness range leasholder dies~~ liveness: low-probability global cluster unavailability if liveness range leasholder dies Sep 27, 2019

knz changed the title ~~liveness: low-probability global cluster unavailability if liveness range leasholder dies~~ liveness: medium-probability global ~5s cluster unavailability if liveness range leasholder dies Sep 27, 2019

knz mentioned this issue Sep 27, 2019

status: _status/range/x useless on broken cluster #30874

Closed

jlinder added the T-kv KV Team label Jun 16, 2021

knz mentioned this issue Feb 1, 2023

kvserver: use shorter lease expiration for liveness range #88443

Open

github-actions bot added the no-issue-activity label Sep 19, 2023

github-actions bot added the X-stale label Oct 2, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 2, 2023

exalate-issue-sync bot closed this as completed Oct 2, 2023

yuzefovich removed the X-stale label May 1, 2024

yuzefovich removed the no-issue-activity label May 1, 2024

yuzefovich reopened this May 1, 2024

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Incoming in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

liveness: medium-probability global ~5s cluster unavailability if liveness range leasholder dies #41162

liveness: medium-probability global ~5s cluster unavailability if liveness range leasholder dies #41162

knz commented Sep 27, 2019 •

edited by cockroach-jira-scripts

Loading

knz commented Sep 27, 2019

knz commented Sep 27, 2019

irfansharif commented Jun 10, 2020

knz commented Jun 11, 2020

tbg commented Nov 12, 2020 •

edited

Loading

github-actions bot commented Sep 19, 2023

liveness: medium-probability global ~5s cluster unavailability if liveness range leasholder dies #41162

liveness: medium-probability global ~5s cluster unavailability if liveness range leasholder dies #41162

Comments

knz commented Sep 27, 2019 • edited by cockroach-jira-scripts Loading

knz commented Sep 27, 2019

knz commented Sep 27, 2019

irfansharif commented Jun 10, 2020

knz commented Jun 11, 2020

tbg commented Nov 12, 2020 • edited Loading

github-actions bot commented Sep 19, 2023

knz commented Sep 27, 2019 •

edited by cockroach-jira-scripts

Loading

tbg commented Nov 12, 2020 •

edited

Loading