kvserver: don't check ConnHealth when releasing proposal quota #85565

pav-kv · 2022-08-03T20:25:01Z

The proposal quota release procedure checked node connection health for every
node that appears active after replica activity checks. This is expensive, and
previously caused issues like #84943.

This change removes the ConnHealth check, because other checks, such as
isFollowerActiveSince and paused replicas, provide sufficient protection from
various kinds of overloads.

Touches #84202

Release note: None

cockroach-teamcity · 2022-08-03T20:25:10Z

This change is

pav-kv · 2022-08-03T20:26:15Z

I'll work on a test tomorrow, but for now: is the approach correct?

pav-kv · 2022-08-03T20:39:10Z

pkg/kv/kvserver/replica_proposal_quota.go

+		// w.r.t. the actual state of that replica, but it will be disambiguated
+		// soon. We do a second check using the NodeLiveness to increase the chance
+		// of disambiguating here.
+		if !r.store.cfg.NodeLiveness.IsAvailable(rep.NodeID) {


Alternatives:

Only do this check if we recently became the leader, because otherwise isFollowerActiveSince does the job.

Only do the NodeLiveness check. Are there cases when it would not detect things that per-replica lastUpdateTimes detects?

Leave the ConnHealth, but gate it by "I recently became the leader", similarly to (1). Are there cases when ConnHealth is more accurate than NodeLiveness?

I think the case we're interested in here is where the follower replica is active, and the node is up and running, but the node is struggling and is failing its liveness heartbeats. This is unrelated to whether we just became the leader or not -- even if we've been the leader for a long time, and isFollowerActiveSince returns true, if the node is struggling then we don't want it to hold up the range.

One typical example of this is a stalled disk, since we do a synchronous disk write during node heartbeats specifically to check this (stalled disks are a common failure mode with networked disks used in cloud VMs):

cockroach/pkg/kv/kvserver/liveness/liveness.go

Lines 1268 to 1290 in eeb7236

// We do a sync write to all disks before updating liveness, so that a

// faulty or stalled disk will cause us to fail liveness and lose our leases.

// All disks are written concurrently.

//

// We do this asynchronously in order to respect the caller's context, and

// coalesce concurrent writes onto an in-flight one. This is particularly

// relevant for a stalled disk during a lease acquisition heartbeat, where

// we need to return a timely NLHE to the caller such that it will try a

// different replica and nudge it into acquiring the lease. This can leak a

// goroutine in the case of a stalled disk.

nl.mu.RLock()

engines := nl.mu.engines

nl.mu.RUnlock()

resultCs := make([]<-chan singleflight.Result, len(engines))

for i, eng := range engines {

eng := eng // pin the loop variable

resultCs[i], _ = nl.engineSyncs.DoChan(strconv.Itoa(i), func() (interface{}, error) {

return nil, nl.stopper.RunTaskWithErr(ctx, "liveness-hb-diskwrite",

func(ctx context.Context) error {

return storage.WriteSyncNoop(eng)

})

})

}

So we should have both checks:

isFollowerActiveSince will check that the Raft replica is live.

IsAvailable will check that the node is live and reasonably healthy.

This is related to a broader discussion on whether we even want the quota pool to ever hold up the range at all (#75978, and semi-related to #79215). Most of the time, we really don't. But for now, this is one step in the right direction.

I'll also note that IsAvailable() will return false if the node is not in the local liveness cache (i.e. we don't know about it). This seems fine, since that will default to not holding up the range.

Is this disk write on the requested node, or on some other range/nodes where the liveness records are stored?

This happens on every node every time they send a heartbeat.

For reference, liveness records themselves are stored as key/value pairs in a range, just like any other data in the system. So there will be one leaseholder somewhere in the cluster who's responsible for storing them. A heartbeat is just a write to this key with some metadata -- including the current time, which is then used for expiration when the node fails to heartbeat.

Ah, got it.

What kinds of events refresh isFollowerActiveSince timestamps? Are these successful Raft RPCs, or any interactions?
Any example when isFollowerActiveSince is true, but the node is not live?

What kinds of events refresh isFollowerActiveSince timestamps? Are these successful Raft RPCs, or any interactions?

Any Raft message received from the follower:

cockroach/pkg/kv/kvserver/replica_raft.go

Line 518 in d10bde7

r.mu.lastUpdateTimes.update(req.FromReplica.ReplicaID, timeutil.Now())

Any example when isFollowerActiveSince is true, but the node is not live?

A simple but contrived example might be a node with a partial network partition: it's not able to communicate with the liveness leaseholder (so it can't heartbeat), but it is able to communicate with other nodes. This isn't all that relevant, because in this case the node itself isn't necessarily struggling as such, but it's a fairly clear example.

A more relevant example might be a very overloaded node. Maybe it's taking more that 4.5 seconds (the heartbeat timeout) to fsync the disk, so node heartbeats aren't being sent. Or maybe the heartbeat goroutine isn't getting scheduled or the network is backlogged. However, it is able to occasionally/eventually append Raft proposals and respond to Raft leaders. In these cases, replicas on that node would be very slow to append Raft proposals, and we would rather sacrifice the followers on that node than slow down the entire range.

pkg/kv/kvserver/replica_proposal_quota.go

pav-kv · 2022-08-04T21:13:43Z

How about an alternative simple approach (either instead, or after this PR):

Collect progress of the replicas (the same way as we do now), particularly the log size estimate from each.
For those replicas that we didn't hear from, be pessimistic.
Set the minIndex to be the median of the log size estimates (all of them, not just those that are alive; but the dead/slow ones will be in the bottom of the sorted list).

If a majority of replicas is healthy, we will progress with the median speed of them. This has less chance of being stuck, even if the "slowest" node is in fact inactive or fails to update liveness. With such an approach, we can avoid both isFollowerActiveSince and NodeLiveness check.

I guess, the problem with this approach is that the slowest replicas will increasingly fall behind. Are there any other throttling mechanisms that try slowing down the quick replicas, so that all replicas can walk at the same speed?

However, honouring the slow node too much is also not great. Maybe in such cases the distribution layer should step in and rebalance nodes to have similar speed.

erikgrinaker · 2022-08-05T09:18:29Z

What you're saying makes a lot of sense, and I think this is the direction we want to move in. Rather than checking the median and the log index, we'd maybe want to check that a quorum have applied up to some index. Raft already requires a quorum to have appended an entry before it is committed, which provides a baseline of throttling here.

We did something similar recently in #83851, where we stop replicating to followers that are experiencing IO overload (specifically, where Pebble compactions aren't able to keep up with incoming writes, inverting the LSM). But we make sure we keep a quorum of followers to make forward progress.

This is related to admission control, which monitors the local Pebble store's health and begins backpressuring and prioritizing incoming traffic once Pebble starts struggling. Raft traffic is a bit trickier, in that the Raft log must be deterministic, so once a command has been committed to it then we can't drop or reorder commands. On the leaseholder, we can (and do) apply admission control to incoming traffic before it hits the Raft log, but on followers there's really nothing we can do -- we'll just have to append and apply those entries as-is.

So longer-term, the leaseholder needs to take the health of followers into account for purposes of admission control, and if followers are getting overloaded, we'll have to rely on the allocator to start moving them around (as you allude to in your comment). The role of the quota pool in all of this is still a bit unclear. There's been a lot of discussions about this recently, e.g. on #75066 and #83851. There are also related challenges with flow control of Raft messages, since we can currently OOM if the Raft queues fill up with huge messages (e.g. SSTable ingestion), see #79755.

For now though, we'll have to get rid of the useless RPC check anyway, and we may as well add in a liveness check while we're here. We're planning on doing more work around admission control and the quota pool for 23.1, but we haven't really firmed up a concrete plan yet (Tobi and Sumeer have been more involved in this than I have).

erikgrinaker

Did you and @aliher1911 come up with an easy way to test this? If not, I think this is fine, and we can rely on broader integration/end-to-end tests to pick up any issues.

PS: let's tack on " in quota pool" in the commit title.

pkg/kv/kvserver/replica_proposal_quota.go

erikgrinaker

LGTM, congrats on your first PR! 🚀

Let's change "Fixes #84202" to "Touches #84202", and add a comment on the issue explaining why we didn't do it this time around. It's something we may want to consider doing later, so let's leave it open. Also, the "Fixes/touches #84202" part isn't usually included in the commit message, only the PR description.

pav-kv · 2022-08-05T14:24:41Z

Note: This PR changed its focus from changing a check to just removing it. As a result, there are strictly fewer cases when the quota is held up, so this change won't introduce new quota-related blockages. The risk is in introducing more cases when followers are overloaded, but there is a feedback mechanism ("paused replicas") which pushes back, so there is already a good protection from this.

Not adding tests.

The proposal quota release procedure checked node connection health for every node that appears active after replica activity checks. This is expensive, and previously caused issues like cockroachdb#84943. This change removes the ConnHealth check, because other checks, such as isFollowerActiveSince and paused replicas, provide sufficient protection from various kinds of overloads. Release note: None

pav-kv · 2022-08-05T14:28:06Z

Let's change "Fixes #84202" to "Touches #84202", and add a comment on the issue explaining why we didn't do it this time around. It's something we may want to consider doing later, so let's leave it open. Also, the "Fixes/touches #84202" part isn't usually included in the commit message, only the PR description.

All done.

erikgrinaker · 2022-08-05T14:34:30Z

Let's backport this change too. You can add on the backport-21.2.x and backport-22.1.x labels, and Blathers will open the backport PRs for you when this merges (if the patch applies cleanly).

pav-kv · 2022-08-05T15:50:20Z

bors r=erikgrinaker

craig · 2022-08-05T17:00:27Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2022-08-05T18:16:40Z

Build succeeded:

Bazel Essential CI (Cockroach)

pav-kv requested review from erikgrinaker and aliher1911 August 3, 2022 20:25

pav-kv requested a review from a team as a code owner August 3, 2022 20:25

pav-kv force-pushed the proposal_quota_remove_healthcheck branch from 9e87a02 to 4735a38 Compare August 3, 2022 20:29

pav-kv commented Aug 3, 2022

View reviewed changes

pkg/kv/kvserver/replica_proposal_quota.go Outdated Show resolved Hide resolved

pav-kv force-pushed the proposal_quota_remove_healthcheck branch from 4735a38 to 73cb5f1 Compare August 4, 2022 16:33

erikgrinaker reviewed Aug 5, 2022

View reviewed changes

pkg/kv/kvserver/replica_proposal_quota.go Outdated Show resolved Hide resolved

pav-kv force-pushed the proposal_quota_remove_healthcheck branch from 73cb5f1 to 1257456 Compare August 5, 2022 14:08

pav-kv changed the title ~~kvserver: check liveness instead of ConnHealth~~ kvserver: don't check ConnHealth when releasing proposal quota Aug 5, 2022

pav-kv requested a review from erikgrinaker August 5, 2022 14:11

erikgrinaker approved these changes Aug 5, 2022

View reviewed changes

pav-kv force-pushed the proposal_quota_remove_healthcheck branch from 1257456 to 49e130f Compare August 5, 2022 14:26

pav-kv added backport-21.2.x labels Aug 5, 2022

craig bot merged commit 2fbdd9f into cockroachdb:master Aug 5, 2022

This was referenced Aug 5, 2022

release-21.2: kvserver: don't check ConnHealth when releasing proposal quota #85678

Merged

release-22.1: kvserver: don't check ConnHealth when releasing proposal quota #85679

Merged

pav-kv deleted the proposal_quota_remove_healthcheck branch August 11, 2022 14:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: don't check ConnHealth when releasing proposal quota #85565

kvserver: don't check ConnHealth when releasing proposal quota #85565

pav-kv commented Aug 3, 2022 •

edited

Loading

cockroach-teamcity commented Aug 3, 2022

pav-kv commented Aug 3, 2022

pav-kv Aug 3, 2022 •

edited

Loading

erikgrinaker Aug 4, 2022 •

edited

Loading

pav-kv Aug 4, 2022

erikgrinaker Aug 4, 2022 •

edited

Loading

pav-kv Aug 4, 2022

pav-kv Aug 4, 2022

erikgrinaker Aug 4, 2022 •

edited

Loading

pav-kv commented Aug 4, 2022 •

edited

Loading

erikgrinaker commented Aug 5, 2022

erikgrinaker left a comment •

edited

Loading

erikgrinaker left a comment

pav-kv commented Aug 5, 2022 •

edited

Loading

pav-kv commented Aug 5, 2022

erikgrinaker commented Aug 5, 2022

pav-kv commented Aug 5, 2022

craig bot commented Aug 5, 2022

craig bot commented Aug 5, 2022

	// We do a sync write to all disks before updating liveness, so that a
	// faulty or stalled disk will cause us to fail liveness and lose our leases.
	// All disks are written concurrently.
	//
	// We do this asynchronously in order to respect the caller's context, and
	// coalesce concurrent writes onto an in-flight one. This is particularly
	// relevant for a stalled disk during a lease acquisition heartbeat, where
	// we need to return a timely NLHE to the caller such that it will try a
	// different replica and nudge it into acquiring the lease. This can leak a
	// goroutine in the case of a stalled disk.
	nl.mu.RLock()
	engines := nl.mu.engines
	nl.mu.RUnlock()
	resultCs := make([]<-chan singleflight.Result, len(engines))
	for i, eng := range engines {
	eng := eng // pin the loop variable
	resultCs[i], _ = nl.engineSyncs.DoChan(strconv.Itoa(i), func() (interface{}, error) {
	return nil, nl.stopper.RunTaskWithErr(ctx, "liveness-hb-diskwrite",
	func(ctx context.Context) error {
	return storage.WriteSyncNoop(eng)
	})
	})
	}

kvserver: don't check ConnHealth when releasing proposal quota #85565

kvserver: don't check ConnHealth when releasing proposal quota #85565

Conversation

pav-kv commented Aug 3, 2022 • edited Loading

cockroach-teamcity commented Aug 3, 2022

pav-kv commented Aug 3, 2022

pav-kv Aug 3, 2022 • edited Loading

Choose a reason for hiding this comment

erikgrinaker Aug 4, 2022 • edited Loading

Choose a reason for hiding this comment

pav-kv Aug 4, 2022

Choose a reason for hiding this comment

erikgrinaker Aug 4, 2022 • edited Loading

Choose a reason for hiding this comment

pav-kv Aug 4, 2022

Choose a reason for hiding this comment

pav-kv Aug 4, 2022

Choose a reason for hiding this comment

erikgrinaker Aug 4, 2022 • edited Loading

Choose a reason for hiding this comment

pav-kv commented Aug 4, 2022 • edited Loading

erikgrinaker commented Aug 5, 2022

erikgrinaker left a comment • edited Loading

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

pav-kv commented Aug 5, 2022 • edited Loading

pav-kv commented Aug 5, 2022

erikgrinaker commented Aug 5, 2022

pav-kv commented Aug 5, 2022

craig bot commented Aug 5, 2022

craig bot commented Aug 5, 2022

pav-kv commented Aug 3, 2022 •

edited

Loading

pav-kv Aug 3, 2022 •

edited

Loading

erikgrinaker Aug 4, 2022 •

edited

Loading

erikgrinaker Aug 4, 2022 •

edited

Loading

erikgrinaker Aug 4, 2022 •

edited

Loading

pav-kv commented Aug 4, 2022 •

edited

Loading

erikgrinaker left a comment •

edited

Loading

pav-kv commented Aug 5, 2022 •

edited

Loading