Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: don't check ConnHealth when releasing proposal quota #85565

Merged
merged 1 commit into from
Aug 5, 2022

Conversation

pav-kv
Copy link
Collaborator

@pav-kv pav-kv commented Aug 3, 2022

The proposal quota release procedure checked node connection health for every
node that appears active after replica activity checks. This is expensive, and
previously caused issues like #84943.

This change removes the ConnHealth check, because other checks, such as
isFollowerActiveSince and paused replicas, provide sufficient protection from
various kinds of overloads.

Touches #84202

Release note: None

@pav-kv pav-kv requested a review from a team as a code owner August 3, 2022 20:25
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@pav-kv
Copy link
Collaborator Author

pav-kv commented Aug 3, 2022

I'll work on a test tomorrow, but for now: is the approach correct?

@pav-kv pav-kv force-pushed the proposal_quota_remove_healthcheck branch from 9e87a02 to 4735a38 Compare August 3, 2022 20:29
// w.r.t. the actual state of that replica, but it will be disambiguated
// soon. We do a second check using the NodeLiveness to increase the chance
// of disambiguating here.
if !r.store.cfg.NodeLiveness.IsAvailable(rep.NodeID) {
Copy link
Collaborator Author

@pav-kv pav-kv Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatives:

  1. Only do this check if we recently became the leader, because otherwise isFollowerActiveSince does the job.
  2. Only do the NodeLiveness check. Are there cases when it would not detect things that per-replica lastUpdateTimes detects?
  3. Leave the ConnHealth, but gate it by "I recently became the leader", similarly to (1). Are there cases when ConnHealth is more accurate than NodeLiveness?

Copy link
Contributor

@erikgrinaker erikgrinaker Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the case we're interested in here is where the follower replica is active, and the node is up and running, but the node is struggling and is failing its liveness heartbeats. This is unrelated to whether we just became the leader or not -- even if we've been the leader for a long time, and isFollowerActiveSince returns true, if the node is struggling then we don't want it to hold up the range.

One typical example of this is a stalled disk, since we do a synchronous disk write during node heartbeats specifically to check this (stalled disks are a common failure mode with networked disks used in cloud VMs):

// We do a sync write to all disks before updating liveness, so that a
// faulty or stalled disk will cause us to fail liveness and lose our leases.
// All disks are written concurrently.
//
// We do this asynchronously in order to respect the caller's context, and
// coalesce concurrent writes onto an in-flight one. This is particularly
// relevant for a stalled disk during a lease acquisition heartbeat, where
// we need to return a timely NLHE to the caller such that it will try a
// different replica and nudge it into acquiring the lease. This can leak a
// goroutine in the case of a stalled disk.
nl.mu.RLock()
engines := nl.mu.engines
nl.mu.RUnlock()
resultCs := make([]<-chan singleflight.Result, len(engines))
for i, eng := range engines {
eng := eng // pin the loop variable
resultCs[i], _ = nl.engineSyncs.DoChan(strconv.Itoa(i), func() (interface{}, error) {
return nil, nl.stopper.RunTaskWithErr(ctx, "liveness-hb-diskwrite",
func(ctx context.Context) error {
return storage.WriteSyncNoop(eng)
})
})
}

So we should have both checks:

  • isFollowerActiveSince will check that the Raft replica is live.
  • IsAvailable will check that the node is live and reasonably healthy.

This is related to a broader discussion on whether we even want the quota pool to ever hold up the range at all (#75978, and semi-related to #79215). Most of the time, we really don't. But for now, this is one step in the right direction.

I'll also note that IsAvailable() will return false if the node is not in the local liveness cache (i.e. we don't know about it). This seems fine, since that will default to not holding up the range.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this disk write on the requested node, or on some other range/nodes where the liveness records are stored?

Copy link
Contributor

@erikgrinaker erikgrinaker Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens on every node every time they send a heartbeat.

For reference, liveness records themselves are stored as key/value pairs in a range, just like any other data in the system. So there will be one leaseholder somewhere in the cluster who's responsible for storing them. A heartbeat is just a write to this key with some metadata -- including the current time, which is then used for expiration when the node fails to heartbeat.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kinds of events refresh isFollowerActiveSince timestamps? Are these successful Raft RPCs, or any interactions?
Any example when isFollowerActiveSince is true, but the node is not live?

Copy link
Contributor

@erikgrinaker erikgrinaker Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kinds of events refresh isFollowerActiveSince timestamps? Are these successful Raft RPCs, or any interactions?

Any Raft message received from the follower:

r.mu.lastUpdateTimes.update(req.FromReplica.ReplicaID, timeutil.Now())

Any example when isFollowerActiveSince is true, but the node is not live?

A simple but contrived example might be a node with a partial network partition: it's not able to communicate with the liveness leaseholder (so it can't heartbeat), but it is able to communicate with other nodes. This isn't all that relevant, because in this case the node itself isn't necessarily struggling as such, but it's a fairly clear example.

A more relevant example might be a very overloaded node. Maybe it's taking more that 4.5 seconds (the heartbeat timeout) to fsync the disk, so node heartbeats aren't being sent. Or maybe the heartbeat goroutine isn't getting scheduled or the network is backlogged. However, it is able to occasionally/eventually append Raft proposals and respond to Raft leaders. In these cases, replicas on that node would be very slow to append Raft proposals, and we would rather sacrifice the followers on that node than slow down the entire range.

@pav-kv pav-kv force-pushed the proposal_quota_remove_healthcheck branch from 4735a38 to 73cb5f1 Compare August 4, 2022 16:33
@pav-kv
Copy link
Collaborator Author

pav-kv commented Aug 4, 2022

How about an alternative simple approach (either instead, or after this PR):

  • Collect progress of the replicas (the same way as we do now), particularly the log size estimate from each.
  • For those replicas that we didn't hear from, be pessimistic.
  • Set the minIndex to be the median of the log size estimates (all of them, not just those that are alive; but the dead/slow ones will be in the bottom of the sorted list).

If a majority of replicas is healthy, we will progress with the median speed of them. This has less chance of being stuck, even if the "slowest" node is in fact inactive or fails to update liveness. With such an approach, we can avoid both isFollowerActiveSince and NodeLiveness check.

I guess, the problem with this approach is that the slowest replicas will increasingly fall behind. Are there any other throttling mechanisms that try slowing down the quick replicas, so that all replicas can walk at the same speed?

However, honouring the slow node too much is also not great. Maybe in such cases the distribution layer should step in and rebalance nodes to have similar speed.

@erikgrinaker
Copy link
Contributor

What you're saying makes a lot of sense, and I think this is the direction we want to move in. Rather than checking the median and the log index, we'd maybe want to check that a quorum have applied up to some index. Raft already requires a quorum to have appended an entry before it is committed, which provides a baseline of throttling here.

We did something similar recently in #83851, where we stop replicating to followers that are experiencing IO overload (specifically, where Pebble compactions aren't able to keep up with incoming writes, inverting the LSM). But we make sure we keep a quorum of followers to make forward progress.

This is related to admission control, which monitors the local Pebble store's health and begins backpressuring and prioritizing incoming traffic once Pebble starts struggling. Raft traffic is a bit trickier, in that the Raft log must be deterministic, so once a command has been committed to it then we can't drop or reorder commands. On the leaseholder, we can (and do) apply admission control to incoming traffic before it hits the Raft log, but on followers there's really nothing we can do -- we'll just have to append and apply those entries as-is.

So longer-term, the leaseholder needs to take the health of followers into account for purposes of admission control, and if followers are getting overloaded, we'll have to rely on the allocator to start moving them around (as you allude to in your comment). The role of the quota pool in all of this is still a bit unclear. There's been a lot of discussions about this recently, e.g. on #75066 and #83851. There are also related challenges with flow control of Raft messages, since we can currently OOM if the Raft queues fill up with huge messages (e.g. SSTable ingestion), see #79755.

For now though, we'll have to get rid of the useless RPC check anyway, and we may as well add in a liveness check while we're here. We're planning on doing more work around admission control and the quota pool for 23.1, but we haven't really firmed up a concrete plan yet (Tobi and Sumeer have been more involved in this than I have).

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you and @aliher1911 come up with an easy way to test this? If not, I think this is fine, and we can rely on broader integration/end-to-end tests to pick up any issues.

PS: let's tack on " in quota pool" in the commit title.

pkg/kv/kvserver/replica_proposal_quota.go Outdated Show resolved Hide resolved
@pav-kv pav-kv force-pushed the proposal_quota_remove_healthcheck branch from 73cb5f1 to 1257456 Compare August 5, 2022 14:08
@pav-kv pav-kv changed the title kvserver: check liveness instead of ConnHealth kvserver: don't check ConnHealth when releasing proposal quota Aug 5, 2022
@pav-kv pav-kv requested a review from erikgrinaker August 5, 2022 14:11
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, congrats on your first PR! 🚀

Let's change "Fixes #84202" to "Touches #84202", and add a comment on the issue explaining why we didn't do it this time around. It's something we may want to consider doing later, so let's leave it open. Also, the "Fixes/touches #84202" part isn't usually included in the commit message, only the PR description.

@pav-kv
Copy link
Collaborator Author

pav-kv commented Aug 5, 2022

Note: This PR changed its focus from changing a check to just removing it. As a result, there are strictly fewer cases when the quota is held up, so this change won't introduce new quota-related blockages. The risk is in introducing more cases when followers are overloaded, but there is a feedback mechanism ("paused replicas") which pushes back, so there is already a good protection from this.

Not adding tests.

The proposal quota release procedure checked node connection health for every
node that appears active after replica activity checks. This is expensive, and
previously caused issues like cockroachdb#84943.

This change removes the ConnHealth check, because other checks, such as
isFollowerActiveSince and paused replicas, provide sufficient protection from
various kinds of overloads.

Release note: None
@pav-kv pav-kv force-pushed the proposal_quota_remove_healthcheck branch from 1257456 to 49e130f Compare August 5, 2022 14:26
@pav-kv
Copy link
Collaborator Author

pav-kv commented Aug 5, 2022

Let's change "Fixes #84202" to "Touches #84202", and add a comment on the issue explaining why we didn't do it this time around. It's something we may want to consider doing later, so let's leave it open. Also, the "Fixes/touches #84202" part isn't usually included in the commit message, only the PR description.

All done.

@erikgrinaker
Copy link
Contributor

Let's backport this change too. You can add on the backport-21.2.x and backport-22.1.x labels, and Blathers will open the backport PRs for you when this merges (if the patch applies cleanly).

@pav-kv
Copy link
Collaborator Author

pav-kv commented Aug 5, 2022

bors r=erikgrinaker

@craig
Copy link
Contributor

craig bot commented Aug 5, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Aug 5, 2022

Build succeeded:

@pav-kv pav-kv deleted the proposal_quota_remove_healthcheck branch August 11, 2022 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants