-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: don't check ConnHealth when releasing proposal quota #85565
kvserver: don't check ConnHealth when releasing proposal quota #85565
Conversation
I'll work on a test tomorrow, but for now: is the approach correct? |
9e87a02
to
4735a38
Compare
// w.r.t. the actual state of that replica, but it will be disambiguated | ||
// soon. We do a second check using the NodeLiveness to increase the chance | ||
// of disambiguating here. | ||
if !r.store.cfg.NodeLiveness.IsAvailable(rep.NodeID) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatives:
- Only do this check if we recently became the leader, because otherwise
isFollowerActiveSince
does the job. - Only do the
NodeLiveness
check. Are there cases when it would not detect things that per-replicalastUpdateTimes
detects? - Leave the
ConnHealth
, but gate it by "I recently became the leader", similarly to (1). Are there cases whenConnHealth
is more accurate thanNodeLiveness
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the case we're interested in here is where the follower replica is active, and the node is up and running, but the node is struggling and is failing its liveness heartbeats. This is unrelated to whether we just became the leader or not -- even if we've been the leader for a long time, and isFollowerActiveSince
returns true
, if the node is struggling then we don't want it to hold up the range.
One typical example of this is a stalled disk, since we do a synchronous disk write during node heartbeats specifically to check this (stalled disks are a common failure mode with networked disks used in cloud VMs):
cockroach/pkg/kv/kvserver/liveness/liveness.go
Lines 1268 to 1290 in eeb7236
// We do a sync write to all disks before updating liveness, so that a | |
// faulty or stalled disk will cause us to fail liveness and lose our leases. | |
// All disks are written concurrently. | |
// | |
// We do this asynchronously in order to respect the caller's context, and | |
// coalesce concurrent writes onto an in-flight one. This is particularly | |
// relevant for a stalled disk during a lease acquisition heartbeat, where | |
// we need to return a timely NLHE to the caller such that it will try a | |
// different replica and nudge it into acquiring the lease. This can leak a | |
// goroutine in the case of a stalled disk. | |
nl.mu.RLock() | |
engines := nl.mu.engines | |
nl.mu.RUnlock() | |
resultCs := make([]<-chan singleflight.Result, len(engines)) | |
for i, eng := range engines { | |
eng := eng // pin the loop variable | |
resultCs[i], _ = nl.engineSyncs.DoChan(strconv.Itoa(i), func() (interface{}, error) { | |
return nil, nl.stopper.RunTaskWithErr(ctx, "liveness-hb-diskwrite", | |
func(ctx context.Context) error { | |
return storage.WriteSyncNoop(eng) | |
}) | |
}) | |
} |
So we should have both checks:
isFollowerActiveSince
will check that the Raft replica is live.IsAvailable
will check that the node is live and reasonably healthy.
This is related to a broader discussion on whether we even want the quota pool to ever hold up the range at all (#75978, and semi-related to #79215). Most of the time, we really don't. But for now, this is one step in the right direction.
I'll also note that IsAvailable()
will return false if the node is not in the local liveness cache (i.e. we don't know about it). This seems fine, since that will default to not holding up the range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this disk write on the requested node, or on some other range/nodes where the liveness records are stored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This happens on every node every time they send a heartbeat.
For reference, liveness records themselves are stored as key/value pairs in a range, just like any other data in the system. So there will be one leaseholder somewhere in the cluster who's responsible for storing them. A heartbeat is just a write to this key with some metadata -- including the current time, which is then used for expiration when the node fails to heartbeat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, got it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kinds of events refresh isFollowerActiveSince
timestamps? Are these successful Raft RPCs, or any interactions?
Any example when isFollowerActiveSince
is true, but the node is not live?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kinds of events refresh
isFollowerActiveSince
timestamps? Are these successful Raft RPCs, or any interactions?
Any Raft message received from the follower:
cockroach/pkg/kv/kvserver/replica_raft.go
Line 518 in d10bde7
r.mu.lastUpdateTimes.update(req.FromReplica.ReplicaID, timeutil.Now()) |
Any example when
isFollowerActiveSince
is true, but the node is not live?
A simple but contrived example might be a node with a partial network partition: it's not able to communicate with the liveness leaseholder (so it can't heartbeat), but it is able to communicate with other nodes. This isn't all that relevant, because in this case the node itself isn't necessarily struggling as such, but it's a fairly clear example.
A more relevant example might be a very overloaded node. Maybe it's taking more that 4.5 seconds (the heartbeat timeout) to fsync the disk, so node heartbeats aren't being sent. Or maybe the heartbeat goroutine isn't getting scheduled or the network is backlogged. However, it is able to occasionally/eventually append Raft proposals and respond to Raft leaders. In these cases, replicas on that node would be very slow to append Raft proposals, and we would rather sacrifice the followers on that node than slow down the entire range.
4735a38
to
73cb5f1
Compare
How about an alternative simple approach (either instead, or after this PR):
If a majority of replicas is healthy, we will progress with the median speed of them. This has less chance of being stuck, even if the "slowest" node is in fact inactive or fails to update liveness. With such an approach, we can avoid both I guess, the problem with this approach is that the slowest replicas will increasingly fall behind. Are there any other throttling mechanisms that try slowing down the quick replicas, so that all replicas can walk at the same speed? However, honouring the slow node too much is also not great. Maybe in such cases the distribution layer should step in and rebalance nodes to have similar speed. |
What you're saying makes a lot of sense, and I think this is the direction we want to move in. Rather than checking the median and the log index, we'd maybe want to check that a quorum have applied up to some index. Raft already requires a quorum to have appended an entry before it is committed, which provides a baseline of throttling here. We did something similar recently in #83851, where we stop replicating to followers that are experiencing IO overload (specifically, where Pebble compactions aren't able to keep up with incoming writes, inverting the LSM). But we make sure we keep a quorum of followers to make forward progress. This is related to admission control, which monitors the local Pebble store's health and begins backpressuring and prioritizing incoming traffic once Pebble starts struggling. Raft traffic is a bit trickier, in that the Raft log must be deterministic, so once a command has been committed to it then we can't drop or reorder commands. On the leaseholder, we can (and do) apply admission control to incoming traffic before it hits the Raft log, but on followers there's really nothing we can do -- we'll just have to append and apply those entries as-is. So longer-term, the leaseholder needs to take the health of followers into account for purposes of admission control, and if followers are getting overloaded, we'll have to rely on the allocator to start moving them around (as you allude to in your comment). The role of the quota pool in all of this is still a bit unclear. There's been a lot of discussions about this recently, e.g. on #75066 and #83851. There are also related challenges with flow control of Raft messages, since we can currently OOM if the Raft queues fill up with huge messages (e.g. SSTable ingestion), see #79755. For now though, we'll have to get rid of the useless RPC check anyway, and we may as well add in a liveness check while we're here. We're planning on doing more work around admission control and the quota pool for 23.1, but we haven't really firmed up a concrete plan yet (Tobi and Sumeer have been more involved in this than I have). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you and @aliher1911 come up with an easy way to test this? If not, I think this is fine, and we can rely on broader integration/end-to-end tests to pick up any issues.
PS: let's tack on " in quota pool" in the commit title.
73cb5f1
to
1257456
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, congrats on your first PR! 🚀
Let's change "Fixes #84202" to "Touches #84202", and add a comment on the issue explaining why we didn't do it this time around. It's something we may want to consider doing later, so let's leave it open. Also, the "Fixes/touches #84202" part isn't usually included in the commit message, only the PR description.
Note: This PR changed its focus from changing a check to just removing it. As a result, there are strictly fewer cases when the quota is held up, so this change won't introduce new quota-related blockages. The risk is in introducing more cases when followers are overloaded, but there is a feedback mechanism ("paused replicas") which pushes back, so there is already a good protection from this. Not adding tests. |
The proposal quota release procedure checked node connection health for every node that appears active after replica activity checks. This is expensive, and previously caused issues like cockroachdb#84943. This change removes the ConnHealth check, because other checks, such as isFollowerActiveSince and paused replicas, provide sufficient protection from various kinds of overloads. Release note: None
1257456
to
49e130f
Compare
All done. |
Let's backport this change too. You can add on the |
bors r=erikgrinaker |
Build failed (retrying...): |
Build succeeded: |
The proposal quota release procedure checked node connection health for every
node that appears active after replica activity checks. This is expensive, and
previously caused issues like #84943.
This change removes the ConnHealth check, because other checks, such as
isFollowerActiveSince and paused replicas, provide sufficient protection from
various kinds of overloads.
Touches #84202
Release note: None