Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

admission: experiment with quota pool awareness of IO overload on remote stores #82132

Closed
tbg opened this issue May 31, 2022 · 4 comments
Closed
Assignees
Labels
A-admission-control C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@tbg
Copy link
Member

tbg commented May 31, 2022

Is your feature request related to a problem? Please describe.

See #79215. At time of writing, MsgApp is not subject to admission control. Direct throttling at the receiver would lead to quota pool fulness and thus throttling at the user. While this may be desirable in some settings (i.e. a marginally slower follower that without such throttling would fall further and further behind), in practice a big problem is posed by stores that are so slow that they have an unacceptable effect on latency if taken into account for throttling.

Describe the solution you'd like

We should consider such stores as "non-live" for the purposes of replication. Concretely, they ought to be treated like followers in StateProbing (which is in (*RawNode).Status().Progress[replicaID].State); raft will only append to them ~once per second until they acknowledge. By artificially treating followers as probed as long as their (gossiped, in this experiment) store stats indicate that admission control on them isn't throttling, we can cheaply emulate the behavior of a "real" flow control mechanism (#79755) that necessarily would have to be more complex (multi-tenancy, priorities, etc).

Essentially the intuition behind this approach is that "a follower that isn't "live" for the purpose of quickly acking MsgApps should be treated as absent. In the experiment we are substituting "live" for "has no inverted LSM as determined by admission control" which is a sensible first approximation. A more complete solution (which is out of scope here) might be based on the store's recent measured write throughput which leaseholders can use to shape traffic to the node. However, this is verging into distributed rate limiting (a slow store will have leaders on many other stores; how do they coordinate who gets to go first) and is out of scope here. (I'll point out that we already do a form of distributed rate limiting for the tenant request units, so some reuse might be possible).

Describe alternatives you've considered
There are possibly other alternatives, which should be linked into the table in #79215.

Jira issue: CRDB-16308

Epic CRDB-15069

@tbg tbg added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-admission-control labels May 31, 2022
@sumeerbhola
Copy link
Collaborator

A more complete solution (which is out of scope here) might be based on the store's recent measured write throughput which leaseholders can use to shape traffic to the node. However, this is verging into distributed rate limiting (a slow store will have leaders on many other stores; how do they coordinate who gets to go first) and is out of scope here.

(adding what I mentioned in our synchronous discussion)
Randomization could avoid the need for explicit coordination. For example, 50% of the leaseholders could start treating the follower as "non-live" if the sub-level count became > 20, then another 50% when the sub-level count exceeded 25 and so on. We could also mix in some knowledge of whether the range is only receiving low priority backfill traffic into this probability value, i.e., use a much higher probability for such ranges.

@tbg
Copy link
Member Author

tbg commented Jun 21, 2022

I ran a first experiment here: upon sending a MsgApp, check whether the recipient store has L0 overload, and if so, just drop the MsgApp:

diff --git a/pkg/kv/kvserver/replica_raft.go b/pkg/kv/kvserver/replica_raft.go
index f80f91c4cd..c954375490 100644
--- a/pkg/kv/kvserver/replica_raft.go
+++ b/pkg/kv/kvserver/replica_raft.go
@@ -1496,6 +1496,10 @@ func (r *Replica) sendRaftMessagesRaftMuLocked(ctx context.Context, messages []r
 	}
 }
 
+var errOverloaded = errors.New("follower is under IO overload; dropping MsgApp")
+
+var toErrEveryN = log.Every(time.Second)
+
 // sendRaftMessageRaftMuLocked sends a Raft message.
 func (r *Replica) sendRaftMessageRaftMuLocked(ctx context.Context, msg raftpb.Message) {
 	r.mu.RLock()
@@ -1518,6 +1522,10 @@ func (r *Replica) sendRaftMessageRaftMuLocked(ctx context.Context, msg raftpb.Me
 				startKey = r.descRLocked().StartKey
 			}
 		})
+		if storeDesc, ok := r.store.allocator.StorePool.GetStoreDescriptor(toReplica.StoreID); ok && storeDesc.Capacity.L0Sublevels > 20 {
+			// NB: this isn't good enough for prod. If multiple followers are overloaded, we must still send to a quorum.
+			toErr = errOverloaded
+		}
 	}
 	r.mu.RUnlock()
 
@@ -1527,8 +1535,10 @@ func (r *Replica) sendRaftMessageRaftMuLocked(ctx context.Context, msg raftpb.Me
 		return
 	}
 	if toErr != nil {
-		log.Warningf(ctx, "failed to look up recipient replica %d in r%d while sending %s: %s",
-			msg.To, r.RangeID, msg.Type, toErr)
+		if toErrEveryN.ShouldLog() {
+			log.Warningf(ctx, "failed to look up recipient replica %d in r%d while sending %s: %s",
+				msg.To, r.RangeID, msg.Type, toErr)
+		}
 		return
 	}

I then ran #82968, invoked via

 gh pr checkout -f 81516 && ./dev build
roachtest  && bin/roachtest run --port 0 --cockroach artifacts/cockroach admission/follower-overload/presplit-with-leases --debug --cloud aws --cluster tobias-aws-drop-msgapp

After the cluster had been running for about an hour, I ssh'ed into n3 and stopped fio. Instead, I added more aggressive disk overload a la @irfansharif (this will be the first annotation in the graphs):

sudo bash -c 'echo "259:0  20971520" > /sys/fs/cgroup/blkio/system.slice/cockroach.service/blkio.throttle.write_bps_device'

which I removed only a few hours later (second annotation).

This experiment validated a few things I had expected, namely:

The LSM on the follower was "protected". File count and sublevel count breached above the admission control limit periodically, but would drop below again as well (as other nodes would start dropping MsgApp):
image

the quota pools on the remote nodes filled up occasionally, stalling the entire workload (bad), but when this wasn't happening, p99s for kv0 were essentially unaffected (~30ms p99; good). Now if we had added the "quota pool awareness" to the experiment, we would expect not to see the foreground workload stalls (at least for the workload that has its leases on n1 and n2, both unaffected by overload). So this continues to be a good next thing to try, and possible band-aid for 22.2.

As expected, the workload that had its leases on n3 degraded, since admission control would periodically turn on and do what it could to protect the node (throttle foreground traffic):

image

image

Interestingly, I also managed to ad-hoc reproduce #82116 when I removed the rate limit. This is because as n3 fell behind, it started to lose access to the log, resulting in snapshots sent to it:

image

image

You can see that when the rate limit was lifted, throughput for these snapshots increased, and this correlates with an instance of L0 overload, for which the likely explanation is that much of these snapshots went into L0:

image

@tbg

This comment was marked as off-topic.

@tbg
Copy link
Member Author

tbg commented Jun 23, 2022

I re-ran the experiment today and without the config hiccups, it performed as expected. n3 would go overloaded, we'd stop appending to it (all the while avoiding quota pool build-up). This cycle repeated a few times, at which point (I believe) most of the replicas on n3 had been cut off from replication traffic permanently due to log truncations, and snapshots would only trickle through due to the bandwidth limits. The kv0 workload directed at n1 and n2 performed consistently throughout (i.e. didn't get bogged down by n3's demise); when the throttle was removed, n3 caught up and now I have a healthy cluster. The one "unrealistic" thing here is that we "ignore" n3 for 10 minutes, whereas L0 really drops sub-critical within seconds. This is an artifact of using the windowed L0 signal that is part of the store gossip; I should repeat this with a more up-to-date signal. But all in all this is about as good as CRDB can perform under these circumstances.

image

image

image

^-- NB: no idea about the latency spike at ~130pm, it doesn't correlate with anything I can see in the other graphs and I don't think it is related to the experiment. These are p50/p90/p100 btw.

image

image

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-admission-control C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

3 participants