kvserver: avoid need for manual tuning of rebalance rate setting #14768

petermattis · 2017-04-10T19:36:59Z

#14718 limited the bandwidth for preemptive snapshots (i.e. rebalancing) to 2 MB/sec. This is a blunt instrument. @bdarnell says:

What we really need is some sort of prioritization scheme that would allow snapshots to use the bandwidth only if it's not needed for other stuff. But I don't have any concrete suggestions so maybe we should go ahead and do this anyway.

Jira issue: CRDB-6099

petermattis · 2017-04-23T01:01:10Z

Perhaps something like weighted fair queueing. Here's a go implementation.

tbg · 2017-06-01T14:34:31Z

Would an ideal solution have the limiting sit on the incoming bandwidth and not the outgoing? I'm thinking of the case in which node 1 and 2 both send to node 3, but node 1 streams a snapshot while node 2 has foreground traffic. It would really have to be node 3 which backs off the other two; node 1 and 2 each would try to go at full speed without a chance of determining their relative priorities. Naively, I'd think that WFQ for reading from the connection should work since there's flow control and the sender has to wait once they've filled up their window.

I've searched around for precedent which I'm sure must exist somewhere, but haven't really been successful.

petermattis · 2017-06-01T16:46:10Z

I think we'd want foreground/background traffic prioritization on both the sender and receiver. The receiver might not have any foreground traffic, but the sender might. And vice versa.

cuongdo · 2017-08-30T15:27:29Z

@m-schneider Can you take this on during 1.2? The issue is a little vague now, and this will require an RFC. So, it'd be helpful if you worked with @tschottdorf on making this specific enough to be actionable as a first step.

petermattis · 2017-08-30T16:52:11Z

I'd estimate that the coding portion of this is a small fraction of the work. The bigger task is to test the prioritization mechanism under a variety of rebalancing and recovery scenarios on real clusters.

tbg · 2017-08-31T00:55:42Z

performance testing, as the mechanism will sit squarely in the hot path.

m-schneider · 2017-08-31T01:10:29Z

@cuongdo Sure sounds like a great project for 1.2!

m-schneider · 2017-10-04T21:55:23Z

Toby and I discussed this and looked into what we can do with gRPC. Prioritizing traffic on a sender is fairly straight forward, we can use gRPC interceptors and the WFQ implementation that @petermattis linked. However on a receiver there doesn't seem to be any straight forward way to do this. Before a recent optimization in gRPC(https://grpc.io/2017/08/22/grpc-go-perf-improvements.html) we could have blocked connections by stalling reads from on connections that we wished to deprioritize via interceptors. By the time the interceptor is invoked, the message has already been consumed and its connection window quota released, so there is an off-by-one that makes this an off-by-infinity for non-streaming RPCs (which we think use a new stream each time).

If we fork gRPC we can modify the connection and interceptor code to give us everything we need to block on a connection to throttle on the receiver side.

We're following issue #17370 because it touches many of the same pathways.

petermattis · 2017-10-05T13:43:47Z

Do we need to prioritize traffic at both the sender and recipient? I was imagining that we'd only prioritize traffic on the sender side, though I haven't thought this through in depth.

tbg · 2017-10-05T14:34:13Z

Change of heart from #14768 (comment)? :-)

The basic problem is that if we only throttle on the sender and a node sends snapshots at full volume (because nothing else is going on), it doesn't matter what's going on on the recipients -- the foreground traffic will be impacted.

There's also a question of scope here: is this specifically to put something into place that allows snapshots to go fast when nothing else is going on, or is the next thing we're going to want a prioritization of Raft traffic vs foreground traffic also?

If there's a relatively clean workable solution for snapshots only that's not as invasive, that might be something to consider, but to "really" solve the problem it seems that we'd be adding some hooks to grpc's internals where we need them, and live with a fork.

a-robinson · 2017-10-05T14:37:29Z

cc @rytaft, who may have some wisdom to share

petermattis · 2017-10-05T14:48:56Z

Change of heart from #14768 (comment)? :-)

Heh, I knew I had thought about this before.

There's also a question of scope here: is this specifically to put something into place that allows snapshots to go fast when nothing else is going on, or is the next thing we're going to want a prioritization of Raft traffic vs foreground traffic also?

The initial scope was all about snapshots: allowing snapshots to be sent as fast as possible as long as there is nothing else going on. Prioritizing Raft traffic vs foreground traffic seems trickier as sometimes that Raft traffic is necessary to service the foreground traffic.

rytaft · 2017-10-05T14:56:00Z

@a-robinson Happy to help, but I think I need a bit more context. Is the main issue that multiple nodes are sending snapshots to the same recipient simultaneously? If so, would it be a problem to have them coordinate?

Also, is the bottleneck happening at the RPC layer, network, CPU utilization or something else? I could also talk about this offline with someone if that would be easier...

tbg · 2017-10-05T16:18:38Z

@rytaft the TL;DR is that we currently have two restrictions in place:

a node only accepts one incoming snapshot at a time
on the sending side, the snapshot stream is rate limited at 2-4 mb/s

The main goal here is to relax 2) by allowing snapshots to be transferred faster, but without impacting foreground traffic (as in, use the bandwidth if nobody else is using it, but don't try to compete, at least not too hard).

rytaft · 2017-10-05T17:21:33Z

Makes sense. I was talking about this with Cuong at lunch, and the main question I have is: are you sure it's the network bandwidth that is the bottleneck, or could it be the processing of RPC calls? In the latter case, you could just create a different channel/port for snapshot traffic....

tbg · 2017-10-05T17:27:04Z

We're somewhat certain that it's the network bandwidth, but @m-schneider is running more experiments now as the original issue #10972 didn't conclusively prove that.

By different channel, do you mean using a different port and then letting the OS throttle that against the rest? That's probably unlikely to happen, for two reasons: a) we only have one IANA assigned port (26257) and b) we don't want to burden the operator with setting up said limits.

tbg · 2021-03-09T10:13:33Z

The other thing I'm noticing is that it looks like we're using the same connection for snapshots and for "general" traffic:

cockroach/pkg/kv/kvserver/raft_transport.go

Lines 656 to 664 in 8c5253b

 conn, err := t.dialer.Dial(ctx, nodeID, rpc.DefaultClass) 

 if err != nil { 

 return err 

 } 

 client := NewMultiRaftClient(conn) 

 stream, err := client.RaftSnapshot(ctx) 

 if err != nil { 

 return err 

 }

I wonder if changing this alone can produce any benefits.

irfansharif · 2022-04-15T21:06:39Z

+cc @shralex.

irfansharif · 2022-06-06T16:38:49Z

x-ref #63728.

andrewbaptist · 2023-04-28T15:19:44Z

There are likely a few things we should do here:

Remove the split between rebalance and recovery rate kv: Clean up snapshot rate limits #63728
Separate out snapshots (and other non-raft sstable transfers) to a separate network channel.
Move the throttling to the receiver side and integrate with AC to allow this to be dynamic. admission,kvserver: subject snapshot ingestion to admission control #80607
Increase the send and receive concurrency (maybe 4/2 are good values to try) - make them runtime configurable.

petermattis added this to the Later milestone Apr 10, 2017

bdarnell mentioned this issue Apr 23, 2017

storage: Snapshot bandwidth "priority inversion" #15274

Closed

petermattis modified the milestones: 1.1, Later May 3, 2017

tbg self-assigned this Jun 1, 2017

petermattis mentioned this issue Jun 15, 2017

storage: upreplicate test (1->3, 10GB) timing out #14852

Closed

petermattis modified the milestones: 1.2, 1.1 Jun 30, 2017

cuongdo assigned m-schneider and unassigned tbg Aug 30, 2017

cuongdo mentioned this issue Sep 22, 2017

improve configuration of rebalance timing for maintenance events (acceptance criteria) #18696

Closed

7 tasks

tbg mentioned this issue Sep 28, 2017

acceptanceccl: restore should play well with rebalancing #18870

Closed

tbg changed the title ~~storage: prioritization mechanism for foreground/background traffic~~ rpc: prioritization mechanism for foreground/background traffic Mar 9, 2021

tbg changed the title ~~rpc: prioritization mechanism for foreground/background traffic~~ kvserver: avoid need for manual tuning of rebalance rate setting Mar 9, 2021

jlinder added the T-kv KV Team label Jun 16, 2021

jlinder added sync-me-3 and removed sync-me-3 labels May 24, 2022

mwang1026 added the O-postmortem Originated from a Postmortem action item. label May 27, 2022

erikgrinaker added T-kv-replication and removed T-kv KV Team labels May 31, 2022

exalate-issue-sync bot removed the T-kv-replication label Jun 2, 2022

blathers-crl bot added the T-kv KV Team label Jun 2, 2022

irfansharif mentioned this issue Jun 6, 2022

kv: model decommissioning/upreplication as a global queue of work #82475

Open

irfansharif added the A-admission-control label Sep 16, 2022

nvanbenschoten added the P-3 Issues/test failures with no fix SLA label Nov 17, 2023

nicktrav added the T-admission-control Admission Control label Dec 21, 2023

andrewbaptist mentioned this issue Mar 13, 2024

admission,kvserver: subject snapshot ingestion to admission control #80607

Closed

exalate-issue-sync bot removed the T-admission-control Admission Control label Mar 15, 2024

nvanbenschoten mentioned this issue Apr 11, 2024

kv: tune single-threaded raft snapshot thoughtput #122232

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: avoid need for manual tuning of rebalance rate setting #14768

kvserver: avoid need for manual tuning of rebalance rate setting #14768

petermattis commented Apr 10, 2017 •

edited by exalate-issue-sync bot

Loading

petermattis commented Apr 23, 2017

tbg commented Jun 1, 2017

petermattis commented Jun 1, 2017

cuongdo commented Aug 30, 2017 •

edited

Loading

petermattis commented Aug 30, 2017

tbg commented Aug 31, 2017

m-schneider commented Aug 31, 2017

m-schneider commented Oct 4, 2017

petermattis commented Oct 5, 2017

tbg commented Oct 5, 2017

a-robinson commented Oct 5, 2017

petermattis commented Oct 5, 2017

rytaft commented Oct 5, 2017

tbg commented Oct 5, 2017

rytaft commented Oct 5, 2017

tbg commented Oct 5, 2017

tbg commented Mar 9, 2021

irfansharif commented Apr 15, 2022

irfansharif commented Jun 6, 2022

andrewbaptist commented Apr 28, 2023 •

edited by tbg

Loading

kvserver: avoid need for manual tuning of rebalance rate setting #14768

kvserver: avoid need for manual tuning of rebalance rate setting #14768

Comments

petermattis commented Apr 10, 2017 • edited by exalate-issue-sync bot Loading

petermattis commented Apr 23, 2017

tbg commented Jun 1, 2017

petermattis commented Jun 1, 2017

cuongdo commented Aug 30, 2017 • edited Loading

petermattis commented Aug 30, 2017

tbg commented Aug 31, 2017

m-schneider commented Aug 31, 2017

m-schneider commented Oct 4, 2017

petermattis commented Oct 5, 2017

tbg commented Oct 5, 2017

a-robinson commented Oct 5, 2017

petermattis commented Oct 5, 2017

rytaft commented Oct 5, 2017

tbg commented Oct 5, 2017

rytaft commented Oct 5, 2017

tbg commented Oct 5, 2017

tbg commented Mar 9, 2021

irfansharif commented Apr 15, 2022

irfansharif commented Jun 6, 2022

andrewbaptist commented Apr 28, 2023 • edited by tbg Loading

petermattis commented Apr 10, 2017 •

edited by exalate-issue-sync bot

Loading

cuongdo commented Aug 30, 2017 •

edited

Loading

andrewbaptist commented Apr 28, 2023 •

edited by tbg

Loading