Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: avoid need for manual tuning of rebalance rate setting #14768

Open
petermattis opened this issue Apr 10, 2017 · 30 comments
Open

kvserver: avoid need for manual tuning of rebalance rate setting #14768

petermattis opened this issue Apr 10, 2017 · 30 comments
Labels
A-admission-control A-kv-client Relating to the KV client and the KV interface. C-performance Perf of queries or internals. Solution not expected to change functional behavior. O-postmortem Originated from a Postmortem action item. P-3 Issues/test failures with no fix SLA T-kv KV Team

Comments

@petermattis
Copy link
Collaborator

petermattis commented Apr 10, 2017

#14718 limited the bandwidth for preemptive snapshots (i.e. rebalancing) to 2 MB/sec. This is a blunt instrument. @bdarnell says:

What we really need is some sort of prioritization scheme that would allow snapshots to use the bandwidth only if it's not needed for other stuff. But I don't have any concrete suggestions so maybe we should go ahead and do this anyway.

Jira issue: CRDB-6099

@petermattis petermattis added this to the Later milestone Apr 10, 2017
@petermattis
Copy link
Collaborator Author

Perhaps something like weighted fair queueing. Here's a go implementation.

@tbg
Copy link
Member

tbg commented Jun 1, 2017

Would an ideal solution have the limiting sit on the incoming bandwidth and not the outgoing? I'm thinking of the case in which node 1 and 2 both send to node 3, but node 1 streams a snapshot while node 2 has foreground traffic. It would really have to be node 3 which backs off the other two; node 1 and 2 each would try to go at full speed without a chance of determining their relative priorities. Naively, I'd think that WFQ for reading from the connection should work since there's flow control and the sender has to wait once they've filled up their window.

I've searched around for precedent which I'm sure must exist somewhere, but haven't really been successful.

@petermattis
Copy link
Collaborator Author

I think we'd want foreground/background traffic prioritization on both the sender and receiver. The receiver might not have any foreground traffic, but the sender might. And vice versa.

@cuongdo
Copy link
Contributor

cuongdo commented Aug 30, 2017

@m-schneider Can you take this on during 1.2? The issue is a little vague now, and this will require an RFC. So, it'd be helpful if you worked with @tschottdorf on making this specific enough to be actionable as a first step.

@petermattis
Copy link
Collaborator Author

I'd estimate that the coding portion of this is a small fraction of the work. The bigger task is to test the prioritization mechanism under a variety of rebalancing and recovery scenarios on real clusters.

@tbg
Copy link
Member

tbg commented Aug 31, 2017

  • performance testing, as the mechanism will sit squarely in the hot path.

@m-schneider
Copy link
Contributor

@cuongdo Sure sounds like a great project for 1.2!

@m-schneider
Copy link
Contributor

Toby and I discussed this and looked into what we can do with gRPC. Prioritizing traffic on a sender is fairly straight forward, we can use gRPC interceptors and the WFQ implementation that @petermattis linked. However on a receiver there doesn't seem to be any straight forward way to do this. Before a recent optimization in gRPC(https://grpc.io/2017/08/22/grpc-go-perf-improvements.html) we could have blocked connections by stalling reads from on connections that we wished to deprioritize via interceptors. By the time the interceptor is invoked, the message has already been consumed and its connection window quota released, so there is an off-by-one that makes this an off-by-infinity for non-streaming RPCs (which we think use a new stream each time).

If we fork gRPC we can modify the connection and interceptor code to give us everything we need to block on a connection to throttle on the receiver side.

We're following issue #17370 because it touches many of the same pathways.

@petermattis
Copy link
Collaborator Author

Do we need to prioritize traffic at both the sender and recipient? I was imagining that we'd only prioritize traffic on the sender side, though I haven't thought this through in depth.

@tbg
Copy link
Member

tbg commented Oct 5, 2017

Change of heart from #14768 (comment)? :-)

The basic problem is that if we only throttle on the sender and a node sends snapshots at full volume (because nothing else is going on), it doesn't matter what's going on on the recipients -- the foreground traffic will be impacted.

There's also a question of scope here: is this specifically to put something into place that allows snapshots to go fast when nothing else is going on, or is the next thing we're going to want a prioritization of Raft traffic vs foreground traffic also?

If there's a relatively clean workable solution for snapshots only that's not as invasive, that might be something to consider, but to "really" solve the problem it seems that we'd be adding some hooks to grpc's internals where we need them, and live with a fork.

@a-robinson
Copy link
Contributor

cc @rytaft, who may have some wisdom to share

@petermattis
Copy link
Collaborator Author

Change of heart from #14768 (comment)? :-)

Heh, I knew I had thought about this before.

There's also a question of scope here: is this specifically to put something into place that allows snapshots to go fast when nothing else is going on, or is the next thing we're going to want a prioritization of Raft traffic vs foreground traffic also?

The initial scope was all about snapshots: allowing snapshots to be sent as fast as possible as long as there is nothing else going on. Prioritizing Raft traffic vs foreground traffic seems trickier as sometimes that Raft traffic is necessary to service the foreground traffic.

@rytaft
Copy link
Collaborator

rytaft commented Oct 5, 2017

@a-robinson Happy to help, but I think I need a bit more context. Is the main issue that multiple nodes are sending snapshots to the same recipient simultaneously? If so, would it be a problem to have them coordinate?

Also, is the bottleneck happening at the RPC layer, network, CPU utilization or something else? I could also talk about this offline with someone if that would be easier...

@tbg
Copy link
Member

tbg commented Oct 5, 2017

@rytaft the TL;DR is that we currently have two restrictions in place:

  1. a node only accepts one incoming snapshot at a time
  2. on the sending side, the snapshot stream is rate limited at 2-4 mb/s

The main goal here is to relax 2) by allowing snapshots to be transferred faster, but without impacting foreground traffic (as in, use the bandwidth if nobody else is using it, but don't try to compete, at least not too hard).

@rytaft
Copy link
Collaborator

rytaft commented Oct 5, 2017

Makes sense. I was talking about this with Cuong at lunch, and the main question I have is: are you sure it's the network bandwidth that is the bottleneck, or could it be the processing of RPC calls? In the latter case, you could just create a different channel/port for snapshot traffic....

@tbg
Copy link
Member

tbg commented Oct 5, 2017

We're somewhat certain that it's the network bandwidth, but @m-schneider is running more experiments now as the original issue #10972 didn't conclusively prove that.

By different channel, do you mean using a different port and then letting the OS throttle that against the rest? That's probably unlikely to happen, for two reasons: a) we only have one IANA assigned port (26257) and b) we don't want to burden the operator with setting up said limits.

@tbg tbg changed the title storage: prioritization mechanism for foreground/background traffic rpc: prioritization mechanism for foreground/background traffic Mar 9, 2021
@tbg
Copy link
Member

tbg commented Mar 9, 2021

The other thing I'm noticing is that it looks like we're using the same connection for snapshots and for "general" traffic:

conn, err := t.dialer.Dial(ctx, nodeID, rpc.DefaultClass)
if err != nil {
return err
}
client := NewMultiRaftClient(conn)
stream, err := client.RaftSnapshot(ctx)
if err != nil {
return err
}

I wonder if changing this alone can produce any benefits.

@tbg tbg changed the title rpc: prioritization mechanism for foreground/background traffic kvserver: avoid need for manual tuning of rebalance rate setting Mar 9, 2021
@jlinder jlinder added the T-kv KV Team label Jun 16, 2021
@irfansharif
Copy link
Contributor

+cc @shralex.

@mwang1026 mwang1026 added the O-postmortem Originated from a Postmortem action item. label May 27, 2022
@erikgrinaker erikgrinaker added T-kv-replication and removed T-kv KV Team labels May 31, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label Jun 2, 2022
@irfansharif
Copy link
Contributor

x-ref #63728.

@andrewbaptist
Copy link
Collaborator

andrewbaptist commented Apr 28, 2023

There are likely a few things we should do here:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-admission-control A-kv-client Relating to the KV client and the KV interface. C-performance Perf of queries or internals. Solution not expected to change functional behavior. O-postmortem Originated from a Postmortem action item. P-3 Issues/test failures with no fix SLA T-kv KV Team
Projects
None yet
Development

No branches or pull requests