-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upstream: desynchronize WRR picks. #3271
Conversation
This PR fixes two issues: * Avoid having synchronized behavior across the fleet by having a seed applied to each WRR balancer instance. * Avoid having WRR reset to same host on each refresh, biasing towards earlier hosts in the schedule. Risk Level: Medium (there will be observable traffic effects due to changes in host pick schedule). Testing: New unit tests, modified existing. Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Harvey Tuch <htuch@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tackling this one!
hosts.size() == 0 | ||
? 0 | ||
: (scheduler_.find(source) != scheduler_.end() ? scheduler_[source].rr_index_ : seed_) % | ||
hosts.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you break this down for readability? Also comments :-)
scheduler.edf_.add(host->weight(), host); | ||
} | ||
// Cycle through hosts to achieve the intended offset behavior. | ||
for (uint32_t i = 0; i < offset; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be sufficient to add the hosts starting at index N and wrapping, rather than adding in-order and doing N picks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly not, since with weighted, the initial insertion order doesn't matter, the weights govern where in the schedule you are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, having thought about this for more than 5 seconds in a meeting :-P Now I'm not convinced this is particularly fair anyway. To be truly fair wouldn't we have to loop through between 0-[each host * each weight]?
Check my math because I'm terrible at these. If you have A w=100, B w=100 C w=1 you'd start with
[A=.01, B=.01, C=1].
Assuming if our rng told us to max out our picks at 3, we'd pop A, assign it a weight of (1/100+1)/100 -> .0101 so now the queue is
B=.01 A=.0101, C=1 pop B, assign it whatever weight, and you'll still pop C as your third pick. A should have a 1 in 201 chance of getting picked and I think it deterministically still has a 0% chance of being picked, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I came to the same conclusion and have already removed the offset for fairness. I'm still keeping it there to break synchronization and have left a TODO to make it fair.
@@ -481,6 +488,9 @@ HostConstSharedPtr RoundRobinLoadBalancer::chooseHost(LoadBalancerContext*) { | |||
// We do not expire any hosts from the host list without rebuilding the scheduler. | |||
ASSERT(host != nullptr); | |||
scheduler.edf_.add(host->weight(), host); | |||
// Track the nubmer of picks so we can pick up where we left of when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nubmer->number
of -> off
@@ -228,13 +228,16 @@ class RoundRobinLoadBalancer : public LoadBalancer, ZoneAwareLoadBalancerBase { | |||
struct Scheduler { | |||
// EdfScheduler for weighted RR. | |||
EdfScheduler<const Host> edf_; | |||
// Simple clock hand for when we do unweighted. | |||
// Simple clock hand for when we do unweighted. Tracks the total number of | |||
// picks when weighted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does not not functionally track the number of picks either way?
size_t rr_index_{}; | ||
bool weighted_{}; | ||
}; | ||
|
||
// Scheduler for each valid HostsSource. | ||
std::unordered_map<HostsSource, Scheduler, HostsSourceHash> scheduler_; | ||
// Seed to allow us to desynchronize WRR balancers across a fleet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mind adding a few sentences about this here, or other place of your choice? I think it would be helpful if this were spelled out more for folks who haven't seen a tech talk (or a postmortem) of what happens in prod recently :-)
const uint64_t offset = | ||
hosts.size() == 0 | ||
? 0 | ||
: (scheduler_.find(source) != scheduler_.end() ? scheduler_[source].rr_index_ : seed_) % |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when not using weights, is there a reason that we don't always use a random offset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah; if you use the last offset we had for a locality, we have the nice property that if we update twice with the same host set, we continue the RR where we left of. This avoids biasing towards the earlier hosts in the schedule.
@@ -481,6 +488,9 @@ HostConstSharedPtr RoundRobinLoadBalancer::chooseHost(LoadBalancerContext*) { | |||
// We do not expire any hosts from the host list without rebuilding the scheduler. | |||
ASSERT(host != nullptr); | |||
scheduler.edf_.add(host->weight(), host); | |||
// Track the nubmer of picks so we can pick up where we left of when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/nubmer/number/
and s/left of/left off/
Signed-off-by: Harvey Tuch <htuch@google.com>
…ht for now. Signed-off-by: Harvey Tuch <htuch@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for jumping on this.
// Cycle through hosts to achieve the intended offset behavior. | ||
// TODO(htuch): Consider how we can avoid biasing towards earlier hosts in the | ||
// schedule across refreshes for the weighted case. | ||
for (uint32_t i = 0; i < seed_ % hosts.size(); ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of this behavior in the EDF case is not intuitive to me at all. Not a big deal right now but is it worth at some point writing some LB simulations to cover this case so we can better understand the behavior? I wonder for example if we really have to cycle through all hosts to get reasonable randomness since doing that might be costly for large host sets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, EDF is pretty confusing and I think we might want to move away from it longer term. I've filed #3278 and will merge as is.
This PR fixes two issues: Avoid having synchronized behavior across the fleet by having a seed applied to each WRR balancer instance. Avoid having unweighted RR reset to same host on each refresh, biasing towards earlier hosts in the schedule. This TODO still exists for weighted. Risk Level: Medium (there will be observable traffic effects due to changes in host pick schedule). Testing: New unit tests, modified existing. Signed-off-by: Harvey Tuch <htuch@google.com> Signed-off-by: Rama <rama.rao@salesforce.com>
This PR fixes two issues:
applied to each WRR balancer instance.
earlier hosts in the schedule. This TODO still exists for weighted.
Risk Level: Medium (there will be observable traffic effects due to
changes in host pick schedule).
Testing: New unit tests, modified existing.
Signed-off-by: Harvey Tuch htuch@google.com