-
Notifications
You must be signed in to change notification settings - Fork 905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poller Scaling Decisions #7300
base: main
Are you sure you want to change the base?
Poller Scaling Decisions #7300
Conversation
be85847
to
728f997
Compare
} else if stats.TasksAddRate/stats.TasksDispatchRate < pm.config.PollerScalingDispatchDownFraction() { | ||
// Decrease if we're dispatching tasks faster than we're adding them. This case can come up as the converse of | ||
// the above, where we have cleared the backlog but are still not hitting the wait-time sync matching case. We | ||
// still can will begin to scale down before hitting that case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this I could imagine a scenario but still not convinced it's good to have it. The scenario it is covering is an edge case in which we want to scale pollers down because the backlog just got drained.
- But why don't we want to wait until the regular condition of
pollWaitTime >= pm.config.PollerScalingSyncMatchWaitTime()
is met? What do we get for reducing the poller count just 1s ealier? - It seems the equilibrium of poller count is where pollers wait just bellow PollerScalingSyncMatchWaitTime. So we do not want to ask SDK to reduce pollers below this equilibrium because it'll have to increase the poller count again to go back to the equilibrium. In other words, I want to say maybe this behavior will add unnecessary "swinging".
- Without it code will be simpler, easier to reason about, and we'd have one less knob to tune.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am totally fine with removing both of these conditions given that they likely don't contribute a whole lot, but I haven't tested without them. If they aren't doing much (which is probably likely) I'm happy to get rid of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then I'd vote for deleting them in the interest of simplicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'm neutral on this.. I'm always for simplicity but ultimately these are all heuristics, if it performs better in practice and isn't expensive/very complicated to do, then there's not much harm in keeping a few extra conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the tests I've done they made next-to-no difference. Of course it's possible they maybe would make a difference for some scenario I didn't try out, but, 🤷♂️ . They are super simple. The config knobs are maybe the most annoying thing about keeping them in. We could keep them in with hardcoded values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We technically do have a backlog for queries and nexus tasks, just not a persistent one. But I think using the incoming / outgoing rate is probably a pretty good signal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It 's not a "backlog" that comes out in stats.ApproximateBacklogCount
, so it needs extra logic somehow (or we can change backlog stats to include in-mem queries + nexus, that would be interesting...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dnr Would you be fine using the rate (and ignoring backlog count) for Nexus?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just pointing out an alternative solution where we do include nexus in the "backlog count" (i.e. make it durable backlog plus in-mem backlog). Right now it might be a little awkward to get that count, though. The new matcher will make that easier. I'm also totally fine with the rate.
Thinking through it a little more: the count is the integral of the rate. The backlog condition is basically: has the count been positive for x time (age > x), or has the count been 0 for y time (poller waited > y)? If the rate is > z then we can "guess" that the count will be positive for a while (this seems safe), or if it's < z then we can guess that the count will soon be 0 (this one seems less safe... but the number of query/nexus in memory is implicitly limited by a timeout, so maybe it's okay?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for now the rate is okay but note that we want to also have a way to instruct users to scale out their workers to handle more Nexus traffic and I think the transient backlog size should fit in the current worker scaling model.
(We could use transient backlog size to scale for high query load as well).
} else if stats.TasksAddRate/stats.TasksDispatchRate < pm.config.PollerScalingDispatchDownFraction() { | ||
// Decrease if we're dispatching tasks faster than we're adding them. This case can come up as the converse of | ||
// the above, where we have cleared the backlog but are still not hitting the wait-time sync matching case. We | ||
// still can will begin to scale down before hitting that case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'm neutral on this.. I'm always for simplicity but ultimately these are all heuristics, if it performs better in practice and isn't expensive/very complicated to do, then there's not much harm in keeping a few extra conditions.
MatchingPollerScalingDecisionsPerSecond = NewTaskQueueFloatSetting( | ||
"matching.pollerScalingDecisionsPerSecond", | ||
10, | ||
`MatchingPollerScalingDecisionsPerSecond is the maximum number of scaling decisions that will be issued per | ||
second by one partition manager`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more note: ideally this would be proportional to worker count, or even a rate limit per worker count. For my projects I'm going to have a "approximate per-key rate limiter" that accepts arbitrary keys but uses only fixed memory. We could use that with poller identity to enforce a rate limit per worker.
Or we can actually just count recent pollers from poller history and multiply the rate limit by that (and then make it like 1 or 2 per worker)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on proportionality to worker count. Especially because root is exclusively making the scale down decisions, even if number of partitions change when workers scale out, the scale down decisions will be rate limited by this value globally.
We don't want the rate of poller scale down be the same (in terms of pollers/sec) when there is one worker vs when there are 1000 workers, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmmmm... I'm not sure about this. A lot of my testing I did with just one worker, and fast (Rust, Go, Java) workers can perform easily a huge amount of tasks per second when they're simple. (Which isn't uncommon). So, being able to burst up pollers pretty fast for a single worker is fairly desirable.
Now, I'd agree maybe it should be even faster for more workers, but, I wouldn't want to only do 1 per sec for 1 worker.
If we did do that where would I count the pollers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See GetAllPollerInfo
and similar functions. You could add new ones to get just a count
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought a bit more about this - I ran some tests where I simply moved the scale down on wait-time to be before any ratelimiting, which seemingly makes good sense anyway. If all those polls actually all did wait that long, they should probably scale down. It works perfectly fine
Then, for scale up, the current ratelimiting seems to be plenty fast enough.
I would have to add a new ratelimiter implementation to do the multiplicative scaling and maybe it makes more sense to just wait for what @dnr is likely to add.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My trick for multiplicative scaling that I was going to suggest: let the rate limit be 10*1e6 with proportional burst, then each poll, call AllowN(1e6 / numPollers)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok yeah good trick
// sticky queues. | ||
pd.PollRequestDeltaSuggestion = 1 | ||
} else if !pm.partition.IsRoot() { | ||
// Non-root partitions don't have an appropriate view of the data to make decisions beyond backlog. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still true if we're not using add/dispatch rates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not. I will run some tests without it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are now, and, additionally I moved the time-based scale down check to be in front of this which I think makes sense.
|
||
// makePollerScalingDecision makes a decision on whether to scale pollers up or down based on the current state of the | ||
// task queue and the task about to be returned. Does not modify inputs. | ||
func (pm *taskQueuePartitionManagerImpl) makePollerScalingDecision( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make scaling decisions for forwarded polls? I don't see a clear reason to say yes or no yet...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. This is why I had that open question about keeping track of the overall wait time including forwarding, since that would affect that I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Though I'm not sure we actually need to pass the start time through forwarding: either a poll is forwarded immediately or not at all, so the wait time on the parent should be only slightly less than the wait time on the child
2d6bd0d
to
b82ea1f
Compare
assert.NoError(t, err) | ||
assert.NotNil(t, resp.PollerScalingDecision) | ||
assert.GreaterOrEqual(t, int32(1), resp.PollerScalingDecision.PollRequestDeltaSuggestion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert
calls don't stop execution and you can't use require
in EventuallyWithT
blocks. You'll want to ensure your code won't panic if it gets passed an assertion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I wasn't expecting them to, if the eventually passes that's all I need here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the test panics it will fail without rerunning the eventually block. I don't think that's what you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following you. This is doing exactly what I expect, it re-runs the block until the asserts pass. Nothing is panicking.
MatchingPollerScalingDecisionsPerSecond = NewTaskQueueFloatSetting( | ||
"matching.pollerScalingDecisionsPerSecond", | ||
10, | ||
`MatchingPollerScalingDecisionsPerSecond is the maximum number of scaling decisions that will be issued per | ||
second by one partition manager`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See GetAllPollerInfo
and similar functions. You could add new ones to get just a count
} else if stats.TasksAddRate/stats.TasksDispatchRate < pm.config.PollerScalingDispatchDownFraction() { | ||
// Decrease if we're dispatching tasks faster than we're adding them. This case can come up as the converse of | ||
// the above, where we have cleared the backlog but are still not hitting the wait-time sync matching case. We | ||
// still can will begin to scale down before hitting that case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It 's not a "backlog" that comes out in stats.ApproximateBacklogCount
, so it needs extra logic somehow (or we can change backlog stats to include in-mem queries + nexus, that would be interesting...)
|
||
// makePollerScalingDecision makes a decision on whether to scale pollers up or down based on the current state of the | ||
// task queue and the task about to be returned. Does not modify inputs. | ||
func (pm *taskQueuePartitionManagerImpl) makePollerScalingDecision( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Though I'm not sure we actually need to pass the start time through forwarding: either a poll is forwarded immediately or not at all, so the wait time on the parent should be only slightly less than the wait time on the child
cc93b8d
to
9d46d2e
Compare
f8a0e9f
to
b6c9d55
Compare
@@ -60,7 +60,7 @@ const ( | |||
MinLongPollTimeout = time.Second * 2 | |||
// CriticalLongPollTimeout is a threshold for the context timeout passed into long poll API, | |||
// below which a warning will be logged | |||
CriticalLongPollTimeout = time.Second * 20 | |||
CriticalLongPollTimeout = time.Second * 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this since, it's useful SDK side, to reduce the timeout while under load (so that when load runs out, we scale down faster).
Talked w/ David and it seems like reducing this ought to be fine, but we can undo if any objections.
dade58b
to
a5bf22b
Compare
6dd2b16
to
a35c853
Compare
Important
Needs temporalio/api#553 to be merged for CI to pass
See also the SDK side here: temporalio/sdk-core#874
What changed?
Optionally attach scaling decisions about whether SDK should increase or decrease pollers to task responses
Why?
Part of the worker management effort to simplify configuration of workers for users.
How did you test it?
Unit and basic functional tests are added here. More comprehensive benchmarking was done in conjunction with the SDK changes. Overall, there can be as much as 100% performance increases in task throughput compared to a static number of pollers in very bursty scenarios. At worst, there is no difference.
A sample of some of the benchmarking results:
Potential risks
This will not be flipped on by default in SDKs until we want to. That said, the obvious risk here is overall significantly increased polling RPCs. That said, the SDK logic respects rate-limiting errors too.
Documentation
Docs will be coming as we roll out the SDK flag to enable this.
Is hotfix candidate?
No