Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poller Scaling Decisions #7300

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Poller Scaling Decisions #7300

wants to merge 8 commits into from

Conversation

Sushisource
Copy link
Member

@Sushisource Sushisource commented Feb 8, 2025

Important

Needs temporalio/api#553 to be merged for CI to pass

See also the SDK side here: temporalio/sdk-core#874

What changed?

Optionally attach scaling decisions about whether SDK should increase or decrease pollers to task responses

Why?

Part of the worker management effort to simplify configuration of workers for users.

How did you test it?

Unit and basic functional tests are added here. More comprehensive benchmarking was done in conjunction with the SDK changes. Overall, there can be as much as 100% performance increases in task throughput compared to a static number of pollers in very bursty scenarios. At worst, there is no difference.

A sample of some of the benchmarking results:

  • 250 workflows running 5 activities each
    • Autoscale ~5s / No Autoscale ~10s
  • 1000 workflows 5 activities each
    • Autoscale ~28s / No Autoscale ~44 seconds
  • 1000 workflows 5 activities each 4 workers
    • No autoscale - ~20 seconds first load ~25 seconds second load
    • Autoscale - ~14 seconds first load ~25 seconds second load

Potential risks

This will not be flipped on by default in SDKs until we want to. That said, the obvious risk here is overall significantly increased polling RPCs. That said, the SDK logic respects rate-limiting errors too.

Documentation

Docs will be coming as we roll out the SDK flag to enable this.

Is hotfix candidate?

No

Comment on lines 1113 to 1116
} else if stats.TasksAddRate/stats.TasksDispatchRate < pm.config.PollerScalingDispatchDownFraction() {
// Decrease if we're dispatching tasks faster than we're adding them. This case can come up as the converse of
// the above, where we have cleared the backlog but are still not hitting the wait-time sync matching case. We
// still can will begin to scale down before hitting that case.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this I could imagine a scenario but still not convinced it's good to have it. The scenario it is covering is an edge case in which we want to scale pollers down because the backlog just got drained.

  • But why don't we want to wait until the regular condition of pollWaitTime >= pm.config.PollerScalingSyncMatchWaitTime() is met? What do we get for reducing the poller count just 1s ealier?
  • It seems the equilibrium of poller count is where pollers wait just bellow PollerScalingSyncMatchWaitTime. So we do not want to ask SDK to reduce pollers below this equilibrium because it'll have to increase the poller count again to go back to the equilibrium. In other words, I want to say maybe this behavior will add unnecessary "swinging".
  • Without it code will be simpler, easier to reason about, and we'd have one less knob to tune.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am totally fine with removing both of these conditions given that they likely don't contribute a whole lot, but I haven't tested without them. If they aren't doing much (which is probably likely) I'm happy to get rid of them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then I'd vote for deleting them in the interest of simplicity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm neutral on this.. I'm always for simplicity but ultimately these are all heuristics, if it performs better in practice and isn't expensive/very complicated to do, then there's not much harm in keeping a few extra conditions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the tests I've done they made next-to-no difference. Of course it's possible they maybe would make a difference for some scenario I didn't try out, but, 🤷‍♂️ . They are super simple. The config knobs are maybe the most annoying thing about keeping them in. We could keep them in with hardcoded values?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We technically do have a backlog for queries and nexus tasks, just not a persistent one. But I think using the incoming / outgoing rate is probably a pretty good signal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It 's not a "backlog" that comes out in stats.ApproximateBacklogCount, so it needs extra logic somehow (or we can change backlog stats to include in-mem queries + nexus, that would be interesting...)

Copy link
Member Author

@Sushisource Sushisource Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dnr Would you be fine using the rate (and ignoring backlog count) for Nexus?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just pointing out an alternative solution where we do include nexus in the "backlog count" (i.e. make it durable backlog plus in-mem backlog). Right now it might be a little awkward to get that count, though. The new matcher will make that easier. I'm also totally fine with the rate.

Thinking through it a little more: the count is the integral of the rate. The backlog condition is basically: has the count been positive for x time (age > x), or has the count been 0 for y time (poller waited > y)? If the rate is > z then we can "guess" that the count will be positive for a while (this seems safe), or if it's < z then we can guess that the count will soon be 0 (this one seems less safe... but the number of query/nexus in memory is implicitly limited by a timeout, so maybe it's okay?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for now the rate is okay but note that we want to also have a way to instruct users to scale out their workers to handle more Nexus traffic and I think the transient backlog size should fit in the current worker scaling model.
(We could use transient backlog size to scale for high query load as well).

service/matching/matching_engine.go Outdated Show resolved Hide resolved
service/matching/matching_engine.go Outdated Show resolved Hide resolved
service/matching/task_queue_partition_manager.go Outdated Show resolved Hide resolved
service/matching/task_queue_partition_manager.go Outdated Show resolved Hide resolved
service/matching/task_queue_partition_manager.go Outdated Show resolved Hide resolved
service/matching/task_queue_partition_manager.go Outdated Show resolved Hide resolved
service/matching/task_queue_partition_manager.go Outdated Show resolved Hide resolved
common/dynamicconfig/constants.go Outdated Show resolved Hide resolved
service/matching/task_queue_partition_manager.go Outdated Show resolved Hide resolved
Comment on lines 1113 to 1116
} else if stats.TasksAddRate/stats.TasksDispatchRate < pm.config.PollerScalingDispatchDownFraction() {
// Decrease if we're dispatching tasks faster than we're adding them. This case can come up as the converse of
// the above, where we have cleared the backlog but are still not hitting the wait-time sync matching case. We
// still can will begin to scale down before hitting that case.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm neutral on this.. I'm always for simplicity but ultimately these are all heuristics, if it performs better in practice and isn't expensive/very complicated to do, then there's not much harm in keeping a few extra conditions.

Comment on lines +1269 to +1279
MatchingPollerScalingDecisionsPerSecond = NewTaskQueueFloatSetting(
"matching.pollerScalingDecisionsPerSecond",
10,
`MatchingPollerScalingDecisionsPerSecond is the maximum number of scaling decisions that will be issued per
second by one partition manager`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more note: ideally this would be proportional to worker count, or even a rate limit per worker count. For my projects I'm going to have a "approximate per-key rate limiter" that accepts arbitrary keys but uses only fixed memory. We could use that with poller identity to enforce a rate limit per worker.

Or we can actually just count recent pollers from poller history and multiply the rate limit by that (and then make it like 1 or 2 per worker)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on proportionality to worker count. Especially because root is exclusively making the scale down decisions, even if number of partitions change when workers scale out, the scale down decisions will be rate limited by this value globally.
We don't want the rate of poller scale down be the same (in terms of pollers/sec) when there is one worker vs when there are 1000 workers, right?

Copy link
Member Author

@Sushisource Sushisource Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmmmm... I'm not sure about this. A lot of my testing I did with just one worker, and fast (Rust, Go, Java) workers can perform easily a huge amount of tasks per second when they're simple. (Which isn't uncommon). So, being able to burst up pollers pretty fast for a single worker is fairly desirable.

Now, I'd agree maybe it should be even faster for more workers, but, I wouldn't want to only do 1 per sec for 1 worker.

If we did do that where would I count the pollers?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See GetAllPollerInfo and similar functions. You could add new ones to get just a count

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought a bit more about this - I ran some tests where I simply moved the scale down on wait-time to be before any ratelimiting, which seemingly makes good sense anyway. If all those polls actually all did wait that long, they should probably scale down. It works perfectly fine

Then, for scale up, the current ratelimiting seems to be plenty fast enough.

I would have to add a new ratelimiter implementation to do the multiplicative scaling and maybe it makes more sense to just wait for what @dnr is likely to add.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My trick for multiplicative scaling that I was going to suggest: let the rate limit be 10*1e6 with proportional burst, then each poll, call AllowN(1e6 / numPollers)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok yeah good trick

service/matching/task_queue_partition_manager.go Outdated Show resolved Hide resolved
// sticky queues.
pd.PollRequestDeltaSuggestion = 1
} else if !pm.partition.IsRoot() {
// Non-root partitions don't have an appropriate view of the data to make decisions beyond backlog.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still true if we're not using add/dispatch rates?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not. I will run some tests without it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are now, and, additionally I moved the time-based scale down check to be in front of this which I think makes sense.


// makePollerScalingDecision makes a decision on whether to scale pollers up or down based on the current state of the
// task queue and the task about to be returned. Does not modify inputs.
func (pm *taskQueuePartitionManagerImpl) makePollerScalingDecision(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make scaling decisions for forwarded polls? I don't see a clear reason to say yes or no yet...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. This is why I had that open question about keeping track of the overall wait time including forwarding, since that would affect that I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Though I'm not sure we actually need to pass the start time through forwarding: either a poll is forwarded immediately or not at all, so the wait time on the parent should be only slightly less than the wait time on the child

@Sushisource Sushisource force-pushed the spencer/poller-scaling branch 3 times, most recently from 2d6bd0d to b82ea1f Compare February 15, 2025 00:15
Comment on lines +113 to +139
assert.NoError(t, err)
assert.NotNil(t, resp.PollerScalingDecision)
assert.GreaterOrEqual(t, int32(1), resp.PollerScalingDecision.PollRequestDeltaSuggestion)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert calls don't stop execution and you can't use require in EventuallyWithT blocks. You'll want to ensure your code won't panic if it gets passed an assertion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wasn't expecting them to, if the eventually passes that's all I need here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the test panics it will fail without rerunning the eventually block. I don't think that's what you want.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following you. This is doing exactly what I expect, it re-runs the block until the asserts pass. Nothing is panicking.

Comment on lines +1269 to +1279
MatchingPollerScalingDecisionsPerSecond = NewTaskQueueFloatSetting(
"matching.pollerScalingDecisionsPerSecond",
10,
`MatchingPollerScalingDecisionsPerSecond is the maximum number of scaling decisions that will be issued per
second by one partition manager`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See GetAllPollerInfo and similar functions. You could add new ones to get just a count

service/matching/task_queue_partition_manager.go Outdated Show resolved Hide resolved
Comment on lines 1113 to 1116
} else if stats.TasksAddRate/stats.TasksDispatchRate < pm.config.PollerScalingDispatchDownFraction() {
// Decrease if we're dispatching tasks faster than we're adding them. This case can come up as the converse of
// the above, where we have cleared the backlog but are still not hitting the wait-time sync matching case. We
// still can will begin to scale down before hitting that case.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It 's not a "backlog" that comes out in stats.ApproximateBacklogCount, so it needs extra logic somehow (or we can change backlog stats to include in-mem queries + nexus, that would be interesting...)

service/matching/task_queue_partition_manager.go Outdated Show resolved Hide resolved

// makePollerScalingDecision makes a decision on whether to scale pollers up or down based on the current state of the
// task queue and the task about to be returned. Does not modify inputs.
func (pm *taskQueuePartitionManagerImpl) makePollerScalingDecision(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Though I'm not sure we actually need to pass the start time through forwarding: either a poll is forwarded immediately or not at all, so the wait time on the parent should be only slightly less than the wait time on the child

@Sushisource Sushisource force-pushed the spencer/poller-scaling branch 2 times, most recently from cc93b8d to 9d46d2e Compare February 19, 2025 18:30
@Sushisource Sushisource force-pushed the spencer/poller-scaling branch from f8a0e9f to b6c9d55 Compare February 19, 2025 23:05
@@ -60,7 +60,7 @@ const (
MinLongPollTimeout = time.Second * 2
// CriticalLongPollTimeout is a threshold for the context timeout passed into long poll API,
// below which a warning will be logged
CriticalLongPollTimeout = time.Second * 20
CriticalLongPollTimeout = time.Second * 10
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this since, it's useful SDK side, to reduce the timeout while under load (so that when load runs out, we scale down faster).

Talked w/ David and it seems like reducing this ought to be fine, but we can undo if any objections.

@Sushisource Sushisource force-pushed the spencer/poller-scaling branch from dade58b to a5bf22b Compare February 20, 2025 01:05
@Sushisource Sushisource force-pushed the spencer/poller-scaling branch from 6dd2b16 to a35c853 Compare February 20, 2025 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants