Poller Scaling Decisions #7300

Sushisource · 2025-02-08T00:45:03Z

Important

Needs temporalio/api#553 to be merged for CI to pass

See also the SDK side here: temporalio/sdk-core#874

What changed?

Optionally attach scaling decisions about whether SDK should increase or decrease pollers to task responses

Why?

Part of the worker management effort to simplify configuration of workers for users.

How did you test it?

Unit and basic functional tests are added here. More comprehensive benchmarking was done in conjunction with the SDK changes. Overall, there can be as much as 100% performance increases in task throughput compared to a static number of pollers in very bursty scenarios. At worst, there is no difference.

A sample of some of the benchmarking results:

250 workflows running 5 activities each
- Autoscale ~5s / No Autoscale ~10s
1000 workflows 5 activities each
- Autoscale ~28s / No Autoscale ~44 seconds
1000 workflows 5 activities each 4 workers
- No autoscale - ~20 seconds first load ~25 seconds second load
- Autoscale - ~14 seconds first load ~25 seconds second load

Potential risks

This will not be flipped on by default in SDKs until we want to. That said, the obvious risk here is overall significantly increased polling RPCs. That said, the SDK logic respects rate-limiting errors too.

Documentation

Docs will be coming as we roll out the SDK flag to enable this.

Is hotfix candidate?

No

common/dynamicconfig/constants.go

service/matching/task_queue_partition_manager.go

ShahabT · 2025-02-12T00:51:04Z

service/matching/task_queue_partition_manager.go

+	} else if stats.TasksAddRate/stats.TasksDispatchRate < pm.config.PollerScalingDispatchDownFraction() {
+		// Decrease if we're dispatching tasks faster than we're adding them. This case can come up as the converse of
+		// the above, where we have cleared the backlog but are still not hitting the wait-time sync matching case. We
+		// still can will begin to scale down before hitting that case.


For this I could imagine a scenario but still not convinced it's good to have it. The scenario it is covering is an edge case in which we want to scale pollers down because the backlog just got drained.

But why don't we want to wait until the regular condition of pollWaitTime >= pm.config.PollerScalingSyncMatchWaitTime() is met? What do we get for reducing the poller count just 1s ealier?

It seems the equilibrium of poller count is where pollers wait just bellow PollerScalingSyncMatchWaitTime. So we do not want to ask SDK to reduce pollers below this equilibrium because it'll have to increase the poller count again to go back to the equilibrium. In other words, I want to say maybe this behavior will add unnecessary "swinging".

Without it code will be simpler, easier to reason about, and we'd have one less knob to tune.

I am totally fine with removing both of these conditions given that they likely don't contribute a whole lot, but I haven't tested without them. If they aren't doing much (which is probably likely) I'm happy to get rid of them.

then I'd vote for deleting them in the interest of simplicity.

I guess I'm neutral on this.. I'm always for simplicity but ultimately these are all heuristics, if it performs better in practice and isn't expensive/very complicated to do, then there's not much harm in keeping a few extra conditions.

Based on the tests I've done they made next-to-no difference. Of course it's possible they maybe would make a difference for some scenario I didn't try out, but, 🤷‍♂️ . They are super simple. The config knobs are maybe the most annoying thing about keeping them in. We could keep them in with hardcoded values?

We technically do have a backlog for queries and nexus tasks, just not a persistent one. But I think using the incoming / outgoing rate is probably a pretty good signal.

It 's not a "backlog" that comes out in stats.ApproximateBacklogCount, so it needs extra logic somehow (or we can change backlog stats to include in-mem queries + nexus, that would be interesting...)

@dnr Would you be fine using the rate (and ignoring backlog count) for Nexus?

I'm just pointing out an alternative solution where we do include nexus in the "backlog count" (i.e. make it durable backlog plus in-mem backlog). Right now it might be a little awkward to get that count, though. The new matcher will make that easier. I'm also totally fine with the rate.

Thinking through it a little more: the count is the integral of the rate. The backlog condition is basically: has the count been positive for x time (age > x), or has the count been 0 for y time (poller waited > y)? If the rate is > z then we can "guess" that the count will be positive for a while (this seems safe), or if it's < z then we can guess that the count will soon be 0 (this one seems less safe... but the number of query/nexus in memory is implicitly limited by a timeout, so maybe it's okay?)

I think for now the rate is okay but note that we want to also have a way to instruct users to scale out their workers to handle more Nexus traffic and I think the transient backlog size should fit in the current worker scaling model.
(We could use transient backlog size to scale for high query load as well).

service/matching/matching_engine.go

service/matching/task_queue_partition_manager.go

common/dynamicconfig/constants.go

service/matching/task_queue_partition_manager.go

dnr · 2025-02-12T23:21:59Z

service/matching/task_queue_partition_manager.go

+	} else if stats.TasksAddRate/stats.TasksDispatchRate < pm.config.PollerScalingDispatchDownFraction() {
+		// Decrease if we're dispatching tasks faster than we're adding them. This case can come up as the converse of
+		// the above, where we have cleared the backlog but are still not hitting the wait-time sync matching case. We
+		// still can will begin to scale down before hitting that case.


I guess I'm neutral on this.. I'm always for simplicity but ultimately these are all heuristics, if it performs better in practice and isn't expensive/very complicated to do, then there's not much harm in keeping a few extra conditions.

dnr · 2025-02-12T23:25:22Z

common/dynamicconfig/constants.go

+	MatchingPollerScalingDecisionsPerSecond = NewTaskQueueFloatSetting(
+		"matching.pollerScalingDecisionsPerSecond",
+		10,
+		`MatchingPollerScalingDecisionsPerSecond is the maximum number of scaling decisions that will be issued per
+second by one partition manager`,


One more note: ideally this would be proportional to worker count, or even a rate limit per worker count. For my projects I'm going to have a "approximate per-key rate limiter" that accepts arbitrary keys but uses only fixed memory. We could use that with poller identity to enforce a rate limit per worker.

Or we can actually just count recent pollers from poller history and multiply the rate limit by that (and then make it like 1 or 2 per worker)

+1 on proportionality to worker count. Especially because root is exclusively making the scale down decisions, even if number of partitions change when workers scale out, the scale down decisions will be rate limited by this value globally.
We don't want the rate of poller scale down be the same (in terms of pollers/sec) when there is one worker vs when there are 1000 workers, right?

mmmmm... I'm not sure about this. A lot of my testing I did with just one worker, and fast (Rust, Go, Java) workers can perform easily a huge amount of tasks per second when they're simple. (Which isn't uncommon). So, being able to burst up pollers pretty fast for a single worker is fairly desirable.

Now, I'd agree maybe it should be even faster for more workers, but, I wouldn't want to only do 1 per sec for 1 worker.

If we did do that where would I count the pollers?

See GetAllPollerInfo and similar functions. You could add new ones to get just a count

Thought a bit more about this - I ran some tests where I simply moved the scale down on wait-time to be before any ratelimiting, which seemingly makes good sense anyway. If all those polls actually all did wait that long, they should probably scale down. It works perfectly fine

Then, for scale up, the current ratelimiting seems to be plenty fast enough.

I would have to add a new ratelimiter implementation to do the multiplicative scaling and maybe it makes more sense to just wait for what @dnr is likely to add.

My trick for multiplicative scaling that I was going to suggest: let the rate limit be 10*1e6 with proportional burst, then each poll, call AllowN(1e6 / numPollers)

Ah, ok yeah good trick

service/matching/task_queue_partition_manager.go

dnr · 2025-02-12T23:30:43Z

service/matching/task_queue_partition_manager.go

+		// sticky queues.
+		pd.PollRequestDeltaSuggestion = 1
+	} else if !pm.partition.IsRoot() {
+		// Non-root partitions don't have an appropriate view of the data to make decisions beyond backlog.


Is this still true if we're not using add/dispatch rates?

Maybe not. I will run some tests without it.

We are now, and, additionally I moved the time-based scale down check to be in front of this which I think makes sense.

dnr · 2025-02-12T23:31:02Z

service/matching/task_queue_partition_manager.go

+
+// makePollerScalingDecision makes a decision on whether to scale pollers up or down based on the current state of the
+// task queue and the task about to be returned. Does not modify inputs.
+func (pm *taskQueuePartitionManagerImpl) makePollerScalingDecision(


Should we make scaling decisions for forwarded polls? I don't see a clear reason to say yes or no yet...

I'm not sure. This is why I had that open question about keeping track of the overall wait time including forwarding, since that would affect that I think.

Good point. Though I'm not sure we actually need to pass the start time through forwarding: either a poll is forwarded immediately or not at all, so the wait time on the parent should be only slightly less than the wait time on the child

bergundy · 2025-02-15T03:32:50Z

tests/poller_scaling_test.go

+		assert.NoError(t, err)
+		assert.NotNil(t, resp.PollerScalingDecision)
+		assert.GreaterOrEqual(t, int32(1), resp.PollerScalingDecision.PollRequestDeltaSuggestion)


assert calls don't stop execution and you can't use require in EventuallyWithT blocks. You'll want to ensure your code won't panic if it gets passed an assertion.

Yeah, I wasn't expecting them to, if the eventually passes that's all I need here

If the test panics it will fail without rerunning the eventually block. I don't think that's what you want.

I'm not following you. This is doing exactly what I expect, it re-runs the block until the asserts pass. Nothing is panicking.

tests/poller_scaling_test.go

dnr · 2025-02-18T21:46:33Z

common/dynamicconfig/constants.go

+	MatchingPollerScalingDecisionsPerSecond = NewTaskQueueFloatSetting(
+		"matching.pollerScalingDecisionsPerSecond",
+		10,
+		`MatchingPollerScalingDecisionsPerSecond is the maximum number of scaling decisions that will be issued per
+second by one partition manager`,


See GetAllPollerInfo and similar functions. You could add new ones to get just a count

service/matching/task_queue_partition_manager.go

dnr · 2025-02-18T21:53:35Z

service/matching/task_queue_partition_manager.go

+	} else if stats.TasksAddRate/stats.TasksDispatchRate < pm.config.PollerScalingDispatchDownFraction() {
+		// Decrease if we're dispatching tasks faster than we're adding them. This case can come up as the converse of
+		// the above, where we have cleared the backlog but are still not hitting the wait-time sync matching case. We
+		// still can will begin to scale down before hitting that case.


It 's not a "backlog" that comes out in stats.ApproximateBacklogCount, so it needs extra logic somehow (or we can change backlog stats to include in-mem queries + nexus, that would be interesting...)

service/matching/task_queue_partition_manager.go

dnr · 2025-02-18T21:57:20Z

service/matching/task_queue_partition_manager.go

+
+// makePollerScalingDecision makes a decision on whether to scale pollers up or down based on the current state of the
+// task queue and the task about to be returned. Does not modify inputs.
+func (pm *taskQueuePartitionManagerImpl) makePollerScalingDecision(


Good point. Though I'm not sure we actually need to pass the start time through forwarding: either a poll is forwarded immediately or not at all, so the wait time on the parent should be only slightly less than the wait time on the child

Sushisource · 2025-02-19T23:12:22Z

common/constants.go

@@ -60,7 +60,7 @@ const (
 	MinLongPollTimeout = time.Second * 2
 	// CriticalLongPollTimeout is a threshold for the context timeout passed into long poll API,
 	// below which a warning will be logged
-	CriticalLongPollTimeout = time.Second * 20
+	CriticalLongPollTimeout = time.Second * 10


Changed this since, it's useful SDK side, to reduce the timeout while under load (so that when load runs out, we scale down faster).

Talked w/ David and it seems like reducing this ought to be fine, but we can undo if any objections.

Sushisource requested a review from a team as a code owner February 8, 2025 00:45

Sushisource mentioned this pull request Feb 8, 2025

Poller Scaling Decisions temporalio/api#553

Open

Sushisource force-pushed the spencer/poller-scaling branch from be85847 to 728f997 Compare February 8, 2025 00:48

Sushisource mentioned this pull request Feb 8, 2025

Poller scaling temporalio/sdk-core#874

Open

ShahabT reviewed Feb 12, 2025

View reviewed changes

common/dynamicconfig/constants.go Outdated Show resolved Hide resolved

ShahabT reviewed Feb 12, 2025

View reviewed changes

service/matching/task_queue_partition_manager.go Outdated Show resolved Hide resolved

ShahabT reviewed Feb 12, 2025

View reviewed changes

dnr reviewed Feb 12, 2025

View reviewed changes

Sushisource force-pushed the spencer/poller-scaling branch 3 times, most recently from 2d6bd0d to b82ea1f Compare February 15, 2025 00:15

bergundy reviewed Feb 15, 2025

View reviewed changes

tests/poller_scaling_test.go Outdated Show resolved Hide resolved

dnr reviewed Feb 18, 2025

View reviewed changes

Sushisource force-pushed the spencer/poller-scaling branch 2 times, most recently from cc93b8d to 9d46d2e Compare February 19, 2025 18:30

Sushisource added 5 commits February 19, 2025 15:04

Initial implementation

7ac72a7

Address David's simple review comments

5047e6e

Make sure Nexus tasks get decisions

8eb8db8

Move inside physical tqm

61a61b4

Move decrease suggestions up

b6c9d55

Sushisource force-pushed the spencer/poller-scaling branch from f8a0e9f to b6c9d55 Compare February 19, 2025 23:05

Lower the threshold for warning about long poll timeouts

0d272b6

Sushisource commented Feb 19, 2025

View reviewed changes

Scale decisions proportionally to recent pollers

a5bf22b

Sushisource force-pushed the spencer/poller-scaling branch from dade58b to a5bf22b Compare February 20, 2025 01:05

Lint fixes

a35c853

Sushisource force-pushed the spencer/poller-scaling branch from 6dd2b16 to a35c853 Compare February 20, 2025 17:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poller Scaling Decisions #7300

Poller Scaling Decisions #7300

Sushisource commented Feb 8, 2025 •

edited

Loading

ShahabT Feb 12, 2025

Sushisource Feb 12, 2025

ShahabT Feb 12, 2025

dnr Feb 12, 2025

Sushisource Feb 13, 2025

bergundy Feb 18, 2025

dnr Feb 18, 2025

Sushisource Feb 18, 2025 •

edited

Loading

dnr Feb 18, 2025

bergundy Feb 19, 2025

dnr Feb 12, 2025

dnr Feb 12, 2025

ShahabT Feb 13, 2025

Sushisource Feb 13, 2025 •

edited

Loading

dnr Feb 18, 2025

Sushisource Feb 19, 2025

dnr Feb 19, 2025

Sushisource Feb 20, 2025

dnr Feb 12, 2025

Sushisource Feb 13, 2025

Sushisource Feb 19, 2025

dnr Feb 12, 2025

Sushisource Feb 13, 2025

dnr Feb 18, 2025

bergundy Feb 15, 2025

Sushisource Feb 18, 2025

bergundy Feb 19, 2025

Sushisource Feb 20, 2025

dnr Feb 18, 2025

dnr Feb 18, 2025

dnr Feb 18, 2025

Sushisource Feb 19, 2025

Poller Scaling Decisions #7300

Are you sure you want to change the base?

Poller Scaling Decisions #7300

Conversation

Sushisource commented Feb 8, 2025 • edited Loading

What changed?

Why?

How did you test it?

Potential risks

Documentation

Is hotfix candidate?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Sushisource Feb 18, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Sushisource Feb 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Sushisource commented Feb 8, 2025 •

edited

Loading

Sushisource Feb 18, 2025 •

edited

Loading

Sushisource Feb 13, 2025 •

edited

Loading