-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serve] Refactor ReplicaQueueLengthAutoscalingPolicy
into AutoscalingPolicyManager
and policy function
#42242
Conversation
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
ReplicaQueueLengthAutoscalingPolicy
into AutoscalingPolicyManager
and replica_queue_length_autoscaling_policy
function
ReplicaQueueLengthAutoscalingPolicy
into AutoscalingPolicyManager
and replica_queue_length_autoscaling_policy
functionReplicaQueueLengthAutoscalingPolicy
into AutoscalingPolicyManager
and policy function
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
|
||
def get_decision_num_replicas( | ||
self, | ||
curr_target_num_replicas: int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Let's be consistent w/ naming -- either prefix all params w/ "curr" or "current"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 will rename in the upcoming PR
if self.config: | ||
self.policy = self.config.get_policy() | ||
|
||
def should_autoscale(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd rather call this is_enabled
, since we're checking whether APM is working rather whether we want to autoscale it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is refactored from https://github.com/ray-project/ray/pull/42242/files#diff-c4c2583a4c2f3a3c87ada6faebd0b2dfef404e165df133a11195be5d65fcb387L1267 and keeping the naming. I feel we can change both places to is_autoscaling_policy_enabled
to be more descriptive. What do you think?
@@ -142,136 +107,59 @@ class ReplicaQueueLengthAutoscalingPolicy(AutoscalingPolicy): | |||
`get_decision_num_replicas` is called once every CONTROL_LOOP_PERIOD_S | |||
seconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need such assumption?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for our default autoscaling policy has not changed and I think the comment is still suitable. Basically it's assuming controller's event loop runs around 0.1 seconds and use that to delay upscaling/ downscaling. I think we can follow up to clean up some of those logics later, but for the purpose of adding the custom autoscaling functionality I rather not touch this right now.
# Scale up. | ||
if desired_num_replicas > curr_target_num_replicas: | ||
# If the previous decision was to scale down (the counter was | ||
# negative), we reset it and then increment it (set to 1). | ||
# Otherwise, just increment. | ||
if decision_counter < 0: | ||
decision_counter = 0 | ||
decision_counter += 1 | ||
|
||
# Only actually scale the replicas if we've made this decision for | ||
# 'scale_up_consecutive_periods' in a row. | ||
if decision_counter > int(config.upscale_delay_s / CONTROL_LOOP_PERIOD_S): | ||
decision_counter = 0 | ||
decision_num_replicas = desired_num_replicas | ||
|
||
# Scale down. | ||
elif desired_num_replicas < curr_target_num_replicas: | ||
# If the previous decision was to scale up (the counter was | ||
# positive), reset it to zero before decrementing. | ||
if decision_counter > 0: | ||
decision_counter = 0 | ||
decision_counter -= 1 | ||
|
||
# Only actually scale the replicas if we've made this decision for | ||
# 'scale_down_consecutive_periods' in a row. | ||
if decision_counter < -int(config.downscale_delay_s / CONTROL_LOOP_PERIOD_S): | ||
decision_counter = 0 | ||
decision_num_replicas = desired_num_replicas | ||
|
||
# Do nothing. | ||
else: | ||
decision_counter = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be deferred to PolicyManager to whether accept, reject or delay scaling recommendation produced by the policy. Policy by itself should be stateless, though could be accepting historical data (say last 5min of rersource usage) to take its decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agreed the policy itself should be stateless and that's why it's been refactored to just a function. However, I don't think the policy manager should decide how fast or slow it the replica scales. There might be usecase that the user is tracking some sort of session count on the main page and wanting to warm up/ cool down the ML service by scaling the deployment replica up/ down accordingly. I don't think we should stop customer from quickly scale up/ down by moving this logics out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, policy should be able to control how fast we auto-scale, but policy shouldn't be relying on the frequency of its invocation (CONTROL_LOOP_PERIOD_S
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, this is currently out of scope of the custom autoscaling project, but we can follow up to rewrite this logics to use the exact time when autoscale last happened :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can persist history w/in policy_state for ex
#42284 (comment)
…ingPolicyManager` and policy function (ray-project#42242) Refactor the existing ReplicaQueueLengthAutoscalingPolicy into AutoscalingPolicyManager to managing the lifecycle of policy call and applies bounds. AutoscalingPolicyManager also served as interface layer between the policy and the DeploymentState. Refactored the rest of core policy logics into replica_queue_length_autoscaling_policy policy function for use by AutoscalingPolicyManager. --------- Signed-off-by: Gene Su <e870252314@gmail.com>
…ingPolicyManager` and policy function (ray-project#42242) Refactor the existing ReplicaQueueLengthAutoscalingPolicy into AutoscalingPolicyManager to managing the lifecycle of policy call and applies bounds. AutoscalingPolicyManager also served as interface layer between the policy and the DeploymentState. Refactored the rest of core policy logics into replica_queue_length_autoscaling_policy policy function for use by AutoscalingPolicyManager. --------- Signed-off-by: Gene Su <e870252314@gmail.com>
Why are these changes needed?
Refactor the existing
ReplicaQueueLengthAutoscalingPolicy
intoAutoscalingPolicyManager
to managing the lifecycle of policy call and applies bounds.AutoscalingPolicyManager
also served as interface layer between the policy and theDeploymentState
. Refactored the rest of core policy logics intoreplica_queue_length_autoscaling_policy
policy function for use byAutoscalingPolicyManager
.Related issue number
Second PR for #41135
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.