-
Notifications
You must be signed in to change notification settings - Fork 7k
Description
Description
Currently, Ray Serve Autoscaling only supports scaling based on the ongoing HTTP Request metrics by built-in policy, and doesn't support custom-defined metrics. This often proves to be inflexible in some practical scenarios. For example, if an application hopes to do autoscaling based on the average CPU and memory utilization of the nodes where the deployment replicas are located in the recent period, it will not be supported. Issue #31540 describes the same scenario and requirements.
To solve this problem and support custom-defined metrics and policy in Ray Serve Autoscaling, we proposed a design idea in this document and implemented and verified it in our internal version.
Use case
At the usage level, we can add the custom_metrics and policy option by extending the autoscaling_config configuration to support custom-defined scaling metrics and policy. For example:
@serve.deployment(
max_ongoing_requests=10,
autoscaling_config=dict(
min_replicas=1,
initial_replicas=1,
max_replicas=10,
custom_metrics=[
"ray_node_cpu_utilization",
"ray_node_mem_used"
],
policy="autoscale_policy:custom_autoscaling_policy"
)
)Here is an implementation example of a simple custom policy autoscale_policy:custom_autoscaling_policy as follows:
def cal_decision_num_replicas_by_custom_metrics(
curr_target_num_replicas: int,
total_num_requests: int,
num_running_replicas: int,
config: Optional[AutoscalingConfig],
capacity_adjusted_min_replicas: int,
capacity_adjusted_max_replicas: int,
policy_state: Dict[str, Any],
# Pass the custom metrics to the custom policy
custom_metrics: Dict[ReplicaID, Dict[str, float]],
) -> int:
"""
Read the values of ray_node_cpu_utilization and ray_node_mem_used from custom_metrics:
- If the average CPU utilization rate of a certain node in the recent period is greater
than 90%, add scaling up one replica.
- If the average CPU utilization rate of a certain node in the recent period is less
than 10%, and scaling down one replica.
- If the average memory utilization rate of a certain node in the recent period is
greater than 80%, add scaling up one replica.
- If the average memory utilization rate of a certain node in the recent period is
less than 10%, and scaling down one replica.
"""
if any(metrics['ray_node_cpu_utilization'] > 90.0 for _, metrics in custom_metrics.items()):
decision_num_replicas = num_running_replicas + 1
elif any(metrics['ray_node_cpu_utilization'] < 10.0 for _, metrics in custom_metrics.items()):
decision_num_replicas = num_running_replicas - 1
elif any(metrics['ray_node_mem_used'] > 80.0 for _, metrics in custom_metrics.items()):
decision_num_replicas = num_running_replicas + 1
elif any(metrics['ray_node_mem_used'] > 10.0 for _, metrics in custom_metrics.items()):
return num_running_replicas - 1
else:
decision_num_replicas = curr_target_num_replicas
return decision_num_replicas
custom_autoscaling_policy = cal_decision_num_replicas_by_custom_metricsSince it's necessary to enable the replica reporting metrics policy, it is required to set RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE=0.
At the design and implementation level, as shown in the following figure, considering that Ray itself already supports reporting metrics through the Prometheus Metrics Exporter, we continue the idea of having each Deployment Replica report the metrics expected by the user in implementation:
The core execution process can be described as follows:
- The Deployment Replica requests the local Prometheus Metrics Exporter to obtain the metrics periodically, and reports the metrics that users are interested in to the ServeController for aggregation according to the
custom_metricsconfiguration. - The ServeController periodically checks and updates the status of each Deployment. During this period, it will pass the custom metrics to the custom scaling policy to calculate the desired number of Deployment replicas in the current cluster.
- When the desired number of replicas does not match the currently actually running number of replicas, the DeploymentStateManager will execute the replica scaling operation.
