Skip to content

[Serve] Ray Serve Autoscaling supports the configuration of custom-metrics and policy #51632

@plotor

Description

@plotor

Description

Currently, Ray Serve Autoscaling only supports scaling based on the ongoing HTTP Request metrics by built-in policy, and doesn't support custom-defined metrics. This often proves to be inflexible in some practical scenarios. For example, if an application hopes to do autoscaling based on the average CPU and memory utilization of the nodes where the deployment replicas are located in the recent period, it will not be supported. Issue #31540 describes the same scenario and requirements.

To solve this problem and support custom-defined metrics and policy in Ray Serve Autoscaling, we proposed a design idea in this document and implemented and verified it in our internal version.

Use case

At the usage level, we can add the custom_metrics and policy option by extending the autoscaling_config configuration to support custom-defined scaling metrics and policy. For example:

@serve.deployment(
    max_ongoing_requests=10,
    autoscaling_config=dict(
        min_replicas=1,
        initial_replicas=1,
        max_replicas=10,
        custom_metrics=[
            "ray_node_cpu_utilization",
            "ray_node_mem_used"
        ],
        policy="autoscale_policy:custom_autoscaling_policy"
    )
)

Here is an implementation example of a simple custom policy autoscale_policy:custom_autoscaling_policy as follows:

def cal_decision_num_replicas_by_custom_metrics(
        curr_target_num_replicas: int,
        total_num_requests: int,
        num_running_replicas: int,
        config: Optional[AutoscalingConfig],
        capacity_adjusted_min_replicas: int,
        capacity_adjusted_max_replicas: int,
        policy_state: Dict[str, Any],
        # Pass the custom metrics to the custom policy
        custom_metrics: Dict[ReplicaID, Dict[str, float]],
) -> int:
    """
    Read the values of ray_node_cpu_utilization and ray_node_mem_used from custom_metrics:
    - If the average CPU utilization rate of a certain node in the recent period is greater
      than 90%, add scaling up one replica.
    - If the average CPU utilization rate of a certain node in the recent period is less
      than 10%, and scaling down one replica.
    - If the average memory utilization rate of a certain node in the recent period is
      greater than 80%, add scaling up one replica.
    - If the average memory utilization rate of a certain node in the recent period is
      less than 10%, and scaling down one replica.
    """

    if any(metrics['ray_node_cpu_utilization'] > 90.0 for _, metrics in custom_metrics.items()):
        decision_num_replicas = num_running_replicas + 1
    elif any(metrics['ray_node_cpu_utilization'] < 10.0 for _, metrics in custom_metrics.items()):
        decision_num_replicas = num_running_replicas - 1
    elif any(metrics['ray_node_mem_used'] > 80.0 for _, metrics in custom_metrics.items()):
        decision_num_replicas = num_running_replicas + 1
    elif any(metrics['ray_node_mem_used'] > 10.0 for _, metrics in custom_metrics.items()):
        return num_running_replicas - 1
    else:
        decision_num_replicas = curr_target_num_replicas

    return decision_num_replicas


custom_autoscaling_policy = cal_decision_num_replicas_by_custom_metrics

Since it's necessary to enable the replica reporting metrics policy, it is required to set RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE=0.

At the design and implementation level, as shown in the following figure, considering that Ray itself already supports reporting metrics through the Prometheus Metrics Exporter, we continue the idea of ​​having each Deployment Replica report the metrics expected by the user in implementation:

Image

The core execution process can be described as follows:

  1. The Deployment Replica requests the local Prometheus Metrics Exporter to obtain the metrics periodically, and reports the metrics that users are interested in to the ServeController for aggregation according to the custom_metrics configuration.
  2. The ServeController periodically checks and updates the status of each Deployment. During this period, it will pass the custom metrics to the custom scaling policy to calculate the desired number of Deployment replicas in the current cluster.
  3. When the desired number of replicas does not match the currently actually running number of replicas, the DeploymentStateManager will execute the replica scaling operation.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions