[Serve] Ray Serve Autoscaling supports the configuration of custom-metrics and policy

### Description

Currently, Ray Serve Autoscaling only supports scaling based on the ongoing HTTP Request metrics by built-in policy, and doesn't support custom-defined metrics. This often proves to be inflexible in some practical scenarios. For example, if an application hopes to do autoscaling based on the average CPU and memory utilization of the nodes where the deployment replicas are located in the recent period, it will not be supported. Issue #31540 describes the same scenario and requirements.

To solve this problem and support custom-defined metrics and policy in Ray Serve Autoscaling, we proposed a design idea in this document and implemented and verified it in our internal version.


### Use case

__At the usage level__, we can add the `custom_metrics` and `policy` option by extending the `autoscaling_config` configuration to support custom-defined scaling metrics and policy. For example:

```python
@serve.deployment(
    max_ongoing_requests=10,
    autoscaling_config=dict(
        min_replicas=1,
        initial_replicas=1,
        max_replicas=10,
        custom_metrics=[
            "ray_node_cpu_utilization",
            "ray_node_mem_used"
        ],
        policy="autoscale_policy:custom_autoscaling_policy"
    )
)
```

Here is an implementation example of a simple custom policy `autoscale_policy:custom_autoscaling_policy` as follows:

```python
def cal_decision_num_replicas_by_custom_metrics(
        curr_target_num_replicas: int,
        total_num_requests: int,
        num_running_replicas: int,
        config: Optional[AutoscalingConfig],
        capacity_adjusted_min_replicas: int,
        capacity_adjusted_max_replicas: int,
        policy_state: Dict[str, Any],
        # Pass the custom metrics to the custom policy
        custom_metrics: Dict[ReplicaID, Dict[str, float]],
) -> int:
    """
    Read the values of ray_node_cpu_utilization and ray_node_mem_used from custom_metrics:
    - If the average CPU utilization rate of a certain node in the recent period is greater
      than 90%, add scaling up one replica.
    - If the average CPU utilization rate of a certain node in the recent period is less
      than 10%, and scaling down one replica.
    - If the average memory utilization rate of a certain node in the recent period is
      greater than 80%, add scaling up one replica.
    - If the average memory utilization rate of a certain node in the recent period is
      less than 10%, and scaling down one replica.
    """

    if any(metrics['ray_node_cpu_utilization'] > 90.0 for _, metrics in custom_metrics.items()):
        decision_num_replicas = num_running_replicas + 1
    elif any(metrics['ray_node_cpu_utilization'] < 10.0 for _, metrics in custom_metrics.items()):
        decision_num_replicas = num_running_replicas - 1
    elif any(metrics['ray_node_mem_used'] > 80.0 for _, metrics in custom_metrics.items()):
        decision_num_replicas = num_running_replicas + 1
    elif any(metrics['ray_node_mem_used'] > 10.0 for _, metrics in custom_metrics.items()):
        return num_running_replicas - 1
    else:
        decision_num_replicas = curr_target_num_replicas

    return decision_num_replicas


custom_autoscaling_policy = cal_decision_num_replicas_by_custom_metrics
```

Since it's necessary to enable the replica reporting metrics policy, it is required to set `RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE=0`.

__At the design and implementation level__, as shown in the following figure, considering that Ray itself already supports reporting metrics through the Prometheus Metrics Exporter, we continue the idea of ​​having each Deployment Replica report the metrics expected by the user in implementation:

![Image](https://github.com/user-attachments/assets/3cfb0fbe-a62b-416f-ae59-d414b821e77c)

The core execution process can be described as follows:

1. The Deployment Replica requests the local Prometheus Metrics Exporter to obtain the metrics periodically, and reports the metrics that users are interested in to the ServeController for aggregation according to the `custom_metrics` configuration.
2. The ServeController periodically checks and updates the status of each Deployment. During this period, it will pass the custom metrics to the custom scaling policy to calculate the desired number of Deployment replicas in the current cluster.
3. When the desired number of replicas does not match the currently actually running number of replicas, the DeploymentStateManager will execute the replica scaling operation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Serve] Ray Serve Autoscaling supports the configuration of custom-metrics and policy #51632

Description

Use case

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Serve] Ray Serve Autoscaling supports the configuration of custom-metrics and policy #51632

Description

Description

Use case

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions