deflake autoscaling basic with min aggregation #57784

abrarsheikh · 2025-10-16T05:36:49Z

flaky test

RAY_SERVE_HANDLE_AUTOSCALING_METRIC_PUSH_INTERVAL_S=0.1 \                                                                                                                                                                                                                        RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER=1 \
RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE=0 \
pytest -svvx "python/ray/serve/tests/test_autoscaling_policy.py::TestAutoscalingMetrics::test_basic[min]"

What I think is the likely cause

When using RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER=1 with min aggregation:

Replicas emit metrics at slightly different times (even if just 10ms apart due to the timestamp bucketing/rounding)
The merged timeseries reflects the ramp-up:
- At t=0: Maybe only replica 1 is reporting → total = 25 requests
- At t=0.01: Replica 2 starts reporting → total = 40 requests
- At t=0.02: Replica 3 starts reporting → total = 50 requests
- etc.
min aggregation captures the starting point:
- aggregate_timeseries(..., aggregation_function="min") takes the minimum value from the merged timeseries
- This will always be one of those initial low values (like 25) when only a subset of replicas had started reporting
- This value can never be ≥ 45, making the test inherently flaky

Removing min from test fixture.

I think a more robust solution is to keep the last report in the controller, generate the final time series using both reports, then clip the data and mid-point, then apply the aggregation function.

Signed-off-by: abrar <abrar@anyscale.com>

gemini-code-assist · 2025-10-16T05:36:53Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

akyang-anyscale · 2025-10-16T05:39:20Z

what's the reasoning behind the flakiness?

flaky test ``` RAY_SERVE_HANDLE_AUTOSCALING_METRIC_PUSH_INTERVAL_S=0.1 \ RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER=1 \ RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE=0 \ pytest -svvx "python/ray/serve/tests/test_autoscaling_policy.py::TestAutoscalingMetrics::test_basic[min]" ``` What I think is the likely cause When using `RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER=1` with `min` aggregation: 1. **Replicas emit metrics at slightly different times** (even if just 10ms apart due to the timestamp bucketing/rounding) 2. **The merged timeseries reflects the ramp-up**: - At t=0: Maybe only replica 1 is reporting → total = 25 requests - At t=0.01: Replica 2 starts reporting → total = 40 requests - At t=0.02: Replica 3 starts reporting → total = 50 requests - etc. 3. **`min` aggregation captures the starting point**: - `aggregate_timeseries(..., aggregation_function="min")` takes the minimum value from the merged timeseries - This will always be one of those initial low values (like 25) when only a subset of replicas had started reporting - This value can never be ≥ 45, making the test inherently flaky Removing min from test fixture. I think a more robust solution is to keep the last report in the controller, generate the final time series using both reports, then clip the data and mid-point, then apply the aggregation function. Signed-off-by: abrar <abrar@anyscale.com>

flaky test ``` RAY_SERVE_HANDLE_AUTOSCALING_METRIC_PUSH_INTERVAL_S=0.1 \ RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER=1 \ RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE=0 \ pytest -svvx "python/ray/serve/tests/test_autoscaling_policy.py::TestAutoscalingMetrics::test_basic[min]" ``` What I think is the likely cause When using `RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER=1` with `min` aggregation: 1. **Replicas emit metrics at slightly different times** (even if just 10ms apart due to the timestamp bucketing/rounding) 2. **The merged timeseries reflects the ramp-up**: - At t=0: Maybe only replica 1 is reporting → total = 25 requests - At t=0.01: Replica 2 starts reporting → total = 40 requests - At t=0.02: Replica 3 starts reporting → total = 50 requests - etc. 3. **`min` aggregation captures the starting point**: - `aggregate_timeseries(..., aggregation_function="min")` takes the minimum value from the merged timeseries - This will always be one of those initial low values (like 25) when only a subset of replicas had started reporting - This value can never be ≥ 45, making the test inherently flaky Removing min from test fixture. I think a more robust solution is to keep the last report in the controller, generate the final time series using both reports, then clip the data and mid-point, then apply the aggregation function. Signed-off-by: abrar <abrar@anyscale.com> Signed-off-by: xgui <xgui@anyscale.com>

flaky test ``` RAY_SERVE_HANDLE_AUTOSCALING_METRIC_PUSH_INTERVAL_S=0.1 \ RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER=1 \ RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE=0 \ pytest -svvx "python/ray/serve/tests/test_autoscaling_policy.py::TestAutoscalingMetrics::test_basic[min]" ``` What I think is the likely cause When using `RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER=1` with `min` aggregation: 1. **Replicas emit metrics at slightly different times** (even if just 10ms apart due to the timestamp bucketing/rounding) 2. **The merged timeseries reflects the ramp-up**: - At t=0: Maybe only replica 1 is reporting → total = 25 requests - At t=0.01: Replica 2 starts reporting → total = 40 requests - At t=0.02: Replica 3 starts reporting → total = 50 requests - etc. 3. **`min` aggregation captures the starting point**: - `aggregate_timeseries(..., aggregation_function="min")` takes the minimum value from the merged timeseries - This will always be one of those initial low values (like 25) when only a subset of replicas had started reporting - This value can never be ≥ 45, making the test inherently flaky Removing min from test fixture. I think a more robust solution is to keep the last report in the controller, generate the final time series using both reports, then clip the data and mid-point, then apply the aggregation function. Signed-off-by: abrar <abrar@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

flaky test ``` RAY_SERVE_HANDLE_AUTOSCALING_METRIC_PUSH_INTERVAL_S=0.1 \ RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER=1 \ RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE=0 \ pytest -svvx "python/ray/serve/tests/test_autoscaling_policy.py::TestAutoscalingMetrics::test_basic[min]" ``` What I think is the likely cause When using `RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER=1` with `min` aggregation: 1. **Replicas emit metrics at slightly different times** (even if just 10ms apart due to the timestamp bucketing/rounding) 2. **The merged timeseries reflects the ramp-up**: - At t=0: Maybe only replica 1 is reporting → total = 25 requests - At t=0.01: Replica 2 starts reporting → total = 40 requests - At t=0.02: Replica 3 starts reporting → total = 50 requests - etc. 3. **`min` aggregation captures the starting point**: - `aggregate_timeseries(..., aggregation_function="min")` takes the minimum value from the merged timeseries - This will always be one of those initial low values (like 25) when only a subset of replicas had started reporting - This value can never be ≥ 45, making the test inherently flaky Removing min from test fixture. I think a more robust solution is to keep the last report in the controller, generate the final time series using both reports, then clip the data and mid-point, then apply the aggregation function. Signed-off-by: abrar <abrar@anyscale.com>

flaky test ``` RAY_SERVE_HANDLE_AUTOSCALING_METRIC_PUSH_INTERVAL_S=0.1 \ RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER=1 \ RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE=0 \ pytest -svvx "python/ray/serve/tests/test_autoscaling_policy.py::TestAutoscalingMetrics::test_basic[min]" ``` What I think is the likely cause When using `RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER=1` with `min` aggregation: 1. **Replicas emit metrics at slightly different times** (even if just 10ms apart due to the timestamp bucketing/rounding) 2. **The merged timeseries reflects the ramp-up**: - At t=0: Maybe only replica 1 is reporting → total = 25 requests - At t=0.01: Replica 2 starts reporting → total = 40 requests - At t=0.02: Replica 3 starts reporting → total = 50 requests - etc. 3. **`min` aggregation captures the starting point**: - `aggregate_timeseries(..., aggregation_function="min")` takes the minimum value from the merged timeseries - This will always be one of those initial low values (like 25) when only a subset of replicas had started reporting - This value can never be ≥ 45, making the test inherently flaky Removing min from test fixture. I think a more robust solution is to keep the last report in the controller, generate the final time series using both reports, then clip the data and mid-point, then apply the aggregation function. Signed-off-by: abrar <abrar@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>

flaky test ``` RAY_SERVE_HANDLE_AUTOSCALING_METRIC_PUSH_INTERVAL_S=0.1 \ RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER=1 \ RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE=0 \ pytest -svvx "python/ray/serve/tests/test_autoscaling_policy.py::TestAutoscalingMetrics::test_basic[min]" ``` What I think is the likely cause When using `RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER=1` with `min` aggregation: 1. **Replicas emit metrics at slightly different times** (even if just 10ms apart due to the timestamp bucketing/rounding) 2. **The merged timeseries reflects the ramp-up**: - At t=0: Maybe only replica 1 is reporting → total = 25 requests - At t=0.01: Replica 2 starts reporting → total = 40 requests - At t=0.02: Replica 3 starts reporting → total = 50 requests - etc. 3. **`min` aggregation captures the starting point**: - `aggregate_timeseries(..., aggregation_function="min")` takes the minimum value from the merged timeseries - This will always be one of those initial low values (like 25) when only a subset of replicas had started reporting - This value can never be ≥ 45, making the test inherently flaky Removing min from test fixture. I think a more robust solution is to keep the last report in the controller, generate the final time series using both reports, then clip the data and mid-point, then apply the aggregation function. Signed-off-by: abrar <abrar@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>

deflake autoscaling basic with min aggregation

a9f6697

Signed-off-by: abrar <abrar@anyscale.com>

abrarsheikh requested a review from a team as a code owner October 16, 2025 05:36

abrarsheikh added the go add ONLY when ready to merge, run all tests label Oct 16, 2025

abrarsheikh requested a review from akyang-anyscale October 16, 2025 05:52

akyang-anyscale approved these changes Oct 16, 2025

View reviewed changes

ray-gardener bot added the serve Ray Serve Related Issue label Oct 16, 2025

zcin approved these changes Oct 16, 2025

View reviewed changes

zcin merged commit 978a9af into master Oct 16, 2025
6 checks passed

zcin deleted the SERVE-1239-abrar-flaky branch October 16, 2025 17:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

deflake autoscaling basic with min aggregation #57784

deflake autoscaling basic with min aggregation #57784

Uh oh!

abrarsheikh commented Oct 16, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Oct 16, 2025

Uh oh!

akyang-anyscale commented Oct 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

deflake autoscaling basic with min aggregation #57784

deflake autoscaling basic with min aggregation #57784

Uh oh!

Conversation

abrarsheikh commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Oct 16, 2025

Uh oh!

akyang-anyscale commented Oct 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

abrarsheikh commented Oct 16, 2025 •

edited

Loading